Monitoring

Todd Hoff's picture

Product: Supervisor - Monitor and Control Your Processes

It's a sad fact of life, but processes die. I know, it's horrible. You start them, send them out into process space, and hope for the best. Yet sometimes, despite your best coding, they core dump, seg fault, or some other calamity befalls them. Unlike our messy biological world so cruelly ruled by entropy, in the digital world processes can be given another chance. They can be restarted. A greater destiny awaits. And hopefully this time the random lottery of unforeseen killing factors will be avoided and a long productive life will be had by all.

This is fun code to write because it's a lot more complicated than you might think. And restarting processes is a highly effective high availability strategy. Most faults are transient, caused by an unexpected series of events. Rather than taking drastic action, like taking a node out of production or failing over, transients can be effectively masked by simply restarting failed processes. Though complexity makes it a fun problem, it's also why you may want to "buy" rather than build. If you are in the market, Supervisor looks worth a visit.

Adapted from their website:
Supervisor is a Python program that allows you to start, stop, and restart other programs on UNIX systems. It can restart crashed processes.

Todd Hoff's picture

Product: Hyperic

From Wikipedia:

Hyperic HQ is a popular open source IT Operations computer system and network monitoring application software. It auto-discovers all system resources and their metrics, including hardware, operating systems, virtualization, databases, middleware, applications, and services. It watches hosts and services that you specify, alerting you when things go bad and again when they get better. It also provides historical charting and event correlation for faster problem identification.

Todd Hoff's picture

Product: collectd

From http://directory.fsf.org/project/collectd/ :

'collectd' is a small daemon which collects system information every 10 seconds and writes the results in an RRD-file. The statistics gathered include: CPU and memory usage, system load, network latency (ping), network interface traffic, and system temperatures (using lm-sensors), and disk usage. 'collectd' is not a script; it is written in C for performance and portability. It stays in the memory so there is no need to start up a heavy interpreter every time new values should be logged.

Todd Hoff's picture

How WordPress.com Tracks 300 Servers Handling 10 Million Pageviews

WordPress.com hosts 300 servers in 5 different data centers. It's always useful to learn how large installations manage all their unruly children: Currently we Nagios for server health monitoring, Munin for graphing various server metrics, and a wiki to keep track of all the server hardware specs, IPs, vendor IDs, etc. All of these tools have suited us well up until now, but there have been some scaling issues. The post covers how these different tools are working for them and the comment section has some interesting discussions too.

Todd Hoff's picture

Product: Ganglia Monitoring System

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.

Todd Hoff's picture

Product: Cacti

Cacti is a network statistics graphing tool designed as a frontend to RRDtool's data storage and graphing functionality. It is intended to be intuitive and easy to use, as well as robust and scalable. It is generally used to graph time-series data like CPU load and bandwidth use.

The frontend is written in PHP; it can handle multiple users, each with their own graph sets, so it is sometimes used by web hosting providers (especially dedicated server, virtual private server, and colocation providers) to display bandwidth statistics for their customers. It can be used to configure the data collection itself, allowing certain setups to be monitored without any manual configuration of RRDtool.

Todd Hoff's picture

Product: Munin Monitoriting Tool

Munin the monitoring tool surveys all your computers and remembers what it saw. It presents all the information in graphs through a web interface. Its emphasis is on plug and play capabilities. After completing a installation a high number of monitoring plugins will be playing with no more effort.

Using Munin you can easily monitor the performance of your computers, networks, SANs, applications, weather measurements and whatever comes to mind. It makes it easy to determine "what's different today" when a performance problem crops up. It makes it easy to see how you're doing capacity-wise on any resources.

Syndicate content