Curing Operational Blindness
In the world of physical and virtual computing, there are thousands of metrics to choose from. These can range from describing a running operating system to describing multiple processes representing a single application running on many servers. Obtaining visibility into the runtime attributes of these servers and processes is critical to effective operations. We must have operational vision which consists of hindsight, sight and foresight. The ability to see past, present and future is critical to guiding effective operations and development. Without this vision we have operational blindness.
We will use a human body and its organs as an analogy for a running operating system and its processes. There are many approaches to collecting the appropriate metrics to accomplish this goal—let us explore a few of the more prominent ones:
Do Nothing Approach
Medical
We assume the body is healthy because we are not aware of anything wrong. Our system seems to be functioning by just observing from afar.
System
This approach is the default state of a freshly running operating system environment. With this approach, we maintain the status quo and maintain operational blindness. Without insightful metrics, we have a misguided sense of safety based on the operating system’s newness and the fact that its processes are still running. Effective business decisions guiding operations and development cannot be made.
Results
The blissful paradise of ignorance quickly turns to constant pain. An illness that could have easily been cured when symptoms first appeared instead requires painful and costly treatment.
The analogy to this for running systems would be: Processes ceasing to function, disk drives filling up, the dreaded random “Out of memory killer”, or constant CPU usage that keeps critical business processes from keeping up with the demand.
Operational blindness should never become the choice the systems your business are built on. To minimize risk and maximize efficiency, issues need to be detected early and mitigated at the first responsible opportunity. If we are operationally blind, we can only prevent issues through guesswork and luck.
Examination and Scanning Mechanisms
Medical
This is analogous to going for a routine checkup with your physician. They poke and prod and scan your body to see how it is functioning, looking for warning signs that they have encountered in the past in order to mitigate health risks before they become issues.
System
We write scripts and small programs which, similar to our physicians, examine and scan the system looking for events which indicate issues before they become issues. We may even instrument some invasive monitors into our critical business programs that notify us when events happen. These checks run at regular intervals and watch for events that are similar to known past issues.
Results
In this scenario, we are granted a kind of black and white operational vision. However, we are still unable to see with full color and excellent foresight. This approach is still valuable, no matter the other methods we use.
At intervals we run scripts which gather log data, poll running processes and collect system {cpu,ram,disk} resource information. Additionally per-language instrumentation is done that modifies language runtimes in order to collect trace data.
These are great sources of information, but they don’t compare to the ability to having detailed data from every process all the time, a “miss nothing” approach.
Brain and Central Nervous System
Medical
A physician’s dream would be to hook into the brain and central nervous system (BCNS) in order to intercept and know exactly what was happening at all times. The problem with such a monitoring system (if it even existed) is that it involves an extremely invasive operation and therefore should only be undertaken in cases where the benefits outweigh the risks.
System
The BCNS of our running systems is the Kernel and its nerves are called “modules”. We use a loaded kernel module to make our operational vision system become one with the BCNS of our running system.
Results
This approach is by far the most effective and complete approach to obtaining operational vision; however, the disadvantages are difficult to overcome. Each client must opt-in for this expensive brain surgery. As with all surgeries, this is acceptable to but a few and causes great fear and anxiety. Others are turned away at the cost of such a procedure.
For these reasons, this approach is seldom chosen and not suitable for general use.
Heart Monitor Approach
Medical
All organs have signals through which their functionality and behavior may be observed by watching these signals pass by. One example of this is wearing a heart monitor. We attach the monitor in the proper place and it intercepts our heart signals as our heart functions. After wearing the monitor for a period of time our physician collects the data from the chip either manually or by some remote submission mechanism.
System
Our running operating systems support this very same concept through a mechanism called a dynamic loader. The dynamic loader is used to add monitors that are automatically loaded with running processes in the form of an additional library. This monitor library merely observes (read only) system calls as they occur and records them. Periodically the recorded data for each of these monitors is collected for remote submission by a collector process.
Results
This approach hits the sweet spot between gathering the most data and the risks and efforts involved in gathering it. The injected nanobot monitor watches all organs in the system using the same mechanism so complete system operations observation is possible.
Our Approach
Our goal for monitoring is to provide a rich suite of data with the minimum impacts on system. After a ton of research, involving everything from rolling our own to multiple vendors. We chose to partner with AppFirst because their heart monitor/scanning approach to monitoring does just that. AppFirst’s Add on integration also provides quality operational insight for our customers, please try it out right now and send us feedback for how it can better help your operations succeed!
At Engine Yard, we strive to maximize operational visibility while minimizing operational cost. We partner with our customers and focus on making them successful first, which will in turn make us successful.
Share your thoughts with @engineyard on Twitter