SCOM: UNIX/Linux Process Monitoring in the Net-SNMP MP In Detail
September 23, 2009 2 Comments
Regardless of the operating system, monitoring of the availability and resource utilization of individual processes is a pretty standard requirement. Between WMI and PerfMon counters, this is easy on Windows systems, but doing the same on UNIX/Linux systems can be a little bit more complicated. In Operations Manager 2007 (R2) environments, there are three general approaches (excluding third party products) that can be utilized to monitor individual processes on UNIX and Linux systems:
- An agent-based solution using the R2 Cross Platform agents
- A purely SNMP solution using tables in the HOST-RESOURCES MIB
- An extended SNMP solution using the proc or exec directives in the Net-SNMP agent’s snmpd.conf file
I think it’s fair to say that in most cases (and when it is supported), the R2 Cross Platform agent is the best and most robust approach. However, it’s almost an inevitability in medium and large enterprises that there will be some UNIX or Linux servers or appliances running distributions not supported by the R2 agents. In these cases, or if there is another compelling reason not to deploy agent software to the device, SNMP may be the best or only option. The pure SNMP option is probably the most universally applicable approach, but introduces a number of challenges, which I will discuss in this post. The third option brings a great degree of flexibility (particularly with the exec directive, which can return the result of an on-demand shell script to an SNMP OID) but requires decentralized configuration.
The approach that I took in the Net-SNMP Management Pack is a hybrid of the pure SNMP and extended SNMP options. The latest version of the MP (which I will be posting soon) supports process resource utilization through the HOST-RESOURCES MIB tables in addition to process availability monitoring facilitated by identifying the monitored processes with the proc directive in snmpd.conf. And as described in the previous post about the MP, if ultimate flexibility is needed, the Extensible Object capability with the exec directive is still supported.
UNIX/Linux SNMP Process Monitoring In-Depth
While it would be possible to configure SNMP process monitoring in a centralized fashion by defining processes to be monitored on an SNMP management server and then monitoring by walking the agent’s hrSwRun SNMP table, this poses a number of challenges with Operations Manager. The challenges exist on multiple levels and may be best illustrated with some examples.
In the simplest scenario, suppose the monitoring requirement was to confirm that a particular process was running on a system, at least once. An SnmpProbe could be configured to walk the hrSwRunName column in the hrSwRun table, with the output data items passed to an ExpressionFilter that matches the returned value to a defined process name variable. For any match, the workflow would continue and could be passed to a monitor type. However, if no match were found (indicating a problem), no data item would make it through the filter and there would be nothing to evaluate in subsequent modules.
Even if a creative solution were implemented to get around the filtering issue, there is still an issue when the monitoring requirement is to monitor both a minimum and maximum number of instances of the process. If the workflow of scheduler -> SnmpProbe (walk) -> Expression Filter were used in this scenario, every matched data item that passes through the Expression Filter would be handled distinctly. There would be no easy way to calculate the count of the data items to derive a number of running processes, making this an unappealing option in most cases (I did utilize something like this in the Net-SNMP MP though, for more details, reference the discussion of the zombie process count below).
What I concluded to be a more reasonable approach is to utilize the process monitoring capabilities of Net-SNMP and require that processes to be monitored for availability (minimum and maximum counts) be configured with proc directives in the snmpd.conf. Processes configured for monitoring in this fashion are then exposed in the UCD-SNMP prTable, which includes the process name, current count, an error flag value, and a pre-formatted error message. With this information readily exposed, it is easy to discover the defined process monitors in this table as OpsMgr monitored objects and monitor the error flag (and collect the error message for alert descriptions).
For the sake of thoroughness, there are some conceivable scenarios where a Scheduler -> SnmpProbe (walk) -> Condition Detection workflow would be an ideal approach for monitoring. For example, if the presence of a process in one or more instances indicated a problem condition, this workflow could be configured followed by an alert-generating write action which would fire an alert for every instance matched.
Resource Utilization Monitoring
Having settled on using the Net-SNMP agent’s process monitoring configuration for process availability monitoring, I deliberated on monitoring process resource utilization for some time (and skipped it altogether in the first version of the Net-SNMP MP). Performance metrics (memory used and cpu time) can be accessed in the hrSwRunPerf table, and rows in the hrSwRunPerf table correspond to rows in the hrSwRun table through the Index values (which match the PID of the process on the agent). This too poses a number of challenges which I will try to describe through hypothetical examples.
If one wanted to monitor for any process consuming memory above a certain threshold, the hrSwRunPerfMem column of the hrSwRunPerf table could be walked and filtered to match only processes that exceeded the threshold. This could then be passed to an alert-generating write action. However, the only values available in this SnmpDataItem would be the OID and the memory use value, which would not make for a very friendly alert description.
Supposing the requirement was to monitor the memory utilization of a particular process, it would be even more difficult. The hrSwRun table would have to be walked to find the OID of the process that had a matching name, and then this OID would have to be passed to a script probe to calculate the PID/Index, and then another SnmpProbe would have to be initiated to retrieve the values – made all the more complicated by the fact that there may be more than one instance of the process to monitor.
One way to work around these challenges is to forgo SNMP walks in the monitors, and use SNMP walks to periodically discover the processes in the hrSwRun table and their properties, and then use the discovered index to formulate specific OIDs for use by the SnmpProbe with SNMP get requests. Running processes can be quite numerous and dynamic, so such an approach with discoveries would necessitate filtering by process name to restrict discovery only to processes to be monitored. Because I had already decided on using the proc directive to facilitate process availability monitoring in the Net-SNMP MP, I decided to extend that configuration by adding a new class for Monitored Process Instances, which are hosted by the Monitored Process objects. So, as it is configured in the MP, process monitoring items defined in the snmpd.conf are discovered (followed by a second property discovery). Another discovery then walks the hrSwRun table to match process names in the table to already discovered Monitored Process objects, and then a final discovery discovers the properties of these monitored processes. The Monitored Process object is monitored for availability by polling the prTable for the errorflag value, and data for the Monitored Process Instance objects are collected by polling the hrSwRunPerf table. While the discoveries introduce some latency in the performance monitoring every time a PID changes, the assumption is that the processes singled out for monitoring will not be that variable as they likely represent production services that do not recycle with high frequency.
Process CPU Monitoring with the hrSwRunPerf Table
The reason I used the process memory utilization in the preceding examples is because the CPU utilization monitoring is more complex. The hrSwRunPerfCpu object is an SNMP counter, indicating the number of centiseconds of total CPU time used by the process. This value is only meaningful (similar to other SNMP counters like ifOctets in the ifTable) when two samples are compared over time. If my math is correct, the calculation should be as follows:
There are 100 centiseconds in 1 second of “wall-clock” time (as the MIB describes it). So, 100% of CPU time in 1 second would be 100 centiseconds. 50% of CPU time in 1 second would be 50 centiseconds. 25% of CPU time in 60 seconds would be 1500 centiseconds. The formula to calculate percent of CPU (in decimal notation 0 – 1 ) would be X = delta / (time * 100), thus 1500/(60 * 100) = .25. The formula to return the percentage in a 0 – 100 value would be: (delta * 100)/(time * 100) or simply delta/time. As for calculating the delta, I utilized a similar approach as in a number of monitors in the Cisco MP. That approach is to maintain a small XML file in a temporary directory (%temp%\Custom_SNMPFiles\) for the each object to be monitored by this data source. If an existing file is found, the data source (via a property bag script probe) compares the time stamp and preceding value (if they are recent enough) to the current values, returns the calculated values as a property bag, and then writes the current values to the XML file. It should be noted that the MIB definition for the HrSwRunPerfCpu counter warns that multi-processor systems may report values in excess of “wall-clock” time, which could create percent utilization values above 100%. However, I believe this is an issue that has been resolved in most UNIX/Linux distributions.
Zombie Process Monitoring
I never fully understood why many UNIX monitoring packages included monitoring for zombie process until one instance many years ago when I was troubleshooting a problem on an AIX 4.3 server. In the course of troubleshooting, I checked the running processes with a ps –ef command and was a bit suprised as thousands of processes scrolled by. It may not be a common problem, but it was one that I wanted to include monitoring for in the Net-SNMP MP. To implement this monitor, I utilized a collection rule that fires an SnmpProbe that walks the hrSwRun table and filters for processes with a status of Invalid. Every match is passed to a script write action that writes the device IP and pid to a text file in a temporary directory. The monitor itself utilizes a script which scans the temporary directory for files that match the device IP (and are recently modified) to tabulate a count of zombie processes. This value is then compared to a threshold to determine health state. To keep the monitoring script running in a reasonable time, it also cleans up old files in the temporary directory. In some cases, it would be possible for the collecting data source to spawn a high number of cscript processes concurrently if many matches were found, but I don’t think this will be a problem as the condition should be relatively rare and the default threshold (25) is pretty low. The chances of a high number of monitored hosts simultaneously having a high number of zombie processes concurrently are pretty low.
While this has been a rather wordy post, I hope that I was able to illustrate some of the challenges in the monitoring of tabular SNMP objects with OpsMgr 2007 R2 and the methods that I implemented to address those challenges in the Net-SNMP MP.