SCOM: A Net-SNMP Management Pack for UNIX Monitoring, Part I – Desiging the MP

The Net-SNMP agent is a cross-platform SNMP agent that is often bundled with UNIX (e.g. Solaris) and Linux distributions (including VMWare and Checkpoint Splat).   The agent implements some relatively standard MIBs, such as the UC Davis (UCD-SNMP-MIB) and HOST-RESOURCES MIBs, exposing a good selection of operating system metrics for monitoring.  Furthermore, the agent is remarkably extensible in a few different ways.  The agent can be extended with “dlMod” module extensions to integrate additional SNMP components, such as the VMWare or HP Insight Management Agent modules.   Additionally, the Net-SNMP agent can be configured to proxy for another SNMP agent listening on a different port, which is how the Sun SunFire Management Agent is typically configured.   Lastly, the Net-SNMP agent provides the ability to define processes, files, extensible objects, load average objects, and disks for specific monitoring.  For each of these, directives are configured in snmpd.conf with parameters and thresholds, and an SNMP table is exposed that will return the monitored value and status value to an SNMP poll, on-demand.   I had previously  posted an article on using the ext (Extensible Object/Arbitrary Extension) capabilities of the Net-SNMP agent to utilize shell scripts for monitoring of Sun hardware.   With this degree of flexibility, it’s easy to understand why the Net-SNMP agent is so widely utilized.

Design Approach for the Net-SNMP Management Pack

Building on the lessons learned from developing the Cisco Management Pack, I wanted to create a management pack with similar capabilities for monitoring of Net-SNMP agents, followed by a set of management packs for monitoring of vendor-specific hardware (e.g. HP Proliant servers running Linux).   After some thought, I decided that it would be advantageous to split the Net-SNMP management pack into two separate management packs, one that defines the base classes and performs discovery of the classes and one for the actual monitoring.   The reason for this is that I wanted a base management pack without any actual monitoring functions so that it could easily (and more flexibly) be reused for future hardware monitoring management packs.   To illustrate this point, if the native UNIX agent capabilities of R2 are being used to monitor OS-level objects, there is still a need to discover the Net-SNMP agent and create base classes for use in SNMP-based hardware monitoring, but in this case, monitoring OS-level objects with SNMP and the OpsMgr agent would be redundant.  With the multiple MP approach, just the base Net-SNMP management pack could be imported along with the hardware monitoring management packs, skipping the Net-SNMP monitoring management pack altogether.  Additionally, I wanted to leave the monitoring management pack unsealed for easy customization, and MP references can only be made to sealed management packs.  In this way, the Net-SNMP Library MP will be sealed, but the monitoring MP can remain unsealed.

The Net-SNMP Library Management Pack

Read more of this post

SCOM: Advanced SNMP Monitoring Part IV: Making the Cisco MP SP1 Compatible

As previously mentioned, the Cisco Management Pack that I have been describing in the past several posts is specific to OpsMgr R2, due to its reliance on the capability of the System.SnmpProbe’s function to return multiple items in an SNMP walk, which is a feature introduced in R2.   My intention was to design the management pack so that it could easily be made SP1 compatible by substituting the discovery methodology, but keeping the classes and monitoring objects intact.   I am still doing more testing to finalize the SP1 version, but in this post, I am going to introduce the discovery methods that I have implemented for SP1 compatibility.

In general, when dealing with SNMP objects stored in tables, options in SCOM 2007 SP1 are a bit limited.  If the table objects are static, rules could be configured targeting specific OID’s, but this is unlikely to be a common scenario.   Another option (and a quite good one) would be to segregate monitoring duties between SCOM (for servers and applications) and a task-specific SNMP monitoring tool like SolarWinds or JalaSoft for SNMP device monitoring.   But for the purposes of this management pack, the option I pursued involves wrapping an external SNMP provider in WSH scripts to perform the discovery.   I chose to use the WMI provider because of its easy use in VBscripts and object-oriented capabilities, but something like the Net-SNMP binaries could be called with ShellExec methods in a vbs script to accomplish the same goal.  

Before I get into the use of the WMI SNMP provider in this MP, there is some background information that should be discussed.  Firstly, when the WMI SNMP provider is installed, it includes support only for the RFC-1213 MIB.   In order to use the WMI SNMP provider for OID’s not in the RFC-1213 MIB, the MIB files must be converted to MOF files with smi2smir.exe and then imported with mofcomp.exe.    This can get a bit tricky, largely depending on how well constructed the vendor’s MIB is and the dependencies of the MIB.   Once these steps are completed, the WMI SNMP provider exposes SNMP objects as collections that can be accessed natively in VBscripts.   However, the WMI provider is not terribly scalable and when overloaded with requests, WMI will cease to function in a timely fashion.   I had done some testing with using the WMI SNMP provider for rules and monitors, but decided against this approach due to the ease with which the provider can be overloaded when monitoring a handful of objects with reasonably aggressive frequencies.   On the other hand, discoveries are not as time-sensitive as monitors/rules and don’t need to run as frequently, so by using the WMI SNMP provider for discoveries, and the System.SnmpProbe provider for monitors and rules, this hybrid approach should remain pretty scalable. 

Implementation Overview

Read more of this post

SCOM: Updates to the Cisco Management Pack

I have uploaded version 1.01.27 of the Cisco Management Pack.   This minor update includes a few fixes and some new features added in response to some helpful comments from readers who have tried this MP out.  The changes in this version are:

  1. Group Population rules have been fixed so that they can be edited in the GUI
  2. Performance data collection has been modified to not collect data when the interface is in a partially discovered state.  This should prevent duplication of performance counters for interfaces going forward.   Because the Interface discovery occurs in two parts: 1) discovering the interfaces and 2) discovering the interface properties, I added a property to the Interface class named: Discovered.  When the Interface discovery runs (which just discovers the index), the discovered property will be set to False.  Once the Interface Property discovery runs, this property will be set to True.   The performance collection rules have been modified to only collect data if this property is set to true. 
  3. Index discovery can now be filtered with a RegEx expression for a device via an override (create the override on the Discover Cisco Interfaces discovery).   By default, the RegEx expression:  (.|..|…) will be used to discover interfaces, which will discover all interfaces with indexes with 1, 2, or 3 digits in length.  To filter by interface index, the expression can be modified to:  (1|2|3|10|15) where the numbers are the index values for the interfaces to be discovered, separated by pipe characters.   The override for this filter can be targeted at all Cisco devices, a group, or specific devices.
  4. I have added a new property to the Interface class named:  “Speed – Override.”  This property allows for manually defining a speed value for the interface to be used in utilization calculations.  There are two scenarios where this is useful: 1) sometimes the value reported in the RFC-1213 ifTable for an interface’s speed isn’t accurate, and 2) if an Ethernet port is the last monitored hop before a lower bandwidth wide-area connection (i.e. the vendor equipment is not monitored), it can be useful to monitor the Ethernet port as if its speed were equivalent to the wide-area connection bandwidth.   For example, if a switch is the last monitored device before connecting to a vendor-managed MetroE connection with a maximum bandwidth of 500Mbit, monitoring utilization on the gigabit switch port will be more accurate if the speed were overridden to 500Mbit.  This property can be set with an override on the Discover Cisco Interface Properties object discovery, by setting the OverrideIFSpeed value to the desired speed, in bits per second.

SCOM: Advanced SNMP Monitoring Part III: The Completed Cisco Management Pack

Note:  This management pack has been replaced by the more feature-rich xSNMP suite.

I still have some more full-scale testing to complete, but I have finished the first version of the Cisco monitoring management pack that I have described in the last few posts.   This version is R2 only, but I hope to have an SP1 version of it complete soon. 

Read more of this post

SCOM: Advanced SNMP Monitoring Part II: Designing the SNMP Monitors

For the Cisco Management Pack, a handful of different approaches were required when designing the monitors and rules and their supporting workflows.  These approaches can generally be placed into three categories:

  • Simple SNMP GET operations
  • Operations that require data manipulation
  • Operations that require access to previously collected values  

For example, monitoring of an interface’s operational status only requires current SNMP GET requests (to retrieve the ifOperStatus and ifAdminStatus values) from the SNMP ifTable.  Condition detection for the monitor type definition can handle the SNMP values as they are presented with no further manipulation.   However, to monitor a value such as the percent free bytes on a Cisco memory pool, data manipulation is required.  In this example, a percentage value is not exposed via SNMP, but free and available values are.  So, to calculate a percentage, the free and available bytes values must be summed and then the free value divided by that sum.  And lastly, in some cases, a value from a previous poll must be referenced to detect the desired condition, such as monitoring for an increase in the collisions reported for a Cisco interface.  In this example, the locIfCollisions values can be retrieved from the locIfTable, but this is a continually augmenting counter, which is difficult to monitor.   By recording the value from the immediately previous poll, and comparing it with the current poll, a delta value can be established for that polling cycle to determine the number of collisions recorded in a single polling cycle. 

Read more of this post

VBS: Decoding Base64 Strings (in 10 lines of code)

This code snippet can be used to decode Base64 encoded strings into plain text, which I have found useful recently when working on VBscript scripts that require decoding the Base64-encoded SNMP community strings stored in OpsMgr.  I thought it would be worth sharing.

Function Decode(strB64)
 strXML = “<B64DECODE xmlns:dt=” & Chr(34) & _
        “urn:schemas-microsoft-com:datatypes” & Chr(34) & ” ” & _
        “dt:dt=” & Chr(34) & “bin.base64” & Chr(34) & “>” & _
        strB64 & “</B64DECODE>”
 Set oXMLDoc = CreateObject(“MSXML2.DOMDocument.3.0”)
 oXMLDoc.LoadXML(strXML)
 decode = oXMLDoc.selectsinglenode(“B64DECODE”).nodeTypedValue
 set oXMLDoc = nothing
End Function

SCOM: Advanced SNMP Monitoring Part I: Discovering Cisco Devices and Hosted Classes

When limited to editing within the Authoring pane of the Operations Console in OpsMgr, SNMP monitoring options are relatively limited – only SNMP GET requests are supported and data cannot be manipulated before calculating health state.   However, with some pretty heavy authoring, very complex SNMP monitoring becomes completely viable, particularly in R2.   In the next several posts, I will be describing some methods to perform advanced SNMP discovery and monitoring in OpsMgr 2007, such as discovering hosting relationships for entities like network interfaces, as well as passing Snmp data to script probes in order to perform state change monitoring and mathematical operations.  When all is said and done, I will post my management packs for SNMP monitoring of Cisco, Checkpoint, and Net-SNMP devices.   In this post, I will describe the methods I used for custom discovery of Cisco SNMP devices and their hosted entities (interfaces, power supplies, etc).

Cisco SNMP Management Pack: Designing the Classes and Groups

In order to facilitate monitoring of the device (e.g. CPU, Memory, up/down status), individual interfaces (e.g. status, utilization, resets), and Cisco EnvMon objects (e.g. Power Supplies, Temperature, etc), each of these entities are best represented as a unique class with hosting relationships.   For the objects stored in tables, there are two feasible options for discovery.  In SCOM 2007 R2, the ‘walkreturnmultipleitems’ attribute of the System.SnmpProbe probe action can be set in order to walk an SNMP table and return each snmpdata item individually.   In SCOM 2007 SP1, this option is not available, requiring the use of a script discovery with an external snmp probe (such as WMI) in order to retrieve the discovery items.  In this post, I will describe the R2 method, and I will address the script option using the WMI SNMP provider in a later post.  In either case, the class design will be similar.

For monitoring of Cisco SNMP devices, I created five classes:

  • CiscoSNMP.Class.CiscoDevice
  • CiscoSNMP.Class.CiscoDevice.Interface
  • CiscoSNMP.Class.CiscoDevice.PowerSupply
  • CiscoSNMP.Class.CiscoDevice.Fan
  • CiscoSNMP.Class.CiscoDevice.TemperatureSensor

For the four hardware entities, I created a hosting relationship so that they are hosted by the CiscoDevice class.   The classes, properties, and relationships are represented on this diagram.

Read more of this post

Coming Soon: SNMP Monitoring for Changes in Polled Values

Most SNMP monitoring can be facilitated by comparing the value of a specific retrieved SNMP object to an expected string or threshold, but monitoring for some conditions can only really be accomplished by comparing the current value to a previous value.  

Three examples of this are:

1)      Serial Interface Flapping:  If a serial connection is experiencing problems, the interface may bounce up and down rapidly.  If an SNMP poll on that interface is occurring every 1, 3, or 5 minutes, it may not detect any problems (if the interface is up for the poll), meaning that compromised availability could go undetected for several polling cycles.  These conditions can be detected by comparing the Interface Resets (locIFResets) counter in the Cisco local interfaces table  to previous values.   

2)      Default Gateway Changes on Redundant Routers:  In some redundant WAN router deployments, a default gateway change on the routers is indicative of a redundancy failover.  Because all devices and interfaces may be up and reachable before and after a failover, it may be difficult to detect when the failover has occurred, potentially meaning production traffic is routed over a slower backup link.   This can be detected by monitoring the Default Gateway value (ipRouteNextHop in the ipRouteTable) on the routers and detecting changes when compared to previous polling cycles.

3)      High-Availability State Changes on CheckPoint SPLAT Firewalls:  In an HA configuration on CheckPoint firewalls, the haStatus value will return a string value of “active” or “standby.”  The best way to detect an HA failover is to watch this value for a change.

These are just three examples of many potential scenarios where monitoring of an SNMP object is best served by comparing current values to previously polled values.  Unfortunately, this capability is not a common feature in many of the monitoring tools that I am familiar with.  

Over the next week (or more), I will be posting articles about how I have to implemented just such a monitor for the three described scenarios using the two monitoring products that I currently work with:  SolarWinds ORION and System Center Operations Manager.  In the case of ORION, these monitors can be implemented fairly easily with a bit of SQL work.  In the case of SCOM, it’s a little bit more complicated, but ultimately doable.

SCOM: An SSRS Custom Report for SNMP Device Performance Data Collected by Rules

While the SCOM Reporting implementation provides a great set of reports out of the box, there are a number of custom reports which I have found useful to develop.   The report described here is one to report on aggregated hourly performance counters collected on SNMP Network Devices.

 

First up, the queries:

Read more of this post

SCOM: One More Use for Custom Resolution States – A “Closed by Operator” State

I had previously written a post about using customization of SCOM’s resolution states to generate an additional notification, on-demand, in order to send the event to a helpdesk software system.  Another resolution state customization that I have found useful facilitates the opposite effect, preventing notifications. 

Unlike those generated by monitors, alerts generated by rules in System Center Operations Manager are not linked to an object’s health state, and are not closed automatically.  Rather, they must be closed manually (in the console or with a script) or deleted after a set period of time (database grooming).  While it is typically desirable for notifications to be generated when alerts generated by monitors are closed, as an indication of the conditions resolution, this is not usually desirable for alerts generated by rules, as the notification would just indicate that an operator closed the alert.   This is particularly true if operators are not managing the console 24/7. 

If notification subscriptions are scoped to classes or groups (and not individual rules/monitors), an easy way that I have found to prevent notifications when rule-generated alerts are closed is by adding a custom resolution state with the name: “Closed by Operator” and a value such as 254 and then modifying the Open alert views in the Operations Console to filter on resolution status less than 254 (instead of “not equal to 255”).  This has the effect of logically assigning two potential closed states for alerts, the system default Closed state of 255 and the “Closed by Operator” state of 254.   In the Operations Console, operators can right click an alert, choose Set Notification State and choose “Closed by Operator,” clearing the alert from the open alert views, and not generating a notification. 

Figure 1