SCOM: Advanced SNMP Monitoring Part II: Designing the SNMP Monitors

For the Cisco Management Pack, a handful of different approaches were required when designing the monitors and rules and their supporting workflows.  These approaches can generally be placed into three categories:

  • Simple SNMP GET operations
  • Operations that require data manipulation
  • Operations that require access to previously collected values  

For example, monitoring of an interface’s operational status only requires current SNMP GET requests (to retrieve the ifOperStatus and ifAdminStatus values) from the SNMP ifTable.  Condition detection for the monitor type definition can handle the SNMP values as they are presented with no further manipulation.   However, to monitor a value such as the percent free bytes on a Cisco memory pool, data manipulation is required.  In this example, a percentage value is not exposed via SNMP, but free and available values are.  So, to calculate a percentage, the free and available bytes values must be summed and then the free value divided by that sum.  And lastly, in some cases, a value from a previous poll must be referenced to detect the desired condition, such as monitoring for an increase in the collisions reported for a Cisco interface.  In this example, the locIfCollisions values can be retrieved from the locIfTable, but this is a continually augmenting counter, which is difficult to monitor.   By recording the value from the immediately previous poll, and comparing it with the current poll, a delta value can be established for that polling cycle to determine the number of collisions recorded in a single polling cycle. 

Read more of this post

VBS: Decoding Base64 Strings (in 10 lines of code)

This code snippet can be used to decode Base64 encoded strings into plain text, which I have found useful recently when working on VBscript scripts that require decoding the Base64-encoded SNMP community strings stored in OpsMgr.  I thought it would be worth sharing.

Function Decode(strB64)
 strXML = “<B64DECODE xmlns:dt=” & Chr(34) & _
        “urn:schemas-microsoft-com:datatypes” & Chr(34) & ” ” & _
        “dt:dt=” & Chr(34) & “bin.base64” & Chr(34) & “>” & _
        strB64 & “</B64DECODE>”
 Set oXMLDoc = CreateObject(“MSXML2.DOMDocument.3.0”)
 oXMLDoc.LoadXML(strXML)
 decode = oXMLDoc.selectsinglenode(“B64DECODE”).nodeTypedValue
 set oXMLDoc = nothing
End Function

SCOM: Advanced SNMP Monitoring Part I: Discovering Cisco Devices and Hosted Classes

When limited to editing within the Authoring pane of the Operations Console in OpsMgr, SNMP monitoring options are relatively limited – only SNMP GET requests are supported and data cannot be manipulated before calculating health state.   However, with some pretty heavy authoring, very complex SNMP monitoring becomes completely viable, particularly in R2.   In the next several posts, I will be describing some methods to perform advanced SNMP discovery and monitoring in OpsMgr 2007, such as discovering hosting relationships for entities like network interfaces, as well as passing Snmp data to script probes in order to perform state change monitoring and mathematical operations.  When all is said and done, I will post my management packs for SNMP monitoring of Cisco, Checkpoint, and Net-SNMP devices.   In this post, I will describe the methods I used for custom discovery of Cisco SNMP devices and their hosted entities (interfaces, power supplies, etc).

Cisco SNMP Management Pack: Designing the Classes and Groups

In order to facilitate monitoring of the device (e.g. CPU, Memory, up/down status), individual interfaces (e.g. status, utilization, resets), and Cisco EnvMon objects (e.g. Power Supplies, Temperature, etc), each of these entities are best represented as a unique class with hosting relationships.   For the objects stored in tables, there are two feasible options for discovery.  In SCOM 2007 R2, the ‘walkreturnmultipleitems’ attribute of the System.SnmpProbe probe action can be set in order to walk an SNMP table and return each snmpdata item individually.   In SCOM 2007 SP1, this option is not available, requiring the use of a script discovery with an external snmp probe (such as WMI) in order to retrieve the discovery items.  In this post, I will describe the R2 method, and I will address the script option using the WMI SNMP provider in a later post.  In either case, the class design will be similar.

For monitoring of Cisco SNMP devices, I created five classes:

  • CiscoSNMP.Class.CiscoDevice
  • CiscoSNMP.Class.CiscoDevice.Interface
  • CiscoSNMP.Class.CiscoDevice.PowerSupply
  • CiscoSNMP.Class.CiscoDevice.Fan
  • CiscoSNMP.Class.CiscoDevice.TemperatureSensor

For the four hardware entities, I created a hosting relationship so that they are hosted by the CiscoDevice class.   The classes, properties, and relationships are represented on this diagram.

Read more of this post

Monitoring for SNMP Value Changes with SolarWinds ORION NPM

I had previously described a few example scenarios in which monitoring SNMP values for changes (from the values in previous polling cycles) could be useful.   In this post, I will describe the steps to configure monitoring for these scenarios in SolarWinds ORION NPM. 

Detecting changes in Checkpoint Firewall (Splat) High Availability State

The checkpoint mib includes a good set of SNMP objects exposed for state and performance monitoring of Checkpoint Secure Platform firewalls.   The state of firewall modules can be polled with the xxStatCode (numeric) or xxStatShortDescr (string) objects.  For example, Secure Virtual Networking can be monitored with the svnStatCode (1.3.6.1.4.1.2620.1.6.101) or svnStatShortDescr (1.3.6.1.4.1.2620.1.6.102) objects.  Likewise for the other modules such as HA, DTPS, or WAM (etc) modules.   However, in order to detect HA failovers, I monitor the haState (1.3.6.1.4.1.2620.1.5.6) object for changes (i.e. from “standby” to “active”).  

Detecting Default Gateway (ipRouteNextHop) Changes on Cisco Routers

In some redundant configurations, a change in the device’s default gateway may be the best indicator of a failover to an alternate Wide-Area connection, which could be a problem if the backup WAN link is a slower bandwidth connection.   The ipRouteNextHop (1.3.6.1.2.1.4.21.1.7) object is located in the ipRoute table of the ubiquitous RFC1213 (MIB II) mib.  The device’s default gateway is the first row listed in this table.

Detecting Serial Interface Flapping

Increases in the locIFResets (1.3.6.1.4.1.9.2.2.1.1.17) Cisco counter on a serial interface are a good indicator of flapping on the serial connection.   If the serial interface resets more than two times in a polling cycle, we can probably assume that it is flapping (an administrative shut and start would be one reset, so by monitoring for 2 or more resets, we can avoid alerts when planned maintenance is being performed).    If the reset count doesn’t change for a few polling cycles, it can probably be assumed that the connection has stabilized. 

Read more of this post

Coming Soon: SNMP Monitoring for Changes in Polled Values

Most SNMP monitoring can be facilitated by comparing the value of a specific retrieved SNMP object to an expected string or threshold, but monitoring for some conditions can only really be accomplished by comparing the current value to a previous value.  

Three examples of this are:

1)      Serial Interface Flapping:  If a serial connection is experiencing problems, the interface may bounce up and down rapidly.  If an SNMP poll on that interface is occurring every 1, 3, or 5 minutes, it may not detect any problems (if the interface is up for the poll), meaning that compromised availability could go undetected for several polling cycles.  These conditions can be detected by comparing the Interface Resets (locIFResets) counter in the Cisco local interfaces table  to previous values.   

2)      Default Gateway Changes on Redundant Routers:  In some redundant WAN router deployments, a default gateway change on the routers is indicative of a redundancy failover.  Because all devices and interfaces may be up and reachable before and after a failover, it may be difficult to detect when the failover has occurred, potentially meaning production traffic is routed over a slower backup link.   This can be detected by monitoring the Default Gateway value (ipRouteNextHop in the ipRouteTable) on the routers and detecting changes when compared to previous polling cycles.

3)      High-Availability State Changes on CheckPoint SPLAT Firewalls:  In an HA configuration on CheckPoint firewalls, the haStatus value will return a string value of “active” or “standby.”  The best way to detect an HA failover is to watch this value for a change.

These are just three examples of many potential scenarios where monitoring of an SNMP object is best served by comparing current values to previously polled values.  Unfortunately, this capability is not a common feature in many of the monitoring tools that I am familiar with.  

Over the next week (or more), I will be posting articles about how I have to implemented just such a monitor for the three described scenarios using the two monitoring products that I currently work with:  SolarWinds ORION and System Center Operations Manager.  In the case of ORION, these monitors can be implemented fairly easily with a bit of SQL work.  In the case of SCOM, it’s a little bit more complicated, but ultimately doable.

SCOM: An SSRS Custom Report for SNMP Device Performance Data Collected by Rules

While the SCOM Reporting implementation provides a great set of reports out of the box, there are a number of custom reports which I have found useful to develop.   The report described here is one to report on aggregated hourly performance counters collected on SNMP Network Devices.

 

First up, the queries:

Read more of this post

Monitoring HP Hardware Status on VMWare ESX Servers

HP provides great SCOM management packs for monitoring of Proliant servers, but only Windows agents are supported by these management packs.  If you’re running ESX on Proliant servers, it takes a little bit more effort to implement monitoring of hardware status.  Fortunately, HP also offers their Management Agents for ESX.   Thus, all that is needed to monitor HP ESX server hardware are some custom monitors to poll the snmp data exposed by the management agent.   An overview of the process for this is as follows:

Installing the HP Management Agent for ESX

  1. Configure SNMP on the ESX servers and set options: http://thwack.com/blogs/geekspeak/archive/2008/10/30/how-to-enable-snmp-on-a-vmware-esx-server.aspx
  2. Download the HP ESX agent (make sure your server model is supported by the agent) and copy the .tgz file to a temporary location on the ESX server
  3. Extract the file hpmgmt-8.x.x-vmware3x.tgz with a tar-zxvf command
  4. In the extracted directory, run the install script — later versions of the agent have a preinstall_setup.sh script which is to be manually run first, and requires a reboot. 
  5. Amongst other configuration prompts, you will be prompted to use an existing snmpd.conf, if you choose “no,” the install will create a new snmpd.conf that has to be configured with your snmp settings.
  6. If you use an existing snmpd.conf, you will have to add one line to it:  cd to /etc/snmp/ and edit snmpd.conf.  Add the following line:  dlmod cmaX /usr/lib/libcmaX.so   – this extends the SNMP agent to include the HP objects as a module.
  7. Restart snmp with:  service snmpd restart

Testing

The HP agents implement the Compaq mibs under the OID 1.3.6.1.4.1.232.   To test, you can use an SNMP browser to remotely connect and walk this OID, or from the ESX server, you can use an snmpwalk command:  snmpwalk –v 2c  -c <read-only community name>  localhost 1.3.6.1.4.1.232.

Monitoring with SCOM

  1. Discover the ESX servers as Network Devices
  2. Create a group for HP ESX servers (optionally in a new management pack).  You can use dynamic inclusion logic by setting a filter on the Device Description (Contains vmnix)
  3. Create your SNMP monitors and rules, targeting the SNMP Network Device class.  Configure the monitors and rules to be disabled, and then use an override to enable them for the HP ESX server group
  4. Create any required views or console tasks

What to Monitor?

When HP purchased Compaq, they made a smart decision in utilizing the Compaq SNMP MIBs for all HP servers, as this is one of the better vendor SNMP implementations out there.   It has remained very consistent over the years and most importantly, it tends to implement a single status value for each group of subcomponents that are represented in SNMP tables, so you don’t have to walk the table to get the overall status.    Thus, instead of checking the status of each disk drive, which will vary in number (and identifier in the table), you can just poll the cpqDaMibCondition (1.3.6.1.4.1.232.3.1.3) from the CPQIDA MIB to get the overall intelligent drive array health.  The agent’s System Management web console can be used for specifically drilling in to problems, so from a monitoring perspective, it is really only necessary to know when there is a problem, and what it’s general nature is.

These are the SNMP objects that I like to alert on for HP servers running UNIX:

Object Name OID
CPU Fans cpqHeThermalCpuFanStatus 1.3.6.1.4.1.232.6.2.6.5.0
Drive Array Health cpqDaMibCondition 1.3.6.1.4.1.232.3.1.3.0
Drive Array Controller (1) cpqDaCntlCondition 1.3.6.1.4.1.232.3.2.2.1.1.6.1
Power supplies cpqHEfltTolPwrSupply 1.3.6.1.4.1.232.6.2.9.1.0
System Fans cpqHeThermalSystemFanStatus 1.3.6.1.4.1.232.6.2.6.4.0
Temperature (Status) cpqHeThermalTempStatus 1.3.6.1.4.1.232.6.2.6.3.0
Thermal Conditions cpqHeThermalCondition 1.3.6.1.4.1.232.6.2.6.1.0
Integrated Management Log cpqHeEventLogCondition 1.3.6.1.4.1.232.6.2.11.2.0
Critical Errors cpqHeCritLogCondition 1.3.6.1.4.1.232.6.2.2.2.0
Correctable Memory Errors cpqHeCorrMemLogStatus 1.3.6.1.4.1.232.6.2.3.1.0

For reference on SNMP MIBS, ByteSphere provides a great Online MIB Database.  The primary Compaq MIBS to look for are: CPQHLTH, CPQIDA, CPSTSYS, CPQHOST,  CPQNIC, CPQTHRSH.

Sun Hardware Monitoring with Net-SNMP and Shell Scripts

While Sun, like most server vendors, offers a comprehensive suite of hardware monitoring agents and management tools, it can be frustrating to monitor Sun hardware using the Sun Management Agent and a third-party SNMP tool, such as System Center Operations Manager.   The Sun Management Agent’s SNMP implementation builds on the Entity-MIB (http://docs.sun.com/app/docs/doc/817-3155/6mip4hnov?l=en&a=view) with the SUN Platform-MIB, and while all relevant hardware monitoring data are exposed through this MIB implementation, there are problems with deploying wide-scale monitoring of these objects using SNMP get requests.  This is because the list of entity objects varies by server model, and can even vary depending on the number of objects, like hard drives.

To expound on this point, Sun servers that run the SMA agent will list all hardware sensors in the entPhysicalTable of the ENTITY-MIB.  The id value for each of the hardware sensors will correspond to the id value for the sensor status (administrative and operational status) in the sunPlatEquipmentTable in the SUN Platform-MIB.   However, on one model, id 15 might correspond to CPU 0 Fan 0, but on another model, id 15 would correspond to a different sensor, and if that server had two CPU’s, id 17 might correspond to CPU 1 Fan 0, but if that system had only one CPU, id 17 would correspond to a different sensor.

If you could use an SNMP table or SNMP walk request, and return the results to a script that parses the output, this would not be a problem, but like many SNMP monitoring tools, SCOM implements SNMP gets only, meaning that variability in the OID.

So, what’s a way to work around this without committing to deploying a secondary monitoring tool just for monitoring SUN hardware?   One solution lies in the extensibility of the Net-SNMP agent, which is the default SNMP agent for Solaris.   Net-SNMP allows the extension of the agent’s functionality by assigning commands to OIDs.  With this configuration, whenever the OID is polled, the command is run on-demand and the output of the command is returned as the SNMP value.   For more on this functionality, see the Extending Agent Functionality section at: http://www.net-snmp.org/docs/man/snmpd.conf.html

To utilize this for hardware monitoring, a standard set of shell or PERL scripts can be written and deployed to a uniform path on all of your SUN servers, each configured to return a value such as “Pass” if everything checks out, or “Fail: <reason for failure>” if there are problems found.   The scripts can be written to support different status checking commands to support maximum portability (for example, using one status command on systems with software disk redundancy and another on systems with hardware RAID).   A great starting point for example monitoring scripts can be found at Sun’s BigAdmin site:  http://www.sun.com/bigadmin/scripts/indexMon.html

The net result is that with well written scripts and the Net-SNMP agent, a single monitoring solution can be deployed to all Sun servers, independent of their hardware model.   With consistent configuration in the snmpd.conf, the OID’s for each of the scripts (e.g. CPU script, HDD script, etc) will be the same and can be polled with a single set of SNMP get monitors in SCOM or another utility.