Coming Soon: SNMP Monitoring for Changes in Polled Values

Most SNMP monitoring can be facilitated by comparing the value of a specific retrieved SNMP object to an expected string or threshold, but monitoring for some conditions can only really be accomplished by comparing the current value to a previous value.  

Three examples of this are:

1)      Serial Interface Flapping:  If a serial connection is experiencing problems, the interface may bounce up and down rapidly.  If an SNMP poll on that interface is occurring every 1, 3, or 5 minutes, it may not detect any problems (if the interface is up for the poll), meaning that compromised availability could go undetected for several polling cycles.  These conditions can be detected by comparing the Interface Resets (locIFResets) counter in the Cisco local interfaces table  to previous values.   

2)      Default Gateway Changes on Redundant Routers:  In some redundant WAN router deployments, a default gateway change on the routers is indicative of a redundancy failover.  Because all devices and interfaces may be up and reachable before and after a failover, it may be difficult to detect when the failover has occurred, potentially meaning production traffic is routed over a slower backup link.   This can be detected by monitoring the Default Gateway value (ipRouteNextHop in the ipRouteTable) on the routers and detecting changes when compared to previous polling cycles.

3)      High-Availability State Changes on CheckPoint SPLAT Firewalls:  In an HA configuration on CheckPoint firewalls, the haStatus value will return a string value of “active” or “standby.”  The best way to detect an HA failover is to watch this value for a change.

These are just three examples of many potential scenarios where monitoring of an SNMP object is best served by comparing current values to previously polled values.  Unfortunately, this capability is not a common feature in many of the monitoring tools that I am familiar with.  

Over the next week (or more), I will be posting articles about how I have to implemented just such a monitor for the three described scenarios using the two monitoring products that I currently work with:  SolarWinds ORION and System Center Operations Manager.  In the case of ORION, these monitors can be implemented fairly easily with a bit of SQL work.  In the case of SCOM, it’s a little bit more complicated, but ultimately doable.

Advertisement

SCOM: An SSRS Custom Report for SNMP Device Performance Data Collected by Rules

While the SCOM Reporting implementation provides a great set of reports out of the box, there are a number of custom reports which I have found useful to develop.   The report described here is one to report on aggregated hourly performance counters collected on SNMP Network Devices.

 

First up, the queries:

Read more of this post

SCOM: One More Use for Custom Resolution States – A “Closed by Operator” State

I had previously written a post about using customization of SCOM’s resolution states to generate an additional notification, on-demand, in order to send the event to a helpdesk software system.  Another resolution state customization that I have found useful facilitates the opposite effect, preventing notifications. 

Unlike those generated by monitors, alerts generated by rules in System Center Operations Manager are not linked to an object’s health state, and are not closed automatically.  Rather, they must be closed manually (in the console or with a script) or deleted after a set period of time (database grooming).  While it is typically desirable for notifications to be generated when alerts generated by monitors are closed, as an indication of the conditions resolution, this is not usually desirable for alerts generated by rules, as the notification would just indicate that an operator closed the alert.   This is particularly true if operators are not managing the console 24/7. 

If notification subscriptions are scoped to classes or groups (and not individual rules/monitors), an easy way that I have found to prevent notifications when rule-generated alerts are closed is by adding a custom resolution state with the name: “Closed by Operator” and a value such as 254 and then modifying the Open alert views in the Operations Console to filter on resolution status less than 254 (instead of “not equal to 255”).  This has the effect of logically assigning two potential closed states for alerts, the system default Closed state of 255 and the “Closed by Operator” state of 254.   In the Operations Console, operators can right click an alert, choose Set Notification State and choose “Closed by Operator,” clearing the alert from the open alert views, and not generating a notification. 

Figure 1

EventLog Search: A Utility App for Searching Windows Event Logs

On a number of occasions in the past few weeks, I’ve found myself wishing there was an easy way to search a machine’s event log for events that matched a string in the description.   I’ve primarily wanted such a utility while troubleshooting SCOM management packs, where I wanted to be able to filter for events with the management pack name in the description, from the Health Service Modules source, and an event ID between 21000-22000. While the native Windows EventLog provides filtering capabilities, you can’t enter a filter for the event message/description or search on a range of event IDs.   I believe there are third-party tools that let you do this, but it seemed pretty easy to whip up a Windows Forms application to provide this search functionality.   I was able to put something together in pretty short order, and it can be downloaded here.  The only requirement should be for .NET version 2, and it doesn’t have to be installed to run.   I haven’t taken the time to implement much error-handling or any documentation, but it seems to fit the bill pretty well and it should be pretty self-explanatory.

The main search window:

The event details window (double-click a row in the Data Grid View to open):

WebMon: A SCOM Management Pack for Basic Web Site Monitoring, Configured with a Single XML File, Part II

The WebMon URL Monitoring management pack that I described in the previous post, can be downloaded here.    Notes on deploying and configuring the MP are as follows:

Overview:

This management pack for SCOM 2007, is intended to provide basic web monitoring for multiple web sites while being very easy to deploy.   All configuration for each monitored URL is performed by editing a single XML file on the Watcher Node.   The management pack implements three classes:  1) WebMon Watcher Node, which hosts 2) WebMon Request and 3) WebMon Secure Request classes.   The Request and Secure Request classes are identical, except the Secure Request class utilizes NTLM/Integrated authentication through a RunAs profile.   The monitors and rules implemented are as follows (for each of the two request clasess):

  • Monitors:
    • DNS Resolution Failure
    • Status Code (greater than a threshold)
    • Reachable
    • Error Code
    • CA Untrusted
    • Certificate Expired
    • Certificate Invalid
    • Response Time (greater than a threshold)
  • Rules:
    • Response Time Performance Collection

The monitors are rolled-up into an aggregate monitor used for alerting for each Request class.   A set of views are also created under the Web Application folder in the Monitoring section of the SCOM console.  These include state views for the Watcher Nodes, two Request classes, and a performance view displaying the historical response time for all requests. 

The advantage of this management pack over the SCOM the Web Application monitoring implementation for basic web monitoring is that it can be rapidly deployed and configured by simply editing the XML configuration file. 

Deployment and Configuration:

To deploy the WebMon URL Monitoring Management Pack:

  • Copy the sample webmonconfig.xml to a location on a local disk drive of each node intended to be a watcher node (the default path is C:\webmon\webmonconfig.xml)
  • Edit the configuration file with the desired settings (see the XML configuration section below)
  • Import the WebMon.xml file using the Import Management Pack function in the Operations Manager console, under administration
  • Access the Authoring section of the Opeartions Manager console, click the Change Scope link and limit the scope to WebMon Watcher Node. 
  • Right click the “WebMon Discovery” object under “Discovered Type: WebMon Watcher Node,” and choose Overrides->Override the Object Discovery->For a Specific Object of Class: Windows Server.   Select the server designated as Watcher Nodes, and override the object discovery to be “enabled.”
  • If the webmonconfig.xml file is deployed to a non-default location, override the script arguments and update the first parameter (c:\webmon\webmonconfig.xml) to reflect the actual script location.
  • The default interval for the object discovery is 15 minutes.  If this needs to be changes, edit the properties of the WebMon Discovery object, and adjust the schedule accordingly. 
  • If any sites are to be monitored with NTLM/Integrated authentication, a RunAs Profile must be configured.
    • Create a new RunAs account to be used by the watcher node, or determine an existing account to use.  
    • In the Administration section of the Operations Manager console, click RunAs Profiles.  Edit the properties of the WebMon Request RunAs Profile. 
    • Assign a RunAs account for the designated Watcher Node

Configuration

All configurable elements of the WebMon URL Monitoring Management Pack can be set in the webmonconfig.xml on the Watcher Node.   Multiple requests can be defined in the configuration file by adding another <request/> element.   The discovery script performs some basic validation, but the XML configuration should be edited carefully in order to prevent inadvertent errors due to invalid configuration.

The XML configuration file looks like:

 <?xml version=”1.0″ encoding=”utf-8″?>
<webmonconfig>
  <requests>
    <request>
      <requesturl>http://www.google.com</requesturl&gt;
      <responsetimethreshold>10</responsetimethreshold>
      <retrycount>1</retrycount>
      <pollinginterval>300</pollinginterval>
      <statuscodevalue>399</statuscodevalue>
      <usentlm>false</usentlm>
    </request>
    <request>
      <requesturl>http://www.microsoft.com</requesturl&gt;
      <responsetimethreshold>15</responsetimethreshold>
      <retrycount>0</retrycount>
      <pollinginterval>180</pollinginterval>
      <statuscodevalue>399</statuscodevalue>
      <usentlm>false</usentlm>
    </request>
    </request>
  </requests>
</webmonconfig>

The configuration elements are:

requests/request

  • requesturl:  the URL of the request, either http:// or https://
  • responsetimethreshold:  the response time threshold in seconds , if this is surpassed, a warning alert will be generated
  • retrycount:  the number of attempts to retry the request, 0 or greater
  • pollinginterval:  the interval in seconds between requests
  • statuscodevalue;  an error alert will be generated if the reponse status code is greater than this value
  • usentlm:  input true for this value if the site requires NTLM authentication.  The credentials used are defined in the WebMon Request URL RunAsProfile

Using the Management Pack

For both the WebMon Request and WebMon Secure Request classes, aggregate monitors are configured to generate alerts if any of the monitors trigger a warning or error health state.   Health states can be viewed in the Monitoring section of the Operations Manager Console, under the Web Application\WebMon URL Monitoring folder.   The views include state views for the WebMon Watcher Node, WebMon Request, and WebMon Secure Request classes as well as a view to display the collected response time data for all of the URL requests.  While overrides on the individual monitors can be configured, configuration should be performed in the webmonconfig.xml file. 

Notes on Editing the Management Pack

Prior to editing the management pack, please reference the link below to read more about the design of this management pack.   Due to some issues with the way the Authoring Console handles MonitorTypes and the use of variables in some configuration elements, all edits should be made in an XML editor. 

Support and More Info

This management pack is provided as-is, with no implied or explicit warranty.    For more info about the design and development of this management pack, reference:  https://operatingquadrant.com/2009/08/22/webmon-a-scom-management-pack-for-basic-web-site-monitoring-configured-with-a-single-xml-file-part-i/

WebMon: A SCOM Management Pack for Basic Web Site Monitoring, Configured with a Single XML File, Part I

The native Web Application monitoring capabilities of SCOM are impressive to say the least, and provide excellent functionality for in-depth monitoring of complex web application transactions.   However, the administrative effort required to configure web application monitoring makes the implementation less than ideal for wide-scale basic web site monitoring.  In most cases, required web monitoring would entail monitors just for status code, reachability, response time, and perhaps some other checks like certificate validity.  I wanted to create a custom management pack to implement these monitors, with a minimal degree of configuration effort.   While researching available options, I came across a post by Russ Slaten describing a way to utilize the Microsoft SystemCenter WebApplication Library implementation to accomplish a similar goal.  

However, I wanted to take it a step further.  The key decision point in my design approach was that I wanted to be able to deploy a configuration file (in XML format) to each watcher node involved and define the URL’s and monitoring properties in that file.   I have completed this management pack, and I’m quite happy with it thus far.   I’ve described the Management Pack and development process below.

The WebMon MP, Design and Development

Read more of this post

Recursively Listing Security Group Members with PowerShell

It seems like a favorite request of auditors is one for lists of all members of a set of local or domain groups that are associated with a resource that is being audited, and these requests typically stipulate that all members of nested groups must be listed as well (i.e. full recursion).  I used to use a VBS script to perform this functionality, but I’ve recently rewritten it in PowerShell.  The script accepts a group name (either local or domain) as well as a recursion depth as command line arguments and outputs a list of all group members to a text file.

The method used to retrieve the group members is:

 $Group= [ADSI]”WinNT://$GroupName,group”
 $Members = @($Group.psbase.Invoke(“Members”))
 

With that, it’s just a matter of configuring a function that accepts a group name as an input parameter, outputs the members, and loops through the member groups until the defined recursion depth is reached.

The script can be downloaded here.  And the output looks like:

The script could be easily modified to accept a text file with a list of group or server names as in input, or modified to output the results in HTML instead of plain text.

Monitoring HP Hardware Status on VMWare ESX Servers

HP provides great SCOM management packs for monitoring of Proliant servers, but only Windows agents are supported by these management packs.  If you’re running ESX on Proliant servers, it takes a little bit more effort to implement monitoring of hardware status.  Fortunately, HP also offers their Management Agents for ESX.   Thus, all that is needed to monitor HP ESX server hardware are some custom monitors to poll the snmp data exposed by the management agent.   An overview of the process for this is as follows:

Installing the HP Management Agent for ESX

  1. Configure SNMP on the ESX servers and set options: http://thwack.com/blogs/geekspeak/archive/2008/10/30/how-to-enable-snmp-on-a-vmware-esx-server.aspx
  2. Download the HP ESX agent (make sure your server model is supported by the agent) and copy the .tgz file to a temporary location on the ESX server
  3. Extract the file hpmgmt-8.x.x-vmware3x.tgz with a tar-zxvf command
  4. In the extracted directory, run the install script — later versions of the agent have a preinstall_setup.sh script which is to be manually run first, and requires a reboot. 
  5. Amongst other configuration prompts, you will be prompted to use an existing snmpd.conf, if you choose “no,” the install will create a new snmpd.conf that has to be configured with your snmp settings.
  6. If you use an existing snmpd.conf, you will have to add one line to it:  cd to /etc/snmp/ and edit snmpd.conf.  Add the following line:  dlmod cmaX /usr/lib/libcmaX.so   – this extends the SNMP agent to include the HP objects as a module.
  7. Restart snmp with:  service snmpd restart

Testing

The HP agents implement the Compaq mibs under the OID 1.3.6.1.4.1.232.   To test, you can use an SNMP browser to remotely connect and walk this OID, or from the ESX server, you can use an snmpwalk command:  snmpwalk –v 2c  -c <read-only community name>  localhost 1.3.6.1.4.1.232.

Monitoring with SCOM

  1. Discover the ESX servers as Network Devices
  2. Create a group for HP ESX servers (optionally in a new management pack).  You can use dynamic inclusion logic by setting a filter on the Device Description (Contains vmnix)
  3. Create your SNMP monitors and rules, targeting the SNMP Network Device class.  Configure the monitors and rules to be disabled, and then use an override to enable them for the HP ESX server group
  4. Create any required views or console tasks

What to Monitor?

When HP purchased Compaq, they made a smart decision in utilizing the Compaq SNMP MIBs for all HP servers, as this is one of the better vendor SNMP implementations out there.   It has remained very consistent over the years and most importantly, it tends to implement a single status value for each group of subcomponents that are represented in SNMP tables, so you don’t have to walk the table to get the overall status.    Thus, instead of checking the status of each disk drive, which will vary in number (and identifier in the table), you can just poll the cpqDaMibCondition (1.3.6.1.4.1.232.3.1.3) from the CPQIDA MIB to get the overall intelligent drive array health.  The agent’s System Management web console can be used for specifically drilling in to problems, so from a monitoring perspective, it is really only necessary to know when there is a problem, and what it’s general nature is.

These are the SNMP objects that I like to alert on for HP servers running UNIX:

Object Name OID
CPU Fans cpqHeThermalCpuFanStatus 1.3.6.1.4.1.232.6.2.6.5.0
Drive Array Health cpqDaMibCondition 1.3.6.1.4.1.232.3.1.3.0
Drive Array Controller (1) cpqDaCntlCondition 1.3.6.1.4.1.232.3.2.2.1.1.6.1
Power supplies cpqHEfltTolPwrSupply 1.3.6.1.4.1.232.6.2.9.1.0
System Fans cpqHeThermalSystemFanStatus 1.3.6.1.4.1.232.6.2.6.4.0
Temperature (Status) cpqHeThermalTempStatus 1.3.6.1.4.1.232.6.2.6.3.0
Thermal Conditions cpqHeThermalCondition 1.3.6.1.4.1.232.6.2.6.1.0
Integrated Management Log cpqHeEventLogCondition 1.3.6.1.4.1.232.6.2.11.2.0
Critical Errors cpqHeCritLogCondition 1.3.6.1.4.1.232.6.2.2.2.0
Correctable Memory Errors cpqHeCorrMemLogStatus 1.3.6.1.4.1.232.6.2.3.1.0

For reference on SNMP MIBS, ByteSphere provides a great Online MIB Database.  The primary Compaq MIBS to look for are: CPQHLTH, CPQIDA, CPSTSYS, CPQHOST,  CPQNIC, CPQTHRSH.

SCOM: Automating Management Pack Documentation

It’s easy to argue the case for keeping accurate documentation of SCOM management pack monitors and customizations, both for the abstract purpose of maintaining good documentation, as well as the more practical purpose of being able to answer the “can you list what is being monitored?” question. However, it can be tedious to keep on top of this documentation. But, given the flexibility of the SCOM command shell, it’s relatively easy to configure a powershell script to automate the documentation of management pack entities. By using a script to loop through unsealed management packs and itemize management pack entities such as groups, rules, monitors and views, along with their description, all it takes to automate documentation of custom management packs is completing the description fields for objects as they are created.

To loop through unsealed management packs, we can defined the list of management packs as an object:

$mps = Get-ManagementPack | where-object{$_.Sealed -eq $false} |Sort-Object DisplayName

Then create the loop logic:

foreach($mp in $mps)
{
  $mpDisplayName=$mp.DisplayName
…functions to list MP objects
}

With a set of functions to list the management pack objects that write the object lists to a formatted file (I use an HTML file so I can utilize CSS for formatting), we can create an automatically-generated document covering all unsealed MP’s.

I’ve posted the script I use here.   This script exports groups, rules, monitors, and views for all unsealed management packs in a formatted html file.  As always, it is provided as-is.

https://i0.wp.com/jxjzig.bay.livefilestore.com/y1pjtnzM9RDmStqJ-ZMgzC1HYtJbO2G9rUylcr1AiV5DQ5Jrq_vO2QlZZom8juYCZgFcLGCiQJe1qd2v4aOE2wb4w/report.JPG

My preference is to combine a single scheduled task that runs this documentation script as well as an automated export of all unsealed management packs. The output of both of these processes is then backed up nightly, creating a hands-off set of documentation and management pack history that can be utilized as necessary.

More on exporting unsealed MP’s for backup can be found at: http://searchwindowsserver.techtarget.com/generic/0,295582,sid68_gci1317380,00.html

Sun Hardware Monitoring with Net-SNMP and Shell Scripts

While Sun, like most server vendors, offers a comprehensive suite of hardware monitoring agents and management tools, it can be frustrating to monitor Sun hardware using the Sun Management Agent and a third-party SNMP tool, such as System Center Operations Manager.   The Sun Management Agent’s SNMP implementation builds on the Entity-MIB (http://docs.sun.com/app/docs/doc/817-3155/6mip4hnov?l=en&a=view) with the SUN Platform-MIB, and while all relevant hardware monitoring data are exposed through this MIB implementation, there are problems with deploying wide-scale monitoring of these objects using SNMP get requests.  This is because the list of entity objects varies by server model, and can even vary depending on the number of objects, like hard drives.

To expound on this point, Sun servers that run the SMA agent will list all hardware sensors in the entPhysicalTable of the ENTITY-MIB.  The id value for each of the hardware sensors will correspond to the id value for the sensor status (administrative and operational status) in the sunPlatEquipmentTable in the SUN Platform-MIB.   However, on one model, id 15 might correspond to CPU 0 Fan 0, but on another model, id 15 would correspond to a different sensor, and if that server had two CPU’s, id 17 might correspond to CPU 1 Fan 0, but if that system had only one CPU, id 17 would correspond to a different sensor.

If you could use an SNMP table or SNMP walk request, and return the results to a script that parses the output, this would not be a problem, but like many SNMP monitoring tools, SCOM implements SNMP gets only, meaning that variability in the OID.

So, what’s a way to work around this without committing to deploying a secondary monitoring tool just for monitoring SUN hardware?   One solution lies in the extensibility of the Net-SNMP agent, which is the default SNMP agent for Solaris.   Net-SNMP allows the extension of the agent’s functionality by assigning commands to OIDs.  With this configuration, whenever the OID is polled, the command is run on-demand and the output of the command is returned as the SNMP value.   For more on this functionality, see the Extending Agent Functionality section at: http://www.net-snmp.org/docs/man/snmpd.conf.html

To utilize this for hardware monitoring, a standard set of shell or PERL scripts can be written and deployed to a uniform path on all of your SUN servers, each configured to return a value such as “Pass” if everything checks out, or “Fail: <reason for failure>” if there are problems found.   The scripts can be written to support different status checking commands to support maximum portability (for example, using one status command on systems with software disk redundancy and another on systems with hardware RAID).   A great starting point for example monitoring scripts can be found at Sun’s BigAdmin site:  http://www.sun.com/bigadmin/scripts/indexMon.html

The net result is that with well written scripts and the Net-SNMP agent, a single monitoring solution can be deployed to all Sun servers, independent of their hardware model.   With consistent configuration in the snmpd.conf, the OID’s for each of the scripts (e.g. CPU script, HDD script, etc) will be the same and can be polled with a single set of SNMP get monitors in SCOM or another utility.