Operations Manager Cross-Platform Authoring: Invoke Action Monitor

When monitoring UNIX/Linux servers, command execution or script-based monitors can provide a great deal of flexibility in many health-checking applications.   The Operations Manager 2007 R2 cross-platform agent facilitates the execution of shell command lines or executable binaries and scripts with the Microsoft.Unix.WSMan.Invoke.Probe module.   In this post, I will walk through the use of this module in an example monitoring scenario:  monitoring UNIX/Linux systems for the count of defunct/zombie processes.   The management pack described in this post can be downloaded here.

Background

The Microsoft.Unix.WSMan.Invoke.Probe is a nicely wrapped implementation of the module Microsoft.SystemCenter.WSManagement.TimedInvoker from the Microsoft.SystemCenter.WsManagement.Library management pack.  The Microsoft.Unix.WSMan.Invoke.Probe facilitates the execution of commands or processes on the agent with two Invoke Actions:  ExecuteCommand and ExecuteShellCommand.   The ExecuteCommand Invoke Action executes a script or binary executable (along with command line parameters), whereas the ExecuteShellCommand executes a command string in a shell environment.   While similar, a key functional difference between the two is that the ExecuteShellCommand Invoke Action supports command-line pipe operations while the ExecuteCommand Invoke Action does not.   So, any output filtering with awk, sed, or grep (for example) will require the use of the ExecuteShellCommand Invoke Action.  An example of using the ExecuteCommand Invoke Action in a discovery and monitor can be found in the Cross Platform MP Authoring Guide.   However, one advantage of using ‘one-liner’ commands with the ExecuteShellCommand Invoke Action in monitoring scenarios as opposed to calling local scripts with ExecuteCommand is that the need to distribute and maintain scripts to agents is eliminated and the monitoring script is thus embedded in the MP to be managed centrally. 

As for monitoring of defunct processs count, the UNIX ps command can easily be utilized to identify defunct/zombie processes.  With some output manipulation by grep and awk, the command string can be configured to return just the number of defunct processes to StdOut: ps -eo ’s’ | grep Z | awk ‘END{print NR}’

Turning this command into a functional monitor then just requires a data source to execute the InvokeAction, a monitor type to define the condition detections and health states, and a unit monitor.

Walk Through

Read more »

xSNMP Management Packs – Beta Version 1.0.8 Release

After many weeks of development efforts and testing, I have made the latest beta version of the xSNMP Management Packs available for download.   Before getting into any more detail, I would be remiss if I did not acknowledge the invaluable help provided by some of the volunteers who tested these management packs through all stages of development.  Many thanks in particular to Chris and Davey, who played a huge role in every stage of the development of the xSNMP MP’s.   Many thanks also to Gary and Björn for their great help in testing the MP’s.  

The included documentation covers recommendations for deployment and configuration as well as the details of the management packs.  Additional information about performance considerations in large SNMP monitoring environments can be found in this previous post.   At present, the following management packs are included in this suite, and more are currently in the works.

  • xSNMP Management Pack – Implements filtered discovery and monitoring of SNMP devices and interfaces that support the standard RFC1213 MIB, IF-MIB, and EtherLike-MIB.  This management pack is the core of the xSNMP Suite and contains public datasources that are utilized in the other optional management packs.
  • xSNMP Overrides Management Pack – This unsealed management pack can be used as a container for overrides, but also provides preconfigured groups and overrides for easily controlling interface monitoring through groups of network interfaces.
  • xSNMP APC Managment Pack – Implements monitoring for APC Rackmount PDU, UPS, Automatic Transfer Switch, and Environmental Monitor devices.
  • xSNMP Brocade Management Pack – Implements chassis monitoring for Brocade Fibre-Channel switch devices (Fibre-Channel ports are monitored as network interfaces with the xSNMP MP).
  • xSNMP Check Point Secure Platform Management Pack – Implements module health and firewall HA failover monitoring for Check Point Secure Platform firewall devices.
  • xSNMP Cisco Management Pack – Implements additional monitoring for Cisco devices, primarily including chassis hardware moniotring for devices that support the EnvMon MIB, Entity-MIB, or Cisco-Stack MIB.
  • xSNMP Data Domain Management Pack – Implements monitoring for the performance, hardware status, and replication status of Data Domain Restorer storage appliances.
  • xSNMP HP ProCurve Management Pack – Implements component health monitoring for HP ProCurve switches and wireless access points.
  • xSNMP HP Proliant Management Pack – Implements hardware health monitoring for SNMP-enabled HP servers that support the Proliant Insight Management Agents.
  • xSNMP Net-SNMP Management Pack  – Implements operating system monitoring for Net-SNMP agent devices, such as UNIX/Linux servers through the UCD and Host-Resources MIBs. 
  • xSNMP Syslog Management Pack – Provides  warning and critical alert generating rules that can be enabled and filtered with overrides to alert on incoming syslog messages from discovered SNMP devices.

Feedback is, of course, welcomed.

Some Screenshots of the xSNMP Management Packs

Diagram view of an HP ProCurve device:

Diagram view of a Data Domain Restorer:

Health Explorer view for an APC UPS:

Diagram View for an HP Proliant server:

Performance View for a network interface:

Diagram view for a Brocade Fibre Channel Switch:

Scalability and Performance (Design and Testing) in the xSNMP Management Packs

In this post, I intend to describe some of the challenges in scaling SNMP monitoring in an Operations Manager environment to a large number of monitored objects, as well as my experiences from testing and the approaches that I took to address these challenges with the xSNMP Management Packs.

Background

In spite of the market availability of many task-specific SNMP monitoring applications boasting rich feature sets, I think that a strong case can be made for the use of System Center Operations Manager in this SNMP monitoring role. Using a single product for systems and infrastructure (SNMP) monitoring facilitates unparalleled monitoring integration (e.g. including critical network devices/interfaces or appliances in Distributed Application Models for vital business functions). The rich MP authoring implementation, dynamic discovery capabilities, and object-oriented modeling approach allow a level of depth and flexibility in SNMP monitoring not often found in pure SNMP monitoring tools.

However, Operations Manager is first and foremost a distributed monitoring application, most often depending on agents to run small workloads independently. Inevitably, running centralized monitoring workloads (i.e. SNMP polls) in a distributed monitoring application is going to carry a higher performance load than the same workloads in a task-specific centralized monitoring application that was built from the ground up to handle a very high number of concurrent polls with maximum efficiency. This centralized architecture would likely feature a single scheduler process that distributes execution of polls in an optimized fashion as well as common polling functions implemented in streamlined managed code. With SNMP monitoring in Operations Manager, any optimization of workload scheduling and code optimization more or less falls to the MP author to implement.

While working on the xSNMP Management Packs, I spent a lot of time testing different approaches to maximize efficiency (and thus scalability) in a centralized SNMP monitoring scenario. I’m sure there is always room for continual improvement, but I will try to highlight some of the key points of my experiences in this pursuits.

Designing for Cookdown

Cookdown is one of the most important concepts in MP authoring when considering the performance impact of workflows. A great summary of OpsMgr cookdown can be found here. In effect, the cookdown process looks for modules with identical configurations (including input parameters) and replaces the multiple executions of redundant modules with a single execution. So, if one wanted to monitor and collect historical data on the inbound and outbound percent utilization and Mb/s throughput of an SNMP network interface, a scheduler and SNMP Probe (with VarBinds defined to retrieve the in and out octets counters for the interface) could be configured. As long as each of the rules and monitors provided the same input parameters to these modules for each interface, the scheduler and SNMP probe would only execute once per interval per interface. Taking this a step further, the SNMP probe could be configured to gather all SNMP values for objects to monitor in the IFTable for this interface (e.g. Admin Status, Oper Status, In Errors, Out Errors), and these values could be used in even more rules and monitors. The one big catch here is that the SNMP Probe module stops processing SNMP VarBind’s once it hits an error. So, it’s typically not a good idea to mix SNMP VarBinds for objects that may not be implemented on some agents with OIDS that would be implemented on all agents.

Workflow Scheduling

Read more »

Introducing the xSNMP Management Pack Suite

Introduction

Over the past several weeks, I’ve been hard at work on some new SNMP management packs for Operations Manager 2007 R2, to replace the Cisco SNMP MP and extend similar functionality to a wide range of SNMP-enabled devices.   In the next few posts, I hope to describe some of the design and development considerations related to these Management Packs, which I am calling the xSNMP Management Pack Suite.   For this post, I hope to give a basic overview of the development effort and resulting management packs.

As I was working on some feature enhancements to the Cisco SNMP Management Pack, and following some really great discussions with others on potential improvements,  I concluded that a more efficient and effective design could be realized by aligning the management pack structure along the lines of the SNMP standard itself.   To expound on this point, much of the monitoring in the Cisco MP is not specific to Cisco devices, but rather, mostly common to all SNMP devices.   The SNMP standard defines a hierarchical set of standard MIBs, and a hierarchical implementation of vendor-specific MIBS, with consideration to the elimination of  redundancy.   I tried to loosely adapt this model in the xSNMP MP architecture.   The first of the MP’s, and the one that all of the others depend on, is the root xSNMP Management Pack.   This management pack has a few functions:

  1. It performs the base discovery of SNMP devices (the discovery is targeted to the  SNMP Network Device class)
  2. It implements monitoring of the SNMP v1/v2 objects for discovered devices and interfaces
  3.  It provides a set of standardized and reusable data sources for use in dependent management packs

From there, the remaining management packs implement vendor-specific monitoring.   Devices and/or interfaces are discovered for the vendor-specific management packs as derived objects from the xSNMP MP, and most of the discoveries, monitors, and rules utilize the common data sources from the xSNMP MP, which makes the initial and ongoing development for vendor-specific MP’s much more efficient.

Controlling Interface Monitoring

One of the topics frequently commented on with the Cisco SNMP Management Pack, and a subject of much deliberation, was that of selecting network interfaces for monitoring.   Even determining the optimal default interface monitoring behavior (disabled vs. enabled) isn’t a terribly easy decision.  For example, a core network switch in a datacenter may require that nearly all interfaces are monitored, while a user distribution switch may just require some uplink ports to be monitored.   In the end, I decided on an approach that seems to work quite well.   In the xSNMP Management Pack, all interface monitoring is disabled by default.   A second, unsealed management pack, is also provided and includes groups to control interface monitoring (e.g. Fully Monitored, Not Monitored, Status Only Monitored).  Overrides are pre-configured in this MP to enable/disable the appropriate interface rules and monitors for these groups.   So, to enable interface monitoring for all Ethernet interfaces, a dynamic group membership rule can be configured to include objects based with interface type 6, or if critical interfaces are consistently labeled on switches with an Alias, the Interface Alias can be used in rules for group population.  

Organizing Hosted Objects

For each of the management packs,  I tried to take a standardized approach for hierarchical organization of hosted objects and their relationships.   This organization was facilitated primarily through the use of arbitrary classes to contain child objects.   So, rather than discover all interfaces of a device with a single hosting relationship to the parent, an intermediary logical class (named “Interfaces”) is discovered with parent and child hosting relationships.   This approach has three primary benefits: 1) the graphical Diagram View is easier to navigate, 2) the object hierarchy is more neatly organized for devices that may be monitored by multiple MP’s (e.g. a server monitored by three MP’s for SNMP hardware monitoring, O/S monitoring, and application monitoring), and 3) the organization of hosted objects is consistent, even for devices with multiple entities exposed through a single SNMP agent. 

Scalability

With loads of invaluable help from some volunteer beta testers, a great deal of time has been spent testing and investigating performance and scalability for these management packs.  While I will save many of these details for a later post, I can offer a few comments on the topic.   In all but the smallest SNMP-monitoring environments, it’s highly advisable to configure SNMP devices to be monitored by a node other than the RMS.  For larger environments, one or more dedicated Management Servers or Agent Proxies (Operations Manager agents configured to proxy requests for SNMP devices) are preferred for optimal performance.    From our testing with these Management Packs, a dedicated agent proxy can be expected to effectively monitor between 1500-3500 objects, depending on the number of monitors/rules, the intervals configured, and the processing power of the agent proxy.   By object, I am referring to any discovered object that is monitored by SNMP modules, such as devices, interfaces, fans, file systems, power supplies, etc.   So, monitoring a switch infrastructure with 4000-6000 monitored network interfaces should be doable with two dedicated agent proxy systems.  

I intend to write in greater detail about these topics in the coming weeks, and hope to post the first public beta version of these management packs soon.

Automating WSS 3.0 Backups with a Script

Although SQL backups of the content databases for a Windows SharePoint Services farm can be used for data recovery, it’s usually a good idea to also perform backups through the stsadm.exe utility to facilitate site and object-level restores.   I recently took on a task to script a more robust solution for the automation of WSS farm backups, which I will describe here.

The stsadm.exe utility can be used to backup in two modes, site collection and catastrophic.  The site collection method backups up an individual site and content for an individual site, specified by URL, and the catastrophic backup method backs up the entire farm or a specified object in a full or differential mode.    I opted to go with the catastrophic backup method in this script to support differential backups and eliminate the requirement to enumerate individual sites for backup operations. 

WSS Backup Script Overview

The script is a bit too long to post in its entirity, but it can be downloaded here.   The script accepts three parameters:

  • The target backup directory path
  • The backup type (full or differential)
  • The number of backups to maintain

The backup operation is relatively simple:  the script uses the WScript.Shell.exec method to execute the stsadm.exe command (after querying the registry to determine the install path of WSS).

Command = sPath & “\STSADM.exe -o backup -Directory ” & BackupDir & ” -BackupMethod ” & BackupType

Set oExec = WshShell.Exec(command) 
While Not oExec.StdOut.AtEndOfStream 
    OutPut = oExec.StdOut.ReadAll
Wend 
Do While oExec.Status = 0
    WScript.Sleep 5000
Loop

To improve monitoring of the operation, the script performs a shell execution to the eventcreate.exe utility to log status to the Windows Application Log.   (Although the WScript.Shell supports basic EventLog logging, I wanted to control the event source and ID, so the eventcreate.exe utility seemed to be a better option).

If blSuccessful then
   Command =  “Eventcreate /L Application /T INFORMATION /ID 943 /SO “”SharePoint Backups”" /D “” ” & sMesg & “”
Else
   Command =  “Eventcreate /L Application /T WARNING /ID 944 /SO “”SharePoint Backups”" /D “” ” & sMesg & “”
End if
Set oExec = WshShell.Exec(command)

The most complex operation of this WSS backup automation script is the maintenance of old backups.  The stsadm backup operation maintains an XML file named spbrtoc.xml in the backup directory with meta-data related to past backups.   While an example of deleting backups older than a certain time interval can be found here, I wanted to maintain past backups based on a count (x number of fulls, x number of differentials).    To implement this, the script loads meta-data from the Table of Contents XML file into an array, determines the number of backups to be purged (correlated to the current backup operation type – full or differential), flags the oldest backups for deletion, and then deletes the related backup directories and XML nodes.  

Automating With System Center Operations Manager 2007

Read more »

Using the Operations Manager 2007 R2 Workflow Analyzer

I’ve only had my hands on the OpsMgr MP Authoring Resource Kit for about 24 hours now, but already the tools are proving to be invaluable.   This post describes a problem that I was able to investigate with the Workflow Analyzer tool to determine the exact cause of the issue.

Background

In a management pack I’m working on, I had a composite workflow designed to calculate SNMP network interface throughput and utilization by collecting the 32bit and 64bit in and out octet counters for an interface.  The SnmpProbe passes the values for all four VarBinds to an Expression Filter, which confirms that either VarBinds 1 and 2 (64bit) or VarBinds 3 and 4 (32bit) have values greater or equal to zero.   The Expression Filter than passes matched data items to a PowerShell property bag probe, which compares the values to a previously collected value set (stored in a temporary file in the file system) in order to calculate delta values and interface utilization and throughput.  

The script was written to use the 64bit counters if data are returned for the 64bit counters and 32bit counters if 64bit counter data are not returned.   I had been having some issues with this workflow when targeted to interfaces of devices that do not support 64bit interface octet counters.  From the lack of errors in the log, and evidence that the PowerShell script probe was not running (no temporary file being generated for these instances), I had concluded that the workflow was stopping with the post-SnmpProbe Expression Filter, but I didn’t know exactly why.   I had thought the Expression Filter was configured in such a way as to continue even if null values were returned for the 64bit counters. 

Using the recently released Operations Manager 2007 R2 Workflow Analyzer, I was able to drill into the actual processing of the workflow and identify the issue.

Workflow Tracing

The steps I used to debug this workflow were:

Launch the Workflow Analyzer and create a new session:

Read more »

Operations Manager MP Authoring Resource Kit

Microsoft has just released the Operations Manager Resource Kit, and my first impressions are very positive.   I haven’t had a chance to test drive all of the tools, but the MP Best Practices Analyzer and Workflow Analyzer show great potential.

The MP Best Practices Analyzer shows up under the Tools menu of the Authoring console and will scan a management pack for best practices compliance in great detail.  In my first use of this tool, I found it to be of great value. 

I’m looking forward to spending some time with the Workflow Analyzer, which provides a great interface for drilling into troubleshooting the more abstract elements of MP performance.  The Workflow Analyzer displays all loaded workflows for a management server, with the option to drill into a workflow to a specific instance, and then launch graphical debug tracing of that specific workflow.   Great stuff indeed.

There are a number of other tools in the RK, not the least of which includes a spell checker.

SCOM: Combining a System.SnmpProbe and System.Performance.DeltaValueCondition Modules to Calculate SNMP Counter Delta Values

I have previously written about using the combination of an SnmpProbe and script probe in Operations Manager work flows to facilitate manipulation of numeric values.   While this is currently the only way to perform numeric operations, there are some cases in which the only required manipulation of a numeric value is the calculation of a delta between two polls, such as calculating the number of interface collisions in an interval (from the ifTable) or calculating the number of interface resets in a polling cycle (from the Cisco locIfTable).  In these cases, the SnmpProbe can be combined with a System.Performance.DeltaValueCondition condition detection module to calculate the delta value without having to engage a script probe.

The Performance.DeltaValueCondition module expects Performance Data as an input, so a System.Performance.DataGenericMapper must be used between the SnmpProbe and DeltavalueCondition modules to do the data conversion.   The DataGenericMapper accepts two options:  NumSamples and Absolute.   The NumSample parameter sets the number of value samples to maintain in memory, and the value returned is the difference between the first and last samples in memory.  The “Absolute” parameter, when true, causes the DeltaValueCondition module to return the delta as the raw difference between the samples, and when false, causes the module to return the percentage of change.

An example workflow can be represented in this diagram (the expression filter being used to validate the data returned from the SnmpProbe prior to continuing):

  Read more »

SCOM: Distributing Workflows with the SpreadInitializationOverInterval Scheduler Parameter

In Operations Manager distributed agent-based monitoring scenarios, resource utilization of the monitoring workflows is rarely a point of major concern as the data sources and probe actions typically consume only nominal resources of the agent host system at any given time.   However, in centralized monitoring scenarios, such as SNMP monitoring or wide-scale URL monitoring, the resource utilization of each work flow must be a primary concern as all workflows will execute on a small number of management servers/agent proxies and the potential for a massive number of workflows executing concurrently is very real.

While I had previously described some of my thoughts on workflow resource utilization with script probe actions, there is another highly relevant aspect of this general topic: workflow schedule distribution.   When working with centralized poll/probe monitors, almost every workflow will start with a scheduler.  By default, the Operations Manager scheduler module will not distribute scheduler initialization, and the result of this is that every workflow scheduled on an interval of X minutes will all fire at the same time – every X minutes since the initiation of the Health Service (unless a SyncTime is specified).   If, for example, 2000 network interfaces were polled with an SNMP probe every 5 minutes for status, every 5 minutes, 2000 workflows would execute simultaneously on the agent proxy system, and particularly if the workflow includes a script probe, the likely result would be oversubscription of the agent proxy’s CPU leading to script timeouts and/or SNMP poll failures.  

If the scheduled workflows could initialize at distributed times, so that they do not fire in a synchronized fashion, significant scalability improvements could be realized.  I had been experimenting with using a PowerShell script in a discovery probe to randomly determine a SyncTime and assign it as a property of an object, and then passing this randomized SyncTime to schedulers as a variable in order to distribute workflow schedules.  This worked to an extent, but was unnecessarily complicated and somewhat limited in effect (because the SyncTime parameter accepts times as an input in HH:MM format, initilization of workflows scheduled for 5 minute intervals could only be distributed at one of 5 possible initialization slots in the 5 minute interval).

However, I was very recently informed of a new R2-only scheduler parameter: SpreadInitializationOverInterval, which (as one would expect from the parameter name) distributes the initialization of the scheduler over a defined interval.   I’ve done a good bit of testing with this parameter and it works exactly as it should, which brings about major improvements in peak resource utilization in centralized monitoring scenarios.

Use of the parameter is quite simple, it expects a numeric value for the initialization interval (in seconds by default, but different time units can be specified with the Unit attribute), and for obvious reasons, it can’t be used along with a SyncTime parameter.  As for guidelines pertaining to ideal interval values, I have come to these conclusions in testing:  for monitors or rules that execute on relatively short intervals (e.g. 5, 10, 15 minutes), it works well to use the same interval for both the scheduler and SpreadInitializationOverInterval parameter.   This maximizes the load distribution facilitated by the spread initialization option.  For rules, monitors, or discoveries that execute infrequently (e.g. 4, 12, 24 hours), I prefer to set the SpreadInitializationOverInterval value to something like 30 minutes.   As an example, if a discovery workflow were scheduled to execute every 24 hours, setting the SpreadInitializationOverInterval parameter to 30 minutes would facilitate load distribution, but not require that any new objects that were added to the Management group go up to 24 hours for discovery. 

An example of the use of this parameter in a composite Data Source might look like this in XML:

<DataSource TypeID=”System!System.Scheduler”>
   <Scheduler>
      <SimpleReccuringSchedule>
          <Interval>$Config/Interval$</Interval>
             <SpreadInitializationOverInterval Unit=”Seconds”>$Config/Interval$</SpreadInitializationOverInterval>
       </SimpleReccuringSchedule>
        <ExcludeDates />
    </Scheduler>
</DataSource>

And the same scheduler in the Authoring Console:

The GUI “Configure” dialogue in the Authoring Console doesn’t provide an option to set the SpreadInitializationOverInterval parameter, so it has to be edited in the XML.  This is probably as ideal of an opportunity as any to highly recommend linking XML Notepad 2007 as the editor in the Ops Mgr Authoring Console.   XML Notepad 2007 is a great XML editor in general, but when used as an editor in the Authoring Console, it does automatic XSD verification, even providing drop-down selections of options: