Using the Operations Manager 2007 R2 Workflow Analyzer

I’ve only had my hands on the OpsMgr MP Authoring Resource Kit for about 24 hours now, but already the tools are proving to be invaluable.   This post describes a problem that I was able to investigate with the Workflow Analyzer tool to determine the exact cause of the issue.

Background

In a management pack I’m working on, I had a composite workflow designed to calculate SNMP network interface throughput and utilization by collecting the 32bit and 64bit in and out octet counters for an interface.  The SnmpProbe passes the values for all four VarBinds to an Expression Filter, which confirms that either VarBinds 1 and 2 (64bit) or VarBinds 3 and 4 (32bit) have values greater or equal to zero.   The Expression Filter than passes matched data items to a PowerShell property bag probe, which compares the values to a previously collected value set (stored in a temporary file in the file system) in order to calculate delta values and interface utilization and throughput.  

The script was written to use the 64bit counters if data are returned for the 64bit counters and 32bit counters if 64bit counter data are not returned.   I had been having some issues with this workflow when targeted to interfaces of devices that do not support 64bit interface octet counters.  From the lack of errors in the log, and evidence that the PowerShell script probe was not running (no temporary file being generated for these instances), I had concluded that the workflow was stopping with the post-SnmpProbe Expression Filter, but I didn’t know exactly why.   I had thought the Expression Filter was configured in such a way as to continue even if null values were returned for the 64bit counters. 

Using the recently released Operations Manager 2007 R2 Workflow Analyzer, I was able to drill into the actual processing of the workflow and identify the issue.

Workflow Tracing

The steps I used to debug this workflow were:

Launch the Workflow Analyzer and create a new session:

Read more of this post

Advertisement

Operations Manager MP Authoring Resource Kit

Microsoft has just released the Operations Manager Resource Kit, and my first impressions are very positive.   I haven’t had a chance to test drive all of the tools, but the MP Best Practices Analyzer and Workflow Analyzer show great potential.

The MP Best Practices Analyzer shows up under the Tools menu of the Authoring console and will scan a management pack for best practices compliance in great detail.  In my first use of this tool, I found it to be of great value. 

I’m looking forward to spending some time with the Workflow Analyzer, which provides a great interface for drilling into troubleshooting the more abstract elements of MP performance.  The Workflow Analyzer displays all loaded workflows for a management server, with the option to drill into a workflow to a specific instance, and then launch graphical debug tracing of that specific workflow.   Great stuff indeed.

There are a number of other tools in the RK, not the least of which includes a spell checker.

SCOM: Combining a System.SnmpProbe and System.Performance.DeltaValueCondition Modules to Calculate SNMP Counter Delta Values

I have previously written about using the combination of an SnmpProbe and script probe in Operations Manager work flows to facilitate manipulation of numeric values.   While this is currently the only way to perform numeric operations, there are some cases in which the only required manipulation of a numeric value is the calculation of a delta between two polls, such as calculating the number of interface collisions in an interval (from the ifTable) or calculating the number of interface resets in a polling cycle (from the Cisco locIfTable).  In these cases, the SnmpProbe can be combined with a System.Performance.DeltaValueCondition condition detection module to calculate the delta value without having to engage a script probe.

The Performance.DeltaValueCondition module expects Performance Data as an input, so a System.Performance.DataGenericMapper must be used between the SnmpProbe and DeltavalueCondition modules to do the data conversion.   The DataGenericMapper accepts two options:  NumSamples and Absolute.   The NumSample parameter sets the number of value samples to maintain in memory, and the value returned is the difference between the first and last samples in memory.  The “Absolute” parameter, when true, causes the DeltaValueCondition module to return the delta as the raw difference between the samples, and when false, causes the module to return the percentage of change.

An example workflow can be represented in this diagram (the expression filter being used to validate the data returned from the SnmpProbe prior to continuing):

  Read more of this post

SCOM: Distributing Workflows with the SpreadInitializationOverInterval Scheduler Parameter

In Operations Manager distributed agent-based monitoring scenarios, resource utilization of the monitoring workflows is rarely a point of major concern as the data sources and probe actions typically consume only nominal resources of the agent host system at any given time.   However, in centralized monitoring scenarios, such as SNMP monitoring or wide-scale URL monitoring, the resource utilization of each work flow must be a primary concern as all workflows will execute on a small number of management servers/agent proxies and the potential for a massive number of workflows executing concurrently is very real.

While I had previously described some of my thoughts on workflow resource utilization with script probe actions, there is another highly relevant aspect of this general topic: workflow schedule distribution.   When working with centralized poll/probe monitors, almost every workflow will start with a scheduler.  By default, the Operations Manager scheduler module will not distribute scheduler initialization, and the result of this is that every workflow scheduled on an interval of X minutes will all fire at the same time – every X minutes since the initiation of the Health Service (unless a SyncTime is specified).   If, for example, 2000 network interfaces were polled with an SNMP probe every 5 minutes for status, every 5 minutes, 2000 workflows would execute simultaneously on the agent proxy system, and particularly if the workflow includes a script probe, the likely result would be oversubscription of the agent proxy’s CPU leading to script timeouts and/or SNMP poll failures.  

If the scheduled workflows could initialize at distributed times, so that they do not fire in a synchronized fashion, significant scalability improvements could be realized.  I had been experimenting with using a PowerShell script in a discovery probe to randomly determine a SyncTime and assign it as a property of an object, and then passing this randomized SyncTime to schedulers as a variable in order to distribute workflow schedules.  This worked to an extent, but was unnecessarily complicated and somewhat limited in effect (because the SyncTime parameter accepts times as an input in HH:MM format, initilization of workflows scheduled for 5 minute intervals could only be distributed at one of 5 possible initialization slots in the 5 minute interval).

However, I was very recently informed of a new R2-only scheduler parameter: SpreadInitializationOverInterval, which (as one would expect from the parameter name) distributes the initialization of the scheduler over a defined interval.   I’ve done a good bit of testing with this parameter and it works exactly as it should, which brings about major improvements in peak resource utilization in centralized monitoring scenarios.

Use of the parameter is quite simple, it expects a numeric value for the initialization interval (in seconds by default, but different time units can be specified with the Unit attribute), and for obvious reasons, it can’t be used along with a SyncTime parameter.  As for guidelines pertaining to ideal interval values, I have come to these conclusions in testing:  for monitors or rules that execute on relatively short intervals (e.g. 5, 10, 15 minutes), it works well to use the same interval for both the scheduler and SpreadInitializationOverInterval parameter.   This maximizes the load distribution facilitated by the spread initialization option.  For rules, monitors, or discoveries that execute infrequently (e.g. 4, 12, 24 hours), I prefer to set the SpreadInitializationOverInterval value to something like 30 minutes.   As an example, if a discovery workflow were scheduled to execute every 24 hours, setting the SpreadInitializationOverInterval parameter to 30 minutes would facilitate load distribution, but not require that any new objects that were added to the Management group go up to 24 hours for discovery. 

An example of the use of this parameter in a composite Data Source might look like this in XML:

<DataSource TypeID=”System!System.Scheduler”>
   <Scheduler>
      <SimpleReccuringSchedule>
          <Interval>$Config/Interval$</Interval>
             <SpreadInitializationOverInterval Unit=”Seconds”>$Config/Interval$</SpreadInitializationOverInterval>
       </SimpleReccuringSchedule>
        <ExcludeDates />
    </Scheduler>
</DataSource>

And the same scheduler in the Authoring Console:

The GUI “Configure” dialogue in the Authoring Console doesn’t provide an option to set the SpreadInitializationOverInterval parameter, so it has to be edited in the XML.  This is probably as ideal of an opportunity as any to highly recommend linking XML Notepad 2007 as the editor in the Ops Mgr Authoring Console.   XML Notepad 2007 is a great XML editor in general, but when used as an editor in the Authoring Console, it does automatic XSD verification, even providing drop-down selections of options:

SCOM: Peculiar Error Upgrading from SP1 to R2 –Cross Platform management pack(s) are imported…

This is probably pretty rare, but I came upon a peculiar error today while upgrading an Operations Manager 2007 SP1 RMS to R2. The error message was:

Product: System Center Operations Manager 2007 R2 — System Center Operations Manager Cross Platform management pack(s) are imported in this management group. Please delete these management packs(s) before upgrading to System Center Operations Manager R2.

This wouldn’t be a surprise except for the fact that there were no Cross Platform MP’s imported, and furthermore, since the current server was 2007 SP1, Cross Platform MP’s couldn’t even be imported in the first place. It took me a bit to figure out, but it turns out the problem was the fact that there was a custom management pack (for SNMP monitoring) with the string UNIX in the MP name. It looks like the installation package is checking installed MP’s for pretty generic strings to determine if Cross Platform MP’s are imported (presumably applying only to pre-RTM versions of R2). I had to remove the custom MP with the term UNIX in the name and rerun the upgrade.

Note:  Anders Bengtsson has described a situation with the same error when upgrading the second node in an RMS cluster, here.

Recursively Listing Security Group Members with PowerShell, Version 2

In this post, I described the use of PowerShell for the purpose of generating a report of all members of a particular group, including nested groups.  The script utilized ADSI calls to the WinNT:// provider to recursively retrieve both local and domain group members.   However, it was brought to my attention that the WinNT:// provider cannot access group objects located in non-default OU’s, but only the default Users container.   In order to access group objects located in OU’s, the ADSI queries have to be targeted to the LDAP:// provider.   This posed a bit of a challenge in correcting due to the fact that LDAP:// can’t access local group members so both the WinNT:// and LDAP:// providers had to be selectively utilized.

An updated version of this script can be downloaded here.  This version utilizes the same command line syntax:  powershell.exe c:\scripts\listusers.ps1 “Domain_or_Server_Name\Group_Name” 5.   The first parameter can be a domain name or server name followed by a backslash and the group name.  If the group name contains spaces, this parameter must be wrapped in quotes.  The second parameter is the recursion depth, how many levels of nested groups the script will traverse and report on.  The script outputs group membership details to the command window as well as a text file (located in the same directory as the script) named with the group name. 

The inner workings of this script became more complicated in this version, largely because I wanted the script to automatically select which provider to use in the ADSI calls.  The functional overview of the script is as follows:

Read more of this post

SCOM: WSH Vs. PowerShell Modules in Composite Workflows – Resource Utilization in SNMP Data Manipulation

One of the realities of working with SNMP monitoring is that more often than not, the monitoring data are presented in a raw form that requires some kind of manipulation in order to render meaningful output.  For example, required data manipulation may be a simple arithmetic operation on two values to calculate a percentage, or in the case of Counter data, mathematical operations based on the delta between values recorded in multiple polling cycles.  In Operations Manager, these manipulations require exiting the realm of managed code and utilizing script-based modules to perform the operations or facilitate temporary storage of values from previous polling cycles.  Two sets of modules are available for the Operations Manager –supported scripting engines: WSH and PowerShell.  To date, I had been opting to use VB scripts when authoring Management Packs for two reasons: 1) WSH is universally deployed in Windows environments whereas PowerShell is not necessarily so – by using VB scripts, there is no requirement to install Power Shell on proxy agents 2) I had assumed that the resource utilization impact of PowerShell was equal or greater than that of WSH.   I had assumed that PowerShell would carry a heavier impact based on the simple notion that if one were to watch process resource utilization when simply launching powershell.exe and cscript.exe, powershell.exe consumes more memory and CPU time (assuming WSH 5.7 is installed).  

The resource utilization of these script providers becomes a major concern particularly when implementing script-based modules in SNMP monitoring scenarios.   To illustrate this point, if a proxy agent were configured to proxy SNMP requests for 10 Cisco switches, with each of these switches having an average of 20 interfaces discovered, and each interface monitored with two monitors that utilize a script probe action to manipulate the raw SNMP data (e.g. collisions and octets), 400 scripts would be executed in a single polling cycle for just the interface monitors for this small scale monitoring scenario.  This poses a threat to the scalability of SNMP monitoring and could severely limit the number of devices/objects a single proxy agent can handle effectively.  

In the course of trying to find a way to address this scalability issue, I was fortunate enough to communicate with someone possessing a great deal of insight into Operations Manager who helpfully suggested that the PowerShell modules should be more efficient than WSH-based modules in composite workflows.  I rewrote all of the scripts in the Cisco MP to convert them from VB Script to PowerShell and began some testing.  I was familiar with the tighter integration of PowerShell in R2 modules (PS scripts no longer have to be launched as external commands), but to be honest, I was expecting to see a large number of powershell.exe processes spawned as the monitors fired.   However, this is not the case.  Rather, it looks to me that the modules are executing the PowerShell script through the .NET framework within the context of the monitoringhost.exe process.   This does appear to be more efficient overall, as the overhead associated with spawning new processes is effectively eliminated, and my impressions thus far are that CPU utilization overall is reduced.

However, switching from WSH scripts to PowerShell scripts in R2 workflows is a little bit of jumping from the frying pan and into the fire in that, instead of spawning a large number of processes each consuming relatively small amounts of processor and memory resources, the PowerShell script modules drive a single process (monitroinghost.exe) to consume a large quantity of resources, particularly CPU cycles.   Overall, memory utilization looks a lot better with the PowerShell modules, and although CPU utilization does seem to be better, it is still a concern for scalability. 

Thus far, I have been doing this performance testing in a development environment, with OpsMgr running on some virtual machines on both on workstation and older server class hardware, neither of which provide a good indication of real-world scalability (particularly given the fact that I have these VM’s running SQL, all OpsMgr duties, and SNMP simulations to boot).  On one of these woefully over-utilized VM’s, something around 130-150 interfaces on 10 monitored Cisco devices seemed to be the breaking point, but a more realistic OpsMgr deployment scenario (segregated database, RMS, and MS duties) on physical hardware should be able to handle far more than that.   I will report an update once I get a chance to do some broader scalability testing with the PowerShell version of the MP on more appropriate hardware. 

In summary, both the WSH and PowerShell probe and write action modules introduce a relatively heavy CPU load when utilized for data manipulation – relative to the very simple operations required to manipulate SNMP data, and a managed code module would be far more desirable, if available.  However, at present, these two providers are the only supported mechanisms for handling data that require processing before returning to a rule or monitor.   My testing thus far appears to support the assertion that R2 implements the PowerShell modules more efficiently than the WSH-based modules, which is welcome news, given the relative ease and impressive flexibility of scripting with PowerShell.  I’ve seen a bit of talk that PowerShell V2 is supposed to bring significant performance improvements over V1, and I hope to do some testing with the CTP version of V2 on an OpsMgr proxy agent in the very near future to see if it helps address any of the scalability challenges in SNMP monitoring with OpsMgr.  As for the best approach for the present, it looks like PowerShell is the way to go, and the overall impact on the MS/proxy agents can be mitigated by spreading monitored objects across multiple proxy agents, focusing discovery to only those objects which are required to be monitored (i.e. interfaces), and avoiding overly-aggressive scheduling of monitors.

SCOM: Building on the Net-SNMP MPs

Due to the ubiquity of the Net-SNMP agent, the Net-SNMP management packs can be used for a wide range of UNIX/Linux devices, and one of my primary intentions in creating these management packs was to extend them to Linux-based proprietary platforms such as Check Point Secure Platform and VMWare ESX.  To that end, I am currently putting the finishing touches on management packs for Check Point Splat and VMWare ESX SNMP monitoring that reference the Net-SNMP Library MP. 

Check Point Secure Platform

SPlat is a hardened Linux kernel, which conveniently supports the Net-SNMP agent for manageability.  The Check Point-specific SNMP objects are exposed through the extended Net-SNMP agent as described in the CHECKPOINT-MIB.   So in this case, the Net-SNMP Monitoring MP can be used for basic system health, while an additional Check Point MP can be added to monitor the Check Point software modules for availability status and Firewall/VPN/Etc performance metrics.  

VMWare ESX – SNMP

Of course, ESX server is a modified Red Hat Enterprise Linux distribution that also utilizes the Net-SNMP agent for SNMP support.  VMWare exposes ESX-specific objects to SNMP via dlMod extensions to the Net-SNMP agent, including VM Guest info and some performance metrics.   So, in VMWare environments, the host operating system can be monitored for health through traditional Net-SNMP-implemented MIBs (UCD-SNMP, HOST-RESOURCES), while VMWare-specific counters can be monitored through the use of the VMWare MIBs.  

When it comes to monitoring of VMWare,  the VMWare SNMP implementation has the advantage of being easy to deploy and rather lightweight, and given the likelihood that SNMP may be used in VMWare environments for full vendor hardware monitoring, the VMWare SNMP implementation is a good way to introduce some monitoring of the hypervisor virtualization layer.  That being said, the VMWare SNMP implementation does leave a lot to be desired; for example, alarms/events are only exposed in SNMP through traps, only a few performance counters are available, and many VMWare Infrastructure objects are not represented.    For more complete/comprehensive monitoring of VMWare environment, the only data provider choice seems to be the VMWare API.   I’m working on something along those lines presently, but I’ll post more on that at a later date.