Scalability and Performance (Design and Testing) in the xSNMP Management Packs
December 14, 2009 9 Comments
In this post, I intend to describe some of the challenges in scaling SNMP monitoring in an Operations Manager environment to a large number of monitored objects, as well as my experiences from testing and the approaches that I took to address these challenges with the xSNMP Management Packs.
Background
In spite of the market availability of many task-specific SNMP monitoring applications boasting rich feature sets, I think that a strong case can be made for the use of System Center Operations Manager in this SNMP monitoring role. Using a single product for systems and infrastructure (SNMP) monitoring facilitates unparalleled monitoring integration (e.g. including critical network devices/interfaces or appliances in Distributed Application Models for vital business functions). The rich MP authoring implementation, dynamic discovery capabilities, and object-oriented modeling approach allow a level of depth and flexibility in SNMP monitoring not often found in pure SNMP monitoring tools.
However, Operations Manager is first and foremost a distributed monitoring application, most often depending on agents to run small workloads independently. Inevitably, running centralized monitoring workloads (i.e. SNMP polls) in a distributed monitoring application is going to carry a higher performance load than the same workloads in a task-specific centralized monitoring application that was built from the ground up to handle a very high number of concurrent polls with maximum efficiency. This centralized architecture would likely feature a single scheduler process that distributes execution of polls in an optimized fashion as well as common polling functions implemented in streamlined managed code. With SNMP monitoring in Operations Manager, any optimization of workload scheduling and code optimization more or less falls to the MP author to implement.
While working on the xSNMP Management Packs, I spent a lot of time testing different approaches to maximize efficiency (and thus scalability) in a centralized SNMP monitoring scenario. I’m sure there is always room for continual improvement, but I will try to highlight some of the key points of my experiences in this pursuits.
Designing for Cookdown
Cookdown is one of the most important concepts in MP authoring when considering the performance impact of workflows. A great summary of OpsMgr cookdown can be found here. In effect, the cookdown process looks for modules with identical configurations (including input parameters) and replaces the multiple executions of redundant modules with a single execution. So, if one wanted to monitor and collect historical data on the inbound and outbound percent utilization and Mb/s throughput of an SNMP network interface, a scheduler and SNMP Probe (with VarBinds defined to retrieve the in and out octets counters for the interface) could be configured. As long as each of the rules and monitors provided the same input parameters to these modules for each interface, the scheduler and SNMP probe would only execute once per interval per interface. Taking this a step further, the SNMP probe could be configured to gather all SNMP values for objects to monitor in the IFTable for this interface (e.g. Admin Status, Oper Status, In Errors, Out Errors), and these values could be used in even more rules and monitors. The one big catch here is that the SNMP Probe module stops processing SNMP VarBind’s once it hits an error. So, it’s typically not a good idea to mix SNMP VarBinds for objects that may not be implemented on some agents with OIDS that would be implemented on all agents.
Workflow Scheduling