SCOM: Distributing Workflows with the SpreadInitializationOverInterval Scheduler Parameter

In Operations Manager distributed agent-based monitoring scenarios, resource utilization of the monitoring workflows is rarely a point of major concern as the data sources and probe actions typically consume only nominal resources of the agent host system at any given time.   However, in centralized monitoring scenarios, such as SNMP monitoring or wide-scale URL monitoring, the resource utilization of each work flow must be a primary concern as all workflows will execute on a small number of management servers/agent proxies and the potential for a massive number of workflows executing concurrently is very real.

While I had previously described some of my thoughts on workflow resource utilization with script probe actions, there is another highly relevant aspect of this general topic: workflow schedule distribution.   When working with centralized poll/probe monitors, almost every workflow will start with a scheduler.  By default, the Operations Manager scheduler module will not distribute scheduler initialization, and the result of this is that every workflow scheduled on an interval of X minutes will all fire at the same time – every X minutes since the initiation of the Health Service (unless a SyncTime is specified).   If, for example, 2000 network interfaces were polled with an SNMP probe every 5 minutes for status, every 5 minutes, 2000 workflows would execute simultaneously on the agent proxy system, and particularly if the workflow includes a script probe, the likely result would be oversubscription of the agent proxy’s CPU leading to script timeouts and/or SNMP poll failures.  

If the scheduled workflows could initialize at distributed times, so that they do not fire in a synchronized fashion, significant scalability improvements could be realized.  I had been experimenting with using a PowerShell script in a discovery probe to randomly determine a SyncTime and assign it as a property of an object, and then passing this randomized SyncTime to schedulers as a variable in order to distribute workflow schedules.  This worked to an extent, but was unnecessarily complicated and somewhat limited in effect (because the SyncTime parameter accepts times as an input in HH:MM format, initilization of workflows scheduled for 5 minute intervals could only be distributed at one of 5 possible initialization slots in the 5 minute interval).

However, I was very recently informed of a new R2-only scheduler parameter: SpreadInitializationOverInterval, which (as one would expect from the parameter name) distributes the initialization of the scheduler over a defined interval.   I’ve done a good bit of testing with this parameter and it works exactly as it should, which brings about major improvements in peak resource utilization in centralized monitoring scenarios.

Use of the parameter is quite simple, it expects a numeric value for the initialization interval (in seconds by default, but different time units can be specified with the Unit attribute), and for obvious reasons, it can’t be used along with a SyncTime parameter.  As for guidelines pertaining to ideal interval values, I have come to these conclusions in testing:  for monitors or rules that execute on relatively short intervals (e.g. 5, 10, 15 minutes), it works well to use the same interval for both the scheduler and SpreadInitializationOverInterval parameter.   This maximizes the load distribution facilitated by the spread initialization option.  For rules, monitors, or discoveries that execute infrequently (e.g. 4, 12, 24 hours), I prefer to set the SpreadInitializationOverInterval value to something like 30 minutes.   As an example, if a discovery workflow were scheduled to execute every 24 hours, setting the SpreadInitializationOverInterval parameter to 30 minutes would facilitate load distribution, but not require that any new objects that were added to the Management group go up to 24 hours for discovery. 

An example of the use of this parameter in a composite Data Source might look like this in XML:

<DataSource TypeID=”System!System.Scheduler”>
   <Scheduler>
      <SimpleReccuringSchedule>
          <Interval>$Config/Interval$</Interval>
             <SpreadInitializationOverInterval Unit=”Seconds”>$Config/Interval$</SpreadInitializationOverInterval>
       </SimpleReccuringSchedule>
        <ExcludeDates />
    </Scheduler>
</DataSource>

And the same scheduler in the Authoring Console:

The GUI “Configure” dialogue in the Authoring Console doesn’t provide an option to set the SpreadInitializationOverInterval parameter, so it has to be edited in the XML.  This is probably as ideal of an opportunity as any to highly recommend linking XML Notepad 2007 as the editor in the Ops Mgr Authoring Console.   XML Notepad 2007 is a great XML editor in general, but when used as an editor in the Authoring Console, it does automatic XSD verification, even providing drop-down selections of options:

About Kristopher Bash
Kris is a Senior Program Manager at Microsoft, working on UNIX and Linux management features in Microsoft System Center. Prior to joining Microsoft, Kris worked in systems management, server administration, and IT operations for nearly 15 years.

3 Responses to SCOM: Distributing Workflows with the SpreadInitializationOverInterval Scheduler Parameter

  1. Pingback: Scalability and Performance Design and Testing in the xSNMP Management Packs « Operating-Quadrant

  2. Patrik Dufresne says:

    Hi,

    I know it an old post, but I’ve been experimenting with the SpreadInitializationOverInterval parameter. After extended test, I notice using the parameter is even worst on the scheduling. It run 120 instances of the workflow withing 10 sec. I was expecting those 120 instances to be spread over 300sec.

    So here my question : What make you believe the parameter is working. How did you test it.

    Here some graph. The X axis is the time line. The y axis is the number of execution during the lapse of time. The step is set to 5 sec.

    Without the parameter:
    http://picasaweb.google.com/lh/photo/Ahw9uvXWhbbyTfCdlzyXeg?feat=directlink

    With the parameter:
    http://picasaweb.google.com/lh/photo/6V4MpRR8yCPwTpjfIXKA4Q?feat=directlink

  3. Ben Lang says:

    I think I am running into this issue…I understand the methodology, but how do I change this setting? Also how do I identify what SNMP polling (rule or monitor) appears to be causing the issue?

    Thanks!
    BEN

Leave a comment