SCOM: Automatically Starting Maintenance Mode When Servers are Rebooted for Patching

While a comprehensive deployment of server and application monitoring is a vital service for many enterprises, this can lead to headaches when it comes to routine maintenance due to server patching.   Of course, it is good security practice to apply patches as they become available (e.g. monthly), but the regular required reboots can lead to storms of alerts in monitored environments without some way of temporarily suppressing alerting.  If every server that reboots fires a few alerts (services stopped, synthetic monitors, etc), and a few hundred servers are rebooted in a relatively short maintenance window, it doesn’t take long for a few thousand false alerts to be generated.  Not only is this annoying and a drain on notification systems (email), it can also mean that real problems that manifest after the reboot are lost in the fray.  An ideal solution for this problem would be for monitoring to be suppressed before the shutdown process and reengaged quickly after the reboot.  Maintenance mode is the obvious mechanism to accomplish this in SCOM monitored environments, but the manual interaction required via the GUI implementation makes impractical for large-scale maintenance.   Several helpful solutions have been previously published that utilize the Powershell exposure of Maintenance Mode in order to make the process more efficient.   These include:  Tim McFadden’s remote maintenance mode scheduling utility:  http://www.scom2k7.com/scom-remote-maintenance-mode-scheduler-20/ and Clive Eastwood’s AgentMM utility: http://blogs.technet.com/cliveeastwood/archive/2007/09/18/agentmm-a-command-line-tool-to-place-opsmgr-agents-into-maintenance-mode.aspx

One general way to address this ‘false alarm during patching’ issue would be to utilize one of these maintenance mode tools to drop groups of servers into maintenance mode.   However, there are some gaps in this approach, in my opinion.   Firstly, if wide-scale server maintenance (i.e. patching) is done in a rolling window over a weekend, it may be hard to schedule maintenance mode windows on groups of servers that just cover the time for reboot.  If the maintenance mode window is too broad, real problems that surface after the reboot may not be caught in orderly fashion.  Additionally, the staff responsible for patch deployments and scheduling may not necessarily be the same staff responsible for SCOM administration.   Finally, large-scale patch deployment is enough of a chore itself and adding an additional step to the process may not be welcomed by the staff responsible for patch deployments.

An Automated Solution:

The solution that I wanted was one that would automatically put servers into maintenance mode only when they are being rebooted for patching, and only for the minimum time necessary – all without administrative intervention.   If all went according to plan, the only alerts received during patch maintenance windows should be for problems that persisted after the reboot, i.e. the real problems.

The first part of this equation is the detection of patching.   How does a monitor differentiate between a reboot for unplanned maintenance or due to a problem versus a reboot due to planned patch maintenance?   The answer to this issue lies in the eventlog.  WSUS or Windows Automatic Updates should log an Event ID 22 before instantiating a reboot, and other tools like Shavlik NetChk are likely to log a similar event to indicate a pending reboot from patching.  If all else fails, a script could be setup in the patch management utility to use the eventcreate.exe (Windows 2003) or logevent.exe (pre-Windows 2003) utilities to log a custom event.    To trap these events, create an alert-generating Windows Event log rule with the required parameters to alert on the event that indicates an imminent reboot due to patching.   In order to make notification subscriptions easier in SCOM 2007 SP1, we can make use of the “Maintenance” category, so select “Maintenance” as the alert category.

The second part of this solution is the mechanism to trigger the maintenance mode script.  In theory, the script could be called client-side in the form of a recovery task, but this would require uniform distributions of powershell as well as the use of runas accounts with appropriate rights in SCOM.   A much simpler method is the use of Command Notification channel in SCOM.  
With a Command Notification (and notification subscription), the alert can call a powershell script and pass the object id to the script.  The powershell script can kick off the maintenance mode for a set period of time.

The main portion of a powershell script to do this would be something like, assuming the Managed Entity Display Name variable is passed to the script as the only argument:

$time = [DateTime]::Now
$nMinutes=20
$sHost=$args[0]
$class=get-monitoringclass | where {$_.displayname -eq “Windows Server”}
$monObj=$class|get-monitoringobject |where {$_.name -like “$sHost*”}
New-MaintenanceWindow -MonitoringObject $monObj -Comment “Planned Maintenace – scripted” -StartTime $time -EndTime $time.AddMinutes($nMinutes)

This example is assuming that 20 minutes after the initial event is logged, the server should be back in proper order. 

The sample powershell script can be downloaded here.

Lots more behind the cut:

Read more of this post

SCOM: Monitoring the WSH Version on Windows Agents

Given the severe performance issues that can be caused by SCOM monitoring on hosts without the Windows Script Host 5.7, and the possibility that WSH 5.7 binaries can be replaced with older versions by Windows File Protection (http://support.microsoft.com/default.aspx/kb/955360), I’ve found it useful to use a SCOM unit monitor to monitor managed agents for the expected WSH version (5.7 or later).  

The script that I’ve written for this purpose first checks the OS Caption with WMI (to exclude 64bit hosts from the check) and then checks the version of cscript.exe using a WSH FileSystemObject. 

set oFSO = CreateObject(“Scripting.FileSystemObject”)
 sFileVersion=oFSO.GetFileVersion(sWinDir & “\system32\cscript.exe”)

To deploy this as a unit monitor, create a two-state timed script monitor.   Set the unhealthy state expression to:  Property[@Name=’status’] equals Error and the healthy state expression to Property[@Name=’status’] equals OK.

A message with a description of the problem and the current cscript.exe version can be added to the alert with the  $Data/Context/Property[@Name=”Message”]$  Xpath string.