SCOM: Automatically Starting Maintenance Mode When Servers are Rebooted for Patching
August 15, 2009 69 Comments
One general way to address this ‘false alarm during patching’ issue would be to utilize one of these maintenance mode tools to drop groups of servers into maintenance mode. However, there are some gaps in this approach, in my opinion. Firstly, if wide-scale server maintenance (i.e. patching) is done in a rolling window over a weekend, it may be hard to schedule maintenance mode windows on groups of servers that just cover the time for reboot. If the maintenance mode window is too broad, real problems that surface after the reboot may not be caught in orderly fashion. Additionally, the staff responsible for patch deployments and scheduling may not necessarily be the same staff responsible for SCOM administration. Finally, large-scale patch deployment is enough of a chore itself and adding an additional step to the process may not be welcomed by the staff responsible for patch deployments.
An Automated Solution:
The solution that I wanted was one that would automatically put servers into maintenance mode only when they are being rebooted for patching, and only for the minimum time necessary – all without administrative intervention. If all went according to plan, the only alerts received during patch maintenance windows should be for problems that persisted after the reboot, i.e. the real problems.
The first part of this equation is the detection of patching. How does a monitor differentiate between a reboot for unplanned maintenance or due to a problem versus a reboot due to planned patch maintenance? The answer to this issue lies in the eventlog. WSUS or Windows Automatic Updates should log an Event ID 22 before instantiating a reboot, and other tools like Shavlik NetChk are likely to log a similar event to indicate a pending reboot from patching. If all else fails, a script could be setup in the patch management utility to use the eventcreate.exe (Windows 2003) or logevent.exe (pre-Windows 2003) utilities to log a custom event. To trap these events, create an alert-generating Windows Event log rule with the required parameters to alert on the event that indicates an imminent reboot due to patching. In order to make notification subscriptions easier in SCOM 2007 SP1, we can make use of the “Maintenance” category, so select “Maintenance” as the alert category.
The second part of this solution is the mechanism to trigger the maintenance mode script. In theory, the script could be called client-side in the form of a recovery task, but this would require uniform distributions of powershell as well as the use of runas accounts with appropriate rights in SCOM. A much simpler method is the use of Command Notification channel in SCOM.
With a Command Notification (and notification subscription), the alert can call a powershell script and pass the object id to the script. The powershell script can kick off the maintenance mode for a set period of time.
The main portion of a powershell script to do this would be something like, assuming the Managed Entity Display Name variable is passed to the script as the only argument:
$time = [DateTime]::Now
$nMinutes=20
$sHost=$args[0]
$class=get-monitoringclass | where {$_.displayname -eq “Windows Server”}
$monObj=$class|get-monitoringobject |where {$_.name -like “$sHost*”}
New-MaintenanceWindow -MonitoringObject $monObj -Comment “Planned Maintenace – scripted” -StartTime $time -EndTime $time.AddMinutes($nMinutes)
This example is assuming that 20 minutes after the initial event is logged, the server should be back in proper order.
The sample powershell script can be downloaded here.
Lots more behind the cut: