SCOM: Automatically Starting Maintenance Mode When Servers are Rebooted for Patching

While a comprehensive deployment of server and application monitoring is a vital service for many enterprises, this can lead to headaches when it comes to routine maintenance due to server patching.   Of course, it is good security practice to apply patches as they become available (e.g. monthly), but the regular required reboots can lead to storms of alerts in monitored environments without some way of temporarily suppressing alerting.  If every server that reboots fires a few alerts (services stopped, synthetic monitors, etc), and a few hundred servers are rebooted in a relatively short maintenance window, it doesn’t take long for a few thousand false alerts to be generated.  Not only is this annoying and a drain on notification systems (email), it can also mean that real problems that manifest after the reboot are lost in the fray.  An ideal solution for this problem would be for monitoring to be suppressed before the shutdown process and reengaged quickly after the reboot.  Maintenance mode is the obvious mechanism to accomplish this in SCOM monitored environments, but the manual interaction required via the GUI implementation makes impractical for large-scale maintenance.   Several helpful solutions have been previously published that utilize the Powershell exposure of Maintenance Mode in order to make the process more efficient.   These include:  Tim McFadden’s remote maintenance mode scheduling utility:  http://www.scom2k7.com/scom-remote-maintenance-mode-scheduler-20/ and Clive Eastwood’s AgentMM utility: http://blogs.technet.com/cliveeastwood/archive/2007/09/18/agentmm-a-command-line-tool-to-place-opsmgr-agents-into-maintenance-mode.aspx

One general way to address this ‘false alarm during patching’ issue would be to utilize one of these maintenance mode tools to drop groups of servers into maintenance mode.   However, there are some gaps in this approach, in my opinion.   Firstly, if wide-scale server maintenance (i.e. patching) is done in a rolling window over a weekend, it may be hard to schedule maintenance mode windows on groups of servers that just cover the time for reboot.  If the maintenance mode window is too broad, real problems that surface after the reboot may not be caught in orderly fashion.  Additionally, the staff responsible for patch deployments and scheduling may not necessarily be the same staff responsible for SCOM administration.   Finally, large-scale patch deployment is enough of a chore itself and adding an additional step to the process may not be welcomed by the staff responsible for patch deployments.

An Automated Solution:

The solution that I wanted was one that would automatically put servers into maintenance mode only when they are being rebooted for patching, and only for the minimum time necessary – all without administrative intervention.   If all went according to plan, the only alerts received during patch maintenance windows should be for problems that persisted after the reboot, i.e. the real problems.

The first part of this equation is the detection of patching.   How does a monitor differentiate between a reboot for unplanned maintenance or due to a problem versus a reboot due to planned patch maintenance?   The answer to this issue lies in the eventlog.  WSUS or Windows Automatic Updates should log an Event ID 22 before instantiating a reboot, and other tools like Shavlik NetChk are likely to log a similar event to indicate a pending reboot from patching.  If all else fails, a script could be setup in the patch management utility to use the eventcreate.exe (Windows 2003) or logevent.exe (pre-Windows 2003) utilities to log a custom event.    To trap these events, create an alert-generating Windows Event log rule with the required parameters to alert on the event that indicates an imminent reboot due to patching.   In order to make notification subscriptions easier in SCOM 2007 SP1, we can make use of the “Maintenance” category, so select “Maintenance” as the alert category.

The second part of this solution is the mechanism to trigger the maintenance mode script.  In theory, the script could be called client-side in the form of a recovery task, but this would require uniform distributions of powershell as well as the use of runas accounts with appropriate rights in SCOM.   A much simpler method is the use of Command Notification channel in SCOM.  
With a Command Notification (and notification subscription), the alert can call a powershell script and pass the object id to the script.  The powershell script can kick off the maintenance mode for a set period of time.

The main portion of a powershell script to do this would be something like, assuming the Managed Entity Display Name variable is passed to the script as the only argument:

$time = [DateTime]::Now
$nMinutes=20
$sHost=$args[0]
$class=get-monitoringclass | where {$_.displayname -eq “Windows Server”}
$monObj=$class|get-monitoringobject |where {$_.name -like “$sHost*”}
New-MaintenanceWindow -MonitoringObject $monObj -Comment “Planned Maintenace – scripted” -StartTime $time -EndTime $time.AddMinutes($nMinutes)

This example is assuming that 20 minutes after the initial event is logged, the server should be back in proper order. 

The sample powershell script can be downloaded here.

Lots more behind the cut:

Read more of this post

“Must-Have” Tools – Sun VirtualBox

Over the past few years, I’ve bounced between virtualization products in my home test lab, including MS Virtual PC and Virtual Server, as well as VMWare Server and Desktop.  Given the option, I’d love to have a Windows 2008 Hyper-V server to run my home development environment, but given the hardware requirements, I don’t see that happening soon.   Recently, I have made the switch to Sun’s free VirtualBox product for both desktop and server virtualization at home.   VirtualBox is an end-user oriented product based on Xen server and has an impressive feature set, including 64 bit OS support and a native RDP server that allows RDP connections even when the OS isn’t loaded.   VirtualBox can also read VMDK and VHD virtual disks.   I was able to take some VHD files for VM’s built in Virtual Server, create a new VirtualBox VM (using the VHD files) and boot the OS without performing any radical conversion process.  Presently, I have a VirtualBox instance on my workstation for desktop OS virtualization, as well as a VirtualBox instance running on a CentOS server that I use for server oS virtualization.   Using the VirtualBox command line implementation, services can easily be configured on Linux to enable automatic starting and suspending with server boots.  For more info see: http://www.glump.net/howto/virtualbox_as_a_service

VirtualBox can be downloaded at:  www.virtualbox.org

SCOM: Monitoring the WSH Version on Windows Agents

Given the severe performance issues that can be caused by SCOM monitoring on hosts without the Windows Script Host 5.7, and the possibility that WSH 5.7 binaries can be replaced with older versions by Windows File Protection (http://support.microsoft.com/default.aspx/kb/955360), I’ve found it useful to use a SCOM unit monitor to monitor managed agents for the expected WSH version (5.7 or later).  

The script that I’ve written for this purpose first checks the OS Caption with WMI (to exclude 64bit hosts from the check) and then checks the version of cscript.exe using a WSH FileSystemObject. 

set oFSO = CreateObject(“Scripting.FileSystemObject”)
 sFileVersion=oFSO.GetFileVersion(sWinDir & “\system32\cscript.exe”)

To deploy this as a unit monitor, create a two-state timed script monitor.   Set the unhealthy state expression to:  Property[@Name=’status’] equals Error and the healthy state expression to Property[@Name=’status’] equals OK.

A message with a description of the problem and the current cscript.exe version can be added to the alert with the  $Data/Context/Property[@Name=”Message”]$  Xpath string.

SCOM: Locally Monitoring a Listening TCP Port

Typically, one would monitor a TCP port remotely, from a designated watcher node, as a means of confirming availability of a network service, but in some cases, this may not be the most desirable method to poll for TCP port status.   For example, if you wanted to monitor the availability of tcp port on a large number of servers that are SCOM agent-managed systems, where you are concerned with the availability of the particular port only when the rest of the system is functioning normally, it may make more sense to monitor this port status locally from the agent.   This minimizes the number of false alarms (each time the monitored nodes are taken down for maintenance, the remote monitor will throw alerts) and also makes deployment much easier (simply target groups for overrides to enable the monitor).  
A simple VBS script that calls portqry.exe (from the Windows Support Tools) or “netstat –an” and parses the output to confirm port listening status can fulfill the monitoring role in this scenario.   I wrote such a script that will use netstat –an to check the TCP ports currently in a listening state on the localsystem  and parse the output to determine that the defined TCP port is in a listening state. 

The core logic can be seen in this excerpt:

sCmd = “netstat -an -proto TCP”
set objShell = createobject(“wscript.shell”)
set objExec = objShell.exec(sCmd)
set oStdOut = objExec.stdout

bl_Healthy = false

Do until oStdOut.AtEndofStream
 sLine = “”
 sLine = oStdOut.ReadLine
 if instr(sLine, “LISTENING”) > 0 and instr(sLine,”:” & nPortToCheck) then
  bl_healthy = true
 end if  
loop

The full script can be downloaded here (this is provided as-is, with no guarantee of function or support, test before deploying, etc).

Walkthrough on creating a corresponding Unit Monitor behind the cut

Read more of this post

Watch Those Variant Types

When creating some types of Unit Monitors (such as SNMP polls) in the OpsMgr GUI, SCOM will set the default type of the values in the expressions to be strings.   If your health monitoring logic performs numerical comparisons, this will create unexpected and unreliable results, because the monitor will be performing comparisons on the numbers as strings, instead of as numerical data (if the system believes the values are strings, greater than and less than operators are concerned with alphabetical order, and not numerical comparisions).

The good news is that the fix is rather easy, just export the management pack, edit the type definitions in the XML file, and import it back.

For more details:

Read more of this post

“Must Have” Tools – Terminals

This utility has so changed the way I work, it’s hard to imagine life before it.  Terminals is a remote desktop utility that implements tabbed windows for RDP, VNC, SSH, ICA (and more) remote connections. 

Get it here: http://www.codeplex.com/Terminals

“Must Have” SCOM Monitors – Long Running SQL Block Detection

Marios Philippopoulos did a great job  with an article on monitoring long-running SQL blocks with SCOM 2007 at SQLServerCentral.com.    The article:  Monitoring Database Blocking Through SCOM 2007 Custom Rules and Alerts (http://www.sqlservercentral.com/articles/Blocking/64038/)  provides a set of scripts for detecting (and display vital information about) long-running blocks in SQL Server 2000 and 2005 as well as a walkthrough on configuring the SCOM elements for alerting.  This is an absolute life-saver and can really cut down on troubleshooting time when dealing with application issues caused by problematic SQL blocking.

The only customizations I’ve made to this monitoring configuration when implementing are some changes to the SQL script to tune the block duration threshold as well as increase the size of some of the type definitions in the temp table to allow for longer command name output.

I’d highly recommend this article to anyone using SCOM to monitor SQL Server.

SCOM PowerShell Script for Maintenance Mode by IP address match

The web seems to bring us a seemingly endless supply of PowerShell scripts for manipulating maintenance mode in SCOM 2007.   I had occasion to write yet another maintenance mode script recently and thought it may be worth throwing out there.   This one will put objects (Windows Servers) in maintenance mode when the agent IP address matches a string.  This script can be useful to schedule maintenance mode when a remote site will be taken down for maintenance or patching.

#Connect to the RMS server and initialize the command shell
$rmsServerName=”<SERVERNAME>”
add-pssnapin “Microsoft.EnterpriseManagement.OperationsManager.Client”;
Set-Location “OperationsManagerMonitoring::”;
$mgConn = New-ManagementGroupConnection -connectionString:$rmsServerName;
if($mgConn -eq $null)
{
 [String]::Format(“Failed to connect to RMS on ‘{0}'”,$rmsServerName);
 return;
}
Set-Location $rmsServerName;

#Set up maintenance mode variables
$time = [DateTime]::Now
$nMinutes=1440

$class=get-monitoringclass | where {$_.displayname -eq “Windows Server”}

Function StartMM($agent){
 $objMon=get-monitoringobject -ID $agent.id
 write-host “Starting Maintenance Mode for: ” $objMon.displayname
 New-MaintenanceWindow -MonitoringObject $objMon -Comment “Suppressing IP network with a script” -StartTime $time -EndTime $time.AddMinutes($nMinutes)
}

get-agent |where {$_.IPAddress -like “192.168.1.*”} | ForEach-Object   {StartMM $_}
get-agent |where {$_.IPAddress -like “192.168.2.*”} | ForEach-Object   {StartMM $_}

To end maintenance mode, just a few slight modifications to the main body of the script:

#Set up maintenance mode variables
$time = [DateTime]::Now.AddMinutes(3)
write-host $time
$time = [DateTime]::Now.AddMinutes(5)

$class=get-monitoringclass | where {$_.displayname -eq “Windows Server”}

Function EndMM($agent){
 $objMon=get-monitoringobject -ID $agent.id
 write-host “Ending Maintenance Mode for: ” $objMon.displayname
  Set-maintenancewindow -EndTime $time -monitoringobject $objmon
}

get-agent |where {$_.IPAddress -like “192.168.1.*”} | ForEach-Object   {EndMM $_}
get-agent |where {$_.IPAddress -like “192.168.2.*”} | ForEach-Object   {EndMM $_}

A creative use for this script behind the cut:

Read more of this post

Selectively Generating Helpdesk Tickets from the SCOM Console – Using Resolution States

Many organizations have some form of an audit requirement that critical alerts from a monitoring system are mapped to helpdesk tickets for tracking. While the simplest solution would be a configuration that generated helpdesk tickets (with an email or script) for every critical alert fired, this is not always practical. Examples of this problem are easy to think of: if connectivity is lost to a remote site, many alerts (system and network) may be generated, but only one helpdesk ticket should be generated; or in SCOM-monitored environments, if an administrator reboots a cluster node and doesn’t put the cluster in maintenance mode, or a threshold monitor (e.g. SQL database free space) is floating just above and below the threshold repeatedly, many alerts can be generated, but not all should translate to helpdesk tickets.

This scenario leads to three possible outcomes: 1) superfluous helpdesk tickets are created with full automation, 2) no tickets are created (no automation), or 3) tickets are manually created after review of the alerts by an operator. Obviously, the latter option is more desirable than creating no tickets and is likely to be more desirable than creating too many tickets. If opting for the manual ticket creation option, it is very easy in SCOM to use custom Resolution States to facilitate manual ticket creation from within the SCOM console (an alert console task could also be used, but the resolution state option is much easier), thus creating a partially automated solution.

 When the resolution state of an alert is changed, SCOM reanalyzes the notification subscriptions and will trigger notification subscriptions that match the new resolution state. So, by adding a custom resolution state (perhaps named: “Create HelpDesk Ticket”) a new notification subscription can be added (filtered to this resolution state) in order to fire a different response than the original subscription that responded to the “New” resolution state. With this configuration, any SCOM operator can simply set the resolution state in the SCOM console and let the notifications take it from there. If your helpdesk system doesn’t listen for inbound email messages, a SCOM command notification could be used in the same way to fire a script or executable that creates the helpdesk ticket.

Walkthrough: Read more of this post

Getting Started

Firstly, the name. The Microsoft Operations Framework (http://technet.microsoft.com/en-us/library/cc539262.aspx) defines the Operating Quadrant as including “IT operating standards, processes, and procedures that are regularly applied to service solutions to achieve and maintain service levels within predetermined parameters.” The Service Management Functions included within the Operating Quadrant are:

System Administration
Security Administration
Directory Services
Administration
Network Administration
Service Monitoring and Control
Storage Management
Job Scheduling

 For many of us IT professionals, this is the world we live in. The purpose of this site is serve as a repository for helpful articles and discussions on detailed real-world solutions for improving the network and system environments that we support.