Monitoring HP Hardware Status on VMWare ESX Servers

HP provides great SCOM management packs for monitoring of Proliant servers, but only Windows agents are supported by these management packs.  If you’re running ESX on Proliant servers, it takes a little bit more effort to implement monitoring of hardware status.  Fortunately, HP also offers their Management Agents for ESX.   Thus, all that is needed to monitor HP ESX server hardware are some custom monitors to poll the snmp data exposed by the management agent.   An overview of the process for this is as follows:

Installing the HP Management Agent for ESX

  1. Configure SNMP on the ESX servers and set options: http://thwack.com/blogs/geekspeak/archive/2008/10/30/how-to-enable-snmp-on-a-vmware-esx-server.aspx
  2. Download the HP ESX agent (make sure your server model is supported by the agent) and copy the .tgz file to a temporary location on the ESX server
  3. Extract the file hpmgmt-8.x.x-vmware3x.tgz with a tar-zxvf command
  4. In the extracted directory, run the install script — later versions of the agent have a preinstall_setup.sh script which is to be manually run first, and requires a reboot. 
  5. Amongst other configuration prompts, you will be prompted to use an existing snmpd.conf, if you choose “no,” the install will create a new snmpd.conf that has to be configured with your snmp settings.
  6. If you use an existing snmpd.conf, you will have to add one line to it:  cd to /etc/snmp/ and edit snmpd.conf.  Add the following line:  dlmod cmaX /usr/lib/libcmaX.so   – this extends the SNMP agent to include the HP objects as a module.
  7. Restart snmp with:  service snmpd restart

Testing

The HP agents implement the Compaq mibs under the OID 1.3.6.1.4.1.232.   To test, you can use an SNMP browser to remotely connect and walk this OID, or from the ESX server, you can use an snmpwalk command:  snmpwalk –v 2c  -c <read-only community name>  localhost 1.3.6.1.4.1.232.

Monitoring with SCOM

  1. Discover the ESX servers as Network Devices
  2. Create a group for HP ESX servers (optionally in a new management pack).  You can use dynamic inclusion logic by setting a filter on the Device Description (Contains vmnix)
  3. Create your SNMP monitors and rules, targeting the SNMP Network Device class.  Configure the monitors and rules to be disabled, and then use an override to enable them for the HP ESX server group
  4. Create any required views or console tasks

What to Monitor?

When HP purchased Compaq, they made a smart decision in utilizing the Compaq SNMP MIBs for all HP servers, as this is one of the better vendor SNMP implementations out there.   It has remained very consistent over the years and most importantly, it tends to implement a single status value for each group of subcomponents that are represented in SNMP tables, so you don’t have to walk the table to get the overall status.    Thus, instead of checking the status of each disk drive, which will vary in number (and identifier in the table), you can just poll the cpqDaMibCondition (1.3.6.1.4.1.232.3.1.3) from the CPQIDA MIB to get the overall intelligent drive array health.  The agent’s System Management web console can be used for specifically drilling in to problems, so from a monitoring perspective, it is really only necessary to know when there is a problem, and what it’s general nature is.

These are the SNMP objects that I like to alert on for HP servers running UNIX:

Object Name OID
CPU Fans cpqHeThermalCpuFanStatus 1.3.6.1.4.1.232.6.2.6.5.0
Drive Array Health cpqDaMibCondition 1.3.6.1.4.1.232.3.1.3.0
Drive Array Controller (1) cpqDaCntlCondition 1.3.6.1.4.1.232.3.2.2.1.1.6.1
Power supplies cpqHEfltTolPwrSupply 1.3.6.1.4.1.232.6.2.9.1.0
System Fans cpqHeThermalSystemFanStatus 1.3.6.1.4.1.232.6.2.6.4.0
Temperature (Status) cpqHeThermalTempStatus 1.3.6.1.4.1.232.6.2.6.3.0
Thermal Conditions cpqHeThermalCondition 1.3.6.1.4.1.232.6.2.6.1.0
Integrated Management Log cpqHeEventLogCondition 1.3.6.1.4.1.232.6.2.11.2.0
Critical Errors cpqHeCritLogCondition 1.3.6.1.4.1.232.6.2.2.2.0
Correctable Memory Errors cpqHeCorrMemLogStatus 1.3.6.1.4.1.232.6.2.3.1.0

For reference on SNMP MIBS, ByteSphere provides a great Online MIB Database.  The primary Compaq MIBS to look for are: CPQHLTH, CPQIDA, CPSTSYS, CPQHOST,  CPQNIC, CPQTHRSH.

Advertisements

SCOM: Automating Management Pack Documentation

It’s easy to argue the case for keeping accurate documentation of SCOM management pack monitors and customizations, both for the abstract purpose of maintaining good documentation, as well as the more practical purpose of being able to answer the “can you list what is being monitored?” question. However, it can be tedious to keep on top of this documentation. But, given the flexibility of the SCOM command shell, it’s relatively easy to configure a powershell script to automate the documentation of management pack entities. By using a script to loop through unsealed management packs and itemize management pack entities such as groups, rules, monitors and views, along with their description, all it takes to automate documentation of custom management packs is completing the description fields for objects as they are created.

To loop through unsealed management packs, we can defined the list of management packs as an object:

$mps = Get-ManagementPack | where-object{$_.Sealed -eq $false} |Sort-Object DisplayName

Then create the loop logic:

foreach($mp in $mps)
{
  $mpDisplayName=$mp.DisplayName
…functions to list MP objects
}

With a set of functions to list the management pack objects that write the object lists to a formatted file (I use an HTML file so I can utilize CSS for formatting), we can create an automatically-generated document covering all unsealed MP’s.

I’ve posted the script I use here.   This script exports groups, rules, monitors, and views for all unsealed management packs in a formatted html file.  As always, it is provided as-is.

https://i1.wp.com/jxjzig.bay.livefilestore.com/y1pjtnzM9RDmStqJ-ZMgzC1HYtJbO2G9rUylcr1AiV5DQ5Jrq_vO2QlZZom8juYCZgFcLGCiQJe1qd2v4aOE2wb4w/report.JPG

My preference is to combine a single scheduled task that runs this documentation script as well as an automated export of all unsealed management packs. The output of both of these processes is then backed up nightly, creating a hands-off set of documentation and management pack history that can be utilized as necessary.

More on exporting unsealed MP’s for backup can be found at: http://searchwindowsserver.techtarget.com/generic/0,295582,sid68_gci1317380,00.html

Sun Hardware Monitoring with Net-SNMP and Shell Scripts

While Sun, like most server vendors, offers a comprehensive suite of hardware monitoring agents and management tools, it can be frustrating to monitor Sun hardware using the Sun Management Agent and a third-party SNMP tool, such as System Center Operations Manager.   The Sun Management Agent’s SNMP implementation builds on the Entity-MIB (http://docs.sun.com/app/docs/doc/817-3155/6mip4hnov?l=en&a=view) with the SUN Platform-MIB, and while all relevant hardware monitoring data are exposed through this MIB implementation, there are problems with deploying wide-scale monitoring of these objects using SNMP get requests.  This is because the list of entity objects varies by server model, and can even vary depending on the number of objects, like hard drives.

To expound on this point, Sun servers that run the SMA agent will list all hardware sensors in the entPhysicalTable of the ENTITY-MIB.  The id value for each of the hardware sensors will correspond to the id value for the sensor status (administrative and operational status) in the sunPlatEquipmentTable in the SUN Platform-MIB.   However, on one model, id 15 might correspond to CPU 0 Fan 0, but on another model, id 15 would correspond to a different sensor, and if that server had two CPU’s, id 17 might correspond to CPU 1 Fan 0, but if that system had only one CPU, id 17 would correspond to a different sensor.

If you could use an SNMP table or SNMP walk request, and return the results to a script that parses the output, this would not be a problem, but like many SNMP monitoring tools, SCOM implements SNMP gets only, meaning that variability in the OID.

So, what’s a way to work around this without committing to deploying a secondary monitoring tool just for monitoring SUN hardware?   One solution lies in the extensibility of the Net-SNMP agent, which is the default SNMP agent for Solaris.   Net-SNMP allows the extension of the agent’s functionality by assigning commands to OIDs.  With this configuration, whenever the OID is polled, the command is run on-demand and the output of the command is returned as the SNMP value.   For more on this functionality, see the Extending Agent Functionality section at: http://www.net-snmp.org/docs/man/snmpd.conf.html

To utilize this for hardware monitoring, a standard set of shell or PERL scripts can be written and deployed to a uniform path on all of your SUN servers, each configured to return a value such as “Pass” if everything checks out, or “Fail: <reason for failure>” if there are problems found.   The scripts can be written to support different status checking commands to support maximum portability (for example, using one status command on systems with software disk redundancy and another on systems with hardware RAID).   A great starting point for example monitoring scripts can be found at Sun’s BigAdmin site:  http://www.sun.com/bigadmin/scripts/indexMon.html

The net result is that with well written scripts and the Net-SNMP agent, a single monitoring solution can be deployed to all Sun servers, independent of their hardware model.   With consistent configuration in the snmpd.conf, the OID’s for each of the scripts (e.g. CPU script, HDD script, etc) will be the same and can be polled with a single set of SNMP get monitors in SCOM or another utility.

SCOM: Automatically Starting Maintenance Mode When Servers are Rebooted for Patching

While a comprehensive deployment of server and application monitoring is a vital service for many enterprises, this can lead to headaches when it comes to routine maintenance due to server patching.   Of course, it is good security practice to apply patches as they become available (e.g. monthly), but the regular required reboots can lead to storms of alerts in monitored environments without some way of temporarily suppressing alerting.  If every server that reboots fires a few alerts (services stopped, synthetic monitors, etc), and a few hundred servers are rebooted in a relatively short maintenance window, it doesn’t take long for a few thousand false alerts to be generated.  Not only is this annoying and a drain on notification systems (email), it can also mean that real problems that manifest after the reboot are lost in the fray.  An ideal solution for this problem would be for monitoring to be suppressed before the shutdown process and reengaged quickly after the reboot.  Maintenance mode is the obvious mechanism to accomplish this in SCOM monitored environments, but the manual interaction required via the GUI implementation makes impractical for large-scale maintenance.   Several helpful solutions have been previously published that utilize the Powershell exposure of Maintenance Mode in order to make the process more efficient.   These include:  Tim McFadden’s remote maintenance mode scheduling utility:  http://www.scom2k7.com/scom-remote-maintenance-mode-scheduler-20/ and Clive Eastwood’s AgentMM utility: http://blogs.technet.com/cliveeastwood/archive/2007/09/18/agentmm-a-command-line-tool-to-place-opsmgr-agents-into-maintenance-mode.aspx

One general way to address this ‘false alarm during patching’ issue would be to utilize one of these maintenance mode tools to drop groups of servers into maintenance mode.   However, there are some gaps in this approach, in my opinion.   Firstly, if wide-scale server maintenance (i.e. patching) is done in a rolling window over a weekend, it may be hard to schedule maintenance mode windows on groups of servers that just cover the time for reboot.  If the maintenance mode window is too broad, real problems that surface after the reboot may not be caught in orderly fashion.  Additionally, the staff responsible for patch deployments and scheduling may not necessarily be the same staff responsible for SCOM administration.   Finally, large-scale patch deployment is enough of a chore itself and adding an additional step to the process may not be welcomed by the staff responsible for patch deployments.

An Automated Solution:

The solution that I wanted was one that would automatically put servers into maintenance mode only when they are being rebooted for patching, and only for the minimum time necessary – all without administrative intervention.   If all went according to plan, the only alerts received during patch maintenance windows should be for problems that persisted after the reboot, i.e. the real problems.

The first part of this equation is the detection of patching.   How does a monitor differentiate between a reboot for unplanned maintenance or due to a problem versus a reboot due to planned patch maintenance?   The answer to this issue lies in the eventlog.  WSUS or Windows Automatic Updates should log an Event ID 22 before instantiating a reboot, and other tools like Shavlik NetChk are likely to log a similar event to indicate a pending reboot from patching.  If all else fails, a script could be setup in the patch management utility to use the eventcreate.exe (Windows 2003) or logevent.exe (pre-Windows 2003) utilities to log a custom event.    To trap these events, create an alert-generating Windows Event log rule with the required parameters to alert on the event that indicates an imminent reboot due to patching.   In order to make notification subscriptions easier in SCOM 2007 SP1, we can make use of the “Maintenance” category, so select “Maintenance” as the alert category.

The second part of this solution is the mechanism to trigger the maintenance mode script.  In theory, the script could be called client-side in the form of a recovery task, but this would require uniform distributions of powershell as well as the use of runas accounts with appropriate rights in SCOM.   A much simpler method is the use of Command Notification channel in SCOM.  
With a Command Notification (and notification subscription), the alert can call a powershell script and pass the object id to the script.  The powershell script can kick off the maintenance mode for a set period of time.

The main portion of a powershell script to do this would be something like, assuming the Managed Entity Display Name variable is passed to the script as the only argument:

$time = [DateTime]::Now
$nMinutes=20
$sHost=$args[0]
$class=get-monitoringclass | where {$_.displayname -eq “Windows Server”}
$monObj=$class|get-monitoringobject |where {$_.name -like “$sHost*”}
New-MaintenanceWindow -MonitoringObject $monObj -Comment “Planned Maintenace – scripted” -StartTime $time -EndTime $time.AddMinutes($nMinutes)

This example is assuming that 20 minutes after the initial event is logged, the server should be back in proper order. 

The sample powershell script can be downloaded here.

Lots more behind the cut:

Read more of this post

“Must-Have” Tools – Sun VirtualBox

Over the past few years, I’ve bounced between virtualization products in my home test lab, including MS Virtual PC and Virtual Server, as well as VMWare Server and Desktop.  Given the option, I’d love to have a Windows 2008 Hyper-V server to run my home development environment, but given the hardware requirements, I don’t see that happening soon.   Recently, I have made the switch to Sun’s free VirtualBox product for both desktop and server virtualization at home.   VirtualBox is an end-user oriented product based on Xen server and has an impressive feature set, including 64 bit OS support and a native RDP server that allows RDP connections even when the OS isn’t loaded.   VirtualBox can also read VMDK and VHD virtual disks.   I was able to take some VHD files for VM’s built in Virtual Server, create a new VirtualBox VM (using the VHD files) and boot the OS without performing any radical conversion process.  Presently, I have a VirtualBox instance on my workstation for desktop OS virtualization, as well as a VirtualBox instance running on a CentOS server that I use for server oS virtualization.   Using the VirtualBox command line implementation, services can easily be configured on Linux to enable automatic starting and suspending with server boots.  For more info see: http://www.glump.net/howto/virtualbox_as_a_service

VirtualBox can be downloaded at:  www.virtualbox.org

SCOM: Monitoring the WSH Version on Windows Agents

Given the severe performance issues that can be caused by SCOM monitoring on hosts without the Windows Script Host 5.7, and the possibility that WSH 5.7 binaries can be replaced with older versions by Windows File Protection (http://support.microsoft.com/default.aspx/kb/955360), I’ve found it useful to use a SCOM unit monitor to monitor managed agents for the expected WSH version (5.7 or later).  

The script that I’ve written for this purpose first checks the OS Caption with WMI (to exclude 64bit hosts from the check) and then checks the version of cscript.exe using a WSH FileSystemObject. 

set oFSO = CreateObject(“Scripting.FileSystemObject”)
 sFileVersion=oFSO.GetFileVersion(sWinDir & “\system32\cscript.exe”)

To deploy this as a unit monitor, create a two-state timed script monitor.   Set the unhealthy state expression to:  Property[@Name=’status’] equals Error and the healthy state expression to Property[@Name=’status’] equals OK.

A message with a description of the problem and the current cscript.exe version can be added to the alert with the  $Data/Context/Property[@Name=”Message”]$  Xpath string.

SCOM: Locally Monitoring a Listening TCP Port

Typically, one would monitor a TCP port remotely, from a designated watcher node, as a means of confirming availability of a network service, but in some cases, this may not be the most desirable method to poll for TCP port status.   For example, if you wanted to monitor the availability of tcp port on a large number of servers that are SCOM agent-managed systems, where you are concerned with the availability of the particular port only when the rest of the system is functioning normally, it may make more sense to monitor this port status locally from the agent.   This minimizes the number of false alarms (each time the monitored nodes are taken down for maintenance, the remote monitor will throw alerts) and also makes deployment much easier (simply target groups for overrides to enable the monitor).  
A simple VBS script that calls portqry.exe (from the Windows Support Tools) or “netstat –an” and parses the output to confirm port listening status can fulfill the monitoring role in this scenario.   I wrote such a script that will use netstat –an to check the TCP ports currently in a listening state on the localsystem  and parse the output to determine that the defined TCP port is in a listening state. 

The core logic can be seen in this excerpt:

sCmd = “netstat -an -proto TCP”
set objShell = createobject(“wscript.shell”)
set objExec = objShell.exec(sCmd)
set oStdOut = objExec.stdout

bl_Healthy = false

Do until oStdOut.AtEndofStream
 sLine = “”
 sLine = oStdOut.ReadLine
 if instr(sLine, “LISTENING”) > 0 and instr(sLine,”:” & nPortToCheck) then
  bl_healthy = true
 end if  
loop

The full script can be downloaded here (this is provided as-is, with no guarantee of function or support, test before deploying, etc).

Walkthrough on creating a corresponding Unit Monitor behind the cut

Read more of this post