Oracle Management Pack for OpsMgr SCX Agents, Part 4: File System and Process Monitoring

While the vast majority of the rules and monitors in the Oracle SCX management pack (part 1, part 2, part 3) that I am working on involve the basic form of a SQL query data source (see part 2) that queries a specific parameter value to evaluate the result, there are two areas of monitoring that required discovery logic a bit more complex:  file system monitoring and Oracle process monitoring.  I found the challenge of implementing these two categories of monitoring to be engaging, and thought that it might be worth writing a bit about the approaches that I took.   

File System Monitoring

Oracle dependencies on file systems could exist in several forms, such as the Flash Recovery Area, Dump space destinations, Data File locations, and log archive destinations.   While the OpsMgr cross platform agent provides File System availability and space-monitoring out-of-the-box, I wanted to implement additional space monitoring for the File Systems that an Oracle instance depends on, with custom thresholds and alerting. 

In order to achieve this result, I started by creating a class for the Oracle File System.   An instance of this class would represent the File System that Oracle depends on.  By using the FS name as the key property, if multiple Oracle components used that File System, only a single instance of the class would be discovered.   In order to enhance alerting, I also added a set of boolean properties to the File System class that, once discovered, would indicate the nature of the dependency that Oracle has on the File System.  

Discovering these File System objects did prove to be the tricky part.   To begin with, the File System path used for each of the dependency categories can be identified with a SQL query.

  • Table Space Data File:  select file_name from dba_data_files where tablespace_name = ‘<Table Space Name>’;
  • Background Dump Destination: select value from v$parameter where name=’background_dump_dest’;
  • Core Dump Destination :  select value from v$parameter where name=’core_dump_dest’;
  • User Dump Destination:  select value from v$parameter where name=’user_dump_dest’;
  • Flash Recovery Area Destination: select value from v$parameter where name=’db_recovery_file_dest’;
  • Log Archive Destination:  select distinct destination from v$archive_dest where destination like ‘%/%’  and status=’VALID’;

The SQL Query data source used heavily in this management pack executes SQL queries through a shell command execution, and I was able to modify that logic for use in these discoveries.   Each of the queries above returns an actual path to a file or directory, and not the name of the file system,  but an easy way on a UNIX/Linux system to return the name of the File System for a given path is through the df command.   So, by assigning the query result to a variable, and then piping this variable to a df -Pk command, the actual file system name is returned as StdOut.   Additionally, some output filtering was required using grep and awk to  handle errors (i.e. if the query results are null) and format the output.    When the whole command string is put together, including the requisite environment configuration and login information for SQLPlus, it becomes an impressive one-liner shell command, for example:

ORACLE_HOME=<OracleHomePath>;export ORACLE_HOME;FSTMPVAR=`echo -e ‘CONNECT <UserName>/<Password>@<OracleSID>;\nSET HEADING OFF;\nselect distinct destination from v$archive_dest where destination like ‘\”%/%’\”  and status=’\”VALID’\”;’| $Config/OracleHome$/bin/sqlplus -S /nolog|grep /`;if [ -z $FSTMPVAR ]; then FSTMPVAR=”/null”;fi;df -Pk $FSTMPVAR|grep /|awk ‘{ print $6}’

And the actual shell command used in the discovery with the data source variables:

ORACLE_HOME=$Config/OracleHome$;export ORACLE_HOME;FSTMPVAR=`echo -e ‘CONNECT $RunAs[Name=”OracleSCX.OracleSQLQueryAccount”]/UserName$/$RunAs[Name=”OracleSCX.OracleSQLQueryAccount”]/Password$@$Config/OracleSID$; \nSET HEADING OFF;\nselect distinct destination from v$archive_dest where destination like ‘\”%/%’\”  and status=’\”VALID’\”;’   |$Config/OracleHome$/bin/sqlplus -S /nolog|grep /`;if [ -z $FSTMPVAR ]; then FSTMPVAR=”/null”;fi;df -Pk $FSTMPVAR|grep /|awk ‘{ print $6}’

For the scalar queries that return a single path value, the StdOut from the shell command data source can be used directly in a filtered discovery mapper for a simple discovery.   The potential multiple values from the Archive Log Destination query do require a subsequent script discovery module to handle splitting the multi-line StdOut into unique instances.

  

The net result of all of this is that the individual file systems that are depended on by Oracle for critical functions are uniquely discovered and can then easily be monitored for low space conditions (with specific Oracle-related thresholds).

Oracle Process Monitoring

A healthy Oracle Instance will involve quite a few running processes, and I wanted to implement monitoring for the status of critical processes, as well collect the historical CPU and memory utilization of all Oracle processes.   The OpsMgr SCX agent implements Unix process monitoring through the CIM classes:  SCX_UnixProcess and SCX_UnixProcessStatisticalInformation.    One challenge with these two classes when monitoring Oracle processes is that, as far as the SCX_UnixProcess instance is concerned, all of the Oracle processes have the same name: oracle.   The distinguishing identifier (such as ora_dbw0_xxxx for a database writer process) for the Oracle process is found in the value of the SCX_UnixProcess class “Parameters” property.    This proved to be the main point to consider in both the availability and performance data sources, and I addressed this different ways.

Monitoring the status of Oracle processes required a new composite monitor type, but was not otherwise overly complicated.   To check the status of a given process, I utilized a Microsoft.Unix.WSMan.TimedEnumerator Data Source, with a Uri of: http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_UnixProcess?__cimnamespace=root/scx and a filter of: select parameters from SCX_UnixProcess where name =”oracle” 

This returns a result set containing the value for the Parameters property for each Oracle SCX_UnixProcess instance.  The output of this data source can then be filtered to match (or not match) the Oracle process to monitor using XPATH syntax such as:  //*[local-name()=”Parameters” and text()=”$Config/ProcessName$”].  I created availability monitors using this method for the following critical Oracle processes:

  • CPKT
  • LGWR
  • MMAN
  • PMON
  • SMON

Collecting the CPU and memory utilization for all Oracle processes did require some additional steps.   Because the distinguishing identifier for each Oracle process is in the Parameters of the SCX_UnixProcess class, it is not easy to directly map processes with a specific parameter string to an instance of the SCX_UnixProcessStatisticalInformation.    As the process name could not be used to filter the instances of SCX_UnixProcessStatisticalInformation, it seemed that the best way was to map processes with their Handle property (the PID).   To facilitate this, I created a new class for Oracle Process objects, and implemented a discovery using a Microsoft.SystemCenter.WSManagement.Enumerator Data Source.   The data source enumerates the process name, handle, and parameters string for each process with a name of “oracle.”   This is then filtered with an overridable Regular Expression name filter property and mapped to discovery data.   This ultimately results in the discovery of all running Oracle processes and each process’s handle.   

The process handle can then be used to filter enumeration requests for the SCX_UnixProcessStatisticalInformation in order to return the current CPU percent utilization and memory KB used values. 

One down side to this discovery-based approach is that Microsoft-recommended best practices mandate that discoveries should not be scheduled for intervals smaller than 4 hours.   This means that the discovered process handle may be wrong for a significant period of time for a process that is restarted, and this would impact data collection as well as lead to some unsightly errors in the eventlog.   One way to deal with this issue is to filter discovery to prevent discovery of processes that are expected to restart frequently.  I am continuing to give this some thought in the hopes of improving this approach, but this method seems to work pretty well overall at the moment.

Advertisements

About Kristopher Bash
Kris is a Senior Program Manager at Microsoft, working on UNIX and Linux management features in Microsoft System Center. Prior to joining Microsoft, Kris worked in systems management, server administration, and IT operations for nearly 15 years.

7 Responses to Oracle Management Pack for OpsMgr SCX Agents, Part 4: File System and Process Monitoring

  1. John Taylor says:

    Love the article… I have been able to create a composite module that let me schedule timed enumerator events. Can you please provide sample of process availability monitor so I can incorporate ?

    Thanks

    • Kristopher Bash says:

      Ok. For process availability monitoring, you can look at the MS SCX management packs in the authoring console to see how they are doing Process monitoring. I am using something similar, but slightly different. To start with, create a composite monitor type, add configuration parameters for Interval, TargetSystem, and then the process parameters that you want to filter on (such as process name). For Member Modules, start with a Microsoft.Unix.WSMan.TimedEnumerator data source. This will do the querying, and be followed by expression filters to determine health state. For the configuration of the timed enumerator DS, here is an example that queries the “Name” field from all running processes:

      TargetSystem: $Config/TargetSystem$
      URI: http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_UnixProcess?__cimnamespace=root/scx
      Filter: select name from SCX_UnixProcess
      SplitItems: false
      Interval: $Config/Interval$

      Then for your expression filters, in this example, to check if a process is running with a specified string in it’s parameters field:
      <Expression><Exists><ValueExpression><XPathQuery Type=”String”>//*[local-name()=”Name” and text()=”$Config/ProcessName$”]</XPathQuery></ValueExpression></Exists></Expression>
      And if it’s not running:
      <Expression><Exists><ValueExpression><XPathQuery Type=”String”>//*[local-name()=”Name” and text()=”$Config/ProcessName$”]</XPathQuery></ValueExpression></Exists></Expression>

      Does that help?

  2. Pingback: Finishing the Oracle SCX Management Pack for OpsMgr Cross-Platform Agents « Operating-Quadrant

  3. John Taylor says:

    Kris ,

    How are you getting past the log monitoring limitations::: Currently the log monitor ,out of box, monitors log file once every 5 minutes and produces a single alert for matched criteria within that time frame.

    In windows the file is kept open and anytime a mtach is found a alert can be generated using Param/Params.

    Please let me know .. I owe you are beer or whatever other payment you require ..

  4. Orhan says:

    John,

    You are right that is the current behavior of log file template. but I think that will be fixed in next coming CU3.

  5. PeterDK says:

    Hi,

    I see that CU4 is already released. Is the problem that John mentioned already solved?
    Because the behaviour of log file monitoring only checking every 5 minutes, is not good. We are missing to many alerts.

  6. Pingback: Handling String Array() Values in OpsMgr WinRM Enumeration Data Sources « Operating-Quadrant

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: