RAL Tier1 CASTOR CMS Instance Monitoring

From GridPP Wiki
Jump to: navigation, search

Introduction

As it stands, CASTOR provides no easy method to monitor and log, in a central location, the state of the system at a given time. These tools are intended to improve the situation. The CMS monitoring can be found here https://webpp.phy.bris.ac.uk/cms.

Data is collected from a number of sources, parsed and output in a standard log format which can be read by the plotting tools. Data is collected from:

  • Ganglia - for monitoring of network IO, load and free disk
  • CASTOR Instance LSF host - for monitoring of LSF queues and completed LSF jobs
  • CASTOR Instance Stage host - for monitoring of tape migration queues and pending disk to disk transfers

Prerequisites

  • An SSH key with no password
  • An account on a webserver, which can be accessed with SSH (and SCP) authenticated with the above key
  • A restricted account on the CASTOR Instance LSF host, authenticated as above
  • A restricted account on the CASTOR Instance Stage host, authenticated as above
  • A monitoring host which can connect to the LSF host, Stage host and the webserver, which has enough disk mounted to hold the log files (~100MB / week)
  • Python >= 2.4 on the monitoring host, with Matplotlib (http://matplotlib.sourceforge.net/), which itself requires NumPy and PIL

Installation

Installation

The tools are available from http://webpp.phy.bris.ac.uk/cms/CastorMon.tar. Expand into a suitable location.

Machine Setup

For the logs, plots and state create the following directory structure on the logging filesystem:

<Log Root>
  |-Archive
  |-State
  |-Plots
      |-Ganglia
      |-LsfHosts
      |-LsfQueues
      |-TapeMigr
      |-WaitD2D

Logging Configuration

All logging is configured in logging.config - the CMS specific version is included in the distribution. The file is repeated below, and a description of the key values is given after:

######################################
# Settings for all logging modules
SshIdentity=/home/csf/cms/T1Logging/Security/id_rsa

StatePath=/stage/cms-data26/T1Logs/State
LogPath=/stage/cms-data26/T1Logs
ArchivePath=/stage/cms-data26/T1Logs/Archive

ArchiveSize=5000000
#TimedLogs=lastmonth:2678400

######################################
# LSF Logging
LsfHost=ccmslsf.ads.rl.ac.uk
LsfUser=cmslogging
LsfQueuePrefix=cms

######################################
# Ganglia logging
GangliaMetrics=bytes_in,bytes_out,cpu_user,cpu_system,cpu_wio,load_one,disk_free
GangliaAverages=sum,sum,average,average,average,average,sum
GangliaWebMethod=http://ganglia.gridpp.rl.ac.uk/cgi-bin/get-rrd-data/get-rrd-data.pl

######################################
# Migration + D2D Logging
StageHost=ccmsstager.ads.rl.ac.uk
StageUser=cmslogging

######################################
# Host configuration
ClusterName=Storage_CASTOR_CMS
HostConfig=/home/csf/cms/T1Logging/2.0/running_hosts.config

General Settings

  • SshIdentity - The path to the SSH private key
  • StatePath - The path to the state directory (<Log Root>/State)
  • LogPath - The path to the log directory (<Log Root>)
  • ArchivePath - The path to the archive directory (<Log Root>/Archive)
  • ArchiveSize - The filesize, in bytes, at which to rotate a logfile to Archive
  • TimedLogs (optional) - A list of rolling log names which hold the last n seconds of logging data

LSF Logging

  • LsfHost - FQDN of the CASTOR Instance LSF host
  • LsfUser - The user to log in to the LSF host as
  • LsfQueuePrefix - Used to capture data for the relevant LSF queues

Ganglia Logging

  • GangliaMetrics - The metrics to collect from Ganglia for disk servers
  • GangliaAverages - Rules to indicate how metrics from individual hosts should be dealt with when federating over disk pools (summed or averaged over hosts)
  • GangliaWebMethod - The URL of the web method used to collect Ganglia data

Migration and D2D Logging

  • StageHost - FQDN of the CASTOR Instance Stager host
  • StageUser - The user to log in to the Stager host as

Host Configuration

  • ClusterName - The name of the instance monitoring cluster in Ganglia
  • HostConfig - Path to disk server host configuration file, this will be written automatically so the only requirement is that this file resides in an already present path

Plotting Configuration

# Width and height of plots in inches
PlotWidth=4
PlotHeight=2.5

# Common font settings
TitleSize=9
YLabelSize=8
XTickLabelSize=8
YTickLabelSize=8
LegendSize=8

# DPI for saving plot images
SaveDpi=72

# General config
StatePath=/stage/cms-data26/T1Logs/State
SshIdentity=/home/csf/cms/T1Logging/Security/id_rsa
WebHost=webpp.phy.bris.ac.uk
WebUser=jj2643

# Log suffixes to plot
LogPath=/stage/cms-data26/T1Logs
ArchivePath=/stage/cms-data26/T1Logs/Archive
#LogPath=/Users/jamesjackson/Documents/workspace/CmsT1Logging/Logs
#ArchivePath=/Users/jamesjackson/Documents/workspace/CmsT1Logging/Logs/Archive
ReadBack=2678400
PlotTimes=3600,86400,604800,2678400
PlotTitles=Last Hour,Last Day,Last Week,Last Month
PlotNames=hour,day,week,month

# Date display for each suffix
TickMarks=10mins,2hours,days,4days
DataAverages=1,6,25,100 

# Ganglia plotting details
GangliaLogName=ganglia
GangliaLogFormat=Bytes_In,Bytes_Out,CPU_User,CPU_System,CPU_WIO,Load_One,DiskFree
GangliaHosts=Storage_CASTOR_CMS,FarmRead,WanIn,WanOut,gdss125.gridpp.rl.ac.uk,gdss98.gridpp.rl.ac.uk
GangliaAverageModes=average,average,average,average,average,average,average
GangliaLocalPlotDir=/stage/cms-data26/T1Logs/Plots/Ganglia
GangliaRemoteWebDir=/var/www/html/cms/Plots/Ganglia

# LSF Host plotting details
LsfHostsLogName=lsfhosts
LsfHostsLogFormat=Status,Max,Jobs,Run,Suspended
LsfHostsAverageModes=none,max,max,max,max
LsfHosts=Storage_CASTOR_CMS,WanIn,WanOut,FarmRead,gdss71.gridpp.rl.ac.uk,<other hosts>,gdss127.gridpp.rl.ac.uk
LsfHostsLocalPlotDir=/stage/cms-data26/T1Logs/Plots/LsfHosts
LsfHostsRemoteWebDir=/var/www/html/cms/Plots/LsfHosts

# LSF Queue plotting details
LsfQueuesLogName=lsfqueues
LsfQueuesLogFormat=Status,Jobs,Pending,Running,Suspended
LsfQueuesAverageModes=none,average,average,average,average
LsfQueues=cmsFarmRead,cmsWanIn,cmsWanInTest,cmsWanOut,cmsWanOutTest
LsfQueuesLocalPlotDir=/stage/cms-data26/T1Logs/Plots/LsfQueues
LsfQueuesRemoteWebDir=/var/www/html/cms/Plots/LsfQueues

# Tape migration plotting details
TapeLogName=migration
TapeLogFormat=Queued
TapeAverageModes=max
 TapeQueues=Storage_CASTOR_CMS,WanIn,WanOut,FarmRead,gdss71.gridpp.rl.ac.uk,<other hosts>,gdss127.gridpp.rl.ac.uk
TapeLocalPlotDir=/stage/cms-data26/T1Logs/Plots/TapeMigr
TapeRemoteWebDir=/var/www/html/cms/Plots/TapeMigr

# WaitD2D plotting details
WaitLogName=waitd2d
WaitLogFormat=Queued
WaitAverageModes=average
WaitQueues=Storage_CASTOR_CMS,WanIn,WanOut,FarmRead
WaitLocalPlotDir=/stage/cms-data26/T1Logs/Plots/WaitD2D
WaitRemoteWebDir=/var/www/html/cms/Plots/WaitD2D

Initial Setup

Before running any logging, an initial host config needs to be created with the command:

python HostConfigUpdater.py -c <path to logging.config> --bootstrap

Running

The logging and plotting is controlled by cron jobs, therefore ensure that the Python environment is available to running cron jobs. For CMS, we use:

PYTHONPATH=/opt/lcg/lib:/opt/lcg/lib/python:/opt/edg/lib:/opt/edg/lib/python:/opt/glite/lib/python:/home/csf/cms/lib/python2.4/site-packages

Updating the host configuration file is handled by HostConfigUpdater. Run this as often (or not) as you want, CMS runs this every 10 minutes to ensure any configuration updates are propagated in reasonable time:

0,10,20,30,40,50 * * * * /home/csf/cms/T1Logging/2.0/HostConfigUpdater.py -c /home/csf/cms/T1Logging/2.0/logging.config

All logging apart from the processing of LSF job logs is handled by InstanceLogging.py. This is run as often as you want to collect data points. CMS run this every minute. LSF job logging needs to be run at an interval less that the average rotation time of the LSF logs, as only the last rotated log is analysed. CMS run this every 15 minutes (for a log rotation with the current load of 2-3 hours):

* * * * * /home/csf/cms/T1Logging/2.0/InstanceLogging.py -c /home/csf/cms/T1Logging/2.0/logging.config
0,15,30,45 * * * * /home/csf/cms/T1Logging/LsfLogs/DoLsfLogs.sh >> /dev/null 2>&1

Plotting is handled by a series of plotting scripts. These are run using nice as they are rather resource intensive:

0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotGanglia.py >> /dev/null 2>&1
0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotLsfHosts.py >> /dev/null 2>&1
0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotLsfQueues.py >> /dev/null 2>&1
0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotMigrationQueues.py >> /dev/null 2>&1
0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotWaitD2D.py >> /dev/null 2>&1

Technical Details

The tools are written in Python (requires >= 2.4), and consist of the following modules and configuration files:

Logging Core

  • LogUtils.py - Provides common functionality for writing to and rotating log files
  • GangliaLogger.py
  • LsfHostLogger.py
  • LsfJobLogger.py
  • LsfQueueLogger.py
  • MigrationQueueLogger.py
  • WaitD2DLogger.py
  • HostConfigUpdater.py

Logging Execution

  • InstanceLogging.py - Instantiates a thread for each logging process and waits for these to finish
  • RunLsfJobLogging.py - Checks if an LSF job log has been rotated, fetches the new log if so and parses it.
  • logging.config

Plotting

  • PlotUtils.py - Provides common functionality for parsing and plotting log files created by LogUtils.py
  • PlotGanglia.py
  • PlotLsfHosts.py
  • PlotLsfQueues.py
  • PlotMigrationQueues.py
  • PlotWaitD2D.py
  • plotting.config