RAL Tier1 CASTOR CMS Instance Monitoring
Contents
Introduction
As it stands, CASTOR provides no easy method to monitor and log, in a central location, the state of the system at a given time. These tools are intended to improve the situation. The CMS monitoring can be found here https://webpp.phy.bris.ac.uk/cms.
Data is collected from a number of sources, parsed and output in a standard log format which can be read by the plotting tools. Data is collected from:
- Ganglia - for monitoring of network IO, load and free disk
- CASTOR Instance LSF host - for monitoring of LSF queues and completed LSF jobs
- CASTOR Instance Stage host - for monitoring of tape migration queues and pending disk to disk transfers
Prerequisites
- An SSH key with no password
- An account on a webserver, which can be accessed with SSH (and SCP) authenticated with the above key
- A restricted account on the CASTOR Instance LSF host, authenticated as above
- A restricted account on the CASTOR Instance Stage host, authenticated as above
- A monitoring host which can connect to the LSF host, Stage host and the webserver, which has enough disk mounted to hold the log files (~100MB / week)
- Python >= 2.4 on the monitoring host, with Matplotlib (http://matplotlib.sourceforge.net/), which itself requires NumPy and PIL
Installation
Installation
The tools are available from http://webpp.phy.bris.ac.uk/cms/CastorMon.tar. Expand into a suitable location.
Machine Setup
For the logs, plots and state create the following directory structure on the logging filesystem:
<Log Root> |-Archive |-State |-Plots |-Ganglia |-LsfHosts |-LsfQueues |-TapeMigr |-WaitD2D
Logging Configuration
All logging is configured in logging.config - the CMS specific version is included in the distribution. The file is repeated below, and a description of the key values is given after:
###################################### # Settings for all logging modules SshIdentity=/home/csf/cms/T1Logging/Security/id_rsa StatePath=/stage/cms-data26/T1Logs/State LogPath=/stage/cms-data26/T1Logs ArchivePath=/stage/cms-data26/T1Logs/Archive ArchiveSize=5000000 #TimedLogs=lastmonth:2678400 ###################################### # LSF Logging LsfHost=ccmslsf.ads.rl.ac.uk LsfUser=cmslogging LsfQueuePrefix=cms ###################################### # Ganglia logging GangliaMetrics=bytes_in,bytes_out,cpu_user,cpu_system,cpu_wio,load_one,disk_free GangliaAverages=sum,sum,average,average,average,average,sum GangliaWebMethod=http://ganglia.gridpp.rl.ac.uk/cgi-bin/get-rrd-data/get-rrd-data.pl ###################################### # Migration + D2D Logging StageHost=ccmsstager.ads.rl.ac.uk StageUser=cmslogging ###################################### # Host configuration ClusterName=Storage_CASTOR_CMS HostConfig=/home/csf/cms/T1Logging/2.0/running_hosts.config
General Settings
- SshIdentity - The path to the SSH private key
- StatePath - The path to the state directory (<Log Root>/State)
- LogPath - The path to the log directory (<Log Root>)
- ArchivePath - The path to the archive directory (<Log Root>/Archive)
- ArchiveSize - The filesize, in bytes, at which to rotate a logfile to Archive
- TimedLogs (optional) - A list of rolling log names which hold the last n seconds of logging data
LSF Logging
- LsfHost - FQDN of the CASTOR Instance LSF host
- LsfUser - The user to log in to the LSF host as
- LsfQueuePrefix - Used to capture data for the relevant LSF queues
Ganglia Logging
- GangliaMetrics - The metrics to collect from Ganglia for disk servers
- GangliaAverages - Rules to indicate how metrics from individual hosts should be dealt with when federating over disk pools (summed or averaged over hosts)
- GangliaWebMethod - The URL of the web method used to collect Ganglia data
Migration and D2D Logging
- StageHost - FQDN of the CASTOR Instance Stager host
- StageUser - The user to log in to the Stager host as
Host Configuration
- ClusterName - The name of the instance monitoring cluster in Ganglia
- HostConfig - Path to disk server host configuration file, this will be written automatically so the only requirement is that this file resides in an already present path
Plotting Configuration
# Width and height of plots in inches PlotWidth=4 PlotHeight=2.5 # Common font settings TitleSize=9 YLabelSize=8 XTickLabelSize=8 YTickLabelSize=8 LegendSize=8 # DPI for saving plot images SaveDpi=72 # General config StatePath=/stage/cms-data26/T1Logs/State SshIdentity=/home/csf/cms/T1Logging/Security/id_rsa WebHost=webpp.phy.bris.ac.uk WebUser=jj2643 # Log suffixes to plot LogPath=/stage/cms-data26/T1Logs ArchivePath=/stage/cms-data26/T1Logs/Archive #LogPath=/Users/jamesjackson/Documents/workspace/CmsT1Logging/Logs #ArchivePath=/Users/jamesjackson/Documents/workspace/CmsT1Logging/Logs/Archive ReadBack=2678400 PlotTimes=3600,86400,604800,2678400 PlotTitles=Last Hour,Last Day,Last Week,Last Month PlotNames=hour,day,week,month # Date display for each suffix TickMarks=10mins,2hours,days,4days DataAverages=1,6,25,100 # Ganglia plotting details GangliaLogName=ganglia GangliaLogFormat=Bytes_In,Bytes_Out,CPU_User,CPU_System,CPU_WIO,Load_One,DiskFree GangliaHosts=Storage_CASTOR_CMS,FarmRead,WanIn,WanOut,gdss125.gridpp.rl.ac.uk,gdss98.gridpp.rl.ac.uk GangliaAverageModes=average,average,average,average,average,average,average GangliaLocalPlotDir=/stage/cms-data26/T1Logs/Plots/Ganglia GangliaRemoteWebDir=/var/www/html/cms/Plots/Ganglia # LSF Host plotting details LsfHostsLogName=lsfhosts LsfHostsLogFormat=Status,Max,Jobs,Run,Suspended LsfHostsAverageModes=none,max,max,max,max LsfHosts=Storage_CASTOR_CMS,WanIn,WanOut,FarmRead,gdss71.gridpp.rl.ac.uk,<other hosts>,gdss127.gridpp.rl.ac.uk LsfHostsLocalPlotDir=/stage/cms-data26/T1Logs/Plots/LsfHosts LsfHostsRemoteWebDir=/var/www/html/cms/Plots/LsfHosts # LSF Queue plotting details LsfQueuesLogName=lsfqueues LsfQueuesLogFormat=Status,Jobs,Pending,Running,Suspended LsfQueuesAverageModes=none,average,average,average,average LsfQueues=cmsFarmRead,cmsWanIn,cmsWanInTest,cmsWanOut,cmsWanOutTest LsfQueuesLocalPlotDir=/stage/cms-data26/T1Logs/Plots/LsfQueues LsfQueuesRemoteWebDir=/var/www/html/cms/Plots/LsfQueues # Tape migration plotting details TapeLogName=migration TapeLogFormat=Queued TapeAverageModes=max TapeQueues=Storage_CASTOR_CMS,WanIn,WanOut,FarmRead,gdss71.gridpp.rl.ac.uk,<other hosts>,gdss127.gridpp.rl.ac.uk TapeLocalPlotDir=/stage/cms-data26/T1Logs/Plots/TapeMigr TapeRemoteWebDir=/var/www/html/cms/Plots/TapeMigr # WaitD2D plotting details WaitLogName=waitd2d WaitLogFormat=Queued WaitAverageModes=average WaitQueues=Storage_CASTOR_CMS,WanIn,WanOut,FarmRead WaitLocalPlotDir=/stage/cms-data26/T1Logs/Plots/WaitD2D WaitRemoteWebDir=/var/www/html/cms/Plots/WaitD2D
Initial Setup
Before running any logging, an initial host config needs to be created with the command:
python HostConfigUpdater.py -c <path to logging.config> --bootstrap
Running
The logging and plotting is controlled by cron jobs, therefore ensure that the Python environment is available to running cron jobs. For CMS, we use:
PYTHONPATH=/opt/lcg/lib:/opt/lcg/lib/python:/opt/edg/lib:/opt/edg/lib/python:/opt/glite/lib/python:/home/csf/cms/lib/python2.4/site-packages
Updating the host configuration file is handled by HostConfigUpdater. Run this as often (or not) as you want, CMS runs this every 10 minutes to ensure any configuration updates are propagated in reasonable time:
0,10,20,30,40,50 * * * * /home/csf/cms/T1Logging/2.0/HostConfigUpdater.py -c /home/csf/cms/T1Logging/2.0/logging.config
All logging apart from the processing of LSF job logs is handled by InstanceLogging.py. This is run as often as you want to collect data points. CMS run this every minute. LSF job logging needs to be run at an interval less that the average rotation time of the LSF logs, as only the last rotated log is analysed. CMS run this every 15 minutes (for a log rotation with the current load of 2-3 hours):
* * * * * /home/csf/cms/T1Logging/2.0/InstanceLogging.py -c /home/csf/cms/T1Logging/2.0/logging.config 0,15,30,45 * * * * /home/csf/cms/T1Logging/LsfLogs/DoLsfLogs.sh >> /dev/null 2>&1
Plotting is handled by a series of plotting scripts. These are run using nice as they are rather resource intensive:
0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotGanglia.py >> /dev/null 2>&1 0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotLsfHosts.py >> /dev/null 2>&1 0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotLsfQueues.py >> /dev/null 2>&1 0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotMigrationQueues.py >> /dev/null 2>&1 0,10,20,30,40,50 * * * * /bin/nice -10 /home/csf/cms/T1Logging/2.0/Plotting/PlotWaitD2D.py >> /dev/null 2>&1
Technical Details
The tools are written in Python (requires >= 2.4), and consist of the following modules and configuration files:
Logging Core
- LogUtils.py - Provides common functionality for writing to and rotating log files
- GangliaLogger.py
- LsfHostLogger.py
- LsfJobLogger.py
- LsfQueueLogger.py
- MigrationQueueLogger.py
- WaitD2DLogger.py
- HostConfigUpdater.py
Logging Execution
- InstanceLogging.py - Instantiates a thread for each logging process and waits for these to finish
- RunLsfJobLogging.py - Checks if an LSF job log has been rotated, fetches the new log if so and parses it.
- logging.config
Plotting
- PlotUtils.py - Provides common functionality for parsing and plotting log files created by LogUtils.py
- PlotGanglia.py
- PlotLsfHosts.py
- PlotLsfQueues.py
- PlotMigrationQueues.py
- PlotWaitD2D.py
- plotting.config