Difference between revisions of "Monitoring Tools for LCG"
(No difference)
|
Latest revision as of 11:11, 20 April 2010
The LCG2 Grid is monitored at a number of different levels. There are some kinds of monitoring which are done by local administrators and other kinds which are developed and run by the UK/Ireland Regional Operations Centre at RAL.
This document will present the commonly used monitoring techniques, progressing from the local options to the grid-wide monitoring done by the ROC.
Contents
Nagios
What It Is
Nagios is a host, service and network monitoring application. It can be deployed at a site to monitor the operation of both machines and specific network services, such as email or web servers. It can also be configured to send alert notifications by email or SMS.
How it is used in the UK LCG Grid
Nagios is not a 'grid' monitoring tool per se but it can be used to monitor the operation of grid services in the same way that it is used to monitor more common network services like HTTP or IMAP.
We maintain a page of Nagios Plugins and Configuration Tips on this wiki.
Further Documentation
cfengine
What It Is
cfengine is a hostbased configuration system that can be used to track filesystem changes, keep directories tidied, rotate logs, and check remote connections and alert on all of its actions.
How it is used in the UK LCG Grid
See the cfengine category for an overview of all cfengine documentation on this wiki, or start at cfengine.
Ganglia
What It Is
Ganglia is a web-based tool for monitoring clusters of computers. Its primary focus is on monitoring hardware metrics like free memory/disk and CPU load.
Ganglia has a very useful property in that it can federate the monitoring of different clusters of machines. This enables a site (or many sites) to build a hierarchy of Ganglia monitors and get higher-level overviews of the status of their cluster.
How it is used in the UK LCG Grid
There is an ongoing project to federate the Ganglia monitoring of all sites in the UK LCG Grid. The root of the Ganglia hierarchy is hosted on the GridPP website.
Further Documentation
Ganglia can be very heavy on disk I/O. Steve Traylen has written an FAQ about using a RAM disk to store Ganglia files. It is strongly recommended that you use something like this for your Ganglia installation.
Grid Operations Centre
What It Is
The UK/Ireland ROC runs the Grid Operations Centre (GOC). The GOC monitors the operation of grid services at each site registered in its database. The primary interface to the GOC is through the web site goc.grid-support.ac.uk
How it is used in the UK LCG Grid
Each site in the UK and Ireland is registered with the GOC. The GOC then submits small test jobs to each site, one every 90 minutes, to check that the site is still responding correctly to the grid.
The results of these monitoring tests are used to colour representative dots on the GridPP Monitoring Map. The GOC also publishes graphs of responsiveness over a period of 1, 7 and 31 days. These are linked from the Monitoring Map page.
A new site will need to register in the GOC database and register at least one administrator who will have responsibility to keep the GOC DB up to date.
If your site is going offline for maintenance or repair, it is necessary to register scheduled downtime in the GOC database.
Further Documentation
- The GOC Database
- Registering your site with the GOC
- Registering yourself as a site administrator with the GOC
- Change Notification
GridICE
What It Is
GridICE is another centralised monitoring service. It monitors the grid resources available at each site, and across the grid as a whole.
How it is used in the UK LCG Grid
There is a GridICE server, run by the Regional Operations Centre at RAL, which monitors the entire LCG Grid worldwide. This is integrated with the LCG information system and therefore membership of the GridICE system requires no action on the part of a site administrator. Simply being registered in the LCG information system is enough.
Further Documentation
GStat
What It Is
GStat is a centralised monitoring service, which monitors the worldwide Grid Information System. Its primary goal is to detect faults, verify the validity of and display useful data from the Information System.
How it is used in the UK LCG Grid
GStat is run by the Regional Operations Centre located at Academia Sinica in Taiwan. As with the GridICE monitoring system, the UK LCG Grid is automatically included in its grid-wide monitoring.
Further Documentation
R-GMA
What It Is
From the R-GMA home page:
- The Relational Grid Monitoring Architecture provides a web service for information, monitoring and logging in a distributed computing environment. R-GMA makes all the information appear like one large Relational Database that may be queried to find the information required. It consists of Producers which publish information into R-GMA, and Consumers which subscribe.
How it is used in the UK LCG Grid
R-GMA is distributed as part of the LCG middleware releases. It is used in conjunction with a package called APEL to provide an accounting database for the UK grid.
Further Documentation
- R-GMA Home Page
- EGEE JRA1 (Information and Monitoring) site
- Documentation about installing and configuring the current LCG release version of R-GMA
- Latest information about the APEL package
About this Page
This page is maintained by Fraser Speirs.