Monitoring Tools for LCG

From GridPP Wiki
Jump to: navigation, search

The LCG2 Grid is monitored at a number of different levels. There are some kinds of monitoring which are done by local administrators and other kinds which are developed and run by the UK/Ireland Regional Operations Centre at RAL.

This document will present the commonly used monitoring techniques, progressing from the local options to the grid-wide monitoring done by the ROC.

Nagios

What It Is

Nagios is a host, service and network monitoring application. It can be deployed at a site to monitor the operation of both machines and specific network services, such as email or web servers. It can also be configured to send alert notifications by email or SMS.

How it is used in the UK LCG Grid

Nagios is not a 'grid' monitoring tool per se but it can be used to monitor the operation of grid services in the same way that it is used to monitor more common network services like HTTP or IMAP.

We maintain a page of Nagios Plugins and Configuration Tips on this wiki.

Further Documentation

cfengine

What It Is

cfengine is a hostbased configuration system that can be used to track filesystem changes, keep directories tidied, rotate logs, and check remote connections and alert on all of its actions.

How it is used in the UK LCG Grid

See the cfengine category for an overview of all cfengine documentation on this wiki, or start at cfengine.

Ganglia

What It Is

Ganglia is a web-based tool for monitoring clusters of computers. Its primary focus is on monitoring hardware metrics like free memory/disk and CPU load.

Ganglia has a very useful property in that it can federate the monitoring of different clusters of machines. This enables a site (or many sites) to build a hierarchy of Ganglia monitors and get higher-level overviews of the status of their cluster.

How it is used in the UK LCG Grid

There is an ongoing project to federate the Ganglia monitoring of all sites in the UK LCG Grid. The root of the Ganglia hierarchy is hosted on the GridPP website.

Further Documentation

Ganglia can be very heavy on disk I/O. Steve Traylen has written an FAQ about using a RAM disk to store Ganglia files. It is strongly recommended that you use something like this for your Ganglia installation.

Grid Operations Centre

What It Is

The UK/Ireland ROC runs the Grid Operations Centre (GOC). The GOC monitors the operation of grid services at each site registered in its database. The primary interface to the GOC is through the web site goc.grid-support.ac.uk

How it is used in the UK LCG Grid

Each site in the UK and Ireland is registered with the GOC. The GOC then submits small test jobs to each site, one every 90 minutes, to check that the site is still responding correctly to the grid.

The results of these monitoring tests are used to colour representative dots on the GridPP Monitoring Map. The GOC also publishes graphs of responsiveness over a period of 1, 7 and 31 days. These are linked from the Monitoring Map page.

A new site will need to register in the GOC database and register at least one administrator who will have responsibility to keep the GOC DB up to date.

If your site is going offline for maintenance or repair, it is necessary to register scheduled downtime in the GOC database.

Further Documentation

GridICE

What It Is

GridICE is another centralised monitoring service. It monitors the grid resources available at each site, and across the grid as a whole.

How it is used in the UK LCG Grid

There is a GridICE server, run by the Regional Operations Centre at RAL, which monitors the entire LCG Grid worldwide. This is integrated with the LCG information system and therefore membership of the GridICE system requires no action on the part of a site administrator. Simply being registered in the LCG information system is enough.

Further Documentation

GStat

What It Is

GStat is a centralised monitoring service, which monitors the worldwide Grid Information System. Its primary goal is to detect faults, verify the validity of and display useful data from the Information System.

How it is used in the UK LCG Grid

GStat is run by the Regional Operations Centre located at Academia Sinica in Taiwan. As with the GridICE monitoring system, the UK LCG Grid is automatically included in its grid-wide monitoring.

Further Documentation

R-GMA

What It Is

From the R-GMA home page:

The Relational Grid Monitoring Architecture provides a web service for information, monitoring and logging in a distributed computing environment. R-GMA makes all the information appear like one large Relational Database that may be queried to find the information required. It consists of Producers which publish information into R-GMA, and Consumers which subscribe.

How it is used in the UK LCG Grid

R-GMA is distributed as part of the LCG middleware releases. It is used in conjunction with a package called APEL to provide an accounting database for the UK grid.

Further Documentation

About this Page

This page is maintained by Fraser Speirs.