Nagios is an enterprise-class monitoring solutions for hosts, services, and networks. Many sites in GridPP use it as part of their service availability monitoring. Other tools are described on the Monitoring Tools for LCG page.
There is interest in using CERN's LEMON sensors to output to NAGIOS.
What should be monitored
- Disks becoming full, or nearly so. In particular check that jobs are not filling /tmp, the home directories or other scratch space.
- Also check that disks don't run out of inodes.
- Node crashes and disk failures.
- NFS mounts failing - check that files can be written and read. In particular checking that the VO_[VONAME]_SW_DIR is readable.
- Clock skew - often because the ntpd has died, or sometimes due to a problem with the clock to which ntpd is synchronised.
- ssh keys - some modes of operation require unchallenged ssh between WNs and the CE, or for MPI among the WNs. First simple check is to verify that the wn can copy back a file to the ce.
- Host certificates expiring - make sure they get renewed in good time. Also try not to renew at times which may cause problems in future years, e.g. August or December.
- CRLs expiring - this can cause failures for certificates from a single CA, which can be hard to diagnose.
- Check that you can use GridFTP from each WN to the CE and SE (although this will need a valid proxy on the WN).
- Check the processes running on each WN - that the needed processes (ntpd, pbs etc) are running, and that other things (rogue processes, stuck jobs) are not.
- Check log files for signs of trouble. Look for permission denied
- Monitor the duration of jobs by WN - if all jobs to a particular WN are ending quickly it may well be faulty.
- Colin Morey has written a PBS queue monitoring plugin for Nagios.
- DESY have written a dCache monitoring plugin for Nagios. It parses the output of your dCache web monitoring page (i.e. http://srm.epcc.ed.ac.uk:2288).
- APROC have some useful looking plugins for monitoring grid services, but I can't find the actual plugins. They also appear to be written in python, and may not scale for monitoring a larger site.
- Chris Brew has written a simple script that uses the LCG same-query tool to get the SAM test results into Nagios.
- RAL-LCG2 Nagios plugins- Please direct any question about these plugins to email@example.com