Nagios Plugins

From GridPP Wiki
Jump to: navigation, search

Nagios is an enterprise-class monitoring solutions for hosts, services, and networks. Many sites in GridPP use it as part of their service availability monitoring. Other tools are described on the Monitoring Tools for LCG page.

There is interest in using CERN's LEMON sensors to output to NAGIOS.

What should be monitored

  • Disks becoming full, or nearly so. In particular check that jobs are not filling /tmp, the home directories or other scratch space.
  • Also check that disks don't run out of inodes.
  • Node crashes and disk failures.
  • NFS mounts failing - check that files can be written and read. In particular checking that the VO_[VONAME]_SW_DIR is readable.
  • Clock skew - often because the ntpd has died, or sometimes due to a problem with the clock to which ntpd is synchronised.
  • ssh keys - some modes of operation require unchallenged ssh between WNs and the CE, or for MPI among the WNs. First simple check is to verify that the wn can copy back a file to the ce.
  • Host certificates expiring - make sure they get renewed in good time. Also try not to renew at times which may cause problems in future years, e.g. August or December.
  • CRLs expiring - this can cause failures for certificates from a single CA, which can be hard to diagnose.
  • Check that you can use GridFTP from each WN to the CE and SE (although this will need a valid proxy on the WN).
  • Check the processes running on each WN - that the needed processes (ntpd, pbs etc) are running, and that other things (rogue processes, stuck jobs) are not.
  • Check log files for signs of trouble. Look for permission denied
  • Monitor the duration of jobs by WN - if all jobs to a particular WN are ending quickly it may well be faulty.

Available Plugins

External Links