Nagios

From GridPP Wiki
Jump to: navigation, search

Nagios is a Network / Host monitoring package available under the GPL. See Either the Wikipedia Summary or the Product Homepage for more details.

Gridpp operates a UK-wide Nagios, info is here: http://www.gridpp.ac.uk/wiki/UKI_Regional_Nagios

Although not promarily designed as Monitoring_Tools_for_LCG it can provide administrators with alerts on failing services and potentially restart them, as well as provide availability statistics.

Monitoring Plugins

Are documented on a Separate Page.

Remote Hosts

Because Nagios runs on a central server, it can only interrogate the remote state of machines if they are somehow accessible over the network. This means that it can run any monitor on localhost but is restricted to the following for remote ones:

  • Network services (ie, check_ssh used to see if there's an sshd service on target host)
  • 'Polled' local scripts sending back over a secure pipe (NRPE)
  • 'Pushed' results of passive / active checks back to nagios server (NSCA)

Configuration Tips

  • See what others are doing - eg RALPP_Work_List_Nagios
  • Generate templates automatically to make repetetive groups simple. ie Andrew Elwell has a set of shell scripts for each type of node (worker, server, disk) that contain loops such as:
for i in `seq 1 140` ; do
h=`printf "%03d" $i`
cat <<EOF >> $CFG
define host {
        host_name       node$h
        alias           Worker Node $h
        address         10.141.0.$i
        use             wn_template
}
 
EOF
done

Rather than defining each service on each node individually, you can then add it to a group at once:

define hostgroup{
        alias                   Worker Nodes
        hostgroup_name          workernodes
}
 
define host{
        name    wn_template
        use     linux-server
        hostgroups      workernodes
        register        0
}
 
define service{
        hostgroup_name  workernodes
        service_description sshd
        check_command   check_ssh
        servicegroups   sshservers
        use             local-service
}
  • Group all the services together using servicegroups
  • If you already restrict access to the webserver that nagios runs under (htaccess or SSL/x509), then you can set the cgi.cfg to allow user * and it'll use $REMOTE_USER within nagios
  • an example SSL Configuration. This is for Apache 2, and also includes an example of how to apply basic certificate ACLs from within the nagios config.
        SSLEngine on
        SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL
        SSLCertificateFile /etc/apache2/ssl/nagios-hostcert.pem
        SSLCertificateKeyFile /etc/apache2/ssl/nagios-hostkey.pem
        SSLCACertificatePath    /etc/grid-security/certificates
        SSLCACertificateFile /etc/apache2/ssl/cacert.crt
        SSLOptions +ExportCertData +CompatEnvVars +StdEnvVars
        SSLVerifyClient require
        SSLVerifyDepth 2
        SSLUserName SSL_CLIENT_S_DN
        <Location /nagios>
                SSLRequire  %{SSL_CLIENT_S_DN} eq "/C=UK/O=eScience/OU=Manchester/L=HEP/CN=colin morey" \
                        or  %{SSL_CLIENT_S_DN} eq "/C=UK/O=eScience/OU=Manchester/L=HEP/CN=Someone Else"
        </Location>

Notifications

By Default Nagios comes with email notifications, but can easily be extended to notify with pagers, sms or even Jabber