RALPP Work List Nagios

From GridPP Wiki
Jump to: navigation, search

Progress on installing Nagios monitoring for the RALPP Tier 2

04/07/2006

Installed nagios rpms, set up very basic config

Installed nagios and nagios-plugins on heplnx182

Configured http access with:

htpasswd -c /etc/nagios/htpasswd.users nagiosadmin

Tried starting the nagios service, it complained about problems with the config file, I had to edit /etc/nagios/nagios.cfg to comment out all the cfg_file entries other than minimal.cfg.

The nagios service then started and I could log into the web interface (after making apache reload its config) but I couldn't see any info on the one host in the config (localhost).

Eventually discovered that I had to edit /etc/nagios/cgi.cfg to enable the userid I'd just setup nagiosadmin permission to access various bits of the CGI. Unscientifically enabled everything in sight. Now I can see that status of localhost.

19/09/2006

Started messing with nrpe to do remote monitoring

Installed nagios-npre-plugin on heplnx182 Installed nagios-npre and nagios-plugins on heplnx10

Opened TCP port 5666 on heplnx10 for nrpe service and edited /etc/nagios/nrpe.cfg to allow connections from heplnx182

On heplnx182 ran:

[root@heplnx182 nagios]# /usr/lib/nagios/plugins/check_nrpe -H heplnx10.pp.rl.ac.uk -c check_users
USERS OK - 1 users currently logged in |users=1;5;10;0

Looks good!

Edited /etc/nagios/minimal.cfg on heplnx182 to include a new command, host and services:

define command{
	command_name	check_system_disk
	command_line	$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_system_disk
	}

define host{
        use                     generic-host            ; Name of host template to use
        host_name               heplnx10
        alias                   heplnx10
        address                 130.246.43.10
        check_command           check-host-alive
        max_check_attempts      10
        check_period		24x7
        notification_interval   120
        notification_period     24x7
        notification_options    d,r
        contact_groups  admins
        }

define service{
        use                             generic-service         ; Name of service template to use
        host_name                       heplnx10
        service_description             PING
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              4
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
	notification_options		w,u,c,r
        notification_interval           960
        notification_period             24x7
	check_command			check_ping!100.0,20%!500.0,60%
        }

define service{
        use                             generic-service         ; Name of service template to use
        host_name                       heplnx10
        service_description             Root Partition
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              4
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
	notification_options		w,u,c,r
        notification_interval           960
        notification_period             24x7
	check_command			check_system_disk!20%!10%!/
        }

and added heplnx10 to the test nodes nodegroup

After reloading nagios heplnx10 apppear on the web pages and the two services were checked.

Then I created a new config file for user-inferfaces.cfg and added commands, a host (heplnx101), a host group for the user interfaces and some services for general things. I added that to the general nagios.cfg and installed the two rpms on heplnx101.

nagios started monitoring the services as expected.

I then created a new notification group for me and stopped the nrpe service on heplnx101 and waited for the checks to go critical. Distinct lack of emails. will look at that later (Ah, e-mail doesn't work with sendmail stopped!).

20-26/09/2006

Started moving to a more permanent set up

Logical file structure

I've split the files up into each different type of definition, so I have:

commands.cfg
generic-templates.cfg
hostgroups.cfg
servicegroups.cfg
service-templates.cfg
time-periods.cfg

Then each host group has a directory containing a file with the host template and service definitions and a file with the host definitions.

hierachy of templates

So for instance:

  1. generic-grid-worker-host inherits from
  2. generic-linux-host which in turn inherits from
  3. generic-host

Most of the host definition is contained in these templates so the actual host definitions looks like:

define host{
	use			generic-grid-worker-host
	host_name		heplnc001
	alias			heplnc001.pp.rl.ac.uk
	address			130.246.45.1
}

The same is also true of services, where I define a generic-serivce-template for each service (say system-load-service-template) that defines everything the service does apart from which nodes it applies to. Then the indevidual service definintions use the hostgroups to apply the services to nodes:

define service{
        use                             system-load-service-template
	hostgroups			7-GridWorkers
        }

This even works quite well for individual service instances, like checking a web is accessible. I just define the service tamplate in the normal way then overide the service_description and check_command like this:

define service{ 
        use                             http-url-service-template
        host_name                       heplnx182 
        service_description             ganglia web accessable 
        check_command                   check_http_url!ganglia.gridpp.rl.ac.uk!/
        } 

I've now installed it on most of the nodes, we're now checking 482 services on 118 hosts and have started to tailor the services to the hosts.

Chris brew 18:52, 26 Sep 2006 (BST)