Difference between revisions of "RALPP Work List Nagios"
Chris brew (Talk | contribs) |
(No difference)
|
Latest revision as of 17:52, 26 September 2006
Progress on installing Nagios monitoring for the RALPP Tier 2
Contents
04/07/2006
Installed nagios rpms, set up very basic config
Installed nagios and nagios-plugins on heplnx182
Configured http access with:
htpasswd -c /etc/nagios/htpasswd.users nagiosadmin
Tried starting the nagios service, it complained about problems with the config file, I had to edit /etc/nagios/nagios.cfg
to comment out all the cfg_file
entries other than minimal.cfg
.
The nagios service then started and I could log into the web interface (after making apache reload its config) but I couldn't see any info on the one host in the config (localhost).
Eventually discovered that I had to edit /etc/nagios/cgi.cfg
to enable the userid I'd just setup nagiosadmin
permission to access various bits of the CGI. Unscientifically enabled everything in sight. Now I can see that status of localhost.
19/09/2006
Started messing with nrpe to do remote monitoring
Installed nagios-npre-plugin on heplnx182 Installed nagios-npre and nagios-plugins on heplnx10
Opened TCP port 5666 on heplnx10 for nrpe service and edited /etc/nagios/nrpe.cfg to allow connections from heplnx182
On heplnx182 ran:
[root@heplnx182 nagios]# /usr/lib/nagios/plugins/check_nrpe -H heplnx10.pp.rl.ac.uk -c check_users USERS OK - 1 users currently logged in |users=1;5;10;0
Looks good!
Edited /etc/nagios/minimal.cfg on heplnx182 to include a new command, host and services:
define command{ command_name check_system_disk command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_system_disk } define host{ use generic-host ; Name of host template to use host_name heplnx10 alias heplnx10 address 130.246.43.10 check_command check-host-alive max_check_attempts 10 check_period 24x7 notification_interval 120 notification_period 24x7 notification_options d,r contact_groups admins } define service{ use generic-service ; Name of service template to use host_name heplnx10 service_description PING is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_ping!100.0,20%!500.0,60% } define service{ use generic-service ; Name of service template to use host_name heplnx10 service_description Root Partition is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_system_disk!20%!10%!/ }
and added heplnx10 to the test nodes nodegroup
After reloading nagios heplnx10 apppear on the web pages and the two services were checked.
Then I created a new config file for user-inferfaces.cfg and added commands, a host (heplnx101), a host group for the user interfaces and some services for general things. I added that to the general nagios.cfg and installed the two rpms on heplnx101.
nagios started monitoring the services as expected.
I then created a new notification group for me and stopped the nrpe service on heplnx101 and waited for the checks to go critical. Distinct lack of emails. will look at that later (Ah, e-mail doesn't work with sendmail stopped!).
20-26/09/2006
Started moving to a more permanent set up
Logical file structure
I've split the files up into each different type of definition, so I have:
commands.cfg generic-templates.cfg hostgroups.cfg servicegroups.cfg service-templates.cfg time-periods.cfg
Then each host group has a directory containing a file with the host template and service definitions and a file with the host definitions.
hierachy of templates
So for instance:
- generic-grid-worker-host inherits from
- generic-linux-host which in turn inherits from
- generic-host
Most of the host definition is contained in these templates so the actual host definitions looks like:
define host{ use generic-grid-worker-host host_name heplnc001 alias heplnc001.pp.rl.ac.uk address 130.246.45.1 }
The same is also true of services, where I define a generic-serivce-template for each service (say system-load-service-template) that defines everything the service does apart from which nodes it applies to. Then the indevidual service definintions use the hostgroups to apply the services to nodes:
define service{ use system-load-service-template hostgroups 7-GridWorkers }
This even works quite well for individual service instances, like checking a web is accessible. I just define the service tamplate in the normal way then overide the service_description
and check_command
like this:
define service{ use http-url-service-template host_name heplnx182 service_description ganglia web accessable check_command check_http_url!ganglia.gridpp.rl.ac.uk!/ }
I've now installed it on most of the nodes, we're now checking 482 services on 118 hosts and have started to tailor the services to the hosts.
Chris brew 18:52, 26 Sep 2006 (BST)