UKI Regional Nagios

From GridPP Wiki
Jump to: navigation, search

Introduction

Information in this page is not up to date Please go to https://www.gridpp.ac.uk/wiki/UKI_WLCG_Regional_Nagios

Nagios is one of the most popular open source monitoring software for system, network and applications. WLCG-Nagios is developed to integrate nagios for LCG monitoring as part of the EGEE-SA1 Multi Level Monitoring MLM approach. Gridppnagios is a UKI ROC level monitoring instance based on WLCG Nagios. It monitors all the UKI sites registered in GOCDB. It is a combination of three tests.

  • Remote Tests
  External agents like SAM and ENOC performs tests on all the sites registered with GOCDB. WLCG Nagios fetch these test results
  from external agents and publish them. These tests are called remote tests.
  • Local Tests
  WLCG nagios also performs local tests on all the sites. These tests are like sam test and it requires proxy as it submit test
  jobs to different sites and analyze the results.
  • Native Tests
  These are the native nagios checks and it does not need any proxy and it is part of nagios package. it will be depreciated in
  WLCG-Nagios.

To get complete information about WLCG Nagios, visit official wiki page of WLCG-Nagios https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgOverview

Access to gridppnagios

Gridppnagios is developed as part of plan to move from central monitoring like SAM to regional monitoring. It is meant for all the UKI grid sites admins and managers, so access to gridppnagios is restricted to the people's who have proper grid certificate and a member of dteam or ops VO. If you are not a member of dteam or ops vo, then you can send your certificate DN to k.mohammad1@physics.ox.ac.uk to get added. Any feedback would be very much appreciated.

Understanding the WLCG-Nagios

There are many tools available in wlcg-nagios which may be useful in troubleshooting the errors. you can see the history and trend of a particular test. All the tests with pasv symbols are external tests,i.e these tests are performed by SAM or ENOC server and nagios has just fetched the result from SAM or ENOC server. So if any of these tests are showing error then first check at SAM or at ENOC.

All the tests with -dteam extension are local tests performed by nagios server using dteam vo proxy. These tests are equivalent to sam tests and can be very useful in early discovery of problems as it is more frequent than sam tests.

Use of Firefox plugins

gridppnagios publishes around 150 hosts and more than 2000 service. Some time it become difficult to search for the host you are interested in. Firefox Addon is a very good tool to filter results. So if some one is interested only in hosts at qmul.ac.uk, install firefox add-on, go to tools->add-ons->nagios checker ->option -> filtering and select host matching regular expression and put qmul.ac.uk in it.

E-mail Notifications

Site managers can request email notification for their sites. If you are interested in receiving email notification concerning problems in your sites, drop me an email, I will enable notification for your sites. Notifications will be sent at address provided in GOCDB.

Installation

This is a all in one box installation. I have installed myproxy server, glite UI and EGEE Nagios in one virtual machine. Myproxy server and UI is only required for those nagios installation which have to perform local tests also. An external myproxy server can also be used but it requires some changes at server side, so it is better to install myproxy server at the same machine. This is the step by step procedure of the installation of wlcg nagios at Oxford.

Repositories :

I have added the fallowing repositories apart from OS repositories.

Requirements

You need a host certificate and access to SAM portal. To get access to SAM portal, you have to open a ticket here

site-info.def

site-info.def file is described in detail at official wiki page but some of the rather confusing attributes are described here.


      PX_HOST=gridppnagios.physics.ox.ac.uk
      GRID_AUTHORIZED_RETRIEVERS="<subject of hostcert of nagios server>"
      GRID_TRUSTED_RETRIEVERS="<subject of hostcert of nagios server>" #output of 'openssl x509 -in hostcert.pem -noout -subject' 
      NCG_PROBES_TYPE=remote,native,local
      NAGIOS_HOST=gridppnagios.physics.ox.ac.uk
      NAGIOS_HTTPD_ENABLE_CONFIG=true
      NAGIOS_NCG_ENABLE_CONFIG=true
      NAGIOS_NAGIOS_ENABLE_CONFIG=true
      NAGIOS_CGI_ENABLE_CONFIG=true
      NCG_GOCDB_ROC_NAME=UKI
      VOS="ops dteam"
      VO_DTEAM_VOMS_SERVERS='vomss://voms.cern.ch:8443/voms/dteam?/dteam/'
      VO_OPS_VOMS_SERVERS='vomss://voms.cern.ch:8443/voms/ops?/ops/'


Installing and Configuring

   yum install httpd                                                  # it should be the first to install
   yum install lcg-CA glite-PX egee-NAGIOS glite-UI                   # better to install altogether
   yaim -c -s site-info.def –n glite-PX –n glite-NAGIOS –n glite-UI

If above steps are completed without any error then nagios page should come up when you open browser with address of your server. It will publish remote results and native results. But it would not be able to perform local tests as it does not have any proxy. So upload a proxy using your personal certificate from any UI, where you personal certificate is installed.

     myproxy-init -c 336 -k NagiosRetrieve-gridppnagios.physics.ox.ac.uk-dteam   -s gridppnagios.physics.ox.ac.uk -l/
     nagios -x -Z "/C=UK/O=eScience/OU=Oxford/L=OeSC/CN=gridppnagios.physics.ox.ac.uk/emailAddress=e.macmahon1@physics.ox.ac.uk"

The above command is in one line. It will upload a proxy with a validity of 336 hours and it can be retrieved by only a machine having a certificate with DN define in command with -Z option

Extra Configuration Tips

Enabling Notification for a particular site

   To enable  notification for a site, say oxford
   vi /etc/nagios/wlcg.d/UKI-SOUTHGRID-OX-HEP/contacts.cfg
       service_notification_options    w,u,c,r    #default is n
       host_notification_options       d,u,r

Granting Access

All the members of ops and dteam vo are granted access to wlcg-nagios automatically. A cron job runs every 6 hour to update /etc/nagios/htpasswd.users from central VOMS server. To add a user, who is not a member of either vo

     Create a  *.conf file in /etc/voms2htpasswd-static.d/ and add DN of the user to this file and then run 
     Service Nagios-htpasswd restart.

Checking a service from command prompt at nagios server

It is possible to check a service for troubleshooting from command prompt. -v is for verbose.

   nagios-run-check -v -H t2se01.physics.ox.ac.uk -s hr.srce.GridFTP-Transfer-dteam

Removing a test from host

To remove a test from a host

   Add fallowing line in /etc/ncg/ncg.localdb 
   REMOVE_SERVICE!ce00.hep.ph.ic.ac.uk!org.glite.LocalLogger
   It will stop publishing org.glite.LocalLogger test for ce00.hep.ph.ic.ac.uk