Difference between revisions of "Backup Regional Nagios"

From GridPP Wiki
Jump to: navigation, search
Line 1: Line 1:
<pre style="color: red">
== GridPP nagios service has been decommissioned and this page is no more valid ==
== Backup Regional Nagios ==
== Backup Regional Nagios ==

Latest revision as of 09:35, 13 September 2016

== GridPP nagios service has been decommissioned and this page is no more valid ==

Backup Regional Nagios

A Backup regional nagios (https://gridppnagios.lancs.ac.uk/nagios/) has been installed at Lancaster. Backup Nagios works in Hot Standby fashion, it means both instance of Nagios are installed in same manner and both instances run all the tests. But only one instance sends the result back to message bus which in turn consume by Operational Dashboard and central myegi instance(https://grid-monitoring.cern.ch/myegi). Availability and Reliability of sites are calculated on the basis of result stored in central myegi instance.

Current Active Nagios : gridppnagios.physics.ox.ac.uk

Standby  : gridppnagios.lancs.ac.uk

Procedure for switching Nagios Instance

    • Set NCG_BACKUP_INSTANCE=true in site-info.def file for the instance which is going to act as backup.
    • Comment out NCG_BACKUP_INSTANCE=true line in site-inf.def for active instance.
    • Run yaim /opt/glite/yaim/bin/yaim -c -s /etc/yaim/site-info.def -n glite-UI -n glite-NAGIOS at both instances
      • Order of glite-UI and glite-NAGIOS is also important in above line
    • restart nagios at both instance /etc/init.d/nagios restart. It has to be done because of a bug.
    • Change current Active Nagios entry in WIKI at above and here

Running yaim takes around 20 mins to be ready to wait.

Other issues

  • In case of top bdii at RAL going down or some networking issues at tier1 change BDII_HOST to some other top bdii in site-info.def and run yaim.
  • Nagios uses a list of WMS to submit jobs
 wms02.grid.hep.ph.ic.ac.uk lcgwms02.gridpp.rl.ac.uk  svr023.gla.scotgrid.ac.uk lcgwms03.gridpp.rl.ac.uk 

Just keep an eye that majority of wms in this list are not failing. If any of the wms from this list is going off for long time then it is better to remove it from the list. Change the site-info.def and run yaim. Other option is to manually change /opt/glite/etc/ops/glite_* files and restart nagios.

  • I upload long term proxy to t2myproxy.physics.ox.ac.uk and lcgrbp01.gridpp.rl.ac.uk. Nagios uses only one myproxy server at a time and it is define in site-info.def fiel