Difference between revisions of "UKI WLCG Regional Nagios"

From GridPP Wiki
Jump to: navigation, search
 
Line 169: Line 169:
 
   How to use Dashboard and Nagios web interface : second part of presentation is good for Nagios
 
   How to use Dashboard and Nagios web interface : second part of presentation is good for Nagios
 
     https://documents.egi.eu/public/RetrieveFile?docid=301&version=6&filename=Training_guide_general_v1.pdf
 
     https://documents.egi.eu/public/RetrieveFile?docid=301&version=6&filename=Training_guide_general_v1.pdf
 
 
{{KeyDocs|responsible=Kashif Mohammad|reviewdate=2017-02-15|accuratedate=2017-02-15|percentage=100}}
 

Latest revision as of 09:44, 14 June 2018

 GridPP nagios service has been decommissioned and this page is for historical purpose only
 
 

WLCG is using ARGO now https://www.gridpp.ac.uk/wiki/ARGO_Nagios_Monitoring

Nagios Status

Current Active Nagios : https://gridppnagios.physics.ox.ac.uk/nagios

Standby  : https://gridppnagios.lancs.ac.uk/nagios

My EGI portal  : https://mon.egi.eu/myegi/

Introduction of WLCG Nagios

WLCG Nagios is replacement of old centralized SAM system to monitor WLCG grid infrastructure. It enabled regional entities, such as ROC or NGI to deploy and maintain regional monitoring infrastructure. Data produced by regional nagios is also used by project level system to carry out task such as Service Level Agreement(SLA) calculations etc. Regional Dashboard also use the same data to show problems at site and subsequently tickets are created based on those alarms. WLCG-Nagios was developed to integrate nagios for WLCG monitoring as part of the EGEE-SA1 Multi Level Monitoring MLM approach. Now it is maintained under EGI and current home page of WLCG Nagios is here. Apart from Nagios, main component of WLCG nagios are


Aggregated Topology Provider (ATP) : ATP is installed as part of ROC/NGI WLCG Nagios package and it collects and aggregate topology related information from various information provider like GOCDB, CIC Portal, BDII and different VO feeds. It is the single authoritative information source for current wlcg grid topology.

Nagios Configuration Generator(NCG) : It is the configuration tool which generates nagios configuration based on current topology provided by ATP. A cron job runs NCG every six hour so any changes in topology is included into nagios portal with in six hours.

Messaging Infrastructure : An ActiveMQ based messaging infrastructure is used to publish all test results. The idea is that every test result should be published to a Topic or Queue on message bus so any tool can subscribe to that topic or queue and get latest result. Like Regional Dashboard subscribes to alarms queue and get all results directly from message bus

UKI Regional Nagios Monitoring Infrastructure

Oxford is hosting and maintaining Regional Nagios(Gridppnagios) for UK. WLCG nagios is running on a Dell 610 machine with 16GB of RAM. Jobs are submitted through WMS using a proxy generated by robot certificate. We are not using dedicated WMS any more and Nagios submission system picks up a random WMS from a configured list. A dedicated SE(storage-monit.physics.ox.ac.uk) hosted at Oxford is used for se-replication test of ARC-CE.

Access to WLCG Nagios Portal

Access to WLCG nagios portal is enabled for all members of ops and dteam VO apart from persons registered as site admin or regional staff in GOCDB. Site admins are also authorized to schedule tests for their respective sites. Regional staff members can schedule test for all sites in NGI.

Access can also be provided to other persons on request. Please send a mail to lcg_manager@physics.ox.ac.uk.

Understanding WLCG Nagios Tests

Nagios plugins are available for almost all grid services and it is explained in detail here . Explaining all tests are out of scope of this wiki so I am giving a brief overview of CE and SE which is the main component of a grid site.

CE Test

Nagios submits a job to CE through WMS and the result is sent to message bus directly from WN. Nagios subscribe this result from message bus and publish it as passive result in Nagios portal. CE test comprise of mainly fallowing steps

  1. Check env variable like, LCG_GFAL_INFOSYS
  2. Check the version of lcg-CA and glite middleware installed at WN
  3. Copy and register a file to close SE and then download it to WN and compare it
  4. Replicate the same file to a central SE define in regional nagios.For UKI, it is storage-monit.physics.ox.ac.uk but it can be any SE.
  5. Delete all tests file from SE(s).

SE test

A wrapper script is launched from Nagios which test different metrics and publish result as passive result in Nagios Portal. Main metrics are

  1. Get full SRM endpoint and storage area from BDII
  2. Copy and list a test file to SRM
  3. Get transport URL of file and download it to tmp space and the delete all test files.

Rescheduling Tests

Site admins registered in GOCDB can reschedule tests for their respective sites for troubleshooting.

Rescheduling SE test

Scheduling SE test is quite straight forward. click on "org.sam.SRM-All-/ops/Role=lcgadmin" for your SE and then click on "Re-schedule the next check of this service" in service command option and then "commit". Point to note here is that org.sam.SRM-All is a wrapper metric and it will reschedule all test for your SE. You can not reschedule other test in SE as they are passive test.

Rescheduling CE test

For CE, reschedule "org.sam.CREAMCE-JobState-/ops/Role=lcgadmin" as same as above. Here also org.sam.CREAMCE-jobState is wrapper test for all other CE tests and you should not try to reschedule any passive test because it throws an error and it persist there. There is an extra complication in CE test that you can only reschedule org.sam.CREAMCE-JobState if status of this test is either OK or failure. You can not reschedule if status is waiting, pending or running.

MyEGI Portal

MyEGI is the visualization tool of WLCG Nagios package. It is the replacement for SAMDB Portal. It consists of

  • Regional GridMap
  • Service View : Shows all services with current status
  • Metric Status View : Show the status of all tests per service and flavour.
  • History View  : Show a graph of the current status of a service over time

Backup Regional Nagios at Lancaster

https://gridppnagios.lancs.ac.uk/nagios

Backup Nagios functions exactly same way as main regional Nagios. The only difference is that it doesn't send alarms to Dashboard.


Maintenance and Troubleshooting

Switch Active Nagios between Oxford and Lancaster:

1. Uncomment #NCG_BACKUP_INSTANCE=true in site-info.def at Active Nagios and run yaim

/opt/glite/yaim/bin/yaim -c -s /etc/yaim/site-info.def -n NAGIOS -n SAM_NAGIOS


This will turn Active Nagios into Backup Nagios


2. Now comment out NCG_BACKUP_INSTANCE in site-info.def at backup nagios and run yaim in same way. It will be become active one.

Changing WMS for SAM Nagios:

VO_OPS_WMS_HOSTS in site-info.def list wms to be use for submitting jobs to service nodes. Sometimes if we want to remove a misbehaving WMS from the list then there is two option.


1. Change VO_OPS_WMS_HOSTS in site-info.def and run yaim as above.


2. Edit /etc/glite-wms/ops/glite_wms.conf and glite_wmsui.conf directly and then restart nagios

/etc/init.d/nagios restart


Default Top Bdii:

emi-cream.CREAMCE-DirectJobState test uses ldap://sam-bdii.cern.ch:2170 as default top bdii. I have overwritten this configuraion at Oxford SAM instance. It is managed by puppet and changes /etc/ncg/ncg-localdb.d/creamcedjs.conf file

MODIFY_PARAMETRIC_PARAMETER!emi.cream.CREAMCE-DirectJobState!--ldap-uri!lcgbdii.gridpp.rl.ac.uk

So if RAL Top BDII is going into extended downtime then change this file. Lancaster SAM Nagios instances uses default value

Same condition applies for org.sam.SRM-All-/ops/Role=lcgadmin test. It has been overwritten and managed by puppet. It requires /etc/ncg/ncg-localdb.d/srm.conf to be changed

MODIFY_METRIC_PARAMETER!org.sam.SRM-All!--ldap-uri!lcgbdii.gridpp.rl.ac.uk

There is no mechanism of failover to different top bdii.


FAQ and Error Messages

   Q. I have added or removed a service, how much time nagios will take to update configuration ?
   A. Nagios reconfigure itself every six hour, So nagios will update with in 6 hour of information being publish in top bdii. No manual intervention is required
   Q. How to subscribe for nagios alerts ?
   A. There is no facility where user can subscribe alerts for himself. You have to ask Regional Nagios admin for subscription of email alerts.
   Q. Nagios job failing with Exit Code!=0
   A. Nagios script launches a Simple-MTA process to send result to message bus. In most of the cases if job completed but  
    Simple-MTA   process could not be launch then it throw this error. Main reasons are
    openldap-clients is not installed on WN
    ldapsearch -b o=grid -h "TOP-BDII" -p 2170 -x "(GlueServiceType=msg.broker.stomp)" to see that BDII defined in WN is publishing  
    information about message broker
    Check BDII_LIST=lcgbdii.gridpp.rl.ac.uk:2170,lcg-bdii.gridpp.ac.uk:2170,lcg-bdii.cern.ch:2170 
    Check firewall outgoing for TCP:6163 port
   Q.org.sam.WN-RepCr test failing with " send2nsd: NS002 - send error : Bad credentials cannot create"
   A. Check crl at WN, it may be one of the reason
   Q How to recalculate availability and reliability of sites in case of problem
   A https://wiki.egi.eu/wiki/PROC10 
     https://tomtools.cern.ch/confluence/display/SAMDOC/Availability+Re-computation+Policy

Useful Links

UKI Myegi Page
  https://gridppnagios.physics.ox.ac.uk/myegi
UKI WLCG Nagios
  https://gridppnagios.physics.ox.ac.uk/nagios
Nagios Test for one site 
   https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?hostgroup=site-UKI-LT2-QMUL&style=detail

Nagios Test for one service i.e glexec
   https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?servicegroup=org.sam.glexec.CE&style=detail
Entry point for all SAM related information
   https://wiki.egi.eu/wiki/SAM


Operational Dashboard workflow
   https://forge.in2p3.fr/projects/opsportaluser/wiki/Operations_Dashboard 


 How to use Dashboard and Nagios web interface : second part of presentation is good for Nagios
   https://documents.egi.eu/public/RetrieveFile?docid=301&version=6&filename=Training_guide_general_v1.pdf