Difference between revisions of "UKI WLCG Regional Nagios"

From GridPP Wiki
Jump to: navigation, search
(Useful Links)
(Useful Links)
Line 119: Line 119:
  
  
{{KeyDocs|responsible=Kashif Mohammad|reviewdate=2014-01-14|accuratedate=2014-01-14|percentage=100}}
+
{{KeyDocs|responsible=Kashif Mohammad|reviewdate=2014-10-03|accuratedate=2014-10-03|percentage=100}}

Revision as of 14:58, 3 October 2014

Nagios Status

Current Active Nagios : https://gridppnagios.physics.ox.ac.uk/nagios

Standby  : https://gridppnagios.lancs.ac.uk/nagios

Introduction of WLCG Nagios

WLCG Nagios is replacement of old centralized SAM system to monitor WLCG grid infrastructure. It enabled regional entities, such as ROC or NGI to deploy and maintain regional monitoring infrastructure. Data produced by regional nagios is also used by project level system to carry out task such as Service Level Agreement(SLA) calculations etc. Regional Dashboard also use the same data to show problems at site and subsequently tickets are created based on those alarms. WLCG-Nagios was developed to integrate nagios for WLCG monitoring as part of the EGEE-SA1 Multi Level Monitoring MLM approach. Now it is maintained under EGI and current home page of WLCG Nagios is here. Apart from Nagios, main component of WLCG nagios are


Aggregated Topology Provider (ATP) : ATP is installed as part of ROC/NGI WLCG Nagios package and it collects and aggregate topology related information from various information provider like GOCDB, CIC Portal, BDII and different VO feeds. It is the single authoritative information source for current wlcg grid topology.

Nagios Configuration Generator(NCG) : It is the configuration tool which generates nagios configuration based on current topology provided by ATP. A cron job runs NCG every six hour so any changes in topology is included into nagios portal with in six hours.

Messaging Infrastructure : An ActiveMQ based messaging infrastructure is used to publish all test results. The idea is that every test result should be published to a Topic or Queue on message bus so any tool can subscribe to that topic or queue and get latest result. Like Regional Dashboard subscribes to alarms queue and get all results directly from message bus

UKI Regional Nagios Monitoring Infrastructure

Oxford is hosting and maintaining Regional Nagios(Gridppnagios) for UK. WLCG nagios is running on a Dell 610 machine with 16GB of RAM. Jobs are submitted through WMS using a proxy generated by robot certificate. We are not using dedicated WMS any more and Nagios submission system picks up a random WMS from a configured list. A dedicated SE(storage-monit.physics.ox.ac.uk) hosted at Oxford is used for se-replication test of ARC-CE.

Access to WLCG Nagios Portal

Access to WLCG nagios portal is enabled for all members of ops and dteam VO apart from persons registered as site admin or regional staff in GOCDB. Site admins are also authorized to schedule tests for their respective sites. Regional staff members can schedule test for all sites in NGI.

Access can also be provided to other persons on request. Please send a mail to lcg_manager@physics.ox.ac.uk.

Understanding WLCG Nagios Tests

Nagios plugins are available for almost all grid services and it is explained in detail here . Explaining all tests are out of scope of this wiki so I am giving a brief overview of CE and SE which is the main component of a grid site.

CE Test

Nagios submits a job to CE through WMS and the result is sent to message bus directly from WN. Nagios subscribe this result from message bus and publish it as passive result in Nagios portal. CE test comprise of mainly fallowing steps

  1. Check env variable like, LCG_GFAL_INFOSYS
  2. Check the version of lcg-CA and glite middleware installed at WN
  3. Copy and register a file to close SE and then download it to WN and compare it
  4. Replicate the same file to a central SE define in regional nagios.For UKI, it is storage-monit.physics.ox.ac.uk but it can be any SE.
  5. Delete all tests file from SE(s).

SE test

A wrapper script is launched from Nagios which test different metrics and publish result as passive result in Nagios Portal. Main metrics are

  1. Get full SRM endpoint and storage area from BDII
  2. Copy and list a test file to SRM
  3. Get transport URL of file and download it to tmp space and the delete all test files.

Rescheduling Tests

Site admins registered in GOCDB can reschedule tests for their respective sites for troubleshooting.

Rescheduling SE test

Scheduling SE test is quite straight forward. click on "org.sam.SRM-All-/ops/Role=lcgadmin" for your SE and then click on "Re-schedule the next check of this service" in service command option and then "commit". Point to note here is that org.sam.SRM-All is a wrapper metric and it will reschedule all test for your SE. You can not reschedule other test in SE as they are passive test.

Rescheduling CE test

For CE, reschedule "org.sam.CREAMCE-JobState-/ops/Role=lcgadmin" as same as above. Here also org.sam.CREAMCE-jobState is wrapper test for all other CE tests and you should not try to reschedule any passive test because it throws an error and it persist there. There is an extra complication in CE test that you can only reschedule org.sam.CREAMCE-JobState if status of this test is either OK or failure. You can not reschedule if status is waiting, pending or running.

MyEGI Portal

MyEGI is the visualization tool of WLCG Nagios package. It is the replacement for SAMDB Portal. It consists of

  • Regional GridMap
  • Service View : Shows all services with current status
  • Metric Status View : Show the status of all tests per service and flavour.
  • History View  : Show a graph of the current status of a service over time

Backup Regional Nagios at Lancaster

https://gridppnagios.lancs.ac.uk/nagios

Backup Nagios functions exactly same way as main regional Nagios. The only difference is that it doesn't send alarms to Dashboard.

FAQ and Error Messages

   Q. I have added or removed a service, how much time nagios will take to update configuration ?
   A. Nagios reconfigure itself every six hour, So nagios will update with in 6 hour of information being publish in top bdii. No manual intervention is required
   Q. How to subscribe for nagios alerts ?
   A. There is no facility where user can subscribe alerts for himself. You have to ask Regional Nagios admin for subscription of email alerts.
   Q. Nagios job failing with Exit Code!=0
   A. Nagios script launches a Simple-MTA process to send result to message bus. In most of the cases if job completed but  
    Simple-MTA   process could not be launch then it throw this error. Main reasons are
    openldap-clients is not installed on WN
    ldapsearch -b o=grid -h "TOP-BDII" -p 2170 -x "(GlueServiceType=msg.broker.stomp)" to see that BDII defined in WN is publishing  
    information about message broker
    Check BDII_LIST=lcgbdii.gridpp.rl.ac.uk:2170,lcg-bdii.gridpp.ac.uk:2170,lcg-bdii.cern.ch:2170 
    Check firewall outgoing for TCP:6163 port
   Q.org.sam.WN-RepCr test failing with " send2nsd: NS002 - send error : Bad credentials cannot create"
   A. Check crl at WN, it may be one of the reason
   Q How to recalculate availability and reliability of sites in case of problem
   A https://wiki.egi.eu/wiki/PROC10 
     https://tomtools.cern.ch/confluence/display/SAMDOC/Availability+Re-computation+Policy

Useful Links

UKI Myegi Page
  https://gridppnagios.physics.ox.ac.uk/myegi
UKI WLCG Nagios
  https://gridppnagios.physics.ox.ac.uk/nagios
Nagios Test for one site 
   https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?hostgroup=site-UKI-LT2-QMUL&style=detail

Nagios Test for one service i.e glexec
   https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?servicegroup=org.sam.glexec.CE&style=detail
Entry point for all SAM related information
   https://wiki.egi.eu/wiki/SAM


Operational Dashboard workflow
   https://forge.in2p3.fr/projects/opsportaluser/wiki/Operations_Dashboard 


 How to use Dashboard and Nagios web interface : second part of presentation is good for Nagios
   https://documents.egi.eu/public/RetrieveFile?docid=301&version=6&filename=Training_guide_general_v1.pdf


This page is a Key Document, and is the responsibility of Kashif Mohammad. It was last reviewed on 2014-10-03 when it was considered to be 100% complete. It was last judged to be accurate on 2014-10-03.