Difference between revisions of "UKI WLCG Regional Nagios"
|Line 1:||Line 1:|
<pre style="color: red">
<pre style="color: red">
GridPP nagios service has been decommissioned and this page is no more valid
== Nagios Status ==
== Nagios Status ==
Revision as of 15:44, 27 September 2016
GridPP nagios service has been decommissioned and this page is no more valid WLCG is using ARGO now https://www.gridpp.ac.uk/wiki/ARGO_Nagios_Monitoring
- 1 Nagios Status
- 2 Introduction of WLCG Nagios
- 3 UKI Regional Nagios Monitoring Infrastructure
- 4 Access to WLCG Nagios Portal
- 5 Understanding WLCG Nagios Tests
- 6 Rescheduling Tests
- 7 MyEGI Portal
- 8 Backup Regional Nagios at Lancaster
- 9 Maintenance and Troubleshooting
- 10 FAQ and Error Messages
- 11 Useful Links
Current Active Nagios : https://gridppnagios.physics.ox.ac.uk/nagios
Standby : https://gridppnagios.lancs.ac.uk/nagios
My EGI portal : https://mon.egi.eu/myegi/
Introduction of WLCG Nagios
WLCG Nagios is replacement of old centralized SAM system to monitor WLCG grid infrastructure. It enabled regional entities, such as ROC or NGI to deploy and maintain regional monitoring infrastructure. Data produced by regional nagios is also used by project level system to carry out task such as Service Level Agreement(SLA) calculations etc. Regional Dashboard also use the same data to show problems at site and subsequently tickets are created based on those alarms. WLCG-Nagios was developed to integrate nagios for WLCG monitoring as part of the EGEE-SA1 Multi Level Monitoring MLM approach. Now it is maintained under EGI and current home page of WLCG Nagios is here. Apart from Nagios, main component of WLCG nagios are
Aggregated Topology Provider (ATP) : ATP is installed as part of ROC/NGI WLCG Nagios package and it collects and aggregate topology related information from various information provider like GOCDB, CIC Portal, BDII and different VO feeds. It is the single authoritative information source for current wlcg grid topology.
Nagios Configuration Generator(NCG) : It is the configuration tool which generates nagios configuration based on current topology provided by ATP. A cron job runs NCG every six hour so any changes in topology is included into nagios portal with in six hours.
Messaging Infrastructure : An ActiveMQ based messaging infrastructure is used to publish all test results. The idea is that every test result should be published to a Topic or Queue on message bus so any tool can subscribe to that topic or queue and get latest result. Like Regional Dashboard subscribes to alarms queue and get all results directly from message bus
UKI Regional Nagios Monitoring Infrastructure
Oxford is hosting and maintaining Regional Nagios(Gridppnagios) for UK. WLCG nagios is running on a Dell 610 machine with 16GB of RAM. Jobs are submitted through WMS using a proxy generated by robot certificate. We are not using dedicated WMS any more and Nagios submission system picks up a random WMS from a configured list. A dedicated SE(storage-monit.physics.ox.ac.uk) hosted at Oxford is used for se-replication test of ARC-CE.
Access to WLCG Nagios Portal
Access to WLCG nagios portal is enabled for all members of ops and dteam VO apart from persons registered as site admin or regional staff in GOCDB. Site admins are also authorized to schedule tests for their respective sites. Regional staff members can schedule test for all sites in NGI.
Access can also be provided to other persons on request. Please send a mail to firstname.lastname@example.org.
Understanding WLCG Nagios Tests
Nagios plugins are available for almost all grid services and it is explained in detail here . Explaining all tests are out of scope of this wiki so I am giving a brief overview of CE and SE which is the main component of a grid site.
Nagios submits a job to CE through WMS and the result is sent to message bus directly from WN. Nagios subscribe this result from message bus and publish it as passive result in Nagios portal. CE test comprise of mainly fallowing steps
- Check env variable like, LCG_GFAL_INFOSYS
- Check the version of lcg-CA and glite middleware installed at WN
- Copy and register a file to close SE and then download it to WN and compare it
- Replicate the same file to a central SE define in regional nagios.For UKI, it is storage-monit.physics.ox.ac.uk but it can be any SE.
- Delete all tests file from SE(s).
A wrapper script is launched from Nagios which test different metrics and publish result as passive result in Nagios Portal. Main metrics are
- Get full SRM endpoint and storage area from BDII
- Copy and list a test file to SRM
- Get transport URL of file and download it to tmp space and the delete all test files.
Site admins registered in GOCDB can reschedule tests for their respective sites for troubleshooting.
Rescheduling SE test
Scheduling SE test is quite straight forward. click on "org.sam.SRM-All-/ops/Role=lcgadmin" for your SE and then click on "Re-schedule the next check of this service" in service command option and then "commit". Point to note here is that org.sam.SRM-All is a wrapper metric and it will reschedule all test for your SE. You can not reschedule other test in SE as they are passive test.
Rescheduling CE test
For CE, reschedule "org.sam.CREAMCE-JobState-/ops/Role=lcgadmin" as same as above. Here also org.sam.CREAMCE-jobState is wrapper test for all other CE tests and you should not try to reschedule any passive test because it throws an error and it persist there. There is an extra complication in CE test that you can only reschedule org.sam.CREAMCE-JobState if status of this test is either OK or failure. You can not reschedule if status is waiting, pending or running.
MyEGI is the visualization tool of WLCG Nagios package. It is the replacement for SAMDB Portal. It consists of
- Regional GridMap
- Service View : Shows all services with current status
- Metric Status View : Show the status of all tests per service and flavour.
- History View : Show a graph of the current status of a service over time
Backup Regional Nagios at Lancaster
Backup Nagios functions exactly same way as main regional Nagios. The only difference is that it doesn't send alarms to Dashboard.
Maintenance and Troubleshooting
Switch Active Nagios between Oxford and Lancaster:
1. Uncomment #NCG_BACKUP_INSTANCE=true in site-info.def at Active Nagios and run yaim
/opt/glite/yaim/bin/yaim -c -s /etc/yaim/site-info.def -n NAGIOS -n SAM_NAGIOS
This will turn Active Nagios into Backup Nagios
2. Now comment out NCG_BACKUP_INSTANCE in site-info.def at backup nagios and run yaim in same way. It will be become active one.
Changing WMS for SAM Nagios:
VO_OPS_WMS_HOSTS in site-info.def list wms to be use for submitting jobs to service nodes. Sometimes if we want to remove a misbehaving WMS from the list then there is two option.
1. Change VO_OPS_WMS_HOSTS in site-info.def and run yaim as above.
2. Edit /etc/glite-wms/ops/glite_wms.conf and glite_wmsui.conf directly and then restart nagios
Default Top Bdii:
emi-cream.CREAMCE-DirectJobState test uses ldap://sam-bdii.cern.ch:2170 as default top bdii. I have overwritten this configuraion at Oxford SAM instance. It is managed by puppet and changes /etc/ncg/ncg-localdb.d/creamcedjs.conf file
So if RAL Top BDII is going into extended downtime then change this file. Lancaster SAM Nagios instances uses default value
Same condition applies for org.sam.SRM-All-/ops/Role=lcgadmin test. It has been overwritten and managed by puppet. It requires /etc/ncg/ncg-localdb.d/srm.conf to be changed
There is no mechanism of failover to different top bdii.
FAQ and Error Messages
Q. I have added or removed a service, how much time nagios will take to update configuration ? A. Nagios reconfigure itself every six hour, So nagios will update with in 6 hour of information being publish in top bdii. No manual intervention is required
Q. How to subscribe for nagios alerts ? A. There is no facility where user can subscribe alerts for himself. You have to ask Regional Nagios admin for subscription of email alerts.
Q. Nagios job failing with Exit Code!=0 A. Nagios script launches a Simple-MTA process to send result to message bus. In most of the cases if job completed but Simple-MTA process could not be launch then it throw this error. Main reasons are openldap-clients is not installed on WN ldapsearch -b o=grid -h "TOP-BDII" -p 2170 -x "(GlueServiceType=msg.broker.stomp)" to see that BDII defined in WN is publishing information about message broker Check BDII_LIST=lcgbdii.gridpp.rl.ac.uk:2170,lcg-bdii.gridpp.ac.uk:2170,lcg-bdii.cern.ch:2170 Check firewall outgoing for TCP:6163 port
Q.org.sam.WN-RepCr test failing with " send2nsd: NS002 - send error : Bad credentials cannot create" A. Check crl at WN, it may be one of the reason
Q How to recalculate availability and reliability of sites in case of problem A https://wiki.egi.eu/wiki/PROC10 https://tomtools.cern.ch/confluence/display/SAMDOC/Availability+Re-computation+Policy
UKI Myegi Page https://gridppnagios.physics.ox.ac.uk/myegi
UKI WLCG Nagios https://gridppnagios.physics.ox.ac.uk/nagios
Nagios Test for one site https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?hostgroup=site-UKI-LT2-QMUL&style=detail Nagios Test for one service i.e glexec https://gridppnagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?servicegroup=org.sam.glexec.CE&style=detail
Entry point for all SAM related information https://wiki.egi.eu/wiki/SAM Operational Dashboard workflow https://forge.in2p3.fr/projects/opsportaluser/wiki/Operations_Dashboard
How to use Dashboard and Nagios web interface : second part of presentation is good for Nagios https://documents.egi.eu/public/RetrieveFile?docid=301&version=6&filename=Training_guide_general_v1.pdf