Difference between revisions of "ATLAS Site Availability and Performance (ASAP)"

From GridPP Wiki
Jump to: navigation, search
(How to use HC detailed status records to debug common scenarios)
(Where to check ASAP site status)
 
(19 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Since December 2013, a new way for ATLAS computing availability has been introduced. It's called ASAP / ATLAS Site Availability and Performance / ATLAS_AnalysisAvailability (depending where you look).  From what I have gleaned about it, ASAP is a metric to replace ADCD site status; it’s not related the SAM tests. The status of the "PandaResource" of an analysis queue is used by ASAP. A site is considered to be unavailable when its analysis queue is in test mode. This document will briefly describe some of the implications of this.
+
Since December 2014, a new way to measure ATLAS computing availability has been introduced. It's called ASAP / ATLAS Site Availability and Performance.  From what I have gleaned about it, ASAP is a metric to replace ADCD site status. The status of the "PandaResource" of an analysis queue is used by ASAP (production queues may be included in future). A site is considered to be unavailable when its analysis queue is in test mode. This document will briefly describe some of the implications of this.
  
 
==HammerCloud==
 
==HammerCloud==
Line 7: Line 7:
 
==How to get the important alerts ==
 
==How to get the important alerts ==
  
Unfortunately, at the moment, when a queue is set to test mode, the notification email is sent to cloud support and doesn’t go to to our site admins. Some site admins may be able to subscribe to the list (atlas-support-cloud-uk@cern.ch) at [https://e-groups.cern.ch/e-groups/Egroup.do?egroupId=133943&tab=3 e-groups.cern.ch]. Admins without the necessary security credentials can request to be subscribed; ask Elena Korolkova, Alessandra Forti or another GridPP representative of ATLAS.
+
Unfortunately, at the moment, when a queue is set to test mode, the notification email is sent to cloud support and doesn’t necessarily go to to our site admins. Some site admins may be able to subscribe themselves to the list (atlas-support-cloud-uk@cern.ch) at [https://e-groups.cern.ch/e-groups/Egroup.do?egroupId=133943&tab=3 e-groups.cern.ch]. Admins without the necessary security credentials can request to be subscribed; ask Elena Korolkova, Alessandra Forti or another GridPP representative of ATLAS.
  
 
Once you are getting the alerts, it's usually easy to set up filters that can find the messages for your site by searching the subject field for the name of the site's queues. A list of all Panda queues can be found here: [https://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=10101 dashb-atlas-ssb.cern.ch].
 
Once you are getting the alerts, it's usually easy to set up filters that can find the messages for your site by searching the subject field for the name of the site's queues. A list of all Panda queues can be found here: [https://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=10101 dashb-atlas-ssb.cern.ch].
  
Alternatively, the HammerCloud document contains a [https://twiki.cern.ch/twiki/bin/viewauth/IT/HammerCloudATLASOperations#1_1_2_Setting_a_contact_email_fo section] that tells how get additional individual emails per queue, although this functionality has not been for a while, so  you may run into trouble with it.
 
  
 
==Where to check ASAP site status==
 
==Where to check ASAP site status==
  
You can check your site's ASAP status here: [https://wlcg-mon.cern.ch/dashboard/request.py/siteviewhistory?columnid=782#time=24&start_date=&end_date=&values=false&spline=false&debug=false&resample=false&sites=all&clouds=all wlcg-mon.cern.ch]. You can use buttons to select various sites and time-scales etc.
+
* In WLCG monitoring
 +
** [https://wlcg-mon.cern.ch/dashboard/request.py/siteviewhistory?columnid=782#time=24&start_date=&end_date=&values=false&spline=false&debug=false&resample=false&sites=all&clouds=all analysis]
 +
** [https://wlcg-mon.cern.ch/dashboard/request.py/siteviewhistory?columnid=1394#time=24&start_date=&end_date=&values=false&spline=false&debug=false&resample=false&sites=all&clouds=all production]
 +
 
 +
* In SAM3 interface
 +
** [http://wlcg-sam-atlas.cern.ch/templates/ember/#/historicalsmry/heatMap?group=ATLAS_Cloud_FR&profile=ATLAS_AnalysisAvailability&time=7d analysis]
 +
** [http://wlcg-sam-atlas.cern.ch/templates/ember/#/historicalsmry/heatMap?group=ATLAS_Cloud_FR&profile=ATLAS_ProductionAvailability&time=7d production]
 +
 
 +
* A new monitor will be prepared in [https://monit-grafana.cern.ch/dashboard/db/home?orgId=17 grafana]
 +
<!--
 +
* In Kibana (to be moved to grafana)
 +
** https://monit-kibana.cern.ch/goto/7a7e2d1eb9354769538da98914fa4ac1
 +
** https://monit-kibana.cern.ch/goto/f0d86fa60056b50db2bf6863b53fd97a
 +
-->
  
 
==How to find ASAP related HC detailed status records==
 
==How to find ASAP related HC detailed status records==
  
Once you have had an alert or otherwise found out that you have a problem, it's time to look into the cause. The  [http://hammercloud.cern.ch/hc/app/atlas/siteoverview/select HammerCloud Site Overview] is a good place to begin. Select your site and some times to example, and choose functional tests (FTs used for blacklisting).You'll get some grid representing the PandaResources (queues) at your site. I can't understand the naming convention, but the important queue at present follows this pattern: ANALY_<site_mnemonic>_SL6 , e.g. for Liverpool, the relevant one is ANALY_LIV_SL6.  Each grid has tabs along the top that represent the test type (template). You need to check them all. Look for tests that are f (red) or m (orange). Those are the ones that have caused trouble. Clicking on the test takes you to the summary. You use the information in this page, and the ones linked from it, to debug the problem. The first thing to note is the Job info field from the second table (Job List). If the Job list table has several entries, choose jobs with status of "failed".
+
Once you have had an alert or otherwise found out that you have a problem, it's time to look into the cause. The  [http://hammercloud.cern.ch/hc/app/atlas/siteoverview/select HammerCloud Site Overview] is a good place to begin. Select your site and some times to examine, and choose functional tests (FTs used for blacklisting).You'll get some tables representing the PandaResources (queues) at your site. I can't understand the naming convention, but the important queue at present follows this pattern: ANALY_<site_mnemonic>_SL6 , e.g. for Liverpool, the relevant one is ANALY_LIV_SL6.  Each grid has tabs along the top that represent the test type (template). You need to check them all. Look for tests that are f (red) or m (orange). Those are the ones that have caused trouble. Clicking on the test takes you to the summary. You use the information in this page, and the ones linked from it, to debug the problem. The first thing to note is the Job info field from the second table (Job List). If the Job list table has several entries, choose jobs with status of "failed".
  
 
==How to use HC detailed status records to debug common scenarios==
 
==How to use HC detailed status records to debug common scenarios==
  
In this section, I describe particular debugging use cases that I've tried.
+
A website called [http://bigpanda.cern.ch/sites/?cloud=UK bigpanda] is the primary source of information. Click on PanDA resource name from the list of the left (e.g. ANALY_LIV_SL6) and on the next page click on "View: ... jobs" to see the full list of jobs, or "View: ... job errors" to limit the output to jobs that went wrong. It's your own choice where to go next, but the links in the "Site error summary" table bring up a summary table for one job. Clicking on a job's PandaID gives you access to all the job's log files, stdout, stderr etc.
 +
 
 +
An online [http://indico.cern.ch/event/276502/session/4/contribution/22/material/slides/0.pdf tutorial] shows how to use big panda monitor. Please note that the BigPandaMonitor is presently incomplete and lacks some user interface controls. Nonetheless, it is useful and the tutorial  gives clues about how to navigate it.
  
 
===Obvious cases===
 
===Obvious cases===
  
Sometimes you strike it lucky, the the job info field tells you exactly what's wrong. For example, PanDA ID 2346079525 has this entry:lcg-cp:  
+
Sometimes you strike it lucky, the the job info field tells you exactly what's wrong. For example, PanDA ID 2346079525 has this entry:
 
<pre>
 
<pre>
error while loading shared libraries: libglobus_ftp_client.so.2:  
+
lcg-cp:  error while loading shared libraries: libglobus_ftp_client.so.2:  
 
</pre>
 
</pre>
  
 
Obviously there's a bad library version on the worker node.
 
Obviously there's a bad library version on the worker node.
  
===blah ===
+
As often as not, a lot more digging is required to find the cause. The following sections outline various approaches to different issues.
 +
 
 +
===More Obscure Errors ===
 +
 
 +
* blah
 +
* blah
 +
* blah

Latest revision as of 13:44, 13 August 2017

Since December 2014, a new way to measure ATLAS computing availability has been introduced. It's called ASAP / ATLAS Site Availability and Performance. From what I have gleaned about it, ASAP is a metric to replace ADCD site status. The status of the "PandaResource" of an analysis queue is used by ASAP (production queues may be included in future). A site is considered to be unavailable when its analysis queue is in test mode. This document will briefly describe some of the implications of this.

HammerCloud

Various tests are sent out to sites using a framework called "HammerCloud". The results of various tests are combined using an algorithm. The HammerCloudATLASOperations wiki page describes it all. The upshot is that once a set of tests have failed, the site is set to test, i.e. not available. Ultimately that is used by ASAP to determine the site's unavailability periods.

How to get the important alerts

Unfortunately, at the moment, when a queue is set to test mode, the notification email is sent to cloud support and doesn’t necessarily go to to our site admins. Some site admins may be able to subscribe themselves to the list (atlas-support-cloud-uk@cern.ch) at e-groups.cern.ch. Admins without the necessary security credentials can request to be subscribed; ask Elena Korolkova, Alessandra Forti or another GridPP representative of ATLAS.

Once you are getting the alerts, it's usually easy to set up filters that can find the messages for your site by searching the subject field for the name of the site's queues. A list of all Panda queues can be found here: dashb-atlas-ssb.cern.ch.


Where to check ASAP site status

  • A new monitor will be prepared in grafana

How to find ASAP related HC detailed status records

Once you have had an alert or otherwise found out that you have a problem, it's time to look into the cause. The HammerCloud Site Overview is a good place to begin. Select your site and some times to examine, and choose functional tests (FTs used for blacklisting).You'll get some tables representing the PandaResources (queues) at your site. I can't understand the naming convention, but the important queue at present follows this pattern: ANALY_<site_mnemonic>_SL6 , e.g. for Liverpool, the relevant one is ANALY_LIV_SL6. Each grid has tabs along the top that represent the test type (template). You need to check them all. Look for tests that are f (red) or m (orange). Those are the ones that have caused trouble. Clicking on the test takes you to the summary. You use the information in this page, and the ones linked from it, to debug the problem. The first thing to note is the Job info field from the second table (Job List). If the Job list table has several entries, choose jobs with status of "failed".

How to use HC detailed status records to debug common scenarios

A website called bigpanda is the primary source of information. Click on PanDA resource name from the list of the left (e.g. ANALY_LIV_SL6) and on the next page click on "View: ... jobs" to see the full list of jobs, or "View: ... job errors" to limit the output to jobs that went wrong. It's your own choice where to go next, but the links in the "Site error summary" table bring up a summary table for one job. Clicking on a job's PandaID gives you access to all the job's log files, stdout, stderr etc.

An online tutorial shows how to use big panda monitor. Please note that the BigPandaMonitor is presently incomplete and lacks some user interface controls. Nonetheless, it is useful and the tutorial gives clues about how to navigate it.

Obvious cases

Sometimes you strike it lucky, the the job info field tells you exactly what's wrong. For example, PanDA ID 2346079525 has this entry:

lcg-cp:  error while loading shared libraries: libglobus_ftp_client.so.2: 

Obviously there's a bad library version on the worker node.

As often as not, a lot more digging is required to find the cause. The following sections outline various approaches to different issues.

More Obscure Errors

  • blah
  • blah
  • blah