ATLAS Monitoring For Sites

From GridPP Wiki
Jump to: navigation, search

====Production Monitoring

in case of problem related to atlas VO, please, contact

Prodsys Dashboard

Monitoring how your site is doing for ATLAS production is done through the Prodsys Dashboard.

From this dashboard you can navigate to the RAL cloud and see the success/failure rates for your site.

  • Click on your site name to get an view of jobs run at just your site.
  • Click on the error link to get a breakdown of the types of errors which occurred at your site
    • The walltime plot of errors is more relevant than the job number plot - errors at the beginning of the job are not so costly.
    • A lot of errors may occur because of job problems (TRFERROS) or infrastructure which is offsite. The ones you, as a site, really need to watchout for are stage in/out errors (EXEPANDA_DQ2PUT_FILECOPYERROR, EXEPANDA_DQ2_STAGEIN). The stage-out errors are especially expensive.
  • Click on an individual error to see the ATLAS jobs which were affected by that error.
  • Click on an individual job to see the job's detailed information.
  • Click on the facilityid number to link to the panda logs for this job
    • In particular, at the panda page, use the "Show Logfile Extracts" to see important snippits of the logs from the pilot.

Panda Dashboard

The PanDA production operations dashboard can give you a view of jobs which are still "inside" the production system. It gives you a more detailed view of what's happening, but at the expense of complexity (we don't really expect sites to look at this level of detail).

You need to click on the "UK" link to expand the UK cloud sites.

The panda job states are:

PanDA Job States
State Description

defined awaiting brokerage to a site
assigned waiting for input data to arrive at site
activated ready to run at site
running picked up by pilot and running
holding task ran but output registration failed
transferring waiting for outputs to transfer back to T1 (numbers in red are jobs within 12 hours of timing out and being failed)
finished job ran successfully and outputs back at T1
failed job failed

You can also click on the graph icons to get historical plots for your site (or cloud).

Data Movement Monitoring

DDM Dashboard

The main interface for data movement monitoring is the DDM Dashboard. Clicking on the "RAL" cloud will expand out all of the sites in the UK. N.B. at the moment what the dashboard shows are effectively SRMv2 endpoints - these are named SITE_SPACETOKEN.

If you click on the error link you will see an expanded view which aggregates all errors of the same type. This is very useful, however, at the moment all errors on transfers to your site are reported, even if these are errors which occurred at the other end of the transfer. So read these errors carefully before getting too worried!