ATLAS Monitoring For Sites

Click on your site name to get an view of jobs run at just your site.
Click on the error link to get a breakdown of the types of errors which occurred at your site
- The walltime plot of errors is more relevant than the job number plot - errors at the beginning of the job are not so costly.
- A lot of errors may occur because of job problems (TRFERROS) or infrastructure which is offsite. The ones you, as a site, really need to watchout for are stage in/out errors (EXEPANDA_DQ2PUT_FILECOPYERROR, EXEPANDA_DQ2_STAGEIN). The stage-out errors are especially expensive.
Click on an individual error to see the ATLAS jobs which were affected by that error.
Click on an individual job to see the job's detailed information.
Click on the facilityid number to link to the panda logs for this job
- In particular, at the panda page, use the "Show Logfile Extracts" to see important snippits of the logs from the pilot.

Panda Dashboard

The PanDA production operations dashboard can give you a view of jobs which are still "inside" the production system. It gives you a more detailed view of what's happening, but at the expense of complexity (we don't really expect sites to look at this level of detail).

You need to click on the "UK" link to expand the UK cloud sites.

The panda job states are:

PanDA Job States
State	Description
defined	awaiting brokerage to a site
assigned	waiting for input data to arrive at site
activated	ready to run at site
running	picked up by pilot and running
holding	task ran but output registration failed
transferring	waiting for outputs to transfer back to T1 (numbers in red are jobs within 12 hours of timing out and being failed)
finished	job ran successfully and outputs back at T1
failed	job failed

You can also click on the graph icons to get historical plots for your site (or cloud).

Data Movement Monitoring

DDM Dashboard

The main interface for data movement monitoring is the DDM Dashboard. Clicking on the "RAL" cloud will expand out all of the sites in the UK. N.B. at the moment what the dashboard shows are effectively SRMv2 endpoints - these are named SITE_SPACETOKEN.

If you click on the error link you will see an expanded view which aggregates all errors of the same type. This is very useful, however, at the moment all errors on transfers to your site are reported, even if these are errors which occurred at the other end of the transfer. So read these errors carefully before getting too worried!

ATLAS Monitoring For Sites

Contents

Obsolete
====Production Monitoring

Prodsys Dashboard

Panda Dashboard

Data Movement Monitoring

DDM Dashboard

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools

ATLAS Monitoring For Sites

Contents

Obsolete====Production Monitoring

Prodsys Dashboard

Panda Dashboard

Data Movement Monitoring

DDM Dashboard

Navigation menu

Search

Obsolete
====Production Monitoring