Steve Lloyd's Eponymous Tests Updated and Improved
Thu 4 Dec 2008
For the last two years "Steve's Tests" have helped the Grid community in the UK by providing diagnostic information which can help the systems administrators debug their site. Having just been moved to a dedicated server at Queen Mary, University of London the suite of tests now run, without intervention, monitoring the UK Grid 365x24x7. Just before the move was finished Steve also added a new network test to measure a user's view of the bandwidth between various sites and within a site.
Started by Steve Lloyd, GridPP's collaboration board chair, in January 2007 as a simple single ATLAS job the tests have now become a complete set of tools indicating and helping to diagnose the state of the UK Grid. This newest, and Steve hopes final, test simply called Network Test is an attempt to keep an eye on bandwidth issues at sites and between sites. Once the job is run at a site it copies a file from the Tier-1 to a storage element at that site, the file is then passed to the worker node on which the job is running which moves a copy of the file to every other site in the UK. The time taken for each leg is recorded and the bandwidth totalled. These are then graded with each site going from red to green depending on these numbers with green (25MB/s) being the target.
These numbers give a good overview of what a user actually encounters when moving files around the Grid from a remote location or within the site itself. However the test are not optimised yet as reliably measuring bandwidth is difficult and the job's success rate needs to be improved. The current state of the UK's Grid network can be seen here: http://pprc.qmul.ac.uk/~lloyd/gridpp/nettest.html

A screenshot of Steve's test at work
There are now three versions of the original ATLAS tests - running a simple Athena (the ATLAS software framework) "Hello World" straight from the ATLAS release libraries, building a custom "Hello World" from source and attempting to analyse 100 pre-generated Z0 to e+e- events and calculate the Z0 mass. These tests pre-dated properly instrumented experiment 'dashboards' and provided sites with useful diagnostic information allowing them to debug their site, especially misconfigured Worker Nodes in the early days, which were hard to spot with other tests available at the time.
Since the ATLAS tests are sent to each site's Computing Element (CE) individually regardless of their status they did not give a good indication of the overall performance seen by a user, as a user's jobs would not get sent to a site that was broken or down. Hence there is now a variation on the ATLAS tests (called the UK tests) whereby the data analysis job is sent to any, unspecified, suitable UK CE which gives a much better indication of the overall UK efficiency.
Sometimes the ATLAS tests failed not because of a problem at a particular site but because of failures of the general infrastructure such as the Resource Broker (RB)/Workload Management System (WMS), Information System (BDII) or LCG File Catalogue (LFC). Separate tests were introduced to test each of these on their own. In the case of the RB/WMS the results of the tests are used dynamically generate a 'good' RB/WMS to use when submitting jobs for the other tests. A separate Storage Element (SE) test was also introduced to regularly try and put a file on each UK SE, read it back and then delete it. This showed up several issues with user privileges and tests that normal users, as opposed to production managers, can use the SEs.
As the Grid matured, the general testing infrastructure became more widespread and reliable and the SAM (Service Availability Monitoring) tests became an important indicator of site performance. Although GridPP does not run its own SAM tests, the "Steve's" testing framework is used to regularly poll the SAM database and store the results for display and historical reporting. Initially only the OPS (Operator) Critical Tests were monitored but recently this has been extended to individual tests for ATLAS, CMS and LHCb as well.
For further information about Steve's jobs and the current status of the UK Grid see http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html
© Copyright GridPP
If you wish to reproduce this piece please credit GridPP and contact Neasan O'Neill to say you are using it