Difference between revisions of "ATLAS AOD File Access"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 11:32, 20 August 2010

This page will describe the way in which ATLAS analysis access their AOD root files, some recent work that has was done in improving it and the testing of those improvements at sites in the UK.

A talk on this was given at the April 2010 GridPP Storage Workshop.

ROOT I/O / Reordering / TTreeCaches

ROOT data files are contain data structures called Trees which in turn contain data objects called branches for a series of Entries. When these files are written they can end up ordered by Branches, Entry or in no particular order. If branches are scattered throughout the file then most attempts to analyse the data will result in "random" like access of the file, many seeks, and so put additional load on site storage. ATLAS have been exploring reordering of the (baskets in) the file by entry and also TTreeCache

The effect of these changes can be seen by reading through the file in ROOT with Tree->GetEntry() and using TTreePerfStats to plot the access pattern. Some instructions at Local_File_Access. Some plots of the results.

Hammercloud tests

Reordering

To test with real ATLAS athena jobs, hammercloud tests were run at QMUL (which uses Lustre and so file:// access to read files) and Glasgow (where DPM and rfio are used). The results from the test on [reordered] and [old] files appears to show an improvement in the event rate from around 25 Hz to around 40 Hz - but this should be considered very preliminary as there are a number of factors which could influence the results.

TTreeCache

Hammercloud Test 1260 at Glasgow had TTreeCache on - in a D3PDMaker job. 21 out of the 80 jobs failed but those that run had a CPU eff of around 90% and an event rate of around 40 Hz. Further investigation of the cause of failures needs to be made.

Also running a root job with TTreeCache over RFIO throws up these errors

Error in <TRFIOFile::TRFIOFile>: error doing rfio_lseek
Error in <TBranchElement::GetBasket>: File: rfio:pool2.glite.ecdf.ed.ac.uk//gridstorage015/atlas/2010-02-15/AODClone.root.5371914.0 at byte:0, branch:m_event_\ ID.m_run_number, entry:4212, badread=0, nerrors=1, basketnumber=1
R__unzip: error in header
Error in <TBasket::ReadBasketBuffers>: fNbytes = -1479295101, fKeylen = 7493, fObjlen = 220569376, noutot = 0, nout=0, nin=9, nbuf=0

Running the job with export RFIO_TRACE=31 may provide some further input into the pattern and errors.

This turned out to be due to two seperate bugs - one in DPM - fixed in the DPM-1.7.4-7 Client libraries and one in ROOT - backproted to 5.26d - available in 15.9.0 and 16 series releases.

This allows performance tests - which will appear here soon.