GridPP17 Discussion Session 4: Experiment Service Challenges ============================================================ Chair: David Kelsey (DK) Panel: David Colling (DC), Catalin Condurache (CC), Peter Hobson (PH), Roger Jones (RJ), Raja Nandakumar (RN), Glenn Patrick (GP) Notes: Caitriana Nicholson DK introduced the session and asked each panel member to introduce themselves and say a few words in turn. Roger Jones - ATLAS ------------------- - The ATLAS Computer Systems Commissioning (CSC) is about to start. RJ does not distinguish between CSC and Service Challenges in general. - Past experience has shown that large, monolithic data challenges can be dangerous and that it is better to break it into a series of tasks, showing each can be performed individually before proceeding to the whole. - RJ displayed the CSC timeline. Currently in a Monte Carlo production phase, continuing until spring; there is some distributed analysis (DA) but no plans for a separate DA test. The plan instead is to encourage more use of the DA system from users, and monitor their use. - the Conditions Database needs to be distributed, deployed and working by the end of January to allow the Calibration exercise in February/March. - the ATLAS computing model allows a very low latency for reconstruction, so the calibration activity will need to be fast: this is the motivation for the calibration exercise. - First pass production tests have been done but reprocessing at Tier 1s still needs to be exercised; this requires SRM 2.2 to work, and so SRM 2.2 will need to be deployed and working at sites by February. - In general, a lot of coordination is needed to make these things work! Glenn Patrick - LHCb -------------------- - LHCb Data Challenge '06 (DC06) started in May and is expected to continue into next year. Production has run well, on over 100 sites, with the UK at the head of the 'league table'. Reconstruction at Tier 1s is proceeding well. - As for ATLAS, the complete chain has not yet been done in one go. - In general, the UK has taken a leading role - the first VO box, for example, was set up at RAL and RAL's expertise has been used to help other T1 sites. - There will be an Alignment challenge beginning in March, incorporating calibration, use of the Conditions DB etc. - The challenge overall is getting everything to work together, and in particular it will be to get it working when all the experiments are running simultaneously. Stuart Paterson tried running analysis and production jobs simultaneously and got about 70% efficiency initially; tweaking got this up to 90%. This is an area that needs more thought. Raja Nandakumar - LHCb ---------------------- - LHCb DC06, running since May, has had a series of steps for getting data: - Production: runs fine, over 100 sites without problems - Digitisation and simulation - Data sent to T1 sites - Reconstruction: RAL T1 had data throughput problems in July, mainly hardware issue (not enough disk) which has now been resolved. Other T1s have had problems, eg NIKHEF, which is now considered a T2 because of firewall issues, and PIC which had disk space problems. - Stripping: the output from the Reco stage was bad, giving problems for the stripping step. This has now been resolved. - In summary, the full chain is about 95% ok, about 40% of events reconstructed so far. Dave Colling - CMS ------------------ - Last year, CMS threw away their entire software and started again; the plan is to have the new s/w ready and tested by the CSA challende this year - the Monte Carlo chain went fine and indeed produced more events than planned - MC has now been imported to T0 and reconstructed at 25% of the data taking rate, at the T1s - Analysis performed at T2s, with 'job robots' used to access each event - Fewer problems with catalogues than other experiments, because Phedex is used and works nicely - Frontier is used for database replication - CSA is now winding down with all milestones met or exceeded. It is a real, not just a political, success. Catalin Condurache - RAL T1 team -------------------------------- - Had problems with hardware over the summer, particularly ensuring disk capacity - This gave problems for the ATLAS SC4 T0->T1 challenge. Because of the lack of disk space, had to choose between the T0->T1 and T1->T2 steps, so the T1->T2 step did not happen in the UK. - LHCb are OK, they are happy with the new CASTOR installation. There may be a general change from dCache to CASTOR in the UK. (DC pointed out that the CMS challenge would not have worked without the RAL CASTOR deployment). - the hardware problems should hopefully be resolved by Christmas. Peter Hobson - CMS ------------------ PH explained that he wanted to look at service challenges from a more socio-political stance, drawing from his experiences at Brunel. - Brunel have had problems getting the network connection they need, but the CMS data challence gave him the leverage needed to get a 1GB LAN and improve relationships with the university networking team. - SCs also give users an incentive to realise that they cannot just keep doing it in the same old way any more! It is a useful catalyst process. - Brunel did manage to produce significant amounts of LHCb MC data, showing that although the contribution to one experiment may be less than hoped, an institute can still contribute usefully to other experiments. - RN interjected that LHCb have been very happy to get spare CPU from wherever they can! They have been doing analysis with GANGA, with very positive feedback from users, with experience being that running grid jobs with GANGA is faster than on the CERN LSF queue. General Discussion ------------------ There was then a period of general discussion. DK posed the question of, having had a good overview of what has happened in the past months, what can GridPP do to improve things in the future? - RN: Keep providing more resources! - Jeremy Coles: currently, up to 50% of resources may be unused at any one time. How can usage be improved? Is it just a case of site environments not being suitable for particular experiments, for example? - RN: agreed that LHCb has problems running at particular sites. - RJ: asked whether sites were failing the Site Functional Tests (SFTs) but perhaps still advertising their resources. - GP: working on a web page to display this kind of information, which could be made more generally available. - DK: There is a general problem of communication, and knowing who should be contacted to get this kind of information Jeremy Coles then posed the question of whether experiments were happy with the way problems are responded to? - RN: Fine at the moment, GGUS system works well. Questioned whether this will scale as more sites are added and more tickets are raised - GP: the EGEE broadcast system is now working better too - RN: Also a problem that sites don't usually report problems, so problems are only noticed when jobs start failing. - Tony Doyle asked whether there was a central web page people could refer to for this kind of information. - RJ: the ATLAS dashboard page exists, but is hard to find! - PH: he has used the LHCb pages; he knew what to do, and it worked! However, need to know where to look / who to contact. - perhaps it would help to have more experiment representatives at DTeam meetings DK: Summary is that communication is generally an issue and should be addressed!