Discussion Session: How ready are we for LHC start-up (site perspective)? ========================================================================= Panel members: Tony Cass, Jeremy Coles, Yves Coppens, Peter Love, Andrew Sansum, Sam Skipsey Chair: Steve Lloyd Each panel member gave their view, followed by general discussion with the audience. Tony ---- The answer to the "are we ready" question is clearly "no"! Looking at different components: - CPU is fine - Data management has problems. There i a lot of work required to deliver the SRM interfaces. For this year, the SRM 2.2 interface is probably not critical - it's not a show-stopper for shipping data to T1s, so not in a desperate mood at the moment. Aside from the components, however, not convinced that most sites are ready for 24/7 operations, with people ready to respond in the middle of the night etc. RAL is putting together a response team but this kind of things needs to be more widespread across sites. For example, it is not clear who you should call at a particular site if something is wrong. These things *need* to be sorted out between now and November. Jeremy ------ Jeremy had a slide prepared which looked at a number of questions: - can we deliver the physical resources? Probably. MoUs and so on are in place to do this. - can we enable the resources to be used efficiently? Don't know. There are issues with middleware, fairshare usage etc - can we maintain availability of resources? e.g. monitoring, failure recovery, backup support.. probably about halfway there - are we prepared for major incidents? - do we understand experiment requirements? There is still a gap in our understanding of their requirements (and in their own understanding, sometimes!) - can we implement essential requirements? - is there sufficient coordination, e.g. between different experiments contending for resources in data transfer etc? This has not really been thought about properly. - can we cope with an increasing number of users for user support etc? Fear that existing structures will not scale. Yves ---- Looking at things from the Tier-2 perspective (Birmingham): - T2 support has improved - there is a problem with load balancing between sites, e.g 500 jobs were recently queued at Birmingham while there were none at RAL - Steve Lloyd's ATLAS tests highlight the need for end-to-end testing. These types of test are critical - We really need to understand user needs - Concerned about middleware, e.g. there were problems with the DPM upgrade recently. This is a serious issue. Peter ----- From T2 perspective (Lancaster), things are looking good. Middleware problem not as bad as for bigger sites. To try and understand Lancaster's readiness, he went through the Readiness Questionnare. - Networking is fine in terms of capacity, but has not been tested in real life situation as part of the campus network. - dCache getting better all the time. Confident it will deliver required stability, although it takes a lot of work to get it going. - Hardware: running out of space at the moment, but new building being built for 14 new racks. - Middleware - under control - Communications with experiments - ok, though more would be good Overall - yes, the site is ready. Andrew ------ Re-iterating what others have said. From the Tier-1 perspective: - Hardware is on track; there were a few operational issues but these are understood. Biggest problem was with disk; had a lot of problems with it but these are now fixed. - need to speed up deployment, but this means getting middleware sorted - networking is fine, will soon connect to SJ5, no issues there - new machine room is coming on track in next year, so there will be no problem with accommodating equipment - CASTOR and dCache: it is not planned to run dCache for data-taking, but will CASTOR at RAL be ready? There are good indications in some areas, but some concerns. There are operational issues, because the experiments' full data models through the Tier-1 have not been tested. These need to be tested as soon as possible. Among the operational issues are the missing LSF plugin at moment, and problems with garbage collection on CASTOR. Upgrade of CASTOR may lead to further issues - upgrades are hard work. - Big concern is middleware. They are busy working to improve availability and reliabiity of CE/SE etc, but the new gLite middleware will be installed in June. This is the same time as SRM 2.2, the Full Dress Rehearsals etc, so not good timing. We really don't know what the gLite middleware will be like. - Finally, the main problems with the T1 are likely to be from chaotic users. Chaotic analysis could, for example, bring CASTOR down very quickly if it is not controlled. Sam --- From the perspective of a small T2 that may become a bigger one. - Hardware will not be a problem. - Not happy with they way people have had to move to SLC4, and the associated problems. John Gordon commented that the SLC4 change has been known for ages. Sam replied that if the original timing had been kept, it would not have been an issue - the problem is in trying to do it at this crucial time. General discussion ------------------ Tony Doyle put a question to Tony Cass about the problems seen with how the SLC3/4 changeover was done. Tony C: the message of what things were wrong with the middleware did not get through to developers early enough. It is only now getting through, and people are getting the message of reliability being required, not performance. People were, in the past, still trying to add new functionality and did not realise the problems there would be getting it all to work with SLC4. Peter Gronbech(?): only the UI has been ported so far. What about the CE, SE etc? Tony C: gLite middleware will be on SLC4, so the current middleware will not be ported. Mark Nelson: why aren't we going to work with what we have, and make that work, instead of trying to get the gLite CE working? Tony C: WMS software is not a CERN responsibility, but this is a major concern Andrew Sansum: putting a glite CE in this summer is not consistent with trying to get a stable service working! Dave Colling: There is now a rapid development turnaround, with a new release every 2 weeks. The latest version looks better, but will give an update in a couple of weeks' time. Because LCG RB still running, people have been putting less effort into hardening the glite WMS. Raja Nandakumar: how will the large data flows affect the networking into the Tier-1? In particular, will there be contention between Diamond and LH experiments for CASTOR access? Andrew S: need to work out how to accommodate Diamond flow into CASTOR but it should not be a problem. For example, they are getting a second robot put in next year, and the networking is flexible. Jeremy: conclusion seems that we are as ready as the software is! Steve: we have the money to buy the hardware but the disk problems are worrying. Can T2s cope with this amount of disk? Andrew: not really worried about T1 disk; RAL are getting new disk this year but phasing out some old disk. Also new RAID6 etc will help. John Gordon: in terms of CPU, we know what we can buy but this will not necessarily translate to all the SPECint that the experiments expect. They need to know what they can expect to get.