Discussion Session (Chair: Pete Clarke) "How ready are we for LHC startup? (User Perspective)" Panel Members: Dave Colling, Karl Harrison, Roger Jones, Akram Khan, Raja Nandakumar, Glenn Patrick, Ben Waugh. Pete: Introduced timeline and probing questions slide Roger: No we are not ready. We have to shift attitude. We have a set of challenges regarding the commissioning of the system. It has got to made to work!. User won't see the distinction. On the experiment side - production system scaling up to full scale. Atlas analysis system - rolling out. Data management is not good enough. CMS may be better (on critical path). Reliability needs to be better. LSF at CERN for example - need tools to help people do that. How can users track problems?. They have no understanding of what is going on. User Analysis - need to work on T1's such as on SRM 3.2. Need a lot more work on support structures- we need more experts available. In ATLAS there is a need for a technical contact person that could phone up sites. Dave Newbold highlighted that there is a need to keep local stuff local. The big problem is that with mass storage systems - need more functionality - we building data centres - but there is a disk imbalance. VOMS integration with middleware is essential - not an optional extra. Pragmatism is the key to get what is already there working and reliable. Karl: Users perform analysis easier using local resources. Any chance a user has to not use grid they will take it. Certificates - getting them is a problem as it is a complicated procedure. Job submission - flexibility versus usability. There are a lot of unknowns. With the resource broker (RB) for example problems have a slow response. Submission takes about 20 to 30 seconds which is not acceptable. It is a mystery what RB does when a job is submitted. Optimisation needed. Priorities - users competing with production systems. Queues behind production system. Main source of job failure is data distribution. Job fails if you can't get at data. User awareness of Grid tools very small - tutorials are useful but it is still difficult to convince users to use tools. Akram: CMS stats. Data transfers reasonable. 10000 CMS production jobs submitted yesterday. 5000 analysis jobs submitted - stuff there working to a certain degree. So obviously people out there that know how to use grid (a core of users/experts) In a way we are half there - things work to a certain degree and then fall over. RBs for example. Scalability problem if there are going to be hundreds of users. We have complex systems (sites can be down for days without people noticing). System is held together with duct tape! With patches here there and everywhere! Data challenges show we can get non experts to use the system. Takes a lot of effort on the part of the user. Guidance is needed with error messages. In summary the system works to certain degree. Functionality is certainly not what is should be. There is a scalability problem. Could increase number of RBs. Ben: We are closer than a year ago but still not there. People prefer local system as it is easier to figure out problems When you submit a job you find the data isn't actually there yet. Therefore the job fails. ATLAS tags - preselection of data doesn't exist. Tools for accessing tags still a little patchy. On a brighter note - tried Ganga - jobs failed but that wasn't Gangas fault. Installing new VO's not straightforward. Raja: LHCb needs CPU : Great response at T2 - thanks!. Recent problems at T-1. Networking : Not a problem (so far at least). Storage : Mostly needed at T-1. "Unstable" dCache. Trying to get CASTOR working for us. Currently the **critical** issue for LHCb. Our needs are : a. Disk & Tape storage : Fine so far b. Access to the above storages : Problems As long as these 3 points are covered we are fine. Right now, we have problems with access to the data. Some light at the end of the tunnel recently. --------------------------------------------------------------------------------- LHCb experience from DC06 : In general, problems since December 06 in storage / dCache. Otherwise it was / seems fine. Hope CASTOR will be a smoother experience -> srm v2.2 implementation in Sept / Oct makes us worry (a little) Overall Grid : UK region is best performing for cpu and availability of storage (with CERN). Other sites have problems with "storage". This means data management is the next big issue for us. "Storage" problems include : 1. Disk availability 2. Access to tape (disk servers) 3. Stability of gridFTP doors 4. srm endpoints having perpetual problems --------------------------------------------------------------------------------- How ready is LHCb : 1. Mostly 2. 3 years of experience now in running jobs "on the grid" 3. We already have stable grid software which works 4. About 3 months for DIRAC3 - more flexible, easier to maintain, extendable 5. glexec mechanism will help in analysis ... -> Low priority for site admins? -> Analysis with many users still an unknown -> Large numbers of users will mean more problems 6. Try to manage data better. -> need help from sites in this - not all T-1s have been steadily up 7. Many things are changing fast. Grid is not fully stable -> New technology being deployed (SRM 2.2, StoRM, ...) -> Metrics changing (SAM tests, ...) 8. Soon we will have all experiments active simultaneously (when we have data) -> Still not tested how this will work -> Contention for resources (eg. CASTOR nightmare?) Glenn: The software is written - Ganga for example. The big test is when production, reconstruction, reprocessing and user analysis all happens together. There will be a change fo focus to users - we have 9 months to do that. Atlas/CMS - transfer to T2's. Rely heavily on T1 for everything. Prioritisation of work is important. This is untested - having a mixed user community will be a challenge. Discussion: Data isn't there - Roger: Data isn't where it should be and can't get it to where it should be (because users clog the system). Raja: LHCb - jobs complete faster on the grid. Steve: Error messages are incomprehensible - they are clearly not going to change - we should focus on the things that can change (or work around). Jeremy (Re submissions) people don't report problems. Raja: When there is a problem there are 2 options. To go through a local contact person at the site or submit a GGUS ticket. Roger: People have to resubmit - some problems are transient - data movement for example. Dave Colling: Submission is faster with gLite. Priorities - behind load of production jobs. It is up to a site to give users different priorities - behaviour of RBs - match criteria user has. Roger: must do something smarter than use ETT. Steve: Fix UI configuration - we can do something about this - but what we need is someone to do it.