"LCG Service Challenges" (plans for SC4 based on experience of SC3) ------------------------------------------------------------------- Chair: Dave Kelsey (CCLRC) Panel Members: Jeremy Coles, Brian Davies, John Gordon, Roger Jones, Paul Millar, Steve Traylen, Paul Trepka, Yong-Jun Zhang John Gordon (CCLRC): Involved as Tier1 and GridPP project manager. In his opinion SC is really good idea which is kind of reality check. SC involves a lot of planning but on other hand could easily demonstrate stage reached. SC is really helpful with recognizing the problems. Roger Jones (Lancaster University): SC is valuable idea. Helps motivate people to keep time scale to solving related problems to SC. Helps recognize the problems. Network is another issue for the next SC. We don't know at the moment how important role will be played by network (UK structure) in the next SC. Paul Millar (University of Glasgow): Doing some development work on Meta Data. Involve on ATLAS and coordinate communication/works between different people. In his opinion SC is more like experiment structure for testing software. Steve Traylen (CCLRC): Working on Tier1 and GridPP software. In his opinion SC is very good idea. SC is playing very useful role with finding the problems. Brian Davies (Lancaster University): Working on LCG project. In his opinion SC is could be using if as a kind of Guinea pig which helping to find the problems. Yong-Jun Zhang (Imperial College London): Not present at the panel. Jeremy Coles (CCLRC): prepared the presentation, please see below Presentation (prepared and presented by Jeremy Coles): ---------------------------------- Preparations required by Tier-2s for SC3 Implement SRM v1.1 dCache (configure disk pools) Upgrade middleware File Transfer Service client v1.2 Setup File Catalogue (mySQL) Experiment specific components (CMS = PhEDEx-v2.2 & PubDB; ATLAS = ) Problems observed (from feedback so far) Insufficient time for additional tests. Lancaster would like to have had greater multiple concurrent transfers to multiple IO nodes for a longer period and at a higher rate Data transfer rates need to be improved SRM and network configuration changes is extra work Installation of data management system services (for ATLAS) Additional upgrades (2.5.0) for functionality dCache pool node became overloaded due to large number of parallel file transfers Implementation of extra resources on short timeline Discovery of network problems when connecting SAN to dCache instance Significant time spent debugging SE information system after upgrades Would like to have tested DPM as well as dCache (not possible because of bugs in the then current DPM release) UKLIGHT is a prototype and does not have production level service level agreements Extensive network tuning and configuring of parallel transfer parameters required to stabilise rates Severe challenges testing due to bugs and missing functionality in dCache Service Challenge 3 enters a new phase Phase 1 (throughput tests) July 2005 dCache-SRM working at all sites Tier-1 managed rates (on UKLIGHT) up to 650 Mb/s to CERN. This is similar to SC2 rates. Edinburgh 10TB data transferred. Sustained rates of 220-250Mb/s Imperial Rates reached 400-480 Mb/s Lancaster 958GB (978 files) over 8 days (~27Mb/s sustained) Phase 2 (service phase) from 1st September 2005 The experiments will use the SC3 infrastructure for testing their models and production Experiment (basic functionality) test jobs are being developed (to run as part of the SFTs) to check sites Service Challenge 4 will affect all sites start preparing! SC4 consists of a Setup Phase starting on 1st April 2006, during which a number of Throughput tests will be performed followed by a Service Phase from 1st May 2006 until the 30th September 2006 All service components for SC4 need to be delivered ready for production by the 31st January 2006 Final testing and integration of components and services must be completed by 31st March 2006 more details in the panel discussion later today. Service Challenge 4 dimensions SC4 starts April 2006 SC4 ends with the deployment of the FULL PRODUCTION SERVICE Deadline for component (production) delivery: end January 2006 Adds further complexity over SC3 extra dimensions Additional components and services, e.g. COOL and other DB-related applications Analysis Use Cases SRM 2.1 features required by LHC experiments - have to monitor progress! MostTier2s, all Tier1s at full service level Anything that dropped off list for SC3 Services oriented at analysis & end-user What implications for the sites? Analysis farms: Batch-like analysis at some sites (no major impact on sites) Large-scale parallel interactive analysis farms and major sites (100 PCs + 10TB storage) x N User community: No longer small (<5) team of production users 20-30 work groups of 15-25 people Large (100s 1000s) numbers of users worldwide SC4 Use Cases are still being decided Not covered so far in Service Challenges: T0 recording to tape (and then out) Reprocessing at T1s Calibrations & distribution of calibration data HEPCAL II Use Cases Individual (mini-) productions (if / as allowed) Additional services to be included: Full VOMS integration COOL, other AA services, experiment-specific services (e.g. ATLAS HVS) PROOF? xrootd? (analysis services in general) Testing of next generation IBM and STK tape drives Analysis Use Cases (HEPCAL II) will need to be part of SC4 Production Analysis (PA) Goals in Context Create AOD/TAG data from input for physics analysis groups Actors Experiment production manager Triggers Need input for individual analysis (Sub-)Group Level Analysis (GLA) Goals in Context Refine AOD/TAG data from a previous analysis step Actors Analysis-group production manager Triggers Need input for refined individual analysis End User Analysis (EA) Goals in Context Find the physics signal Actors End User Triggers Publish data and get the Nobel Prize :-) The LCG SC4 timeline looks like this Now - September: clarification of SC4 Use Cases, components, requirements, services etc. October 2005: SRM 2.1 testing starts; FTS/MySQL; target for post-SC3 services January 31st 2006: basic components delivered and in place February / March: integration testing February: SC4 planning workshop at CHEP (w/e before) March 31st 2006: integration testing successfully completed April 2006: throughput tests May 1st 2006: Service Phase starts (note compressed schedule!) September 1st 2006: Initial LHC Service in stable operation Summer 2007: first LHC event data based on the TDR milestones Date Description 31 Jan 06 All required software for baseline services deployed and operational at all Tier-1s and at least 20 Tier-2 sites. 30 Apr 06 Service Challenge 4 Set-up: Set-up complete and basic service demonstrated. Performance and throughput tests complete: Performance goal for each Tier-1 is the nominal data rate that the centre must sustain during LHC operation (see Table 7.2 below) CERN-disk network Tier-1-tape. Throughput test goal is to maintain for three weeks an average throughput of 1.6 GB/s from disk at CERN to tape at the Tier-1 sites. All Tier-1 sites must participate. The service must be able to support the full computing model of each experiment, including simulation and end-user batch analysis at Tier-2 centres. 31 May 06 Service Challenge 4: Start of stable service phase, including all Tier-1s and 40 Tier-2 centres. 30 Sept 06 1.6 GB/s data recording demonstration at CERN: Data generator disk tape sustaining 1.6 GB/s for one week using the CASTOR mass storage system. 30 Sept 06 Initial LHC Service in operation: Capable of handling the full target data rate between CERN and Tier-1s (see Table 7.2). The service will be used for extended testing of the computing systems of the four experiments, for simulation and for processing of cosmic-ray data. During the following six months each site will build up to the full throughput needed for LHC operation, which is twice the nominal data rate. 1 Apr 07 LHC Service Commissioned: A series of performance, throughput and reliability tests completed to show readiness to operate continuously at the target data rate and at twice this data rate for sustained periods. GridPP action areas Implement a better user support model Support the deployment of an SRM at every Tier-2 site Revisit site plans for implementing promised resources Support the installation of any required local catalogues at sites Investigate the experiment VO box requests. Make a recommendation to Tier-2s. Revisit as GridPP. Better understand network links to sites (we do not want to saturate links) Schedule transfer tests from Tier-1 to Tier-2 test rates and stability Work closer with experiments? Discussion: ----------- Good idea will be to have a deployment checklist. It could give opportunity to check what stage have we reached and will be very helpful to coordinate the remaining work.It still open topic to discuss. The proposition was to discuss that at the next deployment team meeting. Everybody agreed that it's good idea but still case to discuss which should be based on the recommendation and experience of other people. Another topic in discussion was to set up the way to extend information about SC. In generally everybody agreed that is good way to follow. This is still open topic to discuss. Basing on existing documentation is making SC easier. It gives you opportunity to avoid repeating the common mistakes and concentrate on something that is more important. Documentation should be prepared well and should be based on real experience. Good idea is to attach the solutions or proposal how to solve the existing problems. Based on good documentation we are able to sort out most of the problems. Summary by Dave Kelsey: ----------------------- 1. SC is a great idea which is kind of reality check. 2. Need more documentation and support. 3. Time scale and dead line is needed for deployment 4. Storage model is important issue especially for storage group 5. Communication on experience (will be discuss more on the next deployment meeting-what is the next stage) 6. Network issue which could play very important issue in SC4 (more network skills is required)