Difference between revisions of "GDB 16th January 2013"
From GridPP Wiki
Andrew mcnab (Talk | contribs) |
(No difference)
|
Latest revision as of 15:26, 16 January 2013
Notes taken by Andrew McNab in the room used at CERN. Vidyo repeatedly cut out during the meeting.
Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=197796
Contents
- 1 Welcome - Michel Jouvin (Universite de Paris-Sud 11 (FR)
- 2 Update on experiment support after EGI-Inspire SA3 - Ian Bird (CERN)
- 3 Future Computing Strategy for WLCG - Ian Bird (CERN)
- 4 Survey about configuration management tools and plans regarding central banning - Peter Solagna (EGI.eu), Peter Solagna (INFN)
- 5 CERN Computing Facilities’ Evolution - Wayne Salter (CERN)
- 6 Operations Co-ordination pre-GDB summary - Maria Girone (CERN)
- 7 Future directions for work on virtualisation/clouds - Michel Jouvin (LAL)
Welcome - Michel Jouvin (Universite de Paris-Sud 11 (FR)
- Slides: http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=197796
- GDB summaries, not official minutes, at: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGGDBDocs
- Looking for more volunteers to create summaries (Michel will format and take responsibility for any mistakes etc)
- Vidyo unreliable at pre-GDB yesterday
- Moved April GDB due to EGI Community Forum clash: may move a week earlier or cancel.
- March GDB is in KIT rather than CERN. Essential that you register with KIT to get site access.
- February topics: “Post-EMI discussion” follow-up, Federated storage WG final report; Future data access protocols (WG); Storage accounting update; Glexec next milestones
- February pre-GDB on AAI in Storage
- February/March storage WG face to face? Wahid polling the group.
- Actions in Progress: https://twiki.cern.ch/twiki/bin/view/LCG/GDBActionInProgress
- MW client in CVMFS done (Matt) - request for testing - http://www.sysadmin.hep.ac.uk/wiki/EMI2Tarball
- Suggestion from Ian to merge pre-GDB and GDB to allow more in depth discussion.
- Risk is less attendance if a 2 day meeting.
- Need to co-ordinate agenda(s) better.
Update on experiment support after EGI-Inspire SA3 - Ian Bird (CERN)
- Follow-up of the discussion at October GDB
- Slides: http://indico.cern.ch/getFile.py/access?contribId=2&resId=1&materialId=slides&confId=197796
- EMI finishes April 2013; EGI-Inspire to April 2014
- EU Horizon 2020 (FP8) requires relevance to industry and society (so need more than benefits to sciences)
- Large (33%,45%) reductions in CERN ES and GT groups
- IT must focus resources on cross-experiment solutions
- IT must continue with DM tools, information system, and simplify build, packaging
- POOL now only used by ATLAS and will supported by them
- "Previous level of dedicated expert support to experiments unsustainable. Understood by ATLAS/CMS; issue for LHCb and ALICE"
Q&A:
- Future projects (ie funding):
- only from 2014 and must be broader than just HEP and probably require industrial partners
- anything that looks like grid middleware development will not be feasible
- need committment from PH etc not just IT this time
- IT currently about 40 FTE for this area; going down to 24.
Future Computing Strategy for WLCG - Ian Bird (CERN)
- Slides: http://indico.cern.ch/getFile.py/access?contribId=3&resId=1&materialId=slides&confId=197796
- Slides are from September 2012
- Data Management
- Using more standard tools makes collaboration with other sciences easier
- Long term data perservation something that other areas have more experience of
- Must plan on next generation of e-Infrastructures (5-10 yrs): clouds? terabit networks?
- HEP software must support concurrency due to hardware more to large numbers of cores
- Workshop this year on long term strategy for all this
- Will create a single document for LHCC that explains computing models, how they will evolve for running between LS1 and LS2, based on experiment computing plans (TDRs?)
- Draft for C-RRB in October 2013
Q&A:
- Merge/co-ordinate with architects forum?
- Need for communication to sites about experiments' intentions about hardware architectures
- Do funding agencies really understand that LHC data builds up so storage requirements continually increase?
- Need more resources in 2014 so ready for increased rates from machine from mid-2015
- Difficulty of working with other big data areas: need to change existing frameworks (on either side); and overheads if involving other fields on a token basis
Survey about configuration management tools and plans regarding central banning - Peter Solagna (EGI.eu), Peter Solagna (INFN)
- Slides: http://indico.cern.ch/getFile.py/access?contribId=4&resId=1&materialId=slides&confId=197796
- Had 82 unsupported services in December; now 26.
- In December, 44 sites with unsupported middleware; 15 now.
- Currently: 32 DPMs, 2 LFCs, 92 CEs still pointing at gLite WNs.
- From February, DPM, LFC and WNs not upgraded must go into downtime.
- YAIM under discussion as end of EMI nears. Product teams may choose own solutions.
- Survey will be sent out to see what sites are doing (quattor, puppet, Chef etc)
Q&A:
- Site dependent on tar ball shouldn't have to go into downtime since "tarball" via cvmfs only now published
- Still interest in central banning architecture, still being worked on, ARGUS frontrunner for how to enforce it
CERN Computing Facilities’ Evolution - Wayne Salter (CERN)
- Slides: http://indico.cern.ch/getFile.py/access?contribId=5&resId=1&materialId=slides&confId=197796
- Lots of detail about upgrade to cooling, UPS etc in machine rooms. See slides.
- Parameters of Wigner extension to CERN T0 in Hungary:
- To be considered as an extension of the CERN Tier0, i.e. should be transparent for the experiments
- IT Operations to be done as if systems were at CERN
- All operations not requiring physical intervention to be done remotely from CERN
- All physical interventions to be done by Wigner staff on request from CERN using CERN ticketing system
- Wigner responsible for all facility operations
Operations Co-ordination pre-GDB summary - Maria Girone (CERN)
- Slides: http://indico.cern.ch/getFile.py/access?contribId=1&resId=1&materialId=slides&confId=197796
- Many experiment improvements planned for LS1
- Willingless to approach solutions in common
- Topics: Network workshop summary / Security / Information System / CVMFS / Data Management / Clouds
- Discussions part of each topic rather than in planned roundtable at end
- Vidyo problems as usual
- Giving ownershp of CVMFS tasks to people is leading to more progress
- SHA-2 production certs will not be issued before August 1st
- Validation during the following two quarters
- All deployed software to support SHA-2 in early summer
- Should all be transparent to users as SHA-2 issued when certificates are renewed
- Agreed to have a fully deployed and tested glexec system by end of 2013
- Fully validated system at scale by end of LS1
- Information working group proposed a central Service Discovery system (WLCG IS: static information about services)
- Information is collected from a number of sources and collected in a WLCG Information system
- eg with systems like GOCDB as inputs; experiments could use this as input to their own systems
- Experiments agreed on the need to finish deploying FTS3
- Several experiments plan catalog improvements during LS1
- Remote data access proceeding for all four LHC experiments
- http and/or xrootd
- CMS proposal for WNs to be on the OPN and use it directly not seen as damaging
- Discussion about caching vs data placement strategies
- "Reasonable consensus on continuing the work on separating disk and tape"
- HLT resources are the largest “cloud” resources available to the experiments
- HLT system for ATLAS and CMS is 10-15% of their Grid resources.
- 2018 HLT Resources in ALICE is estimated at 250k cores.
- LHCb already makes efficient use of the HLT
Future directions for work on virtualisation/clouds - Michel Jouvin (LAL)
- Slides: http://indico.cern.ch/getFile.py/access?sessionId=0&resId=0&materialId=0&confId=197796
- Proposal from December GDB is for WLCG to continue HEPiX virtualisation work
- Assess interest in this today
- From January pre-GDB, experiments plan to use HLT's as clouds during LS1
- Currently no option to access the accounting services on these clouds
- ATLAS has plans to use PanDA statistics
- No full agreement: feeling that public resource usage should go into WLCG accounting
- For private clouds, APEL has demonstrated ability to collect data from clouds
- CERN Agile infrastructure tests with ATLAS/CMS/IT/ES
- Test jobs and real production jobs
- ~200 VMs of 4 cores/8 GB fpr each experiment
- Condor used to instantiate VMs
- Successful tests, although some CPU efficiency problems not understood
- All experiments except LHCb planning to build cloud out of HLT clusters
- ATLAS/CMS using OpenStack but not part of CERN Agile
- ALICE will use CernVM Cloud
- LHCb will use existing non-cloud infrastructure on HLT
T0 - Juan Manuel Guijarro (CERN)
- Slides: http://indico.cern.ch/getFile.py/access?contribId=6&sessionId=0&resId=0&materialId=slides&confId=197796
- Experiment pilot factories running on public clouds (Helix Nebula)
- Future site clouds: BNL, CNAF, IN2P3, UVIC; HLT farms, CERN IT OpenStack
- Need to be realistic and aligned with industry standards
- Accounting
- Strategy - wall clock instead of CPU time?
- Multiple tenants
- Integration with APEL
- Scheduling
- Dealing with limited? How to reclaim resources?
- Is fair share mechanism needed with clouds?
- Federated Identify Management
- Multiple IT systems and organisations
LHCb - Using institutional clouds - Philippe Charpentier (CERN)
- Slides: http://indico.cern.ch/getFile.py/access?contribId=7&sessionId=0&resId=0&materialId=slides&confId=197796
- Say "institutional" rather than "private" clouds
- LHCb just needs pilot jobs running on the farm somehow
- Pilot job pulls work from the task queue
- Pilot job starts one or more job agents to runs a job
- Jobs may generate more than one thread
- Batch system just needed to place pilot job on WN
- Batch systems also enforce limits to length of jobs - is this actually needed?
- Graphs comparing oscillation vs steady state between sites with different policies
- Advantges and disadvantages of VMs
- Better to have VM running much longer than jobs
- If VM is whole WN then experiment can opitmise memory footprint and work on the node, potentially mixing CPU-bound and IO-bound applications to make full use of node
- Should Accounting be wall clock based? (as with commercial clouds)
- Risk is paying for low efficiency if site is badly configured
- Ideal scenario:
- Start VM if resource available
- Contextualise for VO
- Pilot agent starts and begins to pull jobs from central task queue
- Optimisation is the experiment's problem
- Need way to communicate with VMs to tell them they (will) need to shut down
- How do VMs shut down if they have no work? Do they sleep? Can site tell them to shut down because they are over their pledge?
- Envisage controller in LHCb Dirac to manage VMs
ATLAS
- (No slides)
- In 2011, tried building Condor pool in cloud. VMs connecting to Condor pool. Machines running indefinitely. Could use cloud scheduler. Mostly following this solution.
- Also tried batchless configuration. Pilot running in infinite loop. Killing machine if needed. Part of lxcloud project. No need for pilot factory, or Condor pool.
CMS - Claudio Grandi (INFN - Bologna), David Colling (Imperial)
- Slides: http://indico.cern.ch/getFile.py/access?contribId=9&sessionId=0&resId=1&materialId=slides&confId=197796
- CMS is now active in adapting the job submission framework to Cloud interfaces
- Testing on StratusLab
- CMS High Level Trigger (online) farm has been configured to be accessed through OpenStack for offline use when not needed for the online work
- Have looked at feasibility and cost of being able to double MC generation capacity for a month, say in the run up to a conference.
ALICE - Predrag Buncic (CERN)
- Slides: http://indico.cern.ch/getFile.py/access?contribId=10&sessionId=0&resId=1&materialId=slides&confId=197796
- No official strategy for clouds in ALICE
- Initial work based on CernVM family of solutions
- Uniform access via xrootd, single task queue, no real T1/T2 distinction
- Would welcome and use pure cloud sites, providing they offer API compatible with Public Clouds
- No need for batch, as JobAgent interacts directly with Task Queue
- Summary
- We expect to live in mixed Grid/Cloud environment for some time
- Not an issue, Grid job or Cloud API are just the mechanisms to start Job Agent
- We plan to use CernVM family tools and collaborate with PH/SFT on developments
Discussion
- It's commercial clouds (Amazon), private clouds that you own (HLTs), and public clouds (where you don't own it)
- For shared national research facilities, batch and fair share help coexist with other users and make use of available resources
- Dedicated facility harder to get funding for - why isn't it funded from HEP budget
- Just running VMs for VOs is a much simpler way of working for dedicated HEP sites
- Need some way of increasing or reducing number per VO in response to level of use and pledges
- Question about time constant of changing number of running VMs per VO to respect pledge.
- Cloud storage model?
- Do we want to know where storage is? If Amazon storage was cheap, would we mind?
- What will be the effect of moving responsibility for nodes from sites to VOs that are responsible for what happens inside their VMs?
- Lots of savings from not having to runs CEs etc, using software like OpenStack that someone else supports
- Proposing future project using Cloud technology will attract EU money where grid middleware won't
- Accounting simplified by using wall clock and charging VOs while they have VMs running
- Batch systems do allow advance reservations though, perhaps for cheaper times of the day
- Do we agree that if you have the whole node then should use wallclock for pledges?
- Conclusion: clouds are here, a little experience in integrating them, need to a draft work plan?
- If people, especially from sites, are interested in participating please contact Michel