Difference between revisions of "GDB 16th January 2013"

Latest revision as of 15:26, 16 January 2013

Notes taken by Andrew McNab in the room used at CERN. Vidyo repeatedly cut out during the meeting.

Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=197796

1 Welcome - Michel Jouvin (Universite de Paris-Sud 11 (FR)
2 Update on experiment support after EGI-Inspire SA3 - Ian Bird (CERN)
3 Future Computing Strategy for WLCG - Ian Bird (CERN)
4 Survey about configuration management tools and plans regarding central banning - Peter Solagna (EGI.eu), Peter Solagna (INFN)
5 CERN Computing Facilities’ Evolution - Wayne Salter (CERN)
6 Operations Co-ordination pre-GDB summary - Maria Girone (CERN)
7 Future directions for work on virtualisation/clouds - Michel Jouvin (LAL)

Welcome - Michel Jouvin (Universite de Paris-Sud 11 (FR)

Slides: http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=197796
GDB summaries, not official minutes, at: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGGDBDocs
Looking for more volunteers to create summaries (Michel will format and take responsibility for any mistakes etc)
Vidyo unreliable at pre-GDB yesterday
Moved April GDB due to EGI Community Forum clash: may move a week earlier or cancel.
March GDB is in KIT rather than CERN. Essential that you register with KIT to get site access.
February topics: “Post-EMI discussion” follow-up, Federated storage WG final report; Future data access protocols (WG); Storage accounting update; Glexec next milestones
February pre-GDB on AAI in Storage
February/March storage WG face to face? Wahid polling the group.
Actions in Progress: https://twiki.cern.ch/twiki/bin/view/LCG/GDBActionInProgress
- MW client in CVMFS done (Matt) - request for testing - http://www.sysadmin.hep.ac.uk/wiki/EMI2Tarball
Suggestion from Ian to merge pre-GDB and GDB to allow more in depth discussion.
- Risk is less attendance if a 2 day meeting.
- Need to co-ordinate agenda(s) better.

Update on experiment support after EGI-Inspire SA3 - Ian Bird (CERN)

Follow-up of the discussion at October GDB
Slides: http://indico.cern.ch/getFile.py/access?contribId=2&resId=1&materialId=slides&confId=197796
EMI finishes April 2013; EGI-Inspire to April 2014
EU Horizon 2020 (FP8) requires relevance to industry and society (so need more than benefits to sciences)
Large (33%,45%) reductions in CERN ES and GT groups
IT must focus resources on cross-experiment solutions
IT must continue with DM tools, information system, and simplify build, packaging
POOL now only used by ATLAS and will supported by them
"Previous level of dedicated expert support to experiments unsustainable. Understood by ATLAS/CMS; issue for LHCb and ALICE"

Q&A:

Future projects (ie funding):
- only from 2014 and must be broader than just HEP and probably require industrial partners
- anything that looks like grid middleware development will not be feasible
- need committment from PH etc not just IT this time
IT currently about 40 FTE for this area; going down to 24.

Future Computing Strategy for WLCG - Ian Bird (CERN)

Slides: http://indico.cern.ch/getFile.py/access?contribId=3&resId=1&materialId=slides&confId=197796
Slides are from September 2012
Data Management
- Using more standard tools makes collaboration with other sciences easier
- Long term data perservation something that other areas have more experience of
Must plan on next generation of e-Infrastructures (5-10 yrs): clouds? terabit networks?
HEP software must support concurrency due to hardware more to large numbers of cores
Workshop this year on long term strategy for all this
Will create a single document for LHCC that explains computing models, how they will evolve for running between LS1 and LS2, based on experiment computing plans (TDRs?)
- Draft for C-RRB in October 2013

Q&A:

Merge/co-ordinate with architects forum?
Need for communication to sites about experiments' intentions about hardware architectures
Do funding agencies really understand that LHC data builds up so storage requirements continually increase?
Need more resources in 2014 so ready for increased rates from machine from mid-2015
Difficulty of working with other big data areas: need to change existing frameworks (on either side); and overheads if involving other fields on a token basis

Survey about configuration management tools and plans regarding central banning - Peter Solagna (EGI.eu), Peter Solagna (INFN)

Slides: http://indico.cern.ch/getFile.py/access?contribId=4&resId=1&materialId=slides&confId=197796
Had 82 unsupported services in December; now 26.
In December, 44 sites with unsupported middleware; 15 now.
Currently: 32 DPMs, 2 LFCs, 92 CEs still pointing at gLite WNs.
From February, DPM, LFC and WNs not upgraded must go into downtime.
YAIM under discussion as end of EMI nears. Product teams may choose own solutions.
Survey will be sent out to see what sites are doing (quattor, puppet, Chef etc)

Q&A:

Site dependent on tar ball shouldn't have to go into downtime since "tarball" via cvmfs only now published
Still interest in central banning architecture, still being worked on, ARGUS frontrunner for how to enforce it

CERN Computing Facilities’ Evolution - Wayne Salter (CERN)

Slides: http://indico.cern.ch/getFile.py/access?contribId=5&resId=1&materialId=slides&confId=197796
Lots of detail about upgrade to cooling, UPS etc in machine rooms. See slides.
Parameters of Wigner extension to CERN T0 in Hungary:
- To be considered as an extension of the CERN Tier0, i.e. should be transparent for the experiments
- IT Operations to be done as if systems were at CERN
- All operations not requiring physical intervention to be done remotely from CERN
- All physical interventions to be done by Wigner staff on request from CERN using CERN ticketing system
- Wigner responsible for all facility operations

Operations Co-ordination pre-GDB summary - Maria Girone (CERN)

Slides: http://indico.cern.ch/getFile.py/access?contribId=1&resId=1&materialId=slides&confId=197796
Many experiment improvements planned for LS1
Willingless to approach solutions in common
Topics: Network workshop summary / Security / Information System / CVMFS / Data Management / Clouds
Discussions part of each topic rather than in planned roundtable at end
Vidyo problems as usual
Giving ownershp of CVMFS tasks to people is leading to more progress
SHA-2 production certs will not be issued before August 1st
- Validation during the following two quarters
- All deployed software to support SHA-2 in early summer
- Should all be transparent to users as SHA-2 issued when certificates are renewed
Agreed to have a fully deployed and tested glexec system by end of 2013
- Fully validated system at scale by end of LS1
Information working group proposed a central Service Discovery system (WLCG IS: static information about services)
- Information is collected from a number of sources and collected in a WLCG Information system
- eg with systems like GOCDB as inputs; experiments could use this as input to their own systems
Experiments agreed on the need to finish deploying FTS3
Several experiments plan catalog improvements during LS1
Remote data access proceeding for all four LHC experiments
- http and/or xrootd
- CMS proposal for WNs to be on the OPN and use it directly not seen as damaging
- Discussion about caching vs data placement strategies
"Reasonable consensus on continuing the work on separating disk and tape"
HLT resources are the largest “cloud” resources available to the experiments
- HLT system for ATLAS and CMS is 10-15% of their Grid resources.
- 2018 HLT Resources in ALICE is estimated at 250k cores.
- LHCb already makes efficient use of the HLT

Future directions for work on virtualisation/clouds - Michel Jouvin (LAL)

Slides: http://indico.cern.ch/getFile.py/access?sessionId=0&resId=0&materialId=0&confId=197796
Proposal from December GDB is for WLCG to continue HEPiX virtualisation work
- Assess interest in this today
From January pre-GDB, experiments plan to use HLT's as clouds during LS1
- Currently no option to access the accounting services on these clouds
- ATLAS has plans to use PanDA statistics
- No full agreement: feeling that public resource usage should go into WLCG accounting
- For private clouds, APEL has demonstrated ability to collect data from clouds
CERN Agile infrastructure tests with ATLAS/CMS/IT/ES
- Test jobs and real production jobs
- ~200 VMs of 4 cores/8 GB fpr each experiment
- Condor used to instantiate VMs
- Successful tests, although some CPU efficiency problems not understood
All experiments except LHCb planning to build cloud out of HLT clusters
- ATLAS/CMS using OpenStack but not part of CERN Agile
- ALICE will use CernVM Cloud
- LHCb will use existing non-cloud infrastructure on HLT

T0 - Juan Manuel Guijarro (CERN)

Slides: http://indico.cern.ch/getFile.py/access?contribId=6&sessionId=0&resId=0&materialId=slides&confId=197796
Experiment pilot factories running on public clouds (Helix Nebula)
Future site clouds: BNL, CNAF, IN2P3, UVIC; HLT farms, CERN IT OpenStack
Need to be realistic and aligned with industry standards
Accounting
- Strategy - wall clock instead of CPU time?
- Multiple tenants
- Integration with APEL
Scheduling
- Dealing with limited? How to reclaim resources?
- Is fair share mechanism needed with clouds?
Federated Identify Management
- Multiple IT systems and organisations

LHCb - Using institutional clouds - Philippe Charpentier (CERN)

Slides: http://indico.cern.ch/getFile.py/access?contribId=7&sessionId=0&resId=0&materialId=slides&confId=197796
Say "institutional" rather than "private" clouds
LHCb just needs pilot jobs running on the farm somehow
Pilot job pulls work from the task queue
- Pilot job starts one or more job agents to runs a job
- Jobs may generate more than one thread
Batch system just needed to place pilot job on WN
- Batch systems also enforce limits to length of jobs - is this actually needed?
- Graphs comparing oscillation vs steady state between sites with different policies
Advantges and disadvantages of VMs
- Better to have VM running much longer than jobs
- If VM is whole WN then experiment can opitmise memory footprint and work on the node, potentially mixing CPU-bound and IO-bound applications to make full use of node
- Should Accounting be wall clock based? (as with commercial clouds)
  - Risk is paying for low efficiency if site is badly configured
Ideal scenario:
- Start VM if resource available
- Contextualise for VO
- Pilot agent starts and begins to pull jobs from central task queue
- Optimisation is the experiment's problem
- Need way to communicate with VMs to tell them they (will) need to shut down
- How do VMs shut down if they have no work? Do they sleep? Can site tell them to shut down because they are over their pledge?
Envisage controller in LHCb Dirac to manage VMs

ATLAS

(No slides)
In 2011, tried building Condor pool in cloud. VMs connecting to Condor pool. Machines running indefinitely. Could use cloud scheduler. Mostly following this solution.
Also tried batchless configuration. Pilot running in infinite loop. Killing machine if needed. Part of lxcloud project. No need for pilot factory, or Condor pool.

CMS - Claudio Grandi (INFN - Bologna), David Colling (Imperial)

Slides: http://indico.cern.ch/getFile.py/access?contribId=9&sessionId=0&resId=1&materialId=slides&confId=197796
CMS is now active in adapting the job submission framework to Cloud interfaces
- Testing on StratusLab
CMS High Level Trigger (online) farm has been configured to be accessed through OpenStack for offline use when not needed for the online work

Have looked at feasibility and cost of being able to double MC generation capacity for a month, say in the run up to a conference.

ALICE - Predrag Buncic (CERN)

Slides: http://indico.cern.ch/getFile.py/access?contribId=10&sessionId=0&resId=1&materialId=slides&confId=197796
No official strategy for clouds in ALICE
Initial work based on CernVM family of solutions
Uniform access via xrootd, single task queue, no real T1/T2 distinction
Would welcome and use pure cloud sites, providing they offer API compatible with Public Clouds
No need for batch, as JobAgent interacts directly with Task Queue
Summary
- We expect to live in mixed Grid/Cloud environment for some time
- Not an issue, Grid job or Cloud API are just the mechanisms to start Job Agent
- We plan to use CernVM family tools and collaborate with PH/SFT on developments

Discussion

It's commercial clouds (Amazon), private clouds that you own (HLTs), and public clouds (where you don't own it)
- For shared national research facilities, batch and fair share help coexist with other users and make use of available resources
- Dedicated facility harder to get funding for - why isn't it funded from HEP budget
Just running VMs for VOs is a much simpler way of working for dedicated HEP sites
- Need some way of increasing or reducing number per VO in response to level of use and pledges
Question about time constant of changing number of running VMs per VO to respect pledge.
Cloud storage model?
- Do we want to know where storage is? If Amazon storage was cheap, would we mind?
What will be the effect of moving responsibility for nodes from sites to VOs that are responsible for what happens inside their VMs?
Lots of savings from not having to runs CEs etc, using software like OpenStack that someone else supports
Proposing future project using Cloud technology will attract EU money where grid middleware won't
Accounting simplified by using wall clock and charging VOs while they have VMs running
Batch systems do allow advance reservations though, perhaps for cheaper times of the day
Do we agree that if you have the whole node then should use wallclock for pledges?

Conclusion: clouds are here, a little experience in integrating them, need to a draft work plan?
- If people, especially from sites, are interested in participating please contact Michel

Difference between revisions of "GDB 16th January 2013"

Latest revision as of 15:26, 16 January 2013

Contents

Welcome - Michel Jouvin (Universite de Paris-Sud 11 (FR)

Update on experiment support after EGI-Inspire SA3 - Ian Bird (CERN)

Future Computing Strategy for WLCG - Ian Bird (CERN)

Survey about configuration management tools and plans regarding central banning - Peter Solagna (EGI.eu), Peter Solagna (INFN)

CERN Computing Facilities’ Evolution - Wayne Salter (CERN)

Operations Co-ordination pre-GDB summary - Maria Girone (CERN)

Future directions for work on virtualisation/clouds - Michel Jouvin (LAL)

T0 - Juan Manuel Guijarro (CERN)

LHCb - Using institutional clouds - Philippe Charpentier (CERN)

ATLAS

CMS - Claudio Grandi (INFN - Bologna), David Colling (Imperial)

ALICE - Predrag Buncic (CERN)

Discussion

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools