Difference between revisions of "GDB 16th January 2013"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 15:26, 16 January 2013

Notes taken by Andrew McNab in the room used at CERN. Vidyo repeatedly cut out during the meeting.

Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=197796

Welcome - Michel Jouvin (Universite de Paris-Sud 11 (FR)

Update on experiment support after EGI-Inspire SA3 - Ian Bird (CERN)

  • Follow-up of the discussion at October GDB
  • Slides: http://indico.cern.ch/getFile.py/access?contribId=2&resId=1&materialId=slides&confId=197796
  • EMI finishes April 2013; EGI-Inspire to April 2014
  • EU Horizon 2020 (FP8) requires relevance to industry and society (so need more than benefits to sciences)
  • Large (33%,45%) reductions in CERN ES and GT groups
  • IT must focus resources on cross-experiment solutions
  • IT must continue with DM tools, information system, and simplify build, packaging
  • POOL now only used by ATLAS and will supported by them
  • "Previous level of dedicated expert support to experiments unsustainable. Understood by ATLAS/CMS; issue for LHCb and ALICE"

Q&A:

  • Future projects (ie funding):
    • only from 2014 and must be broader than just HEP and probably require industrial partners
    • anything that looks like grid middleware development will not be feasible
    • need committment from PH etc not just IT this time
  • IT currently about 40 FTE for this area; going down to 24.

Future Computing Strategy for WLCG - Ian Bird (CERN)

  • Slides: http://indico.cern.ch/getFile.py/access?contribId=3&resId=1&materialId=slides&confId=197796
  • Slides are from September 2012
  • Data Management
    • Using more standard tools makes collaboration with other sciences easier
    • Long term data perservation something that other areas have more experience of
  • Must plan on next generation of e-Infrastructures (5-10 yrs): clouds? terabit networks?
  • HEP software must support concurrency due to hardware more to large numbers of cores
  • Workshop this year on long term strategy for all this
  • Will create a single document for LHCC that explains computing models, how they will evolve for running between LS1 and LS2, based on experiment computing plans (TDRs?)
    • Draft for C-RRB in October 2013

Q&A:

  • Merge/co-ordinate with architects forum?
  • Need for communication to sites about experiments' intentions about hardware architectures
  • Do funding agencies really understand that LHC data builds up so storage requirements continually increase?
  • Need more resources in 2014 so ready for increased rates from machine from mid-2015
  • Difficulty of working with other big data areas: need to change existing frameworks (on either side); and overheads if involving other fields on a token basis

Survey about configuration management tools and plans regarding central banning - Peter Solagna (EGI.eu), Peter Solagna (INFN)

  • Slides: http://indico.cern.ch/getFile.py/access?contribId=4&resId=1&materialId=slides&confId=197796
  • Had 82 unsupported services in December; now 26.
  • In December, 44 sites with unsupported middleware; 15 now.
  • Currently: 32 DPMs, 2 LFCs, 92 CEs still pointing at gLite WNs.
  • From February, DPM, LFC and WNs not upgraded must go into downtime.
  • YAIM under discussion as end of EMI nears. Product teams may choose own solutions.
  • Survey will be sent out to see what sites are doing (quattor, puppet, Chef etc)

Q&A:

  • Site dependent on tar ball shouldn't have to go into downtime since "tarball" via cvmfs only now published
  • Still interest in central banning architecture, still being worked on, ARGUS frontrunner for how to enforce it

CERN Computing Facilities’ Evolution - Wayne Salter (CERN)

  • Slides: http://indico.cern.ch/getFile.py/access?contribId=5&resId=1&materialId=slides&confId=197796
  • Lots of detail about upgrade to cooling, UPS etc in machine rooms. See slides.
  • Parameters of Wigner extension to CERN T0 in Hungary:
    • To be considered as an extension of the CERN Tier0, i.e. should be transparent for the experiments
    • IT Operations to be done as if systems were at CERN
    • All operations not requiring physical intervention to be done remotely from CERN
    • All physical interventions to be done by Wigner staff on request from CERN using CERN ticketing system
    • Wigner responsible for all facility operations

Operations Co-ordination pre-GDB summary - Maria Girone (CERN)

  • Slides: http://indico.cern.ch/getFile.py/access?contribId=1&resId=1&materialId=slides&confId=197796
  • Many experiment improvements planned for LS1
  • Willingless to approach solutions in common
  • Topics: Network workshop summary / Security / Information System / CVMFS / Data Management / Clouds
  • Discussions part of each topic rather than in planned roundtable at end
  • Vidyo problems as usual
  • Giving ownershp of CVMFS tasks to people is leading to more progress
  • SHA-2 production certs will not be issued before August 1st
    • Validation during the following two quarters
    • All deployed software to support SHA-2 in early summer
    • Should all be transparent to users as SHA-2 issued when certificates are renewed
  • Agreed to have a fully deployed and tested glexec system by end of 2013
    • Fully validated system at scale by end of LS1
  • Information working group proposed a central Service Discovery system (WLCG IS: static information about services)
    • Information is collected from a number of sources and collected in a WLCG Information system
    • eg with systems like GOCDB as inputs; experiments could use this as input to their own systems
  • Experiments agreed on the need to finish deploying FTS3
  • Several experiments plan catalog improvements during LS1
  • Remote data access proceeding for all four LHC experiments
    • http and/or xrootd
    • CMS proposal for WNs to be on the OPN and use it directly not seen as damaging
    • Discussion about caching vs data placement strategies
  • "Reasonable consensus on continuing the work on separating disk and tape"
  • HLT resources are the largest “cloud” resources available to the experiments
    • HLT system for ATLAS and CMS is 10-15% of their Grid resources.
    • 2018 HLT Resources in ALICE is estimated at 250k cores.
    • LHCb already makes efficient use of the HLT


Future directions for work on virtualisation/clouds - Michel Jouvin (LAL)

  • Slides: http://indico.cern.ch/getFile.py/access?sessionId=0&resId=0&materialId=0&confId=197796
  • Proposal from December GDB is for WLCG to continue HEPiX virtualisation work
    • Assess interest in this today
  • From January pre-GDB, experiments plan to use HLT's as clouds during LS1
    • Currently no option to access the accounting services on these clouds
    • ATLAS has plans to use PanDA statistics
    • No full agreement: feeling that public resource usage should go into WLCG accounting
    • For private clouds, APEL has demonstrated ability to collect data from clouds
  • CERN Agile infrastructure tests with ATLAS/CMS/IT/ES
    • Test jobs and real production jobs
    • ~200 VMs of 4 cores/8 GB fpr each experiment
    • Condor used to instantiate VMs
    • Successful tests, although some CPU efficiency problems not understood
  • All experiments except LHCb planning to build cloud out of HLT clusters
    • ATLAS/CMS using OpenStack but not part of CERN Agile
    • ALICE will use CernVM Cloud
    • LHCb will use existing non-cloud infrastructure on HLT

T0 - Juan Manuel Guijarro (CERN)

  • Slides: http://indico.cern.ch/getFile.py/access?contribId=6&sessionId=0&resId=0&materialId=slides&confId=197796
  • Experiment pilot factories running on public clouds (Helix Nebula)
  • Future site clouds: BNL, CNAF, IN2P3, UVIC; HLT farms, CERN IT OpenStack
  • Need to be realistic and aligned with industry standards
  • Accounting
    • Strategy - wall clock instead of CPU time?
    • Multiple tenants
    • Integration with APEL
  • Scheduling
    • Dealing with limited? How to reclaim resources?
    • Is fair share mechanism needed with clouds?
  • Federated Identify Management
    • Multiple IT systems and organisations

LHCb - Using institutional clouds - Philippe Charpentier (CERN)

  • Slides: http://indico.cern.ch/getFile.py/access?contribId=7&sessionId=0&resId=0&materialId=slides&confId=197796
  • Say "institutional" rather than "private" clouds
  • LHCb just needs pilot jobs running on the farm somehow
  • Pilot job pulls work from the task queue
    • Pilot job starts one or more job agents to runs a job
    • Jobs may generate more than one thread
  • Batch system just needed to place pilot job on WN
    • Batch systems also enforce limits to length of jobs - is this actually needed?
    • Graphs comparing oscillation vs steady state between sites with different policies
  • Advantges and disadvantages of VMs
    • Better to have VM running much longer than jobs
    • If VM is whole WN then experiment can opitmise memory footprint and work on the node, potentially mixing CPU-bound and IO-bound applications to make full use of node
    • Should Accounting be wall clock based? (as with commercial clouds)
      • Risk is paying for low efficiency if site is badly configured
  • Ideal scenario:
    • Start VM if resource available
    • Contextualise for VO
    • Pilot agent starts and begins to pull jobs from central task queue
    • Optimisation is the experiment's problem
    • Need way to communicate with VMs to tell them they (will) need to shut down
    • How do VMs shut down if they have no work? Do they sleep? Can site tell them to shut down because they are over their pledge?
  • Envisage controller in LHCb Dirac to manage VMs

ATLAS

  • (No slides)
  • In 2011, tried building Condor pool in cloud. VMs connecting to Condor pool. Machines running indefinitely. Could use cloud scheduler. Mostly following this solution.
  • Also tried batchless configuration. Pilot running in infinite loop. Killing machine if needed. Part of lxcloud project. No need for pilot factory, or Condor pool.

CMS - Claudio Grandi (INFN - Bologna), David Colling (Imperial)

  • Have looked at feasibility and cost of being able to double MC generation capacity for a month, say in the run up to a conference.

ALICE - Predrag Buncic (CERN)

  • Slides: http://indico.cern.ch/getFile.py/access?contribId=10&sessionId=0&resId=1&materialId=slides&confId=197796
  • No official strategy for clouds in ALICE
  • Initial work based on CernVM family of solutions
  • Uniform access via xrootd, single task queue, no real T1/T2 distinction
  • Would welcome and use pure cloud sites, providing they offer API compatible with Public Clouds
  • No need for batch, as JobAgent interacts directly with Task Queue
  • Summary
    • We expect to live in mixed Grid/Cloud environment for some time
    • Not an issue, Grid job or Cloud API are just the mechanisms to start Job Agent
    • We plan to use CernVM family tools and collaborate with PH/SFT on developments

Discussion

  • It's commercial clouds (Amazon), private clouds that you own (HLTs), and public clouds (where you don't own it)
    • For shared national research facilities, batch and fair share help coexist with other users and make use of available resources
    • Dedicated facility harder to get funding for - why isn't it funded from HEP budget
  • Just running VMs for VOs is a much simpler way of working for dedicated HEP sites
    • Need some way of increasing or reducing number per VO in response to level of use and pledges
  • Question about time constant of changing number of running VMs per VO to respect pledge.
  • Cloud storage model?
    • Do we want to know where storage is? If Amazon storage was cheap, would we mind?
  • What will be the effect of moving responsibility for nodes from sites to VOs that are responsible for what happens inside their VMs?
  • Lots of savings from not having to runs CEs etc, using software like OpenStack that someone else supports
  • Proposing future project using Cloud technology will attract EU money where grid middleware won't
  • Accounting simplified by using wall clock and charging VOs while they have VMs running
  • Batch systems do allow advance reservations though, perhaps for cheaper times of the day
  • Do we agree that if you have the whole node then should use wallclock for pledges?
  • Conclusion: clouds are here, a little experience in integrating them, need to a draft work plan?
    • If people, especially from sites, are interested in participating please contact Michel