GDB 9th March 2011
March GDB
Wednesday 09 March 2011 at - CN2P3, Lyon
http://indico.cern.ch/conferenceDisplay.py?confId=126130
Contents
- 1 Introduction (John Gordon)
- 2 ACE - availabilities based on CREAM (Wojciech Lapka)
- 3 CREAM status (John Gordon)
- 4 glexec and MUPJ update (Maarten Litmaath)
- 5 LHC Open Network Environment (LHCONE) (John Shade)
- 6 File-systems Efficiency (Xavier Canehan)
- 7 LHCb (Stefan Roiser)
- 8 ATLAS - (Stephan Jezequel)
- 9 CMS (Daniele Bonacorsi)
Introduction (John Gordon)
July meeting cancelled due to WLCG workshop.
forthcoming events:
- HEPIX Darmstadt 2-6 May.
- WLCG meeting DESY 11-13th July.
- EGI Technical Forum 19-23rd Sept.
News:
- RGMA registry closed on 1 March.
- Sites publishing through glite-MON will no longer work.
- CPU installed capacity - 13 sites not publishing.
- 59 sites not publishing shares for LHC VO's.
CERN-VM-fs:
- CERN IT support being finalised.
- Security audit ready to report (positively).
- Replication/mirroring process is working- mirror at RAL updates every hour. Working on updates triggered by CERN whenever changes occur - used with RAL WN. BNL progressing.
ACE - availabilities based on CREAM (Wojciech Lapka)
- CREAM nagios probe; everything OK for Tier-1 sites.
- Differences for ~30 sites - including Imperial and Brunel.
- Request to Sites: please check if services declared in GOCDB.
- New FCR mechanism is being tested by CMS.
CREAM status (John Gordon)
- A few sites in UK not supporting CREAM CE.
- Correct availability calculation should be available at end of March.
- WLCG pushing to decommission LCG-CE.
glexec and MUPJ update (Maarten Litmaath)
- Most Tier-1's now passing glexec test.
- test URL: https://samnag023.cern.ch/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CE&style=detail
ATLAS
- glexec tests now OK at various UK Tier-2's (and BNL)
- target proxies not accepted by Panda
CMS
- USCMS use glexec in production at a few OSG sites (setuid mode)
- other T1/T2 sites being tested
- issue of location of glexec $GLEXEC_LOCATION/sbin needs to be defined explicitly (bug opened)
- issue with tarball WN still not clear
LHCb running Nagios glexec tests as well
- CREAM not yet tested
ALICE
- integrating glexec into AliEn
next big push is to roll out in Tier-2's
LHC Open Network Environment (LHCONE) (John Shade)
The problem:
- LHC data models are evolving - more dynamic, less pre-placement of data
- network usage will increase and be more dynamic
- desire not to swamp R&E networks for LHC traffic (although LHCONE allows to use R&E if preferred)
The constraints:
- don't break LHCOPN
- distributed management
- designed for agility and expandability
- must appeal to funding agencies
3 levels of Tier-2
proposed solution:
- exchange points: "Exchange points will be built in carrier-neutral facilities so that any connector can connect with their own fiber or using circuits provided by any telecom provider."
- with LHCONE T2 and T3 will be able to obtain data from any T1 or T2
"LHCONE provides connectivity directly to T1s, T2s, and T3s, and to various aggregation networks, such as the European NRENs, GEÃÅANT, and North American RONs, Internet2, ESnet, CANARIE, etc."
Next steps
- looking for feedback
- build a prototype at CERN
- refine, esp. monitoring
File-systems Efficiency (Xavier Canehan)
- C2INP3 presented some results on file system testing they have done
- seeing problems with latency on some jobs
Lunch
LHCb (Stefan Roiser)
- CVMFS in production use at some Tier-1s: NIKEF, PIC and testing at RAL & CNAF
- Now running production on T0,T1 as well as T2 (up to 30k jobs running)
- Data consistency checks being done - developing new tools to automate this
- Setup of runtime environment sometimes times out (esp. sites with AFS software spaces)- caused by high number of file ops, fixed at IN2P3 but ongoing at CERN
- 'sawtooth pattern' - in site usage i.e. pulsing of usage
Tier-2s
- how to inform the Tier-2's of important upgrades?
- e.g. the CREAM bug
In conclusion: no major problems and good response from sites
ATLAS - (Stephan Jezequel)
Operations over last quarter:
- ATLAS is 'gradually breaking cloud model'
- Consolidation of ATLAS and sites in preparation for data taking
- 34% of analysis being done by US
- DATADISK and MCDISK merged with DDM (FTS transfers/central deletion)
- Sites have to clean up dark data after migration
Space token shares for 2011: See https://twiki.cern.ch/twiki/bin/view/Atlas/StorageSetUp
ATLAS wants to break the cloud model to get more flexibility Obvious constraint : Should match the network connectivity between sites
Actions:
- Prepare LFC consolidation at CERN
- Some T2s running G4 simulation for different T1s
- Direct transfers between some T2s and all T1s
In the future:
- Promote 'good' T2s to host primary replicas (only in T1s today)
Cross-cloud production Reason:
- allocate more CPU resources for urgent simulation
Some big T2 sites already associated to many Tier1s
- T2 connectivity to all T1s
- Select good T2 sites which will always transfer from/to all T1s: called T2Ds
- A long list of sites 'in probation'
ATLAS want to reduce raw file size by zipping them (factor 2) at Tier-0
CMS (Daniele Bonacorsi)
Not many major issues in CMS Computing Operations since last time we met
- analysis, analysis, analysis
- preparing for 2011 data-taking
Site readiness: 40/50 Tier-2s consistently 'ready'
- CMS is monitoring the transition to CREAM - Brunel and Imperial failing the test
glexec on the WN and ARGUS: "Initial tests indicate we have a long way to go for full deployment"
Conclusion: 'CMS Computing Operations OK since last time we met'