GDB February 2009

From GridPP Wiki
Jump to: navigation, search

GDB Wednesday 11th February 2009

Agenda Notes: DTR

Introduction

(John Gordon) More suggestions for pre-GDB meeting topics. Flavia's talk postponed

MB report

(Ian Bird) LHC running - first inject end sept, coll end oct. Physics run from Nov for 1 year. Short stop over Xmas. Heavy ions end 2010. So need to go back to earlier 2009/2010 plan. - see MB minutes?

EGI: policy board (PB) met in Jan in Prague, revised blueprint - much improved. Need to tighten up definition of an NGI. Concerns about continuity from EGEE3 to EGI - continous service imperative during LHC run.

dCache workshop - self help for T1's - dcache.org too small. Not all sites using redundancy/failover options.

ATLAS pCache

(Graeme Stewart) T1 reprocessing required SQlite condtions file CDRelease.tag.gz - 2GB file - overloaded T1 SE's - led to jobs failures at RAL and NIKHEF. Possible solution - local cache of file on WN. Small wrapper pCache checks for file and down-loads if it is not there. Uses hard link - so needs to be on same file system as each job. Sites need to say where the cache shoud go and how much space is available. There was some discussion.

Mass Storage Performance

Graeme: ATLAS bulk pre-staging. ATLAS RAW data on tape in the computing model. T1's reprocess 3 times per year. Want to be able to reprocess within a month. Factor 10 speed in reading data over writing it. So for a 10% Tier-1 equates to 186 MB/s. Pre-staging service provided by DDM - srmBringOnline - uses srmLs to check if files are ONLINE or NEARLINE. Bulk prestage tests: FDR08 dataset at most Tier-1's 9TB. Pre-staged the 3000 files at Triumf at 300MB/s (90 MB/s required). RAL : 420 MB/s - double the target. Other T1s did not do so well. Will try a whole-chain test reprocessing the cosmics runs next month. https://twiki.cern.ch/twiki/bin/view/Atlas/PreStageTests

John: Tape performance at T1 sites: asked sites whether they feel VO's have tested them and if is there monitoring in place. FNAL - yes (only CMS) RAL: no not fully tested at required rates, either independently or combined loading. Monitoring: yes but not in real-time. Hard to plan tape drive demand in the future - need 3-5 year lead-time.

ALICE WMS usuage. 3 alice dedicated WMS at CERN. wms204 showed backlogs. No clear conclusion but have suggested some improvements.

Reporting Installed Capacity

(Flavia Donno) There was no talk from Flavia.

The EGEE AUTHz Framework

(Christoph Witzig) Why? different services use different authorization mechanisms. No single point to ban users from a site. Many sites don't know how to ban users. Banning and unbanning a user should be done at the command line - not a config file issue. No central grid-wide list and no central monitoring. Mechanism: glexec on the WN is first step then OSCT can implement their banning list and then authZ on CREAM and WMS. glexec call out to authZ service host (configured via yaim) -local site banning rules and policies for pilot jobs. (LCG-CE unchanged). Then a grid-wide banning by OSCT can be implemented. Next step integration with CREAM. Policy for global banning - respect for local site autonomy. Sites should implement this and give priority over local policy - 6 hour timescale to ban a user (timescale under discussion). Certification in first half of April. Gradual deployment in six -steps. 1. glexec on WN, 2. OSCT ban list 3. integration into CREAM. Looking for volunteer sites and feedback.

Question: relationship with SCAS? A: SCAS is short term solution, this is longer-term solution. Q: What about storage? A: need to discuss with storage people. At minimum storage should look up central list of banned users. Ian Bird: don't delay installation of SCAS.

Job Wrapper Tests revisited - extracting configuration data from the sites'

(Thomas Low) Collect and visualise information: structural information, job information, system information, software information (java, perl, glite, GFAL) [seems like Greig's DPM monitoring]. Who would benefit: deployment team, VO, WLCG management, site management etc. What is needed: monitoring tool which works on WN, uses reliable comms system, diesn't add overhead (30s max), need display/visualisation of data. How does it work: client on WN - sends messages to ActiveMQ - database then to display. Client executed before each job by job wrapper. Trigger client by any job - job wrapper or SAM test or cron job - http://gridops.cern.ch/gcm/

Middleware Update

(Andreas Unterkircher) Release scripts for rollback, per complete release, rollback per patch will come later. SA1 looking for early adopters (RAL is for glite_BDII_top) SL5/x86_64 WN pilot is over - misconfig of PYTHONPATH - is only issue. Multiple WN version on one WN. One RPM installs a complete version (akin to tarball) - hence get multiple versions on WN currently under test. CREAM CE: release Jan 09 ... SCAS..patch history GLEXEC etc nothing much to report.

Status of the LCG-CE

(Andrey Kiryanov) LCG-CE only meant as a stop-gap. Suffers scalability problems mainly due to being based on globus-2 technology. Recent versions improve on this - gained us some time - OK for production but probably for analysis where we have to handle hundreds of users/roles with hundreds to thousands of jobs per CE. Discussion as to whether we need to move to CREAM and if so on what timescale. Issue as to whether CREAM CE will be ready and whether the LCG-CE can easily be ported to SL5. Need to see CREAM CE making good progress.

Multiuser Pilot Job Frameworks

(Maarten Litmaath) Glexec: Maarten - ALICE questionaire answers and plans submitted. Use of glexec by ALICE (AliEn) forseen by June. What about testing by ATLAS and LHCb? Some bugs fixed - getting close to a SCAS enabled Glexec. LHCb completely ready to go. Not really clear what the status is.

VDT and OSG

(Alain Roy) Introduction to OSG software stack and VDT (virtual data toolkit makes up lagest part). OSG doesn't develop software but takes others. Software Tools Group - new, looks at big picture of OSG software, single point of contact for software providers.