GDB 8th April 2009

From GridPP Wiki
Jump to: navigation, search

GDB Wednesday 8th April 2009

Agenda notes:DTR

Introduction

(John Gordon) pre-GDB proposals: storage for Tier-2's - looking like May; virtualisation in June; site management in July

upcoming events: STEP09 (worldwide) in June

EGI: separate strands progressing, some NGI's said only willing to take part in EGI if it delivers for LHC. Need an 'SSC' for Particle Physics. Discussion: Tier-1's need to be integrated with their NGI's. Go round all Tier-1's and explain how they will intergrate with their NGI. UK: funding for NGI and PP grid for two years - so no problem forseen. NGI = NGS + GridPP. Germany: NGI set-up managed by Gauss Alliance. Karlsrhur one of partners of this alliance. Netherlands: (Jeff), closest to NGI is NCF. Not disconnected, convergence scenarios being discussed. NDGF Nordic countries will form separate NGI's- complex. Italy: In process of adapting. Taiwan: situation is not clear to us. France: Lyon Tier-1 is part of the process - more formal reply to come. Spain: potential a problem NGI is separate from Tier-1.

Security Service Challenge - early 2009 campaign

(Sven Gabriel)

Similar to SSC3 (?) but using WMS. Preliminary results of challenge. Roleplay regarding actions to be carried out in response to malicious user. Triumpf and european Tier-1s have run. OSG to start. Went through each Tier-1 with summary of their responses. Discussion with INFN regarding their lack of a response. Plan to re-run storage challenge and SSC3.

STEP09

(Jamie Shiers) We have a good understanding of WLCG ops means and it works so we should refine rather than change it. SCOD - Service coordinator on duty rota has started. Need to see improvements in ordered recovery and communications. All major problems should be reported. Espcially for services classified as critical for a VO. SCOD asks for it. Tracking of oncidents and the restoration is useful for your knowledgebase - need to learn from mistakes. Expect a 'service incident report' for all major incidents. Recommended to use a template (e.g. gridpp). pre-STEP planning: CMS style report card recommended for the other VO's. Possible analysis pre-GDB meeting in May?

CMS (Daniele Bonacorsi): test writing to tape at Tier-0 for several days. AT Tier-1's - stress test of mass storage system and tapes. For Tier-2 part of STEP09: Need
 to 
test 
analysis 
at 
high 
scale 
including 
stage
 out 
of products
 to 
destination 
T2 
(chaotic 
stage
out,
any 
T2 
permutation 
allowed). Scale to 200k jobs per day - especially user stage out scenarios. Also test Frontier systems can work at load. Summary: draft plan is in place. CMS keen for multi-VO overlap.

ATLAS (Kors Bors) : want to test out their full computing model, with the emphasis on the sites so require full participation. Graeme is the coordinator. If Tier-2 sites want to opt out they need to do it by 15th May. Tier-2's will be tested doing data distribution, production and analysis. Data distribution will be tested using the existing DDM functional test of data from Tier-1 to Tier-2 and also distribution of calibration data. Metrics for Tier-2's include 100% data moved, no outages for > 24 hours. Simulation production metric is 15000 jobs/day exclusively in Tier-2's. There will be a 2 day grace period after which all merged hits must be on tape at Tier-2. Analysis testing at Tier-2's will be done using the Hammercloud. ATLAS will look at how sites manage analysis and production via fairshare at a site. Tier-2 sites will expect to deliver 50% of ATLAS share of CPU capacity. JOb efficiency should be >85% with >15 event/s with a 2 day grace period.

EGEE Authorisation Service (Update)

(Christoph Witzig)

Christoph updated us on progress.

Identity Management - Levels of Assurance

(David Kelsey)

New large-scale academic federations are being deployed and we now need to confirm or redfine the WLCG position on matters. History: goes back to Dec 2000, no clear statment of identity assurance. Acceptable procedure refined to face to face identity. Now looking at Short Lived Credential Service (SLCS) - an automated service which translates credentials from a large site into grid certificates. There are a number of ICTF accredited SLCS CA e.g. FNAL KCA (kerberos?) and more on the way e.g. DFN, Nether-Nordic. UK has Shibboleth CA but not accredited. Academic AAI federations - a good example is eduroam used by some at CHEP and GridPP22. Identity schemas: eduPerson with attributes such as eduPersonPrincipalName(ePPN) user@scope but this is private information and IDPs usually refuse to release them.

How should WLCG respond? What do WLCG VO's and sites need ? DK recommends we still need face-to-face identity vetting, we still need persitence of subject DN and appropriate presentation of real name in DN. He also likes australian shared token e.g. "John Citizen ZsiAvfxa0BXULgcz7QXknbGtfxk)". Note that it is the real name in the DN that causes us all the data privacy problems. DK would like to push for status quo.

Distributed Monitoring in EGEE

(Steve Traylen)

Summary of OA (operations and automation) Team face to face meeting. Overview of montitoring structure - GOCdb, ...SAM team looking to replace SAM portal. MyOSG looks good: http://myosg.grid.iu.edu/about can use igoogle to create your own dashboard.

Static site properties: a reliable definitive source of information is needed. Three possible solutions - 1. status quo - using GOCdb. 2. Site BDII.3. Sites publish to messaging system and GOCdb reads from messaging system. Next step check Glue 1.3 can accomodate this.

Security Tests: Some SAM tests run. Need to port to Nagios-based frameworks. Pakiti campaign: OSCT will establish a Pakiti server to look at site patch status.

Aggregated Topology Provider (ATP) http://gridops.cern.ch/atp/ .

VO's need to run their tests in the new framework. Using Nagios API. Integration week to be held in May.

Middleware update

(Andreas Unterkircher)

SL4: WMS release showed problems - a workaround is available. SCAS/glexec is certified and in PPS. SL5: WN released on 23 March - no problems raised so far. But lcgManageVOTag and lcg-tag don't work.


I got called away at this point and so missed:

Pilot Jobs

(Maarten Litmaath)

and

SRMv2.2 - status of the SRM 2.2 MoU extension

(Andrea Sciaba)

Distributed Monitoring in the OSG

(Brian Bockelman)

OSG principle to make data available locally. Gratia is used for accounting - XML based. Grid and non-grid clusters can be monitored and only grid data forwarded. Now starting to incorporate accounting of data transfer.

Service monitoring: Probes run locally at site. They are hoping to get some improved troubleshooting tools.