GDB 14th December 2011

From GridPP Wiki
Jump to: navigation, search

GDB

Wednesday, December 14, 2011

Notes from Rob Harper

Agenda and slides are at: http://indico.cern.ch/conferenceDisplay.py?confId=106651

The main points that caught my attention were...

  • Vidyo is planned to be used for the January GDB meeting.
  • The search for a new GDB chair is starting.
  • ALICE is looking at using signed jobs instead of proxy certificates as a delegation mechanism.
  • ALICE is ready for a transition to SL6 but the other experiments are not.
  • ATLAS are testing LHCOne and this is proving controversial.
  • Experiments are seeing increased pile-up so are needing more memory to untangle this and are looking at how to deal with the problem.
  • LHCb is running a big MC campaign for the next few months, mostly at T2s.
  • CMS are expecting heavy use of T1 and T2 sites over the next year.
  • An identity tracking test has been run at 3 sites. Also controversial but highlighted the inadvisability of shared local Unix identities for multi-user pilot jobs.

Apologies if I missed anything important. My full notes follow, though they may be a scruffy and also miss out some points, but hopefully they'll be of some use...




   09:00 - 09:30 Introduction
       09:00 Introduction 25' (John Gordon)

Quick report of other meetings, workshops & conferences: SC11 @ Seattle, LHCONE @ Amsterdam, EGI Info Services workshop.

Next year's meetings continue as 2nd Weds, with a couple of changes. Jan meeting (will be a shorter meeting) to focus on the TEGs.

Assorted upcoming meetings. In UK is OGF in Oxford in March.

GDB issues:

  • MU pilot jobs, argus, glexec, CREAM, virtualisation, cvmfs, glite-cluster, etc.
  • Some discussion about whole-node scheduling to increase efficiency (memory/IO). Could possibly be done immediately if requirements and techniques were understood. Possible alternative is tailoring a VM of optimum size for jobs (eg if a job works best on 6 cores...) instead of whole-node.
  • JG asks if CREAM (EMI1 update 11) is working with SGE.
  • CREAM stability... documentation on tuning needs to be properly disseminated.
  • Vidyo... GDB suggested as test subject, probably in January. Info to come out before Xmas.


       09:25 GDB Chair 5' (Ian Bird)

Looking for a new chair for GDB to take over (probably) in May. First need a search committee of JG, IB and 2 volunteers. Any nominations for a chair? (2 year terms.)



   09:30 - 10:00 Middleware
   Convener: 	Doina Cristina Aiftimiei (Istituto Nazionale Fisica Nucleare (IT))

EMI 1 update 10 in November included SGE support in CREAM.

Update 11 2011-12-15... Updates to glite mpi, unicor gateway6, storm, unicore xuudb, wms

Then on 2012-01-19... dcache, assorted unicore components.

Beta & acceptance testing... software available through testing/ (beta) and deployment/ (acceptance) repositories

UMD 1.4.0 due 2011-12-19, not including DPM/LFC v1.8.2, L&B v3.1.0, MPUI, StoRM v1.7.1, WMS, AMGA, LFC_oracle, VOMS_oracle, FTS. (Last 4 not planned to include.)

EMI in good shape for EPEL compliance. Using RPMlint.

IPv6... working with Hepix IPv6 WG to ensure compliance. Using static code analyzer.

Supporting gLite security updates until 2012-04-30. Security updates are released when they are ready, not waiting for formal release.



   10:00 - 12:00 TEG
       10:00 Storage 40' (Daniele Bonacorsi)

Defining membership of group. Initially ~30 -- now ~50 (including some observers).

Kick-off meeting in November.

Trying to coordinate effort between DM-TEG and SM-TEG as there is some overlap. There are now a list of topics shared between the groups. See slides for detail.

Discussing needs and experiences with experiments. Need to digest results in order to decide how to proceed.

Reporting on progress at Amsterdam TEG week in January.

The TEG is concerned that it may not be able to be as ambitious as it could be due to trying to avoid disrupting operations, etc. But opinion was expressed that the TEGs should be asking ambitious questions and pushing the boundaries.



       10:40 Security 40'

Romain Wartel:

  • ~19 members, looking for more.
  • Drafting a risk assessment. Cost of security. Early draft is available (with no recommendations).
  • Aspects being looked into...
    • Publicity... aggravating factor for other risks. Need training to handle media.
    • Traceability... Needs to be fine-grained (identify individual users and identities). Need a form of unix identity switching.


Maarten Litmaath:

  • Authentication and Authorization Infrastructure (AAI) on WN.
  • Need to be able to contain incidents and trace origins, etc.
  • Concern with long-lived and general-purpose proxies.
  • Traceability to help prevent spread or reoccurrance, compliance, deniability for innocent parties. Need to trust VOs. Maybe signing payloads in future and become less reliant on VO for investigation.
  • User payloads should be run as individual user ID for traceability. If you don't separate users can you find out who did something naughty.
  • Technologies... glexec needs a proxy -- can that be relaxed? Sudo needs to determine target account somehow -- maybe glexec wrapper. VMs. One account/job slot (only Condor ATM). SELinux.
  • There are legal issues but we are initially only concerned with operational issues. Can be hard to prove which user responsible but need to know which DN involved. Can be hobbled by only knowing pilot DN.
  • Longer term... Use of general-purpose proxies on WNs is questionable. Deal with relationship between payload and user -- sign payload to prevent tampering. Data ownership issues.


Steffen Schreiner:

  • Delegation with proxy certificates.
  • Actors: user, VO, site. User delegates to VO, which delegates to site. Need to make this traceable.
  • Delegation can be specific: who & what, to whom and how. Or less so.
  • Current proxy is full delegation: whoever has this can do anything in my name. Problem is that (for example) site A can then delegate to site B using proxy in name of user.
  • Why do we need user proxy for payload? We don't fully trust VO. But do we trust the proxy cert to be valid, and can the user trust the site. This proves nothing much and removes non-repudiation for the user.
  • Proposal: signed grid jobs (as per Alice). Should be simpler and easier; proof of what was submitted, no extra permission delegated; should fit into glexec just fine. (Discussion suggests this needs a load more investigating and discussion.)


       11:20 Operational Tools 40' 	(Dr. Maria Girone)

6 weeks since launch used to assess current status on the various areas. Workshop Held this week.

WG1: Monitoring... ALICE & LHCb in-house. ATLAS & CMS largely experiment dashboards. Sites using nagios, ganglia, lemon, etc, but no need to converge on one system. SAM. Hammercloud popular. Site Status Board (SSB) used by ATLAS & CMS -- integrated with ad-hoc tools. SiteView aggregates VO specific monitoring.

Proposal... to use SSB framework to implement site-taylored experiment views.

Suggesting PerfSONAR deployment across WLCG.

Site availability... SAM tests plus other quality metrics. Views... GridView; Dashboard (maybe different critical tests to GridView); ACE (not in production for experiments, but will soon calculate all availabilities).

Ops tests are useful sufficient. And GridView availabilities can be too good -- may not reflect "real" usability. SAM does not measure job robot, hammercloud, data transfer quality, etc.

Proposal... Experiments extend tests to test more site-specific functionality (to be agreed & documented).

SAM framework extended to support external metrics (Panda DIRAC, etc). Base availability on more relevant site functionality and independent from experiment side-issues. But side-metrics published.

WG2: Support tools, etc... Tools generally good but GGus availability and interfaces with other systems needs improvement. Info system stability improvements needed. Need to investigate new solutions to scalability issues (eg torque/maui, SGE). Make experiment-site interaction better with less people.

WG3: application software management... Maybe converge on one configuration & build management system. Need one common s/w deployment tool (cvmfs?).

WG4&5 Middleware stuff... Robustness, simple config, monitoring, documentation.


   12:00 - 14:00 Lunch


   14:00 - 16:00 Experiment Operations


       14:00 Alice 20'
       Speaker: 	Latchezar Betev (CERN)

1.3PB data in 2011 HI run.

Running out of disk, so adding more and doing replica cleanup.

High luminosity -> a lot of pileup -> memory overruns.

Ready to migrate to SL6 (and even other distros, particularly Ubuntu) ASAP (for more up-to-date kernels and compilers). Discussion: if access to recent Linux+gcc is the motivator, would it be easier for sites to upgrade those rather than using a different distro?


       14:30 LHCb 30'
       Speaker: 	Elisa Lanciotti (CERN)

2s have run ~25% of LHCb reprocessing. Specifically mentioned: Manchester QMUL, Liverpool. T2 "attaches" to a T1 (via good network) to get data.

MC campaign has just started, mainly at T2s, expected to continue until restart of data-taking (late spring?)

Not ready for SLC6. (SLC?!)

LHCb likes cvmfs and they like the GridPP documentation on it!


       15:00 ATLAS 30'
       Speaker: 	Cedric Serfon (Ludwig-Maximilians-Univ. Muenchen (DE))

Last 3 months, throughput ~4GB/s. ~1M files/day. About 3.5PB T0->T1.

60% of CPU was on T2s. >50% between US, UK and DE. 70% is MC.

ATLAS reported some network issues linked to a switch to LHCOne. There was a load of discussion about how LHCOne is not in production and how it is being tested, which ATLAS is doing. Where sites join LHCOne this may affect transfers to other sites. ATLAS tests involve a few T1s and T2s, all using perfSONAR-PS and dedicated FTS channels, and performance is being measured before & after connection to LHCOne.

The meeting got a bit derailed by discussion of LHCOne until it was decided to discuss this in more depth at a future meeting.


       15:30 CMS 30'
       Speaker: 	Ian Fisk (Fermi National Accelerator Lab. (US))

Dealing with increasing pile-up, making "the computing problem more challenging": memory requirements lead to low CPU efficiency. Planning whole node scheduling at T0.

Expecting next year's work to be memory limited (pile-up) so there may need to be some physics trade-offs.

Considering multi-core scheduling, memory per core (longer term)...

Expecting heavy use of T1 and T2 sites in 2012.



   16:00 - 16:20 TEG
       16:00 Databases 20'
       Speaker: 	Dave Dykstra (Fermi National Accelerator Lab. (US))

Workshop took place in November with ~30 participants.

Subgroups (conditions, detector controls, db ops, distributed tools & metadata) drafting report by mid-January.

Expecting increased use of NoSQL/structured storage.


   16:20 - 16:50 Identity Tracking
   Convener: 	Sven Gabriel (Unknown)

User traceability was not challenged in SSC 4&5.

A test was run involving unscheduled malicious jobs (which write a file in pilot's home dir, then subsequent jobs access this file) via Panda on 3 sites. VO-CSIRT provided with a load of information.

Should be able to find compromised account/DN within 4-24 hours.

With shared Unix IDs this can be harder. Though WN local home dirs are easier than with shared homes.

Containment difficult in either case.

After 1 week offending DN not found, so extraordinary resources required for this. Found after a vulnerability was identified.

WIth shared unix IDs, it is possible to create untraceable jobs; even if an ID is found, proof is difficult. So don't do it!

Discussion about such unscheduled drills having limited resources available, and may cause problems for sites/VOs. Concerns about whether it is appropriate to do drills like this in future. Also discussion about what the real risks are and whether this is a useful test. Inconclusive but a general dislike of how the test was conducted.