GDB 8th September 2010

From GridPP Wiki
Jump to: navigation, search

GDB 8/9/2010 Agenda


10:00 Introduction 30' (John Gordon)

  • Pilot jobs and glexec has been installed and tested at the T1. SAM/Nagios tests have been set up. Some pre testing has been done experiments should be able to test on a larger scale. Next month there should be more progress.
  • Cream CE: sites don't find it a problem, atlas seems to still have a problem. Condor has provided a new version that should solve the problem but hasn't been tested yet.
  • Virtualisation: there are some activities going on. Experiments should be more involved now.
  • Information system stability: there are top level BDIIs that are too old and cannot publish the latest information.
  • Installed capacity: there is a page on how sites should publish their cpu and storage resources. T1 should go and checked the installed capacity published by the new gstat2.0 page. For the UK sites

http://gstat-wlcg.cern.ch/apps/capacities/sites

Fair shares aren't included yet the info is about raw capacity. The inforation is dynamic at the moment but will become static, with a DB behind, in the future.

Deadline for T1 by the end of September and end of October for T2.

  • SAM tests are being phased out and now regional Nagios have taken over although there is still a central instance.
  • Director General has issued a technical plan that has to go back to funding agencies. See John slides for it.

10:30 Shared Software Areas 30' (Elisa Lanciotti, Ian Collier)

  • Results obtained using CernVM-FS at PIC for software distribution. Setup includes a squid cache to reduce the WAN latency and local cache on the WNs. On the testbed there wasn't a dependence on the number of job and the job execution time is the same if not better than NFS.
  • Lhcb has more than 4M small files and they get touched very often this is also one of the reason the NFS servers get brought down (GridKa complains).
  • RAL made some preliminary scalability tests and they managed to have 800 concurrent jobs but haven't looked at the job efficiency in detail. This could be achieved with any caching system (AFS proposed by Manchester since 2001 and currently used by CERN ;-)) but CERNVM-FS main advantages is that when a file is in the catalogue it doesn't get copied again even if it has multiple references (i.e. belong to more than one release). Next step is to scale to thousands of jobs. The main problem is how the main service at CERN will be supported if sites move and there is a skeleton of security framework that is currently being worked on (chksum and validation done by the sgm account?). One of the things to look at is how much cache you need on the clients and how big has to be the squid cache. Another big advantage is that this model eliminates the need of publishing tags in the info system.
  • Most of the work gets done at CERN were the experiments take care of installing the master release and then sites just use caches. There is a still a validation problem under discussion.
  • General consensus is that this is promising and request for CernVM-FS be supported from the experiments should come through.

11:00 Accounting 30' (Cristina Del Cano Novales)

  • Apel is monitored by nagios now
  • Apel has moved to ActiveMQ as transport layer and RGMA needs to be decommissioned. only 30 sites are using ActiveMQ publisher. 190 are still using RGMA.
  • MPI parser support will be introduced.
  • Regionalization will have to be flexible. Either publish to the central service or setup their accounting service and publish a summary to the central database. But they should do it all through the same ActiveMQ interface.
  • Nikhef didn't use the APEL parser injected the info directly in RGMA. Need to find a way to do the same.
  • The schema still support only grid jobs. Is there anything on the horizon that includes local jobs?

The problem is that for APEL everything is a local job unless it is joined with grid information.

  • There is a accounting requirements collection going on in EGI. User level accounting was a VO requirement that doesn't seem to be needed anymore as experiments have their own way to keep track of whom has submitted the jobs.

11:30 Middleware 30' (Andrew Elwell)

  • Batch systems will have a best effort support from a number of sites
  • There is not a centralised group that can take decisions in EGI on what is considered critical, and to set dates when things should be dropped. In particular there is no date yet to drop Glite3.1 services. We will have to rely on WLCG base line release mechanism to understand what can be published.
  • lcg-vomscerts needs to be updated by the 10th of September.
  • A number of updates and fixes was listed.
  • To maintain uptodate with the situation use the work plan tracking in savannah: http://bit.ly/22we3i

12:00 Network Monitoring 20' (John Shade)

  • Most important thing is that now there is a prototype monitoring dashboard to monitor OPNs and they are working also on having historical data.
 http://sonar1.munich.cnm.dfn.de/lhcopn-dashboard/cgi-bin-auto/cnm-table.cgi
  • It is not yet well supported. DANTE withdraw SARA and CERN picked up the task.

14:00 - 16:25 Experiment Operations

14:00 Alice 30' (Latchezar Betev)

  • Quite happy with the data collection.
  • Waiting for RAL to upgrade castor because new version is "rock solid"
  • Tier2 storage is holding fine
  • Analysis 5% of grid resources ~250 users
  • Machine is stable

14:30 LHCb 30' (Roberto Santinelli)

  • Analysis running up to the computing model expectations.
  • Running xrootd
  • Using crea CE in production testing direct job submission
  • Analysis running at some T2s with DPM is under evaluation but it puts a strain on the central system
  • glexec is not in production but there are nagios tests in place.
  • Distribution is per run rather than per file (1 run == 1 site)
  • Prototyping HC tests
  • Data taking: quite impressive integrated luminosity as well.
  • List of GGUS tickets and analysis of the problems. The number of tickets has doubled and part of this is because lhcb has started to use heavily a number of services at Tier1s.
  • Shared SW area is one of the biggest "distributed" problems. Lyon has had problems since June 8th and looked at using AFS like CERN. Atlas jobs compiling the problems are competing for the resource.

Complain about the number of small files that get touched and bring NFS down from gridka again. Proposed Site ID card at least for T1.

  • CreamCE also has had a number of issues since June. The current release seems more stable.
  • RAL disk servers frequent crashes needed some memory tuning. Storage is still the most vulnerable component.

15:10 CMS 30' (Ian Fisk)

  • Site are in pretty good shape
  • Lots of work done to make T2-T2 transfers reliable but this is paying off increasing data availability
  • Quality of transfers from CERN remarkably high
  • Detailed list of major issues with T1
  • Data for analysis exceeded 2 PB and physics groups manage their space at T2
  • Analysis has slowed down after ICHEP and during August but is ramping up again
  • Cream-CE in use latest version solves a lot of problems
  • Working on pilot factories based on condor glide-ins
  • Asked for a savannah-GGUS bridge which is working fine although a lot of added features have been requested.

15:40 ATLAS 30' (Simone Campana, Stephan Jezequel)

  • Interesting slides about DQ2 vs Info Sys discrepancies reported about the storage. Most of the problems come from the fact that IS uses this equation used = total - unused. Where total is all the disk installed whether functional or not. This causes often to publish negative numbers when sites put data disks offline or in read-only state. A solution needs to be agreed. (For reference https://gus.fzk.de/ws/ticket_info.php?ticket=54818)
  • T1 availability is mostly ok apart from Nikhef/Sara which had heavy problems with storage and oracle.
  • Few GGUS tickets for T1
  • A number of iterations have been required to converge on a reliable checksum computation for storm sites.
  • Storage is still a bit flaky. Big differences of responsiveness and reliability between T1s there should be a detailed comparison in the future.
  • RAL prolonged problems with storage.
  • SARA prolonged problems with LFC (generated by oracle)
  • Actions identified to solve or minimize both problems