RAL Tier1 weekly operations Overview 20090810
Contents
Overview of Milestones and Metrics
Key High Level dates
- LHC schedule delayed 6 weeks over Chamonix date. We now expect first beam in mid-November. Collisions now in December?
- Freeze date now end of September
- Discussing with PMB if MoU commitment dates have any flexibility
Key Metrics
Owner | Description | Target | Achieved |
---|---|---|---|
Gareth Smith | Overall Tier-1 SAM Availability (last week) | 97% | 100% |
Gareth Smith | Alice SAM Availability (Jun) | 97% | 73% |
Gareth Smith | ATLAS SAM Availability (Jun) | 97% | 71% |
Gareth Smith | CMS SAM availability (Jun) | 97% | 71% |
Gareth Smith | LHCB SAM availability (Jun) | 97% | 67% |
Andrew Sansum | Fraction of Tier-1 Staff in Post (Jun) | 93% | 103% |
Gareth Smith | Number of days where called out (last spreadsheet full week) | 3 | 3 |
Matt Hodges | Percentage met of UB allocation of disk (Jul) | 100% | 83% |
Matt Hodges | Job Efficiency (Jul) | 85% | 84% |
Matt Hodges | Farm Occupancy (Jul) | 85% | 78% |
Matt Viljoen | Number of >Severe CASTOR Incidents (Jun) | 6 | 2 |
Availability was poor in June owing to the move of the Tier-1 to R89.
Key Production Milestones
See myactions:
https://myactions.gridpp.rl.ac.uk/all/where/category_name/Operational/
High Level Schedule
Final Update Window Mon 13/07/09 30/09/09 Tier-1 Stability Period (2) October-mid-November LHC First beam mid November
Note that:
- Software freeze date of end of September was considered reasonable by WLCG MB.
Disaster Management
Swine Flu (H1N1) is being handled in the Tier-1 Disaster Management System (currently level 2) Will probably enter the disk deployment problems as level 1
Swine Flu Response Plan
See: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/TierOneSwineFlu
There will be a Tier-1 work at home day on Wednesday 19th August.
Purchasing and Finance
- GRIDPP finalising spend plan
- Commencing current disk and CPU tenders (Dave Corney leading). Disk PQQ is running. CPU PQQ will launch shortly.
Staffing
- One experiment support post started today (Andrew Lahiff) . Second experiment support post, ready to make offer.
- EGEE funded PPS recruitment will start Monday 17th August.
PMB Experiment Reports
ATLAS
Require 2 weeks stability during August. Start date slipped but no new date yet available. Slippage leads to a clash with our LFC downtime.
CMS
LHCB
1) Restarted production last week, after new disk servers became available at CERN and all failover transfer requests had finished. The pending productions were started last Thursday and finished on Sunday (including the 10**9 minimum bias run), after quickly ramping up to > 18K simultaneously running jobs.
2)Bugs found and fixed within DIRAC, relating to job prioritisation.
3) Various Tier-1 sites (not RAL) ran out of storage in the MC-M-DST service class last week. More storage was quickly put in by those sites when alerted by GGUS tickets.
4)lcgwms02 at RAL problems over the weekend. This caused various Monte Carlo simulation jobs to fail - primarily at Bristol.
Outlook: User analysis and further MC productions being prepared.
Hardware Deployment Report
Team will restart work - chaired by Matt Hodges.
Team Reports
Fabric
RAL Tier1 weekly operations Fabric 20090810
Grid Services
http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20090810
CASTOR
http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor_10/08/2009
Database
http://www.gridpp.ac.uk/wiki/Operations_Report_10/08/2009