RAL Tier1 weekly operations Overview 20090810

From GridPP Wiki
Jump to: navigation, search

Overview of Milestones and Metrics

Key High Level dates

  • LHC schedule delayed 6 weeks over Chamonix date. We now expect first beam in mid-November. Collisions now in December?
  • Freeze date now end of September
  • Discussing with PMB if MoU commitment dates have any flexibility


Key Metrics

Owner Description Target Achieved
Gareth Smith Overall Tier-1 SAM Availability (last week) 97% 100%
Gareth Smith Alice SAM Availability (Jun) 97% 73%
Gareth Smith ATLAS SAM Availability (Jun) 97% 71%
Gareth Smith CMS SAM availability (Jun) 97% 71%
Gareth Smith LHCB SAM availability (Jun) 97% 67%
Andrew Sansum Fraction of Tier-1 Staff in Post (Jun) 93% 103%
Gareth Smith Number of days where called out (last spreadsheet full week) 3 3
Matt Hodges Percentage met of UB allocation of disk (Jul) 100% 83%
Matt Hodges Job Efficiency (Jul) 85% 84%
Matt Hodges Farm Occupancy (Jul) 85% 78%
Matt Viljoen Number of >Severe CASTOR Incidents (Jun) 6 2

Availability was poor in June owing to the move of the Tier-1 to R89.

Key Production Milestones

See myactions:

https://myactions.gridpp.rl.ac.uk/all/where/category_name/Operational/

High Level Schedule

Final Update Window					Mon 13/07/09	 30/09/09
Tier-1 Stability Period (2)				October-mid-November
LHC First beam				        	mid November

Note that:

  • Software freeze date of end of September was considered reasonable by WLCG MB.

Disaster Management

Swine Flu (H1N1) is being handled in the Tier-1 Disaster Management System (currently level 2) Will probably enter the disk deployment problems as level 1

Swine Flu Response Plan

See: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/TierOneSwineFlu

There will be a Tier-1 work at home day on Wednesday 19th August.

Purchasing and Finance

  • GRIDPP finalising spend plan
  • Commencing current disk and CPU tenders (Dave Corney leading). Disk PQQ is running. CPU PQQ will launch shortly.

Staffing

  • One experiment support post started today (Andrew Lahiff) . Second experiment support post, ready to make offer.
  • EGEE funded PPS recruitment will start Monday 17th August.

PMB Experiment Reports

ATLAS

Require 2 weeks stability during August. Start date slipped but no new date yet available. Slippage leads to a clash with our LFC downtime.

CMS

LHCB

1) Restarted production last week, after new disk servers became available at CERN and all failover transfer requests had finished. The pending productions were started last Thursday and finished on Sunday (including the 10**9 minimum bias run), after quickly ramping up to > 18K simultaneously running jobs.

2)Bugs found and fixed within DIRAC, relating to job prioritisation.

3) Various Tier-1 sites (not RAL) ran out of storage in the MC-M-DST service class last week. More storage was quickly put in by those sites when alerted by GGUS tickets.

4)lcgwms02 at RAL problems over the weekend. This caused various Monte Carlo simulation jobs to fail - primarily at Bristol.

Outlook: User analysis and further MC productions being prepared.

Hardware Deployment Report

Team will restart work - chaired by Matt Hodges.

Team Reports

Fabric

RAL Tier1 weekly operations Fabric 20090810

Grid Services

http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20090810

CASTOR

http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor_10/08/2009

Database

http://www.gridpp.ac.uk/wiki/Operations_Report_10/08/2009

Production

Production Team Report 2009-08-10