Difference between revisions of "RAL Tier1 weekly operations Overview 20091012"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 14:01, 12 October 2009

Overview of Milestones and Metrics

Key Metrics

Owner Description Target Achieved
Gareth Smith Overall Tier-1 SAM Availability (last week) 97% 42%
Gareth Smith Alice SAM Availability (Aug) 97% 77%
Gareth Smith ATLAS SAM Availability (Aug) 97% 75%
Gareth Smith CMS SAM availability (Aug) 97% 77%
Gareth Smith LHCB SAM availability (Aug) 97% 78%
Andrew Sansum Fraction of Tier-1 Staff in Post (Aug) 93% 103%
Gareth Smith Number of days where called out (last spreadsheet full week) 3 2
Matt Hodges Percentage met of UB allocation of disk (Aug) 100%
Matt Hodges Job Efficiency (Aug) 85% 67%
Matt Hodges Farm Occupancy (Aug) 85% 41%
Matt Viljoen Number of >Severe CASTOR Incidents (Aug) 6 1

Key Production Milestones

See myactions:

https://myactions.gridpp.rl.ac.uk/all/where/category_name/Operational/

High Level Schedule

Tier-1 Stability Period (2)				October-mid-November
LHC First beam				        	mid November
LHC Standby                                             December 19th
Restart                                                 4th January
Run ends                                                October 2010

Disaster Management

  • Swine Flu (H1N1) downgraded to level 1. No regular meetings, will re-activate when case frequency increases
  • Disk deployment (level 2) ongoing testing with Viglen. Increasing likelihood that we will escalate to L3 if no progress soon.
  • Machine room air-conditioning. Now level 2.
  • Water leak
  • Multiple RAID Array failures (was 4) level 2

Purchasing and Finance

  • GRIDPP finalised high level spend plan.
  • Disk tender at ITT evaluation stage.
  • CPU PQQ at ITT stage
  • Tape drives purchased
  • Finalising spend plan.

Staffing

PMB Experiment Reports

Experiments severely impacted by Tier-1 failure last week.

ATLAS

Report possible ongoing performance problems

CMS

Report possible lost files

LHCB

Hardware Deployment Report

A record of a verbal report on disk deployment progress this week from DepMon (Chris):

 * James T (deputy FabMon) has fulfilled the Resource Allocation
   tickets for the four LHC deployments.
 * Chris will assign tickets to Deployers from tomorrow (after fixing
   the Nagios documentation in the deployment process).
 * Deployers will be Chris (LHCb), Catalin (ALICE), Andrew Lahiff
   (CMS), and Tiju (ATLAS; first 10 of 25 servers).
 * High priority to be given to LHCb, who are running low on space in a
   D1 service class (RT#52203).

Matt


Team Reports

Fabric

RAL Tier1 weekly operations Fabric 20091012

Grid Services

http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20091012

CASTOR

http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor_12/10/2009

Database

http://www.gridpp.ac.uk/wiki/Operations_Report_12/10/2009

Production

Production Team Report 2009-10-12