Difference between revisions of "RAL Tier1 weekly operations Overview 20091019"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 14:13, 19 October 2009

Overview of Milestones and Metrics

Key Metrics

Owner Description Target Achieved
Gareth Smith Overall Tier-1 SAM Availability (last week) 97% 98%
Gareth Smith Alice SAM Availability (Aug) 97% 77%
Gareth Smith ATLAS SAM Availability (Aug) 97% 75%
Gareth Smith CMS SAM availability (Aug) 97% 77%
Gareth Smith LHCB SAM availability (Aug) 97% 78%
Andrew Sansum Fraction of Tier-1 Staff in Post (Aug) 93% 103%
Gareth Smith Number of days where called out (last spreadsheet full week) 3
Matt Hodges Percentage met of UB allocation of disk (Aug) 100%
Matt Hodges Job Efficiency (Aug) 85% 67%
Matt Hodges Farm Occupancy (Aug) 85% 41%
Matt Viljoen Number of >Severe CASTOR Incidents (Aug) 6 1

Key Production Milestones

See myactions:

https://myactions.gridpp.rl.ac.uk/all/where/category_name/Operational/

High Level Schedule

Tier-1 Stability Period (2)				October-mid-November
LHC First beam				        	mid November
LHC Standby                                             December 19th
Restart                                                 4th January
Run ends                                                October 2010

Disaster Management

  • Swine Flu (H1N1) downgraded to level 1. No regular meetings, will re-activate when case frequency increases
  • Disk deployment (level 2) ongoing testing with Viglen. Increasing likelihood that we will escalate to L3 if no progress soon.
  • Machine room air-conditioning. Now level 2.
  • Water leak
  • Multiple RAID Array failures (was 4) level 2
  • CASTOR Data loss (level 4). Severe data loss has been damaging to our reputation.

Purchasing and Finance

  • GRIDPP finalised high level spend plan.
  • Disk tender at ITT evaluation stage.
  • CPU PQQ at ITT stage
  • Tape drives ordered
  • Finalising spend plan.

Staffing

PMB Experiment Reports

PMB very concerned about recent loss of data. It considers it the most severe incident ever experienced. Will carry out a review of the Tier-1 in December which will focus on data retention concerns.

ATLAS

Report that experiment very concerned.

CMS

LHCB

Hardware Deployment Report

1. Disk servers deployed last week: * 1x lhcbMdst (Chris) * 6x lhcbNonProd (Chris) - awaiting lhcb decision

3. Deployment Rota (19/10 - 23/10): * FabMon: Martin * DeputyFabMon: None * DepMon: Chris * DeputyDepMon: None

4. Deployment for this week: * 2x Alice - genNonProd (Catalin) * 6x Cms - cmsNonProd (Andrew L.) * 10x Atlas - atlasNonProd (Tiju) * 1x Atlas - atlasNonProd (Richard)

* 14x Atlas - not assigned yet

5. Blocking issues/problems: * Identified several problems with Disk Server deployment, such as: - access to Puppet boxes - FIXED - ssh-keys not forwarding - FIXED - readonly access to overwatch - James A. needs to be notified - incorrect partitions on new disk servers - FIXED but this has delayed the deployment process - missing hostcert.pem for new disk servers - believe it is fixed now

Chris


Team Reports

Fabric

RAL Tier1 weekly operations Fabric 20091019

Grid Services

http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20091019

CASTOR

http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor_19/10/2009

Database

http://www.gridpp.ac.uk/wiki/Operations_Report_19/10/2009

Production

Production Team Report 2009-10-19