Difference between revisions of "RAL Tier1 weekly operations Overview 20091102"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 15:43, 2 November 2009

Overview of Milestones and Metrics

Key Metrics

Owner Description Target Achieved
Gareth Smith Overall Tier-1 SAM Availability (last week) 97% 100%
Gareth Smith Alice SAM Availability (Sep) 97% 81%
Gareth Smith ATLAS SAM Availability (Sep) 97% 85%
Gareth Smith CMS SAM availability (Sep) 97% 87%
Gareth Smith LHCB SAM availability (Sep) 97% 91%
Andrew Sansum Fraction of (GRIDPP funded) Tier-1 Staff in Post (Sep) 93% 103%
Gareth Smith Number of days where called out (last spreadsheet full week) 3 2
Matt Hodges Percentage met of UB allocation of disk (Sep) 100% To follow - UB schedule not finalised yet
Matt Hodges Job Efficiency (Sep) 85% 72%
Matt Hodges Farm Occupancy (Sep) 85% To follow - UB schedule not finalised yet
Matt Viljoen Number of >Severe CASTOR Incidents (Sept) 6 2

Key Production Milestones

See myactions:

https://myactions.gridpp.rl.ac.uk/all/where/category_name/Operational/

High Level Schedule

LHC commissioning appears to be on track and beam injection tests have commenced.

Tier-1 Stability Period (2)				October-mid-November
LHC First beam				        	mid November
LHC Standby                                             December 19th
Restart                                                 4th January
Run ends                                                October 2010

Disaster Management

  • Multiple RAID Array failures (was 4). Now level 2. So far have failed to find electrical problem causing the EMC RAID arrays to be unstable.
  • Disk deployment (level 2) ongoing testing with Viglen. Still problems with hardware, and recently tested solutions have failed acceptance. Expect to escalate this at this weeks review meeting.
  • Machine room air-conditioning (level 2). Will be reviewed this week. Major procurement underway for additional cooling capacity.
  • Water leak. Will be reviewed this week - expect to reduce in severity.
  • CASTOR Data loss - Major review underway.
  • Swine Flu (H1N1) downgraded to level 1. No regular meetings, will re-activate when case frequency increases

Purchasing and Finance

  • GRIDPP finalised high level spend plan.
  • Disk tender at ITT evaluation stage.
  • CPU PQQ at evaluation stage.
  • Tape drives arrived.
  • Finalising spend plan.

Staffing

At full complement

PMB Experiment Reports

ATLAS

Concern expressed by ATLAS at the WOCG MB last Tuesday regarding recent events at RAL.

CMS

No issues

LHCB

No report

Hardware Deployment Report (Chris)

Deployment working well.

1. Disk servers deployed last week:

* all disk servers for aliceTape
* all disk servers for atlasStripInput
* all disk servers for atlasSimStrip- except 3 which are on hold to test DepMon role

2. Deployment Rota (02/11 - 06/11):

* FabMon: 	 	Martin
* DeputyFabMon: 	James T.
* DepMon:	 	Matt
* DeputyDepMon: 	Chris

3. Deployment for this week:

* deploy 3 remaining atlasSimStrip disk servers
* deploy gdss383 (CMS) when it's fixed	
* no other outstanding deployment requests!

4. Problems:

* gdss383 (CMS)- broken, waiting for memory replacement

Team Reports

Fabric

RAL Tier1 weekly operations Fabric 20091102

Grid Services

http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20091102

CASTOR

http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor_02/11/2009

Database

http://www.gridpp.ac.uk/wiki/Operations_Report_02/11/2009

Production

Production Team Report 2009-11-02