Difference between revisions of "RAL Tier1 weekly operations Overview 20091019"

Latest revision as of 14:13, 19 October 2009

Overview of Milestones and Metrics

Key Metrics

Owner	Description	Target	Achieved
Gareth Smith	Overall Tier-1 SAM Availability (last week)	97%	98%
Gareth Smith	Alice SAM Availability (Aug)	97%	77%
Gareth Smith	ATLAS SAM Availability (Aug)	97%	75%
Gareth Smith	CMS SAM availability (Aug)	97%	77%
Gareth Smith	LHCB SAM availability (Aug)	97%	78%
Andrew Sansum	Fraction of Tier-1 Staff in Post (Aug)	93%	103%
Gareth Smith	Number of days where called out (last spreadsheet full week)	3
Matt Hodges	Percentage met of UB allocation of disk (Aug)	100%
Matt Hodges	Job Efficiency (Aug)	85%	67%
Matt Hodges	Farm Occupancy (Aug)	85%	41%
Matt Viljoen	Number of >Severe CASTOR Incidents (Aug)	6	1

Key Production Milestones

See myactions:

https://myactions.gridpp.rl.ac.uk/all/where/category_name/Operational/

High Level Schedule

Tier-1 Stability Period (2)				October-mid-November
LHC First beam				        	mid November
LHC Standby                                             December 19th
Restart                                                 4th January
Run ends                                                October 2010

Disaster Management

Swine Flu (H1N1) downgraded to level 1. No regular meetings, will re-activate when case frequency increases
Disk deployment (level 2) ongoing testing with Viglen. Increasing likelihood that we will escalate to L3 if no progress soon.
Machine room air-conditioning. Now level 2.
Water leak
Multiple RAID Array failures (was 4) level 2
CASTOR Data loss (level 4). Severe data loss has been damaging to our reputation.

Purchasing and Finance

GRIDPP finalised high level spend plan.
Disk tender at ITT evaluation stage.
CPU PQQ at ITT stage
Tape drives ordered
Finalising spend plan.

Staffing

PMB Experiment Reports

PMB very concerned about recent loss of data. It considers it the most severe incident ever experienced. Will carry out a review of the Tier-1 in December which will focus on data retention concerns.

ATLAS

Report that experiment very concerned.

CMS

LHCB

Hardware Deployment Report

1. Disk servers deployed last week: * 1x lhcbMdst (Chris) * 6x lhcbNonProd (Chris) - awaiting lhcb decision

3. Deployment Rota (19/10 - 23/10): * FabMon: Martin * DeputyFabMon: None * DepMon: Chris * DeputyDepMon: None

4. Deployment for this week: * 2x Alice - genNonProd (Catalin) * 6x Cms - cmsNonProd (Andrew L.) * 10x Atlas - atlasNonProd (Tiju) * 1x Atlas - atlasNonProd (Richard)

* 14x Atlas - not assigned yet

5. Blocking issues/problems: * Identified several problems with Disk Server deployment, such as: - access to Puppet boxes - FIXED - ssh-keys not forwarding - FIXED - readonly access to overwatch - James A. needs to be notified - incorrect partitions on new disk servers - FIXED but this has delayed the deployment process - missing hostcert.pem for new disk servers - believe it is fixed now

Chris

Team Reports

Difference between revisions of "RAL Tier1 weekly operations Overview 20091019"

Latest revision as of 14:13, 19 October 2009

Contents

Overview of Milestones and Metrics

Key Metrics

Key Production Milestones

High Level Schedule

Disaster Management

Purchasing and Finance

Staffing

PMB Experiment Reports

ATLAS

CMS

LHCB

Hardware Deployment Report

Team Reports

Fabric

Grid Services

CASTOR

Database

Production

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools