Difference between revisions of "RAL Tier1 weekly operations Overview 20091019"
(No difference)
|
Latest revision as of 14:13, 19 October 2009
Contents
Overview of Milestones and Metrics
Key Metrics
Owner | Description | Target | Achieved |
---|---|---|---|
Gareth Smith | Overall Tier-1 SAM Availability (last week) | 97% | 98% |
Gareth Smith | Alice SAM Availability (Aug) | 97% | 77% |
Gareth Smith | ATLAS SAM Availability (Aug) | 97% | 75% |
Gareth Smith | CMS SAM availability (Aug) | 97% | 77% |
Gareth Smith | LHCB SAM availability (Aug) | 97% | 78% |
Andrew Sansum | Fraction of Tier-1 Staff in Post (Aug) | 93% | 103% |
Gareth Smith | Number of days where called out (last spreadsheet full week) | 3 | |
Matt Hodges | Percentage met of UB allocation of disk (Aug) | 100% | |
Matt Hodges | Job Efficiency (Aug) | 85% | 67% |
Matt Hodges | Farm Occupancy (Aug) | 85% | 41% |
Matt Viljoen | Number of >Severe CASTOR Incidents (Aug) | 6 | 1 |
Key Production Milestones
See myactions:
https://myactions.gridpp.rl.ac.uk/all/where/category_name/Operational/
High Level Schedule
Tier-1 Stability Period (2) October-mid-November LHC First beam mid November LHC Standby December 19th Restart 4th January Run ends October 2010
Disaster Management
- Swine Flu (H1N1) downgraded to level 1. No regular meetings, will re-activate when case frequency increases
- Disk deployment (level 2) ongoing testing with Viglen. Increasing likelihood that we will escalate to L3 if no progress soon.
- Machine room air-conditioning. Now level 2.
- Water leak
- Multiple RAID Array failures (was 4) level 2
- CASTOR Data loss (level 4). Severe data loss has been damaging to our reputation.
Purchasing and Finance
- GRIDPP finalised high level spend plan.
- Disk tender at ITT evaluation stage.
- CPU PQQ at ITT stage
- Tape drives ordered
- Finalising spend plan.
Staffing
PMB Experiment Reports
PMB very concerned about recent loss of data. It considers it the most severe incident ever experienced. Will carry out a review of the Tier-1 in December which will focus on data retention concerns.
ATLAS
Report that experiment very concerned.
CMS
LHCB
Hardware Deployment Report
1. Disk servers deployed last week: * 1x lhcbMdst (Chris) * 6x lhcbNonProd (Chris) - awaiting lhcb decision
3. Deployment Rota (19/10 - 23/10): * FabMon: Martin * DeputyFabMon: None * DepMon: Chris * DeputyDepMon: None
4. Deployment for this week: * 2x Alice - genNonProd (Catalin) * 6x Cms - cmsNonProd (Andrew L.) * 10x Atlas - atlasNonProd (Tiju) * 1x Atlas - atlasNonProd (Richard)
* 14x Atlas - not assigned yet
5. Blocking issues/problems: * Identified several problems with Disk Server deployment, such as: - access to Puppet boxes - FIXED - ssh-keys not forwarding - FIXED - readonly access to overwatch - James A. needs to be notified - incorrect partitions on new disk servers - FIXED but this has delayed the deployment process - missing hostcert.pem for new disk servers - believe it is fixed now
Chris
Team Reports
Fabric
RAL Tier1 weekly operations Fabric 20091019
Grid Services
http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20091019
CASTOR
http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor_19/10/2009
Database
http://www.gridpp.ac.uk/wiki/Operations_Report_19/10/2009