RAL Tier1 weekly operations Overview 20091012
From GridPP Wiki
Revision as of 14:01, 12 October 2009 by Andrew sansum (Talk | contribs)
Contents
Overview of Milestones and Metrics
Key Metrics
Owner | Description | Target | Achieved |
---|---|---|---|
Gareth Smith | Overall Tier-1 SAM Availability (last week) | 97% | 42% |
Gareth Smith | Alice SAM Availability (Aug) | 97% | 77% |
Gareth Smith | ATLAS SAM Availability (Aug) | 97% | 75% |
Gareth Smith | CMS SAM availability (Aug) | 97% | 77% |
Gareth Smith | LHCB SAM availability (Aug) | 97% | 78% |
Andrew Sansum | Fraction of Tier-1 Staff in Post (Aug) | 93% | 103% |
Gareth Smith | Number of days where called out (last spreadsheet full week) | 3 | 2 |
Matt Hodges | Percentage met of UB allocation of disk (Aug) | 100% | |
Matt Hodges | Job Efficiency (Aug) | 85% | 67% |
Matt Hodges | Farm Occupancy (Aug) | 85% | 41% |
Matt Viljoen | Number of >Severe CASTOR Incidents (Aug) | 6 | 1 |
Key Production Milestones
See myactions:
https://myactions.gridpp.rl.ac.uk/all/where/category_name/Operational/
High Level Schedule
Tier-1 Stability Period (2) October-mid-November LHC First beam mid November LHC Standby December 19th Restart 4th January Run ends October 2010
Disaster Management
- Swine Flu (H1N1) downgraded to level 1. No regular meetings, will re-activate when case frequency increases
- Disk deployment (level 2) ongoing testing with Viglen. Increasing likelihood that we will escalate to L3 if no progress soon.
- Machine room air-conditioning. Now level 2.
- Water leak
- Multiple RAID Array failures (was 4) level 2
Purchasing and Finance
- GRIDPP finalised high level spend plan.
- Disk tender at ITT evaluation stage.
- CPU PQQ at ITT stage
- Tape drives purchased
- Finalising spend plan.
Staffing
PMB Experiment Reports
Experiments severely impacted by Tier-1 failure last week.
ATLAS
Report possible ongoing performance problems
CMS
Report possible lost files
LHCB
Hardware Deployment Report
A record of a verbal report on disk deployment progress this week from DepMon (Chris):
* James T (deputy FabMon) has fulfilled the Resource Allocation tickets for the four LHC deployments.
* Chris will assign tickets to Deployers from tomorrow (after fixing the Nagios documentation in the deployment process).
* Deployers will be Chris (LHCb), Catalin (ALICE), Andrew Lahiff (CMS), and Tiju (ATLAS; first 10 of 25 servers).
* High priority to be given to LHCb, who are running low on space in a D1 service class (RT#52203).
Matt
Team Reports
Fabric
RAL Tier1 weekly operations Fabric 20091012
Grid Services
http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20091012
CASTOR
http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor_12/10/2009
Database
http://www.gridpp.ac.uk/wiki/Operations_Report_12/10/2009