RAL Tier1 weekly operations Overview 20090629
Contents
Overview of Milestones and Metrics
Key High Level dates
- LHC schedule expects first beam at the end of September and collisions at the end of October. See http://indico.cern.ch/getFile.py/access?contribId=11&resId=0&materialId=slides&confId=52248.
- There will be no formal change in WLCG planning for data taking until the WLCG workshop on 9th July.
- Data taking then expected to continue (with a 2 week stop for Christmas) through much of 2010. Alternative scenarios are being discussed.
- Machine room migration has commenced and continues for 1 week
Key Metrics
Owner | Description | Target | Achieved |
---|---|---|---|
Gareth Smith | Overall Tier-1 SAM Availability (last week) | 42% | 100% |
Gareth Smith | Alice SAM Availability (May) | 97% | 60% |
Gareth Smith | ATLAS SAM Availability (May) | 97% | 80% |
Gareth Smith | CMS SAM availability (May) | 97% | 77% |
Gareth Smith | LHCB SAM availability (May) | 97% | 84% |
Andrew Sansum | Fraction of Tier-1 Staff in Post (May) | 93% | 103% |
Gareth Smith | Number of days where called out (last spreadsheet full week) | 3 | 3 |
Matt Hodges | Percentage met of UB allocation of disk (May) | 100% | 91% |
Matt Hodges | Job Efficiency (May) | 85% | 81% |
Matt Hodges | Farm Occupancy (May) | 85% | 43% |
Matt Viljoen | Number of >Severe CASTOR Incidents (May) | 6 | 1 |
Availability was poor in May. There was a major CASTOR upgrade to the database RAID controllers that overran. There were also a number of network interventions to upgrade the C300 switch. One of these caused severe disruption to CASTOR over much of the day.
Key Production Milestones
See milestone spreadsheet
Unplanned
It has been impossible to move planning forward owing to lack of staff availability during the move.
- 3D upgrade and schedule required
- Nagios upgrade and schedule required
R89 Migration Summary
Is progressing well. See: http://www.gridpp.rl.ac.uk/blog/category/r89-migration/
By 29th (morning):
- CASTOR infrastructure moved and being rebuilt.
- Robotics (drives and media) moving Monday/Tuesday
- Disk servers continue until Friday 3rd July.
- Restart scheduled for 6th July.
High Level Schedule
Phase II Migration (Tier-1) Mon 22/06/09 Fri 03/07/09 Phase II contingency (Tier-1 Frozen) Mon 06/07/09 Fri 17/07/09 Final Update Window Mon 20/07/09 Fri 31/07/09 Tier-1 Stability Period (2) Mon 03/08/09 Fri 28/08/09 LHC Experiments Require Stability Mon 31/08/09 Fri 25/09/09 LHC First beam Mon 28/09/09 Mon 28/09/09 LHC prepares for collisions Mon 28/09/09 Fri 23/10/09 LHC Collisions Fri 23/10/09 Fri 23/10/09
R89 Migration Downtime Plan
WMS Drain commences Wed 17/06/09 Batch System Drain Sometime in w/e Sat 20/06/09 Batch System Off Mon 22/06/09 (08:00) FTS Drain Thu 25/06/09 (06:00) Full Stop (all services) Thu 25/06/09 (08:00) Network Down Thu 25/06/09 (08:00-16:00) Resume critical non moving services (eg RGMA) Thu 25/06/09 16:00 Resume FTS; LFC; WMS; 3D …. Fri 26/06/09 12:00 We have now achieved all the above. Still to come: CASTOR Core restart commences Tue 30/06/09 Last Disk Server rack moves Fri 03/06/09 Resume CASTOR Service Mon 06/06/09 12:00 Resume batch Service Mon 06/06/09 14:00
Service restart will be earlier if racks go earlier or team can restart over the weekend.
Disaster Management
Swine Flu (H1N1) is being handled in the Tier-1 Disaster Management System (currently level 2)
Puchasing and Finance
- Commencing current tenders. Good discussion with procurement last week. Initial attempt at schedule drawn up by David. First meeting
of Hardware Assesment Group (HAG) on 29th June.
Staffing
- One experiment support post accepted. Second post interviewed.
- PPS recruitment re-approved.
- YII post expected in July
- Extra CASTOR dbadmin accepted.
PMB Experiment Reports
No PMB today
ATLAS
CMS
LHCB
Hardware Deployment Report
None
Team Reports
Fabric
Grid Services
http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20090629
CASTOR
http://storage.esc.rl.ac.uk/castor/weekly_reports/RAL_Tier1_weekly_operations_CASTOR_20090629