Difference between revisions of "RAL Tier1 weekly operations Overview 20090622"
(No difference)
|
Latest revision as of 13:51, 23 June 2009
Contents
Overview of Milestones and Metrics
Key High Level dates
- LHC schedule expects first beam at the end of September and collisions at the end of October. See http://indico.cern.ch/getFile.py/access?contribId=11&resId=0&materialId=slides&confId=52248.
- Data taking then expected to continue (with a 2 week stop for Christmas) through much of 2010. Alternative scenarios are being discussed.
- Machine room migration has commenced and continues for 2 weeks
Key Metrics
Owner | Description | Target | Achieved |
---|---|---|---|
Gareth Smith | Overall Tier-1 SAM Availability (last week) | 97% | 100% |
Gareth Smith | Alice SAM Availability (May) | 97% | 60% |
Gareth Smith | ATLAS SAM Availability (May) | 97% | 80% |
Gareth Smith | CMS SAM availability (May) | 97% | 77% |
Gareth Smith | LHCB SAM availability (May) | 97% | 84% |
Andrew Sansum | Fraction of Tier-1 Staff in Post (May) | 93% | 103% |
Gareth Smith | Number of days where called out (last spreadsheet full week) | 3 | 3 |
Matt Hodges | Percentage met of UB allocation of disk (May) | 100% | 91% |
Matt Hodges | Job Efficiency (May) | 85% | 81% |
Matt Hodges | Farm Occupancy (May) | 85% | 43% |
Matt Viljoen | Number of >Severe CASTOR Incidents (May) | 6 | 1 |
Availability was poor in May. There was a major CASTOR upgrade to the database RAID controllers that overran. There were also a number of network interventions to upgrade the C300 switch. One of these caused severe disruption to CASTOR over much of the day.
Key Production Milestones
See milestone spreadsheet
Unplanned
- 3D upgrade and schedule required
- Nagios upgrade and schedule required
R89 Migration Summary
Has started
- CPU rack moves (and helpdesk) today, Tuesday and Wednesday
- Super Thursday - whole service stops many rack moves
- CASTOR Friday and Monday
- Disk servers continue until Friday 3rd July.
High Level Schedule
Phase II Migration (Tier-1) Mon 22/06/09 Fri 03/07/09 Phase II contingency (Tier-1 Frozen) Mon 06/07/09 Fri 17/07/09 Final Update Window Mon 20/07/09 Fri 31/07/09 Tier-1 Stability Period (2) Mon 03/08/09 Fri 28/08/09 LHC Experiments Require Stability Mon 31/08/09 Fri 25/09/09 LHC First beam Mon 28/09/09 Mon 28/09/09 LHC prepares for collisions Mon 28/09/09 Fri 23/10/09 LHC Collisions Fri 23/10/09 Fri 23/10/09
R89 Migration Downtime Plan
WMS Drain commences Wed 17/06/09 Batch System Drain Sometime in w/e Sat 20/06/09 Batch System Off Mon 22/06/09 (08:00) FTS Drain Thu 25/06/09 (06:00) Full Stop (all services) Thu 25/06/09 (08:00) Network Down Thu 25/06/09 (08:00-16:00) Resume critical non moving services (eg RGMA) Thu 25/06/09 16:00 Resume FTS; LFC; WMS; 3D …. Fri 26/06/09 12:00 CASTOR Core restart commences Tue 30/06/09 Last Disk Server rack moves Fri 03/06/09 Resume CASTOR Service Mon 06/06/09 12:00 Resume batch Service Mon 06/06/09 14:00
Service restart will be earlier if racks go earlier or team can restart over the weekend.
Puchasing and Finance
- Commencing current tenders. Discussion with procurement on Tuesday 23rd June.
Staffing
- One experiment support post accepted. Second post interviewed.
- PPS recruitment failed – seeking re-approval.
- YII post expected in July
- Extra CASTOR dbadmin interviewed.
PMB Experiment Reports
ATLAS
CMS
LHCB
1. Problem with Castor configuration at RAL - the limit on the number of connections allowed by xrootd was 100, and many (>300) jobs running off d0t1 servers (3 of them). This led to intermittent failures of jobs. Problem seen on 16 June, identified and solved on 17 June by Shaun by increasing limit on number of xrootd connections to 200 per diskserver.
3. DIRAC server found to run at its hardware limit when serving ~8K jobs. This is now because of more services coming online and more jobs (including user jobs) with shorter lifetimes and more heartbeats. Work ongoing within LHCb to solve this.
Outlook for the week:
- RAL down for moving hardware to new building.
- Billion event production continues at Tier-2s.
- User analysis jobs.
Hardware Deployment Report
None
Team Reports
Fabric
Grid Services
https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20090622
CASTOR
http://storage.esc.rl.ac.uk/castor/weekly_reports/Tier-1Operations-castor-20090622.doc