RAL Tier1 weekly operations Overview 20090629

From GridPP Wiki
Revision as of 08:52, 29 June 2009 by Andrew sansum (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Overview of Milestones and Metrics

Key High Level dates

  • LHC schedule expects first beam at the end of September and collisions at the end of October. See http://indico.cern.ch/getFile.py/access?contribId=11&resId=0&materialId=slides&confId=52248.
  • There will be no formal change in WLCG planning for data taking until the WLCG workshop on 9th July.
  • Data taking then expected to continue (with a 2 week stop for Christmas) through much of 2010. Alternative scenarios are being discussed.
  • Machine room migration has commenced and continues for 1 week

Key Metrics

Owner Description Target Achieved
Gareth Smith Overall Tier-1 SAM Availability (last week) 42% 100%
Gareth Smith Alice SAM Availability (May) 97% 60%
Gareth Smith ATLAS SAM Availability (May) 97% 80%
Gareth Smith CMS SAM availability (May) 97% 77%
Gareth Smith LHCB SAM availability (May) 97% 84%
Andrew Sansum Fraction of Tier-1 Staff in Post (May) 93% 103%
Gareth Smith Number of days where called out (last spreadsheet full week) 3 3
Matt Hodges Percentage met of UB allocation of disk (May) 100% 91%
Matt Hodges Job Efficiency (May) 85% 81%
Matt Hodges Farm Occupancy (May) 85% 43%
Matt Viljoen Number of >Severe CASTOR Incidents (May) 6 1

Availability was poor in May. There was a major CASTOR upgrade to the database RAID controllers that overran. There were also a number of network interventions to upgrade the C300 switch. One of these caused severe disruption to CASTOR over much of the day.

Key Production Milestones

See milestone spreadsheet

Unplanned

It has been impossible to move planning forward owing to lack of staff availability during the move.

  • 3D upgrade and schedule required
  • Nagios upgrade and schedule required

R89 Migration Summary

Is progressing well. See: http://www.gridpp.rl.ac.uk/blog/category/r89-migration/

By 29th (morning):

  • CASTOR infrastructure moved and being rebuilt.
  • Robotics (drives and media) moving Monday/Tuesday
  • Disk servers continue until Friday 3rd July.
  • Restart scheduled for 6th July.

High Level Schedule

Phase II Migration (Tier-1)				Mon 22/06/09	Fri 03/07/09
Phase II contingency (Tier-1 Frozen)			Mon 06/07/09	Fri 17/07/09
Final Update Window					Mon 20/07/09	Fri 31/07/09
Tier-1 Stability Period (2)				Mon 03/08/09	Fri 28/08/09
LHC Experiments Require Stability			Mon 31/08/09	Fri 25/09/09  
LHC First beam				        	Mon 28/09/09	Mon 28/09/09
LHC prepares for collisions				Mon 28/09/09	Fri 23/10/09
LHC Collisions					        Fri 23/10/09	Fri 23/10/09

R89 Migration Downtime Plan

WMS Drain commences                                            Wed 17/06/09
Batch System Drain                           Sometime in w/e   Sat 20/06/09
Batch System Off                                               Mon 22/06/09 (08:00)
FTS Drain                                                      Thu 25/06/09  (06:00)
Full Stop (all services)                                       Thu 25/06/09  (08:00)
Network Down                                                   Thu 25/06/09  (08:00-16:00)
Resume critical non moving services   (eg RGMA)                Thu 25/06/09  16:00
Resume FTS; LFC; WMS; 3D ….                                    Fri 26/06/09  12:00

We have now achieved all the above. Still to come:

CASTOR Core restart commences                                  Tue 30/06/09  
Last Disk Server rack moves                                    Fri 03/06/09
Resume CASTOR Service                                          Mon 06/06/09  12:00
Resume batch Service                                           Mon 06/06/09   14:00

Service restart will be earlier if racks go earlier or team can restart over the weekend.

Disaster Management

Swine Flu (H1N1) is being handled in the Tier-1 Disaster Management System (currently level 2)

Puchasing and Finance

  • Commencing current tenders. Good discussion with procurement last week. Initial attempt at schedule drawn up by David. First meeting

of Hardware Assesment Group (HAG) on 29th June.

Staffing

  • One experiment support post accepted. Second post interviewed.
  • PPS recruitment re-approved.
  • YII post expected in July
  • Extra CASTOR dbadmin accepted.

PMB Experiment Reports

No PMB today

ATLAS

CMS

LHCB

Hardware Deployment Report

None

Team Reports

Fabric

Grid Services

http://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20090629

CASTOR

http://storage.esc.rl.ac.uk/castor/weekly_reports/RAL_Tier1_weekly_operations_CASTOR_20090629

Database

Production

Production Team Report 2009-06-29