Difference between revisions of "RAL Tier1 weekly operations Overview 20090622"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 13:51, 23 June 2009

Overview of Milestones and Metrics

Key High Level dates

Key Metrics

Owner Description Target Achieved
Gareth Smith Overall Tier-1 SAM Availability (last week) 97% 100%
Gareth Smith Alice SAM Availability (May) 97% 60%
Gareth Smith ATLAS SAM Availability (May) 97% 80%
Gareth Smith CMS SAM availability (May) 97% 77%
Gareth Smith LHCB SAM availability (May) 97% 84%
Andrew Sansum Fraction of Tier-1 Staff in Post (May) 93% 103%
Gareth Smith Number of days where called out (last spreadsheet full week) 3 3
Matt Hodges Percentage met of UB allocation of disk (May) 100% 91%
Matt Hodges Job Efficiency (May) 85% 81%
Matt Hodges Farm Occupancy (May) 85% 43%
Matt Viljoen Number of >Severe CASTOR Incidents (May) 6 1

Availability was poor in May. There was a major CASTOR upgrade to the database RAID controllers that overran. There were also a number of network interventions to upgrade the C300 switch. One of these caused severe disruption to CASTOR over much of the day.

Key Production Milestones

See milestone spreadsheet

Unplanned

  • 3D upgrade and schedule required
  • Nagios upgrade and schedule required

R89 Migration Summary

Has started

  • CPU rack moves (and helpdesk) today, Tuesday and Wednesday
  • Super Thursday - whole service stops many rack moves
  • CASTOR Friday and Monday
  • Disk servers continue until Friday 3rd July.

High Level Schedule

Phase II Migration (Tier-1)				Mon 22/06/09	Fri 03/07/09
Phase II contingency (Tier-1 Frozen)			Mon 06/07/09	Fri 17/07/09
Final Update Window					Mon 20/07/09	Fri 31/07/09
Tier-1 Stability Period (2)				Mon 03/08/09	Fri 28/08/09
LHC Experiments Require Stability			Mon 31/08/09	Fri 25/09/09  
LHC First beam				        	Mon 28/09/09	Mon 28/09/09
LHC prepares for collisions				Mon 28/09/09	Fri 23/10/09
LHC Collisions					        Fri 23/10/09	Fri 23/10/09

R89 Migration Downtime Plan

WMS Drain commences                                            Wed 17/06/09
Batch System Drain                           Sometime in w/e   Sat 20/06/09
Batch System Off                                               Mon 22/06/09 (08:00)
FTS Drain                                                      Thu 25/06/09  (06:00)
Full Stop (all services)                                       Thu 25/06/09  (08:00)
Network Down                                                   Thu 25/06/09  (08:00-16:00)
Resume critical non moving services   (eg RGMA)                Thu 25/06/09  16:00
Resume FTS; LFC; WMS; 3D ….                                    Fri 26/06/09  12:00
CASTOR Core restart commences                                  Tue 30/06/09  
Last Disk Server rack moves                                    Fri 03/06/09
Resume CASTOR Service                                          Mon 06/06/09  12:00
Resume batch Service                                           Mon 06/06/09   14:00

Service restart will be earlier if racks go earlier or team can restart over the weekend.


Puchasing and Finance

  • Commencing current tenders. Discussion with procurement on Tuesday 23rd June.

Staffing

  • One experiment support post accepted. Second post interviewed.
  • PPS recruitment failed – seeking re-approval.
  • YII post expected in July
  • Extra CASTOR dbadmin interviewed.

PMB Experiment Reports

ATLAS

CMS

LHCB

1. Problem with Castor configuration at RAL - the limit on the number of connections allowed by xrootd was 100, and many (>300) jobs running off d0t1 servers (3 of them). This led to intermittent failures of jobs. Problem seen on 16 June, identified and solved on 17 June by Shaun by increasing limit on number of xrootd connections to 200 per diskserver.

3. DIRAC server found to run at its hardware limit when serving ~8K jobs. This is now because of more services coming online and more jobs (including user jobs) with shorter lifetimes and more heartbeats. Work ongoing within LHCb to solve this.

Outlook for the week:

  • RAL down for moving hardware to new building.
  • Billion event production continues at Tier-2s.
  • User analysis jobs.

Hardware Deployment Report

None

Team Reports

Fabric

Grid Services

https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_Grid_20090622

CASTOR

http://storage.esc.rl.ac.uk/castor/weekly_reports/Tier-1Operations-castor-20090622.doc

Database

Report 22/06/2009

Production

Production Team Report 2009-06-22