RAL Tier1 weekly operations Grid 20100201

Operational Issues

Description	Start	End	Affected VO(s)	Severity	Status
Disk errors on LB01 host	27-Jan-2010	1-Feb-2010	all	low	failed disk replaced; service no disrupted
SQL server reboot problems	27-Jan-2010 12:00	27-Jan-2010 13:30		high	server didn't reboot after kernel upgrade; GRUB magic work needed
Alice VOBOX upgrade problems	27-Jan-2010 13:00	27-Feb-2010 14:30	Alice	high	filesystem problems after kernel and RPM upgrades; machine needed a re-install from scratch

Blocking Issues

Description	Requested Date	Required By Date	Priority	Status
Hardware for testing LFC/FTS resilience			High	DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for SCAS servers		2010-02-01	High	Hardware required for production SCAS servers - required to be in place by end of Feb
Hardware for Testbed			Medium	Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Hardware for SL5 CREAM CE for Non LHC SL5 batch access			Medium	Hardware required for CREAM CE for non-LHC vos
Pool accounts for Super B vo	2010-01-13		Medium	Required to enable Super B vo on batch farm. Done

Developments/Plans

Highlights for Tier-1 Ops Meeting

ATLAS: Tier 1 throughput test performed today. RAL + FZK excluded.
ATLAS: Re-processing run still meant to start on Friday 5th February. Will know exact timetable on 3rd February.
CMS: continuous data taking will begin on 8th February
Disk deployment: meeting scheduled for 14:00-15:00 Tuesday

Highlights for Tier-1 VO Liaison Meeting

ALICE: possible decision to get rid of the lcg-CE at T1s and T2s
Plan T2K configuration of FTS. Request dedicated diskpool subject to confirmation of alloaction from UB

Detailed Individual Reports

Alastair

Understand remaining errors from HC test. [Done]
Continue updating RAL PP twiki. [Ongoing]
Prepare slides for presentation on computing requirements. [Ongoing]
Write Nagios script to warn when space token are near full. [To be implemented when Castor comes back]
Work with Brian + Chris in re-deploying disk servers to ATLAS space tokens.

Andrew

Testing of new CMSSW TTreeCache training patch (still not quite as good as lazy-download; found that it crashes with official re-reco config) [Done]
Investigated problems with CMS backfill jobs; MC production failed jobs [Done?]
Started adding PhEDEx-CASTOR consistency Ganglia monitoring
Test another new CMSSW I/O optimisation patch
Complete PhEDEx-CASTOR consistency Ganglia monitoring (PhEDEx part done)
Complete document about automatic job killing
Display Screen Equipment assessment (had to do a second one) [Done]

Catalin

WMS01 and 02 upgrades [Done]
kernel updates [Done]
re-installed one ALICE VOBOX [Done]
1-to-1 on Nagios configuration (with Jonathan)
chase CERN for LFC schemas tidying up
test Alice xrootd (manager + peer) re-installation (with Chris)
quattorise additional LFC frontends (with Ian)

Derek

Reinstalling lcgce08 with host swap config[Done]
Reconfiguring lcgce01[Done]
Installing SL5 SCAS server
Testing SL5 GLexec WN
Setting up testbed site in quattor
Released new yaim config rpm with updated GridPP VOMS server certificate
Installed new yaim config rpm on lcgce02 and csfnfs58

Matt

Test upgrade path from FTS2.1 to FTS2.2 on orisa
Plan ATLAS/R89 co-hosting of Grid Services
FTS drain and migration of front-ends back to somnus [Done]
Plan T2K configuration of FTS, and request dedicated diskpool

Richard

Manual Handling Training [Done]
Had a quattor session with IC to demonstrate problems with the current BDII build. Will do a fresh build to test the effect of altering the INSTALL_ROOT template variable (and report findings back to Michel Jouvin for subsequent inclusion in the QWG templates)
Currently working on one of the Nagios plugins assigned by Cheney
CASTOR items:
- Set up CCSE03..07 as CASTOR disk servers [Done]
- Waiting for resolution of:
  - Disk array problems on castor301
  - Powering off / Crashing problem on ccse02

Mayo

Create system for exporting Metrics report to spreadsheet [Done]
Adding bar chart to Metric system
Admin interface for Metric System

VO Reports

ALICE

possible decision to get rid of the lcg-CE at T1s and T2s

ATLAS

Tier 1 throughput test performed today. RAL + FZK excluded.
Re-processing run still meant to start on Friday 5th February. Will know exact timetable on 3rd February.

CMS

Clean-up at Tier 1s may begin soon in preparation for next data taking period
Continuous data taking will begin on February 8
Backfill restarted (Tues night) in order to check CREAMCE monitoring problems

LHCb

OnCall/AoD Cover

Primary OnCall:
Grid OnCall: Derek (Mon-Sun)
AoD:

RAL Tier1 weekly operations Grid 20100201

Contents

Operational Issues

Blocking Issues

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

Andrew

Catalin

Derek

Matt

Richard

Mayo

VO Reports

ALICE

ATLAS

CMS

LHCb

OnCall/AoD Cover

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools