RAL Tier1 weekly operations Grid 20100201

From GridPP Wiki
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
Disk errors on LB01 host 27-Jan-2010 1-Feb-2010 all low failed disk replaced; service no disrupted
SQL server reboot problems 27-Jan-2010 12:00 27-Jan-2010 13:30 high server didn't reboot after kernel upgrade; GRUB magic work needed
Alice VOBOX upgrade problems 27-Jan-2010 13:00 27-Feb-2010 14:30 Alice high filesystem problems after kernel and RPM upgrades; machine needed a re-install from scratch

Blocking Issues

Description Requested Date Required By Date Priority Status
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for SCAS servers 2010-02-01 High Hardware required for production SCAS servers - required to be in place by end of Feb
Hardware for Testbed Medium Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Hardware for SL5 CREAM CE for Non LHC SL5 batch access Medium Hardware required for CREAM CE for non-LHC vos
Pool accounts for Super B vo 2010-01-13 Medium Required to enable Super B vo on batch farm.

Done

Developments/Plans

Highlights for Tier-1 Ops Meeting

  • ATLAS: Tier 1 throughput test performed today. RAL + FZK excluded.
  • ATLAS: Re-processing run still meant to start on Friday 5th February. Will know exact timetable on 3rd February.
  • CMS: continuous data taking will begin on 8th February
  • Disk deployment: meeting scheduled for 14:00-15:00 Tuesday

Highlights for Tier-1 VO Liaison Meeting

  • ALICE: possible decision to get rid of the lcg-CE at T1s and T2s
  • Plan T2K configuration of FTS. Request dedicated diskpool subject to confirmation of alloaction from UB

Detailed Individual Reports

Alastair

  • Understand remaining errors from HC test. [Done]
  • Continue updating RAL PP twiki. [Ongoing]
  • Prepare slides for presentation on computing requirements. [Ongoing]
  • Write Nagios script to warn when space token are near full. [To be implemented when Castor comes back]
  • Work with Brian + Chris in re-deploying disk servers to ATLAS space tokens.

Andrew

  • Testing of new CMSSW TTreeCache training patch (still not quite as good as lazy-download; found that it crashes with official re-reco config) [Done]
  • Investigated problems with CMS backfill jobs; MC production failed jobs [Done?]
  • Started adding PhEDEx-CASTOR consistency Ganglia monitoring
  • Test another new CMSSW I/O optimisation patch
  • Complete PhEDEx-CASTOR consistency Ganglia monitoring (PhEDEx part done)
  • Complete document about automatic job killing
  • Display Screen Equipment assessment (had to do a second one) [Done]

Catalin

  • WMS01 and 02 upgrades [Done]
  • kernel updates [Done]
  • re-installed one ALICE VOBOX [Done]
  • 1-to-1 on Nagios configuration (with Jonathan)
  • chase CERN for LFC schemas tidying up
  • test Alice xrootd (manager + peer) re-installation (with Chris)
  • quattorise additional LFC frontends (with Ian)

Derek

  • Reinstalling lcgce08 with host swap config[Done]
  • Reconfiguring lcgce01[Done]
  • Installing SL5 SCAS server
  • Testing SL5 GLexec WN
  • Setting up testbed site in quattor
  • Released new yaim config rpm with updated GridPP VOMS server certificate
  • Installed new yaim config rpm on lcgce02 and csfnfs58

Matt

  • Test upgrade path from FTS2.1 to FTS2.2 on orisa
  • Plan ATLAS/R89 co-hosting of Grid Services
  • FTS drain and migration of front-ends back to somnus [Done]
  • Plan T2K configuration of FTS, and request dedicated diskpool

Richard

  • Manual Handling Training [Done]
  • Had a quattor session with IC to demonstrate problems with the current BDII build. Will do a fresh build to test the effect of altering the INSTALL_ROOT template variable (and report findings back to Michel Jouvin for subsequent inclusion in the QWG templates)
  • Currently working on one of the Nagios plugins assigned by Cheney
  • CASTOR items:
    • Set up CCSE03..07 as CASTOR disk servers [Done]
    • Waiting for resolution of:
      • Disk array problems on castor301
      • Powering off / Crashing problem on ccse02

Mayo

  • Create system for exporting Metrics report to spreadsheet [Done]
  • Adding bar chart to Metric system
  • Admin interface for Metric System

VO Reports

ALICE

  • possible decision to get rid of the lcg-CE at T1s and T2s

ATLAS

  • Tier 1 throughput test performed today. RAL + FZK excluded.
  • Re-processing run still meant to start on Friday 5th February. Will know exact timetable on 3rd February.

CMS

  • Clean-up at Tier 1s may begin soon in preparation for next data taking period
  • Continuous data taking will begin on February 8
  • Backfill restarted (Tues night) in order to check CREAMCE monitoring problems

LHCb

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek (Mon-Sun)
  • AoD: