RAL Tier1 weekly operations Grid 20100125

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Ran tests on Frontier server to confirm it is working well. Small number (< 10) errors not understood currently.
    • Completed version 1 of Tier 1 VO requirements with information that has been provided by Raja.
    • Helped deploy disk servers to ATLAS scratch disk when it became full. Wrote script to clean scratchdisk should it rapidly fill again.
  • Andrew
    • CMS data ops training at CERN (ProdAgent; MC production; backfill; creating workflows)
  • Catalin
    • handed over the SL5 LHCb VOBOX
    • ATLAS Frontier 3.22 update
    • tested the Alice xrootd (manager + peer) re-installation
    • followed up some post-reboot WMS issues with CERN
  • Derek
    • Added BLParser to lcgbatch01
    • Wrote quattor template for SL5 Glexec
    • Added extra config to yaim configuration to be categorised correctly by GSTAT
    • Deployed and tested Staged Rollout SL5 64 bit top BDII
    • Completed Multi User pilot job questionnaire
  • Matt
    • Migrated FTS agents to warm standby host
    • Documented R-GMA recovery procedure
    • Finished draft of Grid Services Disaster Recovery document
    • Provided test site BDII for CIP upgrade testing, and tested CIP output
    • Provided input for Grid Team for GridPP4
  • Richard
    • 2 days A/L
    • Catch-up
    • Fire Safety Training
    • Attended "share out" meeting for Nagios/NRPE plugins
    • Re-built the machine sv-08-02 as a test BDII server for Jens' CIP activity
    • CASTOR items:
      • Re-installed castor303.ads as a CASTOR disk disk server
      • Castor301 needs same treatment but has memory problems
  • Mayo
    • Encrypted passwords within the Metric system
    • Added a change password feature to the metric system
    • Fixed a bug within the Metric system
    • Worked on tape statistics spreadsheet project: converting excel chatrs to HTML

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status
Disk errors on FTS Agents host 20100119 14:00 20100121 09:00 LHC Low Schedules migration agents to standby host

Plans for Week(s) Ahead

Plans

  • Alastair
    • Understand remaining errors from HC test.
    • Continue updating RAL PP twiki.
    • Prepare slides for presentation on computing requirements.
    • Write Nagios script to warn when space token are near full.
  • Andrew
    • IO testing: will try new CMSSW TTreeCache patch & compare to lazy download
    • Investigate CMS job status reporting problems for my backfill jobs
    • Investigate my backfill jobs killed by batch system
    • Investigate my aborting MC production jobs at UK T2s/T3s
    • Write document about automatic job killing
  • Catalin
    • WMS01 and 02 upgrades
    • kernel updates
    • chase CERN for LFC schemas tidying up
    • test Alice xrootd (manager + peer) re-installation (with ChrisK)
  • Derek
    • Reinstalling lcgce08 with host swap config
    • Reconfiguring lcgce01
    • Continuing work on Glexec and SCAS
  • Matt
    • FTS drain and migration of front-ends back to somnus
    • Test upgrade path from FTS2.1 to FTS2.2 on orisa
    • Planning ATLAS/R89 co-hosting of Grid Services
    • Plan T2K configuration of FTS, and request dedicated diskpool
  • Richard
    • Manual Handling Training
    • Feed back results from Jens' CIP testing into Quattor profile for BDII server
    • CASTOR items:
      • Finish installing the suite of RPMs needed on castor303 (new disk server)
      • Re-install castor301 when memory has been fixed.
  • Mayo
    • Automating Metric report system
    • Adding charts to the metric system
    • Web interface and script to fetch data for Tape robot statistics spreadsheet project

VO Reports

ALICE

  • Status Report (21 Jan) - very stable behaviour, CREAM stability exceptional, "a bit of free resources" (re: RAL farm between week 53/2009 and week 02/2010)

ATLAS

  • ATLAS confirmed that RAL WMS not critical for UK operations.

CMS

  • RAL ranked 2 in T1 Site Readiness Ranking on 2010-01-25 (last 2 weeks)
  • Only JobRobot and Backfill (re-reco) jobs running recently
  • High CMS network usage on Friday was due to lazy-download not being specified in the reco config file

LHCb

  • Other Tier-1s (IN2P3, PIC) also reporting low job efficiencies. Suspect LHCb user or application problems.
  • LHCb confirmed that RAL WMS not critical for UK operations.
  • New release of DIRAC, so testing on SL5 VOBOX should be able to proceed shortly.

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
Disk problems on FTS agents host lcgfts01 Scheduled 20100121 07:00 20100121 09:00 LHC

Requirements and Blocking Issues

Description Required By Priority Status
Hardware for testing LFC/FTS resilience High DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue
Hardware for Testbed High Required for change validation, load testing, etc. Also for phased rollout (which replaces PPS).
Hardware for SCAS servers Feb 1 2010 High Hardware required for production SCAS servers - required to be in place by end of Feb
Hardware for SL5 CREAM CE for Non LHC SL5 batch access Medium Hardware required for CREAM CE for non-LHC vos
Pool accounts for Super B vo Medium Required to enable Super B vo on batch farm

OnCall/AoD Cover

  • Primary OnCall:
  • Grid OnCall: Derek (Mon-Sun)
  • AoD: