RAL Tier1 weekly operations Grid 20100927

From GridPP Wiki
Revision as of 14:29, 27 September 2010 by Alastair dewhurst (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operational Issues

Description Start End Affected VO(s) Severity Status
RAID software failure on lcglb01 14 Sep 2010 all low replacement disk received; Fabric to swap it

Downtimes

Description Hosts Type Start End Affected VO(s)

Blocking Issues

Description Requested Date Required By Date Priority Status
Name change for glite-APEL box 15 Sep 2010 early October Medium

Developments/Plans

Highlights for Tier-1 Ops Meeting

Highlights for Tier-1 VO Liaison Meeting

Detailed Individual Reports

Alastair

  • Written disk draining scirpt/twiki page.
  • Dealt with data loss at RALPP Higgs group disk last week
  • Working on ATLAS software server, testing CVMFS
    • 825 test jobs have been run.
    • lcg0805 has been setup for production style testing, need to add queue into ATLAS system.
    • Production tasks submitted.
  • Writing script to graph transfer times for FTS transfers
  • Working on Hammer cloud test of castor 2.1.9
    • Analysis queue setup
    • Need to copy DBrelease into pre-prod and replicate

Andrew

  • Fixed (again) ganglia CPU efficiency monitoring (crond wasn't running the script) [Done]
  • Setting up & testing glite-APEL [Ongoing]
  • CMS data ops
    • Running rereco at RAL, PIC, FNAL, ASGC [Ongoing]
  • Revising CMS change-control form

Catalin

  • improve WMS monitoring [done]
  • work on Helpdesk MySQL database migration [done]
  • migrate remaining databases [ongoing]
  • kernel upgrades on SL5 nodes [ongoing]
  • halt old SL4 LFC FEs
  • prepare nodes in ATLAS building for power shutdown

Derek

  • CREAM CE quattor profile [ongoing]
  • Investigating CREAM CE instability [ongoing]
  • WN update rollout (folded into kernel update) [done]

Matt

  • Further testing of Quattorised gLite3.2 FTS FEs. [Ongoing]
  • Quattorisation of MyProxy nodes (write up Change Control). [Ongoing]
  • Capacity Signoff meeting followup. [Done]
  • Migrated FTS agents after h/w failure. [Done]
  • Scheduled FTS drain for LHCb. [Done]
  • Tested afs cache on UI02 (with Ian). [Done]
  • Rework FTS change control; factor out ATLAS power off. [Done]
  • Work on Nagios plugins (tier1-nagios plugins build with Richard; restarter configuration). [Done]

Richard

  • Simplified process of building tier1-nagios-plugins rpm and updated 2 plugins [Done]
  • Prepping for kernel updates on the RAL top-level BDIIs
  • Working on the "team status page" being developed as an action from team awayday [ongoing]
  • Reviewing G/S process documentation [ongoing]
  • CASTOR items:


VO Reports

ALICE

ATLAS

CMS

  • 31 corrupt MC files at RAL (from gdss280) have been globally invalidated.

LHCb

OnCall/AoD Cover

OnCall Rota

  • Primary OnCall: Catalin (Mon-Sun)
  • Grid OnCall: