RAL Tier1 weekly operations Grid 20091012

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Alastair
    • Induction
  • Andrew
    • updated all September pbsjobs MySQL table with data created using UT time; removed entries from 25 problem worker nodes
    • UB schedule for September
    • wrote Nagios script for checking PhEDEx watchdog agents
    • restored efficiencies in Ganglia
    • Training: CASTOR, library, display screen equipment
  • Catalin
    • involved in LFC FEs reconfiguration
    • sorted out issues with Alice SW area
    • prepared Alice SL5 VOBOX deployments
  • Derek
    • n/a
  • Matt
    • Reconfigured FTS/LFC frontends.
    • Progressed disk deployment.
    • Reviewed Grid Services installation/recovery documentation.
    • Rescaled KSI2K batch capacity from HEP-SPEC06 ratings.
  • Richard
    • n/a

Operational Issues and Incidents

Description Start End Affected VO(s) Severity Status
Failure of FTS/LFC RAC hardware 2009-10-06 Ongoing LHC for FTS; ATLAS + non-LHC for LFC Critical [2009-10-09]FTS/LFC running on alternate hardware.
Failure of 3D RAC hardware 2009-10-06 Ongoing ATLAS, LHCb High [2009-10-09]Services not yet restored.

Plans for Week(s) Ahead

Plans

  • Alastair
    • ATLAS jamboree
    • Job plan
  • Andrew
    • Complete September UB schedule
    • Testing CMS skimming without LazyDownload
    • Delete some CMS "dark" data
    • Training: Oracle self-service, fire equipment
  • Catalin
    • ready to deploy Squid and FronTier for ATLAS (pending on HW provisioning)
    • finalise SL5 VOBOX arrangements for Alice
    • attend Oracle Self Service Training course
  • Derek
  • Matt
    • Oracle SSC training
    • GDB
    • Meetings to discuss publishing of HEPSPEC-06
    • Disaster recovery planning
  • Richard
    • n/a

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
Failure of FTS/LFC RAC hardware FTS, FTM, LFC (ATLAS, non-LHC) Unscheduled Outage Oct 06 (12:25) Oct 07 (16:00) LHC for FTS; ATLAS + non-LHC for LFC
Failure of 3D RAC hardware LFC (LHCb) Unscheduled Outage Oct 06 (12:25) Ongoing ATLAS, LHCb

Requirements and Blocking Issues

Description Required By Priority Status
HW for Squid deployment ATLAS High request made via RT Fabric queue
HW for FronTier deployment ATLAS Medium request made via RT Fabric queue
HW for SL5 64-bit VOBOX Alice Medium request made via RT Fabric queue
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.
Hardware for testing LFC/FTS resilience Medium DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience; request for HW made through RT Fabric queue

OnCall/AoD Cover

  • Primary OnCall: Catalin (Mon-Thu)
  • Grid OnCall:
  • AoD: Catalin (Wed)