RAL Tier1 weekly operations Grid 20091005

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Andrew
    • Deployed pbslogs2mysql and gmetric-eff.pl on lcgbatch01 using Quattor
    • Deployed October CPU allocations using Quattor
    • Updated PhEDEx to 3_2_6; wrote documentation
    • Changed number of streams for CLOUDCMSUS-RALLCG2 FTS channel
    • Preparations for testing CMS skimming without LazyDownload
    • Investigated bio very low efficiencies
    • Adjustments to LFC ganglia monitoring
    • Training: manual handling; Quattor; APR
  • Catalin
    • re-installed WMS02 and made it hotswappable, updated documentation
    • Quattor and Castor training
    • some work on WMS purging ('held jobs' issue)
  • Derek
    • n/a
  • Matt
    • Reviewed Grid Services installation/recovery documentation
    • BDII reconfigured for CIP upgrade
    • Generated disk deployment requests for Q4/09 allocations
  • Richard
    • Installed BDII using Quattor; Quattor training
    • Increased use of RT reports

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
lcgce07 partition failure 18/09/09 none yet (potentially alice, cms, lhcb lose resilience) medium

Plans for Week(s) Ahead

Development Priorities

  • Alastair
    • Induction
  • Andrew
    • Updates to September PBS jobs database
    • Testing CMS skimming without LazyDownload
    • Delete some CMS "dark" data
    • Add monitoring of PhEDEx watchdog agents to Nagios
    • Training: CASTOR; welcome to library; display screen equipment
  • Catalin
    • review the requests for Frontier deployment for ATLAS
    • chase ALICE SW area issue
    • work on WMS purging
  • Derek
    • n/a
  • Matt
    • Disaster recovery planning
  • Richard
    • n/a

Resource Requests

Downtimes

Description Hosts Type Start End Affected VO(s)
WMS02 hotswappable lcgwms02 Scheduled Outage Sep 22 (16:00) Sep 30 (17:00) LHC
Oracle ASM patching FTS, FTM, LFCs Scheduled At Risk Oct 01 (13:30) Oct 01 (16:30) All

Requirements and Blocking Issues

Description Required By Priority Status
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium We have made a commitment to test PPS pre-releases, and have no hardware dedicated for this.
Hardware for testing LFC/FTS resilience Medium DataServices want to deploy a DataGuard configuration to test LFC/FTS resilience

OnCall/AoD Cover

  • Primary OnCall: Catalin (Fri-Sun)
  • Grid OnCall: Catalin (Mon), Matt (Tue-Thu)
  • AoD: