RAL Tier1 weekly operations Grid 20090629

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Catalin
    • Debugging LFC streaming
    • R89-related activities
      • Cabling, network stuff setup
      • Restart Grid services
  • Derek
    • YII Objectives
    • Cron job with lower age threshold to mitigate 32k directory limit for Atlas pool account on CEs
    • R89-related activities
      • Stop/Start services
  • Matt
    • R89-related activities
      • Added mechanism to override Nagios service restarters
      • Stop/Start services
    • Reviewed Grid service/process documentation
    • Generated stats for ATLAS FTS transfers during STEP09

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
Production pool account at 32k subdirectory limit 2009-06-03 Ongoing ATLAS High
LB01 RAID failure 2009-06-17 Ongoing All Low
lfc0448 - SMART errors detected 2009-06-15 Ongoing All Low

Plans for Week(s) Ahead

Development Priorities

  • Catalin
    • Support the R89 move (if needed)
    • Finalise plan for ATLAS LFC separation
  • Derek
    • Quattorise test batch system
    • Implement new Helpdesk queue for Production team
  • Matt
    • Plan SL4 to SL5 migration
    • Move production proxy to host in R89
    • June resource accounting
    • 2009/Q2 FTS metrics

Resource Requests

Downtimes

Description Start End Affected VO(s)
WMS drain ahead of R89 move 2009-06-17 10:00 2009-06-26 12:00 All
R89 move 2009-06-25 06:00 2009-06-26 12:00 All

Requirements and Blocking Issues

Description Required By Priority Status
SL5 Worker Node Kickstart High Post-kickstart configuration needed; not yet suitable for bulk deployment
LB01 RAID failure Medium Disk replacement needed
lfc0448 disk failures Medium Disk replacement needed
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium May need to deploy imminently

OnCall/AoD Cover

  • Primary OnCall
  • Grid Oncall
    • Derek
  • AoD
    • Derek: Wednesday