RAL Tier1 weekly operations Grid 20090713

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Derek
    • CE Services restarted
    • Quattorising maui configuration, updating torque server profile to use new QWG release
    • Listened to GDB
    • CA updates on Grid Services nodes
    • Moved lcg-support alias to point at Support queue
    • Removed ops publishing from LHCb LFC

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
Production pool account at 32k subdirectory limit 2009-06-03 Ongoing ATLAS High
LB01 RAID failure 2009-06-17 Ongoing All Low
lfc0448 - SMART errors detected 2009-06-15 Ongoing All Low
lcgpx0619 - RAID failure 2009-07-03 Ongoing All Low
helpdesk DB tables not backed up 2009-07-01 Ongoing none Medium

Plans for Week(s) Ahead

Development Priorities

  • Derek
    • Continue quattorising torque server
    • Schedule FTS drain before Castor downtime
    • Implement OPN ticket merging in Notifications queue
    • Update blog versions

Resource Requests

Downtimes

Description Start End Affected VO(s)
LFC ATLAS separation 2009-07-20 08:00 2009-07-20 17:00 All

Requirements and Blocking Issues

Description Required By Priority Status
SL5 Worker Node Kickstart High Post-kickstart configuration needed; not yet suitable for bulk deployment
LB01 RAID failure Medium Testing hotswap configuration
lfc0448 disk failures Medium Disk replacement needed
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium May need to deploy imminently

OnCall/AoD Cover

  • Primary OnCall
    • Catalin (Wed-)
  • Grid Oncall
    • Derek (Mon,Tue)
  • AoD