RAL Tier1 weekly operations Grid 20090706

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • Catalin
    • work on hot-swapping feature for non-capacity HW
    • planning the LFC separation
    • debugging the lhcb-lfc SAM test issue
  • Derek
    • YII Objectives
    • Quattorising torque server
    • New Support helpdesk queue for production team
  • Matt
    • Plan SL4 to SL5 migration (with Derek)
    • Move production MyProxy to host in R89
    • June resource accounting (except tape usage)
    • 2009/Q2 FTS metrics
    • Attempt to quattorise lcgui02 (with Ian)
    • Nagios script to detect 32k limit for problem ATLAS user

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
Production pool account at 32k subdirectory limit 2009-06-03 Ongoing ATLAS High
LB01 RAID failure 2009-06-17 Ongoing All Low
lfc0448 - SMART errors detected 2009-06-15 Ongoing All Low
lcgpx0619 - RAID failure 2009-07-03 Ongoing All Low
helpdesk DB tables not backed up 2009-07-01 Ongoing none Medium

Plans for Week(s) Ahead

Development Priorities

  • Derek
    • Restart CE services
    • FTS changes due to downtimes
    • Listen in on GDB
    • Quattorise test batch system
    • Accounting and metrics
  • Matt

Resource Requests

Downtimes

Description Start End Affected VO(s)
WMS drain ahead of R89 move 2009-06-17 10:00 2009-06-26 12:00 All
R89 move 2009-06-25 06:00 2009-06-26 12:00 All
LFC ATLAS separation 2009-07-20 08:00 2009-07-20 17:00 All

Requirements and Blocking Issues

Description Required By Priority Status
SL5 Worker Node Kickstart High Post-kickstart configuration needed; not yet suitable for bulk deployment
LB01 RAID failure Medium Testing hotswap configuration
lfc0448 disk failures Medium Disk replacement needed
Non-capacity HW for testing Medium Still using the old HW
Hardware for PPS Medium May need to deploy imminently

OnCall/AoD Cover

  • Primary OnCall
  • Grid Oncall
    • Derek
  • AoD