RAL Tier1 weekly operations Fabric 20091102

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
    • HEPiX
  • Ian:
    • HEPiX
  • James T:
    • A/L
  • Jonathan:
    • updated NIS netgroup lcghosts
    • sorted out problems with atlasbackup for some nodes (at least twice for some)
    • migrated farm home filesystems from /home/csf to /home/tier1
    • Nagios configuration updates
    • removed netnag from Sure and callout script
    • updated iptables on nodes to allow Nagios monitoring to work
    • updated RPM tier1-nagios-plugins and distributed via touch and Quattor
    • worked on Quattor configuration for slave server
    • rebooted nagger after networking stopped
  • James A:
    • 95% of time spent on disk server problems.
    • 5% helping people with Quattor.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss140 and gdss143 fixed and back in production.
    • gdss318 replaced 4x2gb memory. (Fixed)
    • gdss126 fixed and given back to castor
    • Moved touch and scrooge from Atlas (A1 upper) with James A in R89.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss67, 86 and 168.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am not in sight Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Catch up from week at HEPiX
    • Finalise Disk procurement ITT evaluation
    • Start on CPU procurement ITT evaluation
    • EMC array debarcle
  • Ian:
    • Catch up from week at HEPiX
    • Quattor workshop in Brussels
  • James T:
    • Catch up.
    • Take over disk server testing from James.
    • Preparation for CRISTAL 2.
  • Jonathan:
    • Quattor implementation for Nagios slave
    • check on environment for SL4/SL5 systems
    • assist Babar to migrate home filesystems
    • Nagios configuration updates
  • James A:
    • Handing disk server problems back to JIT.
    • Quattor workshop in Brussels.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss154 two drives failure. (Intervention)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous Working on gdss67, 86 and 168.

Absences

  • Ian
    • Quattor workshop (Tues 3rd - Thurs 5th)
  • James T
    • (Possibly) A/L Friday 6th
  • Jonathan
    • A/L on Wednesday (4th November)
  • James A
    • Quattor workshop (Tues 3rd - Thurs 5th)
    • Annual Leave (Mon 9th - Fri 20th).

Fabric On-Call

  • Mon-Sun: Martin

Advanced Warning of Requirements and Blocking issues

Services Issues

  • Various requests for hardware.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric