RAL Tier1 weekly operations Fabric 20091123

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
  • Martin:
    • Completed CPU ITT evaluation
    • More work on EMC arrays problems
    • Benchmarking work on Nehalem chip systems
    • Work on test nodes for LFC database resilience tests
  • Ian:
    • Much Work on Quest FP7 bid
    • Cleaning up remaining issues with kernel security update
    • Quattor tutorial for CAstor
  • James T:
    • Updated PXE boot images for Viglen
    • CRISTAL2 preparation
    • Kernel updates
    • Fixed several Ganglia problems
    • Tried out the OSSEC intrusion detection system
    • AoD Wednesday
  • Jonathan:
    • updated BIOS on 11 sv-08 systems to allow hot swap of soft RAID disks and cleared BIOS logs
    • worked on developing backup strategy on core systems to improve resilience
    • made final dump of csflnx353 to archive tape and shut system down
    • corrected backup check scripts on rhubarb
    • started Ganglia monitoring on cpre004 (CIP server)
    • added Castor specific userids and groups to NIS
    • Nagios configuration updates
    • updated RPMs tier1-sudo-config, tier1-nrpe-config and tier1-nagios-plugins
  • James A:
    • A/L
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss163 double disks failure. (under test)
    • gdss368 moved into rack in HPD room with Martin and James T.
    • gdss161 given back to castor
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss67, 163 and 282.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • Work on evacuating A1 Upper (Castor admin and LSF systems)
  • Martin:
    • complete CPU ITT evaluation
    • testing sample hardware
    • install database test boxes
  • Ian:
    • Wrap up Quattor FP7 bid
    • Assist with quattorising new disk servers
    • SL5 vobox in quattor
    • Look at capacity modelling for Andrew
  • James T:
    • Progress meeting with Viglen
    • Disk server "quattorisation"
  • Jonathan:
    • further work on developing backup strategy on core systems to improve resilience
    • updates to farm to allow migration of Babar functional userids to new home filesystem server
    • Quattor implementation for Nagios slave
    • security updates to disk servers to prevent general user logins
    • Nagios configuration updates
  • James A:
    • Catching up after A/L.
    • Trying to focus on SINDES, theoretically ring-fenced majority of time this week.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss67 return to castor team after finishing test.
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss67, 134, 163 and 282.

Absences

  • James T:
    • CRISTAL2 course (Wednesday - Friday)

Fabric On-Call

  • Mon-Sun: James T (Fabric Mon-Thu, Primary Fri-Sun)

Advanced Warning of Requirements and Blocking issues

Services Issues

  • Various requests for hardware.
    • Working on various hardware requests for Services team.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric