RAL Tier1 weekly operations Fabric 20090907

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
  • Martin:
    • Kernel patching
    • Procurement issues
  • Ian:
    • Annual Leave
  • James T:
    • Ongoing liaison with Viglen over disk issues
    • Continued acceptance testing of Streamline nodes
    • Kickstart for CASTOR on new Streamline nodes
    • Began to put new machines into Overwatch
  • Jonathan:
    • updated RPMs on several servers and rebooted for new kernel
    • corrected /var partition full problem on system consoles
    • updated NIS netgroup to add new batch workers and new CE
    • prepared local sendmail configuration RPM for SL5 and installed on touch
    • cleared up lots of atlasbackup problems (failed to purge old files)
    • restarted syslog daemon on a number of servers that were still logging to the old loggers
    • Nagios configuration updates
    • issued updated version of RPM tier1-nagios-plugins (version 2.0-54)
    • 2 days laboratory holidays
  • James A:
    • Spend most of time working on SL5 WN config in QUATTOR.
  • Kash:
    • Created spreadsheet of serial numbers etc. of drives that have failed in new Viglen kit.
    • Drive replacement.
    • Fixing broken WNs.
    • gdss169 completed verify. (Intervention)
    • gdss332 replaced 4x2gb memory and given back to castor.
    • lcg0643 and lcg0658 replaced faulty memory (1x1gb) borrowed from lcg0651 and moved in HPD area. (Fixed)
    • gdss81 and 151 given back to castor.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss67, 169 and 243.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
1 gdss67 RAID failure (RT#49145) 18/08/09 Ongoing Critical CMS
2 gdss164 RAID5 failure (RT#49192) 19/08/09 Ongoing Critical BaBar
3 Kernel security problem 13/08/09 Ongoing Critical All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type
ogma Migration of Oracle database to a 64-bit system. A 3 hour outage has been declared, although it is expected that the database will only be unavailable for a short while at some point during this interval. 2009-09-08, 09:00:00 2009-09-08, 12:00:00  ?? Down

Development priorities

  • All
  • Martin:
    • Annual Leave
  • Ian:
    • Quattor
  • James T:
    • Keep up to speed with Viglen on disk issues and run any tests they need
    • Restoration of gdss164
    • Admin on Duty on Wednesday
    • Sl5 64-bit software server build
    • Quattor: disk servers and ganglia config files
    • Tuning on gen disk servers at Brian's request
    • Annual leave on Friday
  • Jonathan:
    • migrate NIS servers to new hardware and operating system
    • migrate Nagios slave servers to new hardware
    • Nagios configuration updates
    • plan for migration of home filesystem server to new hardware
    • plan for migration of mail server to new hardware
  • James A:
    • Restore two ARTEMIS units in Atlas, moving one to CICT subnet, one to T1 subnet.
    • Network cabling for new NC twins.
    • Write up power control project plan.
    • Give assistance with CE08 when needed.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Create graph of drives failure. (Daily, Weekly and Monthly)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss67, 169 and 243.

Absences

  • Martin:
    • A/L all week
  • James:
    • A/L Friday
    • A/L 17-18 September


Fabric On-Call

  • Mon-Thurs:
    • James (Primary Mon - Wed)

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric