RAL Tier1 weekly operations Fabric 20090810

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
    • Team Awayday
  • Martin:
    • Procurements
    • Further setup and configuration of SAN, arrays and systems for resilient LFC/FTS/3D Oracle services
  • Ian:


  • James T:
    • Quattor disk server work.
    • Escalation of Viglen 2008 disk acceptance testing including gathering stats and information.
    • Streamline 2008 disk server testing nearing completion
  • Jonathan:
    • added new disk servers (gdss368-477) to /etc/mail/local-host-names on pat and restarted sendmail
    • for Brian/Kier searched for old usernames
    • corrected ownership problems in /kickstart/yum directories
    • applied for updated host certificate for pat
    • copied yumit host certificate to /etc/grid-security on touch to replace existing wrong certificate
    • Nagios configuration update
  • James A:


  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Updated wiki with Suppliers contacts and procedure of complaint for new parts. (Fabric procedures)
    • gdss196 replaced 4 ports raid card/IPMI card. (Fixed)
    • gdss345 replaced Raid card battery given back to castor.
    • gdss192 added IPMI card and given back to castor. (Fixed)
    • near miss gdss169 double disks failure (Near Miss) managed to save the data with swift actions. (Fixed)
    • gdss198 replaced IPMI card and updated firmware. (Fixed)
    • gdss166 given back to castor (Fixed)
    • near miss gdss213 again with the cooperation of James T and castor team managed to save the data. (3 drives failure)
    • Working on Viglen and Streamline 2008 disk servers.
    • Working on gdss73, 95, 152, 243 and 256.


Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All


  • Martin:
    • Procurements
    • Further setup and configuration of SAN, arrays and systems for resilient non-Castor Oracle service
    • OS updates on LFC/FTS RAC and assoicated changes to enable visibility of resilient hardware
  • Ian:


  • James T:
    • Quattor disk server work including workshop with Michel Jouvin
    • Meeting with Viglen about the 2008 disk servers.
    • Begin work on tasks from "away day"
    • Move switch/PDU syslog to new loggers.
    • Decommission old loggers.
  • Jonathan:
    • reboot AFS servers
    • release new versions of tier1-nagios-plugins and tier1-nrpe-config
    • Nagios configuration updates
  • James A:


  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous working on Viglen and Streamline 2008 disk servers.
    • Continuous working on gdss73, 95, 152, 243 and 256.

Absenses

  • None planned

Fabric On-Call

  • Mon-Sun:

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric