RAL Tier1 weekly operations Fabric 20090803

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
  • Martin:
    • Procurements
    • Further setup and configuration of SAN, arrays and systems for resilient LFC/FTS/3D Oracle services
    • Meeting with SGI (Rackable)
  • Ian:
    • Revised Quattor Workplan
    • Production quattor server
    • Introducing jrha and jiant to Quattor
    • Primary on call
  • James T:
    • Updated the verify scheduling system. It now:
      • Takes account of machines in status retired or intervention.
      • Sleeps for a variable time to avoid too many connections to the database.
    • Crash course on Quattor configuration of disk servers with Ian
    • Lots of work on the disk server acceptance testing.
    • Passed some of the checking of machines in testing to Kash. Most of the streamline machines come out of testing this week.
    • New loggers now running. Just a few machines/switches/PDUs logging to old loggers.
  • Jonathan:
    • A/L
  • James A:
    • Added RAS's Production Actions to MyActions system.
    • Discussed WN desployments with IC.
    • Continued learning about TORQUE & MAUI.
    • Deployed new MySQL server for MINOS.
    • Started network load-testing on new CPU.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss356 replaced 4x2gb memory given back to castor.
    • gdss196 replaced 4 ports raid card.
    • gdss166 given back to castor.
    • lcg1001-1002 replaced power supply. (Fixed)
    • gdss391 replaced faulty fan cable. (Fixed)
    • gdss392 replaced faulty memory in raid card and one faulty drive. (Fixed)
    • gdss410 replaced faulty memory Dimm 1A. (Fixed)
    • Working on gdss213. (3 drives failure)
    • Working on Viglen and Streamline 2008 disk servers.
    • Working on gdss73, 196, 198, 243 and 256.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • AwayDay Wednesday
  • Martin:
    • Procurements
    • Further setup and configuration of SAN, arrays and systems for resilient non-Castor Oracle service
  • Ian:
    • Finalising production quattor server
    • Quattor deployment - further work on RAL specifc components
    • Further work on organisation of Quattor templates
  • James T:
    • Quattor disk server deployment.
    • Fabric team "day" and preparation.
    • Viglen 2008 disk acceptance testing (mostly Kash though).
    • Move switch/PDU syslog to new loggers.
    • Decommission old loggers.
  • Jonathan:
    • Catch up
    • reboot AFS servers
    • release new versions of tier1-nagios-plugins and tier1-nrpe-config
    • Nagios configuration updates
  • James A:
    • Monday, Thursday & Friday: Focus on QUATTOR.
    • Tuesday: Prepare for Fabric Day.
    • Wednesday: Fabric Day.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous working on Viglen and Streamline 2008 disk servers.
    • Continuous working on gdss73, 196, 198, 243 and 256.

Absenses

  • All: AwayDay Wednesday: NO FABRIC TEAM COVER EXCEPT IN EMERGENCY (power failure, thermonuclear detonation...)

Fabric On-Call

  • Mon-Sun: Ian

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric