RAL Tier1 weekly operations Fabric 20090817

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
    • Aircon failure recovery (twice) - ongoing
  • Martin:
    • Procurements
    • Further setup and configuration of SAN, arrays and systems for resilient LFC/FTS/3D Oracle services
  • Ian:
    • Michel Jouvin's Visit
    • Finalised production Quattor server
    • Introduced JFW to Quattor
    • Committed first fixes back to Quattor trunk
    • Work on FP7 bid with Morgan Stanley
  • James T:
    • Checking the disk servers.
  • Jonathan:
    • Quattor tutorial and initiation
    • obtained and installed renewed host certificates for yumit and pat
    • updated Fabric alarm response spreadsheet for loggers and help desk
    • updated TierOneOncallProcess page on wiki to correct text of test pager alarm
    • corrected root password for several disk servers
    • disabled local access to Tier1 systems for non-Tier1 users (security issue)
    • fixed mail quota for user
    • updated Nagios configuration on nincom and netnag
  • James A:
    • Implemented new thermal shutdown systems.
    • Deployed two new ARTEMIS instances.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Working on Disk servers and working nodes after Air con. problem.
    • Working on gdss73, 95, 152, 243 and 256.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
Aircon failure requiring emergency stop of Tier1 14:00 Monday 10th 19:00 Monday 10th Critical All
Aircon failure requiring emergency stop of Tier1 00:30 Wednesday 12th Ongoing Critical All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • Ongoing recovery of Tier1 service after aircon failures
  • Martin:
    • Procurements
    • Meeting with Arista (switch vendor)
  • Ian:
    • Work with Derek on Quattorising batch01
    • Quattor deployment - further work on RAL specific components
    • Further meetings re Quattor FP7 bid
    • Primary on call
  • James T:
    • Quattor templates
    • Talk for TDG on Monday 24th.
  • Jonathan:
    • reboot AFS servers
    • update Nagios configurations as required
    • release updated RPMs tier1-nagios-plugins and tier1-nrpe-config
    • sort out atlasbackups after robot problems
    • Quattor work for Nagios
  • James A:
    • Re-connect and restart batch farm.
    • Focus on QUATTOR as much as possible.
    • Implement further enhancements to thermal shutdown systems.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous working on disk servers and working nodes.
    • Continuous working on gdss73, 95, 152, 243 and 256.

Absenses

  • All:
    • Work From Home Wednesday (Except Kash, Martin)
  • Jonathan:
    • A/L Tuesday
  • Ian:
    • A/L Friday

Fabric On-Call

  • Mon-Sun:

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric