RAL Tier1 weekly operations Fabric 20090619

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
    • Move preparation
  • Martin:
    • Preparation and plans for move
  • Ian:
    • Fabric Management Plan
    • Preparation for move
    • Work on Quattor w. Derek
    • Deployed first BDII
    • Finalising Netnag nagios monitoring for move
  • James T:
    • Fabric on call Mon – Thurs
    • Added a "force" feature to the verify system for use during interventions if necessary
    • Move worksheet completion
    • Disk server pre/post move procedures
    • Script to quiesc CASTOR/LSF on disk servers.
  • Jonathan:
    • completed tests on method of restoring AFS volume glite-sw if file server (currently afs1) crashes (now needs to be documented)
    • on sl4sys32 installed RPMs tier1-yum-lcg-ca-certs, ca_UKeScienceRoot-2007 and ca_ UKeScienceCA-2007, and changed /etc/krb5.conf to allow use of Kerberos authentication for access to SVN repository
    • Nagios configuration update
    • repaired MySQL table nagios_logentries
    • created configuration for minimal Nagios server (netnag)
  • James A:
    • Working with suppliers to ensure hand-over of new systems into acceptance testing.
    • Moved some of the remaining ARTEMIS sensors in Atlas to new IPs.
    • Continued laying network cables in preparation for rack moves.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Worked with Streamline Engineer in R89.
    • gdss156 ready for production. (Need to move back into rack)
    • gdss81 two drives failure (Replaced)and given back to castor.
    • Cabling in R89 with James A.
    • Working on gdss73, 196, 198, 207, 102.

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
gdss245 - Read-only / file system. Kash couldn't find a hardware fault. It was re-installed and had verifies turned on to try and weed out any problems

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB


Description Start End Affected VO(s) Severity
None

Development Priorities

  • All
    • Move
  • Martin:
    • Move
  • Ian:
    • Quattor/QWG work
  • James T:
    • Primary on call Mon - Thurs.
    • Recovery (and other) documentation.
    • Acceptance testing new disk hardware.
  • Jonathan:
  • James A:
    • Monitoring relocation of Streamline systems to R89.
    • Continue laying network cables in preparation for CASTOR rack moves.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous working on gdss73, 192, 196, 198, 207, 102, 226.

Absences

  • JW: A/L Monday

Fabric On-Call

  • Mon-Thur: James T
  • Fri-Sun: Ian

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 38567 - Dedicated WN for Alice (SW area + gridftp area):
    • Ongoing
  • RT# 40180 - Resurrect PPS hardware
    • Three units powered up
  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric