RAL Tier1 weekly operations Fabric 20090612

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
    • Away Days
  • Martin:
    • Preparation and plans for move
  • Ian:
    • Configured and tweaked transitional Nagios server
    • Planning IPMI/management network with James A
    • Investigated network traffic over OPN from CERN
    • Dicussed bdii config w Matt H
    • Further Quattor work
  • James T:
    • Primary on call Mon - Thurs.
    • Rolled out scheduled verify system to non-CASTOR disk servers and non-production CASTOR disk servers (nonProd, spare, test,...), both 3ware- and Areca-based.
    • Tested hot swap on ALICE software server and handed over to Grid Services. Just needs another rsync of the data from csfnfs58 when it's put into production.
    • Merged wiki pages for CASTOR@RAL and Tier1 Experiments Liaison meetings.
    • Made changes to disk deployment docs that were agreed with Chris some time ago.
  • Jonathan:
    • started adding documentation to wiki to explain how to do simple Nagios configuration updates
    • worked on method of restoring AFS volume glite-sw if file server (currently afs1) crashes out of hours
    • updated Nagios configuration to add new BDII nodes
    • repaired MySQL table nagios_logentries
    • solved problem of Nagios bleeper not getting mail messages
    • followed up problem of mail to valid user being rejected by iCritical after further occurrence (FBU Footprints #19663)
    • added times to plan for update of Nagios server
  • James A:
    • More work to networking and other systems to allow for equipment moves.
    • Arranged for hand-over of new systems for acceptance testing starting 15th June.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss125 and 160 has been given back to castor.
    • gdss156 ready for production. (Need to move back into rack)
    • gdss205 replaced 8x1gb memory. (Given back to castor)
    • gdss139, 142, 145, 149, 151 and 211 has been given back to castor.
    • Cabling in R27 with James A.
    • Working on gdss73, 196, 198, 207, 102.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type
Castor Changing castor disk servers (Compusys05, Viglen06, Viglen07 AMD) to temporary network prior to the machine room migration 18/06 10:00 18/06 11:30 All At Risk

Development priorities

  • All
    • Preparation for h/w move
    • Cabling in R89
  • Martin:
    • Preparation and plans for move
  • Ian:
    • Set up notifications on transitional Nagios server
    • Set up basic grid services tests
    • Further planning of Fabric Management system
  • James T:
    • Primary on call Mon - Thurs.
    • Recovery (and other) documentation.
    • Acceptance testing new disk hardware.
  • Jonathan:
    • complete adding simple Nagios configuration documentation to wiki
    • continue work building Nagios binaries on nagger
    • Nagios configuration updates as required
    • complete work on method of restoring AFS volume glite-sw if file server (currently afs1) crashes out of hours and document
    • resurrect plan to move home filesystem to new server
    • create SL5 version of tier1-sendmail-config RPM
    • continue to plan AFS migration from Kerberos 4 to Kerberos 5
  • James A:
    • Working with suppliers to ensure hand-over of new systems into acceptance testing.
    • Move remaining ARTEMIS sensors in Atlas to new IPs.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous working on gdss73, 192, 196, 198, 207, 102, 226.

Absences

  • None

Fabric On-Call

  • James T

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 38567 - Dedicated WN for Alice (SW area + gridftp area)
  • RT# 40180 - Resurrect PPS hardware
    • Three units powered up
  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL Tier1

RAL Tier1 weekly operations fabric