RAL Tier1 weekly operations Fabric 20090605

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
    • R89 cabling
    • Settling in to R89
  • Martin:
  • Ian:
    • Scheduling network intervention post-mortem
    • Primary on-call over weekend
    • Fabric Management Working Group first meeting
    • Built secondary Nagios server to monitor network during transitions
    • Photos of machine room
  • James T:
    • Discussed and decided on roll-out of scheduled verifies.
    • Packaged scheduled verify system, waiting for James A. to make required changes to OverWatch.
    • Built ALICE software server (need to test hot swap before handing it to Grid Services)
    • Discussed and decided on disk intervention procedure with the CASTOR team and Kashif.
  • Jonathan:
    • completed update to TierOneSystemRecoveryProcedures wiki documentation and updated Strategic Action 61
    • fixed problems with various farm nodes
    • located files written by Marian describing installation of glite-ui as a virtual machine
    • Nagios configuration updates
    • got replacement number for pager 1, updated oncall program
    • created kickstart files for installation of new Nagios master server, installed and started build of Nagios v3.0.6
    • corrected space problem on lcgsql0363 by removing directory /var/lib/mysql/nagios29_sav
    • repaired MySQL table nagios_logentries by deleting old entries
    • Safety Refresher Training course and R89 Safety Tour
  • James A:
    • Continued to assist engineers where necessary.
    • Continued to lay networking for CASTOR/ADS racks with assistance from team.
    • Assist with connections of new cables to old equipment in A5 lower to allow migration of systems to R89.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss338 ready for deployment.
    • gdss160 has been passed acceptance test. Presently verifying.
    • gdss125, gdss156 ready for production.
    • gdss140 given back to castor.
    • gdss212 and gdss255 replaced 8x1gb memory. (Given back to castor)
    • Working on gdss73, 139, 151, 196, 198, 207, 211, 102, 205.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
Too few NFS daemons on lcg0616 (CMS s/w server) caused hung mounts on WNs. Fixed quickly and jobs continued to run

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • Cabling in R89
    • EScience Away Day (Thursday/Friday) – not Martin/Kash
  • Martin:
  • Ian:
    • Configure Nagios transitional network monitor
    • Further Fabric management project work
    • Preliminary look at Morgan Stanley Aquilon tool
    • Plan IPMI/management network with James A
  • James T:
    • Primary on call Mon - Thurs.
    • Test new disk.
    • Recovery documentation.
    • Roll out scheduled verify system to non-CASTOR disk servers and non-production CASTOR disk servers (nonProd, spare, test,...).
    • Test hot swap on ALICE software server and hand over to Grid Services.
  • Jonathan:
    • add documentation to wiki to explain how to do simple Nagios configuration updates
    • continue to build Nagios binaries on nagger
    • add timings to plan for migration of Nagios master server to new hardware and Nagios version 3
    • Nagios configuration updates as required
    • resurrect plan to move home filesystem to new server
    • create SL5 version of tier1-sendmail-config RPM
    • continue to plan AFS migration from Kerberos 4 to Kerberos 5
  • James A:
    • More work to networking and other systems to allow for equipment moves.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous working on gdss73, 139, 151, 196, 198, 207, 211, 102, 205.

Absences

  • Kash A/L Friday
  • All bar Martin & Kash out Thursday/Friday

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 38567 - Dedicated WN for Alice (SW area + gridftp area):
  • RT# 40180 - Resurrect PPS hardware
    • Three units powered up
  • RT# 44835 – non capacity HW for testing (Services)


Category:RAL_Tier1

RAL Tier1 weekly operations fabric