RAL Tier1 weekly operations Fabric 20090706

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

Developments

  • All
    • Tier1 relocation
    • Sporadic attendance at HEPSysMan
  • Martin
    • Managing Tier1 relocation
    • Network management
    • Procurements
  • Ian
    • Monitoring network etc during move
    • Work with MattH and Derek on gLite/Quattor
    • Hepsysman talk - and attending some sessions
  • James T
    • Re-initialized arrays on Streamline 2008 kit as they were configured differently.
    • Presentation on verifies at HEPSysMan
  • Jonathan
    • updated DHCP server to set new MAC address for puppetdev (for Chris K)
    • worked on list of Fabric Team documentation (for Martin/Gareth)
    • worked with James A on network cabling for Datastore
    • updated iptables on lcgsql0363 to correct netmask for some rules
    • allow pat to handle mail for Castor DB systems (request from Cheney)
    • corrected atlasbackup problems for 6 nodes (old tapes not deleted)
    • updated Nagios configuration on netnag (temporary Nagios server for R89 migration)
    • disabled check for nagios process on master server from nagios01/2/5
    • prepared and released updated RPM tier1-nrpe-config with additional servers (nagger, netnag)
    • worked on installation of Nagios 3 on nagger
  • James A
    • Started up all batch capacity (old & new).
    • Started simple load testing on all WNs in R89 to test air-con and begin acceptance testing of new systems.
    • Assisted where necessary with cabling of various racks and systems.
    • Laid cables for ADS shelves with JFW.
  • Kash
    • Drive replacement.
    • Fixing broken WNs.
    • gdss156 ready for production.
    • Moved srm servers from R27 to R89 with MJB.
    • Replaced new memory in gdss192, 207, 192, 226 and 357.
    • Working on gdss73, 192, 196, 198, 102, 128, 266, 121, 135, 150, 243.

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
Tier1 Move 18 June 6 July All Severe

Plans for Week(s) Ahead

Operational Issues and Incidents

Description Start End Affected VO(s) Severity
Network access off site will be down due to software and board updates to the site routers. ~07:30 7 July ~10:30 7 July All Severe

Development Priorities

  • Martin
    • Procurement preparation
    • Updating of network switch software (Tuesday)
  • Ian
    • Fabric Working group
    • Start work on production quattor server
    • Quattor FP7 bid preparation
  • James T
    • Polish off move (xrootd/NFS servers, remaining interventions)
    • Acceptance testing new disk hardware
  • Jonathan
    • Nagios configuration updates as required
    • restart normal Nagios service with callouts
    • compete list of Fabric Team documentation
    • complete adding simple Nagios configuration documentation to wiki
    • continue configuration work on nagger
    • resurrect plan to move home filesystem to new server
    • create SL5 version of tier1-sendmail-config RPM
    • continue to plan AFS migration from Kerberos 4 to Kerberos 5
  • James A
    • Startup of batch system.
    • Join Ian's work on QUATTOR.
    • Updating IPMI card firmware on various systems.
  • Kash
    • Drive replacement.
    • Fixing broken WNs.
    • Working with Viglen Engineer.
    • Continue working on gdss73, 192, 196, 198, 102, 128, 266, 121, 135, 150, 243.

Absences

  • Ian: Wednesday - toil

Fabric On-Call

  • Mon-Thu: James T

Advanced Warning of Requirements and Blocking issues

Service Issues

  • RT# 38567 - Dedicated WN for Alice (SW area + gridftp area):
    • Ongoing
  • RT# 40180 - Resurrect PPS hardware
    • Three units powered up
  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL Tier1

RAL Tier1 weekly operations fabric