RAL Tier1 weekly operations Fabric 20090706
From GridPP Wiki
Revision as of 21:35, 7 July 2009 by Martin bly (Talk | contribs)
Contents
Summary of Previous Week
Developments
- All
- Tier1 relocation
- Sporadic attendance at HEPSysMan
- Martin
- Managing Tier1 relocation
- Network management
- Procurements
- Ian
- Monitoring network etc during move
- Work with MattH and Derek on gLite/Quattor
- Hepsysman talk - and attending some sessions
- James T
- Re-initialized arrays on Streamline 2008 kit as they were configured differently.
- Presentation on verifies at HEPSysMan
- Jonathan
- updated DHCP server to set new MAC address for puppetdev (for Chris K)
- worked on list of Fabric Team documentation (for Martin/Gareth)
- worked with James A on network cabling for Datastore
- updated iptables on lcgsql0363 to correct netmask for some rules
- allow pat to handle mail for Castor DB systems (request from Cheney)
- corrected atlasbackup problems for 6 nodes (old tapes not deleted)
- updated Nagios configuration on netnag (temporary Nagios server for R89 migration)
- disabled check for nagios process on master server from nagios01/2/5
- prepared and released updated RPM tier1-nrpe-config with additional servers (nagger, netnag)
- worked on installation of Nagios 3 on nagger
- James A
- Started up all batch capacity (old & new).
- Started simple load testing on all WNs in R89 to test air-con and begin acceptance testing of new systems.
- Assisted where necessary with cabling of various racks and systems.
- Laid cables for ADS shelves with JFW.
- Kash
- Drive replacement.
- Fixing broken WNs.
- gdss156 ready for production.
- Moved srm servers from R27 to R89 with MJB.
- Replaced new memory in gdss192, 207, 192, 226 and 357.
- Working on gdss73, 192, 196, 198, 102, 128, 266, 121, 135, 150, 243.
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
Tier1 Move | 18 June | 6 July | All | Severe |
Plans for Week(s) Ahead
Operational Issues and Incidents
Description | Start | End | Affected VO(s) | Severity |
---|---|---|---|---|
Network access off site will be down due to software and board updates to the site routers. | ~07:30 7 July | ~10:30 7 July | All | Severe |
Development Priorities
- Martin
- Procurement preparation
- Updating of network switch software (Tuesday)
- Ian
- Fabric Working group
- Start work on production quattor server
- Quattor FP7 bid preparation
- James T
- Polish off move (xrootd/NFS servers, remaining interventions)
- Acceptance testing new disk hardware
- Jonathan
- Nagios configuration updates as required
- restart normal Nagios service with callouts
- compete list of Fabric Team documentation
- complete adding simple Nagios configuration documentation to wiki
- continue configuration work on nagger
- resurrect plan to move home filesystem to new server
- create SL5 version of tier1-sendmail-config RPM
- continue to plan AFS migration from Kerberos 4 to Kerberos 5
- James A
- Startup of batch system.
- Join Ian's work on QUATTOR.
- Updating IPMI card firmware on various systems.
- Kash
- Drive replacement.
- Fixing broken WNs.
- Working with Viglen Engineer.
- Continue working on gdss73, 192, 196, 198, 102, 128, 266, 121, 135, 150, 243.
Absences
- Ian: Wednesday - toil
Fabric On-Call
- Mon-Thu: James T
Advanced Warning of Requirements and Blocking issues
Service Issues
- RT# 38567 - Dedicated WN for Alice (SW area + gridftp area):
- Ongoing
- RT# 40180 - Resurrect PPS hardware
- Three units powered up
- RT# 44835 – non capacity HW for testing (Services)