RAL Tier1 weekly operations Fabric 20090713
From GridPP Wiki
Contents
Summary of week gone
Developments
- All
- Martin:
- Post-move tidying up
- Procurements
- Ian:
- Quattor FP7 bid planning Meeting
- Trouble shooting reinstallation on quattor server
- New hardware installation timeline
- James T:
- Unquiesced remaining disk servers following move.
- Acceptance testing of Streamline kit.
- Built new loggers and started testing.
- Primary on call over the weekend.
- Jonathan:
- worked on reducing space used space and changed logrotate policy on system loggers
- updated RPMs on several systems and rebooted where required
- renamed csfmonitor-pbs RPM as tier1-batchinfo, updated to write to nagiosdb and tested
- corrected problems on various systems following service restart after R89 move
- created MySQL database minos_dogwood1 for user
- updated RPMs on servers and rebooted
- James A:
- Startup of batch system.
- Joined Ian's work on QUATTOR.
- Updated IPMI card firmware on a few systems.
- Cabled CASTOR Rack F (Certification Systems).
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Worked with Viglen Engineers.
- gdss192 replaced memory, mainboard and raid card (Fixed).
- gdss266 added new raid card. (Fixed)
- gdss198 replaced memory, mainboard and raid card. (Still broken)
- Replaced new memory in gdss102, 223, 216 and 236.
- Working on gdss73, 196, 198, 128, 121, 135, 150, 243
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|
Development priorities
- All
- Martin:
- Physical installation of resilient arrays for Oracle systems
- Castor and kernel upgrade on disk servers
- Possible reset on Tuesday of Stack-8 to regain access to it
- Planning of Fabric intervention schedule for period to 31 August
- Ian:
- Python Course (Weds & Thurs)
- Work with Martin installing production Quattor Server
- Configuring production Quattor Server
- James T:
- Python course Monday and Tuesday.
- Acceptance testing of Viglen kit now that they've all been handed over.
- Keep an eye on all acceptance tests.
- A bit of work on the CASTOR intervention on Tuesday (disk servers).
- Finish testing new loggers, copy data across and put into production.
- Quattor disk server deployment.
- Jonathan:
- Nagios configuration updates as required
- update RPMs on core servers and reboot as required
- complete adding simple Nagios configuration documentation to wiki
- continue configuration work on nagger and nagiosdb
- resurrect plan to move home filesystem to new server
- create SL5 version of tier1-sendmail-config RPM
- James A:
- CASTOR Upgrade.
- Start looking at Sindes for IC.
- Add RAS's Production Actions to MyActions system.
- Document user access to t1bofh.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Continuous Working on gdss73, 196, 198, 128, 121, 135, 150, 243, 218, 260 and 248.
Absences
- James:
- On leave Wed-Fri.
Fabric On-Call
- Mon-Thu: James T
Advanced Warning of Requirements and Blocking issues
Services Issues
- RT# 38567 - Dedicated WN for Alice (SW area + gridftp area)
- lcg0614 handed to Services Team
- RT# 44835 – non capacity HW for testing (Services)