Difference between revisions of "RAL Tier1 weekly operations Fabric 20090720"
From GridPP Wiki
Martin bly (Talk | contribs) |
(No difference)
|
Latest revision as of 17:09, 20 July 2009
Contents
Summary of week gone
Developments
- All
- Martin:
- (Saturday) Terminated testing on 27 disk servers (Viglen08) due to severe problems on each
- Install and initial configuration of EMC data arrays and second SAN switch for resilient non-Castor Oracle services
- Ian:
- Python Course
- Installing production Quattor server
- Resolving Quattor installation mechanism issues
- James T:
- Python course Monday and Tuesday.
- Acceptance testing of 2008 disk now in full swing
- Keep an eye on all acceptance tests.
- Rolled out the verify scheduling across the production disk servers, with a few exceptions to follow up.
- Finished testing new loggers, copy data across and put new logger1 into production.
- Started work on Quattor disk server deployment.
- Jonathan:
- sorted out atlasbackup problems on lcgsql0363, touch
- updated RPMs on several core systems
- installed tier1-batchinfo RPM on csflnx353, lcgbatch01
- updated Nagios configuration
- updated plan for migration of Nagios MySQL database and master server to new hosts
- James A:
- CASTOR Upgrade.
- Documented user access to t1bofh.
- On leave Wed-Fri.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- gdss126 and gdss166 given back to castor (Fixed).
- gdss192 given back to castor. (Fixed working without IPMI card)
- Working on gdss73, 196, 198, 128, 121, 135, 150, 243, 218, 260 and 248.
Operational Issues and Incidents
Index | Description | Start | End | Severity | Affected VO(s) |
---|
Summary of plans for week ahead
Scheduled and Cancelled Down Times
Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB
Component | Description | Start | End | Affected VO(s) | Type |
---|---|---|---|---|---|
Nagios | Cutover of backend database from lcgsql0363 to nagiosdb | Monday pm | Monday pm | None | No risk |
Development priorities
- All
- Martin:
- Further setup and configuration of SAN and arrays for resilient non-Castor Oracle service
- Ian:
- Further Quattor install server work
- James T:
- TOIL on Monday all day.
- Keep an eye on all acceptance tests.
- Quattor disk server deployment.
- Jonathan:
- switch Nagios and batch jobs MySQL databases to nagiosdb with update of Mimic (James A)
- work to with James T to sort out system loggers
- reboot AFS servers (after kernel updates last week)
- complete updates to plan to move home filesystem to new server
- James A:
- Start looking at Sindes for IC.
- Go through QUATTOR installs with IC.
- Add RAS's Production Actions to MyActions system.
- Update Mimi(c) to use new nagios database.
- Kash:
- Drive replacement.
- Fixing broken WNs.
- Move "Marley" from R27 (Ops).
- Create list and location of Viglen 06 Disk servers for adding additional Raid cards.
- Continuous Working on gdss73, 196, 198, 128, 121, 135, 150, 243, 218, 260 and 248.
Absences
- Ian:
- A/L from midday Wednesday
- James T:
- TOIL on Monday
- Jonathan:
- A/L Wednesday (weather dependent), Friday.
- A/L All w/b 27/7
Fabric On-Call
Advanced Warning of Requirements and Blocking issues
Services Issues
- Update of ntpd RPM can leave ntpd process not running (3 instances seen and corrected; there is a Nagios test for ntpd daemon)
- RT# 44835 – non capacity HW for testing (Services)