Production Team Report 2010-12-06
From GridPP Wiki
Contents
RAL Tier1 Production Team Report for 06th December 2010.
AoD This Week
Mon - Wed: John Thu: Gareth Fri: John
Last week
- Gareth: GoD and AoD for 1 day.
- John: Deploying Nagios test for read-only file systems and fsprobe test.
- Tiju: AoD - 4 days.
Changes to Operating procedures
- Call-outs being added for FSPROBE errors and read-only file systems.
Declared Outages in GOC DB
- Mon 6th - Wed 8th Dec upgrade of Atlas Castor instance.
- Tue 7th At risk for FTM as it is being quattorized.
- 10th to 13th - Next weekend there is a power outage in the Atlas building.
Advanced Warning
- Weekend 11/12 Dec: Power outage in Atlas building.
- Monday 13th December - UPS test.
Other Changes
- Fabric:
- Removal of sl08 disk servers from production (with castor team)
- Double the network link to the tape robot stack (stack 12), postponed from the last TS. (Requires Castor stop).
- Swap out the older of the pair of SAN switches in the Tier1 Oracle databases for its new replacement. (Requires FTS, LFC, 3D stop).
- Database:
- Re-visit non-Castor database multipathing
- Increase shared memory for OGMA, LUGH & SOMNUS
- Grid Services:
- Changes to increase resilience of the BDII service
- Castor:
- Change ATLAS castor permissions to prevent users deleting data
- Removal of sl08 disk servers from production
- Networks:
- None