Production Team Report 2011-01-17
From GridPP Wiki
Contents
RAL Tier1 Production Team Report for 17th January 2011.
AoD This Week
Mon: Gareth Tue - Thu: Tiju Fri: Gareth
Last week
- Gareth: AoD (2 days); Post Mortems on Disk servers,
- John: A/L
- Tiju: AoD (3 days); Set-up trial of SMS notifications in parallel with pager.
Changes to Operating procedures
- Trial of SMS notifications in parallel with pager.
Declared Outages in GOC DB
- Mon/Tues 17/18 January - 64-bit OS on Atlas Castor disk servers.
- Sat/Sun 22nd January - Power outage in Atlas building.
Advanced Warning
- Tuesday 18th January - Increase shared memory for OGMA.
- Wednesday 19th January - glite updates to site BDIIs.
Other Changes
- Fabric:
- Application of kernel update to batch server.
- Addition of additional gateway address to enable additional IP range.
- Double the network link to the tape robot stack (stack 12). (Requires Castor stop - probably do when Oracle updates applied).
- Swap out the older of the pair of SAN switches in the Tier1 Oracle databases for its new replacement. (Requires FTS, LFC, 3D stop).
- Database:
- Oracle 10.2.0.5 upgrade. (Will do after CERN has done updates to like databases).
- Re-visit non-Castor database multipathing
- Increase shared memory for OGMA, LUGH & SOMNUS (proposed for Tuesday 18th Jan).
- Grid Services:
- Changes to increase resilience of the BDII service
- glite update on site BDII nodes.
- Change batch Job OS selection mechanism (part of enable scheduling by node).
- Castor:
- Upgrade CMS Disk Servers to 64-bit - probably Mon/Tue 31st Jan/1st Feb.
- Upgrade GEN Disk Servers to 64-bit - date TBD.
- Change ATLAS castor permissions to prevent users deleting data
- Upgrade Puppetmaster & Clients
- Networks:
- None