Tier1 Operations Report 2012-03-07
From GridPP Wiki
Contents
- 1 RAL Tier1 Operations Report for 7th March 2012
RAL Tier1 Operations Report for 7th March 2012
Review of Issues during the week 29th February to 7th March 2012.
- The Castor database saw very heavy load from Atlas during the early hours of Saturday morning (3rd March) that looks to have been linked to a particular workload.
- On Saturday (3rd March) there was a failure of the network link between the UKlight and SAR routers (this affects transfers to/from Tier2s) that lasted a bit under an hour around lunchtime. A fibre transceiver was replaced in this link on Tuesday morning (6th March) while the FTS down for a scheduled intervention.
Resolved Disk Server Issues
- GDSS513 (LHCbDst - D1T0) was removed from production on Wednesday 29th Feb. following multiple drive failures. It was returned to service during the next morning (1st March.)
Current operational status and issues.
- There is a known issue with the Atlas SRMs which is being investigated. A patched version has been rolled out that provides a workaround for the problem and stops the SRMs crashing. The remaining impact of this problem is minimal.
Ongoing Disk Server Issues
- None.
Notable Changes made this last week
- Thursday (1st March) One of the two batches of new worker nodes was moved to production.
- Tuesday (6th March) Castor databases moved to final hardware configuration for main database, with Oracle Data Guard enabled to synchronize updates to the backup database.
- Tuesday (6th March) FTS update to version 2.2.8 (still using Oracle 10).
- Tuesday (6th March) Applied network routing change required to extend range of addresses that route over the OPN.
Forthcoming Work & Interventions
- The Tier1 internal mail server ("Pat") will be replaced in the next couple of weeks.
Declared in the GOC DB
- None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.
- Databases:
- Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
- Switch LFC/FTS/3D to new Database Infrastructure.
- Update LFC/FTS databases to Oracle 11.
- Castor:
- Update the Castor Information Provider (CIP) (Need to re-schedule.)
- Move to use Oracle 11g (requires a minor Castor update.)
- Networking:
- Install new Routing & Spine layers for Tier1 network.
- Main RAL network updates - early summer.
- Fabric:
- BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
- Grid Services:
- Updates of Grid Services (including WMS, MyProxy, LFC front ends) to EMI/UMD versions.
Entries in GOC DB starting between 29th February and 7th March 2012.
There were no unscheduled outages during this period.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor & CEs (batch). | SCHEDULED | OUTAGE | 06/03/2012 10:00 | 06/03/2012 13:30 | 3 hours and 30 minutes | Castor Outage During Migration of Castor Oracle Databases to new hardware. |
lcgfts.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 06/03/2012 08:00 | 06/03/2012 13:05 | 5 hours and 5 minutes | Upgrade to FTS 2.2.8. Will include starting with a fresh database so all channels drained and any transfers waiting in the ready queue will be lost. |
castor GEN instance: srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k | SCHEDULED | OUTAGE | 29/02/2012 08:00 | 29/02/2012 12:00 | 4 hours | Update of GEN Castor instance to version 2.1.11-8 |
Open GGUS Tickets
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
79867 | Green | Less Urgent | In Progress | 2012-03-04 | 2012-03-07 | SNO+ | snoplus.snolab.ca LFC |
79428 | Red | Less Urgent | Waiting Reply | 2012-02-21 | 2012-03-07 | SNO+ | glite-wms-job aborted |
74353 | Red | Very Urgent | In Progress | 2011-09-16 | 2012-03-02 | Pheno | Proxy not renewing properly from WMS |
68853 | Red | less urgent | On hold | 2011-03-22 | 2012-03-02 | Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s) |