Tier1 Operations Report 2012-02-29
From GridPP Wiki
Contents
- 1 RAL Tier1 Operations Report for 29th February 2012
RAL Tier1 Operations Report for 29th February 2012
Review of Issues during the week 22nd to 29th February 2012.
- Some problems relating to batch job submission have been investigated and fixed. These include correcting some information reported to the information system ("resources_default.walltime") and applying a patch to WMS01 to fix a proxy delegation problem seen by LHCb.
- On Tuesday (28th) there was a short (around 15 minute) network issue which had minimal impact on Tier operations although we did see a spike in FTS transfers failures.
Resolved Disk Server Issues
- None.
Current operational status and issues.
- There is a known issue with the Atlas SRMs which is being investigated.
Ongoing Disk Server Issues
- Wednesday 29th Feb. GDSS513 (LHCbDst - D1T0) removed from production following multiple drive failures.
Notable Changes made this last week
- Monday 27 Feb. Upgrade of LHCb Castor instance to version 2.1.11-8.
- Wednesday 29 Feb. Upgrade of GEN Castor instance to version 2.1.11-8. (Castor 2.1.11-8 upgrade now complete.)
- Thursday 23 Feb. Application of Oracle "PSU" patches to Atlas 3D & LHCb 3D/LFC systems ("OGMA" & "LUGH")
- Tuesday 28th Feb. Electrical work took place to prepare for moving part of the cooling system onto the UPS supply.
- Updated drivers have been applied to tape servers which has increased the performance of the T10KB & C tape drives.
Forthcoming Work & Interventions
- Next Tuesday (6th March) (TBC) Castor outage during the second (and final) step of the Castor database migration and includes enabling Oracle Data Guard.
- Next Tuesday (6th March) (TBC) Apply network routing change required to extend range of addresses that route over the OPN.
- Week beginning 5th March (TBC) FTS update to version 2.2.8.
Declared in the GOC DB
- None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.
- Databases:
- Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
- Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
- Castor:
- Update the Castor Information Provider (CIP) (Need to re-schedule.)
- Move to use Oracle 11g (requires a minor Castor update.)
- Networking:
- Install new Routing & Spine layers for Tier1 network.
- Main RAL network updates - early summer.
- Fabric:
- BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
- Grid Services:
- Updates of Grid Services (including WMS, MyProxy, LFC front ends) to EMI/UMD versions.
Entries in GOC DB starting between 22nd and 29th February 2012.
There were no unscheduled outages during this period.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Castor GEN instance (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k). | SCHEDULED | OUTAGE | 29/02/2012 08:00 | 29/02/2012 16:00 | 8 hours | Update of GEN Castor instance to version 2.1.11-8 |
srm-lhcb | SCHEDULED | OUTAGE | 27/02/2012 08:00 | 27/02/2012 13:06 | 5 hours and 6 minutes | Update of LHCb Castor instance to version 2.1.11-8 |
lcgvo05 | SCHEDULED | WARNING | 22/02/2012 11:00 | 24/02/2012 14:35 | 2 days, 3 hours and 35 minutes | Outage on Atlas vobox for Alastair to investigate |
srm-atlas | SCHEDULED | OUTAGE | 22/02/2012 08:00 | 22/02/2012 12:50 | 4 hours and 50 minutes | Update of Atlas Castor instance to version 2.1.11-8 |
Open GGUS Tickets
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
79732 | Green | Less Urgent | In Progress | 2012-02-28 | 2012-02-28 | hone | hone jobs after submission through lcgwms03.gridpp.rl.ac.uk WMS are at Waiting status too long time. |
79545 | Red | Top Priority | Waiting Reply | 2012-02-23 | 2012-02-24 | LHCb | Zombie jobs at RAL |
79428 | Red | Less Urgent | Waiting Reply | 2012-02-21 | 2012-02-23 | SNO+ | glite-wms-job aborted |
77026 | Red | Less Urgent | In Progress | 2011-12-05 | 2012-02-28 | BDII | |
74353 | Red | Very Urgent | Waiting Reply | 2011-09-16 | 2012-02-27 | Pheno | Proxy not renewing properly from WMS |
68853 | Red | less urgent | On hold | 2011-03-22 | 2012-02-21 | Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s) |