Tier1 Operations Report 2012-02-22
From GridPP Wiki
Contents
- 1 RAL Tier1 Operations Report for 22nd February 2012
RAL Tier1 Operations Report for 22nd February 2012
Review of Issues during the week 15th to 22nd February 2012.
- There were some failures of the Atlas SRM SAM tests early on Friday morning. At the moment there is a known problem with the Atlas SRMs that is worked around by an aggressive re-starter. However, this failed and one of the SRMs was manually restarted after a call-out.
Resolved Disk Server Issues
- None.
Current operational status and issues.
- There is a known issue with the Atlas SRMs (see above).
- There is a problem with some batch job submission. This is believed to be when a VO uses information from the bdii in the submission process and was exposed by the batch server upgrade last week.
Ongoing Disk Server Issues
- None
Notable Changes made this last week
- On Monday (20th February) The CMS Castor instance was upgraded to version 2.1.11-8 with new hardware being introduced for the Atlas Castor head nodes.
- The same update for the Atlas Castor instance has just been completed this morning (Wed. 22nd Feb.)
Forthcoming Work & Interventions
- Thursday 23 Feb. Application of Oracle "PSU" patches to Atlas 3D & LHCb 3D/LFC systems ("OGMA" & "LUGH")
- Tuesday 28th Feb - morning. Electrical work in morning to prepare for moving part of the cooling system onto the UPS supply. Some other electrical work carries on for the whole week (27 Feb - 2 Mar).
- Week beginning 5th March (TBC) FTS update to version 2.2.8.
Declared in the GOC DB
- Monday 27 Feb. Upgrade of LHCb Castor instance to version 2.1.11-8.
- Wednesday 29 Feb. Upgrade of GEN Castor instance to version 2.1.11-8.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.
- Databases:
- Regular Oracle "PSU" patches are pending.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
- Next step of these changes is to move Castor databases and enable Data Guard.
- Castor:
- Update the Castor Information Provider (CIP) (Need to re-schedule.)
- Move to use Oracle 11g (requires a minor Castor update.)
- Networking:
- Changes required to extend range of addresses that route over the OPN.
- Install new Routing & Spine layers.
- Fabric:
- BIOS/firmware updates, Other re-configurations (adding IPMI cards, etc.)
- Grid Services:
- Updates of Grid Services (including WMS, FTS, MyProxy, LFC front ends) to EMI/UMD versions.
Entries in GOC DB starting between 15th and 22nd February 2012.
There were no unscheduled outages during this period.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lcgvo05.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 22/02/2012 11:00 | 21/02/2013 12:00 | 365 days, 1 hour | Outage on Atlas vobox for Alastair to investigate |
srm-atlas.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 22/02/2012 08:00 | 22/02/2012 16:00 | 8 hours | Update of Atlas Castor instance to version 2.1.11-8 |
srm-cms.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 20/02/2012 08:00 | 20/02/2012 15:35 | 7 hours and 35 minutes | Update of CMS Castor instance to version 2.1.11-8 |
lcgwms01.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 09/02/2012 15:00 | 15/02/2012 12:00 | 5 days, 21 hours | System unavailable - EMI installation |
Open GGUS Tickets
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
79428 | Green | Less Urgent | In Progress | 2012-02-21 | 2012-02-21 | SNO+ | glite-wms-job aborted |
79720 | Green | Very Urgent | Waiting Reply | 2012-02-21 | 2012-02-22 | t2k.org | All jobs failing at RAL |
79283 | Red | Top Priority | In Progress | 2012-02-16 | 2012-02-22 | LHCb | Job publishing problem for LHCb at RAL |
77026 | Red | Less Urgent | In Progress | 2011-12-05 | 2012-02-03 | BDII | |
74353 | Red | Very Urgent | Waiting Reply | 2011-09-16 | 2012-02-10 | Pheno | Proxy not renewing properly from WMS |
68853 | Red | less urgent | On hold | 2011-03-22 | 2012-02-21 | Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s) |