Tier1 Operations Report 2012-01-18
From GridPP Wiki
Contents
- 1 RAL Tier1 Operations Report for 18th January 2012
RAL Tier1 Operations Report for 18th January 2012
Review of Issues during the week 11th to 18th January 2012.
- Thursday (12th Jan) We had a couple of Atlas SAM test failures due to a database deadlock for the Atlas stager database which occurred between 09:00 and 09:15. This cleared itself but did cause a set of file transfer failures.
- On Monday (16th Jan) there was a problem with the network link between the UKLight and SAR routers that started at 12:18 and was fixed at 14:01. This affected data transfers to/from Tier2s that do not go over the OPN.
- The planned updates to the SRMs ran into problems. The update to srm-atlas was delayed from Monday (16th) to Tuesday (17th), partly owing to the network problem (above) and partly as the planned work by Atlas centrally only took place on the Tuesday (17th). However, this update encountered problems and was backed out. The planned updates to the SRMs from the remaining Castor instances have been cancelled. These updates will be rescheduled once the problem is understood.
Resolved Disk Server Issues
- GDSS568 (Atlas DataDisk) was taken out of production around midday on Monday (16th Jan) owing to a memory error. It was returned to production later that afternoon after memory had been replaced.
Current operational status and issues.
- None.
Ongoing Disk Server Issues
- GDSS400 (Atlas Tape - D0T1) failed around 03:00 on Sunday morning (15th January) with FSProbe errors. There were no files awaiting migration to tape and the server was withdrawn from production. It is undergoing tests.
Notable Changes made this last week
- APEL update to UMD distribution. (Monday 16th Jan).
- Doubled CPU/wall time limits for grid4000M queue (used by Atlas)
Forthcoming Work & Interventions
- Update to Oracle 11 for Atlas 3D services (Ongoing today).
- Update to the Castor Information Provider (CIP). (Thursday 26th January).
- Atlas plan to move their LFC from RAL to CERN during the week beginning 6th February. We are aiming to co-locate a number of other changes into this time.
Declared in the GOC DB
- Update to Oracle 11 for the LHCb LFC and 3D databases. (Tuesday 31st January).
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced. There is a significant amount of work that will require to be done during the LHC stop at the start of 2012.
- Infrastructure:
- Move part of the cooling system onto the UPS supply. (Should not require service interruption.) (Planned for Tuesday 28th February).
- Databases:
- Regular Oracle "PSU" patches are pending.
- Switch Castor and LFC/FTS/3D to new Database Infrastructure (started)
- Castor:
- SRM updates to version 2.11. (Need to be rescheduled)
- Castor 2.1.11 upgrade.
- Replace hardware running Castor Head Nodes.
- Move to use Oracle 11g.
- Networking:
- Changes required to extend range of addresses that route over the OPN.
- Re-configure networking in UPS room.
- Install new Routing & Spine layers.
- Final updates to the RAL DNS infrastructure (two DNS servers still to replace)
- Fabric:
- BIOS/firmware updates, equipment moves in machine room (consolidation of equipment in racks; some rack moves.), Othere re-configurations (adding IPMI cards, etc.)
- Network changes, including:
- Relocating rack housing central switch (C300), resolve known issue with network in UPS room.
- Changes to accommodate new networking equipment.
- Grid Services:
- Updates of Grid Services (including LB, APEL, batch server) to UMD versions (mainly in new year).
- VO:
- Migrate Atlas LFC to CERN. (Expected in week beginning 6th Feb.)
Entries in GOC DB starting between 11th and 18th January 2012.
There were no unscheduled outages during this period.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
castor 'GEN' instance SRMs (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k) | SCHEDULED | WARNING | 17/01/2012 09:00 | 17/01/2012 16:00 | 7 hours | SRM upgrade |
lcgapel0676.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 16/01/2012 12:00 | 16/01/2012 14:00 | 2 hours | Migration from glite to UMD. |
srm-atlas.gridpp.rl.ac.uk | SCHEDULED | WARNING | 16/01/2012 09:00 | 16/01/2012 14:45 | 5 hours and 45 minutes | Atlas SRMs being upgraded |
Open GGUS Tickets
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
77026 | Red | Less Urgent | Waiting Reply | 2011-12-05 | 2012-01-16 | BDII | |
74353 | Red | very urgent | Waiting Reply | 2011-09-16 | 2012-01-16 | Pheno | Proxy not renewing properly from WMS |
68853 | Red | less urgent | On hold | 2011-03-22 | 2011-12-15 | Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s) | |
68077 | Red | less urgent | in progress | 2011-02-28 | 2011-09-20 | Mandatory WLCG InstalledOnlineCapacity not published | |
64995 | Red | less urgent | in progress | 2010-12-03 | 2011-09-20 | No GlueSACapability defined for WLCG Storage Areas |