Tier1 Operations Report 2011-11-16
From GridPP Wiki
Contents
- 1 RAL Tier1 Operations Report for 16th November 2011
RAL Tier1 Operations Report for 16th November 2011
Review of Issues during the week 9th to 16th November 2011.
- The Post Mortem (SIR) has been prepared for the problems with the Atlas Castor instance a week ago (Saturday - Monday 30 Oct - 1 Nov). See: https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111031_Castor_ATLAS_Outage
- In the middle of last week there were problems with transfers for two LHCb users. This was traced to a root CA certificate that had been introduced in the latest distribution, but which was not rolled out to the 3 LHCb SRMs due to an oversight. (These three SRMs have a different set-up to the other SRMs.)
- Since the weekend there have been problems for Alice accessing Castor. The problem is still unresolved despite intensive investigations Monday & Tuesday. At the moment the problem is awaiting input from Alice, and our current thinking is that this is a problem at the Alice end.
- Quite large number of pilot jobs reported as failing (GGUS ticket from LHCb). On Monday (14th Nov) two nodes were found to be failing quite a number of jobs, but below limit for 'black hole' detector. Removed these two batch worker nodes from production.
- Around 05:00 this morning (Wednesday 16th) there was a DNS problem that affected CERN. This also affected SAM tests of the Tier1 and had some impact on VO usage of our site.
Resolved Disk Server Issues
- None.
Current operational status and issues.
- The slow data transfers into the RAL Tier1 from other Tier1s and CERN (i.e. asymmetrical performance) continue to be investigated. Improvements to rates to/from the RAL Tier1 have been made and now only two channels (NIKHEF to RAL; Birmingham to RAL) remain below an acceptable threshold.
- We continue work with the Perfsonar network test system to understand some anomalies seen. The initial set-up was on virtual machines. Hardware has now been obtained to run Perfsonar.
- We are currently patching all Grid Services nodes that run a BDII.
Ongoing Disk Server Issues
- Gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. Fabric Team have completed their work on this server and it is awaiting re-deployment.
- As reported in recent weeks, we are seeing a high number of 'SMART' errors reported by a particular batch of disk servers. Most of these are spurious and resolved by an updated version of the disk controller firmware. This update has been successfully applied to D0T1 disk servers and will be rolled out to the affected D1T0 disk servers over the next week or two.
Notable Changes made this last week
- A firmware update was applied to the RAL Firewall yesterday morning (Tuesday 15th Nov) which addresses a problem of some packet loss.
Forthcoming Work & Interventions
- Tuesday 22nd November. Failover of main RAL link (from Reading link to London link) during maintenance. Should be transparent.
- Tuesday 29th November. Failover of OPN link to backup during maintenance. Should be transparent.
Declared in the GOC DB
- None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced:
- Update, in rolling manner, the Site and Top-BDII nodes to the UMD release.
- Regular Oracle "PSU" patches are pending.
- Update the disk controller firmware on D1T0 nodes in the batch of servers reporting spurious SMART errors.
- There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
- Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
- Networking change required to extend range of addresses that route over the OPN.
- Address permissions problem regarding Atlas User access to all Atlas data.
- Replace hardware running Castor Head Nodes (aimed for end of year).
Entries in GOC DB starting between 9th to 16th November 2011.
There was 1 unscheduled entry in the GOC DB for this last week, which was for the problems on the Alice xrootd manager.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole Site | SCHEDULED | OUTAGE | 15/11/2011 07:55 | 15/11/2011 09:00 | 1 hour and 5 minutes | Short outage during work on site network link. We andticipate around a 30 minute break in connectivity but have allowed some contingency. |
srm-alice | UNSCHEDULED | OUTAGE | 14/11/2011 09:30 | 14/11/2011 11:25 | 1 hour and 55 minutes | Investigating problem with xrootd manager. |
Open GGUS Tickets
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
76023 | Red | very urgent | in progress | 2011-11-05 | 2011-11-16 | Camont | LB query failed |
75395 | Red | urgent | in progress | 2011-10-17 | 2011-11-15 | T2K | WMS 'jumping' (Ticket now with L&B support) |
74353 | Red | very urgent | waiting for reply | 2011-09-16 | 2011-11-07 | Pheno | Proxy not renewing properly from WMS |
68853 | Red | less urgent | On hold | 2011-03-22 | 2011-11-07 | Retirenment of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s) | |
68077 | Red | less urgent | in progress | 2011-02-28 | 2011-09-20 | Mandatory WLCG InstalledOnlineCapacity not published | |
64995 | Red | less urgent | in progress | 2010-12-03 | 2011-09-20 | No GlueSACapability defined for WLCG Storage Areas |