RAL Tier1 Operations Report for 10th April 2013
The Post Mortem review of the failure of disk server GDSS594 (GenTape) in February that led to the loss of 68 T2K files has been completed. This can be seen at:
https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20130219_Disk_Server_Failure_File_Loss
Review of Issues during the week 3rd to 10th April 2013.
|
- On Tuesday morning, 9th April, a planned intervention on the site networking ran into problems. The RAL site was disconnected from the rest of the world for around 100 minutes. The intervention had previously been announced as a scheduled 'Warning' in the GOCDB and the FTS drained. Internally Tier1 services carried on OK during the external break.
- Two files were declared lost to Atlas following the failure of disk server GDSS454.
Resolved Disk Server Issues
|
- GDSS454 (AtlasDataDisk D1T0) failed with a Red Only file system on Sunday 7th April. Following checks it was returned to service on Monday (8th). Two files that were being written at the time of the failure were declared lost to Atlas.
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
- The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
- There is an outstanding problem (and GGUS ticket) affecting the cerificate on the MyProxy server.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing. (LHCb servers done this week).
- Kernel/errata updates and removal of AFS software (as opposed to just disabling) being done across worker nodes.
- New disk servers deployed in production (540TB to AtlasDataDisk; 720TB to CMSDisk).
- One of the two batches of new worker nodes (the one from OCF) have been deployed into production.
- This evening (Wed 10th March 18:00 - 23:59 BST) Emergency maintenance affecting both the main and backup links to CERN. Site declared as 'Warning'.
- Tomorrow (Thursday 11th April) Outage of LFC and FTS services (10:00 - 12:00). The Oracle database behind these services uses two disk arrays. One of the arrays is reporting errors and the database will be reconfigured (rebalanced) to move the data off the faulty array. FTS transfers will be drained before the outage.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- One of the disk arrays hosting the LFC/FTS/3D databases has given some errors. An intervention to move the 'somnus' (LFC & FTS) data off this array is planned for tomorrow. A further intervention will be required on the array itself which will affect the Atlas 3D service.
- A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing. (Alice disk servers in this batch remain to be done).
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).
Entries in GOC DB starting between 3rd and 10th April 2013.
|
There was one unscheduled outage (for the problematic network intervention) and one unscheduled warning (for emergency maintenance on the CERN OPN links) entries in the GOC DB for the last week.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole Site
|
SCHEDULED
|
WARNING
|
10/04/2013 18:00
|
11/04/2013 00:00
|
6 hours
|
An emergency maintenance has been announced for both the main and backup OPN links RAL - CERN.
|
Whole Site
|
UNSCHEDULED
|
OUTAGE
|
09/04/2013 07:45
|
09/04/2013 09:25
|
1 hour and 40 minutes
|
Problem during planned network intervention broke connectivity to site. (Retrospective addition to GOC DB. Intervention originally delared as a Warning.)
|
Whole Site
|
SCHEDULED
|
WARNING
|
09/04/2013 07:30
|
09/04/2013 08:30
|
1 hour
|
At Risk around two short (few minute) breaks in external connectivity to the RAL Tier1 required for a network upgrade. Will drain FTS for an hour beforehand as a precaution.
|
Whole Site
|
UNSCHEDULED
|
WARNING
|
03/04/2013 18:00
|
04/04/2013 00:00
|
6 hours
|
An energency maintenance has been announced for both the main and backup OPN links RAL - CERN. No outage is expected during this maintenance. Services are considered at risk only.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
93149
|
Green
|
Less Urgent
|
On Hold
|
2013-04-05
|
2013-04-08
|
Atlas
|
RAL-LCG2: jobs failing with " cmtside command was timed out"
|
93136
|
Yellow
|
Less Urgent
|
In Progress
|
2013-04-05
|
2013-04-05
|
EPIC
|
Problems downloading job output using RAL WMS (epic VO)
|
92266
|
Red
|
Less Urgent
|
In Progress
|
2013-03-06
|
2013-04-09
|
|
Certificate for RAL myproxy server
|
91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-04-03
|
|
LFC webdav support
|
91029
|
Red
|
Very Urgent
|
On Hold
|
2013-01-30
|
2013-02-27
|
Atlas
|
FTS problem in queryin jobs
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-03-19
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
20/03/13 |
100 |
100 |
100 |
100 |
100 |
|
03/04/13 |
100 |
100 |
100 |
100 |
99.3 |
Job cancelled/purged.
|
04/04/13 |
100 |
100 |
99.2 |
95.9 |
100 |
Atlas: Single SRM test failure "User timeout". CMS: Single SRM test failure "User timeout".
|
05/04/13 |
100 |
100 |
100 |
100 |
100 |
|
06/04/13 |
100 |
100 |
100 |
99.4 |
100 |
Single SRM test failure "User timeout" at very end of day (main effect is on the 7th).
|
07/04/13 |
100 |
100 |
100 |
92.5 |
100 |
Single SRM test failure "User timeout" at very end of day (main effect is on the 7th).
|
08/04/13 |
100 |
100 |
99.1 |
87.7 |
100 |
Atlas: 1 * "could not open connection to srm-atlas.gridpp.rl.ac.uk"; CMS: Total of three SRM test failures. 1 * "could not open connection to srm-cms.gridpp.rl.ac.uk"; 2 * "User timeout".
|
09/04/13 |
91.7 |
100 |
92.7 |
90.4 |
93.4 |
Problem during planned central networking intervention disconnected site for around 100 minutes.
|