RAL Tier1 Operations Report for 3rd July 2013
Review of Issues during the week 26th June to 3rd July 2013.
|
- On Wednesday afternoon (26th June) a problem appeared on the CVMFS stratum 1 server hosted at RAL. The server was unavailable until Monday 1st July. This had an effect on the Tier1 batch system as CVMFS clients didn't fail over cleanly. This was compounded by the system failing at a point when CMS's software in the CVMFS was also broken. This failure also had an effect on Tier2s. CVMFS for the minor (non-LHC VOs) was also impacted. A mirror for some of the VOs supported on that (MICE, NA62 and H1) had recently become available at CERN and information on how to re-configure CVMFS to use this was distributed. A post mortem report for this incident is being prepared.
- On Friday (28th June) a problem arose with the Atlas Castor instance. Initial focus was on the SRM database. However, after working on this it became clear there were further, separate, problems with the Atlas Castor instance which led to the instance being declared down for part of the weekend. The problem was fixed late Saturday evening, with the outage ended in the GOC DB Sunday morning after it had run OK overnight. The problem has subsequently been traced to a known bug in Castor 2.1.12. A hotfix (not applied at RAL) had been available for it and it is fixed in version 2.1.13. A post mortem report for this incident is being prepared.
Resolved Disk Server Issues
|
Current operational status and issues
|
- The problem reported last week with the RAL site firewall logging of a large number of connection requests causing high load was resolved in the short term. However, we still have the number of ALICE batch jobs capped. A report on the problem is under preparation and when complete we will contact ALICE in order to pursue this matter.
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The problem LHCb jobs failing due to long job set-up times remains and investigations continue. Recent updates to the CVMFS clients have improved the situation for Atlas.
- The testing of FTS3 is continuing and the service is being put on a more 'production' footing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE & LHCb being brought on-board with the testing.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- Castor Nameserver Upgraded to version 2.1.13-9.
- Site and Top BDIIs now all on EMI-3 / SL6.
- Puppet (used for some Castor configuration management) upgraded.
- On Thursday (27th June) the test ARC-CEs were added into the BDII.
- As part of both investigating the problems of job failures at start and the CVMS client issues of last week, CVMFS client v2.1.11 has been rolled out across most of the batch farm.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Following the Castor Nameserver upgrade today (3rd July) we plan to upgrade the individual instance stagers on the following dates: Atlas: Wednesday 10th July; CMS & GEN: Tuesday 23rd July; LHCb Tuesday 30th July.
- Re-establishing the paired (2*10Gbit) link to the UKLight router.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13 (ongoing)
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
- Infrastructure:
- A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 26th June and 3rd July 2013.
|
There were four unscheduled entries in the GOC DB. All were for problems with the Atlas storage. The first for a problem with the Atlas SRM database (on Wednesday 26th), the remaining ones for the Atlas Castor problem that occurred Friday - Sunday (26 - 28).
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
All Castor and Batch (all SRMs and CEs)
|
SCHEDULED
|
OUTAGE
|
03/07/2013 09:00
|
03/07/2013 14:00
|
5 hours
|
Castor nameserver Upgrade to version 2.1.13-9. Castor and batch services unavailable.
|
srm-atlas
|
UNSCHEDULED
|
OUTAGE
|
29/06/2013 13:00
|
30/06/2013 11:36
|
22 hours and 36 minutes
|
Ongoing problems with ATLAS CASTOR instance
|
srm-atlas
|
UNSCHEDULED
|
OUTAGE
|
28/06/2013 18:00
|
29/06/2013 13:00
|
19 hours
|
Ongoing problem with Atlas Castor instance
|
srm-atlas
|
UNSCHEDULED
|
OUTAGE
|
28/06/2013 14:30
|
28/06/2013 18:00
|
3 hours and 30 minutes
|
Problem with Atlas Castor instance being investigated.
|
srm-atlas
|
UNSCHEDULED
|
OUTAGE
|
26/06/2013 10:40
|
26/06/2013 11:50
|
1 hour and 10 minutes
|
Problem with Database Behind Atlas SRM.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
91658
|
Red
|
Less Urgent
|
In Progress
|
2013-02-20
|
2013-07-02
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-17-06
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
26/06/13 |
100 |
97.6 |
99.0 |
100 |
100 |
Alice: CE Test Jobs cancelled (timeout/dropped) Atlas: Single SRM test failure (could not open connection to srm-atlas.gridpp.rl.ac.uk).
|
27/06/13 |
100 |
100 |
97.2 |
100 |
100 |
CVMFS problem (/cvmfs/atlas.cern.ch/repo/sw: Transport endpoint is not connected)
|
28/06/13 |
100 |
100 |
45.9 |
100 |
100 |
Problem with Atlas Castor database
|
29/06/13 |
100 |
100 |
7.4 |
100 |
100 |
Problem with Atlas Castor database
|
30/06/13 |
100 |
100 |
61.3 |
100 |
100 |
Test jobs repeatedly landing on a bad WN (with CVMFS not working - cd: /cvmfs/atlas.cern.ch/repo/sw: Transport endpoint is not connected).
|
01/07/13 |
100 |
100 |
77.9 |
100 |
100 |
Mainly the above problem recurring overnight (accounts for maybe 20% lost day), but some 'user timeouts' on the SRM test (remaining few percent).
|
02/07/13 |
100 |
100 |
94.8 |
100 |
100 |
Again a mixture of the above two reasons.
|