(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 17th July 2013
Review of Issues during the week 10th to 17th July 2013.
|
- The CVMFS problems - notably affecting CMS - have been ongoing as we verified that CVMFS client version 2.1.12 works OK. This has been the case and this version has been rolled out across the batch farm.
- The Atlas Castor 2.1.13-9 upgrade overran significantly (4 hours) last Wednesday (10th July). The problems were in the updating of the configurations and OS of the head nodes and disk servers. The upgrade was completed OK.
- The problem reported last week with connections to the batch server failing has continued. The problem started at the same time as the batch server was updated. This update was rolled back last Thursday (11th) but the problem remains.
- On Wednesday late afternoon monitoring showed unusual activity (or lack of it) on the Castor GEN instance which was put into a 'warning' state in the GOC DB overnight . No problems were subsequently identified.
Resolved Disk Server Issues
|
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The problem LHCb jobs failing due to long job set-up times is still under investigation. The recent updates to the CVMFS clients to v2.1.12 is promising.
- The testing of FTS3 is continuing and the service is being put on a more 'production' footing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Testing is ongoing with the proposed new batch system (ARC-CEs, Condor, SL6). Atlas and CMS running work through this. ALICE, LHCb & H1 being brought on-board with the testing.
Ongoing Disk Server Issues
|
- On Thursday evening, 11th July, GDSS664 (AtlasDataDisk, D1T0) failed. There have been significant problems rebuilding the RAID array containing the data and at one point Atlas were warned we may have data loss. However, the server was brought backup on Tuesday (16th) and following checksumming of a sample of files to validate the data the server is being drained ahead of further investigations.
Notable Changes made this last week
|
- Castor Atlas instance (stager) was upgraded to version 2.1.13-9 last Wednesday (10th).
- CVMFS client version 2.1.12 has been rolled out to most of the batch farm.
- Software updates applied to the batch server the week before were rolled back on Thursday 11th July.
- The two ARC-CEs were added to the GOC DB a week ago and were set to 'monitored' this Monday (15th).
- Tuesday 23rd July: Upgrade of CMS and LHCb Castor instances to version 2.1.13-9
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Two reboots of site firewall between 07:45 and 08:45: Tuesday 23rd July.
- Update the remaining Castor stagers on the following dates: CMS & LHCb: Tuesday 23rd July; GEN Tuesday 30th July.
- Wednesday 24th July: Transition of Thames valley Network to Janet 6.
- Re-establishing the paired (2*10Gbit) link to the UKLight router.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13 (ongoing)
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
- Infrastructure:
- A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 3rd and 10th July 2013.
|
There were three unscheduled entries in the GOC DB. One was an unscheduled OUTAGE - when the upgrade to the Atlas Castor upgrade overran. The other two were unscheduled WARNINGs. One for the batch system (as a change made earlier was reverted). The other for the Castor 'GEN' instance which was experiencing problems.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
All CEs: lcgce01, lcgce02, lcgce04, lcgce10, lcgce11, lcgce12.
|
UNSCHEDULED
|
WARNING
|
11/07/2013 09:30
|
11/07/2013 10:35
|
1 hour and 5 minutes
|
Batch service At Risk during work on batch server.
|
Castor GEN instance: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k,
|
UNSCHEDULED
|
WARNING
|
10/07/2013 18:00
|
11/07/2013 09:25
|
15 hours and 25 minutes
|
Some problems seen with Castor GEN instance which are not fully understood. Instance working but being put in Warning overnight.
|
srm-atlas
|
UNSCHEDULED
|
OUTAGE
|
10/07/2013 14:00
|
10/07/2013 18:00
|
4 hours
|
Extending outage of Atlas Castor instance as the upgrade is overrunning.
|
srm-atlas
|
SCHEDULED
|
OUTAGE
|
10/07/2013 09:00
|
10/07/2013 14:00
|
5 hours
|
Upgrade of Atlas Castor Stager to version 2.1.13-9.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
95820
|
Green
|
Less Urgent
|
In Progress
|
2013-07-17
|
2013-07-17
|
CMS
|
Many errors with file access at RAL today, maybe related to high load (~5000 jobs running) on the file server.
|
95757
|
Green
|
Less Urgent
|
In Progress
|
2013-07-15
|
2013-07-17
|
CMS
|
Jobs are failing at a particular node.
|
95671
|
Yellow
|
Less Urgent
|
In Progress
|
2013-07-11
|
2013-07-17
|
LHCb
|
Many jobs are falling at T1_UK_RAL related availability CMSSW release
|
95435
|
Red
|
Urgent
|
In Progress
|
2013-07-04
|
2013-07-04
|
LHCb
|
CVMFS problem at RAL-LCG2
|
91658
|
Red
|
Less Urgent
|
In Progress
|
2013-02-20
|
2013-07-16
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-17-06
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
10/07/13 |
100 |
97.1 |
67.6 |
100 |
100 |
Atlas: Castor Upgrade; ALICE: CE test failures (CEs could not contact batch server.)
|
11/07/13 |
100 |
97.5 |
100 |
100 |
96.2 |
CE test failures (CEs could not contact batch server.)
|
12/07/13 |
100 |
100 |
100 |
100 |
96.9 |
CE test failures (CEs could not contact batch server.)
|
13/07/13 |
100 |
100 |
98.7 |
100 |
100 |
SRM test failures (Castor)
|
14/07/13 |
100 |
100 |
90.4 |
100 |
100 |
SRM test failures (Castor)
|
15/07/13 |
100 |
100 |
94.8 |
91.9 |
100 |
SRM test failures (Castor)
|
16/07/13 |
100 |
96.9 |
100 |
95.9 |
100 |
ALICE: CE test failure; CMS: SRM test failures (Castor)
|