Latest revision as of 12:09, 2 October 2013
RAL Tier1 Operations Report for 2nd October 2013
Review of Issues during the fortnight 18th September to 2nd October 2013.
|
- As reported at the last meeting: There was a problem in the Oracle database behind Castor overnight 17/18 Sep. This was a problem of which we were aware and the fix was to change an Oracle parameter. A Castor outage, with a batch stop/pause was carried out on Wednesday 18th in order to pick up the changed parameter.
- On Thursday (19th) there was a problem on some batch worker nodes with /tmp filling up and the nodes being set offline. This was traced to some ALICE jobs. Good communications with ALICE enabled a prompt resolution.
- On Saturday (28th Sep) a problem with a CVMFS repository at CERN caused problems, in particular with /cvmfs/lhcb-conddb. This caused the Condor farm's healthcheck to fail. For a couple of hours (until this test was removed) Condor batch jobs didn't start.
- There have been some problems with the Torque/maui batch server. This affected Alice test jobs on Saturday (28th). The problem was resolved on Tuesday (1st Oct) when some stuck batch jobs were cleaned up.
Resolved Disk Server Issues
|
- GDSS670 (AliceDisk - D1T0) failed on Sunday (22nd Sep). A RAID verify failed owing to a faulty disk. It was returned to service the next day.
- GDSS595 (GenTape - D0T1) was unavailable for a couple of hours on Monday (23rd Sep). The system needed to be restarted as it wouldn't see a replacement disk drive.
- GDS611 (LHCbDst - D1T0) failed on Tuesday (24th Sep). The disk controller was reporting many failed drives. The system was returned to service on Thursday (26th Sep) and has been drained for further investigation.
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
- The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs will be upgraded to SL6 tomorrow. This farm is expected to restart with initially with around 20% of the total batch capacity tomorrow. Remaining nodes being added over the following days.
- Just before this meeting we have a problem that is believed to be in a network switch stack. We lost access to some older servers (including those in Atlas HotDisk and some older worker nodes. Investigations are ongoing.
Ongoing Disk Server Issues
|
- GDSS673 (CMSDisk - D1T0) is out of production. It failed on Saturday (28th Sep) - possibly due to a disk failing during a RAID verify. The system was returned to service on Monday (30th Sep). However, it failed again when another disk failed while it was still rebuilding the RAID array.
Notable Changes made this last fortnight.
|
- The SRMs have been upgraded to SL5.9 with updated errata and kernel (lcgsrm13 on Monday 23rd, the remaining Atlas SRMs on Wed. 25th, all others on Thursday 26th).
- FTS3 was upgraded to version 3.1.14-1 on Wed 25th and to version 3.1.16-1 the next day.
- On Thursday 26th - Completion of upgrade of all three Top-BDII nodes (lcgbdii01, lcgbdii03, lcgbdii04) to the latest EMI v3.8.0 release.
- The BDII component on a number of systems (including ARC and CREAM CEs for the Condor farm) has been upgraded.
- Condor farm CEs set to Production in the GOC DB on Monday (30th). At this point the Condor farm had around 50% of total batch capacity.
- Tuesday 1st October: Update to Janet6 infrastructure for the RAL site connections and the backup OPN link to CERN.
- LCGCE01,02,04,10,11 (The Torque/Maui farm) in an Outage as the WNs are upgraded to SL6.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Tomorrow (Thursday 3rd October) - upgrade of the Torque/Maui batch farm WNs to SL6. Drain already underway.
- Monday 7th October: Replacement of fans in UPS (UPS not available for 4-5 hours).
- On Tuesday 8th October the primary RAL OPN link to CERN will migrate to the SuperJanet 6 infrastructure.
- Re-establishing the paired (2*10Gbit) link to the UKLight router.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Grid Services
- Testing of alternative batch systems (Condor) along with ARC & CREAM CEs and SL6 Worker Nodes.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
- Infrastructure:
- A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
- Intervention required on the "Essential Power Board".
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
Entries in GOC DB starting between the 18th September and 2nd October 2013.
|
There was one unscheduled outage in the GOC DB for this period. This is for the stop of Castor (and batch) following a database problem.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
All Castor (all SRM end points) and all batch (All CEs)
|
UNSCHEDULED
|
OUTAGE
|
18/09/2013 11:45
|
18/09/2013 14:45
|
3 hours
|
We are seeing errors in the database behind Castor. A Castor restart will be done to fix this. Will also pause batch jobs during this time.
|
lcgwms05.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
06/09/2013 13:00
|
11/09/2013 10:55
|
4 days, 21 hours and 55 minutes
|
Upgrade to EMI-3
|
lcgce12.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
05/09/2013 13:00
|
04/10/2013 13:00
|
29 days,
|
CE (and the SL6 batch queue behind it) being decommissioned.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
97516
|
Red
|
Urgent
|
Waiting Reply
|
2013-09-23
|
2013-09-30
|
T2K
|
[SE][StatusOfPutRequest][SRM_REQUEST_INPROGRESS] errors.
|
97479
|
Red
|
Very Urgent
|
On Hold
|
2013-09-20
|
2013-09-30
|
Atlas
|
RAL-LCG2, high job failure rate
|
97385
|
Red
|
Less Urgent
|
On Hold
|
2013-09-17
|
2013-09-26
|
HyperK
|
CVMFS for hyperk.org
|
97025
|
Red
|
Less urgent
|
On Hold
|
2013-09-03
|
2013-09-12
|
|
Myproxy server certificate does not contain hostname
|
95996
|
Red
|
Urgent
|
On Hold
|
2013-07-22
|
2013-09-17
|
OPS
|
SHA-2 test failing on lcgce01
|
91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-09-03
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-06-17
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
18/09/13 |
97.2 |
100 |
90.6 |
93.2 |
93.0 |
Castor Stop for DB restart caused test failures across all Vos (except Alice)' Atlas also had a couple of other SRM SUM test failures;
|
19/09/13 |
100 |
100 |
100 |
100 |
100 |
|
20/09/13 |
100 |
100 |
100 |
100 |
100 |
|
21/09/13 |
100 |
100 |
100 |
100 |
100 |
|
22/09/13 |
100 |
100 |
100 |
100 |
100 |
|
23/09/13 |
100 |
100 |
100 |
100 |
100 |
|
24/09/13 |
100 |
100 |
100 |
100 |
100 |
|
25/09/13 |
100 |
100 |
98.1 |
100 |
100 |
Failed one SUM test during a SRM upgrade then another (Delete) failure overnight.
|
26/09/13 |
100 |
100 |
100 |
95.8 |
100 |
Single SRM test failure as SRMs were updated.
|
27/09/13 |
100 |
100 |
100 |
100 |
100 |
|
28/09/13 |
100 |
95.8 |
100 |
100 |
100 |
Batch problem (Cannot connect to batch server).
|
29/09/13 |
100 |
100 |
100 |
100 |
100 |
|
30/09/13 |
100 |
100 |
100 |
95.7 |
100 |
Single SRM test failure on PUT: Error reading token data header:
|
01/10/13 |
100 |
100 |
74.8 |
95.8 |
95.8 |
Atlas problem affected all sites; Single test failure for LHCb during Janet6 transition; Single test failure for CMS (timeout).
|