(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 12th June 2013
Review of Issues during the fortnight 29th May to 12th June 2013.
|
- Very high load was seen on some disk servers in CMS disk during the first part of last week (2-4 June). The Castor team made some tuning changes and CMS reduced the load on the disk pool to resolve the issue.
- On Tuesday 4th June there was a load test of the UPS/Generator. The test ran into problems when a circuit breaker failed to close. Cooling was stopped for around 20 minutes. One batch of WNs was manually stopped in response.
- There was a problem with OPS test availabilities for the Site BDII on Monday/Tuesday (10/11 June) when the test ARC-CEs were added into the BDII. These CEs were subsequently removed from the BDII information but it took some time for the tests to clear.
- There are ongoing intermittent problems starting LHCb batch jobs.
Resolved Disk Server Issues
|
- GDSS713 (CMSDisk - D1T0) crashed on the morning of Thursday 30th May. It was returned to service the following morning (31st May). No hardware faults were found during testing.
Current operational status and issues
|
- Following the failure of the UPS/Generator load test on 4th June we are currently running without generator backup.
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The problem LHCb jobs failing due to long job set-up times remains and investigations continue. Recent updates to the CVMFS clients have improved the situation for Atlas.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- On Thursday (6th June) the Atlas 3D database ("Ogma") was unavailable for around 90 minutes while a re-configuration of the Oracle voting disk was made.
- Work has been ongoing testing newer versions of CVMFS to investigate the job set-up problems.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Following on from the failure of the UPS/Generator load test a failed battery on a control board needs to be replaced. Once that has been done the test will be re-scheduled.
- Re-establishing the paired (2*10Gbit) link to the UKLight router. (Aiming to do in next weeks).
- The problem reported last week following the upgrade of the non-Tier1 'facilities' Castor instance to version 2.1.13 is now understood and fixed. We will continue to monitor this closely ahead of re-scheduling the upgrade of Tier1 Castor instances.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Grid Services
- Testing of alternative batch systems (SLURM, Condor) along with ARC-CEs and SL6 Worker Nodes.
- Upgrade of one remaining EMI-1 component (UI) being planned.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
- Infrastructure:
- A 2-day maintenance is being planned sometime in October or November for the following. This is expected to require around a half day outage of power to the UPS room with castor & Batch down for the remaining 1.5 days as equipment is switched off in rotation for the tests.
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will require significant (most likely 2 days) downtime during which time the above infrastructure issues will also be addressed.
Entries in GOC DB starting between 29th May and 12th June 2013.
|
There were no unscheduled entries in the GOC DB starting during the last fortnight.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole Site
|
SCHEDULED
|
WARNING
|
04/06/2013 10:00
|
04/06/2013 12:00
|
2 hours
|
Warning (At Risk) during test of UPS generator.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
94755
|
Green
|
Urgent
|
Waiting Reply
|
2013-06-10
|
2013-06-11
|
|
Error retrieving data from lcgwms04
|
94731
|
Green
|
Less Urgent
|
In Progress
|
2013-06-07
|
2013-06-10
|
cernatschool
|
WMS for cernatschool.org
|
94543
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-06-04
|
2013-06-11
|
SNO+
|
Job outputs not being retrieved
|
91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-05-29
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-03-19
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
29/05/13 |
100 |
100 |
100 |
100 |
100 |
|
30/05/13 |
100 |
100 |
100 |
100 |
100 |
|
31/05/13 |
100 |
100 |
100 |
100 |
100 |
|
01/06/13 |
100 |
100 |
100 |
100 |
100 |
|
02/06/13 |
100 |
100 |
99.0 |
100 |
100 |
Single SRM Put test failure "Zero number of replicas"
|
03/06/13 |
100 |
100 |
100 |
100 |
100 |
|
04/06/13 |
100 |
100 |
100 |
100 |
100 |
|
05/06/13 |
100 |
100 |
98.2 |
100 |
100 |
Two separate test failures. ("Zero number of replicas", "User timeout").
|
06/06/13 |
100 |
100 |
99.1 |
100 |
100 |
Single test failure to delete a file.
|
07/06/13 |
100 |
100 |
100 |
100 |
100 |
|
08/06/13 |
100 |
100 |
100 |
100 |
100 |
|
09/06/13 |
100 |
100 |
98.2 |
96.0 |
100 |
Atlas: Several failures to delete the test file. CMS: Single failure to get a file.
|
10/06/13 |
49.4 |
100 |
100 |
100 |
100 |
ARC-CEs, which are under test, were added to the BDII. However, the data provided was not complete and we failed some BDII sanity checking until they were removed.
|
11/06/13 |
89.4 |
100 |
98.4 |
100 |
100 |
OPS test: Continuation of ARC-CE/BDII issue. Although fixed much earlier (during previous working day), test didn't clear until the early hours of the morning. Atlas: failures to connect to SRM. Probably a problem elsewhere as a few other sites (including a couple of Tier1s) see the same error at roughly the same time.
|