Latest revision as of 09:02, 16 October 2013
RAL Tier1 Operations Report for 16th October 2013
Review of Issues during the week 9th to 16th October 2013.
|
- The Torque/Maui farm as been problematic during this last week. This is currently running with 50% of our total batch capacity.
- A single file was reported lost to CMS. It was on CMSDisk. This was uncovered by the checksum checker. A copy was still available on CMSTape here at RAL and was copied across.
Resolved Disk Server Issues
|
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
- The Condor batch farm has been marked as in production. This contains around 50% of the total batch capacity. All its WNs running SL6. The remaining nodes are in the Torque/Maui farm and its WNs have been upgraded to Sl6 as well. We plan to keep this configuration (with both farms running SL6 WNs with 50% of the total capacity) until early November.
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Re-establishing the paired (2*10Gbit) link to the UKLight router. This is proposed to take place next Wednesday morning, 23rd October, during which Castor will be stopped and batch paused.
- Interruption to services over Tuesday/Wednesday 5/6 November during work on the UPS and safety testing of its circuits. Initial plans propose Castor down for the day on Tuesday 5th.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
- Infrastructure:
- A 2-day maintenance on the UPS along with the safety testing of associated electrical circuits is being planned for the 5th/6th November (TBC). The impact of this on our services is still being worked out. During this the following issues will be addressed:
- Intervention required on the "Essential Power Board".
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will take place over a couple of days during which time individual UPS circuits will need to be powered down.
Entries in GOC DB starting between the 9th and 16th October 2013.
|
There were no entries in the GOC DB for the last week.
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
97908
|
Amber
|
Less Urgent
|
In Progress
|
2013-10-09
|
2013-10-09
|
|
Backup UK VOMS servers
|
97868
|
Red
|
Less Urgent
|
In Progress
|
2013-10-08
|
2013-10-14
|
T2K
|
CVMFS for t2k.org
|
97759
|
Red
|
Urgent
|
On Hold
|
2013-10-04
|
2013-10-04
|
OPS
|
SHA-2 test failing on lcgce01
|
97385
|
Red
|
Less Urgent
|
In Progress
|
2013-09-17
|
2013-10-14
|
HyperK
|
CVMFS for hyperk.org
|
97025
|
Red
|
Less urgent
|
On Hold
|
2013-09-03
|
2013-09-12
|
|
Myproxy server certificate does not contain hostname
|
91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-09-03
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-06-17
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
09/10/13 |
100 |
100 |
97.1 |
96.0 |
100 |
Atlas: SRM test failure (Invalid argument); CMS: SRM test failure (Error reading token data header)
|
10/10/13 |
100 |
100 |
100 |
100 |
100 |
|
11/10/13 |
100 |
100 |
100 |
100 |
100 |
|
12/10/13 |
100 |
100 |
100 |
100 |
100 |
|
13/10/13 |
100 |
100 |
97.8 |
100 |
100 |
Problem with Torque/Maui batch server.
|
14/10/13 |
100 |
100 |
100 |
100 |
100 |
|
15/10/13 |
100 |
100 |
99.2 |
100 |
100 |
SRM test failure (Too many threads busy with Castor)
|