RAL Tier1 Operations Report for 24th April 2013
Review of Issues during the week 17th to 24th April 2013.
|
Resolved Disk Server Issues
|
- GDSS371 (AtlasTape - D0T1) failed during the evening of Tuesday 16th April. It was returned to production during the following afternoon.
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- The problem LHCb and Atlas jobs failing due to long job set-up times remains and investigations continue.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
- There is an outstanding problem (and GGUS ticket) affecting the certificate on the MyProxy server.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- On Wednesday (17th April) a change was made to the way the batch scheduler fills job slots - as part of ongoing investigations into job set-up failures.
- This morning (24th April) Oracle PSU patches applied to the databases behind LFC, FTS & Atlas 3D and the Castor standby databases.
- On Thursday (18th April) the first three disk servers from the second batch of 2012 orders were put into production in AtlasDataDisk.
- This morning a new top BDII node was added into the alias. This replaced a failed server and means there are again three nodes behind the Top-BDII alias.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Oracle patches will be applied to the main Castor databases during an At Risk next Wednesday (1st May).
- A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing. (Alice disk servers in this batch remain to be done).
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services
- Testing of alternative batch systems (e.g. SLURM).
- Upgrade of other EMI-1 components (UI) under investigation.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check. This will require significant (maybe 2 days) downtime.
Entries in GOC DB starting between 17th and 24th April 2013.
|
There were no unscheduled outages during the last week.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk,
|
SCHEDULED
|
WARNING
|
24/04/2013 09:00
|
24/04/2013 13:00
|
4 hours
|
Warning during application of Oracle paches to back-end databases behind FTS, LFC and Atlas 3D systems.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
93149
|
Red
|
Less Urgent
|
On Hold
|
2013-04-05
|
2013-04-08
|
Atlas
|
RAL-LCG2: jobs failing with " cmtside command was timed out"
|
92266
|
Red
|
Less Urgent
|
Waiting for Reply
|
2013-03-06
|
2013-04-16
|
|
Certificate for RAL myproxy server
|
91658
|
Red
|
Less Urgent
|
On Hold
|
2013-02-20
|
2013-04-03
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-03-19
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
17/04/13 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM test failure "user timeout"
|
18/04/13 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM test failure "user timeout"
|
19/04/13 |
100 |
100 |
99.1 |
100 |
100 |
Single SRM test failure "user timeout"
|
20/04/13 |
100 |
100 |
99.1 |
100 |
100 |
Single SRM test failure "user timeout"
|
21/04/13 |
100 |
100 |
100 |
68.3 |
100 |
Problem with CMS's monitoring
|
22/04/13 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM test failure "user timeout"
|
23/04/13 |
100 |
100 |
99.2 |
100 |
100 |
Single SRM test failure "user timeout"
|