RAL Tier1 Operations Report for 13th March 2013
Review of Issues during the week 6th to 13th March 2013.
|
- On Wednesday & Thursday (6/7 March) there were problems on the RAL network. One "Outage" plus two "Warnings" were declared in the GOC DB. The problems caused intermittent breaks in the Tier1's connectivity - mainly to the outside world but also to the rest of RAL. The core networking team found and fixed these problems on the Thursday afternoon.
- On Monday (11th) there was a problem with tape migration traced to a single corrupt file on a disk server. The data loss has been reported to T2K.
- The planned network intervention yesterday (Tues 12th March) overran significantly. A total of a 12-hour outage resulted (almost double that planned). Furthermore, following problems, the network uplink to the UKLight router is now running on a single 10Gbit link, rather than a pair of such links.
Resolved Disk Server Issues
|
- GDSS594 (GenTape), which failed a few weeks ago, has now been retired from service and will be used for spares.
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
- We are investigating a higher job failure rate for LHCb and Atlas. This appears to be caused by job set-ups taking a long time. yesterday (12th Mar) we made a change to run jobs re-niced, although initial results suggest this has not fixed this problem.
- Investigations are ongoing into asymmetric bandwidth to/from other sites. We are seeing some poor outbound rates - a problem which disappears when we are not loading the network. Awaiting confirmation of the effect of the changes yesterday (12th Mar).
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
Ongoing Disk Server Issues
|
- GDSS519 (GenTape) was put into a draining mode following discovery of a single corrupt file. Following the migration of all remaining files it has been taken out of production to be checked out.
Notable Changes made this last week
|
- The core network switch in the Tier1 Network has been replaced (Tuesday 12th March) providing more ports for network expansion.
- During the network change (Tuesday 12th March) the uplink to one of network stacks (stack 13), serving SL09 disk servers (~ 3PB of storage) was doubled in capacity (from 2*10Gbit to 4*10Gbit) to resolve a bottleneck
- Batch queue parameters were modify to run jobs on the worker nodes re-niced.
- The RAL site BDIIs have been upgraded to EMI-2.
- New EMI-2 WMS nodes (lcgwms04, lcgwms05, lcgwms06) have been added into production; the old EMI-1 ones (lcgwms01, 02, 03) will be drained and retired shortly (anyway by end March)
- The Castor client software has been upgraded to version 2.1.13 on one batch of worker nodes.
- Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- A change to the certificate used by the MyProxy server will be introduced on Monday (18th Mar).
- A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).
Entries in GOC DB starting between 6th to 13 March 2013.
|
There were five unscheduled entries in the GOC DB for last week. Three of these (one "Outage", two "Warnings") for the RAL networking problems on Wed/Thu 6/7 March. The other two were unscheduled extensions to a planned downtime to restructure the Tier1 network on 12th March.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole Site
|
UNSCHEDULED
|
OUTAGE
|
12/03/2013 19:00
|
12/03/2013 21:00
|
2 hours
|
The main work on our network is over however it is taking a little time to restore services. Unfortunately it is therefore necessary to make a small further extension to our downtime.
|
Whole Site
|
UNSCHEDULED
|
OUTAGE
|
12/03/2013 15:30
|
12/03/2013 19:00
|
3 hours and 30 minutes
|
Extending Outage as some problems encountered during the intervention to reconfigure the core of the Tier1's network.
|
Whole Site
|
SCHEDULED
|
OUTAGE
|
12/03/2013 08:45
|
12/03/2013 15:30
|
6 hours and 45 minutes
|
Reconfiguration of core network within the RAL Tier1. Storage (Castor) services will be stopped. LFC stopped. FTS and Batch drained of active transfers/jobs. Other services (e.g. BDII) may see some short breaks in connectivity.
|
Whole Site
|
UNSCHEDULED
|
WARNING
|
07/03/2013 10:00
|
07/03/2013 16:30
|
6 hours and 30 minutes
|
Some network issues ongoing and under investigation.
|
Whole Site
|
UNSCHEDULED
|
WARNING
|
06/03/2013 15:00
|
07/03/2013 10:00
|
19 hours
|
At risk while recovering from network outage.
|
Whole Site
|
UNSCHEDULED
|
OUTAGE
|
06/03/2013 09:30
|
06/03/2013 15:00
|
5 hours and 30 minutes
|
Network outage at the RAL tier1
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
92459
|
Green
|
Less Urgent
|
In Progress
|
2013-03-12
|
2013-03-13
|
EPIC
|
LFC support for epic.vo.gridpp.ac.uk VO
|
92266
|
Amber
|
Less Urgent
|
In Progress
|
2013-03-06
|
2013-03-08
|
|
Certificate for RAL myproxy server
|
91974
|
Red
|
Urgent
|
In Progress
|
2013-03-04
|
2013-03-04
|
|
NAGIOS *eu.egi.sec.EMI-1* failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
|
91687
|
Red
|
Less Urgent
|
In Progress
|
2013-02-21
|
2013-03-06
|
epic
|
Support for epic.vo.gridpp.ac.uk VO on WMS
|
91658
|
Red
|
Less Urgent
|
In Progress
|
2013-02-20
|
2013-02-22
|
|
LFC webdav support
|
91146
|
Red
|
Urgent
|
In Progress
|
2013-02-04
|
2013-03-05
|
Atlas
|
RAL input bandwith issues
|
91029
|
Red
|
Very Urgent
|
On Hold
|
2013-01-30
|
2013-02-27
|
Atlas
|
FTS problem in queryin jobs
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-03-06
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
06/03/13 |
97.3 |
100 |
99.2 |
84.5 |
91.8 |
Network problems affecting RAL
|
07/03/13 |
100 |
100 |
77.3 |
83.4 |
83.4 |
Network problems affecting RAL
|
08/03/13 |
100 |
100 |
100 |
100 |
100 |
|
09/03/13 |
100 |
100 |
100 |
100 |
100 |
|
10/03/13 |
100 |
100 |
100 |
95.9 |
100 |
Single SRM Test failure - User timeout.
|
11/03/13 |
100 |
100 |
100 |
100 |
100 |
|
12/03/13 |
44.2 |
25.2 |
43.4 |
43.4 |
41.7 |
Planned network update (C300 replacement) which overran
|