RAL Tier1 Operations Report for 20th March 2013
Review of Issues during the week 13th to 20th March 2013.
|
- On Friday afternoon (15th Mar) there was a problem that lasted about 6 minutes on our Tier1 network at approximately 15:40. This caused a spike in FTS transfer failures as well as some SUM test failures. In general services continues OK but staff made a check round for possible problems.
- On Tuesday (19th Mar) at around midday there was a problem on the site network that lasted around 15 minutes. Tier1 services carried on running although there were some test failures at this time.
Resolved Disk Server Issues
|
- GDSS519 (GenTape D0T1) was put into a draining mode following discovery of a single corrupt file last Wed morning (13th Mar). The server was checked out, confirming only one file was bad and revealing a faulty disk drive that was replaced. The server was returned to production later that day.
Current operational status and issues
|
- The uplink to the UKLight Router is running on a single 10Gbit link, rather than a pair of such links.
- There have been intermittent problems over the past few weeks with the start rate for batch jobs. A script has been introduced to regularly check for this and take action to minimise its effects.
- The problem LHCb and Atlas jobs failing due to long job set-up times remains. (The change to run jobs re-niced has not resolved the problem). Investigations continue.
- The testing of FTS3 is continuing. (This runs in parallel with our existing FTS2 service).
- We are participating in xrootd federated access tests for Atlas.
- Test batch queue with five SL6/EMI-2 worker nodes and own CE in place.
- The change to the certificate used by the MyProxy server announced for Monday 18th Mar. had to be backed out. An alternatice solution to the MyProxy certificate problem reported in GGUS#92266 is being worked on.
Ongoing Disk Server Issues
|
Notable Changes made this last week
|
- An analysis of data rates shows that the intervention on Tuesday 12th Mar (replacing the core C300 switch and modifying the link to the UKLight router) has resolved the problem of asymmetric data rates in/out of the Tier1.
- The APEL publisher was upgraded from UMD-1 to UMD-2 last Thursday (14th Mar).
- The Castor client software has been upgraded to version 2.1.13 on all worker nodes.
- Updating disk controller firmware on Clustervision '11 batch of disk servers ongoing.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- A program of updating the disk controller firmware in the 2011 Clustervision batch of disk servers is ongoing.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Upgrade to version 2.1.13
- Networking:
- Single link to UKLight Router to be restored as paired (2*10Gbit) link.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Addition of caching DNSs into the Tier1 network.
- Grid Services
- Upgrade of other EMI-1 components (APEL, UI) under investigation.
- Infrastructure:
- Intervention required on the "Essential Power Board" & Remedial work on three (out of four) transformers.
- Remedial work on the BMS (Building Management System) due to one its three modules being faulty.
- Electrical safety check (will require some downtime).
Entries in GOC DB starting between 13 and 20th March 2013.
|
There were no unscheduled entries in the GOC DB for the last week.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgrbp01.gridpp.rl.ac.uk,
|
SCHEDULED
|
WARNING
|
18/03/2013 10:00
|
18/03/2013 11:00
|
1 hour
|
Warning for hour following replacement of a certificate on the MyProxy server. (Ref GGUS ticket 92266)
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
92266
|
Amber
|
Less Urgent
|
In Progress
|
2013-03-06
|
2013-03-19
|
|
Certificate for RAL myproxy server
|
91974
|
Red
|
Urgent
|
In Progress
|
2013-03-04
|
2013-03-13
|
|
NAGIOS *eu.egi.sec.EMI-1* failed on lcgwms01.gridpp.rl.ac.uk@RAL-LCG2
|
91658
|
Red
|
Less Urgent
|
In Progress
|
2013-02-20
|
2013-03-13
|
|
LFC webdav support
|
91146
|
Red
|
Urgent
|
In Progress
|
2013-02-04
|
2013-03-14
|
Atlas
|
RAL input bandwith issues
|
91029
|
Red
|
Very Urgent
|
On Hold
|
2013-01-30
|
2013-02-27
|
Atlas
|
FTS problem in queryin jobs
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-03-19
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
13/03/13 |
100 |
100 |
100 |
100 |
100 |
|
14/03/13 |
100 |
100 |
100 |
100 |
100 |
|
15/03/13 |
100 |
100 |
99.2 |
91.8 |
95.8 |
Problem with Tier1 Network (packet storm) around 15:40 caused some failures. Also CMS had a single "SRM timeout" earlier in the day.
|
16/03/13 |
100 |
100 |
100 |
100 |
100 |
|
17/03/13 |
100 |
100 |
99.1 |
100 |
100 |
Single SRM Test failure "zero number of replicas"
|
18/03/13 |
100 |
100 |
100 |
100 |
100 |
|
19/03/13 |
100 |
100 |
96.2 |
95.9 |
100 |
Test failures (SRM & CE) around midday owing to Site Network problem. In addition Atlas suffered a few other (mainly) SRM test failures earlier in the day.
|