RAL Tier1 Operations Report for 2nd April 2014
Review of Issues during the fortnight 19th March to 2nd April 2014.
|
- There was a short (around 5 minute) break in external connectivity to the Tier1 during the morning of Thursday 20th March and again a similar event the following morning.
- There was a failover of an Atlas Castor Database early evening on Tuesday 25th March. The failover triggered a call-out and the database was put back onto its allocated node. The cause is a bug that has been reported to Oracle.
- On Friday, 28th March, we were not running some of the CE SUM tests in a timely manner. It was found that owing to a separate change in the Condor configuration we were no longer prioritising the test jobs. This was fixed.
Ongoing Disk Server Issues
|
- GDSS239 (Atlas HotDisk) crashed this morning. This is being investigated.
Notable Changes made this last fortnight.
|
- The rollout of of WNs updated to the EMI-3 version of WN continues and is expected to be completed this week.
- The EMI3 Argus server is being rolled out for use across all CEs and WNs.
- The old MyProxy server (lcgrbp01.gridpp.rl.ac.uk) has just been turned off today. Its replacement (myproxy.gridpp.rl.ac.uk) is in production.
- The 2013 purchases of worker nodes are being added to the farm this week.
- Two of the CV2013 disk servers (120TB each) have been added to LHCbDst. A further 9 are being added today. Three further servers are in CMS non-prod awaiting being moved into production imminently.
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
102902
|
Green
|
Urgent
|
In Progress
|
2014-04-01
|
2014-04-02
|
MICE & NA62
|
Stale .cvmfswhitelist file MICE VO
|
102611
|
Green
|
Urgent
|
In Progress
|
2014-03-24
|
2014-03-24
|
|
NAGIOS *eu.egi.sec.Argus-EMI-1* failed on argusngi.gridpp.rl.ac.uk@RAL-LCG2
|
101968
|
Yellow
|
Less Urgent
|
On Hold
|
2014-03-11
|
2014-0-01
|
Atlas
|
RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
|
101079
|
Red
|
Less Urgent
|
In Progress
|
2014-02-09
|
2014-04-01
|
|
ARC CEs have VOViews with a default SE of "0"
|
99556
|
Red
|
Very Urgent
|
On Hold
|
2013-12-06
|
2014-03-21
|
|
NGI Argus requests for NGI_UK
|
98249
|
Red
|
Urgent
|
In Progress
|
2013-10-21
|
2014-03-13
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
19/03/14 |
100 |
100 |
100 |
88.6 |
100 |
99 |
73 |
Multiple SRM test failures (load problems).
|
20/03/14 |
100 |
100 |
99.7 |
99.6 |
100 |
100 |
n/a |
Atlas: One SRM Test failure; CMS - CE Test failures on all 3 Arc-ce’s (no compatible resources).
|
21/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
n/a |
|
22/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
n/a |
|
23/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
n/a |
|
24/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
n/a |
|
25/03/14 |
100 |
100 |
99.0 |
89.8 |
100 |
98 |
99 |
Atlas: Castor database problem (Atlas_srm DB moved to another RAC node following a DB crash); CMS SRM SUM test failures separated through day.
|
26/03/14 |
100 |
100 |
100 |
87.1 |
100 |
100 |
99 |
Four separate SRM test failures.
|
27/03/14 |
100 |
100 |
100 |
96.5 |
100 |
97 |
100 |
Two test failures of SRM Put test.
|
28/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
29/03/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
30/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
99 |
|
31/03/14 |
100 |
100 |
100 |
100 |
100 |
100 |
99 |
|
01/04/14 |
100 |
100 |
100 |
100 |
100 |
100 |
99 |
|