RAL Tier1 Operations Report for 28th May 2014
Review of Issues during the week 21st to 28th May 2014.
|
- Grumbly problems with the WMSs reported last week ongoing. The developers have been contacted.
- Problem reported last week with some half dozen Atlas files were lost during the draining of a disk server at the end of February is now understood. This was an isolated incident. Draining has been resumed.
- There was a problem with on FTS2 last Friday (16th) which led to a ticket from CMS.
- The checksum checker found a corrupt LHCb file in Castor which has been declared lost.
Resolved Disk Server Issues
|
Current operational status and issues
|
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- CVMFS Client version 2.1.19 in final stages of being rolled out to whole farm following successful testing so far.
- One new disk server has been deployed to CMS disk. (This replaced a server (GDSS758) that failed a couple of weeks ago).
- A new tape controller server (ACSLS) was put into operation yesterday morning (Tuesday 20th May).
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
All Castor (SRMs) and batch (CEs).
|
SCHEDULED
|
OUTAGE
|
10/06/2014 08:50
|
10/06/2014 15:00
|
6 hours and 10 minutes
|
Castor and batch services down during upgrade of Castor Nameserver to version 2.1.14.
|
arc-ce02.gridpp.rl.ac.uk, arc-ce03.gridpp.rl.ac.uk
|
SCHEDULED
|
WARNING
|
02/06/2014 10:00
|
02/06/2014 12:00
|
2 hours
|
Upgrade arc-ce02 and arc-ce03 to v. 4.1.0.
|
arc-ce01.gridpp.rl.ac.uk, arc-ce01.gridpp.rl.ac.uk,
|
SCHEDULED
|
WARNING
|
28/05/2014 10:00
|
28/05/2014 12:00
|
2 hours
|
Upgrade of ARC CE to version 4.1.0.
|
All Castor (All SRM endpoints)
|
SCHEDULED
|
WARNING
|
28/05/2014 09:30
|
28/05/2014 11:30
|
2 hours
|
At Risk on Castor (All SRM endpoints) during small internal network change.
|
lcgui02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
30/04/2014 14:00
|
29/05/2014 13:00
|
28 days, 23 hours
|
Service being decommissioned.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Provisional dates for the Castor 2.1.14 upgrade: Nameserver: Tuesday 10th June; Stagers: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
- Maintenance will be carried out on the diesel generator tomorrow morning (22nd May) from 09:00 - 11:00. Should we suffer a mains power failure during this time window we will not have generator backup.
- On Wednesday 28th May we plan to move the network switches for some Castor disk servers to the mesh network to alleviate a bottleneck. This will be during an at risk on Castor.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing was largely complete, although a new minor version (2.1.14-12) will be released soon.
- Networking:
- Move switches connecting recent disk servers batches ('11, '12) onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 21st and 28th May 2014.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
arc-ce01.gridpp.rl.ac.uk
|
SCHEDULED
|
WARNING
|
28/05/2014 10:00
|
28/05/2014 12:00
|
2 hours
|
Upgrade of ARC CE to version 4.1.0.
|
All SRMs (All Castor)
|
SCHEDULED
|
WARNING
|
28/05/2014 09:30
|
28/05/2014 11:30
|
2 hours
|
At Risk on Castor (All SRM endpoints) during small internal network change.
|
lcgui02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
30/04/2014 14:00
|
29/05/2014 13:00
|
28 days, 23 hours
|
Service being decommissioned.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
105571
|
Green
|
Less Urgent
|
In Progress
|
2014-05-21
|
2014-05-27
|
LHCb
|
BDII and SRM publish inconsistent storage capacity numbers
|
105405
|
Yellow
|
Urgent
|
In Progress
|
2014-05-14
|
2014-05-15
|
|
please check your Vidyo router firewall configuration
|
105308
|
Yellow
|
Less Urgent
|
On Hold
|
2014-05-11
|
2014-05-27
|
Atlas
|
Jobs at RAL-LCG2_MCORE are failing with "Failed to open shared memory object: Permission denied"
|
105161
|
Amber
|
Less Urgent
|
In Progress
|
2014-05-05
|
2014-05-16
|
H1
|
hone jobs submitted into CREAM queues through lcgwms05.gridpp.rl.ac.uk & lcgwms06.gridpp.rl.ac.uk WMSs are are Ready status long time (more as 5 hours)
|
105100
|
Red
|
Urgent
|
In Progress
|
2014-05-02
|
2014-05-12
|
CMS
|
T1_UK_RAL Consistency Check (May14)
|
98249
|
Red
|
Urgent
|
Waiting Reply
|
2013-10-21
|
2014-05-21
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
21/05/14 |
100 |
100 |
100 |
100 |
100 |
98 |
99 |
|
22/05/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
23/05/14 |
100 |
100 |
100 |
100 |
100 |
97 |
99 |
|
24/05/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
25/05/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
26/05/14 |
100 |
100 |
100 |
100 |
100 |
98 |
100 |
|
27/05/14 |
100 |
100 |
99.1 |
100 |
100 |
96 |
100 |
Single SRM Get test failure.
|