RAL Tier1 Operations Report for 18th June 2014
Review of Issues during the week 11th to 18th June 2014.
|
- There was a problem with cream-ce01 which started failing tests on Friday. It was put into a downtime and drained out over the weekend after which its database was reset. It was returned to service on Monday (16th).
- The CMS Castor stager update to version 2.1.14-13 took place yesterday (Tuesday) as planned. There were some difficulties caused by the disk server re-balancer initially after the upgrade. However, these were understood and resolved within the announced outage. Nevertheless the service was placed in a 'warning' in the GOC DB for the rest of the day and overnight.
Resolved Disk Server Issues
|
Current operational status and issues
|
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- Yesterday (10th June) the Castor namserver was updated to version 2.1.14-13. While Castor was down the opportunity was taken to update the firmware in some network switches and apply kernel/errata updates to the Castor disk servers.
- A new ARC CE, arc-ce04 has been brought into production.
- ARC CEs arc-ce02 & arc-ce03 have been upgraded to version 4.1.0. (All ARC CEs now updated).
- The host certificate on arc-ce02 has been updated andthe new certificate is SHA-2 signed. There was an initial error during the application of this certificate, but that was corrected and the service is now running OK with the new (SHA-2) certificate.
- On the 2nd June LHCb access was removed from cream-ce01.
- Today (11th June) a new tape controller system (ACSLS) is being installed. There have been some problems with the new server. However, last test (last week) was successful.
- On Monday (16th June) Removed LHCb access from cream-ce02
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Dates for the Castor 2.1.14 stager upgrades: CMS- Tue 17th June; LHCb - Thu 19th June; GEN - Tue 24th June; Atlas - Thu 26th June.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- The Castor 2.1.14 upgrade is underway. Some checks are ongoing before finally deciding whether the stagers will go directly to minor version 2.1.14-13 (rather than 2.1.14-11 as previously planned).
- CIP-2.2.15 publishes online resources correctly for CASTOR 2.1.14 but not nearline (they appear as zero, due to a modification to how data is held in CASTOR). CIP will be updated. We can double check the Rajanian Problem at the same time which was understood and fixed earlier.
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 11th and 18th June 2014.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Castor (all SRM endpoints) and batch (all CEs)
|
SCHEDULED
|
OUTAGE
|
10/06/2014 08:50
|
10/06/2014 12:45
|
3 hours and 55 minutes
|
Castor and batch services down during upgrade of Castor Nameserver to version 2.1.14.
|
arc-ce04.gridpp.rl.ac.uk
|
UNSCHEDULED
|
OUTAGE
|
10/06/2014 07:05
|
10/06/2014 12:45
|
5 hours and 40 minutes
|
Stopping work on new CE around upgrade of Castor Storage System.
|
Castor (all SRM endpoints) and batch (all CEs)
|
SCHEDULED
|
OUTAGE
|
10/06/2014 06:50
|
10/06/2014 08:50
|
2 hours
|
Castor and batch services down for Networking Change before upgrade of Castor Nameserver to version 2.1.14.
|
cream-ce01.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
07/06/2014 10:00
|
10/06/2014 12:00
|
3 days, 2 hours
|
EMI-3 update 14 upgrade
|
srm-lhcb-tape.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
05/06/2014 15:25
|
05/06/2014 16:42
|
1 hour and 17 minutes
|
Problem affecting disk cache in front of tape.
|
srm-atlas.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
WARNING
|
05/06/2014 15:25
|
05/06/2014 16:41
|
1 hour and 16 minutes
|
Problem affecting disk cache in front of tape. Non-Tape service clasess unaffected.
|
All srm endpoints
|
SCHEDULED
|
WARNING
|
04/06/2014 08:00
|
04/06/2014 17:00
|
9 hours
|
Warning (At Risk) on tape systems during testing of new tape library controller.
|
arc-ce02.gridpp.rl.ac.uk, arc-ce03.gridpp.rl.ac.uk
|
SCHEDULED
|
WARNING
|
02/06/2014 10:00
|
02/06/2014 12:00
|
2 hours
|
Upgrade arc-ce02 and arc-ce03 to v. 4.1.0.
|
arc-ce01.gridpp.rl.ac.uk
|
SCHEDULED
|
WARNING
|
28/05/2014 10:00
|
28/05/2014 12:00
|
2 hours
|
Upgrade of ARC CE to version 4.1.0.
|
All Castor (all srm endpoints): srm-alice, srm-atlas, srm-biomed, srm-cert, srm-cms, srm-dteam, srm-hone, srm-ilc, srm-lhcb, srm-mice, srm-minos, srm-na62, srm-preprod, srm-snoplus, srm-superb, srm-t2k
|
SCHEDULED
|
WARNING
|
28/05/2014 09:30
|
28/05/2014 11:30
|
2 hours
|
At Risk on Castor (All SRM endpoints) during small internal network change.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
105571
|
Amber
|
Less Urgent
|
In Progress
|
2014-05-21
|
2014-06-02
|
LHCb
|
BDII and SRM publish inconsistent storage capacity numbers
|
105405
|
Red
|
Urgent
|
In Progress
|
2014-05-14
|
2014-06-10
|
|
please check your Vidyo router firewall configuration
|
105100
|
Red
|
Less Urgent
|
In Progress
|
2014-05-02
|
2014-05-30
|
CMS
|
T1_UK_RAL Consistency Check (May14)
|
98249
|
Red
|
Urgent
|
In Progress
|
2013-10-21
|
2014-06-17
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
11/06/14 |
100 |
91.3 |
100 |
99.3 |
100 |
99 |
99 |
Problem with Argus server.
|
12/06/14 |
100 |
100 |
100 |
100 |
100 |
100 |
97 |
|
13/06/14 |
100 |
100 |
95.3 |
100 |
100 |
98 |
99 |
|
14/06/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
15/06/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
16/06/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
17/06/14 |
100 |
100 |
100 |
72.9 |
100 |
96 |
100 |
CMS Castor stager 2.1.14 upgrade.
|