RAL Tier1 Operations Report for 25th June 2014
Review of Issues during the week 18th to 25th June 2014.
|
- On Tuesday morning (24th June) there was a crash of the database system for the Atlas SRM which failed over to another node (and was subsequently put back on the correct RAC node).
Resolved Disk Server Issues
|
Current operational status and issues
|
- There have been some problems with xroot following the CMS Castor stager update to version 2.1.14-13 last Tuesday (17th June). The current CMS workload exhausts the number of available xroot slots on some disk servers and then fails over to use the re-director (proxy) to source the files elsewhere. Tuning made a significant improvement but the issue remains.
- There are ongoing problems with xroot on AliceDisk since the Castor 2.1.14 update.
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- Monday (23rd June) cream-ce01 was upgraded to use EMI
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-lhcb.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
26/06/2014 10:00
|
26/06/2014 16:00
|
6 hours
|
LHCb Castor instance down for Castor 2.1.14 Stager Update
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Dates for the Castor 2.1.14 stager upgrades: LHCb - Thu 26th June; Atlas - Tues 1st July.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- The Castor 2.1.14 upgrade is underway.
- The CIP is compatible with Castor version 2.1.14. There is an issue reported by LHCb to be investigated.
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 18th and 25th June 2014.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Castor GEN instance: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k.
|
SCHEDULED
|
OUTAGE
|
24/06/2014 09:30
|
24/06/2014 17:00
|
7 hours and 30 minutes
|
Castor GEN instance down for Castor 2.1.14 Stager Update.
|
cream-ce02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
21/06/2014 11:00
|
24/06/2014 12:00
|
3 days, 1 hour
|
EMI-3 update 14
|
srm-cms.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
WARNING
|
17/06/2014 16:30
|
18/06/2014 12:00
|
19 hours and 30 minutes
|
Investigating some problems following the Castor 2.1.14 update of the CMS stager.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
106324
|
Green
|
Urgent
|
In Progress
|
2014-06-18
|
2014-06-23
|
CMS
|
pilots losing network connections at T1_UK_RAL
|
105571
|
Red
|
Less Urgent
|
In Progress
|
2014-05-21
|
2014-06-02
|
LHCb
|
BDII and SRM publish inconsistent storage capacity numbers
|
105405
|
Red
|
Urgent
|
In Progress
|
2014-05-14
|
2014-06-10
|
|
please check your Vidyo router firewall configuration
|
98249
|
Red
|
Urgent
|
In Progress
|
2013-10-21
|
2014-06-17
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
18/06/14 |
100 |
100 |
100 |
99.2 |
100 |
100 |
90 |
Some problems with CMS Castor following the stager upgrade led to single SRM test failure.
|
19/06/14 |
100 |
100 |
100 |
100 |
100 |
100 |
91 |
|
20/06/14 |
100 |
100 |
100 |
100 |
100 |
99 |
96 |
|
21/06/14 |
100 |
100 |
100 |
100 |
100 |
100 |
96 |
|
22/06/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
23/06/14 |
100 |
100 |
96.7 |
100 |
100 |
100 |
99 |
Couple of SRM Get test failures (SRM_FILE_BUSY)
|
24/06/14 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|