|
|
Line 32: |
Line 32: |
| |} | | |} |
| * There have been some problems with xroot following the CMS Castor stager update to version 2.1.14-13 last Tuesday (17th June). The current CMS workload exhausts the number of available xroot slots on some disk servers and then fails over to use the re-director (proxy) to source the files elsewhere. Tuning made a significant improvement but the issue remains. | | * There have been some problems with xroot following the CMS Castor stager update to version 2.1.14-13 last Tuesday (17th June). The current CMS workload exhausts the number of available xroot slots on some disk servers and then fails over to use the re-director (proxy) to source the files elsewhere. Tuning made a significant improvement but the issue remains. |
− | * There are ongoing problems with xroot on AliceDisk since the Castor 2.1.14 update.
| |
| <!-- ***********End Current operational status and issues*********** -----> | | <!-- ***********End Current operational status and issues*********** -----> |
| <!-- *************************************************************** -----> | | <!-- *************************************************************** -----> |
Revision as of 10:36, 25 June 2014
RAL Tier1 Operations Report for 25th June 2014
Review of Issues during the week 18th to 25th June 2014.
|
- On Tuesday morning (24th June) there was a crash of the database system for the Atlas SRM which failed over to another node (and was subsequently put back on the correct RAC node).
Resolved Disk Server Issues
|
Current operational status and issues
|
- There have been some problems with xroot following the CMS Castor stager update to version 2.1.14-13 last Tuesday (17th June). The current CMS workload exhausts the number of available xroot slots on some disk servers and then fails over to use the re-director (proxy) to source the files elsewhere. Tuning made a significant improvement but the issue remains.
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- Monday (23rd June) cream-ce02 was upgraded to use EMI upgrade 14
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-lhcb.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
26/06/2014 10:00
|
26/06/2014 16:00
|
6 hours
|
LHCb Castor instance down for Castor 2.1.14 Stager Update
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Dates for the Castor 2.1.14 stager upgrades: LHCb - Thu 26th June; Atlas - Tues 1st July.
- We are starting to plan the termination of the FTS2 service now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- The Castor 2.1.14 upgrade is underway.
- The CIP is compatible with Castor version 2.1.14. There is an issue reported by LHCb to be investigated.
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 18th and 25th June 2014.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Castor GEN instance: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k.
|
SCHEDULED
|
OUTAGE
|
24/06/2014 09:30
|
24/06/2014 17:00
|
7 hours and 30 minutes
|
Castor GEN instance down for Castor 2.1.14 Stager Update.
|
cream-ce02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
21/06/2014 11:00
|
24/06/2014 12:00
|
3 days, 1 hour
|
EMI-3 update 14
|
srm-cms.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
WARNING
|
17/06/2014 16:30
|
18/06/2014 12:00
|
19 hours and 30 minutes
|
Investigating some problems following the Castor 2.1.14 update of the CMS stager.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
106472
|
Green
|
Less Urgent
|
In Progress
|
2014-06-25
|
2014-06-25
|
Atlas
|
Missing ATLAS file at RAL's tape instance
|
106324
|
Green
|
Urgent
|
In Progress
|
2014-06-18
|
2014-06-23
|
CMS
|
pilots losing network connections at T1_UK_RAL
|
105571
|
Red
|
Less Urgent
|
In Progress
|
2014-05-21
|
2014-06-02
|
LHCb
|
BDII and SRM publish inconsistent storage capacity numbers
|
105405
|
Red
|
Urgent
|
In Progress
|
2014-05-14
|
2014-06-10
|
|
please check your Vidyo router firewall configuration
|
98249
|
Red
|
Urgent
|
In Progress
|
2013-10-21
|
2014-06-17
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
18/06/14 |
100 |
100 |
100 |
99.2 |
100 |
100 |
90 |
Some problems with CMS Castor following the stager upgrade led to single SRM test failure.
|
19/06/14 |
100 |
100 |
100 |
100 |
100 |
100 |
91 |
|
20/06/14 |
100 |
100 |
100 |
100 |
100 |
99 |
96 |
|
21/06/14 |
100 |
100 |
100 |
100 |
100 |
100 |
96 |
|
22/06/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
24/06/14 |
100 |
100 |
96.6 |
100 |
100 |
100 |
99 |
Atlas error was: 1 File was NOT copied from SRM
|
23/06/14 |
100 |
66 |
95.5 |
95.9 |
100 |
98 |
96 |
Alice had ssues with Castor upgrades. Atlas had file not copied errors and CMS had an error stating zero number of replicas
|