|
|
Line 46: |
Line 46: |
| | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues | | | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Ongoing Disk Server Issues |
| |} | | |} |
− | * GDSS773 (LHCbDst - D1T0) crashed on Sunday (21st). Investigations are ongoing. | + | * GDSS773 (LHCbDst - D1T0) crashed on Sunday (21st May). Investigations are ongoing. |
| <!-- ***************End Ongoing Disk Server Issues**************** -----> | | <!-- ***************End Ongoing Disk Server Issues**************** -----> |
| <!-- ************************************************************* -----> | | <!-- ************************************************************* -----> |
Revision as of 07:24, 24 May 2017
RAL Tier1 Operations Report for 24th May 2017
Review of Issues during the week 17th to 24th May 2017.
|
- Following the failure of the UPS in building R89 on Friday 28th April a replacement UPS was installed at the end of last week. This was brought into use yesterday (16th May) and then a ups/generator load test successfully carried out this morning.
- There was a significant problem with the CMS Castor instance over the weekend that severely affected availabilities. Space was only available on a small number of disk servers and these became heavily overloaded.
- Atlas and CMS were affected for a couple of hours yesterday when Castor was reporting disk pools as full. An update had unexpectedly caused processes on disk servers to restart and this had a knock effect.
- There have been some problems with the ECHO CEPH xrootd gateways. A xrootd proxy cache has been installed on these gateways and this has resolved this issue. However, the root cause is still being investigated.
Resolved Disk Server Issues
|
- GDSS724 (AtlasDataDisk - D1T0) Crashed on Wednesday evening (157th May). It was returned to service, intially read-only, on the 19th. Three zero-sized files lost at time of crash.
Current operational status and issues
|
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities. CMS are also looking at file access performance and have turned off "lazy-download". This will be re-addresses once we have upgraded to Castor 2.1.16.
- LHCb Castor performance. I have left this item in place although here has been a Castor update for LHCb and testing has been carried out this week.
Ongoing Disk Server Issues
|
- GDSS773 (LHCbDst - D1T0) crashed on Sunday (21st May). Investigations are ongoing.
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
- Atlas Castor instance updated to Castor version 2.1.16-13. (The Atlas SRMs were already at version 2.1.16).
- CEs being migrated to use the load balancers in front of teh argus service.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-cms-disk.gridpp.rl.ac.uk, srm-cms.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
25/05/2017 10:00
|
25/05/2017 16:00
|
6 hours
|
Upgrade of CMS Castor instance to version 2.1.16.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Pending - but not yet formally announced:
- Update Castor (including SRMs) to version 2.1.16. Central nameserver done. Current plan: LHCb stager on Thursday 11th May. Others to follow.
- Update Castor SRMs - CMS & GEN still to do. Problems seen with the SRM update mean these will wait until Castor 2.1.16 is rolled out.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6.
- Update Castor to version 2.1.16 (ongoing)
- Merge AtlasScratchDisk into larger Atlas disk pool.
- Networking
- Increase OPN link to CERN from 2*10Gbit to 3*10Gbit links.
- Enable first services on production network with IPv6 now that the addressing scheme has been agreed. (Perfsonar already working over IPv6).
- Services
- Put argus systems behind a load balancer to improve resilience.
- The production FTS needs updating. This will no longer support the soap interface. (The "test" FTS , used by Atlas, has already been upgraded.)
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-atlas.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
23/05/2017 10:00
|
23/05/2017 11:34
|
1 hour and 34 minutes
|
Upgrade of Atlas Castor instance to version 2.1.16.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
128350
|
Green
|
Less Urgent
|
In Progress
|
2017-05-16
|
2017-05-16
|
Atlas
|
RAL-LCG2_DATADISK: transfer errors
|
128308
|
Green
|
Urgent
|
In Progress
|
2017-05-14
|
2017-05-15
|
CMS
|
Description: T1_UK_RAL in error for about 6 hours
|
128180
|
Green
|
Urgent
|
In Progress
|
2017-05-05
|
2017-05-08
|
|
WLGC-IPv6 Tier-1 readiness
|
127968
|
Green
|
Less Urgent
|
In Progress
|
2017-04-27
|
2017-04-27
|
MICE
|
RAL castor: not able to list directories and copy to
|
127967
|
Green
|
Less Urgent
|
On Hold
|
2017-04-27
|
2017-04-28
|
MICE
|
Enabling pilot role for mice VO at RAL-LCG2
|
127612
|
Red
|
Alarm
|
In Progress
|
2017-04-08
|
2017-05-09
|
LHCb
|
CEs at RAL not responding
|
127598
|
Green
|
Urgent
|
In Progress
|
2017-04-07
|
2017-05-12
|
CMS
|
UK XRootD Redirector
|
127597
|
Yellow
|
Urgent
|
Waiting for Reply
|
2017-04-07
|
2017-05-16
|
CMS
|
Check networking and xrootd RAL-CERN performance
|
127240
|
Amber
|
Urgent
|
Waiting for Reply
|
2017-03-21
|
2017-05-15
|
CMS
|
Staging Test at UK_RAL for Run2
|
124876
|
Red
|
Less Urgent
|
On Hold
|
2016-11-07
|
2017-01-01
|
OPS
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
Red
|
Less Urgent
|
On Hold
|
2015-11-18
|
2017-05-10
|
|
CASTOR at RAL not publishing GLUE 2.
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 841);CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas ECHO |
Atlas HC |
Atlas HC ECHO |
CMS HC |
Comment
|
17/05/17 |
100 |
100 |
100 |
91 |
100 |
100 |
96 |
98 |
100 |
Intermittent SRM test failures. (User timeout)
|
18/05/17 |
100 |
100 |
98 |
79 |
100 |
100 |
94 |
100 |
100 |
Atlas: One SRM test failure; CMS: Intermittent SRM test failures. (timeout)
|
19/05/17 |
100 |
100 |
100 |
78 |
100 |
100 |
100 |
100 |
100 |
Intermittent SRM test failures. (User)
|
20/05/17 |
100 |
100 |
100 |
83 |
100 |
100 |
95 |
100 |
100 |
Intermittent SRM test failures. (User)
|
21/05/17 |
100 |
100 |
100 |
80 |
100 |
100 |
100 |
199 |
100 |
Intermittent SRM test failures. (User)
|
22/05/17 |
100 |
100 |
100 |
83 |
100 |
100 |
100 |
100 |
100 |
Intermittent SRM test failures. (User)
|
23/05/17 |
100 |
100 |
92 |
96 |
100 |
100 |
100 |
100 |
100 |
Atlas Castor 2.1.16 update; CMS: Intermittent SRM test failures. (timeout)
|