RAL Tier1 Operations Report for 17th May 2017
Review of Issues during the week 10th to 17th May 2017.
|
- Following the failure of the UPS in building R89 on Friday 28th April a replacement UPS was installed at the end of last week. This was brought into use yesterday (16th May) and then a ups/generator load test successfully carried out this morning.
- There was a significant problem with the CMS Castor instance over the weekend that severely affected availabilities. Space was only available on a small number of disk srevers and these became heavily overloaded.
- Atlas and CMs were affected for a couple of hours yesterday when Castor was reporting disk pools as full. An update had unexpectedly caused processes on disk servers to restart and this had a knock effect.
- There have been some problems with the ECHO CEPH xrootd gateways.
Resolved Disk Server Issues
|
- GDSS744 (AtlasDataDisk - D1T0) Crashed on Monday morning (15th May). Two disk drives were replace and it was returned to service (initially read-only) at the end of Tuesday afternoon (16th).
Current operational status and issues
|
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities. CMS are also looking at file access performance and have turned off "lazy-download". This will be re-addresses once we have upgraded to Castor 2.1.16.
- LHCb Castor performance. I have left this item in place although here has been a Castor update for LHCb and testing has been carried out this week.
Ongoing Disk Server Issues
|
Limits on concurrent batch system jobs.
|
Notable Changes made since the last meeting.
|
- New UPS installed in R89 and tested.
- LHCb Castor instance updated to Castor version 2.1.16-13.
- Edinburgh Dirac site now moving 'production' files to Castor.
- Support for the following VOs removed from batch (as no longer supported by GridPP): "hone" "fusion" "superbvo.org"
None
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Pending - but not yet formally announced:
- Update Castor (including SRMs) to version 2.1.16. Central nameserver done. Current plan: LHCb stager on Thursday 11th May. Others to follow.
- Update Castor SRMs - CMS & GEN still to do. Problems seen with the SRM update mean these will wait until Castor 2.1.16 is rolled out.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6.
- Update Cator to version 2.1.16 (ongoing)
- Merge AtlasScratchDisk into larger Atlas disk pool.
- Networking
- Enable first services on production network with IPv6 now that the addressing scheme has been agreed. (Perfsonar already working over IPv6).
- Services
- Put argus systems behind a load balancer to improve resilience.
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole site
|
UNSCHEDULED
|
WARNING
|
16/05/2017 14:00
|
16/05/2017 17:00
|
3 hours
|
Emergency warning while bringing UPS supply back online.
|
srm-lhcb.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
11/05/2017 10:00
|
11/05/2017 12:50
|
2 hours and 50 minutes
|
Downtime while upgrading LHCb Castor instance to 2.1.16
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
128350
|
Green
|
Less Urgent
|
In Progress
|
2017-05-16
|
2017-05-16
|
Atlas
|
RAL-LCG2_DATADISK: transfer errors
|
128308
|
Green
|
Urgent
|
In Progress
|
2017-05-14
|
2017-05-15
|
CMS
|
Description: T1_UK_RAL in error for about 6 hours
|
128180
|
Green
|
Urgent
|
In Progress
|
2017-05-05
|
2017-05-08
|
|
WLGC-IPv6 Tier-1 readiness
|
127968
|
Green
|
Less Urgent
|
In Progress
|
2017-04-27
|
2017-04-27
|
MICE
|
RAL castor: not able to list directories and copy to
|
127967
|
Green
|
Less Urgent
|
On Hold
|
2017-04-27
|
2017-04-28
|
MICE
|
Enabling pilot role for mice VO at RAL-LCG2
|
127612
|
Red
|
Alarm
|
In Progress
|
2017-04-08
|
2017-05-09
|
LHCb
|
CEs at RAL not responding
|
127598
|
Green
|
Urgent
|
In Progress
|
2017-04-07
|
2017-05-12
|
CMS
|
UK XRootD Redirector
|
127597
|
Yellow
|
Urgent
|
Waiting for Reply
|
2017-04-07
|
2017-05-16
|
CMS
|
Check networking and xrootd RAL-CERN performance
|
127240
|
Amber
|
Urgent
|
Waiting for Reply
|
2017-03-21
|
2017-05-15
|
CMS
|
Staging Test at UK_RAL for Run2
|
124876
|
Red
|
Less Urgent
|
On Hold
|
2016-11-07
|
2017-01-01
|
OPS
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
Red
|
Less Urgent
|
On Hold
|
2015-11-18
|
2017-05-10
|
|
CASTOR at RAL not publishing GLUE 2.
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 841);CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas ECHO |
Atlas HC |
Atlas HC ECHO |
CMS HC |
Comment
|
10/05/17 |
100 |
100 |
85 |
100 |
100 |
100 |
100 |
100 |
99 |
CMS: Mainly CE tests failed owing to xroot redirection failing. Atlas: intermittent SRM test failures.
|
11/05/17 |
100 |
100 |
96 |
83 |
87 |
100 |
100 |
100 |
96 |
Atlas: Single SRM test failure; CMS: 83% Solid blobk of failures for CE & SRM tests. LHCb:Castor Stager 2.1.16 update.’
|
12/05/17 |
100 |
100 |
94 |
99 |
100 |
100 |
100 |
100 |
100 |
Intermittent SRM test failures.
|
13/05/17 |
100 |
100 |
94 |
39 |
100 |
100 |
100 |
100 |
100 |
Atlas: Intermittent SRM test failures.; CMS: SRM and CE tests failing with problems writing into Castor.
|
14/05/17 |
100 |
100 |
90 |
35 |
100 |
100 |
98 |
100 |
100 |
Atlas: Intermittent SRM test failures.; CMS: SRM and CE tests failing with problems writing into Castor.
|
15/05/17 |
100 |
100 |
90 |
99 |
100 |
100 |
100 |
100 |
100 |
Intermittent SRM test failures.
|
16/05/17 |
100 |
100 |
90 |
91 |
100 |
100 |
91 |
100 |
94 |
Both Atlas and CMS had a block of failures reporting disk pool full.
|