RAL Tier1 Operations Report for 22nd March 2017
Review of Issues during the week 15th to 22nd March 2017.
|
- There was a problem with the Atlas Castor instance on the evening of Wednesday 15th Mar. The oncall was contacted and Castor services restarted to fix it. The cause was a known bug that causes exhaustion of a particular database resource.
- A crash of one of the five hypervisors in the Microsoft Hyper-V high availability cluster caused a number of VMs to reboot overnight Thursday-Friday (16-17 Mar).
- We have an ongoing problem with the SRM SAM tests for Atlas which are failing a lot of the time. We have confirmed this is not affecting Atlas operationally it is just the tests that fails. We still have a GGUS ticket open with Atlas as the test appears to be problematic.
- On Friday (17th March) there was a problem with the Argus srever when it had a full disk partition. This affected CMS glexec tests.
Resolved Disk Server Issues
|
Current operational status and issues
|
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures is reduced as compared to a few weeks ago.
Ongoing Disk Server Issues
|
Limits on concurrent batch system jobs.
|
- Atlas Pilot (Analysis) 1500
- CMS Multicore 460
Notable Changes made since the last meeting.
|
- Last week nine of the ’14 generation disk servers – each 100TB - were deployed into AtlasDataDisk. (These are from the batch that was used as CEPH test servers).
- Work ongoing on replacing two of the chillers.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-lhcb.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
23/03/2017 11:00
|
23/03/2017 17:00
|
6 hours
|
Upgrade of Castor SRMs for LHCb to version 2.1.16-10
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Pending - but not yet formally announced:
- Update Castor SRMs starting with LHCb. (Announced for 23rd March.
- Chiller replacement - work ongoing.
- Merge AtlasScratchDisk into larger Atlas disk pool.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6.
- Bring some newer disk servers ('14 generation) into service, replacing some older ('12 generation) servers.
- Databases
- Removal of "asmlib" layer on Oracle database nodes. (Ongoing)
- Networking
- Enable first services on production network with IPv6 once addressing scheme agreed.
- Infrastructure:
- Two of the chillers supplying the air-conditioning for the R89 machine room will be replaced.
Entries in GOC DB starting since the last report.
|
None
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
126905
|
Green
|
Less Urgent
|
In Progress
|
2017-03-02
|
2017-03-02
|
solid
|
finish commissioning cvmfs server for solidexperiment.org
|
126184
|
Yellow
|
Less Urgent
|
In Progress
|
2017-01-26
|
2017-02-07
|
Atlas
|
Request of inputs for new sites monitoring
|
124876
|
Red
|
Less Urgent
|
On Hold
|
2016-11-07
|
2017-01-01
|
OPS
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
Red
|
Less Urgent
|
On Hold
|
2015-11-18
|
2017-03-02
|
|
CASTOR at RAL not publishing GLUE 2.
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 842);CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
Atlas HC ECHO |
CMS HC |
Comment
|
15/03/17 |
100 |
100 |
100 |
99 |
100 |
99 |
98 |
99 |
Single SRM test failure (timeout)
|
16/03/17 |
100 |
99 |
100 |
99 |
100 |
79 |
72 |
99 |
ALICE: Test failed with 'no compatible resources found in BDII'. CMS: Single SRM test failure
|
17/03/17 |
100 |
100 |
100 |
85 |
100 |
100 |
100 |
97 |
Problem with glexec tests from CMS.
|
18/03/17 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
19/03/17 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
20/03/17 |
100 |
100 |
100 |
98 |
100 |
100 |
100 |
100 |
SRM test failures on GET (User timeout).
|
21/03/17 |
100 |
100 |
100 |
98 |
100 |
100 |
100 |
99 |
SRM test failures on GET (User timeout).
|