RAL Tier1 Operations Report for 12th April2017
Review of Issues during the week 5th to 12th April 2017.
|
- LHCb Castor instance has been running with problems all this last week. Initially it appeared the new SRM version was causing a bottleneck. This was fixed but it then appears the stager was also struggling. Work has been ongoing to resolve this.
- Some batch job submission errors have been seen by CMS and LHCb. These are not yet understood. ?? Ongoing
- Over the weekend there were problems with the Atlas Frontier systems. Lyon were also affected.
- On Monday problems were reported on one of the ARC CEs (AC-CE4) and its services were restarted.
Resolved Disk Server Issues
|
Current operational status and issues
|
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures is reduced as compared to a few weeks ago.
Ongoing Disk Server Issues
|
- gdss673 (LHCb-Tape) was removed from production this morning (05/04/2017) due to it having a double disk failure.
Limits on concurrent batch system jobs.
|
- Atlas Pilot (Analysis) 1500
- CMS Multicore 550
- LHCb 1000
Notable Changes made since the last meeting.
|
- Increased limit on number of CMS multicore jobs from 460 to 550 due to increased pledge for 2017.
- Out of Hours cover for the CEPH ECHO service is being piloted.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgwms04.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
12/04/2017 09:05
|
18/04/2017 12:00
|
6 days, 2 hours and 55 minutes
|
server migration
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Pending - but not yet formally announced:
- Update Castor SRMs - CMS & GEN still to do. This is awaiting a full understanding of the problem seen with LHCb.
- Chiller replacement - work ongoing.
- Merge AtlasScratchDisk into larger Atlas disk pool.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6.
- Bring some newer disk servers ('14 generation) into service, replacing some older ('12 generation) servers.
- Networking
- Enable first services on production network with IPv6 once addressing scheme agreed.
- Infrastructure:
- Two of the chillers supplying the air-conditioning for the R89 machine room are being replaced.
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgwms04.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
12/04/2017 09:05
|
18/04/2017 12:00
|
6 days, 2 hours and 55 minutes
|
server migration
|
srm-lhcb.gridpp.rl.ac.uk
|
UNSCHEDULED
|
OUTAGE
|
09/04/2017 12:00
|
10/04/2017 12:00
|
24 hours
|
Problems with LHCb transfers
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
127388
|
Green
|
Less urgent
|
In Progress
|
2017-03-29
|
2017-04-03
|
LHCb
|
[FATAL] Connection error for some file
|
127240
|
Green
|
Urgent
|
In Progress
|
2017-03-21
|
2017-03-27
|
CMS
|
Staging Test at UK_RAL for Run2
|
126905
|
Green
|
Less Urgent
|
Waiting Reply
|
2017-03-02
|
2017-04-03
|
solid
|
finish commissioning cvmfs server for solidexperiment.org
|
126184
|
Amber
|
Less Urgent
|
In Progress
|
2017-01-26
|
2017-02-07
|
Atlas
|
Request of inputs for new sites monitoring
|
124876
|
Red
|
Less Urgent
|
On Hold
|
2016-11-07
|
2017-01-01
|
OPS
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
Red
|
Less Urgent
|
On Hold
|
2015-11-18
|
2017-03-02
|
|
CASTOR at RAL not publishing GLUE 2.
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 841);CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
Atlas HC ECHO |
CMS HC |
Comment
|
29/03/17 |
100 |
100 |
100 |
97 |
83 |
100 |
99 |
100 |
SRM test failures
|
30/03/17 |
100 |
100 |
100 |
100 |
88 |
100 |
100 |
100 |
SRM test failures
|
31/03/17 |
100 |
100 |
100 |
98 |
50 |
96 |
100 |
100 |
SRM test failures
|
01/04/17 |
100 |
100 |
67 |
100 |
79 |
100 |
100 |
100 |
Atlas: Missing data; LHCb: SRM test failures
|
02/034/17 |
100 |
100 |
100 |
98 |
100 |
100 |
89 |
100 |
SRM test failures
|
03/04/17 |
100 |
100 |
100 |
97 |
82 |
98 |
94 |
99 |
SRM test failures
|
05/04/17 |
100 |
100 |
100 |
100 |
88 |
100 |
100 |
100 |
SRM Test failures.
|
06/04/17 |
100 |
100 |
100 |
99 |
84 |
100 |
100 |
100 |
SRM test failures for both CMS and LHCb.
|
07/04/17 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
08/04/17 |
100 |
96 |
88 |
75 |
94 |
100 |
100 |
100 |
A hypervisor failure led to problems for one of the CEs and argus.
|
09/04/17 |
100 |
100 |
100 |
45 |
88 |
100 |
100 |
100 |
CMS: Problem with CMS Castor (transfer manager problems); LHCb - SRM test failures.
|
10/04/17 |
100 |
100 |
100 |
60 |
100 |
100 |
100 |
100 |
CMS: Ongoing problem with CMS Castor (above) fixed during morning.
|
11/04/17 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|