RAL Tier1 Operations Report for 25th January 2017
Review of Issues during the week 18th to 25th January 2017.
|
- We have still been seeing SAM SRM tests failures for CMS. These are owing to the total load on the instance. (Now recorded as an ongoing operational problem below).
- There was a performance problem on the LHCb Castor instance following the upgrade last Wednesday. This was resolved by the end of the following day.
- On Monday the LHCb Castor instance was stopped while OS security patches were applied and nodes rebooted. It took a couple of hours after the planned intervention to get the last disk server back as a couple of them had problems on reboot.
Resolved Disk Server Issues
|
- GDSS772 (LHCbDst - D1T0) failed on Thursday evening, 19th Jan. Back read-only the following afternoon. A disk drive was replaced. Two files reported lost.
- GDSS667 (AtlasScratchDisk - D1T0) failed on Sunday morning (22nd Jan). It was returned to service read-only the following afternoon. One drive with a lot of media errors was replaced. Eleven files reported lost to Atlas.
- GDSS776 (LHCbDst - D1T0) has problems after the reboots to pick up security patches on Monday (23rd). It was returned to service teh following day.
Current operational status and issues
|
- There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites.
- We are seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities. We attribute these failures to load on Castor.
Ongoing Disk Server Issues
|
- GDSS780 (LHCbDst - D1T0) crashed at around 8am this morning (Wed 25th Jan). System under investigation.
Notable Changes made since the last meeting.
|
- Castor 2.1.15 updates carried out on the LHCb and Atlas stagers.
- The top BDIIs have been put behind load balancers. (This was recently done for the site BDIIs)
- Migration of LHCb data from 'C' to 'D' tapes ongoing. Now over 90% done with less than 100 out of the 1000 tapes still to do.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Castor CMS instance
|
SCHEDULED
|
OUTAGE
|
31/01/2017 10:00
|
31/01/2017 16:00
|
6 hours
|
Castor 2.1.15 Upgrade. Only affecting CMS instance. (CMS stager component being upgraded).
|
Castor GEN instance
|
SCHEDULED
|
OUTAGE
|
26/01/2017 10:00
|
26/01/2017 16:00
|
6 hours
|
Castor 2.1.15 Upgrade. Only affecting GEN instance. (GEN stager component being upgraded).
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Pending - but not yet formally announced:
- Merge AtlasScratchDisk into larger Atlas disk pool.
Listing by category:
- Castor:
- Update to Castor version 2.1.15. This upgrade is now part done.
- Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update.
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-atlas.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
24/01/2017 10:00
|
24/01/2017 13:27
|
3 hours and 27 minutes
|
Castor 2.1.15 Upgrade. Only affecting Atlas instance. (Atlas stager component being upgraded).
|
srm-lhcb.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
23/01/2017 10:30
|
23/01/2017 12:30
|
2 hours
|
Downtime while patching CVE-7117 on LHCb Castor instance.
|
srm-lhcb.gridpp.rl.ac.uk
|
UNSCHEDULED
|
WARNING
|
19/01/2017 18:00
|
20/01/2017 17:00
|
23 hours
|
Ongoing problems with LHCb Castor instance when under load
|
srm-lhcb.gridpp.rl.ac.uk
|
UNSCHEDULED
|
WARNING
|
19/01/2017 10:00
|
19/01/2017 18:00
|
8 hours
|
Ongoing problems with LHCb Castor instance when under load.
|
srm-lhcb.gridpp.rl.ac.uk
|
UNSCHEDULED
|
OUTAGE
|
18/01/2017 20:00
|
19/01/2017 10:03
|
14 hours and 3 minutes
|
Problem with LHCb Castor instance.
|
srm-lhcb.gridpp.rl.ac.uk
|
SCHEDULED
|
OUTAGE
|
18/01/2017 10:00
|
18/01/2017 15:39
|
5 hours and 39 minutes
|
Castor 2.1.15 Upgrade. Only affecting LHCb instance. (LHCb stager component being upgraded).
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
125856
|
Green
|
Top Priority
|
Waiting Reply
|
2017-01-06
|
2016-01-18
|
LHCb
|
Permission denied for some files
|
124876
|
Amber
|
Less Urgent
|
On Hold
|
2016-11-07
|
2017-01-01
|
OPS
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
Red
|
Less Urgent
|
On Hold
|
2015-11-18
|
2016-12-07
|
|
CASTOR at RAL not publishing GLUE 2. We looked at this as planned in December (report).
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 844); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
18/01/17 |
100 |
100 |
100 |
94 |
72 |
100 |
100 |
LHCb: Catsor 2.1.15 update; CMS: SRM test failures - User timeout
|
19/01/17 |
100 |
100 |
100 |
97 |
87 |
100 |
100 |
LHCb Problems after 2.1.15 upgrade; CMS: SRM test failures - User timeout
|
20/01/17 |
100 |
100 |
100 |
87 |
100 |
99 |
100 |
SRM test failures - User timeout
|
21/01/17 |
100 |
100 |
100 |
90 |
100 |
100 |
100 |
SRM test failures - User timeout
|
22/01/17 |
100 |
100 |
100 |
92 |
100 |
100 |
100 |
SRM test failures - User timeout
|
23/01/17 |
100 |
100 |
100 |
100 |
92 |
100 |
100 |
Patching nodes in LHCb Castor instance. (New kernel)
|
24/01/17 |
100 |
100 |
86 |
98 |
100 |
96 |
100 |
Atlas: Catsor 2.1.15 update; CMS: SRM test failures - User timeout
|