RAL Tier1 Operations Report for 22nd February 2017
Review of Issues during the week 15th to 22nd February 2017.
|
- There was a file access problem seen by LHCb last night. This appears to have been a temporary problem (thread starvation within the SRM) that went away this morning.
- There remains some issues following the Castor 2.1.15 upgrade -
- An occasional problem with a database resource (number of cursors) becoming exhausted. This has affected more than one of the instances. Investigations into this are ongoing. There is a bugfix to Castor in version 2.1.16 in this area.
- We are managing memory leaks seen in the transfer manager component.
- We still see some timeout test failures in SAM tests for CMS.
Resolved Disk Server Issues
|
Current operational status and issues
|
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures have been reduced recently.
Ongoing Disk Server Issues
|
- GDSS663 (AtlasTape - D0T1) crashed on Saturday (18th Feb). Two faulty disks found and replaced. Expected back in service imminently.
Limits on concurrent batch system jobs.
|
- LHCb Pilot 4500
- Atlas Pilot (Analysis) 1600
- CMS Multicore 470
Notable Changes made since the last meeting.
|
- The OPNR (OPN router) has been enabled for IPv6 this morning. A reboot was required to enable IPv6 ACLS.
- Various systems have had security and other patches applied. In particular back end database systems are being updated to remove a software layer ("asmlib").
- Two batches of worker nodes are running SL7 with the jobs themselves in SL6 containers.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole site
|
SCHEDULED
|
WARNING
|
01/03/2017 07:00
|
01/03/2017 11:00
|
4 hours
|
Warning on site during network intervention in preparation for IPv6.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Pending - but not yet formally announced:
- Merge AtlasScratchDisk into larger Atlas disk pool.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update.
- Networking:
- Enabling IPv6 onto production network.
- Databases
- Removal of "asmlib" layer on Oracle database nodes.
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lfc.gridpp.rl.ac.uk
|
SCHEDULED
|
WARNING
|
22/02/2017 08:45
|
22/02/2017 13:00
|
4 hours and 15 minutes
|
LFC Oracle backend security updates
|
All Castor and ECHO storage and Perfsonar.
|
SCHEDULED
|
WARNING
|
22/02/2017 07:00
|
22/02/2017 11:00
|
4 hours
|
Warning on Storage and Perfsonar during network intervention in preparation for IPv6.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
1267
|
Green
|
Very Urgent
|
In Progress
|
2017-02-22
|
2017-02-22
|
LHCb
|
File access problem at RAL
|
126718
|
Green
|
Urgent
|
In Progress
|
2017-02-21
|
2017-02-21
|
Atlas
|
UK RAL-LCG2-ECHO DATADISK: ~8k deletion error due to "Device or resource busy"
|
126532
|
Green
|
Urgent
|
In Progress
|
2017-02-09
|
2017-02-21
|
Atlas
|
RAL tape staging errors
|
126184
|
Green
|
Less Urgent
|
In Progress
|
2017-01-26
|
2017-02-07
|
Atlas
|
Request of inputs for new sites monitoring
|
124876
|
Red
|
Less Urgent
|
On Hold
|
2016-11-07
|
2017-01-01
|
OPS
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
Red
|
Less Urgent
|
On Hold
|
2015-11-18
|
2017-02-10
|
|
CASTOR at RAL not publishing GLUE 2. Looking at it again now (Feb), progress made on back end. Need to update ticket.
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 842);CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
Atlas HC ECHO |
CMS HC |
Comment
|
15/02/17 |
100 |
100 |
100 |
96 |
100 |
99 |
100 |
100 |
Timeouts on CMS SRM tests.
|
16/02/17 |
100 |
100 |
100 |
92 |
100 |
100 |
100 |
100 |
Timeouts on CMS SRM tests.
|
17/02/17 |
100 |
100 |
100 |
88 |
100 |
100 |
99 |
100 |
Timeouts on CMS SRM tests.
|
18/02/17 |
100 |
100 |
100 |
97 |
100 |
100 |
96 |
100 |
Timeouts on CMS SRM tests.
|
19/02/17 |
100 |
100 |
100 |
97 |
100 |
100 |
99 |
100 |
Timeouts on CMS SRM tests.
|
20/02/17 |
100 |
100 |
100 |
96 |
100 |
98 |
97 |
100 |
Timeouts on CMS SRM tests.
|
21/02/17 |
100 |
100 |
100 |
98 |
100 |
98 |
93 |
100 |
Timeouts on CMS SRM tests.
|
- CMS confirmed their xroot redirection tests are passing OK for RAL Castor.
- Catalin has carried out a survey of the users of the WMS service. This indicates ongoing interest in this service.