RAL Tier1 Operations Report for 8th March 2017
Review of Issues during the week 1st to 8th March 2017.
|
- Following a discussion at last week's liaison meeting an apparant cap on LHCb batch jobs was removed. However, it turned out that the restriction in the Condor configuration file was not erroneous and had no effect. LHCb batch jobs were not being limited.
- There has been a problem accessing AtlasScratchDisk in Castor this morning (8th Mar). Not yet fully understood.
Resolved Disk Server Issues
|
Current operational status and issues
|
- We are still seeing a rate of failures of the CMS SAM tests against the SRM. These are affecting our (CMS) availabilities but the level of failures is reduced as compared to a few weeks ago.
Ongoing Disk Server Issues
|
Limits on concurrent batch system jobs.
|
- Atlas Pilot (Analysis) 1500
- CMS Multicore 460
Notable Changes made since the last meeting.
|
- CMS PhEDEx debug transfers switched from CASTOR to Ceph ECHO.
- Ongoing work appling security and other patches. More back-end database systems have been updated to remove a software layer ("asmlib").
- IPv6 disabled across systems in preparation for enabling if IPv6 in routers.
None
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Pending - but not yet formally announced:
- Merge AtlasScratchDisk into larger Atlas disk pool.
Listing by category:
- Castor:
- Update SRMs to new version, including updating to SL6. This will be done after the Castor 2.1.15 update.
- Networking:
- Enabling IPv6 onto production network.
- Databases
- Removal of "asmlib" layer on Oracle database nodes.
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole site
|
SCHEDULED
|
WARNING
|
08/03/2017 07:00
|
08/03/2017 11:00
|
4 hours
|
Warning on site during network intervention in preparation for IPv6.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
126905
|
Green
|
Less Urgent
|
In Progress
|
2017-03-02
|
2017-03-02
|
solid
|
finish commissioning cvmfs server for solidexperiment.org
|
126184
|
Green
|
Less Urgent
|
In Progress
|
2017-01-26
|
2017-02-07
|
Atlas
|
Request of inputs for new sites monitoring
|
124876
|
Red
|
Less Urgent
|
On Hold
|
2016-11-07
|
2017-01-01
|
OPS
|
[Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk
|
117683
|
Red
|
Less Urgent
|
On Hold
|
2015-11-18
|
2017-03-02
|
|
CASTOR at RAL not publishing GLUE 2. Looking at it again now (Feb), progress made on back end. Need to update ticket.
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 845); Atlas HC ECHO = Atlas ECHO (Template 842);CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
Atlas HC ECHO |
CMS HC |
Comment
|
01/03/17 |
100 |
100 |
83 |
81 |
100 |
99 |
100 |
99 |
Atlas: Ongoing problems with SRM test; CMS - CE test failures due to poblem with argus server.
|
02/03/17 |
100 |
100 |
59 |
99 |
100 |
100 |
99 |
100 |
Atlas: Ongoing problems with SRM test; CMS - timeouts in SRM tests.
|
03/03/17 |
100 |
100 |
91 |
98 |
100 |
97 |
100 |
100 |
Atlas: Ongoing problems with SRM test; CMS - timeouts in SRM tests.
|
04/03/17 |
100 |
100 |
96 |
100 |
100 |
100 |
100 |
99 |
Atlas: Ongoing problems with SRM test.
|
05/03/17 |
100 |
97 |
98 |
88 |
92 |
97 |
100 |
100 |
Atlas: Ongoing problems with SRM test; CMS - timeouts in SRM tests; LHCb - some SRM test failures.
|
06/03/17 |
100 |
68 |
89 |
97 |
100 |
99 |
98 |
100 |
Alice: Central monitoring problem; Atlas: Ongoing problems with SRM test; CMS - timeouts in SRM tests.
|
07/03/17 |
100 |
100 |
100 |
93 |
100 |
94 |
99 |
100 |
Timeouts on CMS SRM tests.
|