RAL Tier1 Operations Report for 9th March 2016
Review of Issues during the week 2nd to 9th March 2016.
|
- There were packet loss problems within part of teh Tier1 network. These did not seem to affect operations - and the effect was unusual. E.g. soem access to teh WNs from offices, but not from disk servers. Chased to OpenStack??? Fixed on Monday (7th).
Resolved Disk Server Issues
|
Current operational status and issues
|
- There is a problem seen by LHCb of a low but persistent rate of failure when copying the results of batch jobs to Castor. There is also a further problem that sometimes occurs when these (failed) writes are attempted to storage at other sites. A recent modification has improved, but not completed fixed this.
- The intermittent, low-level, load-related packet loss seen over external connections is still being tracked. Likewise we have been working to understand some remaining low level of packet loss seen within a part of our Tier1 network.
Ongoing Disk Server Issues
|
- GDSS677 (CMSTape - D0T1) crashed in the morning of Tuesday 8th March. It is being checked out.
Notable Changes made since the last meeting.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- The Castor 2.1.15 update will soon be scheduled. We await successful completion of testing before scheduling.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update to Castor version 2.1.15.
- Migration of data from T10KC to T10KD tapes (affects Atlas & LHCb data).
- Networking:
- Replace the UKLight Router. Then upgrade the 'bypass' link to the RAL border routers to 2*10Gbit.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, LFC)
- Grid Services
- A Load Balancer (HAProxy) will be used in front of the FTS service.
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgfts3.gridpp.rl.ac.uk,
|
SCHEDULED
|
WARNING
|
03/03/2016 10:00
|
03/03/2016 11:00
|
1 hour
|
Update of Production FTS3 service to version 3.4.2
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
119841
|
Amber
|
Less Urgent
|
In Progress
|
2016-03-01
|
2016-03-08
|
LHCb
|
HTTP support for lcgcadm04.gridpp.rl.ac.uk
|
117683
|
Green
|
Less Urgent
|
In Progress
|
2015-11-18
|
2016-02-17
|
|
CASTOR at RAL not publishing GLUE 2
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
02/03/16 |
100 |
100 |
90 |
60 |
100 |
100 |
100 |
Problem affected CE tests across many sites for both Atlas & CMS.
|
03/03/16 |
100 |
100 |
100 |
61 |
100 |
100 |
100 |
Ongoing problem affecting CE tests across many sites for CMS.
|
04/03/16 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|
05/03/16 |
100 |
100 |
100 |
100 |
100 |
100 |
N/A |
|
06/03/16 |
100 |
100 |
87 |
100 |
100 |
100 |
N/A |
No results for Atlas tests - not sure what is happening.
|
07/03/16 |
100 |
100 |
56 |
66 |
73 |
100 |
100 |
Missing monitoring results. To be fixed up.
|
08/03/16 |
100 |
100 |
100 |
100 |
100 |
100 |
100 |
|