RAL Tier1 Operations Report for 23rd July 2014
Review of Issues during the week 16th to 23rd July 2014.
|
- The recurring problems with the SRM processes for the castor GEN instance, that were crashing since Friday (11th), hs been solved (on Friday 18th). The problem was a malformed file name being sent that was not trapped by the relevant SRM code.
- On Thursday (17th) the Castor disk cache for AtlasTape filled up. This was traced to the garbage collector not running and was immediately fixed.
- On Friday (18th) there were problems with Atlas FAX.
- Yesterday (Tuesday 22nd) there was a problem with the site network that effectively took the Tier1 off-air from ?? to ??. This was co-incident with, but not caused by, anongoing network update (to use the "RIP" protocol).
Resolved Disk Server Issues
|
Current operational status and issues
|
- We are still investigating xroot access to CMS Castor following the upgrade on the 17th June.
- There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP.
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- New CERN VOMS servers added.
- Castor re-pack instance being updated to version 2.1.14-13 (ongoing now).
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
14/07/2014 11:00
|
14/08/2014 11:00
|
31 days,
|
Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.
- The FTS3 service will be updated (to v3.2.26) on Monday morning, 21st July.
- The core RAL site network will be updated to use RIP for network routing on Tuesday 22nd July.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 16th and 23rd July 2014.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole Site
|
UNSCHEDULED
|
WARNING
|
22/07/2014 12:00
|
22/07/2014 15:00
|
3 hours
|
Extending the earlier warning during update to site network routing.
|
Whole Site
|
UNSCHEDULED
|
OUTAGE
|
22/07/2014 10:40
|
22/07/2014 11:25
|
45 minutes
|
Unexpected outage during update to site network routing.
|
Whole Site
|
UNSCHEDULED
|
WARNING
|
22/07/2014 10:00
|
22/07/2014 11:00
|
1 hour
|
Extending the earlier warning during update to site network routing.
|
Whole Site
|
SCHEDULED
|
WARNING
|
22/07/2014 07:00
|
22/07/2014 10:00
|
3 hours
|
Warning during update to site network routing.
|
lcgfts3.gridpp.rl.ac.uk,
|
SCHEDULED
|
WARNING
|
21/07/2014 11:00
|
21/07/2014 13:00
|
2 hours
|
Update of FTS3 service to version v3.2.26
|
perfsonar-ps01.gridpp.rl.ac.uk, perfsonar-ps02.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
14/07/2014 11:00
|
14/08/2014 11:00
|
31 days,
|
Systems being decommissioned. They have been replaced by lcgps01.gridpp.rl.ac.uk and lcgps02.gridpp.rl.ac.uk
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
106655
|
Yellow
|
Less Urgent
|
In Progress
|
2014-07-04
|
2014-07-16
|
Ops
|
[Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam)
|
106324
|
Red
|
Urgent
|
In Progress
|
2014-06-18
|
2014-07-07
|
CMS
|
pilots losing network connections at T1_UK_RAL
|
105405
|
Red
|
Urgent
|
On Hold
|
2014-05-14
|
2014-07-01
|
|
please check your Vidyo router firewall configuration
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
16/07/14 |
100 |
100 |
100 |
100 |
100 |
100 |
98 |
|
17/07/14 |
100 |
100 |
100 |
100 |
100 |
97 |
100 |
|
18/07/14 |
100 |
100 |
100 |
100 |
100 |
97 |
99 |
|
19/07/14 |
100 |
100 |
100 |
100 |
100 |
98 |
100 |
|
20/07/14 |
100 |
100 |
100 |
100 |
100 |
96 |
100 |
|
21/07/14 |
100 |
100 |
100 |
100 |
100 |
100 |
96 |
|
22/07/14 |
100 |
95.6 |
97.4 |
95.6 |
100 |
96 |
96 |
Site networking problem.
|