RAL Tier1 Operations Report for 9th July 2014
Review of Issues during the week 2nd to 9th July 2014.
|
- There were problems with the SRM (not Castor) for the GEN instance on Thursday and Friday of last week (3/4 July). It was fixed by a database edit.
- Problems with Atlas multicore jobs on Friday 4th July. We believe it is an Atlas issue.
Resolved Disk Server Issues
|
Current operational status and issues
|
- We are still investigating xroot access to CMS Castor following the upgrade on the 17th June.
- There is a problem with the dteam SRM regional nagios tests, which may be caused by how dteam is published by the CIP.
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- Tuesday and Wednesday (8th and 9th July) Atlas Castor instance upgraded to version 2.1.14-13. Castor Atlas was returned to production at 10:40 this morning.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- We are planning the termination of the FTS2 service (announced for 2nd September) now that almost all use is on FTS3.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Networking:
- Move switches connecting the 2011 disk servers batches onto the Tier1 mesh network.
- Make routing changes to allow the removal of the UKLight Router.
- Fabric
- We are phasing out the use of the software server used by the small VOs.
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
- There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 2nd and 9th July 2014.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
srm-atlas
|
SCHEDULED
|
OUTAGE
|
08/07/2014 06:00
|
09/07/2014 10:40
|
1 day, 6 hours
|
Atlas Castor instance down for Castor 2.1.14 Stager Update
|
Castor GEN: srm-alice, srm-biomed, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-na62, srm-snoplus, srm-superb, srm-t2k
|
UNSCHEDULED
|
WARNING
|
03/07/2014 07:45
|
03/07/2014 13:00
|
5 hours and 15 minutes
|
Problem with SRMs for Castor GEN instance. (However Castor itself - e.g. xroot access - working OK).
|
Whole site
|
SCHEDULED
|
WARNING
|
02/07/2014 10:00
|
02/07/2014 11:00
|
1 hour
|
RAL Tier1 site in warning state due to UPS/generator test.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
106753
|
Green
|
Less Urgent
|
In Progress
|
2014-07-09
|
2014-07-09
|
Atlas
|
Errors in transfers to RAL-LCG2
|
106695
|
Green
|
Less Urgent
|
In Progress
|
2014-07-08
|
2014-07-08
|
Ops
|
[Rod Dashboard] Issues detected at RAL-LCG2
|
106655
|
Green
|
Less Urgent
|
In Progress
|
2014-07-04
|
2014-07-04
|
Ops
|
[Rod Dashboard] Issues detected at RAL-LCG2 (srm-dteam)
|
106640
|
Green
|
Less Urgent
|
In Progress
|
2014-07-04
|
2014-07-04
|
ILC
|
Failure to submit jobs to RAL-LCG2 CEs
|
106610
|
Green
|
Less Urgent
|
In Progress
|
2014-07-02
|
2014-07-02
|
HyperK
|
HyperK support
|
106324
|
Yellow
|
Urgent
|
In Progress
|
2014-06-18
|
2014-07-01
|
CMS
|
pilots losing network connections at T1_UK_RAL
|
105405
|
Red
|
Urgent
|
On Hold
|
2014-05-14
|
2014-07-01
|
|
please check your Vidyo router firewall configuration
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
02/07/14 |
100 |
100 |
100 |
100 |
100 |
98 |
99 |
|
03/07/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
04/07/14 |
100 |
100 |
100 |
100 |
100 |
97 |
100 |
|
05/07/14 |
100 |
100 |
100 |
100 |
100 |
92 |
100 |
|
06/07/14 |
100 |
100 |
100 |
100 |
100 |
99 |
100 |
|
07/07/14 |
100 |
100 |
100 |
100 |
100 |
97 |
100 |
|
08/07/14 |
100 |
100 |
41 |
100 |
100 |
100 |
99 |
Atlas Castor upgrade.
|