From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 27th November 2013
Review of Issues during the week 20th to 27th November 2013.
|
- The problem with one batch of worker nodes that has been reported in previous weeks has been solved. These systems were put back in production on Thursday (21st Nov).
- On Monday (25th November) the Primary OPN link to CERN failed. However, the failover was not clean in that whilst the router at the CERN end switched to the backup link, the router at the RAL end didn't. Once the problem was identified the primary link was forced down at the RAL end and all traffic ran over the backup link. The following morning the primary link was fixed and traffic was switched back to use it.
- On Monday (25th November) there was a problem with one of the hypervisor clusters that led to problems on some service machines that run as VMs there (FTS, Alice VO box, arc-ce03).
- On Tuesday evening there was a problem with one of the WMS systems, WMS05, caused by a user job filling up the available space.
Resolved Disk Server Issues
|
Current operational status and issues
|
- The Condor batch farm is running fine. Some tweaks have been applied to the scheduling in the light of experience (e.g. Increased Condor priority halflife from 1 to 3 days.)
- The FTS3 testing continues. Two updates have been applied in this last week.
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- On Tuesday (26th Nov) a firmware update was made to one of the disk arrays used by the LFC/FTS2/Atlas3D databases. This had been showing a fault and the firmware upgrade was required to investigate this.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
- Networking:
- Possible move of Tier1 core network switch in January (TBC).
- Implementation of new site firewall.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting between the 20th and 27th November 2013.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgfts (FTS2)
|
UNSCHEDULED
|
OUTAGE
|
26/11/2013 15:00
|
26/11/2013 15:15
|
15 minutes
|
Investigating problems with restarting FTS2 service after intervention earlier today
|
lcgft-atlas, lcgfts (FTS2), lfc.gridpp
|
SCHEDULED
|
OUTAGE
|
26/11/2013 09:30
|
26/11/2013 15:00
|
5 hours and 30 minutes
|
Outage of LFC, FTS2 and Atlas 3D/Frontier during work on disk array used by back end database.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
99162
|
Green
|
Less Urgent
|
In Progress
|
2013-11-25
|
2013-11-25
|
|
Publishing default values
|
99161
|
Green
|
Less Urgent
|
In Progress
|
2013-11-25
|
2013-11-25
|
|
GLUE 2 obsolete entries
|
98249
|
Red
|
Urgent
|
Waiting Reply
|
2013-10-21
|
2013-11-18
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
98122
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-10-17
|
2013-11-18
|
cernatschool
|
CVMFS access for the cernatschool.org VO
|
97868
|
Red
|
Less Urgent
|
In Progress
|
2013-10-08
|
2013-11-18
|
T2K
|
CVMFS for t2k.org
|
97385
|
Red
|
Less Urgent
|
In Progress
|
2013-09-17
|
2013-11-18
|
HyperK
|
CVMFS for hyperk.org
|
97025
|
Red
|
Less urgent
|
On Hold
|
2013-09-03
|
2013-11-05
|
|
Myproxy server certificate does not contain hostname
|
91658
|
Red
|
Less Urgent
|
In Progress
|
2013-02-20
|
2013-11-15
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-10-18
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
20/11/13 |
100 |
100 |
100 |
100 |
100 |
|
21/11/13 |
100 |
100 |
100 |
100 |
100 |
|
22/11/13 |
100 |
100 |
100 |
100 |
100 |
|
23/11/13 |
100 |
100 |
100 |
100 |
100 |
|
24/11/13 |
100 |
100 |
100 |
100 |
100 |
|
25/11/13 |
100 |
100 |
82.4 |
83.0 |
84.9 |
CERN Primary link failed but failover didn't work correctly
|
26/11/13 |
100 |
95.8 |
100 |
89.7 |
87.8 |
BDII problem at CERN affected many sites.
|