From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 20th November 2013
Review of Issues during the week 13th to 20th November 2013.
|
- One batch of worker nodes is still under investigation. A BIOS/Firmware update has been applied to the nodes. At present around 50% of the batch are back in production and these are being monitored ahead of putting the remainder back.
- Following a report by LHCb a number of (LHCb) files in D1T0 service class have been found to be problematic. Castor thinks there are copies of the file on both tape and disk, however the disk copy does not exist. These can be fixed up and this does not mean any data is lost. Work is ongoing to find the full extent of this problem and to try and understand the cause.
Resolved Disk Server Issues
|
- GDSS720 (AtlasDataDisk - D1T0) was returned to service yesterday morning (Tuesday 19th). The system had crashed on 22nd October. It has been drained. Following a firmware update to the RAID controller it underwent two weeks of acceptance testing.
Current operational status and issues
|
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- A modification was made to increase the published job time limit on ARC CEs for LCHb.
- The size of the CASTOR overhead has been reduced from 5% to 1% on a small number of disk servers (two for CMS; three for Atlas). The impact of this will be evaluated before a wider roll-out of this change.
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
CEs for Torque/Maui farm. (lcgce01, lcgce02, lcgce04, lcgce10, lcgce11)
|
SCHEDULED
|
OUTAGE
|
05/11/2013 07:00
|
30/11/2013 23:59
|
25 days, 16 hours and 59 minutes
|
Service being decommissioned.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Tuesday 26th November: Upgrading the firmware in a disk array. This will cause an interruption to the LFC, Atlas 3D and FTS2 services for a few hours. (FTS3 unaffected).
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
- Networking:
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Entries in GOC DB starting between the 13th and 20th November 2013.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Whole site
|
SCHEDULED
|
WARNING
|
13/11/2013 10:00
|
13/11/2013 12:00
|
2 hours
|
RAL site in warning state due to power generator test.
|
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11
|
SCHEDULED
|
OUTAGE
|
05/11/2013 07:00
|
30/11/2013 23:59
|
25 days, 16 hours and 59 minutes
|
Service being decommissioned.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
98764
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-11-08
|
2013-11-11
|
SNO+
|
Storage request
|
98625
|
Red
|
Urgent
|
Waiting Reply
|
2013-11-04
|
2013-11-15
|
LHCb
|
Data unavailable for Brazilian proxies at RAL-LCG2
|
98249
|
Red
|
Urgent
|
Waiting Reply
|
2013-10-21
|
2013-11-18
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
98122
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-10-17
|
2013-11-18
|
cernatschool
|
CVMFS access for the cernatschool.org VO
|
97868
|
Red
|
Less Urgent
|
In Progress
|
2013-10-08
|
2013-11-18
|
T2K
|
CVMFS for t2k.org
|
97385
|
Red
|
Less Urgent
|
In Progress
|
2013-09-17
|
2013-11-18
|
HyperK
|
CVMFS for hyperk.org
|
97025
|
Red
|
Less urgent
|
On Hold
|
2013-09-03
|
2013-11-05
|
|
Myproxy server certificate does not contain hostname
|
91658
|
Red
|
Less Urgent
|
In Progress
|
2013-02-20
|
2013-11-15
|
|
LFC webdav support
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-10-18
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
13/11/13 |
100 |
100 |
100 |
100 |
100 |
|
14/11/13 |
100 |
100 |
100 |
100 |
100 |
|
15/11/13 |
100 |
100 |
100 |
100 |
100 |
|
16/11/13 |
100 |
100 |
100 |
100 |
100 |
|
17/11/13 |
100 |
100 |
100 |
100 |
100 |
|
18/11/13 |
100 |
100 |
100 |
100 |
100 |
|
19/11/13 |
100 |
100 |
100 |
100 |
100 |
|