From GridPP Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
RAL Tier1 Operations Report for 4th December 2013
Review of Issues during the week 27th November to 4th December 2013.
|
- There was a problem reported last week with one of the WMS systems, WMS05, caused by a user job filling up the available space. Our initial clean-up was insufficient and WMS05 again had a rather full disk and stopped accepting jobs overnight Thursday/Friday.
- One file has been reported lost to Atlas. It was found to be missing during the (ongoing) Atlas file renaming.
Resolved Disk Server Issues
|
- Two disk servers (gdss238, gdss239) in AtlasHotDisk were out of production from Thursday to Friday (28-29 Nov) as they were physically moved. (The rack space being required for this year's purchases).
Current operational status and issues
|
Ongoing Disk Server Issues
|
Notable Changes made this last week.
|
- On Friday 29th Nov. the site-BDIIs were updated to EMI-3 update 9.
- Some batch system parameters have been adjusted as experience is gained with the new system, notably when Atlas were running a large number of whole node jobs.
- Wednesday 11th December: UPS/Generator Load Test at 10:00. Site in 'warning' state.
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- There will be an interruption to the small VO's software server as it to be physically moved.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
- Networking:
- Possible move of Tier1 core network switch in January (TBC).
- Implementation of new site firewall.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting between the 27th November and 4th December 2013.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
lcgfts.gridpp.rl.ac.uk,
|
UNSCHEDULED
|
OUTAGE
|
26/11/2013 15:00
|
26/11/2013 15:15
|
15 minutes
|
Investigating problems with restarting FTS2 service after intervention earlier today
|
lcgft-atlas.gridpp.rl.ac.uk, lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk,
|
SCHEDULED
|
OUTAGE
|
26/11/2013 09:30
|
26/11/2013 15:00
|
5 hours and 30 minutes
|
Outage of LFC, FTS2 and Atlas 3D/Frontier during work on disk array used by back end database.
|
Open GGUS Tickets (Snapshot at time of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
98249
|
Red
|
Urgent
|
Waiting Reply
|
2013-10-21
|
2013-11-18
|
SNO+
|
please configure cvmfs stratum-0 for SNO+ at RAL T1
|
98122
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-10-17
|
2013-11-18
|
cernatschool
|
CVMFS access for the cernatschool.org VO
|
97868
|
Red
|
Less Urgent
|
Waiting Reply
|
2013-10-08
|
2013-12-03
|
T2K
|
CVMFS for t2k.org
|
97385
|
Red
|
Less Urgent
|
In Progress
|
2013-09-17
|
2013-11-18
|
HyperK
|
CVMFS for hyperk.org
|
97025
|
Red
|
Less urgent
|
On Hold
|
2013-09-03
|
2013-11-05
|
|
Myproxy server certificate does not contain hostname
|
86152
|
Red
|
Less Urgent
|
On Hold
|
2012-09-17
|
2013-10-18
|
|
correlated packet-loss on perfsonar host
|
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Comment
|
27/11/13 |
100 |
91.1 |
100 |
100 |
58.4 |
Ongoing problem that affected all sites. (For Alice additional scheduling issue - see 28/11)
|
28/11/13 |
100 |
51.5 |
100 |
100 |
100 |
Problem scheduling Alice test jobs coming into the 'whole node' queue.
|
29/11/13 |
100 |
100 |
100 |
100 |
100 |
|
30/11/13 |
100 |
100 |
100 |
100 |
100 |
|
01/12/13 |
100 |
100 |
100 |
100 |
100 |
|
02/12/13 |
100 |
100 |
100 |
100 |
100 |
|
03/12/13 |
100 |
100 |
100 |
100 |
100 |
|