Difference between revisions of "Tier1 Operations Report 2013-12-04"
From GridPP Wiki
Gareth smith (Talk | contribs) |
(No difference)
|
Latest revision as of 13:23, 4 December 2013
RAL Tier1 Operations Report for 4th December 2013
Review of Issues during the week 27th November to 4th December 2013. |
- There was a problem reported last week with one of the WMS systems, WMS05, caused by a user job filling up the available space. Our initial clean-up was insufficient and WMS05 again had a rather full disk and stopped accepting jobs overnight Thursday/Friday.
- One file has been reported lost to Atlas. It was found to be missing during the (ongoing) Atlas file renaming.
Resolved Disk Server Issues |
- Two disk servers (gdss238, gdss239) in AtlasHotDisk were out of production from Thursday to Friday (28-29 Nov) as they were physically moved. (The rack space being required for this year's purchases).
Current operational status and issues |
- Nothing To Report.
Ongoing Disk Server Issues |
- None
Notable Changes made this last week. |
- On Friday 29th Nov. the site-BDIIs were updated to EMI-3 update 9.
- Some batch system parameters have been adjusted as experience is gained with the new system, notably when Atlas were running a large number of whole node jobs.
Declared in the GOC DB |
- Wednesday 11th December: UPS/Generator Load Test at 10:00. Site in 'warning' state.
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- There will be an interruption to the small VO's software server as it to be physically moved.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- Castor 2.1.14 testing is starting. It is expected to be a few months before deployment.
- Networking:
- Possible move of Tier1 core network switch in January (TBC).
- Implementation of new site firewall.
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting between the 27th November and 4th December 2013. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
lcgfts.gridpp.rl.ac.uk, | UNSCHEDULED | OUTAGE | 26/11/2013 15:00 | 26/11/2013 15:15 | 15 minutes | Investigating problems with restarting FTS2 service after intervention earlier today |
lcgft-atlas.gridpp.rl.ac.uk, lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, | SCHEDULED | OUTAGE | 26/11/2013 09:30 | 26/11/2013 15:00 | 5 hours and 30 minutes | Outage of LFC, FTS2 and Atlas 3D/Frontier during work on disk array used by back end database. |
Open GGUS Tickets (Snapshot at time of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
98249 | Red | Urgent | Waiting Reply | 2013-10-21 | 2013-11-18 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
98122 | Red | Less Urgent | Waiting Reply | 2013-10-17 | 2013-11-18 | cernatschool | CVMFS access for the cernatschool.org VO |
97868 | Red | Less Urgent | Waiting Reply | 2013-10-08 | 2013-12-03 | T2K | CVMFS for t2k.org |
97385 | Red | Less Urgent | In Progress | 2013-09-17 | 2013-11-18 | HyperK | CVMFS for hyperk.org |
97025 | Red | Less urgent | On Hold | 2013-09-03 | 2013-11-05 | Myproxy server certificate does not contain hostname | |
86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-10-18 | correlated packet-loss on perfsonar host |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
27/11/13 | 100 | 91.1 | 100 | 100 | 58.4 | Ongoing problem that affected all sites. (For Alice additional scheduling issue - see 28/11) |
28/11/13 | 100 | 51.5 | 100 | 100 | 100 | Problem scheduling Alice test jobs coming into the 'whole node' queue. |
29/11/13 | 100 | 100 | 100 | 100 | 100 | |
30/11/13 | 100 | 100 | 100 | 100 | 100 | |
01/12/13 | 100 | 100 | 100 | 100 | 100 | |
02/12/13 | 100 | 100 | 100 | 100 | 100 | |
03/12/13 | 100 | 100 | 100 | 100 | 100 |