Tier1 Operations Report 2013-10-30
From GridPP Wiki
Revision as of 10:54, 30 October 2013 by Gareth smith (Talk | contribs)
RAL Tier1 Operations Report for 30th October 2013
Review of Issues during the week 23rd to 30th October 2013. |
- The Torque/Maui batch still has one of the batches of worker nodes disabled. Apart from that it has run reasonably well. The Condor farm has run OK.
- Two files were declared lost to Atlas following the failure of GDSS720. These were in transit as the server went down.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- The FTS3 testing has continued very actively with Atlas. Problems with FTS3 are being uncovered during these tests. Patches are being regularly applied to FTS3 to deal with issues.
- We are participating in xrootd federated access tests for Atlas. The server has now been successfully configured to work as an xroot redirector, whereas before it could only serve as a proxy.
- We are running with the two farms, Condor and Torque/Maui, in production. The Torque/Maui farm will be decommissioned after the intervention next week and its nodes moved into the Condor farm.
- The uplink from the Tier1 core switch to the UK Light router that was doubled last week has been working OK since that change.
Ongoing Disk Server Issues |
- GDSS720 (AtlasDataDisk - D1T0) crashed during the evening of 22nd October. It has been drained. Following a firmware update to the RAID controller it is undergoing two weeks of acceptance testing before being returned to production.
Notable Changes made this last week. |
- CVMFS client version 2.1.15-1 has been rolled out to all worker nodes in the Condor farm.
- A further update was applied to FTS3 last Wednesday, 23rd Oct. (Upgraded to 3.1.33-1).
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
BDIIs (lcgbdii, site-bdii), lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, Myproxy (lcgrbp01, myproxy) | SCHEDULED | WARNING | 05/11/2013 07:00 | 06/11/2013 12:00 | 1 day, 5 hours | Warning (At Risk) on services during intervention on Uninterruptible Power Supply (UPS). Some services (LFC, FTS) will experience two breaks of around one to two hours during this period. |
All Castor (all SRMs), Atlas Frontier | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 05/11/2013 21:00 | 14 hours | Stop of systems (Castor, Frontier/3D database) during work on Uninterruptible Power Supply (UPS). |
Condor batch farm (arc-ce01, arc-ce02, arc-ce03, cream-ce01, cream-ce02, lcgargus01, VO boxes, lcgapel01, atlas-squid, cms-squid, UIs (lcgui01, lcgui02), WMSs (lcgwms04, lcgwms05, lcgwms06), Perfsonar (perfsonar-ps01, perfsonar-ps02). | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 06/11/2013 15:00 | 1 day, 8 hours | Stop of systems (Batch, WMS) during work on Uninterruptible Power Supply (UPS). |
lcgce01, lcgce02, lcgce04, lcgce10, lcgce11 | SCHEDULED | OUTAGE | 05/11/2013 07:00 | 30/11/2013 23:59 | 25 days, 16 hours and 59 minutes | Service being decommissioned. |
lcgwms04, lcgwms05, lcgwms06 | SCHEDULED | OUTAGE | 01/11/2013 12:00 | 05/11/2013 07:00 | 3 days, 19 hours | Drain of WMSs ahead of their shutdown during work on UPS. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Interruption to services over Tuesday/Wednesday 5/6 November during work on the UPS and safety testing of its circuits. Outages and Warnings declared in GOC DB.
Listing by category:
- Databases:
- Switch LFC/FTS/3D to new Database Infrastructure.
- Castor:
- None
- Networking:
- Update core Tier1 network and change connection to site and OPN including:
- Install new Routing layer for Tier1
- Change the way the Tier1 connects to the RAL network.
- These changes will lead to the removal of the UKLight Router.
- Update core Tier1 network and change connection to site and OPN including:
- Fabric
- One of the disk arrays hosting the FTS, LFC & Atlas 3D databases is showing a fault and an intervention is required - initially to update the disk array's firmware.
Entries in GOC DB starting between the 23rd and 30th October 2013. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor (all SRMs), batch (All CEs),lcgfts, lfc | SCHEDULED | OUTAGE | 23/10/2013 09:45 | 23/10/2013 12:15 | 2 hours and 30 minutes | Upgrade (doubling) of network data link. Some risk of disruption to our Tier1 network so some services stopped during the work. Other services at risk, |
All systems not in the above outage. | SCHEDULED | WARNING | 23/10/2013 09:45 | 23/10/2013 12:15 | 2 hours and 30 minutes | Upgrade (doubling) of network data link. Some risk of disruption to our Tier1 network - some services At Risk. (Other services declared down in separate GOC DB entry). |
Open GGUS Tickets (Snapshot at time of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
98337 | Amber | Urgent | In Progress | 2013-10-23 | 2013-10-23 | Mice | Slow file uploads to castor (MICE) |
98249 | Red | Urgent | In Progress | 2013-10-21 | 2013-10-30 | SNO+ | please configure cvmfs stratum-0 for SNO+ at RAL T1 |
98214 | Red | Less Urgent | In Progress | 2013-10-19 | 2013-10-21 | CMS | HC Job failure reading dataset from T1_UK_RAL storage |
98122 | Red | Less Urgent | In Progress | 2013-10-17 | 2013-10-30 | cernatschool | CVMFS access for the cernatschool.org VO |
97868 | Red | Less Urgent | Waiting Reply | 2013-10-08 | 2013-10-30 | T2K | CVMFS for t2k.org |
97759 | Red | Urgent | On Hold | 2013-10-04 | 2013-10-04 | OPS | SHA-2 test failing on lcgce01 |
97385 | Red | Less Urgent | In Progress | 2013-09-17 | 2013-10-14 | HyperK | CVMFS for hyperk.org |
97025 | Red | Less urgent | On Hold | 2013-09-03 | 2013-09-12 | Myproxy server certificate does not contain hostname | |
91658 | Red | Less Urgent | On Hold | 2013-02-20 | 2013-09-03 | LFC webdav support | |
86152 | Red | Less Urgent | On Hold | 2012-09-17 | 2013-10-18 | correlated packet-loss on perfsonar host |
Availability Report |
Day | OPS | Alice | Atlas | CMS | LHCb | Comment |
---|---|---|---|---|---|---|
23/10/13 | 89.6 | 89.6 | 87.4 | 89.6 | 89.6 | Systems stopped for doubling of data uplink. |
24/10/13 | 100 | 100 | 85.9 | 100 | 100 | Atlas Castor problem caused by a draining disk server. |
25/10/13 | 100 | 100 | 100 | 100 | 100 | |
26/10/13 | 100 | 100 | 99.5 | 100 | 100 | Single SRM test failure "Error reading token data header:" |
27/10/13 | 100 | 100 | 100 | 100 | 100 | |
28/10/13 | 100 | 100 | 100 | 100 | 100 | |
29/10/13 | 100 | 100 | 100 | 95.9 | 100 | Single SRM test failure "Error reading token data header:" |