Difference between revisions of "Tier1 Operations Report 2015-08-05"
From GridPP Wiki
(→) |
(→) |
||
(10 intermediate revisions by one user not shown) | |||
Line 9: | Line 9: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 29th July and 5th August 2015. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 29th July and 5th August 2015. | ||
|} | |} | ||
− | + | * There have been further problems with network connectivity caused by a failure of our secondary router - which at the time was our only router. The first outage was on Tuesday (28th July) as reported last week. At the time of last week's meeting another was ongoing - which ended up lasting around five hours and was caused by problems both in getting the router back up and then a failure to propagate routing information via the RIP protocol. There was then a third outage, lasting around an hour, on Sunday morning (2nd August). These preceded a long-planned scheduled intervention on 4th August when an engineer from the router vendor company was present. As a result of this intervention and associated discussions we believe these router problems are now largely understood and we are again running with a resilient pair of network routers. | |
− | * There have been | + | |
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 32: | Line 31: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | ||
|} | |} | ||
− | |||
* The post mortem review of the network incident on the 8th April is being finalised. | * The post mortem review of the network incident on the 8th April is being finalised. | ||
* The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. | * The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. | ||
Line 57: | Line 55: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | ||
|} | |} | ||
− | * | + | * During yesterday's intervention on the Tier1 routers a number of problems were understood. We again have a resilient pair of routers which are running the latest production firmware and the RIP protocol is enabled. |
− | * The test of the updated worker node configuration (with grid middleware delivered via CVMFS) continues on a one whole batch of Worker Nodes. | + | * The test of the updated worker node configuration (with grid middleware delivered via CVMFS) continues on a one whole batch of Worker Nodes. We now propose continuing with the rollout to more worker nodes. (There is a complication in that some OS patches are being applied to these nodes now - and that will also we tested before extending the rollout). |
+ | * Investigative work into the ongoing issues for CMS Castor. This included putting the CMS xroot reads through the Castor scheduler again. | ||
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 100: | Line 99: | ||
|} | |} | ||
<!-- ******* still to be formally scheduled and/or announced ******* -----> | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
− | * | + | * Upgrade of Castor disk servers to SL6. We plan to do this during the second part of August (17-28). Rollout plan to be discussed. |
+ | * Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Whilst we need to confirm exact dates we are looking at the following days in September: | ||
+ | ** Tuesday 8th: day's Outage for Atlas & GEN. | ||
+ | ** Tuesday 15th: at risk on Atlas & GEN. | ||
+ | ** Thursday 17th: day's Outage for ALL instances. | ||
+ | ** Tuesday 22nd: day's at risk for ALL instances | ||
+ | ** Thursday 24th: Half day outage for ALL instances. | ||
+ | * Extending the rollout of the new worker node configuration. | ||
+ | |||
'''Listing by category:''' | '''Listing by category:''' | ||
* Databases: | * Databases: | ||
Line 107: | Line 114: | ||
* Castor: | * Castor: | ||
** Update SRMs to new version (includes updating to SL6). | ** Update SRMs to new version (includes updating to SL6). | ||
− | ** Update the Oracle | + | ** Update the Oracle databases behind Castor to version 11.2.0.4. Will require some downtimes (See above) |
** Update disk servers to SL6. | ** Update disk servers to SL6. | ||
** Update to Castor version 2.1.15. | ** Update to Castor version 2.1.15. | ||
* Networking: | * Networking: | ||
− | |||
− | |||
** Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit. | ** Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit. | ||
** Make routing changes to allow the removal of the UKLight Router. | ** Make routing changes to allow the removal of the UKLight Router. | ||
Line 203: | Line 208: | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
|- | |- | ||
− | | | + | | 115434 |
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
| In Progress | | In Progress | ||
− | | 2015- | + | | 2015-08-03 |
− | | 2015- | + | | 2015-08-03 |
− | | | + | | SNO+ |
− | | | + | | glite-wms-job-status warning |
|- | |- | ||
− | | | + | | 115417 |
+ | | Green | ||
+ | | Very Urgent | ||
+ | | Waiting Reply | ||
+ | | 2015-08-02 | ||
+ | | 2015-08-03 | ||
+ | | LHCb | ||
+ | | CVMS problem at RAL-LCG2 | ||
+ | |- | ||
+ | | 115387 | ||
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
| In Progress | | In Progress | ||
− | | 2015- | + | | 2015-08-03 |
− | | 2015- | + | | 2015-08-03 |
− | | | + | | SNO+ |
− | | | + | | XRootD for SNO+ from RAL |
|- | |- | ||
− | | | + | | 115290 |
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
| On Hold | | On Hold | ||
− | | 2015-07- | + | | 2015-07-28 |
− | | 2015-07- | + | | 2015-07-29 |
− | | | + | | |
− | | | + | | FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes |
|- | |- | ||
| 113836 | | 113836 |
Latest revision as of 12:05, 5 August 2015
RAL Tier1 Operations Report for 5th August 2015
Review of Issues during the week 29th July and 5th August 2015. |
- There have been further problems with network connectivity caused by a failure of our secondary router - which at the time was our only router. The first outage was on Tuesday (28th July) as reported last week. At the time of last week's meeting another was ongoing - which ended up lasting around five hours and was caused by problems both in getting the router back up and then a failure to propagate routing information via the RIP protocol. There was then a third outage, lasting around an hour, on Sunday morning (2nd August). These preceded a long-planned scheduled intervention on 4th August when an engineer from the router vendor company was present. As a result of this intervention and associated discussions we believe these router problems are now largely understood and we are again running with a resilient pair of network routers.
Resolved Disk Server Issues |
- None.
Current operational status and issues |
- The post mortem review of the network incident on the 8th April is being finalised.
- The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked.
- There are some on-going issues for CMS. These are a problem with the Xroot (AAA) redirection accessing Castor; Slow file open times using Xroot; and poor batch job efficiencies.
Ongoing Disk Server Issues |
- None
Notable Changes made since the last meeting. |
- During yesterday's intervention on the Tier1 routers a number of problems were understood. We again have a resilient pair of routers which are running the latest production firmware and the RIP protocol is enabled.
- The test of the updated worker node configuration (with grid middleware delivered via CVMFS) continues on a one whole batch of Worker Nodes. We now propose continuing with the rollout to more worker nodes. (There is a complication in that some OS patches are being applied to these nodes now - and that will also we tested before extending the rollout).
- Investigative work into the ongoing issues for CMS Castor. This included putting the CMS xroot reads through the Castor scheduler again.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole site | UNSCHEDULED | WARNING | 04/08/2015 15:00 | 05/08/2015 15:00 | 24 hours | Warning on Site following investigations of problem with network router. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Upgrade of Castor disk servers to SL6. We plan to do this during the second part of August (17-28). Rollout plan to be discussed.
- Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Whilst we need to confirm exact dates we are looking at the following days in September:
- Tuesday 8th: day's Outage for Atlas & GEN.
- Tuesday 15th: at risk on Atlas & GEN.
- Thursday 17th: day's Outage for ALL instances.
- Tuesday 22nd: day's at risk for ALL instances
- Thursday 24th: Half day outage for ALL instances.
- Extending the rollout of the new worker node configuration.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4. This will affect all services that use Oracle databases: Castor, Atlas Frontier (LFC done)
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update the Oracle databases behind Castor to version 11.2.0.4. Will require some downtimes (See above)
- Update disk servers to SL6.
- Update to Castor version 2.1.15.
- Networking:
- Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit.
- Make routing changes to allow the removal of the UKLight Router.
- Cabling/switch changes to the network in the UPS room to improve resilience.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole site | UNSCHEDULED | WARNING | 04/08/2015 15:00 | 05/08/2015 15:00 | 24 hours | Warning on Site following investigations of problem with network router. |
Whole site | SCHEDULED | OUTAGE | 04/08/2015 08:30 | 04/08/2015 15:00 | 6 hours and 30 minutes | Site Outage during investigation of problem with network router. |
Whole site | UNSCHEDULED | OUTAGE | 02/08/2015 10:00 | 02/08/2015 11:00 | 1 hour | Site was inaccessible due to a problem with the network router |
Whole site | UNSCHEDULED | WARNING | 29/07/2015 16:30 | 30/07/2015 16:30 | 24 hours | Problems on the network have been fixed. Putting whole site AT-RISK for next 24 hours. |
Whole site | UNSCHEDULED | OUTAGE | 29/07/2015 13:10 | 29/07/2015 16:51 | 3 hours and 41 minutes | Outage while we work on network problems |
Whole site | UNSCHEDULED | WARNING | 29/07/2015 11:10 | 29/07/2015 13:10 | 2 hours | Outage while we investigate network problems |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
115434 | Green | Less Urgent | In Progress | 2015-08-03 | 2015-08-03 | SNO+ | glite-wms-job-status warning |
115417 | Green | Very Urgent | Waiting Reply | 2015-08-02 | 2015-08-03 | LHCb | CVMS problem at RAL-LCG2 |
115387 | Green | Less Urgent | In Progress | 2015-08-03 | 2015-08-03 | SNO+ | XRootD for SNO+ from RAL |
115290 | Green | Less Urgent | On Hold | 2015-07-28 | 2015-07-29 | FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes | |
113836 | Amber | Less Urgent | In Progress | 2015-05-20 | 2015-06-24 | GLUE 1 vs GLUE 2 mismatch in published queues | |
108944 | Red | Less Urgent | In Progress | 2014-10-01 | 2015-07-17 | CMS | AAA access test failing at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
29/07/15 | 77.9 | 78.0 | 78.0 | 79.0 | 73.0 | 97 | 100 | Problem with Tier1 network router. |
30/07/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
31/07/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
01/08/15 | 97.4 | 100 | 100 | 100 | 100 | 100 | 99 | SRM tests failed on Castor GEN instance. |
02/07/15 | 97.3 | 100 | 95.0 | 95.0 | 97.0 | 67 | 96 | Problem with Tier1 network router. |
03/08/15 | 100 | 100 | 100 | 94.0 | 96.0 | 93 | 97 | CMS: ARC CE test failures; LHCb: SRM test failure. |
04/07/15 | 72.9 | 73.0 | 73.0 | 73.0 | 73.0 | 100 | 56 | Planned intervention on the Tier1 Network Routers. |