Difference between revisions of "Tier1 Operations Report 2015-08-26"
From GridPP Wiki
(→) |
(→) |
||
(11 intermediate revisions by one user not shown) | |||
Line 9: | Line 9: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 19th to 26th August 2015. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 19th to 26th August 2015. | ||
|} | |} | ||
− | * | + | * Last week we reported a significant backlog of migrations to tape for Atlas as the instance was seeing very high load. Castor caught up with the backlog overnight Thursday/Friday (21/21 Aug) - and there have been no problems since then. |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 26: | Line 20: | ||
| style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | | style="background-color: #f8d6a9; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Resolved Disk Server Issues | ||
|} | |} | ||
− | * | + | * None |
− | + | ||
<!-- ***********End Resolved Disk Server Issues*********** -----> | <!-- ***********End Resolved Disk Server Issues*********** -----> | ||
<!-- ***************************************************** -----> | <!-- ***************************************************** -----> | ||
Line 38: | Line 31: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | ||
|} | |} | ||
− | * | + | * As reported last week the post-mortem review of the network incident on the 8th April has been finalised. It can be seen here: |
** https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20150408_network_intervention_preceding_Castor_upgrade | ** https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20150408_network_intervention_preceding_Castor_upgrade | ||
− | * The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. | + | * The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. An attempted fix (replacing the cables between the Tier1 core network and the UKLight router) had no effect. |
− | * There are some on-going issues for CMS. | + | * There are some on-going issues for CMS. There is a problem with the Xroot (AAA) redirection accessing Castor and file open times using Xroot are slow. The poor batch job efficiencies have been improved since a change to the Linux I/O scheduler on CMS disk servers some ten days ago. |
<!-- ***********End Current operational status and issues*********** -----> | <!-- ***********End Current operational status and issues*********** -----> | ||
<!-- *************************************************************** -----> | <!-- *************************************************************** -----> | ||
Line 63: | Line 56: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | ||
|} | |} | ||
− | * | + | * The rollout of the new worker node configuration continues. Around a quarter of the batch farm has now been upgraded. |
+ | * The network link between the Tier1 core network and the UKLight router was switched to a new pair of fibres. | ||
+ | * As part of the housekeeping work on our Router pair the old static routes have been removed (today - 26th August). | ||
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 161: | Line 156: | ||
|} | |} | ||
<!-- ******* still to be formally scheduled and/or announced ******* -----> | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
− | * Upgrade of Castor disk servers to SL6. For the D1T0 Service Classes this is being done today/tomorrow (26/27 August) with | + | * Upgrade of Castor disk servers to SL6. For the D1T0 Service Classes this is being done today/tomorrow (26/27 August) with extended 'At Risks'. (Declared in GOC DB) |
* Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Dates declared in GOC DB (See above). | * Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Dates declared in GOC DB (See above). | ||
* Some detailed internal network reconfigurations to be tackled now that the routers are stable. Notably: | * Some detailed internal network reconfigurations to be tackled now that the routers are stable. Notably: | ||
Line 224: | Line 219: | ||
|-style="background:#b7f1ce" | |-style="background:#b7f1ce" | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
+ | |- | ||
+ | | 115805 | ||
+ | | Green | ||
+ | | Less Urgent | ||
+ | | In progress | ||
+ | | 2015-08-21 | ||
+ | | 2015-08-21 | ||
+ | | SNO+ | ||
+ | | RAL WMS logs | ||
|- | |- | ||
| 115573 | | 115573 | ||
Line 230: | Line 234: | ||
| In progress | | In progress | ||
| 2015-08-07 | | 2015-08-07 | ||
− | | 2015-08- | + | | 2015-08-26 |
| CMS | | CMS | ||
| T1_UK_RAL Consistency Check (August 2015) | | T1_UK_RAL Consistency Check (August 2015) | ||
Line 237: | Line 241: | ||
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
− | | | + | | In progress |
| 2015-08-03 | | 2015-08-03 | ||
| 2015-08-07 | | 2015-08-07 | ||
| SNO+ | | SNO+ | ||
| glite-wms-job-status warning | | glite-wms-job-status warning | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
| 115290 | | 115290 | ||
Line 298: | Line 293: | ||
| 24/08/15 || 100 || 100 || 100 || 100 || 100 || 94 || 100 || | | 24/08/15 || 100 || 100 || 100 || 100 || 100 || 94 || 100 || | ||
|- | |- | ||
− | | 25/08/15 || 100 || | + | | 25/08/15 || 100 || 100 || style="background-color: lightgrey;" | 98.0 || 100 || 100 || 97 || 98 || Atlas: Single SRM test failure. (Alice originally reported as 95% - but subsequently corrected). |
|} | |} | ||
<!-- **********************End Availability Report************************** -----> | <!-- **********************End Availability Report************************** -----> | ||
<!-- *********************************************************************** -----> | <!-- *********************************************************************** -----> |
Latest revision as of 07:36, 27 August 2015
RAL Tier1 Operations Report for 26th August 2015
Review of Issues during the week 19th to 26th August 2015. |
- Last week we reported a significant backlog of migrations to tape for Atlas as the instance was seeing very high load. Castor caught up with the backlog overnight Thursday/Friday (21/21 Aug) - and there have been no problems since then.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- As reported last week the post-mortem review of the network incident on the 8th April has been finalised. It can be seen here:
- The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. An attempted fix (replacing the cables between the Tier1 core network and the UKLight router) had no effect.
- There are some on-going issues for CMS. There is a problem with the Xroot (AAA) redirection accessing Castor and file open times using Xroot are slow. The poor batch job efficiencies have been improved since a change to the Linux I/O scheduler on CMS disk servers some ten days ago.
Ongoing Disk Server Issues |
- None
Notable Changes made since the last meeting. |
- The rollout of the new worker node configuration continues. Around a quarter of the batch farm has now been upgraded.
- The network link between the Tier1 core network and the UKLight router was switched to a new pair of fibres.
- As part of the housekeeping work on our Router pair the old static routes have been removed (today - 26th August).
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Castor Atlas & GEN instances. | SCHEDULED | OUTAGE | 13/10/2015 08:00 | 13/10/2015 16:00 | 8 hours | Outage of All Castor instances during upgrade of Oracle back end database. |
All Castor | SCHEDULED | WARNING | 08/10/2015 08:30 | 08/10/2015 20:30 | 12 hours | Warning (At Risk) on All Castor instances during upgrade of back end Oracle database. |
All Castor | SCHEDULED | OUTAGE | 06/10/2015 08:00 | 06/10/2015 20:30 | 12 hours and 30 minutes | Outage of All Castor instances during upgrade of Oracle back end database. |
Castor Atlas & GEN instances. | SCHEDULED | WARNING | 22/09/2015 08:30 | 22/09/2015 20:30 | 12 hours | Warning (At Risk) on Atlas and GEN Castor instances during upgrade of back end Oracle database. |
Castor Atlas & GEN instances. | SCHEDULED | OUTAGE | 15/09/2015 08:00 | 15/09/2015 20:30 | 12 hours and 30 minutes | Outage of Atlas and GEN Castor instances during upgrade of Oracle back end database. |
Whole Site | SCHEDULED | WARNING | 03/09/2015 08:25 | 03/09/2015 09:25 | 1 hour | At risk on site during brief network re-configuration. Actual change expected at 07:30 (UTC). |
All Castor Disk Only Service Classes. | SCHEDULED | WARNING | 27/08/2015 09:30 | 27/08/2015 19:00 | 9 hours and 30 minutes | Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade. (Continuation of rolling upgrade from previous day). |
All Castor Disk Only Service Classes. | SCHEDULED | WARNING | 26/08/2015 09:30 | 26/08/2015 18:00 | 8 hours and 30 minutes | Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Upgrade of Castor disk servers to SL6. For the D1T0 Service Classes this is being done today/tomorrow (26/27 August) with extended 'At Risks'. (Declared in GOC DB)
- Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Dates declared in GOC DB (See above).
- Some detailed internal network reconfigurations to be tackled now that the routers are stable. Notably:
- Brief (less than 20 seconds) break in internal connectivity while systems in the UPS room are re-connected. (3rd Sep.)
- Replacement of cables and connectivity to the UKLIGHT router that provides our link to both the OPN Link to CERN and the bypass route for other data transfers.
- Extending the rollout of the new worker node configuration.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4. This will affect all services that use Oracle databases: Castor, Atlas Frontier (LFC done)
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update the Oracle databases behind Castor to version 11.2.0.4. Will require some downtimes (See above)
- Update disk servers to SL6 (ongoing)
- Update to Castor version 2.1.15.
- Networking:
- Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit.
- Make routing changes to allow the removal of the UKLight Router.
- Cabling/switch changes to the network in the UPS room to improve resilience.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor disk-only service classes. | SCHEDULED | WARNING | 26/08/2015 09:30 | 26/08/2015 18:00 | 8 hours and 30 minutes | Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
115805 | Green | Less Urgent | In progress | 2015-08-21 | 2015-08-21 | SNO+ | RAL WMS logs |
115573 | Green | Urgent | In progress | 2015-08-07 | 2015-08-26 | CMS | T1_UK_RAL Consistency Check (August 2015) |
115434 | Green | Less Urgent | In progress | 2015-08-03 | 2015-08-07 | SNO+ | glite-wms-job-status warning |
115290 | Green | Less Urgent | On Hold | 2015-07-28 | 2015-07-29 | FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes | |
108944 | Red | Less Urgent | In Progress | 2014-10-01 | 2015-08-17 | CMS | AAA access test failing at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
19/08/15 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | |
20/08/15 | 100 | 100 | 100 | 100 | 100 | 97 | 100 | |
21/08/15 | 100 | 100 | 100 | 100 | 100 | 86 | 96 | |
22/08/15 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | |
23/08/15 | 100 | 100 | 100 | 100 | 100 | 94 | n/a | |
24/08/15 | 100 | 100 | 100 | 100 | 100 | 94 | 100 | |
25/08/15 | 100 | 100 | 98.0 | 100 | 100 | 97 | 98 | Atlas: Single SRM test failure. (Alice originally reported as 95% - but subsequently corrected). |