Difference between revisions of "Tier1 Operations Report 2015-08-26"
From GridPP Wiki
(→) |
(→) |
||
Line 31: | Line 31: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Current operational status and issues | ||
|} | |} | ||
− | * As reported last week the post mortem review of the network incident on the 8th April has been finalised. It can be seen here: | + | * As reported last week the post-mortem review of the network incident on the 8th April has been finalised. It can be seen here: |
** https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20150408_network_intervention_preceding_Castor_upgrade | ** https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20150408_network_intervention_preceding_Castor_upgrade | ||
* The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. An attempted fix (replacing the cables between the Tier1 core network and the UKLight router) had no effect. | * The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. An attempted fix (replacing the cables between the Tier1 core network and the UKLight router) had no effect. | ||
− | * There are some on-going issues for CMS. | + | * There are some on-going issues for CMS. There is a problem with the Xroot (AAA) redirection accessing Castor and file open times using Xroot are slow. The poor batch job efficiencies have been improved since a change to the Linux I/O scheduler on CMS disk servers some ten days ago. |
<!-- ***********End Current operational status and issues*********** -----> | <!-- ***********End Current operational status and issues*********** -----> | ||
<!-- *************************************************************** -----> | <!-- *************************************************************** -----> |
Revision as of 10:46, 26 August 2015
RAL Tier1 Operations Report for 26th August 2015
Review of Issues during the week 19th to 26th August 2015. |
- Last week we reported a significant backlog of migrations to tape for Atlas as the instance was seeing very high load. Castor caught up with the backlog overnight Thursday/Friday (21/21 Aug) - and there have been no problems since then.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- As reported last week the post-mortem review of the network incident on the 8th April has been finalised. It can be seen here:
- The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. An attempted fix (replacing the cables between the Tier1 core network and the UKLight router) had no effect.
- There are some on-going issues for CMS. There is a problem with the Xroot (AAA) redirection accessing Castor and file open times using Xroot are slow. The poor batch job efficiencies have been improved since a change to the Linux I/O scheduler on CMS disk servers some ten days ago.
Ongoing Disk Server Issues |
- None
Notable Changes made since the last meeting. |
- The rollout of the new worker node configuration continues. Around a quarter of the batch farm has now been upgraded.
- The network link between the Tier1 core network and the UKLight router was switched to a new pair of fibres.
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Castor Atlas & GEN instances. | SCHEDULED | OUTAGE | 13/10/2015 08:00 | 13/10/2015 16:00 | 8 hours | Outage of All Castor instances during upgrade of Oracle back end database. |
All Castor | SCHEDULED | WARNING | 08/10/2015 08:30 | 08/10/2015 20:30 | 12 hours | Warning (At Risk) on All Castor instances during upgrade of back end Oracle database. |
All Castor | SCHEDULED | OUTAGE | 06/10/2015 08:00 | 06/10/2015 20:30 | 12 hours and 30 minutes | Outage of All Castor instances during upgrade of Oracle back end database. |
Castor Atlas & GEN instances. | SCHEDULED | WARNING | 22/09/2015 08:30 | 22/09/2015 20:30 | 12 hours | Warning (At Risk) on Atlas and GEN Castor instances during upgrade of back end Oracle database. |
Castor Atlas & GEN instances. | SCHEDULED | OUTAGE | 15/09/2015 08:00 | 15/09/2015 20:30 | 12 hours and 30 minutes | Outage of Atlas and GEN Castor instances during upgrade of Oracle back end database. |
Whole Site | SCHEDULED | WARNING | 03/09/2015 08:25 | 03/09/2015 09:25 | 1 hour | At risk on site during brief network re-configuration. Actual change expected at 07:30 (UTC). |
All Castor Disk Only Service Classes. | SCHEDULED | WARNING | 27/08/2015 09:30 | 27/08/2015 19:00 | 9 hours and 30 minutes | Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade. (Continuation of rolling upgrade from previous day). |
All Castor Disk Only Service Classes. | SCHEDULED | WARNING | 26/08/2015 09:30 | 26/08/2015 18:00 | 8 hours and 30 minutes | Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade. |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Upgrade of Castor disk servers to SL6. For the D1T0 Service Classes this is being done today/tomorrow (26/27 August) with extended 'At Risks'. (Declared in GOC DB)
- Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Dates declared in GOC DB (See above).
- Some detailed internal network reconfigurations to be tackled now that the routers are stable. Notably:
- Brief (less than 20 seconds) break in internal connectivity while systems in the UPS room are re-connected. (3rd Sep.)
- Replacement of cables and connectivity to the UKLIGHT router that provides our link to both the OPN Link to CERN and the bypass route for other data transfers.
- Extending the rollout of the new worker node configuration.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4. This will affect all services that use Oracle databases: Castor, Atlas Frontier (LFC done)
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update the Oracle databases behind Castor to version 11.2.0.4. Will require some downtimes (See above)
- Update disk servers to SL6 (ongoing)
- Update to Castor version 2.1.15.
- Networking:
- Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit.
- Make routing changes to allow the removal of the UKLight Router.
- Cabling/switch changes to the network in the UPS room to improve resilience.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor disk-only service classes. | SCHEDULED | WARNING | 26/08/2015 09:30 | 26/08/2015 18:00 | 8 hours and 30 minutes | Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
115805 | Green | Less Urgent | In progress | 2015-08-21 | 2015-08-21 | SNO+ | RAL WMS logs |
115573 | Green | Urgent | In progress | 2015-08-07 | 2015-08-26 | CMS | T1_UK_RAL Consistency Check (August 2015) |
115434 | Green | Less Urgent | In progress | 2015-08-03 | 2015-08-07 | SNO+ | glite-wms-job-status warning |
115290 | Green | Less Urgent | On Hold | 2015-07-28 | 2015-07-29 | FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes | |
108944 | Red | Less Urgent | In Progress | 2014-10-01 | 2015-08-17 | CMS | AAA access test failing at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
19/08/15 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | |
20/08/15 | 100 | 100 | 100 | 100 | 100 | 97 | 100 | |
21/08/15 | 100 | 100 | 100 | 100 | 100 | 86 | 96 | |
22/08/15 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | |
23/08/15 | 100 | 100 | 100 | 100 | 100 | 94 | n/a | |
24/08/15 | 100 | 100 | 100 | 100 | 100 | 94 | 100 | |
25/08/15 | 100 | 95.0 | 98.0 | 100 | 100 | 97 | 98 | Alice: Single ARC CE test failure; Atlas: Single SRM test failure. |