Difference between revisions of "Tier1 Operations Report 2015-09-02"
From GridPP Wiki
(→) |
(→) |
||
Line 232: | Line 232: | ||
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
− | | | + | | Waiting Reply |
| 2015-08-03 | | 2015-08-03 | ||
− | | 2015- | + | | 2015-09-02 |
| SNO+ | | SNO+ | ||
| glite-wms-job-status warning | | glite-wms-job-status warning |
Revision as of 09:25, 2 September 2015
RAL Tier1 Operations Report for 2nd September 2015
Review of Issues during the week 26th August to 2nd September 2015. |
- Last week we reported a significant backlog of migrations to tape for Atlas as the instance was seeing very high load. Castor caught up with the backlog overnight Thursday/Friday (21/21 Aug) - and there have been no problems since then.
Resolved Disk Server Issues |
- None
Current operational status and issues |
- As reported last week the post-mortem review of the network incident on the 8th April has been finalised. It can be seen here:
- The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. An attempted fix (replacing the cables between the Tier1 core network and the UKLight router) had no effect.
- There are some on-going issues for CMS. There is a problem with the Xroot (AAA) redirection accessing Castor and file open times using Xroot are slow. The poor batch job efficiencies have been improved since a change to the Linux I/O scheduler on CMS disk servers some ten days ago.
Ongoing Disk Server Issues |
- None
Notable Changes made since the last meeting. |
- The rollout of the new worker node configuration continues. Around a quarter of the batch farm has now been upgraded.
- The network link between the Tier1 core network and the UKLight router was switched to a new pair of fibres.
- As part of the housekeeping work on our Router pair the old static routes have been removed (today - 26th August).
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Castor Atlas & GEN instances. | SCHEDULED | OUTAGE | 13/10/2015 08:00 | 13/10/2015 16:00 | 8 hours | Outage of All Castor instances during upgrade of Oracle back end database. |
All Castor | SCHEDULED | WARNING | 08/10/2015 08:30 | 08/10/2015 20:30 | 12 hours | Warning (At Risk) on All Castor instances during upgrade of back end Oracle database. |
All Castor | SCHEDULED | OUTAGE | 06/10/2015 08:00 | 06/10/2015 20:30 | 12 hours and 30 minutes | Outage of All Castor instances during upgrade of Oracle back end database. |
Castor Atlas & GEN instances. | SCHEDULED | WARNING | 22/09/2015 08:30 | 22/09/2015 20:30 | 12 hours | Warning (At Risk) on Atlas and GEN Castor instances during upgrade of back end Oracle database. |
Castor Atlas & GEN instances. | SCHEDULED | OUTAGE | 15/09/2015 08:00 | 15/09/2015 20:30 | 12 hours and 30 minutes | Outage of Atlas and GEN Castor instances during upgrade of Oracle back end database. |
Whole Site | SCHEDULED | WARNING | 03/09/2015 08:25 | 03/09/2015 09:25 | 1 hour | At risk on site during brief network re-configuration. Actual change expected at 07:30 (UTC). |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Upgrade of Castor disk servers to SL6. For the D1T0 Service Classes this is being done today/tomorrow (26/27 August) with extended 'At Risks'. (Declared in GOC DB)
- Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Dates declared in GOC DB (See above).
- Some detailed internal network reconfigurations to be tackled now that the routers are stable. Notably:
- Brief (less than 20 seconds) break in internal connectivity while systems in the UPS room are re-connected. (3rd Sep.)
- Replacement of cables and connectivity to the UKLIGHT router that provides our link to both the OPN Link to CERN and the bypass route for other data transfers.
- Extending the rollout of the new worker node configuration.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4. This will affect all services that use Oracle databases: Castor, Atlas Frontier (LFC done)
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update the Oracle databases behind Castor to version 11.2.0.4. Will require some downtimes (See above)
- Update disk servers to SL6 (ongoing)
- Update to Castor version 2.1.15.
- Networking:
- Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit.
- Make routing changes to allow the removal of the UKLight Router.
- Cabling/switch changes to the network in the UPS room to improve resilience.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
All Castor (All SRMs) | UNSCHEDULED | WARNING | 27/08/2015 19:00 | 28/08/2015 15:02 | 20 hours and 2 minutes | We are investigating a problem that looks to affect transfers to particular sites from some of our disk servers. |
All Castor (All SRMs) | SCHEDULED | WARNING | 27/08/2015 09:30 | 27/08/2015 19:00 | 9 hours and 30 minutes | Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade. (Continuation of rolling upgrade from previous day). |
All Castor Disk only (All SRMs for disk-only) | SCHEDULED | WARNING | 26/08/2015 09:30 | 26/08/2015 18:00 | 8 hours and 30 minutes | Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
115805 | Green | Less Urgent | In progress | 2015-08-21 | 2015-08-21 | SNO+ | RAL WMS logs |
115434 | Green | Less Urgent | Waiting Reply | 2015-08-03 | 2015-09-02 | SNO+ | glite-wms-job-status warning |
115290 | Green | Less Urgent | On Hold | 2015-07-28 | 2015-07-29 | FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes | |
108944 | Red | Less Urgent | In Progress | 2014-10-01 | 2015-08-26 | CMS | AAA access test failing at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
26/08/15 | 100 | 100 | 100 | 100 | 100 | 93 | 81 | |
27/08/15 | 100 | 98.0 | 100 | 100 | 100 | 91 | 97 | Alice: Single SRM test failure. During upgrade of Castor disk servers to SL6. |
28/08/15 | 100 | 100 | 100 | 100 | 100 | 90 | 96 | |
29/08/15 | 100 | 100 | 100 | 100 | 100 | 95 | 96 | |
30/08/15 | 100 | 100 | 100 | 100 | 100 | 96 | 97 | |
31/08/15 | 89.8 | 100 | 98.0 | 100 | 100 | 95 | 98 | Atlas: Single SRM testfailure; OPS: Central monitoring problem. |
01/09/15 | 65.7 | 100 | 87.0 | 100 | 100 | 96 | 96 | OPS and Atlas: Central monitoring problems. |