RAL Tier1 Operations Report for 26th August 2015
Review of Issues during the week 19th to 26th August 2015.
|
- Last week we reported a significant backlog of migrations to tape for Atlas as the instance was seeing very high load. Castor caught up with the backlog overnight Thursday/Friday (21/21 Aug) - and there have been no problems since then.
Resolved Disk Server Issues
|
Current operational status and issues
|
- As reported last week the post-mortem review of the network incident on the 8th April has been finalised. It can be seen here:
- The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked. An attempted fix (replacing the cables between the Tier1 core network and the UKLight router) had no effect.
- There are some on-going issues for CMS. There is a problem with the Xroot (AAA) redirection accessing Castor and file open times using Xroot are slow. The poor batch job efficiencies have been improved since a change to the Linux I/O scheduler on CMS disk servers some ten days ago.
Ongoing Disk Server Issues
|
Notable Changes made since the last meeting.
|
- The rollout of the new worker node configuration continues. Around a quarter of the batch farm has now been upgraded.
- The network link between the Tier1 core network and the UKLight router was switched to a new pair of fibres.
- As part of the housekeeping work on our Router pair the old static routes have been removed (today - 26th August).
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
Castor Atlas & GEN instances.
|
SCHEDULED
|
OUTAGE
|
13/10/2015 08:00
|
13/10/2015 16:00
|
8 hours
|
Outage of All Castor instances during upgrade of Oracle back end database.
|
All Castor
|
SCHEDULED
|
WARNING
|
08/10/2015 08:30
|
08/10/2015 20:30
|
12 hours
|
Warning (At Risk) on All Castor instances during upgrade of back end Oracle database.
|
All Castor
|
SCHEDULED
|
OUTAGE
|
06/10/2015 08:00
|
06/10/2015 20:30
|
12 hours and 30 minutes
|
Outage of All Castor instances during upgrade of Oracle back end database.
|
Castor Atlas & GEN instances.
|
SCHEDULED
|
WARNING
|
22/09/2015 08:30
|
22/09/2015 20:30
|
12 hours
|
Warning (At Risk) on Atlas and GEN Castor instances during upgrade of back end Oracle database.
|
Castor Atlas & GEN instances.
|
SCHEDULED
|
OUTAGE
|
15/09/2015 08:00
|
15/09/2015 20:30
|
12 hours and 30 minutes
|
Outage of Atlas and GEN Castor instances during upgrade of Oracle back end database.
|
Whole Site
|
SCHEDULED
|
WARNING
|
03/09/2015 08:25
|
03/09/2015 09:25
|
1 hour
|
At risk on site during brief network re-configuration. Actual change expected at 07:30 (UTC).
|
All Castor Disk Only Service Classes.
|
SCHEDULED
|
WARNING
|
27/08/2015 09:30
|
27/08/2015 19:00
|
9 hours and 30 minutes
|
Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade. (Continuation of rolling upgrade from previous day).
|
All Castor Disk Only Service Classes.
|
SCHEDULED
|
WARNING
|
26/08/2015 09:30
|
26/08/2015 18:00
|
8 hours and 30 minutes
|
Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade.
|
Advanced warning for other interventions
|
The following items are being discussed and are still to be formally scheduled and announced.
|
- Upgrade of Castor disk servers to SL6. For the D1T0 Service Classes this is being done today/tomorrow (26/27 August) with extended 'At Risks'. (Declared in GOC DB)
- Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Dates declared in GOC DB (See above).
- Some detailed internal network reconfigurations to be tackled now that the routers are stable. Notably:
- Brief (less than 20 seconds) break in internal connectivity while systems in the UPS room are re-connected. (3rd Sep.)
- Replacement of cables and connectivity to the UKLIGHT router that provides our link to both the OPN Link to CERN and the bypass route for other data transfers.
- Extending the rollout of the new worker node configuration.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4. This will affect all services that use Oracle databases: Castor, Atlas Frontier (LFC done)
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update the Oracle databases behind Castor to version 11.2.0.4. Will require some downtimes (See above)
- Update disk servers to SL6 (ongoing)
- Update to Castor version 2.1.15.
- Networking:
- Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit.
- Make routing changes to allow the removal of the UKLight Router.
- Cabling/switch changes to the network in the UPS room to improve resilience.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting since the last report.
|
Service
|
Scheduled?
|
Outage/At Risk
|
Start
|
End
|
Duration
|
Reason
|
All Castor disk-only service classes.
|
SCHEDULED
|
WARNING
|
26/08/2015 09:30
|
26/08/2015 18:00
|
8 hours and 30 minutes
|
Warning on Castor Disk Storage as servers upgraded to SL6. This is a rolling update of the servers - and each is expected to be unavailable for around 30 minutes during the upgrade.
|
Open GGUS Tickets (Snapshot during morning of meeting)
|
GGUS ID |
Level |
Urgency |
State |
Creation |
Last Update |
VO |
Subject
|
115805
|
Green
|
Less Urgent
|
In progress
|
2015-08-21
|
2015-08-21
|
SNO+
|
RAL WMS logs
|
115573
|
Green
|
Urgent
|
In progress
|
2015-08-07
|
2015-08-26
|
CMS
|
T1_UK_RAL Consistency Check (August 2015)
|
115434
|
Green
|
Less Urgent
|
In progress
|
2015-08-03
|
2015-08-07
|
SNO+
|
glite-wms-job-status warning
|
115290
|
Green
|
Less Urgent
|
On Hold
|
2015-07-28
|
2015-07-29
|
|
FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes
|
108944
|
Red
|
Less Urgent
|
In Progress
|
2014-10-01
|
2015-08-17
|
CMS
|
AAA access test failing at T1_UK_RAL
|
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day |
OPS |
Alice |
Atlas |
CMS |
LHCb |
Atlas HC |
CMS HC |
Comment
|
19/08/15 |
100 |
100 |
100 |
100 |
100 |
98 |
100 |
|
20/08/15 |
100 |
100 |
100 |
100 |
100 |
97 |
100 |
|
21/08/15 |
100 |
100 |
100 |
100 |
100 |
86 |
96 |
|
22/08/15 |
100 |
100 |
100 |
100 |
100 |
98 |
100 |
|
23/08/15 |
100 |
100 |
100 |
100 |
100 |
94 |
n/a |
|
24/08/15 |
100 |
100 |
100 |
100 |
100 |
94 |
100 |
|
25/08/15 |
100 |
95.0 |
98.0 |
100 |
100 |
97 |
98 |
Alice: Single ARC CE test failure; Atlas: Single SRM test failure.
|