Difference between revisions of "Tier1 Operations Report 2015-08-12"
From GridPP Wiki
(Created page with "==RAL Tier1 Operations Report for 12th August 2015== __NOTOC__ ====== ====== <!-- ************************************************************* -----> <!-- ***********Start R...") |
(→) |
||
(21 intermediate revisions by one user not shown) | |||
Line 55: | Line 55: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | ||
|} | |} | ||
+ | * Deployed changes to remove glite-CLUSTER node from information system and shutdown cream-ce01 and cream-ce02. | ||
* Atlas have transferred a share of their FTS service back to RAL. | * Atlas have transferred a share of their FTS service back to RAL. | ||
* The test of the updated worker node configuration (with grid middleware delivered via CVMFS) continues on a one whole batch of Worker Nodes. We are now draining a second batch of worker nodes. | * The test of the updated worker node configuration (with grid middleware delivered via CVMFS) continues on a one whole batch of Worker Nodes. We are now draining a second batch of worker nodes. | ||
− | * Investigative work into the ongoing issues for CMS Castor. | + | * Investigative work into the ongoing issues for CMS Castor. We have now changed the I/O scheduler on the disk servers. |
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 67: | Line 68: | ||
|- | |- | ||
| style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | | style="background-color: #d8e8ff; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Declared in the GOC DB | ||
− | None | + | |} |
+ | {| | ||
+ | | None | ||
+ | |} | ||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 82: | Line 86: | ||
<!-- ******* still to be formally scheduled and/or announced ******* -----> | <!-- ******* still to be formally scheduled and/or announced ******* -----> | ||
* Upgrade of Castor disk servers to SL6. We plan to do this during the second part of August (17-28). Rollout plan to be discussed. | * Upgrade of Castor disk servers to SL6. We plan to do this during the second part of August (17-28). Rollout plan to be discussed. | ||
− | * Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Whilst we need to confirm exact dates we are looking at the following days in September: | + | * Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Whilst we need to confirm exact dates we are looking at the following days in September, but the dates are likely to be revised: |
** Tuesday 8th: day's Outage for Atlas & GEN. | ** Tuesday 8th: day's Outage for Atlas & GEN. | ||
** Tuesday 15th: at risk on Atlas & GEN. | ** Tuesday 15th: at risk on Atlas & GEN. | ||
Line 115: | Line 119: | ||
| style="background-color: #7c8aaf; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Entries in GOC DB starting since the last report. | | style="background-color: #7c8aaf; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Entries in GOC DB starting since the last report. | ||
|} | |} | ||
− | |||
− | |||
{| border=1 align=center | {| border=1 align=center | ||
|- bgcolor="#7c8aaf" | |- bgcolor="#7c8aaf" | ||
Line 127: | Line 129: | ||
! Reason | ! Reason | ||
|- | |- | ||
− | | Whole site | + | | Whole site |
| UNSCHEDULED | | UNSCHEDULED | ||
| WARNING | | WARNING | ||
Line 133: | Line 135: | ||
| 05/08/2015 15:00 | | 05/08/2015 15:00 | ||
| 24 hours | | 24 hours | ||
− | | Warning on Site following investigations of problem with network router. | + | |Warning on Site following investigations of problem with network router. |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
|} | |} | ||
+ | |||
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 189: | Line 152: | ||
|-style="background:#b7f1ce" | |-style="background:#b7f1ce" | ||
! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ! GGUS ID !! Level !! Urgency !! State !! Creation !! Last Update !! VO !! Subject | ||
+ | |- | ||
+ | | 115573 | ||
+ | | Green | ||
+ | | Urgent | ||
+ | | In progress | ||
+ | | 2015-08-07 | ||
+ | | 2015-08-11 | ||
+ | | CMS | ||
+ | | T1_UK_RAL Consistency Check (August 2015) | ||
+ | |- | ||
+ | | 115512 | ||
+ | | | ||
+ | | very urgent | ||
+ | | waiting for reply | ||
+ | | 2015-08-05 | ||
+ | | 2015-08-12 | ||
+ | | lhcb | ||
+ | | User cannot submit jobs directly to RAL | ||
|- | |- | ||
| 115434 | | 115434 | ||
| Green | | Green | ||
| Less Urgent | | Less Urgent | ||
− | | | + | | waiting for reply |
− | + | ||
| 2015-08-03 | | 2015-08-03 | ||
+ | | 2015-08-07 | ||
| SNO+ | | SNO+ | ||
| glite-wms-job-status warning | | glite-wms-job-status warning | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
| 115387 | | 115387 | ||
Line 213: | Line 185: | ||
| In Progress | | In Progress | ||
| 2015-08-03 | | 2015-08-03 | ||
− | | 2015-08- | + | | 2015-08-11 |
| SNO+ | | SNO+ | ||
| XRootD for SNO+ from RAL | | XRootD for SNO+ from RAL | ||
Line 225: | Line 197: | ||
| | | | ||
| FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes | | FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
| 108944 | | 108944 | ||
Line 262: | Line 225: | ||
! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment | ! Day !! OPS !! Alice !! Atlas !! CMS !! LHCb !! Atlas HC !! CMS HC !! Comment | ||
|- | |- | ||
− | | | + | | 05/08/15 || 100 || 100 || style="background-color: lightgrey;" | 98.0 || style="background-color: lightgrey;" | 92.0 || 100 || 100 ||style="background-color: lightgrey;" | 51 || Atlas had a Single SRM test failure. CMS had high load on disk servers. |
|- | |- | ||
− | | | + | | 06/08/15 || 100 || 100 || 100 || style="background-color: lightgrey;" | 83 || 100 || 100 || 100 || Several SRM test failures. (CMS instance known to be performing badly as a lot of batch job failures). |
|- | |- | ||
− | | | + | | 07/08/15 || 100 || 100 || style="background-color: lightgrey;" | 98 || 100 || 100 || 100 || 100 || Atlas had one SRM failure [SRM_INVALID_PATH] No such file or directory |
|- | |- | ||
− | | | + | | 08/08/15 || style="background-color: lightgrey;" | 93.9 || 100 || style="background-color: lightgrey;" | 98 || style="background-color: lightgrey;" | 85 || style="background-color: lightgrey;" | 89 || style="background-color: lightgrey;" | 95 || style="background-color: lightgrey;" | 85 || Site network problems |
|- | |- | ||
− | | | + | | 09/07/15 || 100 || 100 || 100 || 100 || style="background-color: lightgrey;" | 89 || 100 || 100 || LHCb had one SRM failure, [SRM_INVALID_PATH] No such file or directory |
|- | |- | ||
− | | | + | | 10/08/15 || 100 || 100 || 100 || 100 || 100 || 100 || 100 || |
|- | |- | ||
− | | | + | | 11/07/15 || 100 || 100 || 100 || 100 || 100 || 100 || 56 || |
|} | |} | ||
<!-- **********************End Availability Report************************** -----> | <!-- **********************End Availability Report************************** -----> | ||
<!-- *********************************************************************** -----> | <!-- *********************************************************************** -----> |
Latest revision as of 13:20, 12 August 2015
RAL Tier1 Operations Report for 12th August 2015
Review of Issues during the week 29th July and 5th August 2015. |
- There was a site network outage on Saturday 8th August. The Tier1 was affected from approx 07:30 until 10:00. The issue was resolved when a member of the network team came on site and re-seated a card in a router.
Resolved Disk Server Issues |
- None.
Current operational status and issues |
- The post mortem review of the network incident on the 8th April is being finalised.
- The intermittent, low-level, load-related packet loss over the OPN to CERN is still being tracked.
- There are some on-going issues for CMS. These are a problem with the Xroot (AAA) redirection accessing Castor; Slow file open times using Xroot; and poor batch job efficiencies.
Ongoing Disk Server Issues |
- gdss720 (part of ATLASDATADISK) crashed on Tuesday 11th. The machine is currently being drained so the fabric team can replace some components.
Notable Changes made since the last meeting. |
- Deployed changes to remove glite-CLUSTER node from information system and shutdown cream-ce01 and cream-ce02.
- Atlas have transferred a share of their FTS service back to RAL.
- The test of the updated worker node configuration (with grid middleware delivered via CVMFS) continues on a one whole batch of Worker Nodes. We are now draining a second batch of worker nodes.
- Investigative work into the ongoing issues for CMS Castor. We have now changed the I/O scheduler on the disk servers.
Declared in the GOC DB |
None |
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
- Upgrade of Castor disk servers to SL6. We plan to do this during the second part of August (17-28). Rollout plan to be discussed.
- Upgrade of the Oracle databases behind Castor to version 11.2.0.4. This is a multi-step intervention. Whilst we need to confirm exact dates we are looking at the following days in September, but the dates are likely to be revised:
- Tuesday 8th: day's Outage for Atlas & GEN.
- Tuesday 15th: at risk on Atlas & GEN.
- Thursday 17th: day's Outage for ALL instances.
- Tuesday 22nd: day's at risk for ALL instances
- Thursday 24th: Half day outage for ALL instances.
- Extending the rollout of the new worker node configuration.
Listing by category:
- Databases:
- Switch LFC/3D to new Database Infrastructure.
- Update to Oracle 11.2.0.4. This will affect all services that use Oracle databases: Castor, Atlas Frontier (LFC done)
- Castor:
- Update SRMs to new version (includes updating to SL6).
- Update the Oracle databases behind Castor to version 11.2.0.4. Will require some downtimes (See above)
- Update disk servers to SL6.
- Update to Castor version 2.1.15.
- Networking:
- Increase bandwidth of the link from the Tier1 into the RAL internal site network to 40Gbit.
- Make routing changes to allow the removal of the UKLight Router.
- Cabling/switch changes to the network in the UPS room to improve resilience.
- Fabric
- Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
Entries in GOC DB starting since the last report. |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
Whole site | UNSCHEDULED | WARNING | 04/08/2015 15:00 | 05/08/2015 15:00 | 24 hours | Warning on Site following investigations of problem with network router. |
Open GGUS Tickets (Snapshot during morning of meeting) |
GGUS ID | Level | Urgency | State | Creation | Last Update | VO | Subject |
---|---|---|---|---|---|---|---|
115573 | Green | Urgent | In progress | 2015-08-07 | 2015-08-11 | CMS | T1_UK_RAL Consistency Check (August 2015) |
115512 | very urgent | waiting for reply | 2015-08-05 | 2015-08-12 | lhcb | User cannot submit jobs directly to RAL | |
115434 | Green | Less Urgent | waiting for reply | 2015-08-03 | 2015-08-07 | SNO+ | glite-wms-job-status warning |
115387 | Green | Less Urgent | In Progress | 2015-08-03 | 2015-08-11 | SNO+ | XRootD for SNO+ from RAL |
115290 | Green | Less Urgent | On Hold | 2015-07-28 | 2015-07-29 | FTS3@RAL: missing proper host names in subjectAltName of FTS agent nodes | |
108944 | Red | Less Urgent | In Progress | 2014-10-01 | 2015-07-17 | CMS | AAA access test failing at T1_UK_RAL |
Availability Report |
Key: Atlas HC = Atlas HammerCloud (Queue ANALY_RAL_SL6, Template 508); CMS HC = CMS HammerCloud
Day | OPS | Alice | Atlas | CMS | LHCb | Atlas HC | CMS HC | Comment |
---|---|---|---|---|---|---|---|---|
05/08/15 | 100 | 100 | 98.0 | 92.0 | 100 | 100 | 51 | Atlas had a Single SRM test failure. CMS had high load on disk servers. |
06/08/15 | 100 | 100 | 100 | 83 | 100 | 100 | 100 | Several SRM test failures. (CMS instance known to be performing badly as a lot of batch job failures). |
07/08/15 | 100 | 100 | 98 | 100 | 100 | 100 | 100 | Atlas had one SRM failure [SRM_INVALID_PATH] No such file or directory |
08/08/15 | 93.9 | 100 | 98 | 85 | 89 | 95 | 85 | Site network problems |
09/07/15 | 100 | 100 | 100 | 100 | 89 | 100 | 100 | LHCb had one SRM failure, [SRM_INVALID_PATH] No such file or directory |
10/08/15 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
11/07/15 | 100 | 100 | 100 | 100 | 100 | 100 | 56 |