Difference between revisions of "Tier1 Operations Report 2018-11-13"
From GridPP Wiki
(Created page with "==RAL Tier1 Operations Report for 5th November 2018== __NOTOC__ ====== ====== <!-- ************************************************************* -----> <!-- ***********Star...") |
(→) |
||
(14 intermediate revisions by one user not shown) | |||
Line 11: | Line 11: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 30th October to the 5th November 2018. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 30th October to the 5th November 2018. | ||
|} | |} | ||
− | * | + | * We have noticed that our CV17 storage nodes have been randomly rebooting. These nodes are not fully weighted up in Echo and further increases have been stopped while the problem is sorted out. A firmware update is being pushed out. Machines that have the new firmware appear to be fixed, but we are still gathering statistics. It should be noted that there has been no observed impact from the user perspective. |
− | + | * On 7th November, the first LHCb space token was migrated from Castor to Echo. This is the FAILOVER space token which is used by jobs from other sites if their storage is down. | |
− | + | * ATLAS are unable to get their jobs via singularity to work. We believe this is because they require privileged containers and we are currently only offering unprivileged ones. We would hope that ATLAS could improve their code. | |
− | + | * CMS AAA issue remains ongoing, but we believe the situation is improving. We updated the XRootD version to fix a known bug on Friday and the amount of red in the SAM tests dropped over the weekend. | |
− | + | ||
− | + | ||
− | * | + | |
− | + | ||
− | + | ||
− | + | ||
− | * | + | |
− | + | ||
− | * | + | |
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 104: | Line 95: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Notable Changes made since the last meeting. | ||
|} | |} | ||
− | * | + | * NA62 were successfully migrated to the new consolidated Castor tape instance. |
<!-- *************End Notable Changes made this last week************** -----> | <!-- *************End Notable Changes made this last week************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 159: | Line 150: | ||
! Reason | ! Reason | ||
|- | |- | ||
− | | | + | | CASTOR |
− | | | + | | 26325 |
| Yes | | Yes | ||
| Outage | | Outage | ||
− | | | + | | 15/11/2018 |
− | | | + | | 15/11/2018 |
− | | - | + | | 3Hrs |
− | | | + | | Castor SRM endpoint migration |
+ | |- | ||
+ | | CASTOR | ||
+ | | 26338 | ||
+ | | Yes | ||
+ | | Outage | ||
+ | | 20/11/2018 | ||
+ | | 20/11/2018 | ||
+ | | 3Hrs | ||
+ | | CASTOR out as part of Oracle Patch Instalation (Neptune and Pluto environments) | ||
+ | |||
|} | |} | ||
− | * <s>No ongoing downtime</s> | + | * <s> No ongoing downtime </s> |
− | * <s>No downtime scheduled in the GOCDB for next 2 weeks</s> | + | * <s> No downtime scheduled in the GOCDB for next 2 weeks </s> |
<!-- **********************End GOC DB Entries************************** -----> | <!-- **********************End GOC DB Entries************************** -----> | ||
<!-- ****************************************************************** -----> | <!-- ****************************************************************** -----> | ||
Line 208: | Line 209: | ||
! Scope | ! Scope | ||
|- | |- | ||
− | | style="background-color: green;" | | + | | style="background-color: green;" | 138218 |
| cms | | cms | ||
| in progress | | in progress | ||
| urgent | | urgent | ||
− | | | + | | 09/11/2018 |
− | | | + | | 12/11/2018 |
| CMS_Data Transfers | | CMS_Data Transfers | ||
− | | Transfers failing from | + | | Transfers failing from RAL_Buffer to TIFR |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
| WLCG | | WLCG | ||
|- | |- | ||
Line 233: | Line 224: | ||
| urgent | | urgent | ||
| 01/11/2018 | | 01/11/2018 | ||
− | | | + | | 08/11/2018 |
| Other | | Other | ||
| singularity jobs failing at RAL | | singularity jobs failing at RAL | ||
| EGI | | EGI | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
| style="background-color: green;" | 137897 | | style="background-color: green;" | 137897 | ||
Line 253: | Line 234: | ||
| urgent | | urgent | ||
| 23/10/2018 | | 23/10/2018 | ||
− | | | + | | 12/11/2018 |
| Accounting | | Accounting | ||
| enmr.eu accounting at RAL | | enmr.eu accounting at RAL | ||
| EGI | | EGI | ||
|- | |- | ||
− | | style="background-color: | + | | style="background-color: red;" | 137822 |
| lhcb | | lhcb | ||
| in progress | | in progress | ||
Line 268: | Line 249: | ||
| WLCG | | WLCG | ||
|- | |- | ||
− | | style="background-color: | + | | style="background-color: orange;" | 137650 |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
| cms | | cms | ||
| in progress | | in progress | ||
| very urgent | | very urgent | ||
| 09/10/2018 | | 09/10/2018 | ||
− | | | + | | 12/11/2018 |
| CMS_AAA WAN Access | | CMS_AAA WAN Access | ||
| Low HC xrootd success rates at T1_UK_RAL | | Low HC xrootd success rates at T1_UK_RAL | ||
Line 293: | Line 264: | ||
| urgent | | urgent | ||
| 12/09/2018 | | 12/09/2018 | ||
− | | | + | | 12/11/2018 |
| Data Management - generic | | Data Management - generic | ||
| LFC entry has file size 0, preventsw registering of additional replicas | | LFC entry has file size 0, preventsw registering of additional replicas | ||
Line 322: | Line 293: | ||
! Scope | ! Scope | ||
|- | |- | ||
− | | | + | | 138103 |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
| cms | | cms | ||
| solved | | solved | ||
| urgent | | urgent | ||
− | |||
| 05/11/2018 | | 05/11/2018 | ||
+ | | 07/11/2018 | ||
| CMS_Data Transfers | | CMS_Data Transfers | ||
− | | Transfers failing | + | | Transfers failing from RALPP to RAL |
| WLCG | | WLCG | ||
|- | |- | ||
− | | | + | | 138077 |
| cms | | cms | ||
| solved | | solved | ||
| urgent | | urgent | ||
− | | | + | | 02/11/2018 |
− | | | + | | 08/11/2018 |
− | | | + | | CMS_SAM tests |
− | | | + | | SAM test critical T1_UK_RAL |
| WLCG | | WLCG | ||
|- | |- | ||
− | | | + | | 138028 |
− | | | + | | lhcb |
− | | | + | | verified |
| urgent | | urgent | ||
− | | | + | | 01/11/2018 |
− | | | + | | 12/11/2018 |
− | | | + | | File Access |
− | | | + | | File cannot be staged |
| WLCG | | WLCG | ||
|- | |- | ||
− | | | + | | 137752 |
− | | | + | | other |
− | | | + | | solved |
− | | urgent | + | | less urgent |
− | | | + | | 15/10/2018 |
− | | | + | | 12/11/2018 |
+ | | VO Specific Software | ||
+ | | Replicate OSG CVMFS repositories to EGI stratum 1s | ||
+ | | EGI | ||
+ | |- | ||
+ | | 136701 | ||
+ | | lhcb | ||
+ | | verified | ||
+ | | very urgent | ||
+ | | 14/08/2018 | ||
+ | | 06/11/2018 | ||
| File Transfer | | File Transfer | ||
− | | | + | | background of transfer errors |
| WLCG | | WLCG | ||
|- | |- | ||
− | | | + | | 136199 |
− | | | + | | lhcb |
− | | | + | | verified |
− | | urgent | + | | very urgent |
− | | | + | | 18/07/2018 |
− | | | + | | 06/11/2018 |
− | | | + | | File Transfer |
− | | | + | | Lots of submitted transfers on RAL FTS |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
| WLCG | | WLCG | ||
|} | |} | ||
Line 431: | Line 382: | ||
! Comments | ! Comments | ||
|- | |- | ||
− | | 2018- | + | | 2018-11-07 |
| 100 | | 100 | ||
| 100 | | 100 | ||
− | | | + | | 99 |
+ | | 100 | ||
| 100 | | 100 | ||
− | |||
| 100 | | 100 | ||
| | | | ||
|- | |- | ||
− | | 2018- | + | | 2018-11-08 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 449: | Line 400: | ||
| | | | ||
|- | |- | ||
− | | 2018-11- | + | | 2018-11-09 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 458: | Line 409: | ||
| | | | ||
|- | |- | ||
− | | 2018-11- | + | | 2018-11-10 |
+ | | 100 | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | |||
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 467: | Line 418: | ||
| | | | ||
|- | |- | ||
− | | 2018-11- | + | | 2018-11-11 |
+ | | 100 | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | |||
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 476: | Line 427: | ||
| | | | ||
|- | |- | ||
− | | 2018-11- | + | | 2018-11-12 |
| 100 | | 100 | ||
| 100 | | 100 | ||
− | |||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
| 100 | | 100 | ||
| 100 | | 100 | ||
| | | | ||
|- | |- | ||
− | | 2018-11- | + | | 2018-11-13 |
| 100 | | 100 | ||
| 100 | | 100 | ||
− | | style="background-color: red;" | | + | | style="background-color: red;" | 87 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 519: | Line 462: | ||
! Day !! Atlas HC !! CMS HC !! Comment | ! Day !! Atlas HC !! CMS HC !! Comment | ||
|- | |- | ||
− | | 2018- | + | | 2018-11-06 || 98 || 100|| |
|- | |- | ||
− | | 2018- | + | | 2018-11-07 || 100 || 99 || |
|- | |- | ||
− | | 2018-11- | + | | 2018-11-08 || 100 || 99 || |
|- | |- | ||
− | | 2018-11- | + | | 2018-11-09 || 100 || 98 || |
|- | |- | ||
− | | 2018-11- | + | | 2018-11-10 || 100 || style="background-color: orange;" | 96 || |
|- | |- | ||
− | | 2018-11- | + | | 2018-11-11 || 100 || style="background-color: orange;" | 92 || |
|- | |- | ||
− | | 2018-11- | + | | 2018-11-12 || 100 || 99 || |
|- | |- | ||
− | | 2018-11- | + | | 2018-11-13 || 100 || 99 || |
|- | |- | ||
|} | |} |
Latest revision as of 08:57, 14 November 2018
RAL Tier1 Operations Report for 5th November 2018
Review of Issues during the week 30th October to the 5th November 2018. |
- We have noticed that our CV17 storage nodes have been randomly rebooting. These nodes are not fully weighted up in Echo and further increases have been stopped while the problem is sorted out. A firmware update is being pushed out. Machines that have the new firmware appear to be fixed, but we are still gathering statistics. It should be noted that there has been no observed impact from the user perspective.
- On 7th November, the first LHCb space token was migrated from Castor to Echo. This is the FAILOVER space token which is used by jobs from other sites if their storage is down.
- ATLAS are unable to get their jobs via singularity to work. We believe this is because they require privileged containers and we are currently only offering unprivileged ones. We would hope that ATLAS could improve their code.
- CMS AAA issue remains ongoing, but we believe the situation is improving. We updated the XRootD version to fix a known bug on Friday and the amount of red in the SAM tests dropped over the weekend.
Current operational status and issues |
- NTR
Resolved Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
- | - | - | - | - |
Ongoing Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
gdss788 | WLCG | - | gridTape | d0t1 |
Limits on concurrent batch system jobs. |
- None currently enforced.
Notable Changes made since the last meeting. |
- NA62 were successfully migrated to the new consolidated Castor tape instance.
Entries in GOC DB starting since the last report. |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
Declared in the GOC DB |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
CASTOR | 26325 | Yes | Outage | 15/11/2018 | 15/11/2018 | 3Hrs | Castor SRM endpoint migration |
CASTOR | 26338 | Yes | Outage | 20/11/2018 | 20/11/2018 | 3Hrs | CASTOR out as part of Oracle Patch Instalation (Neptune and Pluto environments) |
-
No ongoing downtime -
No downtime scheduled in the GOCDB for next 2 weeks
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting). |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope |
---|---|---|---|---|---|---|---|---|
138218 | cms | in progress | urgent | 09/11/2018 | 12/11/2018 | CMS_Data Transfers | Transfers failing from RAL_Buffer to TIFR | WLCG |
138033 | atlas | in progress | urgent | 01/11/2018 | 08/11/2018 | Other | singularity jobs failing at RAL | EGI |
137897 | enmr.eu | waiting for reply | urgent | 23/10/2018 | 12/11/2018 | Accounting | enmr.eu accounting at RAL | EGI |
137822 | lhcb | in progress | top priority | 18/10/2018 | 31/10/2018 | File Transfer | FTS server seems in bad state. | WLCG |
137650 | cms | in progress | very urgent | 09/10/2018 | 12/11/2018 | CMS_AAA WAN Access | Low HC xrootd success rates at T1_UK_RAL | WLCG |
137153 | t2k.org | in progress | urgent | 12/09/2018 | 12/11/2018 | Data Management - generic | LFC entry has file size 0, preventsw registering of additional replicas | EGI |
GGUS Tickets Closed Last week |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope |
---|---|---|---|---|---|---|---|---|
138103 | cms | solved | urgent | 05/11/2018 | 07/11/2018 | CMS_Data Transfers | Transfers failing from RALPP to RAL | WLCG |
138077 | cms | solved | urgent | 02/11/2018 | 08/11/2018 | CMS_SAM tests | SAM test critical T1_UK_RAL | WLCG |
138028 | lhcb | verified | urgent | 01/11/2018 | 12/11/2018 | File Access | File cannot be staged | WLCG |
137752 | other | solved | less urgent | 15/10/2018 | 12/11/2018 | VO Specific Software | Replicate OSG CVMFS repositories to EGI stratum 1s | EGI |
136701 | lhcb | verified | very urgent | 14/08/2018 | 06/11/2018 | File Transfer | background of transfer errors | WLCG |
136199 | lhcb | verified | very urgent | 18/07/2018 | 06/11/2018 | File Transfer | Lots of submitted transfers on RAL FTS | WLCG |
Availability Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas | Atlas-Echo | CMS | LHCB | Alice | OPS | Comments |
---|---|---|---|---|---|---|---|
2018-11-07 | 100 | 100 | 99 | 100 | 100 | 100 | |
2018-11-08 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-09 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-10 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-11 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-12 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-13 | 100 | 100 | 87 | 100 | 100 | 100 |
Hammercloud Test Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas HC | CMS HC | Comment |
---|---|---|---|
2018-11-06 | 98 | 100 | |
2018-11-07 | 100 | 99 | |
2018-11-08 | 100 | 99 | |
2018-11-09 | 100 | 98 | |
2018-11-10 | 100 | 96 | |
2018-11-11 | 100 | 92 | |
2018-11-12 | 100 | 99 | |
2018-11-13 | 100 | 99 |
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
Notes from Meeting. |