Difference between revisions of "Tier1 Operations Report 2019-01-14"
From GridPP Wiki
(→) |
(→) |
||
(11 intermediate revisions by one user not shown) | |||
Line 10: | Line 10: | ||
| style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 7th January 2019 to the 14th January 2019. | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 7th January 2019 to the 14th January 2019. | ||
|} | |} | ||
− | * | + | * With the minor exceptions below, it's still (suspiciously!), calm and peaceful at the Tier-1. Very much business as usual. |
+ | * Ongoing issues with ARC CEs. Not causing significant operational issues yet. We have identified a large number of LHCb MCFastSimulation jobs that are probably creating the load on the system. | ||
+ | * On Wednesday 9th January an Echo storage node started rebooting itself and was removed from the cluster. No loss of availability to the cluster. Engineer to attend site to fix. | ||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 40: | Line 42: | ||
! Comments | ! Comments | ||
|- | |- | ||
− | | | + | | gdss804 |
| lhcb | | lhcb | ||
| lhcbDst | | lhcbDst | ||
Line 63: | Line 65: | ||
! Comments | ! Comments | ||
|- | |- | ||
− | | | + | | - |
− | | | + | | - |
− | | | + | | - |
− | | | + | | - |
| - | | - | ||
|} | |} | ||
Line 188: | Line 190: | ||
! Subject | ! Subject | ||
! Scope | ! Scope | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
| style="background-color: lightgreen;" | 139081 | | style="background-color: lightgreen;" | 139081 | ||
Line 229: | Line 221: | ||
| EGI | | EGI | ||
|- | |- | ||
− | | style="background-color: | + | | style="background-color: orange;" | 138500 |
| cms | | cms | ||
| in progress | | in progress | ||
Line 241: | Line 233: | ||
| style="background-color: lightgreen;" | 138361 | | style="background-color: lightgreen;" | 138361 | ||
| t2k.org | | t2k.org | ||
− | | | + | | on hold |
| less urgent | | less urgent | ||
| 19/11/2018 | | 19/11/2018 | ||
− | | | + | | 15/01/2019 |
| Other | | Other | ||
| RAL-LCG2: t2k.org LFC to DFC transition | | RAL-LCG2: t2k.org LFC to DFC transition | ||
Line 261: | Line 253: | ||
| style="background-color: lightgreen;" | 137897 | | style="background-color: lightgreen;" | 137897 | ||
| enmr.eu | | enmr.eu | ||
− | | | + | | waiting for reply |
| urgent | | urgent | ||
| 23/10/2018 | | 23/10/2018 | ||
− | | | + | | 15/01/2019 |
| Workload Management | | Workload Management | ||
| enmr.eu accounting at RAL | | enmr.eu accounting at RAL | ||
Line 293: | Line 285: | ||
! Scope | ! Scope | ||
|- | |- | ||
− | | 139106 | + | | style="background-color: lightgreen;" | 139108 |
+ | | ops | ||
+ | | solved | ||
+ | | less urgent | ||
+ | | 09/01/2019 | ||
+ | | 15/01/2019 | ||
+ | | Operations | ||
+ | | [Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-ARIS@arc-ce04.gridpp.rl.ac.uk | ||
+ | | EGI | ||
+ | |- | ||
+ | | style="background-color: lightgreen;" | 139106 | ||
| lhcb | | lhcb | ||
| verified | | verified | ||
Line 305: | Line 307: | ||
| 138833 | | 138833 | ||
| none | | none | ||
− | | | + | | verified |
| urgent | | urgent | ||
| 13/12/2018 | | 13/12/2018 | ||
− | | | + | | 14/01/2019 |
| Storage Systems | | Storage Systems | ||
| Tape reporting is not being updated | | Tape reporting is not being updated | ||
Line 342: | Line 344: | ||
! Comments | ! Comments | ||
|- | |- | ||
− | | 2019-01- | + | | 2019-01-08 |
| 100 | | 100 | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
+ | | 98 | ||
| 100 | | 100 | ||
+ | | style="background-color: cyan;" | -1 | ||
+ | | Ref GGUS#138891 | ||
+ | |- | ||
+ | | 2019-01-09 | ||
| 100 | | 100 | ||
− | | | + | | 100 |
+ | | 100 | ||
+ | | 100 | ||
+ | | 98 | ||
+ | | style="background-color: cyan;" | -1 | ||
+ | | Ref GGUS#138891 | ||
|- | |- | ||
− | | 2019-01- | + | | 2019-01-10 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 356: | Line 368: | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | | style="background-color: | + | | style="background-color: cyan;" | -1 |
− | | | + | | Ref GGUS#138891 |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-11 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 365: | Line 377: | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | | style="background-color: | + | | style="background-color: cyan;" | -1 |
− | | | + | | Ref GGUS#138891 |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-12 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 374: | Line 386: | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | | style="background-color: | + | | style="background-color: cyan;" | -1 |
− | | | + | | Ref GGUS#138891 |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-13 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 383: | Line 395: | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | | style="background-color: | + | | style="background-color: cyan;" | -1 |
− | | | + | | Ref GGUS#138891 |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-14 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 392: | Line 404: | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | | style="background-color: | + | | style="background-color: cyan;" | -1 |
− | | | + | | Ref GGUS#138891 |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-15 |
| 100 | | 100 | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | |||
| 100 | | 100 | ||
− | | style="background-color: | + | | 100 |
− | | | + | | style="background-color: cyan;" | -1 |
+ | | Ref GGUS#138891 | ||
|- | |- | ||
− | | 2019-01- | + | | 2019-01-16 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 410: | Line 422: | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | | style="background-color: | + | | style="background-color: cyan;" | -1 |
− | | | + | | Ref GGUS#138891 |
− | + | ||
|} | |} | ||
Line 431: | Line 442: | ||
! Day !! Atlas HC !! CMS HC !! Comment | ! Day !! Atlas HC !! CMS HC !! Comment | ||
|- | |- | ||
− | | 2019-01-01 || 100 || 100 || | + | | 2019-01-07 || 100 || 99 || |
+ | |- | ||
+ | | 2019-01-08 || 100 || 100 || | ||
|- | |- | ||
− | | 2019-01- | + | | 2019-01-09 || 100 || 99 || |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-10 || 100 || 100 || |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-11 || 100 || 100 || |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-12 || 100 || style="background-color: orange;" | 95 || |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-13 || 100 || 98 || |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-14 || 100 || 100 || |
|- | |- | ||
− | | 2019-01- | + | | 2019-01-14 || 100 || 100 || |
|- | |- | ||
|} | |} |
Latest revision as of 09:54, 16 January 2019
RAL Tier1 Operations Report for 14th January 2019
Review of Issues during the week 7th January 2019 to the 14th January 2019. |
- With the minor exceptions below, it's still (suspiciously!), calm and peaceful at the Tier-1. Very much business as usual.
- Ongoing issues with ARC CEs. Not causing significant operational issues yet. We have identified a large number of LHCb MCFastSimulation jobs that are probably creating the load on the system.
- On Wednesday 9th January an Echo storage node started rebooting itself and was removed from the cluster. No loss of availability to the cluster. Engineer to attend site to fix.
Current operational status and issues |
- NTR
Resolved Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
gdss804 | lhcb | lhcbDst | d1t0 | - |
Ongoing Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
- | - | - | - | - |
Limits on concurrent batch system jobs. |
- ALICE - 1000
Notable Changes made since the last meeting. |
- NTR
Entries in GOC DB starting since the last report. |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
- | - | - | - | - | - | - |
- No ongoing downtime
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting). |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope |
---|---|---|---|---|---|---|---|---|
139081 | lhcb | in progress | top priority | 07/01/2019 | 14/01/2019 | Local Batch System | Aborted pilots for LHCb | WLCG |
138891 | ops | waiting for reply | less urgent | 17/12/2018 | 09/01/2019 | Operations | [Rod Dashboard] Issue detected : egi.eu.lowAvailability-/RAL-LCG2@RAL-LCG2_Availability | EGI |
138665 | mice | in progress | urgent | 04/12/2018 | 11/01/2019 | Middleware | Problem accessing LFC at RAL | EGI |
138500 | cms | in progress | urgent | 26/11/2018 | 20/12/2018 | CMS_Data Transfers | Transfers failing from T2_PL_Swierk to RAL | WLCG |
138361 | t2k.org | on hold | less urgent | 19/11/2018 | 15/01/2019 | Other | RAL-LCG2: t2k.org LFC to DFC transition | EGI |
138033 | atlas | in progress | urgent | 01/11/2018 | 08/01/2019 | Other | singularity jobs failing at RAL | EGI |
137897 | enmr.eu | waiting for reply | urgent | 23/10/2018 | 15/01/2019 | Workload Management | enmr.eu accounting at RAL | EGI |
GGUS Tickets Closed Last week |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope |
---|---|---|---|---|---|---|---|---|
139108 | ops | solved | less urgent | 09/01/2019 | 15/01/2019 | Operations | [Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-ARIS@arc-ce04.gridpp.rl.ac.uk | EGI |
139106 | lhcb | verified | very urgent | 09/01/2019 | 10/01/2019 | File Transfer | FTS3 transfer problem at RAL-LCG2 | WLCG |
138833 | none | verified | urgent | 13/12/2018 | 14/01/2019 | Storage Systems | Tape reporting is not being updated | EGI |
Availability Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas | Atlas-Echo | CMS | LHCB | Alice | OPS | Comments |
---|---|---|---|---|---|---|---|
2019-01-08 | 100 | 100 | 100 | 98 | 100 | -1 | Ref GGUS#138891 |
2019-01-09 | 100 | 100 | 100 | 100 | 98 | -1 | Ref GGUS#138891 |
2019-01-10 | 100 | 100 | 100 | 100 | 100 | -1 | Ref GGUS#138891 |
2019-01-11 | 100 | 100 | 100 | 100 | 100 | -1 | Ref GGUS#138891 |
2019-01-12 | 100 | 100 | 100 | 100 | 100 | -1 | Ref GGUS#138891 |
2019-01-13 | 100 | 100 | 100 | 100 | 100 | -1 | Ref GGUS#138891 |
2019-01-14 | 100 | 100 | 100 | 100 | 100 | -1 | Ref GGUS#138891 |
2019-01-15 | 100 | 100 | 100 | 100 | 100 | -1 | Ref GGUS#138891 |
2019-01-16 | 100 | 100 | 100 | 100 | 100 | -1 | Ref GGUS#138891 |
Hammercloud Test Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas HC | CMS HC | Comment |
---|---|---|---|
2019-01-07 | 100 | 99 | |
2019-01-08 | 100 | 100 | |
2019-01-09 | 100 | 99 | |
2019-01-10 | 100 | 100 | |
2019-01-11 | 100 | 100 | |
2019-01-12 | 100 | 95 | |
2019-01-13 | 100 | 98 | |
2019-01-14 | 100 | 100 | |
2019-01-14 | 100 | 100 |
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
Notes from Meeting. |