Difference between revisions of "Tier1 Operations Report 2019-02-11"
From GridPP Wiki
(→) |
(→) |
||
Line 190: | Line 190: | ||
! Subject | ! Subject | ||
! Scope | ! Scope | ||
+ | |- | ||
+ | | style="background-color: lightgreen;" | 139575 | ||
+ | | cms | ||
+ | | in progress | ||
+ | | urgent | ||
+ | | 07/02/2019 | ||
+ | | 08/02/2019 | ||
+ | | CMS_AAA WAN Access | ||
+ | | T1_UK_RAL SAM xrootd reads failing | ||
+ | | WLCG | ||
|- | |- | ||
| style="background-color: lightgreen;" | 139477 | | style="background-color: lightgreen;" | 139477 | ||
Line 196: | Line 206: | ||
| less urgent | | less urgent | ||
| 01/02/2019 | | 01/02/2019 | ||
− | | | + | | 07/02/2019 |
| Operations | | Operations | ||
| [Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-submit-ops@arc-ce04.gridpp.rl.ac.uk | | [Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-submit-ops@arc-ce04.gridpp.rl.ac.uk | ||
Line 206: | Line 216: | ||
| less urgent | | less urgent | ||
| 01/02/2019 | | 01/02/2019 | ||
− | | | + | | 06/02/2019 |
| Other | | Other | ||
| LFC dump | | LFC dump | ||
Line 221: | Line 231: | ||
| EGI | | EGI | ||
|- | |- | ||
− | | style="background-color: | + | | style="background-color: red;" | 138665 |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
| mice | | mice | ||
| on hold | | on hold |
Revision as of 10:38, 12 February 2019
RAL Tier1 Operations Report for 11th February 2019
Review of Issues during the week 28th January 2019 to the 5th February 2019. |
- Our tape robot experienced a 2 hour unscheduled outage on Wednesday. The issue was as the result of a failed network switch (internal to the robot), which was resolved after diligent use of the power switch.
- CASTOR team are moving Alice to the new xrootd redirector setup.
- Garbage collection issues with the new wlcgTape that were, in part, responsible for issues experienced by NA62 last week have now been understood and subsequently resolved.
- The ARC-CE’s are intermittently reporting ‘unknown’ for SAM test results. We believe we understand the cause of this and upgrading to the current release version will resolve this issue.
Current operational status and issues |
Resolved Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
- | - | - | - | - |
Ongoing Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
gdss811 | LHCb | lhcbDst | d1t0 | - |
Limits on concurrent batch system jobs. |
- ALICE - 1000
Notable Changes made since the last meeting. |
- NTR
Entries in GOC DB starting since the last report. |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
Declared in the GOC DB |
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
- | - | - | - | - | - | - |
- No ongoing downtime
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting). |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope |
---|---|---|---|---|---|---|---|---|
139575 | cms | in progress | urgent | 07/02/2019 | 08/02/2019 | CMS_AAA WAN Access | T1_UK_RAL SAM xrootd reads failing | WLCG |
139477 | ops | in progress | less urgent | 01/02/2019 | 07/02/2019 | Operations | [Rod Dashboard] Issue detected : org.nordugrid.ARC-CE-submit-ops@arc-ce04.gridpp.rl.ac.uk | EGI |
139476 | mice | in progress | less urgent | 01/02/2019 | 06/02/2019 | Other | LFC dump | EGI |
139306 | dteam | in progress | less urgent | 24/01/2019 | 29/01/2019 | Monitoring | perfsonar hosts need updating | EGI |
138665 | mice | on hold | urgent | 04/12/2018 | 30/01/2019 | Middleware | Problem accessing LFC at RAL | EGI |
138500 | cms | on hold | urgent | 26/11/2018 | 30/01/2019 | CMS_Data Transfers | Transfers failing from T2_PL_Swierk to RAL | WLCG |
138361 | t2k.org | in progress | less urgent | 19/11/2018 | 31/01/2019 | Other | RAL-LCG2: t2k.org LFC to DFC transition | EGI |
138033 | atlas | in progress | urgent | 01/11/2018 | 31/01/2019 | Other | singularity jobs failing at RAL | EGI |
137897 | enmr.eu | on hold | urgent | 23/10/2018 | 31/01/2019 | Workload Management | enmr.eu accounting at RAL | EGI |
GGUS Tickets Closed Last week |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope |
---|---|---|---|---|---|---|---|---|
139538 | cms | solved | urgent | 05/02/2019 | 05/02/2019 | CMS_Data Transfers | Some transfers failing to RAL - SRM_AUTHORIZATION_FAILURE | WLCG |
139414 | lhcb | verified | very urgent | 30/01/2019 | 05/02/2019 | Other | Jobs Failed with Segmentation fault at RAL-LCG2 | WLCG |
139405 | ops | verified | less urgent | 30/01/2019 | 05/02/2019 | Operations | [Rod Dashboard] Issue detected : org.bdii.GLUE2-Validate@site-bdii.gridpp.rl.ac.uk | EGI |
139404 | none | verified | top priority | 30/01/2019 | 01/02/2019 | Other | This TEST ALARM has been raised for testing GGUS alarm work flow after a new GGUS release. | WLCG |
139380 | cms | solved | urgent | 29/01/2019 | 31/01/2019 | CMS_Facilities | T1_UK_RAL failing SAM tests inside Singularity | WLCG |
139375 | atlas | solved | urgent | 29/01/2019 | 04/02/2019 | Other | RAL-LCG2 transfers fail with "the server responded with an error 500" | WLCG |
139328 | cms | solved | urgent | 25/01/2019 | 29/01/2019 | CMS_Facilities | T1_UK_RAL SRM tests failing | WLCG |
139312 | cms | solved | urgent | 25/01/2019 | 29/01/2019 | CMS_Data Transfers | Corrupted files at RAL_Buffer? | WLCG |
139245 | cms | solved | urgent | 21/01/2019 | 04/02/2019 | CMS_Data Transfers | Transfers failing from CNAF_Disk to RAL_Buffer | WLCG |
Availability Report |
Day | Atlas | Atlas-Echo | CMS | LHCB | Alice | OPS | Comments |
---|---|---|---|---|---|---|---|
2019-01-29 | 100 | 100 | 97 | 100 | 100 | -1 | |
2019-01-30 | 100 | 100 | 98 | 100 | 100 | 100 | |
2019-01-31 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-02-01 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-02-02 | 100 | 100 | 98 | 100 | 100 | 100 | |
2019-02-03 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-02-04 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-02-05 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-02-06 | 100 | 100 | 100 | 100 | 100 | 100 |
Hammercloud Test Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas HC | CMS HC | Comment |
---|---|---|---|
2019-01-23 | 100 | 98 | |
2019-01-24 | 100 | 98 | |
2019-01-25 | 100 | 98 | |
2019-01-26 | 100 | 91 | |
2019-01-27 | 100 | 97 | |
2019-01-28 | 100 | 93 | |
2019-01-29 | 100 | 98 |
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
Notes from Meeting. |