Difference between revisions of "Tier1 Operations Report 2019-05-20"
(→) |
|||
(6 intermediate revisions by one user not shown) | |||
Line 14: | Line 14: | ||
* DUNE jobs are running again at Tier-1. This was reported 10 days ago but we believe had been a problem for longer. | * DUNE jobs are running again at Tier-1. This was reported 10 days ago but we believe had been a problem for longer. | ||
* LHCb Castor (Disk) is now read only. | * LHCb Castor (Disk) is now read only. | ||
− | |||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> | ||
Line 195: | Line 194: | ||
! Solution | ! Solution | ||
|- | |- | ||
− | | style="background-color: lightgreen;" | | + | | style="background-color: lightgreen;" | 141262 |
− | | | + | | lhcb |
| in progress | | in progress | ||
− | | | + | | very urgent |
− | | | + | | 21/05/2019 |
− | | | + | | 21/05/2019 |
− | | | + | | File Access |
− | | | + | | Users are getting [FATAL] Auth failed |
− | | | + | | WLCG |
| | | | ||
|- | |- | ||
Line 215: | Line 214: | ||
| Files vanished from RAL tape? | | Files vanished from RAL tape? | ||
| EGI | | EGI | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
| | | | ||
|- | |- | ||
Line 286: | Line 274: | ||
! Solution | ! Solution | ||
|- | |- | ||
− | | | + | | 141108 |
+ | | dune | ||
+ | | verified | ||
+ | | top priority | ||
+ | | 10/05/2019 | ||
+ | | 17/05/2019 | ||
+ | | Workload Management | ||
+ | | Problem submitting DUNE jobs to RAL CEs | ||
+ | | EGI | ||
+ | | For the avoidance of doubt, DUNE uses the same queue names as LIGO (or CMS). and as can be seen we are happily running DUNE jobs. | ||
+ | |||
+ | |||
+ | Total: LIGO_UK_RAL_arc_ce01 Total LIGO_frontend OSG_Flock_frontend gpfrontend01_frontend | ||
+ | Status Running 45 0 0 45 | ||
+ | Idle 5 5 0 0 | ||
+ | Waiting 5 5 0 0 | ||
+ | Pending 0 0 0 0 | ||
+ | Stage in 0 0 0 0 | ||
+ | Stage out 0 0 0 0 | ||
+ | Unknown 0 0 0 0 | ||
+ | Held 0 0 0 0 | ||
+ | Requested Max glideins 792 10 0 782 | ||
+ | Idle 0 0 0 0 | ||
+ | Client Monitor Claimed 90 0 0 90 | ||
+ | User Run Here 90 0 0 90 | ||
+ | User Running 90 0 0 90 | ||
+ | Unmatched 6391 0 0 6391 | ||
+ | User idle 0 0 0 0 | ||
+ | Registered 0 0 0 0 | ||
+ | Info age 33427 71 0 33356 | ||
+ | Troubleshoot Diff(Status: Running - CM: Registered) 45 0 0 45 | ||
+ | Diff(Status: Idle - Requested: Idle) 5 5 0 0 | ||
+ | |- | ||
+ | | 141105 | ||
| ops | | ops | ||
− | | | + | | verified |
| less urgent | | less urgent | ||
| 10/05/2019 | | 10/05/2019 | ||
− | | | + | | 17/05/2019 |
| Operations | | Operations | ||
| [Rod Dashboard] Issues detected at RAL-LCG2 | | [Rod Dashboard] Issues detected at RAL-LCG2 | ||
Line 305: | Line 326: | ||
andrew mcnab | andrew mcnab | ||
|- | |- | ||
− | | | + | | 140965 |
− | | | + | | cms |
− | | | + | | closed |
− | | | + | | urgent |
− | | | + | | 02/05/2019 |
− | | | + | | 17/05/2019 |
− | | | + | | CMS_Data Transfers |
− | | | + | | Datatransfers T2_AT_VIenna -> T1_UK_RAL_Buffer failing |
− | | | + | | WLCG |
− | | | + | | Possible corruption on the file coming from Vienna may have made the file un-deleteable at RAL. The file was replaced with a new copy at Vienna and transfer in debug stream is now green. |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
|- | |- | ||
| 140887 | | 140887 | ||
Line 334: | Line 349: | ||
Closing this ticket. | Closing this ticket. | ||
|- | |- | ||
− | | | + | | 140660 |
− | | | + | | cms |
| closed | | closed | ||
| urgent | | urgent | ||
− | | | + | | 09/04/2019 |
− | | | + | | 16/05/2019 |
− | | | + | | CMS_Central Workflows |
− | | | + | | FIle read issues for Workflows where data is located at T1_UK_RAL |
| WLCG | | WLCG | ||
− | | | + | | The gridmap has been replaced and we appear to be working again. As such I'm going to close this ticket as implementation of the the full solution is beyond the scope of this ticket. |
− | + | ||
− | I'm | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
|} | |} | ||
|}<!-- **********************End Availability Report************************** -----> | |}<!-- **********************End Availability Report************************** -----> | ||
Line 375: | Line 384: | ||
! Comments | ! Comments | ||
|- | |- | ||
− | | 2019-05- | + | | 2019-05-14 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 384: | Line 393: | ||
| | | | ||
|- | |- | ||
− | | 2019-05- | + | | 2019-05-15 |
− | + | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
+ | | 99 | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 393: | Line 402: | ||
| | | | ||
|- | |- | ||
− | | 2019-05- | + | | 2019-05-16 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 402: | Line 411: | ||
| | | | ||
|- | |- | ||
− | | 2019-05- | + | | 2019-05-17 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 411: | Line 420: | ||
| | | | ||
|- | |- | ||
− | | 2019-05- | + | | 2019-05-18 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 420: | Line 429: | ||
| | | | ||
|- | |- | ||
− | | 2019-05- | + | | 2019-05-19 |
| 100 | | 100 | ||
| 100 | | 100 | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
| 100 | | 100 | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
| 100 | | 100 | ||
− | |||
− | |||
| | | | ||
|- | |- | ||
− | | 2019-05- | + | | 2019-05-20 |
| 100 | | 100 | ||
| 100 | | 100 | ||
Line 464: | Line 464: | ||
! Day !! Atlas HC !! CMS HC !! Comment | ! Day !! Atlas HC !! CMS HC !! Comment | ||
|- | |- | ||
− | | 2019-05- | + | | 2019-05-14 || 100 || 99 || |
− | + | ||
− | + | ||
|- | |- | ||
− | | 2019-05- | + | | 2019-05-15 || 100 || 99 || |
|- | |- | ||
− | | 2019-05- | + | | 2019-05-16 || 100 || 100 || |
|- | |- | ||
− | | 2019-05- | + | | 2019-05-17 || 100 || 100 || |
|- | |- | ||
− | | 2019-05- | + | | 2019-05-18 || 100 || 99 || |
|- | |- | ||
− | | 2019-05- | + | | 2019-05-19 || 100 || 100 || |
|- | |- | ||
− | | 2019-05- | + | | 2019-05-20 || 100 || 100 || |
|- | |- | ||
|} | |} |
Latest revision as of 09:03, 21 May 2019
RAL Tier1 Operations Report for 20th May 2019
Review of Issues during the week 13th May 2019 to the 20th May 2019. |
- PPD had an unpatched Jenkins server which had been compromised. This identified on Friday 17th May. It was not a Grid resource (we believe it was an old MICE server).
- We are seeing high outbound packet loss over IPv6. Central networking intend to perform a firmware update on Wednesday(22/5/2019), morning.
- DUNE jobs are running again at Tier-1. This was reported 10 days ago but we believe had been a problem for longer.
- LHCb Castor (Disk) is now read only.
Current operational status and issues |
Resolved Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
- | - | - | - | - |
Ongoing Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
- | - | - | - | - |
Limits on concurrent batch system jobs. |
- ALICE - 1000
Notable Changes made since the last meeting. |
- NTR
Entries in GOC DB starting since the last report. |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
Declared in the GOC DB |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
- No ongoing downtime
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting). |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope | Solution |
---|---|---|---|---|---|---|---|---|---|
141262 | lhcb | in progress | very urgent | 21/05/2019 | 21/05/2019 | File Access | Users are getting [FATAL] Auth failed | WLCG | |
140870 | t2k.org | in progress | less urgent | 25/04/2019 | 14/05/2019 | Data Management - generic | Files vanished from RAL tape? | EGI | |
140447 | dteam | on hold | less urgent | 27/03/2019 | 14/05/2019 | Network problem | packet loss outbound from RAL-LCG2 over IPv6 | EGI | |
140220 | mice | in progress | less urgent | 15/03/2019 | 15/05/2019 | Other | mice LFC to DFC transition | EGI | |
139672 | other | in progress | urgent | 13/02/2019 | 30/04/2019 | Middleware | No LIGO pilots running at RAL | EGI |
GGUS Tickets Closed Last week |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope | Solution |
---|---|---|---|---|---|---|---|---|---|
141108 | dune | verified | top priority | 10/05/2019 | 17/05/2019 | Workload Management | Problem submitting DUNE jobs to RAL CEs | EGI | For the avoidance of doubt, DUNE uses the same queue names as LIGO (or CMS). and as can be seen we are happily running DUNE jobs.
|
141105 | ops | verified | less urgent | 10/05/2019 | 17/05/2019 | Operations | [Rod Dashboard] Issues detected at RAL-LCG2 | EGI | The problem has been solved.
Details about the solution --------- Passing tests now, thanks! andrew mcnab |
140965 | cms | closed | urgent | 02/05/2019 | 17/05/2019 | CMS_Data Transfers | Datatransfers T2_AT_VIenna -> T1_UK_RAL_Buffer failing | WLCG | Possible corruption on the file coming from Vienna may have made the file un-deleteable at RAL. The file was replaced with a new copy at Vienna and transfer in debug stream is now green. |
140887 | atlas | closed | urgent | 27/04/2019 | 13/05/2019 | File Transfer | UK RAL-LCG2 ransfer error with: srm-ifce err: Communication error on send | WLCG | This is not a RAL issue, but a problem with Wuppertalprod already ticketed at https://ggus.eu/index.php?mode=ticket_info&ticket_id=140883 .
Closing this ticket. |
140660 | cms | closed | urgent | 09/04/2019 | 16/05/2019 | CMS_Central Workflows | FIle read issues for Workflows where data is located at T1_UK_RAL | WLCG | The gridmap has been replaced and we appear to be working again. As such I'm going to close this ticket as implementation of the the full solution is beyond the scope of this ticket. |
Availability Report |
Day | Atlas | Atlas-Echo | CMS | LHCB | Alice | OPS | Comments |
---|---|---|---|---|---|---|---|
2019-05-14 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-15 | 100 | 100 | 99 | 100 | 100 | 100 | |
2019-05-16 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-17 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-18 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-19 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-20 | 100 | 100 | 100 | 100 | 100 | 100 |
Hammercloud Test Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas HC | CMS HC | Comment |
---|---|---|---|
2019-05-14 | 100 | 99 | |
2019-05-15 | 100 | 99 | |
2019-05-16 | 100 | 100 | |
2019-05-17 | 100 | 100 | |
2019-05-18 | 100 | 99 | |
2019-05-19 | 100 | 100 | |
2019-05-20 | 100 | 100 |
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
Notes from Meeting. |