Difference between revisions of "Tier1 Operations Report 2019-05-20"
From GridPP Wiki
(Created page with "==RAL Tier1 Operations Report for 13th May 2019== __NOTOC__ ====== ====== <!-- ************************************************************* -----> <!-- ***********Start Rev...") |
|||
Line 1: | Line 1: | ||
− | ==RAL Tier1 Operations Report for | + | ==RAL Tier1 Operations Report for 20th May 2019== |
__NOTOC__ | __NOTOC__ | ||
Line 8: | Line 8: | ||
{| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;" | {| width="100%" cellspacing="0" cellpadding="0" style="background-color: #ffffff; border: 1px solid silver; border-collapse: collapse; width: 100%; margin: 0 0 1em 0;" | ||
|- | |- | ||
− | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week | + | | style="background-color: #b7f1ce; border-bottom: 1px solid silver; text-align: center; font-size: 1em; font-weight: bold; margin-top: 0; margin-bottom: 0; padding-top: 0.1em; padding-bottom: 0.1em;" | Review of Issues during the week 13th May 2019 to the 20th May 2019. |
|} | |} | ||
− | * We are seeing high outbound packet loss over IPv6. Central networking intend to perform a firmware update | + | * PPD had an unpatched Jenkins server which had been compromised. This identified on Friday 17th May. It was not a Grid resource (we believe it was an old MICE server). |
− | + | * We are seeing high outbound packet loss over IPv6. Central networking intend to perform a firmware update on Wednesday(22/5/2019), morning. | |
− | * | + | * DUNE jobs are running again at Tier-1. This was reported 10 days ago but we believe had been a problem for longer. |
− | * | + | * LHCb Castor (Disk) is now read only. |
+ | |||
<!-- ***********End Review of Issues during last week*********** -----> | <!-- ***********End Review of Issues during last week*********** -----> | ||
<!-- *********************************************************** -----> | <!-- *********************************************************** -----> |
Revision as of 08:23, 21 May 2019
RAL Tier1 Operations Report for 20th May 2019
Review of Issues during the week 13th May 2019 to the 20th May 2019. |
- PPD had an unpatched Jenkins server which had been compromised. This identified on Friday 17th May. It was not a Grid resource (we believe it was an old MICE server).
- We are seeing high outbound packet loss over IPv6. Central networking intend to perform a firmware update on Wednesday(22/5/2019), morning.
- DUNE jobs are running again at Tier-1. This was reported 10 days ago but we believe had been a problem for longer.
- LHCb Castor (Disk) is now read only.
Current operational status and issues |
Resolved Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
- | - | - | - | - |
Ongoing Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
- | - | - | - | - |
Limits on concurrent batch system jobs. |
- ALICE - 1000
Notable Changes made since the last meeting. |
- NTR
Entries in GOC DB starting since the last report. |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
Declared in the GOC DB |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
- No ongoing downtime
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting). |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope | Solution |
---|---|---|---|---|---|---|---|---|---|
141108 | dune | in progress | top priority | 10/05/2019 | 15/05/2019 | Workload Management | Problem submitting DUNE jobs to RAL CEs | EGI | |
140870 | t2k.org | in progress | less urgent | 25/04/2019 | 14/05/2019 | Data Management - generic | Files vanished from RAL tape? | EGI | |
140773 | lhcb | in progress | top priority | 18/04/2019 | 08/05/2019 | Storage Systems | Removal of Echo unbearably slow | WLCG | |
140447 | dteam | on hold | less urgent | 27/03/2019 | 14/05/2019 | Network problem | packet loss outbound from RAL-LCG2 over IPv6 | EGI | |
140220 | mice | in progress | less urgent | 15/03/2019 | 15/05/2019 | Other | mice LFC to DFC transition | EGI | |
139672 | other | in progress | urgent | 13/02/2019 | 30/04/2019 | Middleware | No LIGO pilots running at RAL | EGI |
GGUS Tickets Closed Last week |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope | Solution |
---|---|---|---|---|---|---|---|---|---|
141105 | ops | solved | less urgent | 10/05/2019 | 14/05/2019 | Operations | [Rod Dashboard] Issues detected at RAL-LCG2 | EGI | The problem has been solved.
Details about the solution --------- Passing tests now, thanks! andrew mcnab |
140932 | enmr.eu | solved | less urgent | 30/04/2019 | 08/05/2019 | Other | how to install cvmfs on worker nodes | EGI | Hi Enrico,
I'm going to make the assumption that as "it works perfectly", I can mark this ticket as solved. Best regards Darren |
140887 | atlas | closed | urgent | 27/04/2019 | 13/05/2019 | File Transfer | UK RAL-LCG2 ransfer error with: srm-ifce err: Communication error on send | WLCG | This is not a RAL issue, but a problem with Wuppertalprod already ticketed at https://ggus.eu/index.php?mode=ticket_info&ticket_id=140883 .
Closing this ticket. |
140758 | lhcb | closed | urgent | 17/04/2019 | 08/05/2019 | File Access | lhcbUser svcClass not working as it should ? | WLCG | Hi guys,
I'm assuming I can now resolve this one again? Cheers D. |
Availability Report |
Day | Atlas | Atlas-Echo | CMS | LHCB | Alice | OPS | Comments |
---|---|---|---|---|---|---|---|
2019-05-06 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-07 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-08 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-09 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-10 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-11 | 100 | 100 | 100 | 100 | 74 | 74 | |
2019-05-12 | 100 | 100 | 100 | 100 | 24 | 24 | |
2019-05-13 | 100 | 100 | 100 | 100 | 100 | 100 |
Hammercloud Test Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas HC | CMS HC | Comment |
---|---|---|---|
2019-05-06 | 100 | 99 | |
2019-05-07 | 100 | 99 | |
2019-05-08 | 100 | 100 | |
2019-05-09 | 100 | 100 | |
2019-05-10 | 100 | 98 | |
2019-05-11 | 100 | 100 | |
2019-05-12 | 100 | 100 | |
2019-05-13 | 100 | 100 |
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
Notes from Meeting. |