Difference between revisions of "Tier1 Operations Report 2019-06-03"
From GridPP Wiki
(→) |
(→) |
||
Line 46: | Line 46: | ||
| d1t0 | | d1t0 | ||
| - | | - | ||
+ | | | ||
| gdss815 | | gdss815 | ||
| LHCb | | LHCb | ||
Line 51: | Line 52: | ||
| d1t0 | | d1t0 | ||
| - | | - | ||
+ | | | ||
| gdss778 | | gdss778 | ||
| LHCb | | LHCb |
Revision as of 08:33, 4 June 2019
RAL Tier1 Operations Report for 3rd June 2019
Review of Issues during the week 27th May 2019 to the 3rd June 2019. |
- Ongoing, we are seeing high outbound packet loss over IPv6. Central networking performed a firmware update to the border routers but this didn’t resolve the issue. Plan to move connections to the new border routers in Mid June. Will do this before trying to debug any further.
- The old LHCb Castor instance lost three disk servers over the weekend!! We don’t intend to spend much effort recovering them. The old LHCb castor instance will be decommissioned (no files will be recoverable) on Friday 7th June.
Current operational status and issues |
Resolved Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gdss813 | LHCb | lhcb | d1t0 | - | gdss815 | LHCb | lhcb | d1t0 | - | gdss778 | LHCb | lhcb | d1t0 | - |
Ongoing Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
- | - | - | - | - |
Limits on concurrent batch system jobs. |
Notable Changes made since the last meeting. |
- NTR
Entries in GOC DB starting since the last report. |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
Declared in the GOC DB |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
- No ongoing downtime
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting). |
Ticket-ID | Type | VO | Site | Priority | Responsible Unit | Status | Last Update | Subject | Scope |
---|---|---|---|---|---|---|---|---|---|
141549 | TEAM | atlas | RAL-LCG2 | less urgent | NGI_UK | in progress | 2019-06-03 08:08:00 | ATLAS-RAL-Frontier and some of Lpad-RAL-LCG2 squid degraded | WLCG |
141537 | TEAM | lhcb | RAL-LCG2 | very urgent | NGI_UK | in progress | 2019-05-31 19:28:00 | Pilots Failed at RAL-LCG2 | WLCG |
141462 | TEAM | lhcb | RAL-LCG2 | top priority | NGI_UK | in progress | 2019-06-02 05:45:00 | Error: Connection limit exceeded | WLCG |
141262 | TEAM | lhcb | RAL-LCG2 | very urgent | NGI_UK | in progress | 2019-05-31 09:23:00 | Users are getting [FATAL] Auth failed | WLCG |
140870 | USER | t2k.org | RAL-LCG2 | less urgent | NGI_UK | in progress | 2019-05-14 13:19:00 | Files vanished from RAL tape? | EGI |
140447 | USER | dteam | RAL-LCG2 | less urgent | NGI_UK | on hold | 2019-05-22 14:20:00 | packet loss outbound from RAL-LCG2 over IPv6 | EGI |
140220 | USER | mice | RAL-LCG2 | less urgent | NGI_UK | in progress | 2019-05-15 11:07:00 | mice LFC to DFC transition | EGI |
139672 | USER | other | RAL-LCG2 | urgent | NGI_UK | waiting for reply | 2019-06-03 09:23:00 | No LIGO pilots running at RAL | EGI |
GGUS Tickets Closed Last week |
Ticket-ID | Type | VO | Site | Priority | Responsible Unit | Status | Last Update | Subject | Scope |
---|---|---|---|---|---|---|---|---|---|
141359 | USER | ops | RAL-LCG2 | less urgent | NGI_UK | verified | 2019-05-31 08:04:00 | [Rod Dashboard] Issue detected : org.sam.SRM-Put@srm-lhcb.gridpp.rl.ac.uk | EGI |
141333 | ALARM | none | RAL-LCG2 | top priority | NGI_UK | verified | 2019-05-28 10:54:00 | This TEST ALARM has been raised for testing GGUS alarm work flow after a new GGUS release. | WLCG |
Availability Report |
Day | Atlas | Atlas-Echo | CMS | LHCB | Alice | OPS | Comments |
---|---|---|---|---|---|---|---|
2019-05-27 | 100 | 100 | 100 | 100 | 100 | 100 | |
2019-05-28 | 100 | 100 | 90 | 100 | 98 | 100 | |
2019-05-29 | 100 | 100 | 85 | 100 | 100 | 100 | |
2019-05-30 | 100 | 100 | 90 | 100 | 93 | 100 | |
2019-05-31 | 100 | 100 | 100 | 100 | 89 | 100 | |
2019-06-01 | 100 | 100 | 100 | 95 | 89 | 100 | |
2019-06-02 | 100 | 100 | 100 | 100 | 84 | 100 |
Hammercloud Test Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas HC | CMS HC | Comment |
---|---|---|---|
2019-05-27 | 100 | 98 | |
2019-05-28 | 100 | 98 | |
2019-05-29 | 100 | 99 | |
2019-05-30 | 100 | 100 | |
2019-05-31 | 100 | 100 | |
2019-05-01 | 100 | 99 | |
2019-05-02 | 100 | 100 |
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
Notes from Meeting. |