Difference between revisions of "Tier1 Operations Report 2018-11-13"
From GridPP Wiki
(→) |
(→) |
||
Line 165: | Line 165: | ||
| 20/11/2018 | | 20/11/2018 | ||
| 20/11/2018 | | 20/11/2018 | ||
− | | | + | | 3Hrs |
| CASTOR out as part of Oracle Patch Instalation (Neptune and Pluto environments) | | CASTOR out as part of Oracle Patch Instalation (Neptune and Pluto environments) | ||
Latest revision as of 08:57, 14 November 2018
RAL Tier1 Operations Report for 5th November 2018
Review of Issues during the week 30th October to the 5th November 2018. |
- We have noticed that our CV17 storage nodes have been randomly rebooting. These nodes are not fully weighted up in Echo and further increases have been stopped while the problem is sorted out. A firmware update is being pushed out. Machines that have the new firmware appear to be fixed, but we are still gathering statistics. It should be noted that there has been no observed impact from the user perspective.
- On 7th November, the first LHCb space token was migrated from Castor to Echo. This is the FAILOVER space token which is used by jobs from other sites if their storage is down.
- ATLAS are unable to get their jobs via singularity to work. We believe this is because they require privileged containers and we are currently only offering unprivileged ones. We would hope that ATLAS could improve their code.
- CMS AAA issue remains ongoing, but we believe the situation is improving. We updated the XRootD version to fix a known bug on Friday and the amount of red in the SAM tests dropped over the weekend.
Current operational status and issues |
- NTR
Resolved Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
- | - | - | - | - |
Ongoing Castor Disk Server Issues |
Machine | VO | DiskPool | dxtx | Comments |
---|---|---|---|---|
gdss788 | WLCG | - | gridTape | d0t1 |
Limits on concurrent batch system jobs. |
- None currently enforced.
Notable Changes made since the last meeting. |
- NA62 were successfully migrated to the new consolidated Castor tape instance.
Entries in GOC DB starting since the last report. |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
- | - | - | - | - | - | - | - |
Declared in the GOC DB |
Service | ID | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|---|
CASTOR | 26325 | Yes | Outage | 15/11/2018 | 15/11/2018 | 3Hrs | Castor SRM endpoint migration |
CASTOR | 26338 | Yes | Outage | 20/11/2018 | 20/11/2018 | 3Hrs | CASTOR out as part of Oracle Patch Instalation (Neptune and Pluto environments) |
-
No ongoing downtime -
No downtime scheduled in the GOCDB for next 2 weeks
Advanced warning for other interventions |
The following items are being discussed and are still to be formally scheduled and announced. |
Listing by category:
- DNS servers will be rolled out within the Tier1 network.
Open
GGUS Tickets (Snapshot taken during morning of the meeting). |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope |
---|---|---|---|---|---|---|---|---|
138218 | cms | in progress | urgent | 09/11/2018 | 12/11/2018 | CMS_Data Transfers | Transfers failing from RAL_Buffer to TIFR | WLCG |
138033 | atlas | in progress | urgent | 01/11/2018 | 08/11/2018 | Other | singularity jobs failing at RAL | EGI |
137897 | enmr.eu | waiting for reply | urgent | 23/10/2018 | 12/11/2018 | Accounting | enmr.eu accounting at RAL | EGI |
137822 | lhcb | in progress | top priority | 18/10/2018 | 31/10/2018 | File Transfer | FTS server seems in bad state. | WLCG |
137650 | cms | in progress | very urgent | 09/10/2018 | 12/11/2018 | CMS_AAA WAN Access | Low HC xrootd success rates at T1_UK_RAL | WLCG |
137153 | t2k.org | in progress | urgent | 12/09/2018 | 12/11/2018 | Data Management - generic | LFC entry has file size 0, preventsw registering of additional replicas | EGI |
GGUS Tickets Closed Last week |
Request id | Affected vo | Status | Priority | Date of creation | Last update | Type of problem | Subject | Scope |
---|---|---|---|---|---|---|---|---|
138103 | cms | solved | urgent | 05/11/2018 | 07/11/2018 | CMS_Data Transfers | Transfers failing from RALPP to RAL | WLCG |
138077 | cms | solved | urgent | 02/11/2018 | 08/11/2018 | CMS_SAM tests | SAM test critical T1_UK_RAL | WLCG |
138028 | lhcb | verified | urgent | 01/11/2018 | 12/11/2018 | File Access | File cannot be staged | WLCG |
137752 | other | solved | less urgent | 15/10/2018 | 12/11/2018 | VO Specific Software | Replicate OSG CVMFS repositories to EGI stratum 1s | EGI |
136701 | lhcb | verified | very urgent | 14/08/2018 | 06/11/2018 | File Transfer | background of transfer errors | WLCG |
136199 | lhcb | verified | very urgent | 18/07/2018 | 06/11/2018 | File Transfer | Lots of submitted transfers on RAL FTS | WLCG |
Availability Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas | Atlas-Echo | CMS | LHCB | Alice | OPS | Comments |
---|---|---|---|---|---|---|---|
2018-11-07 | 100 | 100 | 99 | 100 | 100 | 100 | |
2018-11-08 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-09 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-10 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-11 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-12 | 100 | 100 | 100 | 100 | 100 | 100 | |
2018-11-13 | 100 | 100 | 87 | 100 | 100 | 100 |
Hammercloud Test Report |
Target Availability for each site is 97.0% | Red <90% | Orange <97% |
Day | Atlas HC | CMS HC | Comment |
---|---|---|---|
2018-11-06 | 98 | 100 | |
2018-11-07 | 100 | 99 | |
2018-11-08 | 100 | 99 | |
2018-11-09 | 100 | 98 | |
2018-11-10 | 100 | 96 | |
2018-11-11 | 100 | 92 | |
2018-11-12 | 100 | 99 | |
2018-11-13 | 100 | 99 |
Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud
Notes from Meeting. |