Difference between revisions of "Tier1 Operations Report 2018-11-06"

From GridPP Wiki
Jump to: navigation, search
()
()
Line 215: Line 215:
 
! Scope
 
! Scope
 
|-
 
|-
| style="background-color: green;" | 138002
+
| style="background-color: green;" | 138103
 
| cms
 
| cms
 
| in progress
 
| in progress
| top priority
+
| urgent
| 30/10/2018
+
| 05/11/2018
| 30/10/2018
+
| 05/11/2018
 
| CMS_Data Transfers
 
| CMS_Data Transfers
| Issues with RAL FTS
+
| Transfers failing from RALPP to RAL
 
| WLCG
 
| WLCG
 
|-
 
|-
| style="background-color: green;" | 137994
+
| style="background-color: green;" | 138077
 
| cms
 
| cms
 
| in progress
 
| in progress
 
| urgent
 
| urgent
| 30/10/2018
+
| 02/11/2018
| 30/10/2018
+
| 05/11/2018
| CMS_Data Transfers
+
| CMS_SAM tests
| Transfers failing between RAL and T1_FR_CCIN2P3_Disk
+
| SAM test critical T1_UK_RAL
 
| WLCG
 
| WLCG
 
|-
 
|-
| style="background-color: green;" | 137942
+
| style="background-color: green;" | 138033
| cms
+
| atlas
 
| in progress
 
| in progress
 
| urgent
 
| urgent
| 25/10/2018
+
| 01/11/2018
| 31/10/2018
+
| 01/11/2018
| CMS_Data Transfers
+
| Other
| Failing transfers via IPv6 between T1_UK_RAL and T1_DE_KIT
+
| singularity jobs failing at RAL
 +
| EGI
 +
|-
 +
| style="background-color: green;" | 138028
 +
| lhcb
 +
| in progress
 +
| urgent
 +
| 01/11/2018
 +
| 01/11/2018
 +
| File Access
 +
| File cannot be staged
 
| WLCG
 
| WLCG
 
|-
 
|-
 
| style="background-color: green;" | 137897
 
| style="background-color: green;" | 137897
 
| enmr.eu
 
| enmr.eu
| in progress
+
| waiting for reply
 
| urgent
 
| urgent
 
| 23/10/2018
 
| 23/10/2018
| 24/10/2018
+
| 05/11/2018
 
| Accounting
 
| Accounting
 
| enmr.eu accounting at RAL
 
| enmr.eu accounting at RAL
 
| EGI
 
| EGI
 
|-
 
|-
| style="background-color: yellow;" | 137822
+
| style="background-color: orange;" | 137822
 
| lhcb
 
| lhcb
 
| in progress
 
| in progress
 
| top priority
 
| top priority
 
| 18/10/2018
 
| 18/10/2018
| 22/10/2018
+
| 31/10/2018
 
| File Transfer
 
| File Transfer
| FTS server seems in bad state
+
| FTS server seems in bad state.
 
| WLCG
 
| WLCG
 
|-
 
|-
Line 270: Line 280:
 
| less urgent
 
| less urgent
 
| 15/10/2018
 
| 15/10/2018
| 29/10/2018
+
| 02/11/2018
 
| VO Specific Software
 
| VO Specific Software
 
| Replicate OSG CVMFS repositories to EGI stratum 1s
 
| Replicate OSG CVMFS repositories to EGI stratum 1s
Line 280: Line 290:
 
| very urgent
 
| very urgent
 
| 09/10/2018
 
| 09/10/2018
| 31/10/2018
+
| 02/11/2018
 
| CMS_AAA WAN Access
 
| CMS_AAA WAN Access
 
| Low HC xrootd success rates at T1_UK_RAL
 
| Low HC xrootd success rates at T1_UK_RAL
Line 290: Line 300:
 
| urgent
 
| urgent
 
| 12/09/2018
 
| 12/09/2018
| 10/10/2018
+
| 05/11/2018
 
| Data Management - generic
 
| Data Management - generic
 
| LFC entry has file size 0, preventsw registering of additional replicas
 
| LFC entry has file size 0, preventsw registering of additional replicas
 
| EGI
 
| EGI
|-
 
| style="background-color: red;" | 136701
 
| lhcb
 
| in progress
 
| very urgent
 
| 14/08/2018
 
| 17/10/2018
 
| File Transfer
 
| background of transfer errors
 
| WLCG
 
|-
 
| style="background-color: red;" | 136199
 
| lhcb
 
| in progress
 
| very urgent
 
| 18/07/2018
 
| 17/10/2018
 
| File Transfer
 
| Lots of submitted transfers on RAL FTS
 
| WLCG
 
 
|}
 
|}
 
<!-- **********************End Availability Report************************** ----->
 
<!-- **********************End Availability Report************************** ----->

Revision as of 09:59, 6 November 2018

RAL Tier1 Operations Report for 5th November 2018

Review of Issues during the week 30th October to the 5th November 2018.
  • ALICE have had problems with authentication problems with CASTOR. An update was performed to CASTOR on the 29th October, which promptly broke itself. This was reverted the same day but issues remained until Thursday.
  • LHCb have started syncing more of their data from Castor to Echo. They have written over a PB to Echo since the 2nd November (just under 3 days). The write rate into Echo is about 10 times higher than normal ATLAS and CMS production work (5GB/s instead of 500MB/s).
  • Some (1-5%) of CMS gridFTP SAM test jobs are failing against Echo due to "System error in bind: Address already in us”. This is when GridFTP can’t find a contiguous block of ports to use for a transfer. This potential problem has been known about for a long time, but we believed we had sufficient mitigation in place to prevent it causing any real issues. This may be related to the bulk transfers LHCb have been doing in the last week.
  • CMS AAA, problems remain. The manager continues to randomly crash. As mitigation we will setup a second instance in an attempt to hide the problem. Additionally, this week we will be pushing out the newest version of XRootD (4.8.5), which claims to fix the problem.
  • ATLAS migrated to the new tape instance (wlcgTape) on Wednesday 31st October. ATLAS are now completely off their old instance which will be decommissioned. CMS and non-LHC VOs will follow before Christmas.
  • After NA62 lost data at CERN, the Castor team recalled what we had backed up at RAL to the new wlcgTape instance as this buffer was larger and more per-formant than the gen instance one. This speed up recover by a day or so.
  • We received multiple GGUS tickets regarding the FTS problems which quickly pointed to an IPv6 issue. Inbound IPv6 traffic was getting blocked to machine that were not on the OPN subnet (i.e. a firewall problem). We believed we fixed the issue on Friday and did not get any further complaints over the weekend. IPv6 problems both at RAL and other sites are impacting the FTS service very frequently. To mitigate this, we are reverting the FTS test instance to IPv4 only, this will allow VO's to continue to function in the event of this problem recurring. We are also planning to move the FTS service on to the OPN subnet this week.
  • It was also discovered that CERN has been incorrectly routing IPv6 packets. At the LHCONE meeting last week it was noted that KIT was receiving packets from RAL via LHCONE. It turned out that KIT was only advertising its IPv6 address to those on the OPN via the LHCONE. This should have meant no IPv6 transfers between RAL and KIT were possible. The fact that the Tier-1 is not on the LHCONE does cause confusion for other sites especially Tier-1s who assume we are part of it.
Current operational status and issues
  • NTR
Resolved Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - - -
Ongoing Castor Disk Server Issues
Machine VO DiskPool dxtx Comments
- - - - -


Limits on concurrent batch system jobs.
  • None currently enforced.
Notable Changes made since the last meeting.
  • None.
Entries in GOC DB starting since the last report.
Service ID Scheduled? Outage/At Risk Start End Duration Reason
- - - - - - - -
Declared in the GOC DB
Service ID Scheduled? Outage/At Risk Start End Duration Reason
srm-atlas 26250 Yes Outage 31/10/2018 09:00 31/10/2018 12:00 - Migration of ATLAS to wlcgTape.
  • No ongoing downtime
  • No downtime scheduled in the GOCDB for next 2 weeks
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Castor:
    • Disk only storage will end in Castor. A new single tape-only Castor instance (called WLCGTape) is being tested. This is using generic Catsor headnodes on SL7 configured by Quattor/Aquilon with a slightly newer Castor version.
    • Update systems to use SL7 and configured by Quattor/Aquilon. (Tape servers done)
  • Internal
    • DNS servers will be rolled out within the Tier1 network.
Open

GGUS Tickets (Snapshot taken during morning of the meeting).

Request id Affected vo Status Priority Date of creation Last update Type of problem Subject Scope
138103 cms in progress urgent 05/11/2018 05/11/2018 CMS_Data Transfers Transfers failing from RALPP to RAL WLCG
138077 cms in progress urgent 02/11/2018 05/11/2018 CMS_SAM tests SAM test critical T1_UK_RAL WLCG
138033 atlas in progress urgent 01/11/2018 01/11/2018 Other singularity jobs failing at RAL EGI
138028 lhcb in progress urgent 01/11/2018 01/11/2018 File Access File cannot be staged WLCG
137897 enmr.eu waiting for reply urgent 23/10/2018 05/11/2018 Accounting enmr.eu accounting at RAL EGI
137822 lhcb in progress top priority 18/10/2018 31/10/2018 File Transfer FTS server seems in bad state. WLCG
137752 other in progress less urgent 15/10/2018 02/11/2018 VO Specific Software Replicate OSG CVMFS repositories to EGI stratum 1s EGI
137650 cms in progress very urgent 09/10/2018 02/11/2018 CMS_AAA WAN Access Low HC xrootd success rates at T1_UK_RAL WLCG
137153 t2k.org in progress urgent 12/09/2018 05/11/2018 Data Management - generic LFC entry has file size 0, preventsw registering of additional replicas EGI
GGUS Tickets Closed Last week
Request id Affected vo Status Priority Date of creation Last update Type of problem Subject Scope
137881 lhcb verified urgent 22/10/2018 24/10/2018 Other Low level of running jobs at RAL-LCG2 WLCG
137723 cms closed urgent 14/10/2018 29/10/2018 CMS_Data Transfers PhEDEx component Agent Watchdog donw at T1_UK-RAL WLCG
137634 cms closed urgent 08/10/2018 24/10/2018 CMS_Data Transfers Transfers failing from T1_UK_RAL_Disk to METU WLCG
137391 atlas closed urgent 25/09/2018 24/10/2018 Network problem UK RAL-LCG2 transfer errors with Communication error on send WLCG
137195 ops verified less urgent 14/09/2018 28/10/2018 Operations [Rod Dashboard] Issues detected at RAL-LCG2 EGI
124876 ops solved less urgent 07/11/2016 24/10/2018 Operations [Rod Dashboard] Issue detected : hr.srce.GridFTP-Transfer-ops@gridftp.echo.stfc.ac.uk EGI

Availability Report

Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas Atlas-Echo CMS LHCB Alice OPS Comments
2018-10-22 100 100 99 100 100 100
2018-10-23 100 100 99 100 100 100
2018-10-24 100 100 98 100 100 100
2018-10-25 100 100 95 100 100 100
2018-10-26 100 100 100 100 93 100
2018-10-27 100 100 100 100 100 100
2018-10-28 100 100 100 100 100 100
2018-10-29 100 100 99 100 55 100
2018-10-30 100 100 99 100 66
Hammercloud Test Report
Target Availability for each site is 97.0% Red <90% Orange <97%
Day Atlas HC CMS HC Comment
2018-10-22 100 98
2018-10-23 100 97
2018-10-24 100 99
2018-10-25 100 99
2018-10-26 100 99
2018-10-27 100 99
2018-10-28 98 98
2018-10-29 100 98
2018-10-30 - -

Key: Atlas HC = Atlas HammerCloud (Queue RAL-LCG2_UCORE, Template 841); CMS HC = CMS HammerCloud

Notes from Meeting.