|
|
Line 126: |
Line 126: |
| <!-- *********************************************************** -----> | | <!-- *********************************************************** -----> |
| <!-- ***********************Start T1 text*********************** -----> | | <!-- ***********************Start T1 text*********************** -----> |
− | '''Tue 6th October''' Report for the Experiments Liaison Report (06/11/2018) is [https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2018-11-06 here]. | + | '''Tue 13th November''' Report for the Experiments Liaison Report (13/11/2018) is [https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2018-11-13 here]. |
| <!-- *********************************************************** -----> | | <!-- *********************************************************** -----> |
| <!-- **********************End T1 text************************** -----> | | <!-- **********************End T1 text************************** -----> |
− | * ALICE have had problems with authentication problems with CASTOR. An update was performed to CASTOR on the 29th October, which promptly broke itself. This was reverted the same day but issues remained until Thursday. | + | * We have noticed that the CV17 storage nodes have been randomly rebooting. These nodes are not fully weighted up in Echo and further increases have been stopped while the problem is sorted out. A firmware update is being pushed out. Machines that have the new firmware appear to be fixed, but we are still gathering statistics. It should be noted that there has been no observed impact from the user perspective. |
− | | + | * On 7th November, the first LHCb space token was migrated from Castor to Echo!! This is the FAILOVER space token which is used by jobs from other sites if their storage is down. The test jobs are succeeding. |
− | * LHCb have started syncing more of their data from Castor to Echo. They have written over a PB to Echo since the 2nd November (just under 3 days). The write rate into Echo is about 10 times higher than normal ATLAS and CMS production work (5GB/s instead of 500MB/s).
| + | * NA62 were successfully migrated to the new consolidated Castor tape instance. |
− | | + | * ATLAS are unable to get their jobs via singularity to work. We believe this is because they require privileged containers and we are currently only offering unprivileged ones. We would hope that ATLAS could improve their code. |
− | * Some (1-5%) of CMS gridFTP SAM test jobs are failing against Echo due to "System error in bind: Address already in us”. This is when GridFTP can’t find a contiguous block of ports to use for a transfer. This potential problem has been known about for a long time, but we believed we had sufficient mitigation in place to prevent it causing any real issues. This may be related to the bulk transfers LHCb have been doing in the last week.
| + | * CMS AAA issue remains ongoing, but we believe the situation is improving. We updated the XRootD version to fix a known bug on Friday and the amount of red in the SAM tests dropped over the weekend. |
− | | + | |
− | * CMS AAA, problems remain. The manager continues to randomly crash. As mitigation we will setup a second instance in an attempt to hide the problem. Additionally, this week we will be pushing out the newest version of XRootD (4.8.5), which claims to fix the problem. | + | |
− | | + | |
− | * ATLAS migrated to the new tape instance (wlcgTape) on Wednesday 31st October. ATLAS are now completely off their old instance which will be decommissioned. CMS and non-LHC VOs will follow before Christmas.
| + | |
− | | + | |
− | * After NA62 lost data at CERN, the Castor team recalled what we had backed up at RAL to the new wlcgTape instance as this buffer was larger and more per-formant than the gen instance one. This speed up recover by a day or so. | + | |
− | | + | |
− | * We received multiple GGUS tickets regarding the FTS problems which quickly pointed to an IPv6 issue. Inbound IPv6 traffic was getting blocked to machine that were not on the OPN subnet (i.e. a firewall problem). We believed we fixed the issue on Friday and did not get any further complaints over the weekend. IPv6 problems both at RAL and other sites are impacting the FTS service very frequently. To mitigate this, we are reverting the FTS test instance to IPv4 only, this will allow VO's to continue to function in the event of this problem recurring. We are also planning to move the FTS service on to the OPN subnet this week. | + | |
− | | + | |
− | * It was also discovered that CERN has been incorrectly routing IPv6 packets. At the LHCONE meeting last week it was noted that KIT was receiving packets from RAL via LHCONE. It turned out that KIT was only advertising its IPv6 address to those on the OPN via the LHCONE. This should have meant no IPv6 transfers between RAL and KIT were possible. The fact that the Tier-1 is not on the LHCONE does cause confusion for other sites especially Tier-1s who assume we are part of it. | + | |
| |} | | |} |
| <!-- ****************Start Storage & DM****************** -----> | | <!-- ****************Start Storage & DM****************** -----> |