Tier1 Operations Report 2014-03-19

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 19th March 2014

Review of Issues during the week 12th to 19th March 2014.
  • On Wednesday early evening (12th March) there was a failure of the primary link to CERN between 17:00 and 19:00. Traffic flowed over the backup link. However, the failover was not clean and during this time we were failing the VO SUM tests.
  • There was a problem with one of the FTS2 agent systems in the early hours of Thursday 13th March. Owing to a configuration error the hypervisor hosting this virtual machine rebooted and this particular system was not configured to re-start. This was resolved by the primary on-call.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • There have been problems with the CMS Castor instance through the last week. These are triggered by high load on CMS_Tape - with all the disk servers that provide the cache for this ervice class running flat out (as far as network connectivity goes). Work is underway to increase the throughput of this disk cache.
  • The Castor Team are now able to reproduce the intermittent failures of Castor access via the SRM that has been reported in recent weeks. Understanding of the problem is significantly adcanced and further investigations are ongoing using the Castor Preprod instance. Ideas for a workaround are being developed.
  • The problem of full Castor disk space for Atlas has been eased. Working with Atlas the file deletion rate has been somewhat improved. However, there is still a problem that needs to be understood.
  • Around 50 files in tape backed service classes (mainly in GEN) have been found not to have migrated to tape. This is under investigation. The cause for some of these is understood (a bad tape at time of migration). CERN will provide a script to re-send the remaining ones to tape.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • The move of the Tier1 to use the new site firewall took place on Monday 17th March between 07:00 and 07:30. FTS (2 & 3) services were drained and stopped during the change. The batch system was also reconfigured such that new batch jobs world not startt during this period. The change was successful. There was a routing problem that affected the LFC in particular and external access from many worker nodes but that was fixed in around an hour.
  • One batch of WNs now updated to EMI-3 version of WN a week ago. So far so good.
  • The EMI3 Argus server is in use for most of the CEs and one batch of worker nodes.
  • The planned and announced UPS/Generator load test scheduled for this morning (19th March) was cancelled.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is largely complete. We are starting to look at possible dates for rolling this out (probably around April).
  • Networking:
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • We are phasing out the use of the software server used by the small VOs.
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 12th and 19th March 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
Whole Site SCHEDULED WARNING 17/03/2014 07:00 17/03/2014 17:00 10 hours Site At Risk during and following change to use new firewall.
lcgfts.gridpp.rl.ac.uk, lcgfts3.gridpp.rl.ac.uk SCHEDULED OUTAGE 17/03/2014 06:00 17/03/2014 09:00 3 hours Drain and stop of FTS services during update to new site firewall.
srm-cms.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 14/03/2014 09:40 14/03/2014 10:26 46 minutes Problem with CMS Castor instance being investigated.
srm-cms.gridpp.rl.ac.uk UNSCHEDULED OUTAGE 14/03/2014 04:15 14/03/2014 07:15 3 hours Currently investigtating problems with Oracle DB behind Castor CMS
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
101968 Green Less Urgent On Hold 2014-03-11 2014-03-12 Atlas RAL-LCG2_SCRATCHDISK: One dataset to delete is causing 1379 deletion errors
101079 Red Urgent In Progress 2014-02-09 2014-03-17 ARC CEs have VOViews with a default SE of "0"
101052 Red Urgent In Progress 2014-02-06 2014-03-17 Biomed Can't retrieve job result file from cream-ce02.gridpp.rl.ac.uk
99556 Red Very Urgent In Progress 2013-12-06 2014-03-06 NGI Argus requests for NGI_UK
98249 Red Urgent Waiting Reply 2013-10-21 2014-03-13 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
97025 Red Less urgent On Hold 2013-09-03 2014-03-04 Myproxy server certificate does not contain hostname
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
12/03/14 100 100 91.4 93.7 90.4 There was a failure of the Primary OPN link to CERN. Traffic flipped to backup link but the failover was not complete.
13/03/14 100 100 100 91.7 100 2 SRM test failures (both "User Timeout") See next entry for cause.
14/03/14 100 100 100 72.1 100 SRM test failures. The appear as "User Timeout" problem as yesterday - bad request in the database.
15/03/14 100 100 100 100 100
16/03/14 100 100 100 100 100
17/03/14 100 100 99.1 87.7 100 Atlas: Single SRM Test ("User Timeout"); CMS: Continuation of what are believed to be load triggered problems in CMS_Tape.
18/03/14 100 100 97.9 64.2 100 Atlas: Single SRM Test failure ("could not open connection to srm-atlas"); CMS: Continuation of above problems.