Tier1 Operations Report 2014-01-08

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 8th January 2014

Review of Issues during the weeks 18th December 2013 to 8th January 2014.
  • With the Christmas & New Year holiday occurring since the last report this has been a quiet three weeks. Operations were smooth over the holiday period.
  • There were a number of restarts of the xrootd daemon on disk servers in AliceDisk over the days 21-23 December. These then stopped and the root cause remains unknown. There was also a problem with the Stager on the Castor GEN instance on Tuesday 24th December which was traced to a logging problem and fixed.
  • The Atlas file renaming in Castor is ongoing with around 14 million files renamed so far. The number of files lost has been found to be significantly lower (by around 90%) than previously thought owing to attempts to rename some files more than once.
Resolved Disk Server Issues
  • None.
Current operational status and issues
  • None
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • The batch farm roll-out of a condor update and a reduction in the memory over-commit (as well as kernel/errata updates) was completed before Christmas.
  • LB servers lcglb03 & lcglb04 have been replaced by new SL6 EMI-3 update L&B nodes lcglb01 and lcglb02.gridpp.rl.ac.uk.
  • EMI-3 update 8 on LFC nodes
  • There have been some tweaks to the Condor for multicore jobs (Added multicore accounting groups for Atlas to enable fairshares for this type of job; updated algorithm used to free up job slots for multicore jobs.)
  • One of the tranches of CPU orders is currently being delivered.
Declared in the GOC DB
  • There is an entry for the retirement of two old (and replaced) Logging & Bookkeeping servers.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is ongoing. A data for deployments awaits successful completion of this testing.
  • Networking:
    • Implementation of new site firewall.
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1
      • Change the way the Tier1 connects to the RAL network.
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 18th December 2013 and 8th January 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcglb03.gridpp.rl.ac.uk, lcglb04.gridpp.rl.ac.uk, SCHEDULED OUTAGE 18/12/2013 11:00 31/01/2014 00:00 43 days, 13 hours Old EMI-2 hosts to be retired
Open GGUS Tickets (Snapshot at time of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
100086 Green Less Urgent Waiting Reply 2014-01-07 2014-01-08 T2K WMS jobs cleared too rapidly
99768 Red Less Urgent Waiting Reply 2013-12-13 2014-01-07 Atlas RAL-LCG2_DATADISK: transfer failures with "source file doesn't exist"
99647 Red Less Urgent Waiting Reply 2013-12-12 2013-12-17 SNO+ lcg-cp connection timeouts
99556 Red Very Urgent In Progress 2013-12-06 2014-01-07 NGI Argus requests for NGI_UK
98249 Red Urgent Waiting Reply 2013-10-21 2014-01-06 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
98122 Red Less Urgent Waiting Reply 2013-10-17 2014-01-06 cernatschool CVMFS access for the cernatschool.org VO
97025 Red Less urgent On Hold 2013-09-03 2014-01-06 Myproxy server certificate does not contain hostname
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
18/12/13 100 100 100 100 100
19/12/13 100 100 99.1 100 100 Single SRM GET failure: "could not open connection to srm-atlas.gridpp.rl.ac.uk"
20/12/13 100 100 100 100 100
21/12/13 100 100 100 100 100
22/12/13 100 100 100 100 100
23/12/13 100 100 100 100 100
24/12/13 100 100 100 100 100
25/12/13 100 100 100 100 100
26/12/13 100 100 100 100 100
27/12/13 100 100 100 100 100
28/12/13 100 100 100 100 100
29/12/13 100 100 100 100 100
30/12/13 100 100 100 100 100
31/12/13 100 100 100 100 100
01/01/14 100 100 100 100 100
02/01/14 100 100 100 100 100
03/01/14 100 100 100 100 100
04/01/14 100 100 100 100 100
05/01/14 100 100 100 100 100
06/01/14 100 100 100 100 100
07/01/14 100 100 100 100 100