Tier1 Operations Report 2014-01-29

From GridPP Wiki
Revision as of 13:26, 29 January 2014 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 29th January 2014

Review of Issues during the week 22nd to 29th January 2014.
  • On Thursday afternoon (23rd January) the condor master daemons on worker nodes were restarted due to a configuration error. However, the system recovered by itelf from this although there some batch job losses.
  • There were problems with one of the WMS systems (lcgwms05) during the night Monday/Tuesday (27/28 Jan) when a user ran many unsuitable jobs through this WMS and filled up the available disk space.
  • On Monday there was a database crash on one of the nodes in the Castor Database Oracle RAC. The crash has the symptoms of a bug that has been seen before and previously reported to Oracle. The processes on that node failed over to other nodes in the RAC. During the afternoon the processes were put back into their normal locations. There was a transitory effect on Castor as these changes happend.
  • There was a problem with failing CMS SAM tests on the ARC-CEs from the end of yesterday afternoon until late this morning. This was traced to the a configuration problem with the relevant CMS user being given too low a priority on the batch system.
Resolved Disk Server Issues
  • None
Current operational status and issues
  • Nothing to report.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week.
  • As reported at the last meeting there was a failed attempt to update FTS3 (to resolve openssl problems) on Tuesday 22nd Jan which was backed out. The upgrade was repeated, this time successfully, on Monday 27th Jan. During this second attempt all existng proxies on the FTS3 systems were deleted.
Declared in the GOC DB
  • There is an entry for the retirement of two old (and replaced) Logging & Bookkeeping servers.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Tuesday 4th February: 2-hour break in tape access to test a new server that provides the interface to the tape library. This server will be required to support T10000D tape drives.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Castor 2.1.14 testing is ongoing. A date for deployments awaits successful completion of this testing.
  • Networking:
    • Implementation of new site firewall. Date for Tier1 proposed to be 11th March. (Initial changes for links that do not affect the Tier1 commenced this week.)
    • Update core Tier1 network and change connection to site and OPN including:
      • Install new Routing layer for Tier1 & change the way the Tier1 connects to the RAL network. (Required before firewall changes on 11th March).
      • These changes will lead to the removal of the UKLight Router.
  • Fabric
    • Firmware updates on remaining EMC disk arrays (Castor, FTS/LFC)
    • There will be circuit testing of the remaining (i.e. non-UPS) circuits in the machine room during 2014.
Entries in GOC DB starting between the 22nd and 29th January 2014.
Service Scheduled? Outage/At Risk Start End Duration Reason
lcgfts3.gridpp.rl.ac.uk, SCHEDULED OUTAGE 27/01/2014 10:00 27/01/2014 12:00 2 hours Upgrade of FTS3 gridsite and openssl. WIll remove existing proxies on the server as part of upgrade.
lcglb03.gridpp.rl.ac.uk, lcglb04.gridpp.rl.ac.uk, SCHEDULED OUTAGE 18/12/2013 11:00 31/01/2014 00:00 43 days, 13 hours old EMI-2 hosts to be retired
Open GGUS Tickets (Snapshot during morning of meeting)
GGUS ID Level Urgency State Creation Last Update VO Subject
100558 Green Less Urgent Waiting Reply 2014-01-27 2014-01-28 SNO+ Moving data
100507 Yellow Less Urgent Waiting Reply 2014-01-23 2014-01-27 CMS [sr #141722] Transfers from Caltech to RAL are failing
100343 Red Less Urgent In Progress 2014-01-16 2014-01-27 RAL WMS still generating 512 proxies
100114 Red Less Urgent In Progress 2014-01-08 2014-01-10 Jobs failing to get from RAL WMS to Imperial
99556 Red Very Urgent In Progress 2013-12-06 2014-01-21 NGI Argus requests for NGI_UK
98249 Red Urgent On Hold 2013-10-21 2014-01-14 SNO+ please configure cvmfs stratum-0 for SNO+ at RAL T1
97025 Red Less urgent On Hold 2013-09-03 2014-01-06 Myproxy server certificate does not contain hostname
86152 Red Less Urgent On Hold 2012-09-17 2013-10-18 correlated packet-loss on perfsonar host
Availability Report
Day OPS Alice Atlas CMS LHCb Comment
22/01/14 100 100 100 100 100
23/01/14 100 100 100 100 100
24/01/14 100 100 100 100 100
25/01/14 100 100 100 100 100
26/01/14 100 100 100 100 100
27/01/14 100 100 99.1 100 100 Single SRM PUT failure: "could not open connection to srm-atlas.gridpp.rl.ac.uk". Coincident with a database processes being put back on correct nodes in Oracle RAC.
28/01/14 100 100 100 72.0 100 CMS batch jobs not being run. problem with the relevant CMS user being given too low a priority on the batch system.