RAL Tier1 weekly operations castor 16/10/2015

From GridPP Wiki
Jump to: navigation, search

Operations News

  • The 11.2.0.4 DB upgrade (moving the Neptune nodes back to R89 and Pluto to R26) has been completed successfully.
  • 5 fresh nodes have been added to atlasTape
  • Attempted fix for LHCb's intermittent stage request failures - increase 'Stage Request' thread count from 4 to 12.

Operations Problems

  • More SRM DB duplicates continue. RA is developing an automatic script to clean up the usual case of these.
  • LHCB ticket re. low level of failures likely to be fail in nsmkdir. Look for timeouts in name server logs to prove where timeout is. Mitigation may be to increase thread count on NS. Be a little careful with this one as vastly increasing this could exceed connection limit to DB (number of NS daemons is quite high)
  • Putdone without a put, 4 threads on Stager for this type of request. Suggest from Shaun (with CERN agreement) increase to 12.
  • High loading seen on the atlasTape has eased, sadly shortly before we added 5 new nodes to the instance.
  • The checksum issue/tickets still present. These are thought to be due to a CASTOR bug fixed in 2.1.15.
  • The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating.
  • gdss707 fell over again. Will be drained and have a CPU change.

Blocking Issues

Planned, Scheduled and Cancelled Interventions

  • Tape backed nodes to SL6 - scheduled for Monday/Tuesday.
  • CASTOR 2.1.15
    • Currently scheduled for late this year/early next.
  • Upgrade of Oracle clients to 11.2.0.4

Advanced Planning

Tasks

  • CASTOR Face-to-face 2nd-3rd November.

Interventions

Staffing

  • Castor on Call person next week
    • SdW
  • Staff absence/out of the office:
    • BD Monday, Thursday, Friday

New Actions

Existing Actions

  • SdW to revisit schema changes to improve SRM DB dup position W/C 12th Oct.
  • Production team check srm03 regarding lack of crl updates
  • RA to schedule no-op change to be included in quattor / persistent (CMSdisk)
  • BC to doc procedure for HW return to CASTOR
  • SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • RA/JJ to look at information provider re DiRAC (reporting disk only etc)
  • GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
  • BD to chase AD about using the space reporting thing we made for him
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII

Completed actions