RAL Tier1 weekly operations castor 16/11/2015

From GridPP Wiki
Jump to: navigation, search

Operations News

  • SRMdb dups script created
  • Erratas done on production and all others
  • 11.2.0.4 clients rolled out on preprod
  • RA, SdW, GTF and AS have been to CERN for a CASTOR face-to-face meeting
  • Attempted fix for LHCb's intermittent stage request failures - increase 'Stage Request' thread count from 4 to 12.
  • BD looking at porting persistent tests to Ceph

Operations Problems

  • Disk servers name lookup issue (CV11's) - more system than CASTOR. Currently holding CV11 upgrades until understood.
  • NS dump failed because backup times changed - more frequent (Brian)
  • RA has developed an automatic script to clean up the usual case of these.
  • stager_qry has been running slowly for ATLAS
  • The failed CV11 nodes from atlasTape have been fixed and returned to prod. We are looking into our procedures to see if acceptance testing should be improved.
  • LHCb ticket re. low level of failures likely to be fail in nsmkdir. Look for timeouts in name server logs to prove where timeout is. Mitigation may be to increase thread count on NS. Be a little careful with this one as vastly increasing this could exceed connection limit to DB (number of NS daemons is quite high)

Blocking Issues

Planned, Scheduled and Cancelled Interventions

  • 2.1.15 Tape - completion of upgrade next week (Tim)
  • network change to reconnect to atlas building - Wed 09:30, short break to standby dbs
  • network change - castor headnodes affected - to schedule (GS)


  • Tape backed nodes to SL6 - scheduled for Monday/Tuesday.
  • CASTOR 2.1.15
    • Initial installation on vcert underway. Currently scheduled for early next year.
  • Upgrade of Oracle clients to 11.2.0.4

Things from the CASTOR F2F

  • CASTOR can run on an arbitrary Ceph underlayer.
  • No 2.1.16 is planned any time soon.
  • SL7 test build ready soon

Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing

Interventions

  • Remaining D0T1 disk servers

Staffing

  • Castor on Call person next week
    • RA
  • Staff absence/out of the office:
    • SdW Wed
    • Chris all week

New Actions

  • RA produce a list of requirements for tape drive dedication in future castor versions, to discuss with CERN
  • RA/BD review unowned RT tickets


Existing Actions

  • SdW to revisit schema changes to improve SRM DB dup position W/C 12th Oct.
  • Production team check srm03 regarding lack of crl updates
  • RA to schedule no-op change to be included in quattor / persistent (CMSdisk)
  • BC to doc procedure for HW return to CASTOR
  • SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • RA/JJ to look at information provider re DiRAC (reporting disk only etc)
  • GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
  • BD to chase AD about using the space reporting thing we made for him
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII

Completed actions