RAL Tier1 weekly operations castor 06/10/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Plan to ensure PreProd represents production in terms of hardware generation are underway
  • Disk server redeployments continue (i.e. D1T0 reused in D0T1 etc) ... 5 servers in LHCb left, see below
  • SL6 Headnode work progressing well - hoping for rollout in Nov
  • xrootd security advisory with FAX component within xrootd
  • useful breakout sessions at Castor face to face - deadlock analysis & bugs confirmed, discussions to simplify headnode configurations and remove a single point of failure
  • Oracle PSU DB patch for Neptune standby (Neptr26) and Neptune primary (Neptr89) completed (not the cause for recent issues)

Operations Problems

  • Gdss720 - still issues around this server, currently no evidence that physical server is not performing correctly however draining seems mostly unsuccessful (slow) and atlas were suffering with many failed transfers that were present on gdss720. Server currently in read only and will revisit post current castor issues.
  • gdss707 (atlasStripInput) was taken out over the weekend - completed memory tests and has had RAID controller firmware upgrade
  • gdss763 (also atlasStripInput) RAID card replaced, back into production.
  • still having difficult deploying 5 servers into LHCb production because of a mellanox network/quattor issue - fabric to fix
  • Atlas load, high wait I/O on several servers – partly understood and working towards some config changes
  • Db crash on Neptune standby last Wed 1st after previously applying an Oracle patch - not production affecting. Oracle report its a know issue but dont currently have a fix for our version.
  • Current castor problems with the cache on an emc disk array - atlas were worst affected. Our position was improved by moving to atlas/gen stager/srm standby Monday 6th Oct

Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.


Planned, Scheduled and Cancelled Interventions

  • Juan to patch castor dbs (PSU patches for Pluto and Juno) – standard change ... note this is currently delayed due to EMC cache issue
  • 2.1.14-14 stress testing in preprod along with new errata
  • A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.


Advanced Planning

Tasks

  • Possible future upgrade to CASTOR 2.1.14-15 post christmas
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
  • Replace DLF with Elastic Search
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers


Interventions

  • Current castor problems with the cache on an emc disk array - atlas were worst affected. Our position was improved by moving to atlas/gen stager/srm standby Monday 6th Oct

Staffing

  • Castor on Call person
    • Matt


  • Staff absence/out of the office:
    • Rob - Out all week