RAL Tier1 weekly operations castor 29/02/2016

From GridPP Wiki
Revision as of 10:41, 26 February 2016 by Alison Packer 52064d6050 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • No disk server issues this week
  • globc updates applied, all CASTOR systems rebooted. initial issues with head nodes, 7 failed to reboot due to their build history. ACTION: they need their quattor build revisited so that this does not recur.
  • Main CIP system failed, have failed over to test CIP machine. HW failure to be fixed then will fail back over to production system
  • 11.2.0.4 DB client update had to be rescheduled, should go ahead Monday 29th, has been running in pre-prod for considerable amount of time. This should be transparent.


  • castor 2.1.15 update
    • ns upgrade on day of 29thFeb-3March; Downtime for all VOs
    • stager upgrade for one VO week commencing 21/3/16
  • Repack updated to 2.1.14-15
  • 2.1.15 works on preprod (RAL xroot rpm build) had not been put under stress yet
  • castor 2.1.16 coming soon - SRM integration into CASTOR code base
  • ATLAS gSoap Errors; JK (SdW advised) restarted SRM front ends
  • CMS AAA still an issue
  • LHCb upload still problematic


  • VO DiRAC people from Leicester are coming online -
  • 2.1.15 change control had its first airing in change control - 2.1.15 currently not working for us.
  • new tape backed disk servers for Tier1 - to replace CV11, recommendation made to Martin
  • Merging tape pools wiki created by Shaun
  • 2.1.15 name server tested
  • New SRM on vcert2
  • New SRM (SL6) with bug fixes available - needs test
  • Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
  • LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
  • BD looking at porting persistent tests to Ceph