RAL Tier1 weekly operations castor 13/05/2016

From GridPP Wiki
Jump to: navigation, search

Operations News

  • New MICE user set up


Operations Problems

  • aircon issue - reduced impact by stopping the batch farm. There was a question as to how batch is turned back on, concerns swamping castor?
  • tape library issues
  • gfal investigations - awaiting membership of dteam etc for George P
  • draining - George has training on manual draining technique (atlas)


  • gfalcat does not work with castor, underlying issue fixed for gfalcopy but not gfalcat (gfal developers responsible) - Tracking
  • AtlasScratch, users from atlas still having problems accessing atlasScratch files - investigations ongoing
  • GDSS771 crashed - now in draining
  • draining is not working for atlas (does however seem to work on LHCb) - Brian has changed parameters as recommended by Shaun no improvement. manual method of draining still works - diskServerLs and stager_get (to move file to another disk server)


Planned, Scheduled and Cancelled Interventions

  • host certs on many disk servers should be updated (gridftp relies on this)
  • CASTOR 2.1.15 - issue with writes


Long-term projects

  • SL7 castor - disk servers higher priority (frontend to CEPH) this will be Aquilon based


  • RA has produced a python script to handle SRM db duplication issue which is causing callouts. This script has been tested and will now be put into production, as a cron job. This should be a temporary fix, so a bug report should be made to the FTS development team, via AL.
  • JJ – Glue 2 for CASTOR, used for publishing information. RA writing data getting end in python, JJ writing Glue 2 end in LISP. No schedule as yet.
  • WAN tuning

Advanced Planning

Tasks

  • CASTOR 2.1.15 implementation and testing
  • Deployment of SRM 2.14

Staffing

  • Chris out Friday and following Monday
  • Castor on Call person next week

RA for next 2 weeks

New Actions

Existing Actions

  • GS ask Kashif re RAID firmware updates on d0t1 v2011 machines and if there are other batches of machines that should upgraded
  • GP to work with BD to take over WAN tuning work developed by BD (aquilon / SCDB)
  • GP to create wan tuning WIKI
  • RA/AS new tool for monitoring srm db dups - the user type
  • RA to get someone to code review his SRM_DB_DUPLICATES blatting script
  • GS is there any documentation re handling broken CIPs (raised following CIP failure at weekend)
  • GS Callout for CIP only in waking hours?
  • RA ensure quattorising atlas consistency check - Rob to talk to Andrew L
  • RA to try stopping tapeserverd mid-migration to see if it breaks - ask Tim.
  • RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress

Completed Actions

  • BD check if D drives have arrived for WLCG
  • BD report draining issues to CERN
  • BD mice ticket - asking for a separate tape pool for d0t1 for monticarlo