RAL Tier1 weekly operations castor 20/07/2015

From GridPP Wiki
Jump to: navigation, search

List of CASTOR meetings

Operations News

  • Proposed CASTOR face to face W/C Oct 5th or 12th

Operations Problems

  • CMS still upset. We have asked them to define exactly why their jobs are slow.
  • Brian and Shaun investigating double putstart problem
  • The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk


Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.


Planned, Scheduled and Cancelled Interventions

  • Stress test SRM poss deploy week after (Shaun)


Advanced Planning

Tasks

  • Proposed CASTOR face to face W/C Oct 5th or 12th
  • Discussed CASTOR 2017 planning, see wiki page.


Interventions


Staffing

  • Castor on Call person next week
    • Rob
  • Staff absence/out of the office:
    • Rob out Monday afternoon
    • Chris out Wed morning


Actions

  • Shaun to modify cleanlostfiles to log to syslog so we can track its use
  • Shaun to look into GC improvements - notify if file in inconsistent state
  • Shaun to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
  • Rob/Jens to look at information provider re DiRAC (reporting disk only etc)
  • All to book meeting with Rob re draining / disk deployment / decommissioning ...
  • Rob to look into procedural issues with CMS disk server interventions
  • Bruno to document processes to control services previously controlled by puppet
  • Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl - ONGOING
  • Rob to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.
  • Rob to get jobs thought to cause CMS pileup
  • Bruno to put SL6 on preprod disk
  • Bruno / Rob to write change control doc for SL6 disk
  • Shaun testing/working gfalcopy rpms
  • Someone - mice, what access protocol do they use?

Completed actions

  • Rob/Gareth to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
  • Rob/Alastair to clarify what we are doing with 'broken' disk servers
  • Gareth to ensure that there is a ping test etc to the atlas building