RAL Tier1 weekly operations castor 20/07/2015

Operations News

CMS still upset. We have asked them to define exactly why their jobs are slow.
Brian and Shaun investigating double putstart problem
The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Tasks

Interventions

Staff absence/out of the office:
- Rob out Monday afternoon
- Chris out Wed morning

Shaun to look into GC improvements - notify if file in inconsistent state
Shaun to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
Rob/Jens to look at information provider re DiRAC (reporting disk only etc)
All to book meeting with Rob re draining / disk deployment / decommissioning ...
Rob to look into procedural issues with CMS disk server interventions
Bruno to document processes to control services previously controlled by puppet
Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl - ONGOING
Rob to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.
Rob to get jobs thought to cause CMS pileup
Bruno to put SL6 on preprod disk
Bruno / Rob to write change control doc for SL6 disk
Shaun testing/working gfalcopy rpms
Someone - mice, what access protocol do they use?

Rob/Gareth to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
Rob/Alastair to clarify what we are doing with 'broken' disk servers
Gareth to ensure that there is a ping test etc to the atlas building