RAL Tier1 weekly operations castor 20/07/2015
From GridPP Wiki
Revision as of 10:59, 17 July 2015 by Rob Appleyard 7822b28575 (Talk | contribs)
Contents
Operations News
- Proposed CASTOR face to face W/C Oct 5th or 12th
Operations Problems
- CMS still upset. We have asked them to define exactly why their jobs are slow.
- Brian and Shaun investigating double putstart problem
- The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk
Blocking Issues
- grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.
Planned, Scheduled and Cancelled Interventions
- Stress test SRM poss deploy week after (Shaun)
Advanced Planning
Tasks
- Proposed CASTOR face to face W/C Oct 5th or 12th
- Discussed CASTOR 2017 planning, see wiki page.
Interventions
Staffing
- Castor on Call person next week
- Rob
- Staff absence/out of the office:
- Rob out Monday afternoon
- Chris out Wed morning
Actions
- Shaun to look into GC improvements - notify if file in inconsistent state
- Shaun to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
- Rob/Jens to look at information provider re DiRAC (reporting disk only etc)
- All to book meeting with Rob re draining / disk deployment / decommissioning ...
- Rob to look into procedural issues with CMS disk server interventions
- Bruno to document processes to control services previously controlled by puppet
- Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
- Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl - ONGOING
- Rob to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.
- Rob to get jobs thought to cause CMS pileup
- Bruno to put SL6 on preprod disk
- Bruno / Rob to write change control doc for SL6 disk
- Shaun testing/working gfalcopy rpms
- Someone - mice, what access protocol do they use?
Completed actions
- Rob/Gareth to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
- Rob/Alastair to clarify what we are doing with 'broken' disk servers
- Gareth to ensure that there is a ping test etc to the atlas building