RAL Tier1 weekly operations castor 10/08/2015

From GridPP Wiki
Jump to: navigation, search

Operations News

Operations Problems

  • CMS still upset. We believe that the current blocker is the read rate from disk - for this reason we are looking at undoing the 'heart bypass' implemented for CMS and using the scheduler again.
    • AL is trying to reproduce this on preprod.
  • Removal of 'heart bypass' may have led to timeouts.
    • RA has contacted CERN developers fo rfix.
  • The large numbers of checksum tickets seen by the production team are thought to be due to the rebalancer.

Appears not to be casues by rebalance un-sticking script.

  • The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk

Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Planned, Scheduled and Cancelled Interventions

  • Stress test SRM poss deploy week after (Shaun)
  • Upgrade CASTOR disk servers to SL6
    • All servers now PXEBOOT , CV11 have issue over incorrect sdb scheme.
  • Oracle patching schedule planned. (End 13th October)

Advanced Planning

Tasks

  • Proposed CASTOR face to face W/C Oct 5th or 12th
  • Discussed CASTOR 2017 planning, see wiki page.

Interventions

Staffing

  • Castor on Call person next week
    • RA
  • Staff absence/out of the office:

New Actions

  • GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII

Existing Actions

  • SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • SdW to look into GC improvements - notify if file in inconsistent state
  • SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
  • SdW testing/working gfalcopy rpms
  • RA/JJ to look at information provider re DiRAC (reporting disk only etc)
  • RA to look into procedural issues with CMS disk server interventions
  • RA to get jobs thought to cause CMS pileup
    • AL to look at optimising lazy download
  • RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved
  • RA to investigate why we are getting partial files
  • BC / RA to write change control doc for SL6 disk
  • BC to document processes to control services previously controlled by puppet
  • GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
  • BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
  • BD to chase AD about using the space reporting thing we made for him
  • JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.
  • Someone - mice, what access protocol do they use?
  • All to book meeting with Rob re draining / disk deployment / decommissioning ...

Completed actions

  • RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
  • RA/AD to clarify what we are doing with 'broken' disk servers
  • GS to ensure that there is a ping test etc to the atlas building
  • BC to put SL6 on preprod disks.
  • RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.