RAL Tier1 weekly operations castor 03/08/2015

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Proposed CASTOR face to face W/C Oct 5th or 12th
  • New SRM version under testing. Works OK for a single job but hits trouble with many jobs.
  • SL6 disk server config now testable, although needs solution for CV11 nodes and hooks to add WAN tuning.

Operations Problems

  • CMS still upset. We believe that the current blocker is the read rate from disk - for this reason we are looking at undoing the 'heart bypass' implemented for CMS and using the scheduler again.
    • AL is trying to reproduce this on preprod.
  • The large numbers of checksum tickets seen by the production team are thought to be due to the rebalancer. RA to test this to see if that can be blamed on the rebalance un-sticking script.
  • BD and SdW investigating double putstart problem
  • The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk


Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.


Planned, Scheduled and Cancelled Interventions

  • Stress test SRM poss deploy week after (Shaun)
  • Upgrade CASTOR disk servers to SL6
  • Downtime 4th Aug for network test


Advanced Planning

Tasks

  • Proposed CASTOR face to face W/C Oct 5th or 12th
  • Discussed CASTOR 2017 planning, see wiki page.

Interventions


Staffing

  • Castor on Call person next week
    • RA
  • Staff absence/out of the office:
    • RA out Friday morning
    • CP probably back Mon
    • SdW out all week

New Actions

  • RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved
    • AL to look at optimising lazy download
  • BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
  • RA to investigate why we are getting partial files
  • BD to chase AD about using the space reporting thing we made for him
  • JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.


Existing Actions

  • SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • SdW to look into GC improvements - notify if file in inconsistent state
  • SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
  • RA/JJ to look at information provider re DiRAC (reporting disk only etc)
  • All to book meeting with Rob re draining / disk deployment / decommissioning ...
  • RA to look into procedural issues with CMS disk server interventions
  • BC to document processes to control services previously controlled by puppet
  • GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
  • RA to get jobs thought to cause CMS pileup
  • BC / RA to write change control doc for SL6 disk
  • SdW testing/working gfalcopy rpms
  • Someone - mice, what access protocol do they use?

Completed actions

  • RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
  • RA/AD to clarify what we are doing with 'broken' disk servers
  • GS to ensure that there is a ping test etc to the atlas building
  • BC to put SL6 on preprod disks.
  • RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.