Difference between revisions of "RAL Tier1 weekly operations castor 20/07/2015"

From GridPP Wiki
Jump to: navigation, search
 
Line 3: Line 3:
 
== Operations News ==
 
== Operations News ==
 
* Proposed CASTOR face to face W/C Oct 5th or 12th
 
* Proposed CASTOR face to face W/C Oct 5th or 12th
 +
* New SRM version under testing. Works OK for a single job but hits trouble with many jobs.
 +
* SL6 disk server config now testable, although needs solution for CV11 nodes and hooks to add WAN tuning.
  
 
== Operations Problems ==
 
== Operations Problems ==
Line 22: Line 24:
 
* Proposed CASTOR face to face W/C Oct 5th or 12th
 
* Proposed CASTOR face to face W/C Oct 5th or 12th
 
* Discussed CASTOR 2017 planning, see wiki page.
 
* Discussed CASTOR 2017 planning, see wiki page.
 
  
 
'''Interventions'''
 
'''Interventions'''
Line 29: Line 30:
 
== Staffing ==
 
== Staffing ==
 
* Castor on Call person next week
 
* Castor on Call person next week
** Rob
+
** RA
  
 
* Staff absence/out of the office:
 
* Staff absence/out of the office:
** Rob out Monday afternoon
+
** RA out Friday morning
** Chris out Wed morning
+
** CP probably back Mon
 +
** SdW out all week
 +
 
 +
== New Actions ==
 +
* RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved
 +
  * AL to look at optimising lazy download
 +
* BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
 +
* RA to investigate why we are getting partial files
 +
* BD to chase AD about using the space reporting thing we made for him
 +
* JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.
  
  
== Actions ==
+
== Existing Actions ==
* Shaun to modify cleanlostfiles to log to syslog so we can track its use
+
* SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
* Shaun to look into GC improvements - notify if file in inconsistent state
+
* SdW to look into GC improvements - notify if file in inconsistent state
* Shaun to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
+
* SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
* Rob/Jens to look at information provider re DiRAC (reporting disk only etc)
+
* RA/JJ to look at information provider re DiRAC (reporting disk only etc)
 
* All to book meeting with Rob re draining / disk deployment / decommissioning ...
 
* All to book meeting with Rob re draining / disk deployment / decommissioning ...
* Rob to look into procedural issues with CMS disk server interventions  
+
* RA to look into procedural issues with CMS disk server interventions  
* Bruno to document processes to control services previously controlled by puppet   
+
* BC to document processes to control services previously controlled by puppet   
* Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
+
* GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
* Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl - ONGOING
+
* GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
* Rob to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.
+
* RA to get jobs thought to cause CMS pileup
* Rob to get jobs thought to cause CMS pileup
+
* BC / RA to write change control doc for SL6 disk
* Bruno to put SL6 on preprod disk
+
* SdW testing/working gfalcopy rpms  
* Bruno / Rob to write change control doc for SL6 disk
+
* Shaun testing/working gfalcopy rpms  
+
 
* Someone - mice, what access protocol do they use?  
 
* Someone - mice, what access protocol do they use?  
  
 
== Completed actions ==
 
== Completed actions ==
* Rob/Gareth to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads   
+
* RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads   
* Rob/Alastair to clarify what we are doing with 'broken' disk servers
+
* RA/AD to clarify what we are doing with 'broken' disk servers
* Gareth to ensure that there is a ping test etc to the atlas building
+
* GS to ensure that there is a ping test etc to the atlas building
 +
* BC to put SL6 on preprod disks.
 +
* RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.

Latest revision as of 11:07, 24 July 2015

List of CASTOR meetings

Operations News

  • Proposed CASTOR face to face W/C Oct 5th or 12th
  • New SRM version under testing. Works OK for a single job but hits trouble with many jobs.
  • SL6 disk server config now testable, although needs solution for CV11 nodes and hooks to add WAN tuning.

Operations Problems

  • CMS still upset. We have asked them to define exactly why their jobs are slow.
  • Brian and Shaun investigating double putstart problem
  • The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk


Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.


Planned, Scheduled and Cancelled Interventions

  • Stress test SRM poss deploy week after (Shaun)


Advanced Planning

Tasks

  • Proposed CASTOR face to face W/C Oct 5th or 12th
  • Discussed CASTOR 2017 planning, see wiki page.

Interventions


Staffing

  • Castor on Call person next week
    • RA
  • Staff absence/out of the office:
    • RA out Friday morning
    • CP probably back Mon
    • SdW out all week

New Actions

  • RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved
  * AL to look at optimising lazy download
  • BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
  • RA to investigate why we are getting partial files
  • BD to chase AD about using the space reporting thing we made for him
  • JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.


Existing Actions

  • SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
  • SdW to look into GC improvements - notify if file in inconsistent state
  • SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
  • RA/JJ to look at information provider re DiRAC (reporting disk only etc)
  • All to book meeting with Rob re draining / disk deployment / decommissioning ...
  • RA to look into procedural issues with CMS disk server interventions
  • BC to document processes to control services previously controlled by puppet
  • GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
  • GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
  • RA to get jobs thought to cause CMS pileup
  • BC / RA to write change control doc for SL6 disk
  • SdW testing/working gfalcopy rpms
  • Someone - mice, what access protocol do they use?

Completed actions

  • RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
  • RA/AD to clarify what we are doing with 'broken' disk servers
  • GS to ensure that there is a ping test etc to the atlas building
  • BC to put SL6 on preprod disks.
  • RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.