Difference between revisions of "RAL Tier1 weekly operations castor 20/07/2015"

Latest revision as of 11:07, 24 July 2015

List of CASTOR meetings

Operations News

Proposed CASTOR face to face W/C Oct 5th or 12th
New SRM version under testing. Works OK for a single job but hits trouble with many jobs.
SL6 disk server config now testable, although needs solution for CV11 nodes and hooks to add WAN tuning.

Operations Problems

CMS still upset. We have asked them to define exactly why their jobs are slow.
Brian and Shaun investigating double putstart problem
The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk

Blocking Issues

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Planned, Scheduled and Cancelled Interventions

Stress test SRM poss deploy week after (Shaun)

Advanced Planning

Tasks

Proposed CASTOR face to face W/C Oct 5th or 12th
Discussed CASTOR 2017 planning, see wiki page.

Interventions

Staffing

Castor on Call person next week
- RA

Staff absence/out of the office:
- RA out Friday morning
- CP probably back Mon
- SdW out all week

New Actions

RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved

  * AL to look at optimising lazy download

BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
RA to investigate why we are getting partial files
BD to chase AD about using the space reporting thing we made for him
JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.

Existing Actions

SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
SdW to look into GC improvements - notify if file in inconsistent state
SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
RA/JJ to look at information provider re DiRAC (reporting disk only etc)
All to book meeting with Rob re draining / disk deployment / decommissioning ...
RA to look into procedural issues with CMS disk server interventions
BC to document processes to control services previously controlled by puppet
GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
RA to get jobs thought to cause CMS pileup
BC / RA to write change control doc for SL6 disk
SdW testing/working gfalcopy rpms
Someone - mice, what access protocol do they use?

Completed actions

RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
RA/AD to clarify what we are doing with 'broken' disk servers
GS to ensure that there is a ping test etc to the atlas building
BC to put SL6 on preprod disks.
RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.

@@ Line 3: / Line 3: @@
 == Operations News ==
 * Proposed CASTOR face to face W/C Oct 5th or 12th
+* New SRM version under testing. Works OK for a single job but hits trouble with many jobs.
+* SL6 disk server config now testable, although needs solution for CV11 nodes and hooks to add WAN tuning.
 == Operations Problems ==
@@ Line 22: / Line 24: @@
 * Proposed CASTOR face to face W/C Oct 5th or 12th
 * Discussed CASTOR 2017 planning, see wiki page.
 '''Interventions'''
@@ Line 29: / Line 30: @@
 == Staffing ==
 * Castor on Call person next week
-** Rob
+** RA
 * Staff absence/out of the office:
-** Rob out Monday afternoon
+** RA out Friday morning
-** Chris out Wed morning
+** CP probably back Mon
+** SdW out all week
+== New Actions ==
+* RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved
+   * AL to look at optimising lazy download
+* BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
+* RA to investigate why we are getting partial files
+* BD to chase AD about using the space reporting thing we made for him
+* JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.
-== Actions ==
+== Existing Actions ==
-* Shaun to modify cleanlostfiles to log to syslog so we can track its use
+* SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
-* Shaun to look into GC improvements - notify if file in inconsistent state
+* SdW to look into GC improvements - notify if file in inconsistent state
-* Shaun to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
+* SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
-* Rob/Jens to look at information provider re DiRAC (reporting disk only etc)
+* RA/JJ to look at information provider re DiRAC (reporting disk only etc)
 * All to book meeting with Rob re draining / disk deployment / decommissioning ...
-* Rob to look into procedural issues with CMS disk server interventions
+* RA to look into procedural issues with CMS disk server interventions
-* Bruno to document processes to control services previously controlled by puppet
+* BC to document processes to control services previously controlled by puppet
-* Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
+* GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
-* Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl - ONGOING
+* GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
-* Rob to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.
+* RA to get jobs thought to cause CMS pileup
-* Rob to get jobs thought to cause CMS pileup
+* BC / RA to write change control doc for SL6 disk
-* Bruno to put SL6 on preprod disk
+* SdW testing/working gfalcopy rpms
-* Bruno / Rob to write change control doc for SL6 disk
-* Shaun testing/working gfalcopy rpms
 * Someone - mice, what access protocol do they use?
 == Completed actions ==
-* Rob/Gareth to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
+* RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
-* Rob/Alastair to clarify what we are doing with 'broken' disk servers
+* RA/AD to clarify what we are doing with 'broken' disk servers
-* Gareth to ensure that there is a ping test etc to the atlas building
+* GS to ensure that there is a ping test etc to the atlas building
+* BC to put SL6 on preprod disks.
+* RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.

Difference between revisions of "RAL Tier1 weekly operations castor 20/07/2015"

Latest revision as of 11:07, 24 July 2015

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Staffing

New Actions

Existing Actions

Completed actions

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools