Difference between revisions of "RAL Tier1 weekly operations castor 17/08/2015"

Latest revision as of 13:26, 14 August 2015

Operations News

CMS disk read issues much improved.
- Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites.
Disk server draining on hold due to ATLAS being very full

Operations Problems

The large numbers of checksum tickets seen by the production team are not thought to be due to the rebalancer. The source of these needs to be identified. Shaun is investigating
The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating.

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Stress test SRM poss deploy week after (Shaun)
Upgrade CASTOR disk servers to SL6
- One tape-backed node being upgraded, more tape-backed nodes next week.
Oracle patching schedule planned. (End 13th October)

Advanced Planning

Tasks

Proposed CASTOR face to face W/C Oct 5th or 12th
Discussed CASTOR 2017 planning, see wiki page.

Interventions

Staffing

Castor on Call person next week
- SdW
Staff absence/out of the office:

New Actions

Existing Actions

SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
SdW to look into GC improvements - notify if file in inconsistent state
SdW to replace canbemigr test - to test for file that have not been migrated to tape (warn 8h / alert 24h)
SdW testing/working gfalcopy rpms
RA/JJ to look at information provider re DiRAC (reporting disk only etc)
RA to look into procedural issues with CMS disk server interventions
RA to investigate why we are getting partial files
BC to document processes to control services previously controlled by puppet
GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
BD to chase AD about using the space reporting thing we made for him
Someone - mice, what access protocol do they use?
GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII

@@ Line 2: / Line 2: @@
 * CMS disk read issues much improved.
 ** Changed I/O scheduler for cmsDisk nodes from 'cfq' to 'noop'. We are now seeing performance roughly in line with other CMS Tier 1 sites.
+* Disk server draining on hold due to ATLAS being very full
 == Operations Problems ==
-* The large numbers of checksum tickets seen by the production team are not thought to be due to the rebalancer. The source of these needs to be identified.
+* The large numbers of checksum tickets seen by the production team are not thought to be due to the rebalancer. The source of these needs to be identified. Shaun is investigating
-* The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating - needed for lhcb and vo.dirac.ac.uk
+* The gridmap file on the the webdav host lcgcadm04.gridpp.rl.ac.uk is not auto-updating.
 == Blocking Issues ==
@@ Line 10: / Line 11: @@
 * Stress test SRM poss deploy week after (Shaun)
 * Upgrade CASTOR disk servers to SL6
-**All servers now PXEBOOT , CV11 have issue over incorrect sdb scheme.
+** One tape-backed node being upgraded, more tape-backed nodes next week.
 * Oracle patching schedule planned. (End 13th October)
@@ Line 23: / Line 24: @@
 * Staff absence/out of the office:
 == New Actions ==
-* GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII
 == Existing Actions ==
 * SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
@@ Line 31: / Line 32: @@
 * RA/JJ to look at information provider re DiRAC (reporting disk only etc)
 * RA to look into procedural issues with CMS disk server interventions
-* RA to get jobs thought to cause CMS pileup
-** AL to look at optimising lazy download
-* RA to undo CMS's 'heart bypass' (unscheduled reads) to see if read rates are improved
 * RA to investigate why we are getting partial files
-* BC / RA to write change control doc for SL6 disk
 * BC to document processes to control services previously controlled by puppet
 * GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
 * GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
-* BD to set all non-target ATLAS nodes to R/Only to see if it stops partial file draining problem
 * BD to chase AD about using the space reporting thing we made for him
-* JS, RA and GS to propose dates for Oracle 11.2.0.4 patching.
 * Someone - mice, what access protocol do they use?
-* All to book meeting with Rob re draining / disk deployment / decommissioning ...
+* GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII
 == Completed actions ==
-* RA/GS to write some new docs to cover oncall procedures for CMS with the introduction of unscheduled xroot reads
-* RA/AD to clarify what we are doing with 'broken' disk servers
-* GS to ensure that there is a ping test etc to the atlas building
-* BC to put SL6 on preprod disks.
-* RA to remove Facilities disk servers from cedaRetrieve to go back to Fabric for acceptance testing.

Difference between revisions of "RAL Tier1 weekly operations castor 17/08/2015"

Latest revision as of 13:26, 14 August 2015

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Staffing

New Actions

Existing Actions

Completed actions

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools