Difference between revisions of "RAL Tier1 weekly operations castor 18/05/2015"

Revision as of 14:10, 15 May 2015

Operations News

SL6 disk server config is production ready, but will not be deployed until DB 11.2.0.4 upgrade is done.
2 SL6 tape servers now running in production.
Testing CASTOR rebalancer on preproduction, and developing associated tools.

Operations Problems

CMS CASTOR file open time issues affecting batch farm efficiency

  * We have determined that the most serious incidence of this problem is due to a hot dataset that is located almost entirely on one node. Shaun has devised a system to redistribute this dataset across the rest of the cmsDisk pool.

Retrieval errors from facilities castor - a number of similar incidents seen in last few weeks
Log rotate issue - transfer manager logs for rebalancing on CASTOR stagers - Resolved
GDSS757 cmsDisk / 763 atlasStripInput - new motherboards, fabric acceptance test complete, ready for deployment.
A higher number of checksum errors (Alice) - found to be due to VO actions. This is being cleared up.

OLDER>>>>
Deadlocks on atlas stager DB (post 2.1.14-15) - not serious but investigation ongoing
Atlas are putting files (sonar.test files) into un-routable paths - this looks like an issue with the space token used. Brian working on this.
There is a problem with latest gfal libraries - not a new problem but Atlas are starting to exercise functionality and identifying these issues
castor functional test on lcgccvm02 causing problems - Gareth reviewing.

Blocking Issues

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Tasks

Create new small VOs for DiRAC (and LIGO?)
Bruno working on SL6 disk servers
Complete deployment of SL6 tape servers
Provide new VM? to provide castor client functionality to query the backup DBs
Plan to ensure PreProd represents production in terms of hardware generation are underway
Switch from admin machines: lcgccvm02 to lcgcadm05
Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers
Intervention to upgrade CASTOR DBs to Oracle 11.2.0.4

Interventions

Staffing

Castor on Call person next week
- Rob

Staff absence/out of the office:
- Shaun not available on Thurs

Actions

Rob and Shaun to continue fixing CMS
Rob to try and reproduce SRM DB Dups issue with FTS transfers
Rob - change checksum validator so it maintains case when displaying
Bruno to document processes to control services previously controlled by puppet
Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl

Completed Actions

Brian - On Wed, tell Tim if he can start repacking mctape (Done)
Shaun to send plots to Matt to support above action
Rob to book meeting to discuss possible workarounds/investigations for CMS issue - xroot timeout / server(read-ahead) or Network tuning / CASTOR bug / deploying more disk servers
Rob - talk with Shaun to see if its possible to reject anything that does not have a spacetoken (answer: this is not a good idea)
Brian - to discuss unroutable files / spacetokens at at DDM meeting on Tuesday (Done)
Rob to pick up DB cleanup change control (Done)
Chris/Rob to arrange a meeting to discuss CMS performance/xroot issues (is performance appropriate, if not plan to resolve) - inc. Shaun, Rob, Brian, Gareth (Potential fix implemented)

@@ Line 2: / Line 2: @@
 == Operations News ==
-* ATLAS draining complete
 * SL6 disk server config is production ready, but will not be deployed until DB 11.2.0.4 upgrade is done.
 * 2 SL6 tape servers now running in production.
-* Testing CASTOR rebalancer on preproduction.
+* Testing CASTOR rebalancer on preproduction, and developing associated tools.
 == Operations Problems ==
-* CMS CASTOR issues affecting batch farm efficiency - several areas of investigation / mitigation underway
+* CMS CASTOR file open time issues affecting batch farm efficiency
-* Seeing a high volume of SRM DB file dup errors, Rob is currently investigating (trying to reproduce with FTS transfers)
+   * We have determined that the most serious incidence of this problem is due to a hot dataset that is located almost entirely on one node. Shaun has devised a system to redistribute this dataset across the rest of the cmsDisk pool.
 * Retrieval errors from facilities castor - a number of similar incidents seen in last few weeks
-* log rotate issue - transfer manager logs for rebalancing on CASTOR stagers
+* Log rotate issue - transfer manager logs for rebalancing on CASTOR stagers - Resolved
 * GDSS757 cmsDisk  / 763 atlasStripInput - new motherboards, fabric acceptance test complete, ready for deployment.
-* Elastic tape data corruption - CASTOR team assisting with extraction of data for review
+* A higher number of checksum errors (Alice) - found to be due to VO actions. This is being cleared up.
-* A higher number of checksum errors (Alice) investigation ongoing
 * OLDER>>>>
 * Deadlocks on atlas stager DB (post 2.1.14-15) - not serious but investigation ongoing
-* Issue with missing files in LHCb – race condition reported to CERN
+* Atlas are putting files (sonar.test files) into un-routable paths - this looks like an issue with the space token used. Brian working on this.
-* Atlas are putting files (sonar.test files) into un-routable paths - this looks like an issue with the space token used. Brian / Rob actions below.
 * There is a problem with latest gfal libraries - not a new problem but Atlas are starting to exercise functionality and identifying these issues
-* storageD retrieval from castor problems - investigation ongoing
 * castor functional test on lcgccvm02 causing problems - Gareth reviewing.
@@ Line 37: / Line 34: @@
 * Create new small VOs for DiRAC (and LIGO?)
 * Bruno working on SL6 disk servers
+* Complete deployment of SL6 tape servers
 * Provide new VM? to provide castor client functionality to query the backup DBs
 * Plan to ensure PreProd represents production in terms of hardware generation are underway
@@ Line 47: / Line 45: @@
 == Staffing ==
 * Castor on Call person next week
-** Shaun
+** Rob
 * Staff absence/out of the office:
-** Juan out from Tues
+** Shaun not available on Thurs
-** Jens out Monday/Tues
-** Shaun not available on Wed
 == Actions ==
-* Rob and Matt to escalate CASTOR slowness to Massimo at CERN
+* Rob and Shaun to continue fixing CMS
-* Shaun to send plots to Matt to support above action
-* Rob to book meeting to discuss possible workarounds/investigations for CMS issue - xroot timeout / server(read-ahead) or Network tuning / CASTOR bug / deploying more disk servers
 * Rob to try and reproduce SRM DB Dups issue with FTS transfers
 * Rob - change checksum validator so it maintains case when displaying
@@ Line 67: / Line 61: @@
 == Completed Actions ==
 * Brian - On Wed, tell Tim if he can start repacking mctape (Done)
+* Shaun to send plots to Matt to support above action
+* Rob to book meeting to discuss possible workarounds/investigations for CMS issue - xroot timeout / server(read-ahead) or Network tuning / CASTOR bug / deploying more disk servers
 * Rob - talk with Shaun to see if its possible to reject anything that does not have a spacetoken (answer: this is not a good idea)
 * Brian - to discuss unroutable files / spacetokens at at DDM meeting on Tuesday (Done)
 * Rob to pick up DB cleanup change control (Done)
 * Chris/Rob to arrange a meeting to discuss CMS performance/xroot issues (is performance appropriate, if not plan to resolve) - inc. Shaun, Rob, Brian, Gareth (Potential fix implemented)

Difference between revisions of "RAL Tier1 weekly operations castor 18/05/2015"

Revision as of 14:10, 15 May 2015

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Staffing

Actions

Completed Actions

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools