Difference between revisions of "RAL Tier1 weekly operations castor 09/02/2015"

Latest revision as of 16:58, 6 February 2015

List of CASTOR meetings

Operations News

Draining - Atlas now draining again / LHCb also draining
Tier 1 CASTOR stop and rebooted for Ghost vulnerability (and CIP)
Facilities CASTOR rebooted for Ghost vulnerability

Operations Problems

There is a problem with latest gfal libraries - not a new problem but Atlas are starting to exercise functionality and identifying these issues
Atlas are putting files (sonar.test files) into un-routable paths - this looks like an issue with the space token used
storageD retrieval from castor problems - investigation ongoing
CMS heavy load & high job failure – hot spotting 3 servers, files spread to many servers which then also became loaded. CMS moved away from RAL for these files. Need to discuss this issue in more detail.
castor functional test on lcgccvm02 causing problems - Gareth reviewing
150k zero size files reported last week have almost all been dealt with, CMS files outstanding
Files with no ns or xattr checksum value in castor are failing transfers from RAL to BNL using the BNL FTS3 server.

Blocking Issues

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Planned, Scheduled and Cancelled Interventions

Oracle upgrade (11.2.0.4) of preprod 9th Feb - will require a short outage
Oracle PSU update - Tuesday Primary / Wednesday standby
Upgrade Oracle DB to version 11.2.0.4 (Late February?)
Upgrade CASTOR to version 2.1.14-14 OR 2.1.14-15 (February)

Advanced Planning

Tasks

DB team need to plan some work which will result in the DBs being under load for approx 1h - not terribly urgent but needs to be done in new year.
Provide new VM? to provide castor client functionality to query the backup DBs
Plan to ensure PreProd represents production in terms of hardware generation are underway
Possible future upgrade to CASTOR 2.1.14-15 post-Christmas
Switch from admin machines: lcgccvm02 to lcgcadm05
Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions

Actions

Rob to pick up DB cleanup change control
Bruno to document processes to control services previously controlled by puppet
Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
Chris/Rob to arrange a meeting to discuss CMS performance/xroot issues (is performance appropriate, if not plan to resolve) - inc. Shaun, Rob, Brian, Gareth
Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl

Matt to identify a suitable time to patch CASTOR facilities - DB team will synchronise patching of Juno with this outage - Done

Staffing

Castor on Call person
- Rob

Staff absence/out of the office:
- Chris working from home Monday

@@ Line 3: / Line 3: @@
 == Operations News ==
+* Draining - Atlas now draining again / LHCb also draining
+* Tier 1 CASTOR stop and rebooted for Ghost vulnerability (and CIP)
+* Facilities CASTOR rebooted for Ghost vulnerability
-* Draining - Atlas currently on hold due to lack of free disk space (currently 85TB)
-* Facilities CASTOR patched for kernel/errata (not Ghost)
-* CASTOR Disk server kernel patch – LHCb was completed, atlas ~10 disk servers completed when Ghost vulnerability identified then paused
-* certificates on fdsdss20 to fdsdss30 were updated
-* initial CASTOR DB cleanup was successful - further change control req?
 == Operations Problems ==
 * There is a problem with latest gfal libraries - not a new problem but Atlas are starting to exercise functionality and identifying these issues
+* Atlas are putting files (sonar.test files) into un-routable paths - this looks like an issue with the space token used
+* storageD retrieval from castor problems - investigation ongoing
+* CMS heavy load & high job failure – hot spotting 3 servers, files spread to many servers which then also became loaded. CMS moved away from RAL for these files. Need to discuss this issue in more detail.
 * castor functional test on lcgccvm02 causing problems - Gareth reviewing
-* storageD retrieval from castor problems - investigation ongoing
 * 150k zero size files reported last week have almost all been dealt with, CMS files outstanding
 * Files with no ns or xattr checksum value in castor are failing transfers from RAL to BNL using the BNL FTS3 server.
-* CMS heavy load & high job failure – hot spotting 3 servers, files spread to many servers which then also became loaded. CMS moved away from RAL for these files. Need to discuss this issue in more detail.
-* Files unroutable to tape - a few test files recently (sonar.test) from atlas. Little more investigation needed
-* fetch-crl was not running on any SL6 headnode (cert revocation) - now resolved
@@ Line 34: / Line 22: @@
 == Planned, Scheduled and Cancelled Interventions ==
-* CASTOR STOP for Patching (kernel and errata). Monday 10–13:30. Matt/Chris/others?
+* Oracle upgrade (11.2.0.4) of preprod 9th Feb - will require a short outage
-* Oracle PSU patching. Neptune(atlas and gen)- Wednesday Primary i.e. short castor outage (and Tuesday Backup i.e. no outage)
+* Oracle PSU update - Tuesday Primary / Wednesday standby
-* Oracle upgrade of preprod 2nd Feb - will require a short outage
 * Upgrade Oracle DB to version 11.2.0.4 (Late February?)
 * Upgrade CASTOR to version 2.1.14-14 OR 2.1.14-15 (February)
@@ Line 60: / Line 46: @@
 * Chris/Rob to arrange a meeting to discuss CMS performance/xroot issues (is performance appropriate, if not plan to resolve) - inc. Shaun, Rob, Brian, Gareth
 * Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl
-* Matt to identify a suitable time to patch CASTOR facilities - DB team will synchronise patching of Juno with this outage
+* Matt to identify a suitable time to patch CASTOR facilities - DB team will synchronise patching of Juno with this outage - Done
 == Staffing ==
 * Castor on Call person
-** Rob (TBC)
+** Rob
 * Staff absence/out of the office:
-** Rob out Monday
+** Chris working from home Monday
-** Chris out Tues/Wed/Thurs
-** Matt out at CERN Tues> TBC

Difference between revisions of "RAL Tier1 weekly operations castor 09/02/2015"

Latest revision as of 16:58, 6 February 2015

Contents

Operations News

Operations Problems

Blocking Issues

Planned, Scheduled and Cancelled Interventions

Advanced Planning

Actions

Staffing

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools