Difference between revisions of "RAL Tier1 weekly operations castor 20/04/2015"

Latest revision as of 13:38, 20 April 2015

GDSS757 cmsDisk / 763 atlasStripInput - new motherboards, fabric acceptance test finishes 23/4/15
CMS Has been suffering from hot files - these have been duplicated on other disk servers
Minor issue with xroot libaries on 2 LHCb disk servers following reverted 2.1.14-15 upgrade 08/04/15 - resolved Friday 10th April
Deadlocks on atlas stager DB (post 2.1.14-15) - not serious but investigation ongoing

OLDER>>>>
Issue with missing files in LHCb – race condition reported to CERN
Atlas are putting files (sonar.test files) into un-routable paths - this looks like an issue with the space token used. Brian / Rob actions below.
There is a problem with latest gfal libraries - not a new problem but Atlas are starting to exercise functionality and identifying these issues
storageD retrieval from castor problems - investigation ongoing
CMS heavy load & high job failure – hot spotting 3 servers, files spread to many servers which then also became loaded. CMS moved away from RAL for these files. Need to discuss this issue in more detail.
castor functional test on lcgccvm02 causing problems - Gareth reviewing
150k zero size files reported last week have almost all been dealt with, CMS files outstanding
Files with no ns or xattr checksum value in castor are failing transfers from RAL to BNL using the BNL FTS3 server.

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Tasks

SL6 tape server
Bruno working on SL6 disk servers
DB team need to plan some work which will result in the DBs being under load for approx 1h
Provide new VM? to provide castor client functionality to query the backup DBs
Plan to ensure PreProd represents production in terms of hardware generation are underway
Possible future upgrade to CASTOR 2.1.14-15 in March or April
Switch from admin machines: lcgccvm02 to lcgcadm05
Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions

Rob - revisit deployment plans (esp people in office towards the end of March)
Brian - On Wed, tell Tim if he can start repacking mctape
Rob - change checksum validator so it maintains case when displaying
Rob - talk with Shaun to see if its possible to reject anything that does not have a spacetoken
Brian - to discuss unroutable files / spacetokens at at DDM meeting on Tuesday
Rob to pick up DB cleanup change control
Bruno to document processes to control services previously controlled by puppet
Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures
Chris/Rob to arrange a meeting to discuss CMS performance/xroot issues (is performance appropriate, if not plan to resolve) - inc. Shaun, Rob, Brian, Gareth
Gareth to investigate providing checks for /etc/noquatto on production nodes & checks for fetch-crl

Staff absence/out of the office:
- Shaun – out all week
- Brian – back on Tuesday 21st
- Jens – out Monday

@@ Line 2: / Line 2: @@
 == Operations News ==
-* SRM certs have been renewed
+* Tier 1 CASTOR 2.1.14-15 upgrade completed successfully
 * Draining - Atlas still draining
+* SL6 Disk and Tape server config is production ready
 == Operations Problems ==
-* GDSS757 cmsDisk still waiting for new motherboard
+* GDSS757 cmsDisk  / 763 atlasStripInput - new motherboards, fabric acceptance test finishes 23/4/15
-* Process to generate new grid-mapfiles has not been failing, currently being corrected
+* CMS Has been suffering from hot files - these have been duplicated on other disk servers
+* Minor issue with xroot libaries on 2 LHCb disk servers following reverted 2.1.14-15 upgrade 08/04/15 - resolved Friday 10th April
+* Deadlocks on atlas stager DB (post 2.1.14-15) - not serious but investigation ongoing
 * OLDER>>>>
 * Issue with missing files in LHCb – race condition reported to CERN
@@ Line 26: / Line 29: @@
 == Planned, Scheduled and Cancelled Interventions ==
 * Upgrade Oracle DB to version 11.2.0.4 (Late February?)
-* 2.1.14-15 Facilities on Tuesday 31st March, and Tier 1 on Wednesday 8th April (expected outage 4 hours TBC)
 == Advanced Planning ==