RAL Tier1 weekly operations castor 08/02/2016

Operations News

VO DiRAC people from Leicester are coming online -
2.1.15 change control had its first airing in change control - 2.1.15 currently not working for us.
new tape backed disk servers for Tier1 - to replace CV11, recommendation made to Martin
Repack upgrade 2.1.14-15

Merging tape pools wiki created by Shaun
2.1.15 name server tested
New SRM on vcert2
New SRM (SL6) with bug fixes available - needs test
Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
BD looking at porting persistent tests to Ceph

ongoing AAA issues in CMS
tape loss - corrupt tape for atlas MC tape. Tim recovered some files. Alastair reports 2012 monticarlo. Tim sending tape off for analysis.
gdss667 is back in production
20 GB/s out of atlasstripinput - 2000 jobs. Lots of the connections were rfio

CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate

Merge all castor tape backed pool into one
11.2.04 client updates (running in preprod) - possible change control for prod
WAN tuning proposal - possibly put into change control Brian
CASTOR 2.1.15
Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first

RA has produced a python script to handle SRM db duplication issue which is causing callouts. Problem running the python script as version of python on the SRM servers is still at 2.4, however RA will pursue this. SdW has reviewed and confident that this is low risk.
JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements

Tasks

Interventions

BD to understand where RFIO connection from Atlas are coming from
BD track down current RFIO usage within castor and plan migration
GS arrange a meeting to discuss remaining actions on CV11 and V12 (when Kashif is back)
BD to clarify if separating the DiRAC data is a necessity
RA take VCERT 2.1.14, install new srm on vcert srm and perform functional tests
BD ensure quattorising atlas consistency check

Rob to send Shaun’s merging tape pools wiki to CERN for review
Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures
Rob to deploy a 14 generation into preprod
Shaun to present plan to put tape backend pool into one on 29th Jan
Rob look at DLS / CEDA data rates in relation
BD re. WAN tuning proposal - discuss with GS, does it need a change control?
RA to try stopping tapeserverd mid-migration to see if it breaks.
RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress