RAL Tier1 weekly operations castor 01/02/2016

Operations News

Merging tape pools wiki created by Shaun
2.1.15 name server tested
New SRM on vcert2
New SRM (SL6) with bug fixes available - needs test
Gfal-cat command failing for atlas reading of nsdumps form castor: https://ggus.eu/index.php?mode=ticket_info&ticket_id=117846. Developers looking to fix within: https://ggus.eu/index.php?mode=ticket_info&ticket_id=118842
LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
BD looking at porting persistent tests to Ceph

atlasScratchDisk ... gdss667 triple disk failure, most data lost. 2 x callouts, down on disk servers and hit limit on xroot connections - SAM test failing.
gdss620 - gen tape with fabric still
CV '11 gen RAID card controller firmware update required - WD digital drives failing at a high rate
fac tape drive broken - Tim taken out of tape pool

Merge all castor tape backed pool into one
11.2.04 client updates (running in preprod) - possible change control for prod
WAN tuning proposal - possibly put into change control Brian
CASTOR 2.1.15
Upgrade of Oracle clients to 11.2.0.4 - planning for production rollout Gen first

RA has produced a python script to handle SRM db duplication issue which is causing callouts. Problem running the python script as version of python on the SRM servers is still at 2.4, however RA will pursue this. SdW has reviewed and confident that this is low risk.
JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements

Tasks

Interventions

Staff absence/out of the office:
- Brian out Tues 26th - thurs 4th
- Shaun out 1st Feb for a week, then back in until 1st March

Rob to double check 667 has had its firmware updated
Chris to document the Corbin method (i.e. when a diskpool is having issues restart xroot daemons first, then try diskmanagerd, finally use the transfermanager stop - diskmanagerd restart - transfermanager start 'nuclear' method)
Rob to send Shaun’s merging tape pools wiki to CERN for review
Rob and Shaun to review Alison’s wiki page, docs from castor handover and update castor procedures
Rob to send email to Kevin O'N re any issues with Fac tape config change level 2
Tim apply Fac tape config change level 1 - i.e. a DLS read only tape on Monday 1st
Gareth to share reporting available for fac tape

Juan to discuss with Shaun what to do with the workaround applied to “subrequesttodo” procedure in CASTOR 2.1.15.
Rob to create castor 14 gen build (for Ceph)
Shaun to present plan to put tape backend pool into one on 29th Jan
Rob look at DLS / CEDA data rates in relation
BD re. WAN tuning proposal - discuss with GS, does it need a change control?
RA to try stopping tapeserverd mid-migration to see if it breaks.
RA (was SdW) to modify cleanlostfiles to log to syslog so we can track its use - under testing
GS to investigate providing checks for /etc/noquattor on production nodes & checks for
GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII - progress