RAL Tier1 weekly operations castor 06/10/2014

Operations News

Plan to ensure PreProd represents production in terms of hardware generation are underway
Disk server redeployments continue (i.e. D1T0 reused in D0T1 etc) ... 5 servers in LHCb left, see below
SL6 Headnode work progressing well - hoping for rollout in Nov
xrootd security advisory with FAX component within xrootd
useful breakout sessions at Castor face to face - deadlock analysis & bugs confirmed, discussions to simplify headnode configurations and remove a single point of failure
Oracle PSU DB patch for Neptune standby (Neptr26) and Neptune primary (Neptr89) completed (not the cause for recent issues)

Gdss720 - still issues around this server, currently no evidence that physical server is not performing correctly however draining seems mostly unsuccessful (slow) and atlas were suffering with many failed transfers that were present on gdss720. Server currently in read only and will revisit post current castor issues.
gdss707 (atlasStripInput) was taken out over the weekend - completed memory tests and has had RAID controller firmware upgrade
gdss763 (also atlasStripInput) RAID card replaced, back into production.
still having difficult deploying 5 servers into LHCb production because of a mellanox network/quattor issue - fabric to fix
Atlas load, high wait I/O on several servers – partly understood and working towards some config changes
Db crash on Neptune standby last Wed 1st after previously applying an Oracle patch - not production affecting. Oracle report its a know issue but dont currently have a fix for our version.
Current castor problems with the cache on an emc disk array - atlas were worst affected. Our position was improved by moving to atlas/gen stager/srm standby Monday 6th Oct

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Juan to patch castor dbs (PSU patches for Pluto and Juno) – standard change ... note this is currently delayed due to EMC cache issue
2.1.14-14 stress testing in preprod along with new errata
A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.

Tasks

Possible future upgrade to CASTOR 2.1.14-15 post christmas
Switch from admin machines: lcgccvm02 to lcgcadm05
New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
Replace DLF with Elastic Search
Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions

Current castor problems with the cache on an emc disk array - atlas were worst affected. Our position was improved by moving to atlas/gen stager/srm standby Monday 6th Oct