RAL Tier1 weekly operations castor 13/10/2014

Operations News

Plan to ensure PreProd represents production in terms of hardware generation are underway
Disk server redeployments continue (i.e. D1T0 reused in D0T1 etc) ... 5 servers in LHCb left, see below
xrootd security advisory with FAX component within xrootd
SL6 Headnode work progressing well - hoping for test in castor vcert next week
PSU oracle patch on juno completed

Gdss720 - currently looking like there is a RAID issue (LSI card) at the root of its problems. Kashif attempted to replace the RAID controller Friday 10th but was unsuccessful. The server is now running Read Only again with the same RAID controller as before.
still having difficult deploying 5 servers into LHCb production because of a mellanox network/quattor issue - fabric to fix
Atlas load, high wait I/O on several servers – partly understood and working towards some config changes
T2K - could not get through SRM – Shaun caused and corrected (would have affected sno+ and another)
T2K files with a physical size but no size in namespace (blocking a disk server decommissioning) - Brian talking to T2K.
CIP stopped updating – this particular problem is thought to be related to the castor switch due to emc failure. Action Add CIP into instructions for castor failover.Castor team decided to wait until dbs rolled back.
DB duplicates SRM userfile tables still occurring – test files only
EMC array issues – scheduled fail-back on Tuesday 10:00–12:00 (looks like dataguard has a day and a half left to sync up dbs – end of Sat 11th)

grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
Juan further patch castor dbs (PSU patches for Pluto and Juno) – standard change ... TBC
2.1.14-14 stress testing in preprod along with new errata
Tuesday 10:00 am SAN switchback
LHCB boxes into production (assuming mellanox issues resolved)

Tasks

Possible future upgrade to CASTOR 2.1.14-15 post christmas
Switch from admin machines: lcgccvm02 to lcgcadm05
New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
Replace DLF with Elastic Search
Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions

Castor problems with the cache on an emc disk array - atlas were worst affected. Our position was improved by moving to atlas/gen stager/srm standby Monday 6th Oct. Moving back on Tuesday 6th at 10am. Improved monitoring and speed of EMCs response being looked into.