RAL Tier1 weekly operations castor 30/11/2015

Operations News

LHCb batch jobs failing to copy results into castor - changes made seems to have improved the situation but not fix (Raja). Increasing the number of connections to the NS db (more threads)
V09 disk machines - decommissioned but still running, GS turned off 25 this week
2.1.15-20 now on vcert

SRMdb dups script created
Erratas done on production and all others
11.2.0.4 clients rolled out on preprod
RA, SdW, GTF and AS have been to CERN for a CASTOR face-to-face meeting
Attempted fix for LHCb's intermittent stage request failures - increase 'Stage Request' thread count from 4 to 12.
BD looking at porting persistent tests to Ceph

SL6 upgrade, CV11 problems – systems have been stable this week. RA had done some stress testing but the results from this were inconclusive, in that they did not produce a comparable result. Therefore it may be that it is just not possible to simulate the behaviour on pre-production system. ACTIONS: RA to ensure the procedure for dealing with any recurrence of this issue is documented for on-call personnel. GS to ensure this is mentioned at on call meeting and check with CW and KM what they are testing with the machines which have had the fault as we have seen them returned to production and then showing the fault again. RA – decision made to continue with upgrade to SL6 in any case.
Disk servers name lookup issue (CV11's) - more system than CASTOR. Currently holding CV11 upgrades until understood.
NS dump failed because backup times changed - more frequent (Brian)
RA has developed an automatic script to clean up the usual case of these.
stager_qry has been running slowly for ATLAS
The failed CV11 nodes from atlasTape have been fixed and returned to prod. We are looking into our procedures to see if acceptance testing should be improved.
LHCb ticket re. low level of failures likely to be fail in nsmkdir. Look for timeouts in name server logs to prove where timeout is. Mitigation may be to increase thread count on NS. Be a little careful with this one as vastly increasing this could exceed connection limit to DB (number of NS daemons is quite high)

9th Dec - very short NW break for castor headnodes due to removal of old core network switch.

2.1.15 Tape - completion of upgrade next week (Tim)
network change to reconnect to atlas building - Wed 09:30, short break to standby dbs
network change - castor headnodes affected - to schedule (GS)
Tape backed nodes to SL6 - scheduled for Monday/Tuesday.
CASTOR 2.1.15
- Initial installation on vcert underway. Currently scheduled for early next year.
Upgrade of Oracle clients to 11.2.0.4

RA has produced a python script to handle SRM db duplication issue which is causing callouts. Problem running the python script as version of python on the SRM servers is still at 2.4, however RA will pursue this. SdW has reviewed and confident that this is low risk.
JJ – Glue 2 for CASTOR, something to do with publishing information??? Not sure there was a specific action associated with this
JS – replacing Tier1 CASTOR db hardware, ACTION: RA/JS to discuss disk requirements

Tasks

Interventions

Castor on Call person next week
- CP
Staff absence/out of the office:
- RA out from Wednesday 25th November until 7th Dec
- BD out 30th Nov, 1st Dec
- AS out all next week - JS in on his own

BD/RA – review tickets prior to RA going on holiday
RA/JS vcert upgrade to 2.15.20
GS/RA to revisit the CASTOR decommissioning process in light of the production team updates to their decommissioning process
RA to try stopping tapeserverd mid-migration to see if it breaks. RA to check with TF for more detailed explanation of what he did and why.
RA produce a list of requirements for tape drive dedication in future castor versions, to discuss with CERN
RA/BD review unowned RT tickets

SdW to revisit schema changes to improve SRM DB dup position W/C 12th Oct.
Production team check srm03 regarding lack of crl updates
RA to schedule no-op change to be included in quattor / persistent (CMSdisk)
BC to doc procedure for HW return to CASTOR
SdW to modify cleanlostfiles to log to syslog so we can track its use - under testing
RA/JJ to look at information provider re DiRAC (reporting disk only etc)
GS to arrange meeting castor/fab/production to discuss the decommissioning procedures
GS to investigate providing checks for /etc/noquattor on production nodes & checks for fetch-crl - ONGOING
BD to chase AD about using the space reporting thing we made for him
GS to investigate how/if we need to declare xrootd endpoints in GOCDB BDII