RAL Tier1 weekly operations castor 10/06/2016
Contents
Minutes from previous meeting
Operation news
Gareth will review the situation with the tape robot and libraries, perofoem safety checks and circulate an update email
GS will ask Kashif re RAID firmware updates on d0t1 v2011 machines and if there are other batches of machines that should upgraded Operation problems
Two disk servers gdss698 and gdss718 went out of propduction and brough back again
Bad xrootd certificate on gen lsf node (fixed) Planned/Scheduled/Cancelled Interventions
Completed draining of gdss680 - investigationn remaining files
Draining of gdss703 in process Long-term projects
Further progress has been made with CASTOR 2.1.15 upgrade
Staffing
CP on call next week
Actions
BD and CP to find out about zero-sized files on CASTOR facilities
RA to check out the doc for xroot certificates
BD to review outstanding RT tickets on CASTOR queue
GP and BD to chase the dteam VO for the GP membership request
GP to review mailing lists he is on – with Rob?
GP access GOCDB
GP to document the alternative draining procedure on wiki
GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters
GP to talk to Andrew Lahiff about a SL7 upgrade on the worker nodes using aquilon.
RA SRM DB duplicates removal script is under testing
Completed actions
GP to review outstanding SNO+ GGUS ticket
GP access GGUS
GP to discuss with DB team to include file size in Nameserver dumps. The goal is to identify zero length files
Operation problems
There was an exceedingly high number of SRM requests from t2k VO which resulted in repeated time-outs on theier end - #RT 172486. It also resulted to a large backlog of tape recalls possibly because they were requesting many small files spread across many tapes.
Kevin reported that many files were stuck in the STAGEIN status on the facilities stager - resolved by RA
xrootd certificate expired on another machine
ATLAS had a series of failures around this Friday morning as suggested by a peak in the staged wainting time around then
Long-term projects
The CASTOR 2.1.15 upgrade seems to work apart from the part that deals with the SRM reads
Staffing
CP on call next week
GP/RA/BC/GV at CERN on Mon and Tue
Actions
BD and CP to request info from Diamond abou the zero-sized files
RA to update the doc for xroot certificates
BD to review outstanding RT tickets on CASTOR queue
GP and BD to chase the dteam VO for the GP membership request
GP to review mailing lists he is on
GP access GOCDB
GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters
GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade
RA SRM DB duplicates removal script is under testing
Completed Actions
GP to document the alternative draining procedure on wiki