RAL Tier1 weekly operations castor 17/06/2016

From GridPP Wiki
Jump to: navigation, search

Minutes from the previous meeting

Operation problems

There was an exceedingly high number of SRM requests from t2k VO which resulted in repeated time-outs on theier end - #RT 172486. It also resulted to a large backlog of tape recalls possibly because they were requesting many small files spread across many tapes.

Kevin reported that many files were stuck in the STAGEIN status on the facilities stager - resolved by RA

xrootd certificate expired on another machine

ATLAS had a series of failures around this Friday morning as suggested by a peak in the staged wainting time around then Long-term projects

The CASTOR 2.1.15 upgrade seems to work apart from the part that deals with the SRM reads Staffing

CP on call next week

GP/RA/BC/GV at CERN on Mon and Tue Actions

BD and CP to request info from Diamond abou the zero-sized files

RA to update the doc for xroot certificates

BD to review outstanding RT tickets on CASTOR queue

GP and BD to chase the dteam VO for the GP membership request

GP to review mailing lists he is on

GP access GOCDB

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade

RA SRM DB duplicates removal script is under testing

Completed Actions

GP to document the alternative draining procedure on wiki

Operation problems

Hot SRM for gen mainly due to t2k transfers

Tape library problems occured again early this week. There was an instability with the ACSLS software last night. Tim will put a new machine running ACSLS today

DB resources exhaustion issues. Around 15 June there were about twice as much writes to the primary database causing ca. 20 min of writes to standby database Need to keep track of DB activity over the next weeks

Long-term projects

RA did some debugging work on 2.1.15 in CERN and he found out that the SRM problem is not trivial. He will be in touch with Giusepe about this.

CASTOR will be replaced in CERN by 2022. Need to consider what will happen in RAL

Staffing

RA on annual leave during next week

BD in a meeting from Mon to Thu

CP/BD will attend the Data Intensive workshop on Mon

CP on call next week

Actions

CP to request final confirmation from Diamond and do test recalls on the zero-sized files

RA to update the doc for xroot certificates

GP to ask Jens about the pending membership request to dteam VO

GP to review mailing lists he is on

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade

RA SRM DB duplicates removal script is under testing

Completed actions

BD to review outstanding RT tickets on CASTOR queue

GP access GOCDB