Difference between revisions of "RAL Tier1 weekly operations castor 10/06/2016"

From GridPP Wiki
Jump to: navigation, search
(Created page with " == Minutes from previous meeting == Operation news Gareth will review the situation with the tape robot and libraries, perofoem safety checks and circulate an update email ...")
 
 
Line 20: Line 20:
  
 
Further progress has been made with CASTOR 2.1.15 upgrade
 
Further progress has been made with CASTOR 2.1.15 upgrade
 +
 
Staffing
 
Staffing
  
 
CP on call next week
 
CP on call next week
 +
 
Actions
 
Actions
  
Line 44: Line 46:
  
 
RA SRM DB duplicates removal script is under testing
 
RA SRM DB duplicates removal script is under testing
 +
 
Completed actions
 
Completed actions
  
Line 77: Line 80:
 
BD and CP to request info from Diamond abou the zero-sized files
 
BD and CP to request info from Diamond abou the zero-sized files
  
RA to update the xroot documentation
+
RA to update the doc for xroot certificates
  
 
BD to review outstanding RT tickets on CASTOR queue
 
BD to review outstanding RT tickets on CASTOR queue
Line 83: Line 86:
 
GP and BD to chase the dteam VO for the GP membership request
 
GP and BD to chase the dteam VO for the GP membership request
  
GP to review mailing lists he is on – with Rob?
+
GP to review mailing lists he is on
  
 
GP access GOCDB
 
GP access GOCDB
Line 89: Line 92:
 
GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters
 
GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters
  
GP to arrange a meeting with Bruno about the aquilon migration and SL7 upgrade  
+
GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade
  
RA's SRM DB duplicates removal script is under testing
+
RA SRM DB duplicates removal script is under testing
  
 
== Completed Actions ==
 
== Completed Actions ==
  
 
GP to document the alternative draining procedure on wiki
 
GP to document the alternative draining procedure on wiki

Latest revision as of 11:58, 10 June 2016

Minutes from previous meeting

Operation news

Gareth will review the situation with the tape robot and libraries, perofoem safety checks and circulate an update email

GS will ask Kashif re RAID firmware updates on d0t1 v2011 machines and if there are other batches of machines that should upgraded Operation problems

Two disk servers gdss698 and gdss718 went out of propduction and brough back again

Bad xrootd certificate on gen lsf node (fixed) Planned/Scheduled/Cancelled Interventions

Completed draining of gdss680 - investigationn remaining files

Draining of gdss703 in process Long-term projects

Further progress has been made with CASTOR 2.1.15 upgrade

Staffing

CP on call next week

Actions

BD and CP to find out about zero-sized files on CASTOR facilities

RA to check out the doc for xroot certificates

BD to review outstanding RT tickets on CASTOR queue

GP and BD to chase the dteam VO for the GP membership request

GP to review mailing lists he is on – with Rob?

GP access GOCDB

GP to document the alternative draining procedure on wiki

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

GP to talk to Andrew Lahiff about a SL7 upgrade on the worker nodes using aquilon.

RA SRM DB duplicates removal script is under testing

Completed actions

GP to review outstanding SNO+ GGUS ticket

GP access GGUS

GP to discuss with DB team to include file size in Nameserver dumps. The goal is to identify zero length files

Operation problems

There was an exceedingly high number of SRM requests from t2k VO which resulted in repeated time-outs on theier end - #RT 172486. It also resulted to a large backlog of tape recalls possibly because they were requesting many small files spread across many tapes.

Kevin reported that many files were stuck in the STAGEIN status on the facilities stager - resolved by RA

xrootd certificate expired on another machine

ATLAS had a series of failures around this Friday morning as suggested by a peak in the staged wainting time around then

Long-term projects

The CASTOR 2.1.15 upgrade seems to work apart from the part that deals with the SRM reads

Staffing

CP on call next week

GP/RA/BC/GV at CERN on Mon and Tue

Actions

BD and CP to request info from Diamond abou the zero-sized files

RA to update the doc for xroot certificates

BD to review outstanding RT tickets on CASTOR queue

GP and BD to chase the dteam VO for the GP membership request

GP to review mailing lists he is on

GP access GOCDB

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade

RA SRM DB duplicates removal script is under testing

Completed Actions

GP to document the alternative draining procedure on wiki