RAL Tier1 weekly operations castor 03/06/2016

From GridPP Wiki
Jump to: navigation, search

Agenda:

1. Problems encountered this week

2.Upgrades/improvements made this week

3.What are we planning to do next week?

4.Long-term project updates (if not already covered)

- 2.1.15

- Progress

- Planning

5.Special topics

6.Actions

7.Anything for CASTOR-Fabric?

8.AoTechnicalB

9.Availability for next week

10.On-Call

11.AoOtherB


Minutes from previous meeting

Operations news

New tape pools created for LHCb and CMS

Draining on GDSS680 (atlas strip) with 30k files per partition on server worked


Operations problems

Tape robot - ran OK overnight (26-27/5). Currently using backup controller server. Oracle engineer due 11:00 27/5.

GDSS635 - atlas tape ... slight confusion, staged files on the server (not canbemigrs) when filesystem was rebuilt

40 files in atlas scratch had zero size in CASTOR namespace, BD declare lost to Atlas

Atlas seeing failing transfers because file size and checksum that rucio held were different.

SNO+ GGUS ticket has been outstanding for some time


Planned, Scheduled and Cancelled Interventions

CASTOR 2.1.15


Long-term projects

SL6 to SL7 upgrade on all CASTOR tape servers Staffing

Bank Holiday on Monday

RA out Fri 3rd

Oncall CP from Tuesday


Actions

GP needs to review mailing lists he is on / can he access GGUS

GP to discuss with DB team to include file size in Nameserver dumps. The goal is to identify zero length files

GP to review outstanding SNO+ GGUS ticket

GP and BD to chase the dteam VO for the GP membership request

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

GP to talk to Andrew Lahiff about a SL7 upgrade on the worker nodes using aquilon.

SRM DB duplicates removal script is under testing


Completed actions

BD AND RA to test the newly created tape families for ATLAS today


Operation news

Gareth will review the situation with the tape robot and libraries, perofoem safety checks and circulate an update email

GS will ask Kashif re RAID firmware updates on d0t1 v2011 machines and if there are other batches of machines that should upgraded

Operation problems

Two disk servers gdss698 and gdss718 went out of propduction and brough back again

Bad xrootd certificate on gen lsf node (fixed)

Planned/Scheduled/Cancelled Interventions

Completed draining of gdss680 - investigationn remaining files

Draining of gdss703 in process

Long-term projects

Further progress has been made with CASTOR 2.1.15 upgrade

Staffing

CP on call next week

Actions

BD and CP to find out about zero-sized files on CASTOR facilities

RA to check out the doc for xroot certificates

BD to review outstanding RT tickets on CASTOR queue

GP and BD to chase the dteam VO for the GP membership request

GP to review mailing lists he is on – with Rob?

GP access GOCDB

GP to document the alternative draining procedure on wiki

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

GP to talk to Andrew Lahiff about a SL7 upgrade on the worker nodes using aquilon.

RA SRM DB duplicates removal script is under testing

Completed actions

GP to review outstanding SNO+ GGUS ticket

GP access GGUS

GP to discuss with DB team to include file size in Nameserver dumps. The goal is to identify zero length files