RAL Tier1 weekly operations castor 15/07/2016

From GridPP Wiki
Jump to: navigation, search

Minutes from the previous meeting

Operation problems

High load on one of the servers pf the cmsTape - gdss676(?)

gdss730 and gdss654 went out production

Draining of gdss748 is complete. The server is out of castor and handed over to the fabric team to swap back drives with gdss755

Operation news

The tape system is now fixed and it is back to nomrmal operation with all drives included Preventive maintaince of the two robots will be carried out on a date to be agreed

DB load was not excessive but need to find out why the atlas stager caused load peaks. Some focussed effort, perhaps what we need to do is ensure we have enough space for the logs from the primary to the backup

The draining script is ready

Long-term projects

Work on 2.1.15 upgrade continues liaising with CERN. Need to find the license under which CASTOR is distibuted for the new users.

Migration to aquilon and SL7 upgrade

Staffing

GS out Tuesday

AS out Friday

RA oncall I assume TBC

Actions

CASTOR TEAM Durham / Leicester Dirac data - need to create separate tape pools / uid / gid

RA disks servers requiring RAID update - locate servers and plan for update with fabric

RA decide what to do with persistent data (for daily test) is still on GenScratch

RA to update the doc for xroot certificates

GP to review with RA the mailing lists he is on

GP/RA to look at the stress test results for gdss596 and evaluate the WAN tuning parameters

Complete testing of the SRM DB duplicates removal script written by RA

Operation problems

CMS external xroot test is failing

gdss619 showed hardware problems and had to be set to read-only mode for RAID verify tests.

No route to tape issues for CMS due to the way file classes are set up

An number of facilities tape drives were down

AN LHCb tape containing 800 files has been physically lost. Tim is chasing this up.

Operation news

Tape system library is stable

Deployment of the Dell 2015 tape buffers has started. Three of them have been deployed to atlasNonProd service class

Long-term projects

Not much success with fixing the 2.1.15 installation in liason with CERN

Migration to aquilon and SL7 upgrade. Intermediate step: configure a VM as a tape server.

Staffing

CP on call next week

RA may take of some time in lieu

GP may leave earlier certain days

Actions

CASTOR TEAM Durham / Leicester Dirac data - need to create separate tape pools / uid / gid

RA disks servers requiring RAID update - locate servers and plan for update with fabric

RA decide what to do with persistent data (for daily test) is still on GenScratch

RA to update the doc for xroot certificates

GP to review with RA the mailing lists he is on

GP/RA to look at the stress test results for gdss596 and evaluate the WAN tuning parameters