RAL Tier1 weekly operations castor 27/05/2016

From GridPP Wiki
Revision as of 12:06, 1 June 2016 by George Patargias c592d6dd61 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Minutes from the previous meeting

Operation news

Automated workflow for disk server deployment has been disabled New CASTOR functional testing using xrootd will be enabled on Monday 23/5/2016 CASTOR issues

Heavy workload on the Atlas scratch disk resulting in almost nothing being achieved

Full recovery from the tape robot and air condition problems. Chris checked status of migration queues last weekend and on Mon 16/5

Double putStart problem on CASTOR facilities

Some work to be done on the improvement of the new draining script's logic

gdss664 was brought back to production on 18/05/2016 at ca. 15:00 folowing a sucessfull rebuilding

gdss727 (production D1T0 CMS disk server) FSProbe Error Removed from Production and Overwatch Updated RT 172141

xrootd segmentation fault on atlas-xrd-proxy01. John Kelly investigated /var/log/messages and /var/log/xrootd/manager/atlas/xrootd.log.20160516 and found that the machine was busy shortly before the error. GP tried to debug the dumped core file but could not run xrootd as root.

Ongoing work on the upgrade to CASTOR 2.1.15 on preprod

GP and BD to chase the dteam VO for the GP membership request

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

GP to talk to Andrew Lahiff about a SL7 upgrade on the worker nodes using aquilon.

SRM DB duplicates removal script is under testing

BD AND RA will test the newly created tape families for ATLAS today Fri 20/5


Operations news

New tape pools created for LHCb and CMS

Draining on GDSS680 (atlas strip) with 30k files per partition on server worked


Operations problems

Tape robot - ran OK overnight (26-27/5). Currently using backup controller server. Oracle engineer due 11:00 27/5.

GDSS635 - atlas tape ... slight confusion, staged files on the server (not canbemigrs) when filesystem was rebuilt

40 files in atlas scratch had zero size in CASTOR namespace, BD declare lost to Atlas

Atlas seeing failing transfers because file size and checksum that rucio held were different.

SNO+ GGUS ticket has been outstanding for some time


Planned, Scheduled and Cancelled Interventions

CASTOR 2.1.15


Long-term projects

SL6 to SL7 upgrade on all CASTOR tape servers

Staffing

Bank Holiday on Monday

RA out Fri 3rd

Oncall CP from Tuesday


Actions

GP needs to review mailing lists he is on / can he access GGUS

GP to discuss with DB team to include file size in Nameserver dumps. The goal is to identify zero length files

GP to review outstanding SNO+ GGUS ticket

GP and BD to chase the dteam VO for the GP membership request

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

GP to talk to Andrew Lahiff about a SL7 upgrade on the worker nodes using aquilon.

SRM DB duplicates removal script is under testing


Completed actions

BD AND RA to test the newly created tape families for ATLAS today