RAL Tier1 weekly operations castor 01/07/2016
Contents
Minutes from the previous meeting
Operation news
DB load has eased off but we still don’t understand the root cause of previous load yet
The tape library has been stable for ~40h
Operation problems
multi disk failure on disk array serving standby Neptune DB – now recovered by fabric / Andrey
gdss743 failed and went out of production but now operational.
gdss 748 suffered from RAID and motherboard hardware failure. Its data disks have been put into the chassis of gdss755 (preprod) and the server renamed to 748. Currently running as RO (atlas can delete still). The meeting took the decision to start draining 748 disks on Monday 27/6. Fabric have been informed.
Brian notes that GenScratch – still has data on it (being decommissioned and is not accessible)
DB load has eased off but we still don’t understand the root cause of previous load yet
Persistent data (for daily test) is still on GenScratch - need to discuss what to do with it
Long-term projects
RA did some debugging work on 2.1.15 in CERN and he found out that the SRM problem is not trivial. He will be in touch with Giusepe about this.
CASTOR will be replaced in CERN by 2022. Need to consider what will happen in RAL
Stafing
Chris possibly out Tuesday, otherwise all here
RA on call -tbc
Actions
GP/RA make sure we start to drain 748
CASTOR TEAM Durham / Leicester Dirac data - need to create separate tape pools / uid / gid
RA disks servers requiring RAID update - locate servers and plan for update with fabric
RA decide what to do with persistent data (for daily test) is still on GenScratch
RA to update the doc for xroot certificates
GP to review mailing lists he is on
GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters
RA SRM DB duplicates removal script is under testing
Completed actions
CP to request final confirmation from Diamond and do test recalls on the zero-sized files
GP to ask Jens about the pending membership request to dteam VO
GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade
Operation news
The DB load is high but not excessive. Thresholds were increased.
The tape library is stable although it is still not clear what the exaclt cause of the problem is. Oracle is working on the problem in the US. Success with completing the cycle stable->unstable->stable in a controlled manner
Removal of genScratch: need to find the new locations of the persistent test files
Operation problems
The ATLAS transfer manager stopped working. CASTOR reported that too many files were open. RA fixed the problem by increasing the process limit. Will see if the DB loads have any relation with this problem
CIP reports zero tape usage for dirac. A fix is underway from Rob and Jens.
Long-term projects
Work on 2.1.15 upgrade continues liaising with CERN. Need to find the license under which CASTOR is distibuted for the new users.
GP and Bruno had a preliminary discussion about aquilon and SL7 upgrade on the tape servers. Bruno is happy to oversee this project.
Stafing
RA is off next week starting from late Tuesday
CP is on call
Actions
CASTOR TEAM Durham / Leicester Dirac data - need to create separate tape pools / uid / gid
RA disks servers requiring RAID update - locate servers and plan for update with fabric
RA decide what to do with persistent data (for daily test) is still on GenScratch
RA to update the doc for xroot certificates
GP to review mailing lists he is on
Complete testing of the SRM DB duplicates removal script written by RA
Completed actions
GP/RA make sure we start to drain 748
GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters