Difference between revisions of "RAL Tier1 weekly operations castor 08/07/2016"

From GridPP Wiki
Jump to: navigation, search
(Created page with " == Minutes from the previous meeting == Operation news DB load has eased off but we still don’t understand the root cause of previous load yet The tape library has been ...")
 
(Operation news)
 
(One intermediate revision by one user not shown)
Line 60: Line 60:
 
GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade
 
GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade
  
== Operation news ==
+
== Operation problems ==
  
 
High load on one of the servers pf the cmsTape - gdss676(?)
 
High load on one of the servers pf the cmsTape - gdss676(?)
Line 71: Line 71:
 
== Operation news ==
 
== Operation news ==
  
The tape system is now fixed and it is back to nomrmal operation with all drives included
+
The tape system is now fixed and it is back to normal operation with all drives included
 
Preventive maintaince of the two robots will be carried out on a date ato be agreed
 
Preventive maintaince of the two robots will be carried out on a date ato be agreed
 +
 +
DB load was not excessive but need to find out why the atlas stager caused load peaks.
 +
Some focussed effort, perhaps what we need to do is ensure we have enough space for the
 +
logs from the primary to the backup
  
 
The draining script is ready
 
The draining script is ready
Line 102: Line 106:
 
GP to review with RA the mailing lists he is on
 
GP to review with RA the mailing lists he is on
  
GP/RA to look at the stress test results for gdss596 to evaluate the WAN tuning parameters  
+
GP/RA to look at the stress test results for gdss596 and evaluate the WAN tuning parameters  
  
 
Complete testing of the SRM DB duplicates removal script written by RA
 
Complete testing of the SRM DB duplicates removal script written by RA

Latest revision as of 13:12, 15 July 2016

Minutes from the previous meeting

Operation news

DB load has eased off but we still don’t understand the root cause of previous load yet

The tape library has been stable for ~40h

Operation problems

multi disk failure on disk array serving standby Neptune DB – now recovered by fabric / Andrey

gdss743 failed and went out of production but now operational.

gdss 748 suffered from RAID and motherboard hardware failure. Its data disks have been put into the chassis of gdss755 (preprod) and the server renamed to 748. Currently running as RO (atlas can delete still). The meeting took the decision to start draining 748 disks on Monday 27/6. Fabric have been informed.

Brian notes that GenScratch – still has data on it (being decommissioned and is not accessible)

DB load has eased off but we still don’t understand the root cause of previous load yet

Persistent data (for daily test) is still on GenScratch - need to discuss what to do with it

Long-term projects

RA did some debugging work on 2.1.15 in CERN and he found out that the SRM problem is not trivial. He will be in touch with Giusepe about this.

CASTOR will be replaced in CERN by 2022. Need to consider what will happen in RAL

Stafing

Chris possibly out Tuesday, otherwise all here

RA on call -tbc

Actions

GP/RA make sure we start to drain 748

CASTOR TEAM Durham / Leicester Dirac data - need to create separate tape pools / uid / gid

RA disks servers requiring RAID update - locate servers and plan for update with fabric

RA decide what to do with persistent data (for daily test) is still on GenScratch

RA to update the doc for xroot certificates

GP to review mailing lists he is on

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

RA SRM DB duplicates removal script is under testing

Completed actions

CP to request final confirmation from Diamond and do test recalls on the zero-sized files

GP to ask Jens about the pending membership request to dteam VO

GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade

Operation problems

High load on one of the servers pf the cmsTape - gdss676(?)

gdss730 and gdss654 went out production

Draining of gdss748 is complete. The server is out of castor and handed over to the fabric team to swap back drives with gdss755

Operation news

The tape system is now fixed and it is back to normal operation with all drives included Preventive maintaince of the two robots will be carried out on a date ato be agreed

DB load was not excessive but need to find out why the atlas stager caused load peaks. Some focussed effort, perhaps what we need to do is ensure we have enough space for the logs from the primary to the backup

The draining script is ready

Long-term projects

Work on 2.1.15 upgrade continues liaising with CERN. Need to find the license under which CASTOR is distibuted for the new users.

Migration to aquilon and SL7 upgrade

Staffing

GS out Tuesday

AS out Friday

RA oncall I assume TBC

Actions

CASTOR TEAM Durham / Leicester Dirac data - need to create separate tape pools / uid / gid

RA disks servers requiring RAID update - locate servers and plan for update with fabric

RA decide what to do with persistent data (for daily test) is still on GenScratch

RA to update the doc for xroot certificates

GP to review with RA the mailing lists he is on

GP/RA to look at the stress test results for gdss596 and evaluate the WAN tuning parameters

Complete testing of the SRM DB duplicates removal script written by RA