Difference between revisions of "RAL Tier1 weekly operations castor 24/06/2016"

From GridPP Wiki
Jump to: navigation, search
Line 45: Line 45:
 
== Operation news ==
 
== Operation news ==
  
DB has appears to have caught up with the increased number of ATLAS writes
+
DB load has eased off but we still don’t understand the root cause of previous load yet
 +
 
 +
The tape library has been stable for ~40h
  
 
== Operation problems ==
 
== Operation problems ==
  
Backup DB of CASTOR database (Nepturne) failed and the content was lost. The hardware is now  
+
multi disk failure on disk array serving standby Neptune DB now recovered by fabric / Andrey
fixed and the DB is now back on production
+
  
Disk servers 743 and 748 failed and went out or production earlier this week
+
gdss743 failed and went out of production but now operational.
 +
 
 +
gdss 748 suffered from RAID and motherboard hardware failure. Its data disks have been put into the chassis
 +
of gdss755 (preprod) and the server renamed to 748. Currently running as RO (atlas can delete still).
 +
The meeting took the decision to start draining 748 disks on Monday 27/6. Fabric have been informed.
 +
 
 +
Brian notes that GenScratch – still has data on it (being decommissioned and is not accessible)
 +
 
 +
DB load has eased off but we still don’t understand the root cause of previous load yet
 +
 
 +
Persistent data (for daily test) is still on GenScratch - need to discuss what to do with it
  
 
== Long-term projects ==
 
== Long-term projects ==
 +
 +
RA did some debugging work on 2.1.15 in CERN and he found out that the SRM problem is not trivial. He will be in touch with Giusepe about this.
 +
 +
CASTOR will be replaced in CERN by 2022. Need to consider what will happen in RAL
  
 
== Stafing ==
 
== Stafing ==
All here  
+
 
RA on call
+
Chris possibly out Tuesday, otherwise all here  
 +
 
 +
RA on call -tbc
  
 
== Actions ==
 
== Actions ==
  
RA to update the doc for xroot certificates
+
GP/RA make sure we start to drain 748
  
GP to ask Jens about the pending membership request to dteam VO
+
CASTOR TEAM Durham / Leicester Dirac data - need to create separate tape pools / uid / gid
 +
 
 +
RA disks servers requiring RAID update - locate servers and plan for update with fabric
 +
 
 +
RA decide what to do with persistent data (for daily test) is still on GenScratch
 +
 
 +
RA to update the doc for xroot certificates
  
 
GP to review mailing lists he is on
 
GP to review mailing lists he is on
  
 
GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters
 
GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters
 
GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade
 
  
 
RA SRM DB duplicates removal script is under testing  
 
RA SRM DB duplicates removal script is under testing  
Line 78: Line 99:
  
 
CP to request final confirmation from Diamond and do test recalls on the zero-sized files
 
CP to request final confirmation from Diamond and do test recalls on the zero-sized files
 +
 +
GP to ask Jens about the pending membership request to dteam VO
 +
 +
GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade

Revision as of 12:29, 24 June 2016

Minutes from the previous meeting

Operation problems

Hot SRM for gen aminly due to t2k transfers

Tape library problems occured again early this week. There was an instability with the ACSLS software last night. Tim will put a new machine running ACSLS today

DB resources exhaustion issues. Around 15 June there were about twice as much writes to the primary database causing ca. 20 min of writes to standby database Need to keep track of DB activity over the next weeks Long-term projects

RA did some debugging work on 2.1.15 in CERN and he found out that the SRM problem is not trivial. He will be in touch with Giusepe about this.

CASTOR will be replaced in CERN by 2022. Need to consider what will happen in RAL Staffing

RA on annual leave during next week

BD in a meeting from Mon to Thu

CP/BD will attend the Data Intensive workshop on Mon

CP on call next week Actions

CP to request final confirmation from Diamond and do test recalls on the zero-sized files

RA to update the doc for xroot certificates

GP to ask Jens about the pending membership request to dteam VO

GP to review mailing lists he is on

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade

RA SRM DB duplicates removal script is under testing Completed actions

BD to review outstanding RT tickets on CASTOR queue

GP access GOCDB

Operation news

DB load has eased off but we still don’t understand the root cause of previous load yet

The tape library has been stable for ~40h

Operation problems

multi disk failure on disk array serving standby Neptune DB – now recovered by fabric / Andrey

gdss743 failed and went out of production but now operational.

gdss 748 suffered from RAID and motherboard hardware failure. Its data disks have been put into the chassis of gdss755 (preprod) and the server renamed to 748. Currently running as RO (atlas can delete still). The meeting took the decision to start draining 748 disks on Monday 27/6. Fabric have been informed.

Brian notes that GenScratch – still has data on it (being decommissioned and is not accessible)

DB load has eased off but we still don’t understand the root cause of previous load yet

Persistent data (for daily test) is still on GenScratch - need to discuss what to do with it

Long-term projects

RA did some debugging work on 2.1.15 in CERN and he found out that the SRM problem is not trivial. He will be in touch with Giusepe about this.

CASTOR will be replaced in CERN by 2022. Need to consider what will happen in RAL

Stafing

Chris possibly out Tuesday, otherwise all here

RA on call -tbc

Actions

GP/RA make sure we start to drain 748

CASTOR TEAM Durham / Leicester Dirac data - need to create separate tape pools / uid / gid

RA disks servers requiring RAID update - locate servers and plan for update with fabric

RA decide what to do with persistent data (for daily test) is still on GenScratch

RA to update the doc for xroot certificates

GP to review mailing lists he is on

GP and BD to perform stress testing of gdss596 to evaluate the new WAN parameters

RA SRM DB duplicates removal script is under testing


Completed actions

CP to request final confirmation from Diamond and do test recalls on the zero-sized files

GP to ask Jens about the pending membership request to dteam VO

GP to arrange a meeting with Bruno about the aquilon migration and the SL7 upgrade