Difference between revisions of "RAL Tier1 weekly operations castor 16/6/2017"

From GridPP Wiki
Jump to: navigation, search
(Operation problems)
 
Line 60: Line 60:
 
== Actions ==  
 
== Actions ==  
  
Ensure that Fabirc is on track with the deployment of the new DB hardware
+
Ensure that Fabric is on track with the deployment of the new DB hardware
  
 
Drain and decomission/recomission the 12 generation disk servers
 
Drain and decomission/recomission the 12 generation disk servers

Latest revision as of 16:16, 19 June 2017

Draft agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

  1. SL7 upgrade on tape servers
  2. SRM upgrade to SL6/CASTOR 2.1.16
  3. SL5 elimination from CASTOR functional test boxes and tape verification server
  4. CASTOR stress test improvement

5. Special topics

  1. Future CASTOR upgrade methodology

6. Actions

7. Anything for CASTOR-Fabric?

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

gdss732 crashed and removed from production

srmbed on GEN SRMs was stopping consistently causing failures throughout the week. As a result, a GGUS ticket from SNOplus was created. The problem may be related with a DB row contention issue seen on the evening of Tue 13/6 that Miguell attributes to a network disruption. Maybe wirthwhile checking SNO plus request rate.

Operation news

RA has a fix for the memory leak seen on lcgclsf01 and it is pushed gradually across Tier-1

Plans for next week

Roll out the WAN tuning params on all CASTOR disk servers on Monday 19/6

Long-term projects

CIP migration to aquilon and upgrade to SL6

SL6 upgrade on functional test boxes and tape verification server: aquilon configuration is complete for the functional test box and the tape verification server and tests for these two hosts are pending

Tape-server migration to aquilon and SL7 upgrade: resumed work on this; re-factoring and re-compiling

CASTOR stress test improvement

Actions

Ensure that Fabric is on track with the deployment of the new DB hardware

Drain and decomission/recomission the 12 generation disk servers

Staffing

RA on call