Difference between revisions of "RAL Tier1 weekly operations castor 16/6/2017"

From GridPP Wiki
Jump to: navigation, search
(Operation problems)
(Plans for next week)
Line 45: Line 45:
 
==  Plans for next week ==  
 
==  Plans for next week ==  
  
Roll out the WAN tuning params on all CASTOR disk servers
+
Roll out the WAN tuning params on all CASTOR disk servers on Monday 19/6
  
 
== Long-term projects ==
 
== Long-term projects ==

Revision as of 14:29, 16 June 2017

Draft agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

  1. SL7 upgrade on tape servers
  2. SRM upgrade to SL6/CASTOR 2.1.16
  3. SL5 elimination from CASTOR functional test boxes and tape verification server
  4. CASTOR stress test improvement

5. Special topics

  1. Future CASTOR upgrade methodology

6. Actions

7. Anything for CASTOR-Fabric?

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

gdss732 crashed and removed from production

srmbed on GEN SRMs was stopping consistently causing failures throughout the week. As a result, a GGUS ticket from SNOplus was created. The problem may be related with DB row contention issue seen on Tue 13.6 evening that Miguell attributes to a network disruption

Operation news

RA has a fix for the memory leak seen on lcgclsf01 and it is pushed gradually across Tier-1

Plans for next week

Roll out the WAN tuning params on all CASTOR disk servers on Monday 19/6

Long-term projects

CIP migration to aquilon and upgrade to SL6

SL6 upgrade on functional test boxes and tape verification server: aquilon configuration is complete for the functional test box and the tape verification server and tests for these two hosts are pending

Tape-server migration to aquilon and SL7 upgrade: resumed work on this; re-factoring and re-compiling

CASTOR stress test improvement

Actions

Ensure that Fabirc is on track with the deployment of the new DB hardware

Drain and decomission/recomission the 12 generation disk servers

Staffing

RA on call