Difference between revisions of "RAL Tier1 weekly operations castor 16/6/2017"
(Created page with "== Draft agenda == 1. Problems encountered this week 2. Upgrades/improvements made this week 3. What are we planning to do next week? 4. Long-term project updates (if not ...") |
|||
(4 intermediate revisions by one user not shown) | |||
Line 35: | Line 35: | ||
srmbed on GEN SRMs was stopping consistently causing failures throughout the week. As a result, | srmbed on GEN SRMs was stopping consistently causing failures throughout the week. As a result, | ||
− | a [https://ggus.eu/?mode=ticket_info&ticket_id=128954 GGUS]ticket from SNOplus was created. | + | a [https://ggus.eu/?mode=ticket_info&ticket_id=128954 GGUS] ticket from SNOplus was created. |
− | The problem may be related with DB row contention issue seen on Tue 13 | + | The problem may be related with a DB row contention issue seen on the evening of Tue 13/6 that |
− | attributes to a network disruption | + | Miguell attributes to a network disruption. Maybe wirthwhile checking SNO plus request rate. |
== Operation news == | == Operation news == | ||
Line 45: | Line 45: | ||
== Plans for next week == | == Plans for next week == | ||
− | Roll out the WAN tuning params on all CASTOR disk servers | + | Roll out the WAN tuning params on all CASTOR disk servers on Monday 19/6 |
== Long-term projects == | == Long-term projects == | ||
Line 60: | Line 60: | ||
== Actions == | == Actions == | ||
− | Ensure that | + | Ensure that Fabric is on track with the deployment of the new DB hardware |
Drain and decomission/recomission the 12 generation disk servers | Drain and decomission/recomission the 12 generation disk servers |
Latest revision as of 16:16, 19 June 2017
Contents
Draft agenda
1. Problems encountered this week
2. Upgrades/improvements made this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
1. SL7 upgrade on tape servers 2. SRM upgrade to SL6/CASTOR 2.1.16 3. SL5 elimination from CASTOR functional test boxes and tape verification server 4. CASTOR stress test improvement
5. Special topics
1. Future CASTOR upgrade methodology
6. Actions
7. Anything for CASTOR-Fabric?
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Operation problems
gdss732 crashed and removed from production
srmbed on GEN SRMs was stopping consistently causing failures throughout the week. As a result, a GGUS ticket from SNOplus was created. The problem may be related with a DB row contention issue seen on the evening of Tue 13/6 that Miguell attributes to a network disruption. Maybe wirthwhile checking SNO plus request rate.
Operation news
RA has a fix for the memory leak seen on lcgclsf01 and it is pushed gradually across Tier-1
Plans for next week
Roll out the WAN tuning params on all CASTOR disk servers on Monday 19/6
Long-term projects
CIP migration to aquilon and upgrade to SL6
SL6 upgrade on functional test boxes and tape verification server: aquilon configuration is complete for the functional test box and the tape verification server and tests for these two hosts are pending
Tape-server migration to aquilon and SL7 upgrade: resumed work on this; re-factoring and re-compiling
CASTOR stress test improvement
Actions
Ensure that Fabric is on track with the deployment of the new DB hardware
Drain and decomission/recomission the 12 generation disk servers
Staffing
RA on call