Difference between revisions of "RAL Tier1 weekly operations castor 02/6/2017"
(→Operation problems) |
(→Operation problems) |
||
(2 intermediate revisions by one user not shown) | |||
Line 31: | Line 31: | ||
== Operation problems == | == Operation problems == | ||
+ | |||
+ | Gen VOs failing tests on ARGO because OPS VO was missing from srm2_storage.conf [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=189820 RT189820] | ||
+ | |||
+ | ALICE problems accessing data on CASTOR [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=189724 RT189724] | ||
gdss658 called out with fsprobe errors and was removed from production [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=189789 RT189789] | gdss658 called out with fsprobe errors and was removed from production [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=189789 RT189789] | ||
Line 49: | Line 53: | ||
CIP migration to aquilon and upgrade to SL6 | CIP migration to aquilon and upgrade to SL6 | ||
− | SL6 upgrade on functional test boxes and tape verification server | + | SL6 upgrade on functional test boxes and tape verification server: some more aquilon features were added |
− | Tape-server migration to aquilon and SL7 upgrade | + | Tape-server migration to aquilon and SL7 upgrade: resumed work on this; re-factoring and re-compiling |
− | CASTOR stress test improvement | + | CASTOR stress test improvement |
== Actions == | == Actions == |
Latest revision as of 07:47, 8 June 2017
Contents
Draft agenda
1. Problems encountered this week
2. Upgrades/improvements made this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
1. SL7 upgrade on tape servers 2. SRM upgrade to SL6/CASTOR 2.1.16 3. SL5 elimination from CASTOR functional test boxes and tape verification server 4. CASTOR stress test improvement
5. Special topics
1. Future CASTOR upgrade methodology
6. Actions
7. Anything for CASTOR-Fabric?
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Operation problems
Gen VOs failing tests on ARGO because OPS VO was missing from srm2_storage.conf RT189820
ALICE problems accessing data on CASTOR RT189724
gdss658 called out with fsprobe errors and was removed from production RT189789
Many PrepareToPut requests timed out while waiting response from the stager on early Friday morning taking more then 20 each. Miguel reported that "Last night (Fri), between 1:30 am and 7:00 am, there were several alerts of blocking sessions (CMS STAGER)"
Operation news
The LHCb xrootd manager is now running on the stager head node to get around the problem with the wrong TURL returned by the SRM
Gen stager and SRMs were upgraded to 2.1.16
Plans for next week
Long-term projects
CIP migration to aquilon and upgrade to SL6
SL6 upgrade on functional test boxes and tape verification server: some more aquilon features were added
Tape-server migration to aquilon and SL7 upgrade: resumed work on this; re-factoring and re-compiling
CASTOR stress test improvement
Actions
GP to check the rate of TURL requests from LHCb
DB hardware upgrade tracking
Drain and decomission/recomission the 12 generation disk servers
RA to get a new source control management system sorted for CASTOR script development
GP to prepare a report on the performance of the WAN parameters deployed on CMS disk servers