RAL Tier1 weekly operations castor 02/6/2017

From GridPP Wiki
Jump to: navigation, search

Draft agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

  1. SL7 upgrade on tape servers
  2. SRM upgrade to SL6/CASTOR 2.1.16
  3. SL5 elimination from CASTOR functional test boxes and tape verification server
  4. CASTOR stress test improvement

5. Special topics

  1. Future CASTOR upgrade methodology

6. Actions

7. Anything for CASTOR-Fabric?

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

Gen VOs failing tests on ARGO because OPS VO was missing from srm2_storage.conf RT189820

ALICE problems accessing data on CASTOR RT189724

gdss658 called out with fsprobe errors and was removed from production RT189789

Many PrepareToPut requests timed out while waiting response from the stager on early Friday morning taking more then 20 each. Miguel reported that "Last night (Fri), between 1:30 am and 7:00 am, there were several alerts of blocking sessions (CMS STAGER)"

Operation news

The LHCb xrootd manager is now running on the stager head node to get around the problem with the wrong TURL returned by the SRM

Gen stager and SRMs were upgraded to 2.1.16

Plans for next week

Long-term projects

CIP migration to aquilon and upgrade to SL6

SL6 upgrade on functional test boxes and tape verification server: some more aquilon features were added

Tape-server migration to aquilon and SL7 upgrade: resumed work on this; re-factoring and re-compiling

CASTOR stress test improvement

Actions

GP to check the rate of TURL requests from LHCb

DB hardware upgrade tracking

Drain and decomission/recomission the 12 generation disk servers

RA to get a new source control management system sorted for CASTOR script development

GP to prepare a report on the performance of the WAN parameters deployed on CMS disk servers

Staffing