Difference between revisions of "RAL Tier1 weekly operations castor 25/11/2016"
(→Operation news) |
(→Operation problems) |
||
(8 intermediate revisions by one user not shown) | |||
Line 29: | Line 29: | ||
== Operation problems == | == Operation problems == | ||
− | gdss750 failed due to fsprobe errors and was removed from production | + | gdss750 (lhcbDst) failed due to fsprobe errors and was removed from production |
gdss651 (preProd) is still down [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177006 RT177006] | gdss651 (preProd) is still down [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177006 RT177006] | ||
+ | |||
+ | Complication with the renewal of the Gen SRM host certificates due to the need to include the alternative | ||
+ | hostnames for the different VOs in each certificate. This requires talking to Jens before the approval | ||
+ | of the ceritificate request. '''Whenever a new VO is added to Gen, the certificates need to be re-issued.''' | ||
+ | |||
+ | Service alarm: Castor functional test lhcbUser on host lcgcadm05 [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177667 177667] | ||
+ | |||
+ | gdss784 does not appear on ganglia with its host name but with its IP | ||
== Operation news == | == Operation news == | ||
Line 37: | Line 45: | ||
All CV14 disk servers have been deployed into full production in lhcbDst [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=176041&results=c43ec556d93f2209c7f403533b72d222 176041] | All CV14 disk servers have been deployed into full production in lhcbDst [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=176041&results=c43ec556d93f2209c7f403533b72d222 176041] | ||
− | Started draining and decomissioning the CV11 disk servers in aliceDisk | + | Started draining and decomissioning the CV11 disk servers in aliceDisk [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=176040 176040] |
== Long-term projects == | == Long-term projects == | ||
Line 47: | Line 55: | ||
== Special topics == | == Special topics == | ||
− | Remake transfer rate plots for larger files (> 0.5 GB) and covering longer time periods | + | Remake transfer rate plots for larger files (> 0.5 GB) and covering longer time periods: implemented these requirements in the script. Need to modify the script to ingnore treansfers that finished on the next day after they started. |
== Actions == | == Actions == | ||
− | + | Create new tape pools for dirac and update accordingly the SRM grid-map file [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=160227 160227] | |
− | Present AL two alternatives to choose from: 1) Create generic fileclass/tapepool 2) Remove the "unroutable file to tape" | + | Start gathering tape recall stats for ATLAS [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177612 177612] |
+ | |||
+ | Present AL two alternatives to choose from: 1) Create generic fileclass/tapepool 2) Remove the "unroutable file to tape" | ||
+ | callout to working hours | ||
+ | |||
+ | Discuss with Khash about the urgency of RAID upgrade on CV13 ds and plan the intervention | ||
Delete empty dirs from CASTOR (prompted by BD) | Delete empty dirs from CASTOR (prompted by BD) | ||
Line 61: | Line 74: | ||
Schedule with AL a CASTOR upgrade of preprod from scratch | Schedule with AL a CASTOR upgrade of preprod from scratch | ||
− | RA to talk to AL about merging CMS | + | RA to talk to AL about merging old CMS tape pools |
+ | |||
+ | == Staffing == | ||
+ | |||
+ | GP on call next week | ||
+ | |||
+ | RA away on A/L | ||
+ | |||
+ | Miguel, the new DBA, has started |
Latest revision as of 12:16, 25 November 2016
Contents
Draft agenda
1. Problems encountered this week
2. Upgrades/improvements made this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
1. Castor 2.1.15 2. SL7 upgrade on tape servers
5. Special topics
6. Actions
7. Anything for CASTOR-Fabric?
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Operation problems
gdss750 (lhcbDst) failed due to fsprobe errors and was removed from production
gdss651 (preProd) is still down RT177006
Complication with the renewal of the Gen SRM host certificates due to the need to include the alternative hostnames for the different VOs in each certificate. This requires talking to Jens before the approval of the ceritificate request. Whenever a new VO is added to Gen, the certificates need to be re-issued.
Service alarm: Castor functional test lhcbUser on host lcgcadm05 177667
gdss784 does not appear on ganglia with its host name but with its IP
Operation news
All CV14 disk servers have been deployed into full production in lhcbDst 176041
Started draining and decomissioning the CV11 disk servers in aliceDisk 176040
Long-term projects
Castor 2.1.15 upgrade has been postponed until January 2017
First draft of castor tapeserver features almost complete
Special topics
Remake transfer rate plots for larger files (> 0.5 GB) and covering longer time periods: implemented these requirements in the script. Need to modify the script to ingnore treansfers that finished on the next day after they started.
Actions
Create new tape pools for dirac and update accordingly the SRM grid-map file 160227
Start gathering tape recall stats for ATLAS 177612
Present AL two alternatives to choose from: 1) Create generic fileclass/tapepool 2) Remove the "unroutable file to tape" callout to working hours
Discuss with Khash about the urgency of RAID upgrade on CV13 ds and plan the intervention
Delete empty dirs from CASTOR (prompted by BD)
Test DB upgrade to CASTOR 2.1.15
Schedule with AL a CASTOR upgrade of preprod from scratch
RA to talk to AL about merging old CMS tape pools
Staffing
GP on call next week
RA away on A/L
Miguel, the new DBA, has started