Difference between revisions of "RAL Tier1 weekly operations castor 02/12/2016"

From GridPP Wiki
Jump to: navigation, search
(Long-term projects)
(Actions)
 
(4 intermediate revisions by one user not shown)
Line 28: Line 28:
  
 
== Operation problems ==
 
== Operation problems ==
 +
 +
puppetdev was down
  
 
gdss726 (cmsDisk) failed, showed fsprobe errors and removed from production [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177879 RT177879]
 
gdss726 (cmsDisk) failed, showed fsprobe errors and removed from production [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177879 RT177879]
Line 38: Line 40:
 
== Operation news ==
 
== Operation news ==
  
CV13 firmware upgrade has been scheduled for next week; gdss726 has been upgraded [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177723 RT177723]]
+
CV13 firmware upgrade has been scheduled for next week; gdss726 has been upgraded [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177723 RT177723]
  
 
== Long-term projects ==
 
== Long-term projects ==
Line 54: Line 56:
  
 
Create new tape pools for dirac and update accordingly the SRM grid-map file [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=160227 RT1660227]
 
Create new tape pools for dirac and update accordingly the SRM grid-map file [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=160227 RT1660227]
 +
 +
RA to talk to AL about merging old CMS tape pools
  
 
Start gathering tape recall stats for ATLAS [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177612 RT177612]
 
Start gathering tape recall stats for ATLAS [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177612 RT177612]
Line 63: Line 67:
 
Test DB upgrade to CASTOR 2.1.15
 
Test DB upgrade to CASTOR 2.1.15
  
Schedule with AL a CASTOR upgrade of preprod from scratch
+
Schedule with AS a CASTOR upgrade of preprod from scratch
  
RA to talk to AL about merging old CMS tape pools
+
Consider to move puppetdev to new hardware or VM (suggested by Kashif) [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=177712 RT177712]

Latest revision as of 16:34, 8 December 2016

Draft agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

  1. Castor 2.1.15
  2. SL7 upgrade on tape servers

5. Special topics

6. Actions

7. Anything for CASTOR-Fabric?

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

puppetdev was down

gdss726 (cmsDisk) failed, showed fsprobe errors and removed from production RT177879

gdss747 (atlasStripInput) failed and removed from production. Two drives had to be replaced. Currently rebuilding

SAM tests failed on both cmsDisk and cmsTape, RT177950, due to heavy load from production transfers. Fixed by restarting transfer managers on scheduler and utility nodes

Operation news

CV13 firmware upgrade has been scheduled for next week; gdss726 has been upgraded RT177723

Long-term projects

Castor 2.1.15 upgrade has been postponed until January 2017

First draft of castor tapeserver features completed and published for review. lcgcts02.gridpp.rl.ac.uk (vcert) was added to magDB and imported to aquilon.

Special topics

Remake transfer rate plots for larger files (> 0.5 GB) and covering longer time periods: implemented these requirements in the script. Need to modify the script to ingnore treansfers that finished on the next day after they started.

Actions

Create new tape pools for dirac and update accordingly the SRM grid-map file RT1660227

RA to talk to AL about merging old CMS tape pools

Start gathering tape recall stats for ATLAS RT177612

Move the "unroutable file to tape" callout to working hours

Delete empty dirs from CASTOR (prompted by BD)

Test DB upgrade to CASTOR 2.1.15

Schedule with AS a CASTOR upgrade of preprod from scratch

Consider to move puppetdev to new hardware or VM (suggested by Kashif) RT177712