Difference between revisions of "RAL Tier1 weekly operations castor 22/03/2019"

From GridPP Wiki
Jump to: navigation, search
(Achievements this week)
 
Line 40: Line 40:
  
 
* ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
 
* ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
** Rob has created a ticket with Tim.
+
** Rob has created a ticket with Tim, work is being done.
 
* CASTOR metric reporting for GridPP.
 
* CASTOR metric reporting for GridPP.
 
** Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
 
** Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
Line 48: Line 48:
 
* Examine further standardisation of CASTOR pool settings.
 
* Examine further standardisation of CASTOR pool settings.
 
** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
 
** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
* Tape robot testing.
+
* CASTOR side tape robot testing.
  
 
== Long-term projects ==
 
== Long-term projects ==
Line 59: Line 59:
 
** Ticket with Fabric team to make the VMs.
 
** Ticket with Fabric team to make the VMs.
 
* RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
 
* RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
* Bellona (new Facilities DB) migration - monitoring fixed.
 
  
 
== Actions ==
 
== Actions ==
Line 72: Line 71:
 
== Staffing ==
 
== Staffing ==
  
* RA out from Friday for two weeks.
+
* RA out for the next two weeks, at HEPiX next week, on A/L the week after.
* GP out on Monday
+
  
 
== AoB ==
 
== AoB ==

Latest revision as of 10:48, 22 March 2019

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • New facd0t1 disk servers
    • All new facd0t1 disk servers are in production and working without issues
    • We will then retire the old servers
  • Facilities headnodes requested on VMWare, ticket not done yet.
    • Willing to accept delays on this until ~May.
    • Queued behind new disk, tape robot and a number of Diamond ICAT tasks.
  • Acceptance testing of the new tape robot completed
    • New-style tape server installation ongoing.
    • Tape library ready for CASTOR-side testing
  • Aquilon disk servers ready to go, also queued behind tape robot.

Operation problems

  • ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
    • Rob has created a ticket with Tim, work is being done.
  • CASTOR metric reporting for GridPP.
    • Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.

Plans for next few weeks

  • Examine further standardisation of CASTOR pool settings.
    • CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
  • CASTOR side tape robot testing.

Long-term projects

  • New CASTOR WLCGTape instance.
    • LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
  • CASTOR disk server migration to Aquilon.
    • Change ready to implement.
  • Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
    • Ticket with Fabric team to make the VMs.
  • RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
  • RA to look at making all fileclasses have nbcopies >= 1.
  • Problem with functional test node using a personal proxy which runs out some time in July.
    • Rob met with Jens, requested an appropriate certificate.

Staffing

  • RA out for the next two weeks, at HEPiX next week, on A/L the week after.

AoB

On Call

GP on call