RAL Tier1 weekly operations castor 08/02/2019

From GridPP Wiki
Jump to: navigation, search

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  • Brief tape 'mishap' on Wednesday afternoon. Tier 1 tape robot shut down, and because it is the master, it took facilities with it. Fixed within a couple of hours.
    • Question about monitoring - RA to ask Tim for a followup explanation
  • The misconfigured na62 file class was set to migrate all files to tape and all existing files were forcibly migrated.
  • An incorrectly configured service class prevented automatic garbage collection for a week. We found and deleted this service class and this fixed the GC. No data loss.

Operation news

  • Reviewed use of the 'failjobswhennospace' flag on wlcgTape and set it to 'True' for all classes.
  • New facd0t1 disk servers
    • Ready to go into production from a CASTOR perspective
    • Decommissioning tickets created for all machines occupying the rack (rack 214, row 2) in which these disk servers will be placed
    • Final deployment waiting on Fabric team to remove the decommissioned machines out of the machine room.
  • Decision to be made about adding excess C tape drives to Facilities.
    • CP and AP agreed to revisit this topic in May.
  • We have ordered a new SpectraLogic TFinity robot, to be installed in March. Work on installing the robot will absorb much time over the next few months.

Plans for next few weeks

  * Examine further standardisation of CASTOR pool settings.
  * Replacement of Facilities CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. This worked.
     * Now ready to deploy once Fabric team are ready.
  * Migrate ALICE to WLCGTape

Long-term projects

  • New CASTOR WLCGTape instance.
    • Created new ALICE xrootd redirector VM and passed Change Control.
    • Ready to implement ALICE tape move to wlcgTape on Monday
    • Agreed that we are fine with moving the tape before aliceDisk is decommissioned.
  • CASTOR disk server migration to Aquilon: gdss742 problems appeared to be specific to that hardware type and it's not one we'll use in prod. Switched to gdss808 as a development node.
    • Plan to apply the new Aquilon disk server profile and see what happens.
  • Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
  • Various bits of monitoring/stats are broken for wlcgTape, investigate further.
    • Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this (action with production team)
    • The networking graph on the front page now works.
    • We need a clear statement from stakeholders of what they need and how we fall short of that.
  • Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.


  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
    • JK to complete improvements on Tier 1 front page.


  * GP out Wednesday

On Call

GP on call