RAL Tier1 weekly operations castor 18/01/2019

From GridPP Wiki
Jump to: navigation, search

Parent Article

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * None!

Operation news

  * New facd0t1 disk servers
     * Ready to go into production from a CASTOR perspective
     * Final deployment waiting on Fabric team to sort out their placement in the machine room.
  * Decision to be made about adding excess C tape drives to Facilities.
     * Some have been added, but development work would be needed to add the remainder due to a shortage of servers.
     * The C drives are also running out of maintenance.
     * Action on Chris to discuss with Alison.
  * Removed fdsdss34-5 from diamondRecall and sent for decommissioning
  * Kernel patching for CASTOR standbys on Tuesday 15th Jan 
  * RA met DM and GS to discuss storage metrics. Noted that we need to revise which source metric to use and take a broader look at this.

Plans for next few weeks

  * Neptune and Pluto Oracle patching already done, kernel patching date TBD (~ 2 weeks)
  * Oracle/kernel patching for CASTOR Facilities DB (22nd Jan)
      * This will not require a downtime.
  * Replacement of Facilities CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. This worked.

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
     * Stress testing didn't work probably due to ALICE's authentication system.
     * !GridPP security officer happy with this.
     * We've determined that for RAL to use ALICE's auth system is too complicated and that we will waive the stress test on the ALICE xroot redirector node. ALICE will run some external tests.
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 
  * Need to discuss monitoring/metrics for Aquilonised disk servers.
  * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
      * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this.


  * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.
     * CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.


  * CP WfH Monday
  * RA out for the week after next

On Call

  * GP for the next two weeks