RAL Tier1 weekly operations castor 25/01/2019

From GridPP Wiki
Revision as of 10:48, 25 January 2019 by Rob Appleyard 7f7797b74a (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Parent Article

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * One 'slightly odd' recall on Diamond on Thursday night, investigation ongoing.

Operation news

  * New facd0t1 disk servers
     * Ready to go into production from a CASTOR perspective
     * Final deployment waiting on Fabric team to sort out their placement in the machine room.
  * Decision to be made about adding excess C tape drives to Facilities.
     * Some have been added, but development work would be needed to add the remainder due to a shortage of servers.
     * The C drives are also running out of maintenance.
     * Action on Chris to discuss with Alison.
  * Moved three more D drives into wlcgTape. wlcgTape now have 15 drives, Tier 1 have 5.
  * Neptune and Pluto Oracle kernel patching done.
  * Oracle/kernel patching for CASTOR Facilities DB done

Plans for next few weeks

  * Replacement of Facilities CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. This worked.
     * Now ready to deploy.

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
     * Redirector worked when attached to vcert but not when attached to production. This is being worked on, possible CUPV issue?
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 
     * Action is with the Fabric team to sort out the SL7 installation. Possible hardware issue, asked for a new, non-broken node.
  * Need to discuss monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
  * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
      * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this (action with production team)
 

Actions

  * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.
     * CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.

Staffing

  * RA out next week

On Call

  * GP