RAL Tier1 weekly operations castor 11/01/2019

From GridPP Wiki
Jump to: navigation, search

Parent Article

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * gdss804 failed but is now back in production.
  * gdss739 has been returned to production

Operation news

  * New facd0t1 disk servers
     * fdsdss54 has been installed and attached to preprod for testing.
     * This generation has been configured with a single data partition.
     * They may need to be re-racked to clear space for the new tape robot.
     * fdsdss55 and 56 have handed over to CASTOR team.
  * Funding decision to be made about adding excess C tape drives to Facilities.
  * Fixed the WLCG tape accounting system.

Plans for next few weeks

  * Kernel patching for CASTOR standbys on Tuesday 15th Jan 
  * Neptune and Pluto Oracle patching already done, kernel patching date TBD (~ 2 weeks)
  * Eris Oracle patching already done, kernel patching done in October but it runs a different OS (Oracle Linux 7)
  * Oracle/kernel patching for CASTOR Facilities DB (23rd Jan)
  * Decommission fdsdss34 and fdssdss35 - replaced by fdsdss51-3.
     * Chris to confirm that he's happy with fdsdss51-3.
  * Replacement of Facilities CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. Try installing one this way, adding it to CASTOR preprod and testing.

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
     * Stress testing didn't work probably due to ALICE's authentication system.
     * !GridPP security officer to do a once-over before opening external access, then ask ALICE experts to do a test.
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 
  * Need to discuss monitoring/metrics for Aquilonised disk servers.
  * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
      * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this.
 

Actions

  * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.
     * CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
  * RA to set up with meeting with DM and GS to discuss storage metrics.

Staffing

  * RA out on Monday.

On Call

  * RA