RAL Tier1 weekly operations castor 04/01/2019

From GridPP Wiki
Jump to: navigation, search

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * gdss739 failed and removed from production; back in readonly
  * xrootd stopped on WLCGTape disk servers after host cert renewal; need to modify certificate installed script
  * Current theory is that the WLCG tape accounting problem is because it's using the scponly shell. John is sceptical.

Operation news

  * fdsdss54 (one of the Facilities d0t1 servers) is installed and attached to preprod for testing. Needs data partitions to be mounted and Fabric 'Dell Hardware Looksie' script to be run.
  * Ongoing discussions about adding excess C tape drives to Facilities.

Plans for next few weeks

  * Oracle/kernel patching for CASTOR Facilities DB (23rd Jan)
  * Decommission fdsdss34 and fdssdss35 - replaced by fdsdss51-3. Question for Diamond: Are three nodes sufficient? Answer: No problems currently observed.
  * Replacement of Facilities CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. Try installing one this way, adding it to CASTOR preprod and testing.

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
     * Prototype created and completed a basic functional test. 
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 
  * Need to discuss monitoring/metrics for Aquilonised disk servers.
  * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
      * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this.
 

Actions

  * Ganglia/disk accounting system broke before Christmas when the databases moved from Frost to dbssql04. John has fixed the immediate problem but more may arise. Given CASTOR d1t0 is going away anyway, open questions: Can we retire this and do we need it for Echo?
  * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.

Staffing

  * Everyone in.

On Call

  * GP