RAL Tier1 weekly operations castor 21/12/2018

From GridPP Wiki
Revision as of 11:45, 21 December 2018 by Rob Appleyard 7f7797b74a (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * fdsdss51 is back in prod.
  * Current theory is that the WLCG tape accounting problem is because it's using the scponly shell. John is sceptical.
  * Ongoing issue with gdss789 - crashed, returned to prod. VO need to be informed that it's back. Kash wants to reinstall the OS, to be done after Christmas.

Operation news

  * LHCb are staging data for the major reprocessing of Run1 and Run2 data (4.7 PB in total) that will be carried out in 2019 (this is the cause of them being full)
  * Ongoing discussions about adding tape drives to Facilities.

Plans for next few weeks

  * Christmas
  * Oracle/kernel patching for CASTOR Facilities DB (January, precise date to be agreed with Martin)
  * Decommission fdsdss34 and fdssdss35 - replaced by fdsdss51-3. Question for Diamond: Are three nodes sufficient? Answer: No problems currently observed.
  * Replacement of CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. Try installing one this way, adding it to CASTOR preprod and testing.

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 
     * Need to discuss monitoring/metrics for Aquilonised disk servers.
     * Various bits of monitoring/stats are broken for wlcgTape, investigate further.

Actions

  * Ganglia/disk accounting system broke before Christmas when the databases moved from Frost moved to dbssql04. John has fixed the immediate problem but more may arise. Given CASTOR d1t0 is going away anyway, open questions: Can we retire this and do we need it for Echo?

Staffing

  * Everyone eating Christmas dinner

On Call

  * RA over Christmas