RAL Tier1 weekly operations castor 14/12/2018

From GridPP Wiki
Revision as of 10:52, 14 December 2018 by Rob Appleyard 7f7797b74a (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * LHCb are full and failing tests, they know about this.
  * Ganglia/disk accounting system broke when the databases moved from Frost moved to dbssql04. John has fixed the immediate problem but more may arise. Given CASTOR d1t0 is going away anyway, open questions: Can we retire this and do we need it for Echo?
  * The wlcgTape account broke when the node was moved to openStack. There appears to be a key/auth problem, Rob and Brian investigating - sshd.conf doesn't have the wlcgTape nodes, this needs doing in Aquilon.
  * New Faciliities disk server (fdsdss51) had a bad drive and suffered degraded performance while Diamond where busy, problem resolved. Brian to refer this node to the Fabric team.
  * CMS stagerjobs on Pluto don't like spinning idle and eat CPU, Miguel proposed to disable them, agreed that this is a good idea because they aren't needed and can be restarted if required (Brian points that this will be needed if we still need to clean the stager)

Operation news

 * LHCb are staging data for the major reprocessing of Run1 and Run2 data (4.7 PB in total) that will be carried out in 2019 (this is the cause of them being full)

Plans for next few weeks

  * Christmas
  * Oracle/kernel patching for CASTOR Facilities DB (January, precise date to be agreed with Martin)
  * Decommission fdsdss34 and fdssdss35 - replaced by fdsdss51-3. Question for Diamond: Are three nodes sufficient?

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 

Actions

Staffing

  * RA out Monday-Wednesday next week

On Call

  * GP until Thursday.