Difference between revisions of "RAL Tier1 weekly operations castor 21/12/2018"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Standing agenda == 1. Problems encountered this week 2. Upgrades/improvements made this week 3. What are we planning to do next week? 4. Long-term project updates (if n...")
 
 
Line 28: Line 28:
 
   * fdsdss51 is back in prod.
 
   * fdsdss51 is back in prod.
 
   * Current theory is that the WLCG tape accounting problem is because it's using the scponly shell. John is sceptical.
 
   * Current theory is that the WLCG tape accounting problem is because it's using the scponly shell. John is sceptical.
   * Ongoing issue with gdss789 - crashed, returned to prod. VO need to be informed that it's back. Kash wants to reinstall the OS, to be done after Christmas.
+
   * Ongoing issue with gdss780 - crashed, returned to prod. VO need to be informed that it's back. Kash wants to reinstall the OS, to be done after Christmas.
 +
  * castor-stager01 failed on Wed evening causing some disruption in WLCGTape. Resolved.
  
 
== Operation news ==
 
== Operation news ==
Line 34: Line 35:
 
   * LHCb are staging data for the major reprocessing of Run1 and Run2 data (4.7 PB in total) that will be carried out in 2019 (this is the cause of them being full)
 
   * LHCb are staging data for the major reprocessing of Run1 and Run2 data (4.7 PB in total) that will be carried out in 2019 (this is the cause of them being full)
 
   * Ongoing discussions about adding tape drives to Facilities.
 
   * Ongoing discussions about adding tape drives to Facilities.
 +
  * All cmsDisk and cmsTape disk servers have been decommissioned.
 +
  * Two additional tape servers (lcgcts15 and lcgcts18) were allocated to WLCGTape. Current ratio of WLCGTape and LHCb,Alice tape servers: 12 to 8
  
 
== Plans for next few weeks ==
 
== Plans for next few weeks ==
Line 48: Line 51:
 
== Long-term projects ==
 
== Long-term projects ==
  
   * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE
+
   * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE. Completed Aquilon installation this week.
  
 
   * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
 
   * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
Line 58: Line 61:
  
 
   * Ganglia/disk accounting system broke before Christmas when the databases moved from Frost moved to dbssql04. John has fixed the immediate problem but more may arise. Given CASTOR d1t0 is going away anyway, open questions: Can we retire this and do we need it for Echo?
 
   * Ganglia/disk accounting system broke before Christmas when the databases moved from Frost moved to dbssql04. John has fixed the immediate problem but more may arise. Given CASTOR d1t0 is going away anyway, open questions: Can we retire this and do we need it for Echo?
 +
  * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and  how this can be actually implemented.
  
 
== Staffing ==
 
== Staffing ==

Latest revision as of 14:11, 21 December 2018

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * fdsdss51 is back in prod.
  * Current theory is that the WLCG tape accounting problem is because it's using the scponly shell. John is sceptical.
  * Ongoing issue with gdss780 - crashed, returned to prod. VO need to be informed that it's back. Kash wants to reinstall the OS, to be done after Christmas.
  * castor-stager01 failed on Wed evening causing some disruption in WLCGTape. Resolved.

Operation news

  * LHCb are staging data for the major reprocessing of Run1 and Run2 data (4.7 PB in total) that will be carried out in 2019 (this is the cause of them being full)
  * Ongoing discussions about adding tape drives to Facilities.
  * All cmsDisk and cmsTape disk servers have been decommissioned.
  * Two additional tape servers (lcgcts15 and lcgcts18) were allocated to WLCGTape. Current ratio of WLCGTape and LHCb,Alice tape servers: 12 to 8

Plans for next few weeks

  * Christmas
  * Oracle/kernel patching for CASTOR Facilities DB (January, precise date to be agreed with Martin)
  * Decommission fdsdss34 and fdssdss35 - replaced by fdsdss51-3. Question for Diamond: Are three nodes sufficient? Answer: No problems currently observed.
  * Replacement of CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. Try installing one this way, adding it to CASTOR preprod and testing.

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE. Completed Aquilon installation this week.
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 
     * Need to discuss monitoring/metrics for Aquilonised disk servers.
     * Various bits of monitoring/stats are broken for wlcgTape, investigate further.

Actions

  * Ganglia/disk accounting system broke before Christmas when the databases moved from Frost moved to dbssql04. John has fixed the immediate problem but more may arise. Given CASTOR d1t0 is going away anyway, open questions: Can we retire this and do we need it for Echo?
  * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and  how this can be actually implemented.

Staffing

  * Everyone eating Christmas dinner

On Call

  * RA over Christmas