Difference between revisions of "RAL Tier1 weekly operations castor 04/01/2019"

From GridPP Wiki
Jump to: navigation, search
(Operation problems)
 
(One intermediate revision by one user not shown)
Line 28: Line 28:
 
   * gdss739 failed and removed from production; back in readonly
 
   * gdss739 failed and removed from production; back in readonly
 
   * xrootd stopped on WLCGTape disk servers after host cert renewal; need to modify certificate installed script
 
   * xrootd stopped on WLCGTape disk servers after host cert renewal; need to modify certificate installed script
 
 
  * fdsdss51 is back in prod.
 
 
   * Current theory is that the WLCG tape accounting problem is because it's using the scponly shell. John is sceptical.
 
   * Current theory is that the WLCG tape accounting problem is because it's using the scponly shell. John is sceptical.
  * Ongoing issue with gdss780 - crashed, returned to prod. VO need to be informed that it's back. Kash wants to reinstall the OS, to be done after Christmas.
 
  * castor-stager01 failed on Wed evening causing some disruption in WLCGTape. Resolved.
 
  
 
== Operation news ==
 
== Operation news ==
 
+
 
   * LHCb are staging data for the major reprocessing of Run1 and Run2 data (4.7 PB in total) that will be carried out in 2019 (this is the cause of them being full)
+
   * fdsdss54 (one of the Facilities d0t1 servers) is installed and attached to preprod for testing. Needs data partitions to be mounted and Fabric 'Dell Hardware Looksie' script to be run.
   * Ongoing discussions about adding tape drives to Facilities.
+
   * Ongoing discussions about adding excess C tape drives to Facilities.
  * All cmsDisk and cmsTape disk servers have been decommissioned.
+
  * Two additional tape servers (lcgcts15 and lcgcts18) were allocated to WLCGTape. Current ratio of WLCGTape and LHCb,Alice tape servers: 12 to 8
+
  
 
== Plans for next few weeks ==
 
== Plans for next few weeks ==
  
  * Christmas
+
   * Oracle/kernel patching for CASTOR Facilities DB (23rd Jan)
 
+
   * Oracle/kernel patching for CASTOR Facilities DB (January, precise date to be agreed with Martin)
+
  
 
   * Decommission fdsdss34 and fdssdss35 - replaced by fdsdss51-3. Question for Diamond: Are three nodes sufficient? Answer: No problems currently observed.
 
   * Decommission fdsdss34 and fdssdss35 - replaced by fdsdss51-3. Question for Diamond: Are three nodes sufficient? Answer: No problems currently observed.
  
   * Replacement of CASTOR d0t1 ingest nodes.
+
   * Replacement of Facilities CASTOR d0t1 ingest nodes.
 
       * Agreed to install these with only 1 CASTOR partition. Try installing one this way, adding it to CASTOR preprod and testing.
 
       * Agreed to install these with only 1 CASTOR partition. Try installing one this way, adding it to CASTOR preprod and testing.
  
 
== Long-term projects ==
 
== Long-term projects ==
  
   * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE. Completed Aquilon installation this week.
+
   * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
 +
      * Prototype created and completed a basic functional test.  
  
 
   * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
 
   * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
 
     but there are problems with the SL7 installation [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=216885 RT216885 ]
 
     but there are problems with the SL7 installation [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=216885 RT216885 ]
      * Need to discuss monitoring/metrics for Aquilonised disk servers.
+
  * Need to discuss monitoring/metrics for Aquilonised disk servers.
      * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
+
  * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
 
+
      * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this.
 +
 
 
== Actions ==
 
== Actions ==
  
   * Ganglia/disk accounting system broke before Christmas when the databases moved from Frost moved to dbssql04. John has fixed the immediate problem but more may arise. Given CASTOR d1t0 is going away anyway, open questions: Can we retire this and do we need it for Echo?
+
   * Ganglia/disk accounting system broke before Christmas when the databases moved from Frost to dbssql04. John has fixed the immediate problem but more may arise. Given CASTOR d1t0 is going away anyway, open questions: Can we retire this and do we need it for Echo?
   * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.
+
   * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.
  
 
== Staffing ==
 
== Staffing ==
  
   * Everyone eating Christmas dinner
+
   * Everyone in.
  
 
== On Call ==
 
== On Call ==
  
   * RA over Christmas
+
   * GP

Latest revision as of 11:01, 4 January 2019

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * gdss739 failed and removed from production; back in readonly
  * xrootd stopped on WLCGTape disk servers after host cert renewal; need to modify certificate installed script
  * Current theory is that the WLCG tape accounting problem is because it's using the scponly shell. John is sceptical.

Operation news

  * fdsdss54 (one of the Facilities d0t1 servers) is installed and attached to preprod for testing. Needs data partitions to be mounted and Fabric 'Dell Hardware Looksie' script to be run.
  * Ongoing discussions about adding excess C tape drives to Facilities.

Plans for next few weeks

  * Oracle/kernel patching for CASTOR Facilities DB (23rd Jan)
  * Decommission fdsdss34 and fdssdss35 - replaced by fdsdss51-3. Question for Diamond: Are three nodes sufficient? Answer: No problems currently observed.
  * Replacement of Facilities CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. Try installing one this way, adding it to CASTOR preprod and testing.

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
     * Prototype created and completed a basic functional test. 
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 
  * Need to discuss monitoring/metrics for Aquilonised disk servers.
  * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
      * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this.
 

Actions

  * Ganglia/disk accounting system broke before Christmas when the databases moved from Frost to dbssql04. John has fixed the immediate problem but more may arise. Given CASTOR d1t0 is going away anyway, open questions: Can we retire this and do we need it for Echo?
  * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.

Staffing

  * Everyone in.

On Call

  * GP