Difference between revisions of "RAL Tier1 weekly operations castor 01/02/2019"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Standing agenda == 1. Problems encountered this week 2. Upgrades/improvements made this week 3. What are we planning to do next week? 4. Long-term project updates (if n...")
 
(Plans for next few weeks)
 
(12 intermediate revisions by one user not shown)
Line 26: Line 26:
 
== Operation problems ==
 
== Operation problems ==
  
   * One 'slightly odd' recall on Diamond on Thursday night, investigation ongoing.
+
   * Again the issue of CASTOR marking unmounted tapes as BUSY. Migration backlog
 +
    was created but soon cleared after Tim removed the BUSY flag
 +
 
 +
  * gdss 739 and gdss811 (lhcbDst) failed and removed from prod.
 +
 
 +
  * na62 files (~196,000) associated with disk-only file class (na62-tape0) were found in WLCGTape cache
 +
 
 +
  * GC on WLCGTape is not working properly and as a result the tape buffer ran out of space.
 +
    Had to initiate deletipon of 581,587 Atlas files [https://elog.gridpp.rl.ac.uk/Tier1/7140 elog]
 +
    as a temporary mitigation
  
 
== Operation news ==
 
== Operation news ==
Line 32: Line 41:
 
   * New facd0t1 disk servers
 
   * New facd0t1 disk servers
 
       * Ready to go into production from a CASTOR perspective
 
       * Ready to go into production from a CASTOR perspective
       * Final deployment waiting on Fabric team to sort out their placement in the machine room.
+
      * Decommissioning tickets created for all machines occupying the rack (rack 214, row 2) in which
 +
        these disk servers will be placed
 +
       * Final deployment waiting on Fabric team to remove the decommissioned machines out of the machine room.
 
   * Decision to be made about adding excess C tape drives to Facilities.
 
   * Decision to be made about adding excess C tape drives to Facilities.
 
       * Some have been added, but development work would be needed to add the remainder due to a shortage of servers.
 
       * Some have been added, but development work would be needed to add the remainder due to a shortage of servers.
 
       * The C drives are also running out of maintenance.
 
       * The C drives are also running out of maintenance.
 
       * Action on Chris to discuss with Alison.
 
       * Action on Chris to discuss with Alison.
  * Moved three more D drives into wlcgTape. wlcgTape now have 15 drives, Tier 1 have 5.
 
  * Neptune and Pluto Oracle kernel patching done.
 
  * Oracle/kernel patching for CASTOR Facilities DB done
 
  
 
== Plans for next few weeks ==
 
== Plans for next few weeks ==
Line 46: Line 54:
 
       * Agreed to install these with only 1 CASTOR partition. This worked.
 
       * Agreed to install these with only 1 CASTOR partition. This worked.
 
       * Now ready to deploy.
 
       * Now ready to deploy.
 +
  * Migrate ALICE to WLCGTape
  
 
== Long-term projects ==
 
== Long-term projects ==
  
 
   * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
 
   * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
       * Redirector worked when attached to vcert but not when attached to production. This is being worked on, possible CUPV issue?
+
       * Redirector attached to WLCGTape instance and passes all ALICE external tests
 
   * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
 
   * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
 
     but there are problems with the SL7 installation [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=216885 RT216885 ]
 
     but there are problems with the SL7 installation [https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=216885 RT216885 ]
Line 57: Line 66:
 
   * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
 
   * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
 
       * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this (action with production team)
 
       * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this (action with production team)
 
+
 
 
== Actions ==
 
== Actions ==
  
 
   * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.
 
   * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.
      * CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
+
  * CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
  
 
== Staffing ==
 
== Staffing ==
  
   * RA out next week
+
   * All in
  
 
== On Call ==
 
== On Call ==
 +
 +
RA on call

Latest revision as of 17:53, 1 February 2019

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  * Again the issue of CASTOR marking unmounted tapes as BUSY. Migration backlog
    was created but soon cleared after Tim removed the BUSY flag
  * gdss 739 and gdss811 (lhcbDst) failed and removed from prod.
  * na62 files (~196,000) associated with disk-only file class (na62-tape0) were found in WLCGTape cache
  * GC on WLCGTape is not working properly and as a result the tape buffer ran out of space. 
    Had to initiate deletipon of 581,587 Atlas files elog
    as a temporary mitigation

Operation news

  * New facd0t1 disk servers
     * Ready to go into production from a CASTOR perspective
     * Decommissioning tickets created for all machines occupying the rack (rack 214, row 2) in which 
       these disk servers will be placed 
     * Final deployment waiting on Fabric team to remove the decommissioned machines out of the machine room.
  * Decision to be made about adding excess C tape drives to Facilities.
     * Some have been added, but development work would be needed to add the remainder due to a shortage of servers.
     * The C drives are also running out of maintenance.
     * Action on Chris to discuss with Alison.

Plans for next few weeks

  * Replacement of Facilities CASTOR d0t1 ingest nodes.
     * Agreed to install these with only 1 CASTOR partition. This worked.
     * Now ready to deploy.
  * Migrate ALICE to WLCGTape

Long-term projects

  * New CASTOR WLCGTape instance. Things need doing: Create a seperate xrootd redirector for ALICE.
     * Redirector attached to WLCGTape instance and passes all ALICE external tests
  * CASTOR disk server migration to Aquilon: gdss742 has been compiled with a draft aquilon profile
    but there are problems with the SL7 installation RT216885 
     * Action is with the Fabric team to sort out the SL7 installation. Possible hardware issue, asked for a new, non-broken node.
  * Need to discuss monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
  * Various bits of monitoring/stats are broken for wlcgTape, investigate further.
      * Discussion about the Tier 1 front page. Agreed to try changing the URL on the page to see if that magically produces the correct graph. Noted that Ganglia is going away and we don't want to spend to much time on this (action with production team)

Actions

  * AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs. Some discussion about what exactly is required and how this can be actually implemented.
  * CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.

Staffing

  * All in

On Call

RA on call