Difference between revisions of "RAL Tier1 weekly operations castor 05/04/2019"

From GridPP Wiki
Jump to: navigation, search
(Created page with "== Standing agenda == 1. Achievements this week 2. Problems encountered this week 3. What are we planning to do next week? 4. Long-term project updates (if not already cov...")
 
 
Line 26: Line 26:
 
== Achievements this week ==
 
== Achievements this week ==
 
    
 
    
* Old facd0tl disk servers have been in read-only mode for one week, can be decommissioned next week
+
* Old facd0tl disk servers have been decommissioned
* Facilities headnodes requested on VMWare, ticket not done yet. Haven't heard anything from RIG yet.
+
** Proved facilities can recall zero sized file
** Willing to accept delays on this until ~May.
+
* Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
** Queued behind new disk, tape robot and a number of Diamond ICAT tasks.
+
** Willing to accept delays on this until ~May
 +
** Queued behind tape robot and a number of Diamond ICAT tasks
 
* Acceptance testing of the new tape robot completed
 
* Acceptance testing of the new tape robot completed
 
** New-style tape server installation ongoing.
 
** New-style tape server installation ongoing.
 
** Tape library for CASTOR-side testing in progress now
 
** Tape library for CASTOR-side testing in progress now
* Aquilon disk servers ready to go, also queued behind tape robot.
+
*** Large scale read testing completed, seems successful, but analysis underway on a few outstanding queries
 +
* Aquilon disk servers ready to go, also queued behind tape robot
  
 
== Operation problems ==
 
== Operation problems ==
  
* gdss733 crashed and has been removed from production for two days. Now back into prod, no issues so far.
+
* TimF was doing a tape verify for Diamond and encountered issues with all files on the tape being verified (fd1866)
* TimF was doing a tape verify for Diamond (fd1866)
+
** We would like the test not to run on the read-only Diamond tape server, but this is not possible on facilities instance
** The verify was supposed to verify first 10, last 10, and random 10 files in the middle, but for some reason the test run on the all tape, therefore taking a lot of time more than the usual 30 minutes. Data has been inaccessible  for a long time.
+
** Not going to investigate further unless problem reoccurs
** We would like the test not to run on the read-only Diamond tape server
+
* ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
** Tim is looking at the reason why this happened
+
** Tim A has updated the ticket, indicating he will raise the issue with the appropriate people
* ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
+
* CASTOR metric reporting for GridPP
** Rob has created a ticket with Tim, work is being done.
+
* CASTOR metric reporting for GridPP.
+
 
** Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
 
** Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
 +
* LHCb are trying to access files using the wrong service class (default)
 +
* IO data traffic on gdss700 stalls, under investigation
 +
* CEDA outbound certificates going to expire, but Rob H is on the case
 +
* SQL error during a nsls on the facilities instance
 +
** This shows up as an error of 'no file on CASTOR'
 +
** No issues seen on Hermes, was there any issue with CASTOR?
  
 
== Plans for next few weeks ==
 
== Plans for next few weeks ==
  
* Decommission the old facilities ingest disk servers
 
 
* Examine further standardisation of CASTOR pool settings.
 
* Examine further standardisation of CASTOR pool settings.
 
** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
 
** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
* CASTOR side tape robot testing.
+
* Continue CASTOR side tape robot testing.
  
 
== Long-term projects ==
 
== Long-term projects ==
Line 60: Line 65:
 
* CASTOR disk server migration to Aquilon.
 
* CASTOR disk server migration to Aquilon.
 
** Change ready to implement.
 
** Change ready to implement.
 +
** More meaningful stress test needs to be carried out.
 
* Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
 
* Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
 
** Ticket with Fabric team to make the VMs.
 
** Ticket with Fabric team to make the VMs.
Line 75: Line 81:
 
== Staffing ==
 
== Staffing ==
  
* RA out for the next week, on A/L
+
* RA back next week
  
 
== AoB ==
 
== AoB ==
Line 81: Line 87:
 
== On Call ==
 
== On Call ==
  
GP on call
+
RA on call

Latest revision as of 09:54, 5 April 2019

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • Old facd0tl disk servers have been decommissioned
    • Proved facilities can recall zero sized file
  • Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
    • Willing to accept delays on this until ~May
    • Queued behind tape robot and a number of Diamond ICAT tasks
  • Acceptance testing of the new tape robot completed
    • New-style tape server installation ongoing.
    • Tape library for CASTOR-side testing in progress now
      • Large scale read testing completed, seems successful, but analysis underway on a few outstanding queries
  • Aquilon disk servers ready to go, also queued behind tape robot

Operation problems

  • TimF was doing a tape verify for Diamond and encountered issues with all files on the tape being verified (fd1866)
    • We would like the test not to run on the read-only Diamond tape server, but this is not possible on facilities instance
    • Not going to investigate further unless problem reoccurs
  • ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
    • Tim A has updated the ticket, indicating he will raise the issue with the appropriate people
  • CASTOR metric reporting for GridPP
    • Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
  • LHCb are trying to access files using the wrong service class (default)
  • IO data traffic on gdss700 stalls, under investigation
  • CEDA outbound certificates going to expire, but Rob H is on the case
  • SQL error during a nsls on the facilities instance
    • This shows up as an error of 'no file on CASTOR'
    • No issues seen on Hermes, was there any issue with CASTOR?

Plans for next few weeks

  • Examine further standardisation of CASTOR pool settings.
    • CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
  • Continue CASTOR side tape robot testing.

Long-term projects

  • New CASTOR WLCGTape instance.
    • LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
  • CASTOR disk server migration to Aquilon.
    • Change ready to implement.
    • More meaningful stress test needs to be carried out.
  • Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
    • Ticket with Fabric team to make the VMs.
  • RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
  • RA to look at making all fileclasses have nbcopies >= 1.
  • Problem with functional test node using a personal proxy which runs out some time in July.
    • Rob met with Jens, requested an appropriate certificate.

Staffing

  • RA back next week

AoB

On Call

RA on call