Difference between revisions of "RAL Tier1 weekly operations castor 05/04/2019"

Latest revision as of 09:54, 5 April 2019

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

Old facd0tl disk servers have been decommissioned
- Proved facilities can recall zero sized file
Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
- Willing to accept delays on this until ~May
- Queued behind tape robot and a number of Diamond ICAT tasks
Acceptance testing of the new tape robot completed
- New-style tape server installation ongoing.
- Tape library for CASTOR-side testing in progress now
  - Large scale read testing completed, seems successful, but analysis underway on a few outstanding queries
Aquilon disk servers ready to go, also queued behind tape robot

Operation problems

TimF was doing a tape verify for Diamond and encountered issues with all files on the tape being verified (fd1866)
- We would like the test not to run on the read-only Diamond tape server, but this is not possible on facilities instance
- Not going to investigate further unless problem reoccurs
ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
- Tim A has updated the ticket, indicating he will raise the issue with the appropriate people
CASTOR metric reporting for GridPP
- Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
LHCb are trying to access files using the wrong service class (default)
IO data traffic on gdss700 stalls, under investigation
CEDA outbound certificates going to expire, but Rob H is on the case
SQL error during a nsls on the facilities instance
- This shows up as an error of 'no file on CASTOR'
- No issues seen on Hermes, was there any issue with CASTOR?

Plans for next few weeks

Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
Continue CASTOR side tape robot testing.

Long-term projects

New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
CASTOR disk server migration to Aquilon.
- Change ready to implement.
- More meaningful stress test needs to be carried out.
Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1

Actions

AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
RA to look at making all fileclasses have nbcopies >= 1.
Problem with functional test node using a personal proxy which runs out some time in July.
- Rob met with Jens, requested an appropriate certificate.

@@ Line 26: / Line 26: @@
 == Achievements this week ==
-* Old facd0tl disk servers have been in read-only mode for one week, can be decommissioned next week
+* Old facd0tl disk servers have been decommissioned
-* Facilities headnodes requested on VMWare, ticket not done yet. Haven't heard anything from RIG yet.
+** Proved facilities can recall zero sized file
-** Willing to accept delays on this until ~May.
+* Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
-** Queued behind new disk, tape robot and a number of Diamond ICAT tasks.
+** Willing to accept delays on this until ~May
+** Queued behind tape robot and a number of Diamond ICAT tasks
 * Acceptance testing of the new tape robot completed
 ** New-style tape server installation ongoing.
 ** Tape library for CASTOR-side testing in progress now
-* Aquilon disk servers ready to go, also queued behind tape robot.
+*** Large scale read testing completed, seems successful, but analysis underway on a few outstanding queries
+* Aquilon disk servers ready to go, also queued behind tape robot
 == Operation problems ==
-* gdss733 crashed and has been removed from production for two days. Now back into prod, no issues so far.
+* TimF was doing a tape verify for Diamond and encountered issues with all files on the tape being verified (fd1866)
-* TimF was doing a tape verify for Diamond (fd1866)
+** We would like the test not to run on the read-only Diamond tape server, but this is not possible on facilities instance
-** The verify was supposed to verify first 10, last 10, and random 10 files in the middle, but for some reason the test run on the all tape, therefore taking a lot of time more than the usual 30 minutes. Data has been inaccessible  for a long time.
+** Not going to investigate further unless problem reoccurs
-** We would like the test not to run on the read-only Diamond tape server
+* ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
-** Tim is looking at the reason why this happened
+** Tim A has updated the ticket, indicating he will raise the issue with the appropriate people
-* ATLAS are periodically submitting silly SAM tests that impact availability and cause pointless callouts.
+* CASTOR metric reporting for GridPP
-** Rob has created a ticket with Tim, work is being done.
-* CASTOR metric reporting for GridPP.
 ** Looking for clarity on precisely what metrics are relevant, and given CASTOR's changed role, what system RA should report on.
+* LHCb are trying to access files using the wrong service class (default)
+* IO data traffic on gdss700 stalls, under investigation
+* CEDA outbound certificates going to expire, but Rob H is on the case
+* SQL error during a nsls on the facilities instance
+** This shows up as an error of 'no file on CASTOR'
+** No issues seen on Hermes, was there any issue with CASTOR?
 == Plans for next few weeks ==
-* Decommission the old facilities ingest disk servers
 * Examine further standardisation of CASTOR pool settings.
 ** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
-* CASTOR side tape robot testing.
+* Continue CASTOR side tape robot testing.
 == Long-term projects ==
@@ Line 60: / Line 65: @@
 * CASTOR disk server migration to Aquilon.
 ** Change ready to implement.
+** More meaningful stress test needs to be carried out.
 * Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
 ** Ticket with Fabric team to make the VMs.
@@ Line 75: / Line 81: @@
 == Staffing ==
-* RA out for the next week, on A/L
+* RA back next week
 == AoB ==
@@ Line 81: / Line 87: @@
 == On Call ==
-GP on call
+RA on call

Difference between revisions of "RAL Tier1 weekly operations castor 05/04/2019"

Latest revision as of 09:54, 5 April 2019

Contents

Standing agenda

Achievements this week

Operation problems

Plans for next few weeks

Long-term projects

Actions

Staffing

AoB

On Call

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools