Difference between revisions of "RAL Tier1 weekly operations castor 22/02/2019"

Revision as of 10:57, 1 March 2019

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

Overweekend, CERN EOS for ALICE had an outage, which led to all the traffic from CERN going to aliceDisk, which overwhelmed it and caused trouble.
LHCb currently have a problem reading some files on lhcbDst. The CASTOR team is investigating.
- Some disk2disk copies were attempted; this has been explained as being due to LHCb failing to read a file from the lhcbDst service class, and then trying the default service class, which would trigger a d2d copy to lhcbRawRdst (which fails anyway).
- We don't seem to have a problem with the local data integrity.
gdss776 crashed and was taken out of production, it has been returned in readonly mode.

Operation news

New facd0t1 disk servers
- Ready to go into production from a CASTOR perspective.
- The hardware has been racked, but hasn't been cabled yet.
Migrated ALICE to WLCGTape.
- Argo tests are failing for srm-alice since the upgrade.
- Due to the Argo tests insisting on SRM access, which ALICE don't use, we have marked srm-alice as unmonitored.
Shuffled two tape servers in lhcbRawRdst
New Facilities DB environment ready.

Plans for next few weeks

Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
Replacement of Facilities CASTOR d0t1 ingest nodes
- Now ready to deploy once Fabric team are ready.

Long-term projects

New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
CASTOR disk server migration to Aquilon.
- Prototype profile passes functional tests perfectly.
- Change control approved.
Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
- Ganglia seems to Just Work (TM)
- One of the test InfluxDB servers (05) can be used for CASTOR InfluxDB development.
- Telegraf is working on an Aquilonised disk server and it is successfully sending core metrics to Influx.
Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1

Actions

AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
RA to look at making all fileclasses have nbcopies >= 1

Staffing

Everyone in

AoB

Namespace dumps are fixed.

On Call

RA on call

@@ Line 40: / Line 40: @@
 ** Argo tests are failing for srm-alice since the upgrade.
 ** Due to the Argo tests insisting on SRM access, which ALICE don't use, we have marked srm-alice as unmonitored.
-* New Facilities headnodes - DNS request created
+* Shuffled two tape servers in lhcbRawRdst
-* Decommissioned the former genTape disk servers
+* New Facilities DB environment ready.
-* LHCb disk server move complete on Tuesday.
 == Plans for next few weeks ==
@@ Line 50: / Line 50: @@
 * Replacement of Facilities CASTOR d0t1 ingest nodes
 ** Now ready to deploy once Fabric team are ready.
-* New Facilities DB environment will be ready on Monday.
 == Long-term projects ==
@@ Line 58: / Line 57: @@
 * CASTOR disk server migration to Aquilon.
 ** Prototype profile passes functional tests perfectly.
+** Change control approved.
 * Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
 ** Ganglia seems to Just Work (TM)
@@ Line 79: / Line 79: @@
 == AoB ==
-* Namespace dumps have broken.
+* Namespace dumps are fixed.
-** OK with writing these to tape.
-* Discussed general principle that everything written to wlcgTape must be migrated to tape.
-** This preserves simplicity and eliminates the failure mode of experiments writing to an area they think goes to tape but doesn't.
 == On Call ==
-GP on call
+RA on call

Difference between revisions of "RAL Tier1 weekly operations castor 22/02/2019"

Revision as of 10:57, 1 March 2019

Contents

Standing agenda

Operation problems

Operation news

Plans for next few weeks

Long-term projects

Actions

Staffing

AoB

On Call

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools