Difference between revisions of "RAL Tier1 weekly operations castor 22/02/2019"

From GridPP Wiki
Jump to: navigation, search
(Undo revision 19596 by Rob Appleyard 7f7797b74a (talk))
 
Line 40: Line 40:
 
** Argo tests are failing for srm-alice since the upgrade.
 
** Argo tests are failing for srm-alice since the upgrade.
 
** Due to the Argo tests insisting on SRM access, which ALICE don't use, we have marked srm-alice as unmonitored.
 
** Due to the Argo tests insisting on SRM access, which ALICE don't use, we have marked srm-alice as unmonitored.
* Shuffled two tape servers in lhcbRawRdst
+
* New Facilities headnodes - DNS request created
* New Facilities DB environment ready.
+
* Decommissioned the former genTape disk servers
 
+
* LHCb disk server move complete on Tuesday.
  
 
== Plans for next few weeks ==
 
== Plans for next few weeks ==
Line 50: Line 50:
 
* Replacement of Facilities CASTOR d0t1 ingest nodes
 
* Replacement of Facilities CASTOR d0t1 ingest nodes
 
** Now ready to deploy once Fabric team are ready.  
 
** Now ready to deploy once Fabric team are ready.  
 +
* New Facilities DB environment will be ready on Monday.
  
 
== Long-term projects ==
 
== Long-term projects ==
Line 57: Line 58:
 
* CASTOR disk server migration to Aquilon.
 
* CASTOR disk server migration to Aquilon.
 
** Prototype profile passes functional tests perfectly.
 
** Prototype profile passes functional tests perfectly.
** Change control approved.
 
 
* Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
 
* Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
 
** Ganglia seems to Just Work (TM)
 
** Ganglia seems to Just Work (TM)
Line 79: Line 79:
 
== AoB ==
 
== AoB ==
  
* Namespace dumps are fixed.
+
* Namespace dumps have broken.
 +
** OK with writing these to tape.
 +
* Discussed general principle that everything written to wlcgTape must be migrated to tape.
 +
** This preserves simplicity and eliminates the failure mode of experiments writing to an area they think goes to tape but doesn't.
  
 
== On Call ==
 
== On Call ==
  
RA on call
+
GP on call

Latest revision as of 12:04, 1 March 2019

Standing agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

  • Overweekend, CERN EOS for ALICE had an outage, which led to all the traffic from CERN going to aliceDisk, which overwhelmed it and caused trouble.
  • LHCb currently have a problem reading some files on lhcbDst. The CASTOR team is investigating.
    • Some disk2disk copies were attempted; this has been explained as being due to LHCb failing to read a file from the lhcbDst service class, and then trying the default service class, which would trigger a d2d copy to lhcbRawRdst (which fails anyway).
    • We don't seem to have a problem with the local data integrity.
  • gdss776 crashed and was taken out of production, it has been returned in readonly mode.

Operation news

  • New facd0t1 disk servers
    • Ready to go into production from a CASTOR perspective.
    • The hardware has been racked, but hasn't been cabled yet.
  • Migrated ALICE to WLCGTape.
    • Argo tests are failing for srm-alice since the upgrade.
    • Due to the Argo tests insisting on SRM access, which ALICE don't use, we have marked srm-alice as unmonitored.
  • New Facilities headnodes - DNS request created
  • Decommissioned the former genTape disk servers
  • LHCb disk server move complete on Tuesday.

Plans for next few weeks

  • Examine further standardisation of CASTOR pool settings.
    • CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
  • Replacement of Facilities CASTOR d0t1 ingest nodes
    • Now ready to deploy once Fabric team are ready.
  • New Facilities DB environment will be ready on Monday.

Long-term projects

  • New CASTOR WLCGTape instance.
    • LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
  • CASTOR disk server migration to Aquilon.
    • Prototype profile passes functional tests perfectly.
  • Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
    • Ganglia seems to Just Work (TM)
    • One of the test InfluxDB servers (05) can be used for CASTOR InfluxDB development.
    • Telegraf is working on an Aquilonised disk server and it is successfully sending core metrics to Influx.
  • Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
    • Ticket with Fabric team to make the VMs.
  • RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
  • RA to look at making all fileclasses have nbcopies >= 1

Staffing

  • Everyone in

AoB

  • Namespace dumps have broken.
    • OK with writing these to tape.
  • Discussed general principle that everything written to wlcgTape must be migrated to tape.
    • This preserves simplicity and eliminates the failure mode of experiments writing to an area they think goes to tape but doesn't.

On Call

GP on call