Difference between revisions of "RAL Tier1 weekly operations castor 22/02/2019"
From GridPP Wiki
Line 40: | Line 40: | ||
** Argo tests are failing for srm-alice since the upgrade. | ** Argo tests are failing for srm-alice since the upgrade. | ||
** Due to the Argo tests insisting on SRM access, which ALICE don't use, we have marked srm-alice as unmonitored. | ** Due to the Argo tests insisting on SRM access, which ALICE don't use, we have marked srm-alice as unmonitored. | ||
− | * | + | * Shuffled two tape servers in lhcbRawRdst |
− | + | * New Facilities DB environment ready. | |
− | * | + | |
== Plans for next few weeks == | == Plans for next few weeks == | ||
Line 50: | Line 50: | ||
* Replacement of Facilities CASTOR d0t1 ingest nodes | * Replacement of Facilities CASTOR d0t1 ingest nodes | ||
** Now ready to deploy once Fabric team are ready. | ** Now ready to deploy once Fabric team are ready. | ||
− | |||
== Long-term projects == | == Long-term projects == | ||
Line 58: | Line 57: | ||
* CASTOR disk server migration to Aquilon. | * CASTOR disk server migration to Aquilon. | ||
** Prototype profile passes functional tests perfectly. | ** Prototype profile passes functional tests perfectly. | ||
+ | ** Change control approved. | ||
* Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK. | * Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK. | ||
** Ganglia seems to Just Work (TM) | ** Ganglia seems to Just Work (TM) | ||
Line 79: | Line 79: | ||
== AoB == | == AoB == | ||
− | * Namespace dumps | + | * Namespace dumps are fixed. |
− | + | ||
− | + | ||
− | + | ||
== On Call == | == On Call == | ||
− | + | RA on call |
Revision as of 10:57, 1 March 2019
Contents
Standing agenda
1. Problems encountered this week
2. Upgrades/improvements made this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Operation problems
- Overweekend, CERN EOS for ALICE had an outage, which led to all the traffic from CERN going to aliceDisk, which overwhelmed it and caused trouble.
- LHCb currently have a problem reading some files on lhcbDst. The CASTOR team is investigating.
- Some disk2disk copies were attempted; this has been explained as being due to LHCb failing to read a file from the lhcbDst service class, and then trying the default service class, which would trigger a d2d copy to lhcbRawRdst (which fails anyway).
- We don't seem to have a problem with the local data integrity.
- gdss776 crashed and was taken out of production, it has been returned in readonly mode.
Operation news
- New facd0t1 disk servers
- Ready to go into production from a CASTOR perspective.
- The hardware has been racked, but hasn't been cabled yet.
- Migrated ALICE to WLCGTape.
- Argo tests are failing for srm-alice since the upgrade.
- Due to the Argo tests insisting on SRM access, which ALICE don't use, we have marked srm-alice as unmonitored.
- Shuffled two tape servers in lhcbRawRdst
- New Facilities DB environment ready.
Plans for next few weeks
- Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
- Replacement of Facilities CASTOR d0t1 ingest nodes
- Now ready to deploy once Fabric team are ready.
Long-term projects
- New CASTOR WLCGTape instance.
- LHCb migration is with LHCb at the moment, they are not blocked. Mirroring of lhcbDst to Echo complete.
- CASTOR disk server migration to Aquilon.
- Prototype profile passes functional tests perfectly.
- Change control approved.
- Need to discuss Ganglia monitoring/metrics for Aquilonised disk servers. GP to try, follow up with JK.
- Ganglia seems to Just Work (TM)
- One of the test InfluxDB servers (05) can be used for CASTOR InfluxDB development.
- Telegraf is working on an Aquilonised disk server and it is successfully sending core metrics to Influx.
- Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
- RA working with James to sort out the gridmap-file distribution infrastructure and get a machine with a better name for this than castor-functional-test1
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
- RA to look at making all fileclasses have nbcopies >= 1
Staffing
- Everyone in
AoB
- Namespace dumps are fixed.
On Call
RA on call