Difference between revisions of "RAL Tier1 weekly operations castor 26/04/2019"
From GridPP Wiki
(Undo revision 19924 by Rob Appleyard 7f7797b74a (talk)) |
|||
Line 30: | Line 30: | ||
* Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction | * Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction | ||
** Willing to accept delays on this until ~May | ** Willing to accept delays on this until ~May | ||
− | ** | + | ** Queued behind tape robot and a number of Diamond ICAT tasks |
* Aquilon disk servers ready to go, also queued behind tape robot | * Aquilon disk servers ready to go, also queued behind tape robot | ||
− | |||
* New Spectra tape robot | * New Spectra tape robot | ||
− | ** | + | ** Finalised configuration for the Tape servers |
− | * | + | * Produced lots of stats on CASTOR ingest rates |
− | + | ||
− | + | ||
== Operation problems == | == Operation problems == | ||
− | * T2K issues with finding files on tape (GGUS 140870) | + | * gdss738 (lhcbDst) failed, back in production read-only. |
− | * ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts | + | * gdss811 (lhcbDst) returned to prod in the pre-Easter week with an HDD for the OS instead of an SSD |
+ | * T2K issues with finding files on tape (GGUS 140870) | ||
+ | * ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts | ||
+ | ** TA has updated the ticket, indicating he will raise the issue with the appropriate people | ||
+ | * LHCb raised an issue with xroot access to lhcbUser, believed to be resolved now. | ||
== Plans for next few weeks == | == Plans for next few weeks == | ||
Line 49: | Line 50: | ||
** CASTOR team to generate a list of nonstandard settings and consider whether they are justified. | ** CASTOR team to generate a list of nonstandard settings and consider whether they are justified. | ||
* Castor tape testing to continue after the production tape robot networking is installed | * Castor tape testing to continue after the production tape robot networking is installed | ||
+ | * Test preprod against Bellona (RT223698) | ||
== Long-term projects == | == Long-term projects == | ||
Line 55: | Line 57: | ||
** LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers | ** LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers | ||
* CASTOR disk server migration to Aquilon. | * CASTOR disk server migration to Aquilon. | ||
− | ** Need to work with Fabric to get a | + | ** Need to work with Fabric to get a realistic. |
+ | * Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers. | ||
+ | ** Ticket with Fabric team to make the VMs. | ||
* The problem of castor-functional-test1 has been absorbed into the task of sorting out worker node grid-mapfile generation and distribution. | * The problem of castor-functional-test1 has been absorbed into the task of sorting out worker node grid-mapfile generation and distribution. | ||
Line 71: | Line 75: | ||
== Staffing == | == Staffing == | ||
− | |||
− | |||
== AoB == | == AoB == | ||
Line 78: | Line 80: | ||
== On Call == | == On Call == | ||
− | + | RA on call. |
Latest revision as of 09:50, 3 May 2019
Contents
Standing agenda
1. Achievements this week
2. Problems encountered this week
3. What are we planning to do next week?
4. Long-term project updates (if not already covered)
5. Special topics
6. Actions
7. Review Fabric tasks
1. Link
8. AoTechnicalB
9. Availability for next week
10. On-Call
11. AoOtherB
Achievements this week
- Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
- Willing to accept delays on this until ~May
- Queued behind tape robot and a number of Diamond ICAT tasks
- Aquilon disk servers ready to go, also queued behind tape robot
- New Spectra tape robot
- Finalised configuration for the Tape servers
- Produced lots of stats on CASTOR ingest rates
Operation problems
- gdss738 (lhcbDst) failed, back in production read-only.
- gdss811 (lhcbDst) returned to prod in the pre-Easter week with an HDD for the OS instead of an SSD
- T2K issues with finding files on tape (GGUS 140870)
- ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
- TA has updated the ticket, indicating he will raise the issue with the appropriate people
- LHCb raised an issue with xroot access to lhcbUser, believed to be resolved now.
Plans for next few weeks
- Examine further standardisation of CASTOR pool settings.
- CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
- Castor tape testing to continue after the production tape robot networking is installed
- Test preprod against Bellona (RT223698)
Long-term projects
- New CASTOR WLCGTape instance.
- LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers
- CASTOR disk server migration to Aquilon.
- Need to work with Fabric to get a realistic.
- Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
- Ticket with Fabric team to make the VMs.
- The problem of castor-functional-test1 has been absorbed into the task of sorting out worker node grid-mapfile generation and distribution.
Actions
- AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
- Some discussion about what exactly is required and how this can be actually implemented.
- CASTOR team proposal is either:
- to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
- to run a recursive nschmod on all the unneeded directories to make them read only.
- CASTOR team split over the correct approach.
- Problem with functional test node using a personal proxy which runs out some time in July.
- RA met with JJ, requested an appropriate certificate.
- RA and DM to sit down to sort out storage metric question
Staffing
AoB
On Call
RA on call.