Difference between revisions of "RAL Tier1 weekly operations castor 26/04/2019"

From GridPP Wiki
Jump to: navigation, search
(Undo revision 19924 by Rob Appleyard 7f7797b74a (talk))
 
Line 30: Line 30:
 
* Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
 
* Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
 
** Willing to accept delays on this until ~May
 
** Willing to accept delays on this until ~May
** Progress ongoing.
+
** Queued behind tape robot and a number of Diamond ICAT tasks
 
* Aquilon disk servers ready to go, also queued behind tape robot
 
* Aquilon disk servers ready to go, also queued behind tape robot
** Designing a stress test based on CC meeting (IOZone on SL6, IOZone on SL7, compare)
 
 
* New Spectra tape robot
 
* New Spectra tape robot
** Fibre-optic cabling up ongoing.
+
** Finalised configuration for the Tape servers
** Initial performance tests promising (800MB/s)
+
* Produced lots of stats on CASTOR ingest rates
* LHCb now running batch jobs using Echo
+
* Migrated Facilities CASTOR from Juno to Bellona.
+
  
 
== Operation problems ==
 
== Operation problems ==
  
* T2K issues with finding files on tape (GGUS 140870) - Currently on Alastair
+
* gdss738 (lhcbDst) failed, back in production read-only.
* ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts - Currently with Tim Adye
+
* gdss811 (lhcbDst) returned to prod in the pre-Easter week with an HDD for the OS instead of an SSD
 +
* T2K issues with finding files on tape (GGUS 140870)
 +
* ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
 +
** TA has updated the ticket, indicating he will raise the issue with the appropriate people
 +
* LHCb raised an issue with xroot access to lhcbUser, believed to be resolved now.
  
 
== Plans for next few weeks ==
 
== Plans for next few weeks ==
Line 49: Line 50:
 
** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
 
** CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
 
* Castor tape testing to continue after the production tape robot networking is installed
 
* Castor tape testing to continue after the production tape robot networking is installed
 +
* Test preprod against Bellona (RT223698)
  
 
== Long-term projects ==
 
== Long-term projects ==
Line 55: Line 57:
 
** LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers
 
** LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers
 
* CASTOR disk server migration to Aquilon.
 
* CASTOR disk server migration to Aquilon.
** Need to work with Fabric to get a stress test (see above)
+
** Need to work with Fabric to get a realistic.
 +
* Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
 +
** Ticket with Fabric team to make the VMs.
 
* The problem of castor-functional-test1 has been absorbed into the task of sorting out worker node grid-mapfile generation and distribution.
 
* The problem of castor-functional-test1 has been absorbed into the task of sorting out worker node grid-mapfile generation and distribution.
  
Line 71: Line 75:
  
 
== Staffing ==
 
== Staffing ==
 
Rob out from end of next week.
 
  
 
== AoB ==
 
== AoB ==
Line 78: Line 80:
 
== On Call ==
 
== On Call ==
  
GP on call.
+
RA on call.

Latest revision as of 09:50, 3 May 2019

Parent article

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • Facilities headnodes requested on VMWare, ticket not done yet. Facilities VMWare cluster still under construction
    • Willing to accept delays on this until ~May
    • Queued behind tape robot and a number of Diamond ICAT tasks
  • Aquilon disk servers ready to go, also queued behind tape robot
  • New Spectra tape robot
    • Finalised configuration for the Tape servers
  • Produced lots of stats on CASTOR ingest rates

Operation problems

  • gdss738 (lhcbDst) failed, back in production read-only.
  • gdss811 (lhcbDst) returned to prod in the pre-Easter week with an HDD for the OS instead of an SSD
  • T2K issues with finding files on tape (GGUS 140870)
  • ATLAS are periodically submitting SAM tests that impact availability and cause pointless callouts
    • TA has updated the ticket, indicating he will raise the issue with the appropriate people
  • LHCb raised an issue with xroot access to lhcbUser, believed to be resolved now.

Plans for next few weeks

  • Examine further standardisation of CASTOR pool settings.
    • CASTOR team to generate a list of nonstandard settings and consider whether they are justified.
  • Castor tape testing to continue after the production tape robot networking is installed
  • Test preprod against Bellona (RT223698)

Long-term projects

  • New CASTOR WLCGTape instance.
    • LHCb migration to Echo is in progress, being sped up by failing CASTOR disk servers
  • CASTOR disk server migration to Aquilon.
    • Need to work with Fabric to get a realistic.
  • Deadline of end of April to get Facilities moved to generic VM headnodes and 2.1.17 tape servers.
    • Ticket with Fabric team to make the VMs.
  • The problem of castor-functional-test1 has been absorbed into the task of sorting out worker node grid-mapfile generation and distribution.

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is either:
      • to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
      • to run a recursive nschmod on all the unneeded directories to make them read only.
      • CASTOR team split over the correct approach.
  • Problem with functional test node using a personal proxy which runs out some time in July.
    • RA met with JJ, requested an appropriate certificate.
  • RA and DM to sit down to sort out storage metric question

Staffing

AoB

On Call

RA on call.