RAL Tier1 weekly operations castor 28/06/2019

From GridPP Wiki
Jump to: navigation, search

Parent article

Standing agenda

1. Achievements this week

2. Problems encountered this week

3. What are we planning to do next week?

4. Long-term project updates (if not already covered)

5. Special topics

6. Actions

7. Review Fabric tasks

  1.   Link

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Achievements this week

  • Started deletion of LHCb data from lhcbDst
  • Started repermissioning of SNO+ data in response to GGUS
  • Renamed some MICE data in response to a request.

Operation problems

  • Unscheduled Facilities downtime this week
    • On Wednesday, Fabric intervened to replace a PDU board on rack 88 (which contains the Facilities headnodes)
    • Also on Thursday (at about 11), an intervention was made on fdsclsf05 to replace some memory and fix the power supply
    • At 11:12, networking connectivity was lost to many (but not all) Facilties machines.
      • We believe this was related to someone jogging a network cable during the hardware intervention.
      • Facilities admins raised the alarm and the Fabric team investigated.
    • At 11:46, connectivity to all the machines except fdscstg05 came back.
      • Fdscstg05 needed a new power supply on Friday; we think this was probably the reason for that machine's issue.
    • After lunch, fdscstg05 was restarted and it started working again.
    • We then tested the service and ended the downtime
  • We also had problems this week with SRM transfers failing for CMS and LHCb
    • Lots of transfers timing out and SAM tests failing.
    • DB team also reported high DB load from the SRM
    • Logs were full of activity from a particular T2K user.
    • We think the problem was that this T2K user was treating the SRM very roughly and hitting it with a very many requests.
    • When they were asked to stop the problem went away.
  • More drives than expected down in Facilities during a busy period.

Plans for next few weeks

  • Decommissioned lhcbDst; hardware awaiting retirement.
  • Kevin has done some storageD functional tests with the new tape robot
  • Brian C is currently testing StorageD/ET on the new robot
  • Replace Facilities headnodes with VMs.
    • Waiting until Kevin is back from holiday.
    • Scheduled for the 30th July.

Long-term projects

  • Migration of name server to VMs on 2.1.17-xx is waiting until aliceDisk is decommissioned.
  • CASTOR disk server migration to Aquilon.
    • Need to work with Fabric to get a stress test (see above)
  • Facilties headnode replacement:
    • SL7 VM headnodes need changes to their personalities for the facilities.
    • SL7 headnodes are being tested by GP
  • Implementing DUNE on Spectralogic robot is paused.
  • Migrate VCert to VMWare.
  • Move VCert into the Facilities domain so we have a facilities test instance.

Actions

  • AD wants us to make sure that experiments cannot write to that part of namespace that was used for d1t0 data: namespace cleanup/deletion of empty dirs.
    • Some discussion about what exactly is required and how this can be actually implemented.
    • CASTOR team proposal is either:
      • to switch all of these directories to a fileclass with a requirement for a tape copy but no migration route; this will cause an error whenever any writes are attempted.
      • to run a recursive nschmod on all the unneeded directories to make them read only.
      • CASTOR team split over the correct approach.
  • Problem with functional test node using a personal proxy which runs out some time in July.
    • RA met with JJ, requested an appropriate certificate.
    • Follow up with JJ or ST next week

Staffing

  • Everybody in

AoB

On Call

RA on call