RAL Tier1 weekly operations castor 17/11/2017

From GridPP Wiki
Jump to: navigation, search

Parent article: https://www.gridpp.ac.uk/wiki/RAL_Tier1_weekly_operations_castor

Draft agenda

1. Problems encountered this week

2. Upgrades/improvements made this week

3. What are we planning to do next week?

  1. Facilities Headnode Replacement

4. Long-term project updates (if not already covered)

  1. SL7 upgrade on tape servers
  2. SL5 elimination from CASTOR functional test boxes and tape verification server
  3. CASTOR stress test improvement
  4. Generic CASTOR headnode setup

5. Special topics

6. Actions

7. Anything for CASTOR-Fabric?

8. AoTechnicalB

9. Availability for next week

10. On-Call

11. AoOtherB

Operation problems

2 disk servers failed.

gdss776 - Unexplained crash. Back in prod r/o

gdss698 - Drive error(s). Currently out of prod.

MLF: Strange DB-side problem on Orpheus (vcert) - possible underlying hardware issue that needs Fabric intervention.

Ongoing:

tape usage plots for facilities instance of castor are not same with Tier1 RT196714 - PROGRESS, SEE TICKET

Continuing problems with SOLID writes to CASTOR

Operation news

Increased number of drives available to na62 due to a large migration backlog.

Upgraded LHCb SRM to 2.1.16-18

Patched Tier 1 CASTOR to latest available errata/kernel

New CASTOR functional test node is in production.

Plans for next week

Decommission old functional test box

Patch Facilities disk servers to latest available errata/kernel. Outline plan: Readonly then disable half of the node (ensuring no ongoing transfers or CANBEMIRs), then reboot them, wait for them to come back up, re-enable, then do the same for the other half. Same plan for the recall caches (cedaRetrieve and diamondRecall).

Long-term projects

CASTOR stress test improvement - Script writing, awaiting testing and validation

Tape server migration to Aquilon - Still need to do Facilities - GP to co-ordinate with GTF

Headnode migration to Aquilon - Templates for the stager written, but not compiled or tested. Focus on features but not personalities

Target: Combined headnodes running on SL7/Aquilon - implement CERN-style 'Macro' headnodes.

Draining of remainder of 12 generation HW - waiting on CMS migration to Echo. No more draining ongoing.

Special topics

Patching - consider turning repeatedly re-used patching plan into standard procedure.

Actions

RA to check if the disk server setting, changed to bring disk servers back more quickly after a CASTOR shutdown, is still in place

RA/GP: Run GFAL unit tests against CASTOR. Get them here: https://gitlab.cern.ch/dmc/gfal2/tree/develop/test/

GP: Turn CASTOR SRMs & other Aquilonised nodes from clients to servers

GP to talk to AL about service ownership/handover of the xroot manager boxes DONE

GP to chase up Fabric team about getting action on RT197296(use fdscspr05 as the preprod stager - replacement for ccse08)

Staffing

GP on call

GP out Friday afternoon

RA out until 6th December