RAL Tier1 weekly operations castor 26/01/2015

From GridPP Wiki
Jump to: navigation, search

List of CASTOR meetings

Operations News

  • Draining - ongoing
  • Name server SL6 upgrade completed - no issues
  • Redundant atlasHotdisk service class and disk pool from CASTOR

Operations Problems

  • certificates on fdsdss20 to fdsdss30 will be expiring 1st Feb - Gareth has raised with Fabric
  • castor functional test on lcgccvm02 causing problems - Gareth reviewing
  • storageD retrieval from castor problems - investigation ongoing
  • 150k zero size files reported last week have almost all been dealt with, CMS files outstanding
  • Files with no ns or xattr checksum value in castor are failing transfers from RAL to BNL using the BNL FTS3 server.

Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.


Planned, Scheduled and Cancelled Interventions

  • Removal of redundant CASTOR DB tables Monday 26th 9am (Shaun)
  • Kernel upgrade on Castor SL5 disk/srm/tape. Tuesday 27/Wednesday 28/Thursday 29
  • Kernel upgrade on Castor facilities - scheduled for Monday 26th 9-10am
  • Oracle upgrade of preprod 2nd Feb - will require a short outage
  • Oracle PSU patching 3rd (Neptune)/4th (Pluto) - castor production at risk 4th Feb
  • Upgrade Oracle DB to version 11.2.0.4 (Late February?)
  • Upgrade CASTOR to version 2.1.14-14 OR 2.1.14-15 (Early February)


Advanced Planning

Tasks

  • DB team need to plan some work which will result in the DBs being under load for approx 1h - not terribly urgent but needs to be done in new year.
  • Provide new VM? to provide castor client functionality to query the backup DBs
  • Plan to ensure PreProd represents production in terms of hardware generation are underway
  • Possible future upgrade to CASTOR 2.1.14-15 post-Christmas
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions


Actions

  • Rob to pick up DB cleanup change control
  • Bruno to document processes to control services previously controlled by puppet
  • Gareth to arrange meeting castor/fab/production to discuss the decommissioning procedures

Staffing

  • Castor on Call person
    • Chris
  • Staff absence/out of the office:
    • Rob out until Monday 2nd Feb