RAL Tier1 weekly operations castor 19/01/2015

From GridPP Wiki
Jump to: navigation, search

List of CASTOR meetings

Operations News

  • Draining - ongoing
  • SL6 name server upgrade postponed due to castor team resource - likely to be this week

Operations Problems

  • Heavy LHCb usage causing significant castor SRM load - SRM test failures. Additional SRM has been added, monitoring (SRM03).
  • storageD retrieval from castor problems - investigation ongoing
  • 150k files have a 0 size entry in the namespace across all VOs. ~150 files do have a file in the stager and have been fixed. Lists have/being provided to VOs.
  • Files with no ns or xattr checksum value in castor are failing transfers from RAL to BNL using the BNL FTS3 server.


Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.

Planned, Scheduled and Cancelled Interventions

  • Kernel upgrade on Castor SL5 disk/srm/tape. Currently in planning but likely to be last week in Jan
  • A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
  • Upgrade of CASTOR headnodes to SL6 Gen W/C 5th Jan
  • Upgrade Oracle DB to version 11.2.0.4 (Late February?)
  • Upgrade CASTOR to version 2.1.14-14 OR 2.1.14-15 (Early February)


Advanced Planning

Tasks

  • schedule name server upgrades to SL6
  • DB team need to plan some work which will result in the DBs being under load for approx 1h - not terribly urgent but needs to be done in new year.
  • Provide new VM? to provide castor client functionality to query the backup DBs
  • Plan to ensure PreProd represents production in terms of hardware generation are underway
  • Possible future upgrade to CASTOR 2.1.14-15 post-Christmas
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers

Interventions


Actions

Rob to pick up DB cleanup change control Bruno to document processes to control services previously controlled by puppet


Staffing

  • Castor on Call person
    • Matt
  • Staff absence/out of the office:
    • Chris WFH Tues afternoon
    • Rob out Thurs - Monday 2nd Feb