RAL Tier1 weekly operations castor 13/10/2014

From GridPP Wiki
Jump to: navigation, search

Operations News

  • Plan to ensure PreProd represents production in terms of hardware generation are underway
  • Disk server redeployments continue (i.e. D1T0 reused in D0T1 etc) ... 5 servers in LHCb left, see below
  • xrootd security advisory with FAX component within xrootd
  • SL6 Headnode work progressing well - hoping for test in castor vcert next week
  • PSU oracle patch on juno completed


Operations Problems

  • Gdss720 - currently looking like there is a RAID issue (LSI card) at the root of its problems. Kashif attempted to replace the RAID controller Friday 10th but was unsuccessful. The server is now running Read Only again with the same RAID controller as before.
  • still having difficult deploying 5 servers into LHCb production because of a mellanox network/quattor issue - fabric to fix
  • Atlas load, high wait I/O on several servers – partly understood and working towards some config changes
  • T2K - could not get through SRM – Shaun caused and corrected (would have affected sno+ and another)
  • T2K files with a physical size but no size in namespace (blocking a disk server decommissioning) - Brian talking to T2K.
  • CIP stopped updating – this particular problem is thought to be related to the castor switch due to emc failure. Action Add CIP into instructions for castor failover.Castor team decided to wait until dbs rolled back.
  • DB duplicates SRM userfile tables still occurring – test files only
  • EMC array issues – scheduled fail-back on Tuesday 10:00–12:00 (looks like dataguard has a day and a half left to sync up dbs – end of Sat 11th)


Blocking Issues

  • grid ftp bug in SL6 - stops any globus copy if a client is using a particular library. This is a show stopper for SL6 on disk server.


Planned, Scheduled and Cancelled Interventions

  • A Tier 1 Database cleanup is planned so as to eliminate a number of excess tables and other entities left over from previous CASTOR versions. This will be change-controlled in the near future.
  • Juan further patch castor dbs (PSU patches for Pluto and Juno) – standard change ... TBC
  • 2.1.14-14 stress testing in preprod along with new errata
  • Tuesday 10:00 am SAN switchback
  • LHCB boxes into production (assuming mellanox issues resolved)


Advanced Planning

Tasks

  • Possible future upgrade to CASTOR 2.1.14-15 post christmas
  • Switch from admin machines: lcgccvm02 to lcgcadm05
  • New VM configured to run against the standby CASTOR database will be created as a front-end for dark data etc queries.
  • Replace DLF with Elastic Search
  • Correct partitioning alignment issue (3rd CASTOR partition) on new castor disk servers


Interventions

  • Castor problems with the cache on an emc disk array - atlas were worst affected. Our position was improved by moving to atlas/gen stager/srm standby Monday 6th Oct. Moving back on Tuesday 6th at 10am. Improved monitoring and speed of EMCs response being looked into.

Staffing

  • Castor on Call person
    • Chris


  • Staff absence/out of the office:
    • All castor team in