RAL Tier1 weekly operations castor 08/03/2010

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • 2.1.8/2.1.9 strategy and presented new features to Liaison meeting
    • Planning for CASTOR session at GridPP24
    • Database DR
    • Kicked off plans for moving forward to new production database hardware
    • Installed lcg_utils on castoradm3 for stress testing
    • Depmon (and backup CASTOR on Day) duties
    • Wrote presentation for T1 Away Day
  • Shaun:
    • Assisted with disk server deployment problems
    • Fixed t2k tape recall problem
    • Implemented tweak to address CMS job problems
    • CODD (Friday)
  • Chris:
    • Continuing testing number of job slots per protocol basis
    • Doing some work on Quattor Tape and Disk Server
    • Start preparing test infrastructure for castor upgrades
    • Implemented fix for Atlas for LFS events
    • Castor on Duty person
    • Friday off
  • Cheney:
    • ..
  • Tim:
    • Hardware installs
    • CS1818 problem investigation
    • Pre-prod VDQM (big-id) problems.
    • T10KB drives on Pre-prod
  • Richard:
    • Worked on new version of pre-prod benchmarking tool
  • Brian:
    • ..
  • Jens:
    • Expounding on the Correct Interpretation(tm) of information

Developments for this week

  • Matthew:
    • ..
  • Shaun:
    • COD
    • Castor Monitoring prototyping
    • Testing distribution of new tnsnames file
  • Chris:
    • Continue testing number of job slots per protocol basis. Waiting for LHCB to test rootd
    • Do some work with polymorphic machines
    • Prepare cold stand-by central server
    • Do some work on Quattor Tape Server
    • Preparing test infrastructure for castor upgrades
  • Cheney:
    • ..
  • Tim:
    • T10KB drive testing on Pre-prod
    • Getting new tape servers into operation
  • Richard:
    • Complete new version of pre-prod benchmarking tool and create a Wiki page to document it
  • Brian:
    • ..
  • Jens:
    • Getting preprod and/or cert cipped. Pick up CIP 2.2.0 again.

Operations Issues

  • Large number of jobs failing due to saturation of access to small number of hot files. New service class with replica=30 added using same disk pool as cmsFarmRead to deal with this.
  • 1 faulty Atlas tape identified (cs1818)
  • problems of missing RPMs on redeployed disk servers after going into production. Final disk server signoff introduced by CASTOR team members when deploying new disk servers to production.
  • Another BigID occurence, this time on Preprod VDQM (first time on this schema)

Blocking issues

  • Still awaiting preprod database

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB


Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (before 2.1.8 upgrade)


  • Castor on Call person: Shaun
  • Matt on paternity leave for 2 weeks from approx 8 March
  • Staff absences: