RAL Tier1 weekly operations castor 26/04/2010

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • Building and testing new puppetmaster server
    • Testing quattorized disk server deployment
    • Writing plans for Investigating Alternatives to CASTOR project
  • Shaun:
    • Solved problem of ATLAS recalls (missing account on disk server)
    • Worked with James Jackson on implementing solution to mighunter contention for CMS
    • Solving and testing of ATLAS software problems (32 bit issue)
    • APR
    • Stratgic Objectives
  • Chris:
    • ..
  • Richard:
    • Continuing with p/p stress testing
  • Brian:
    • ..
  • Jens:
    • ..

Developments for this week

  • Matthew:
    • CASTOR Database - The Way Forward meeting
    • Testing quattorized disk server deployment
    • Writing plans for Investigating Alternatives to CASTOR project
    • Testing new puppetmaster server
  • Shaun:
    • Make LSF pending jobs information available to CMS
    • Catch up with SRM developments
    • Keep trying the get SRM rate monitoring into castormon
    • APR (cont...)
  • Chris:
    • ..
  • Richard:
    • Continuing with p/p stress testing
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • Number of pending jobs in CMS increased substantially, resulting in a callout. Will make castormon data files available to provide feedback to PheDeX to avoid overloading CASTOR in the future.
  • gdss420 was found to be missing its /exportstage/castor* partitions during deployment into atlasNonProd

Blocking issues

  • Preprod stress testing taking longer than anticipated. We are cutting back the number of tests from 10k->5k and file size tests to only 100Mb and 2Gb

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Add new node to database SAN for running backups 28/4/10 1015 28/4/10 1430 At risk Downtime

Advanced Planning

  • Upgrade to 2.1.8/2.1.9 2010
  • Upgrade to SRM 2.8-6 after testing is complete
  • ATLAS want to know how much capacity is available in disabled servers (published as Capability). Low priority CIP change to do this.
  • CASTOR Instance for Non LHC 2010Q2
  • Install/enable gridftp-internal on Gen (Before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Shaun
  • Staff absences:
    • Matthew (Tues morning)
    • Jens (Mon-Wed)