RAL Tier1 weekly operations castor 29/03/2010

From GridPP Wiki
Revision as of 13:59, 29 March 2010 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • CASTOR Database Way Forward
    • Tier1 Open Day talk
    • Investigating safeguarding CASTOR Tier0 data (T2K,MICE,MINOS)
    • Organizing CASTOR panel session at GridPP24
    • Finalizing 2.1.8/2.1.9 Test Plan and Stress Testing specifications
  • Shaun:
    • Tier 1 Open day talk
    • LHCb Jamboree
    • Scheduling Upgrades
    • Fixing deployment problems
    • COD duties
  • Chris:
    • Tested maximum number of job slots for root protocol with Raja
    • Building 4 cold stand-by central castor servers and doing the final configuration
    • Deploying disk servers
    • DepMon duties
    • Castor on Call duties Mon-Tue
    • Doing work related to Tier1 Security Group project
  • Cheney:
    • cleaning machine room
    • investigate sls timeouts
    • build new robot controller
    • fix zfs on new robot controller
    • investigate oracle install problems
    • check over castor151 backups
    • relocate fibre channel switches
    • replace failed drive in vtl
    • fix backup problems on nagger
    • bring up tape servers with mir problems
  • Tim:
    • ..
  • Richard:
    • Deploying some disk servers into cmsNonProd and lhcbNonProd
    • Continuing with stress-testing of pre-prod instance and contributing towards test-plan
  • Brian:
    • Clearence of stuck migration files
    • Chase up of redeployment tickets.
    • T2 work
  • Jens:
    • Mostly bkg stuff, a little CIP 2.2.0 dev.

Developments for this week

  • Matthew:
    • Tier1 Open Day
    • CASTOR DB Disaster Recovery plans
    • CASTOR On Duty work
    • Publishing list of 'approved exceptions' - changes that don't require formal change control
  • Shaun:
    • Tier 1 open day
    • Presenting upgrade timelines
    • CASTOR SRM Monitoring
    • Testing SRM 2.8-6
  • Chris:
    • Test SL5 (64bit) disk server with xfs
    • Test cold stand-by central castor servers and then write documentation
    • Disk server deployment duties
    • Test Quattor disk server procedure and build castor disk server
    • Castor 2.1.8/2/1.9 upgrade work
    • Doing work related to Tier1 Security Group project
  • Richard:
    • Tweaking stress-testing script to meet requirements of test-plan
    • Running stress-testing script on pre-prod instance
  • Brian:
    • T1 Open Day
    • T2 Storage for LHC Media/Start of 7TeV Day
    • T2s
  • Jens:
    • See if I can get round to finishing new CIP features for ATLAS and test on preprod or cert.

Operations Issues

  • problem transferring files to gdss346 (atlasSimRaw) due to error during deployment

Blocking issues

None

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
update LSF license keys 26/03/2010 12:00 26/03/2010 12:30 At-risk All
update LSF license keys 29/03/2010 09:30 29/03/2010 10:30 At-risk All

Advanced Planning

  • Upgrade to 2.1.8/2.1.9 2010
  • CASTOR Instance for Non LHC 2010Q2
  • Install/enable gridftp-internal on Gen (Before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Matthew
  • Staff absences:
    • None