RAL Tier1 weekly operations castor 01/03/2010

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • CASTOR database - Disaster recovery coordination
    • Forming our 2.1.8/2.1.9 strategy
    • Coordinating interventions
    • Depmon duties - deploying 100Tb into atlasSimStrip
    • First look at CIP code
  • Shaun:
    • Identifying and correcting problems with new disk server deployment
    • Completed investigation of ATLAS SAM timeouts
    • Prototyping of monitoring updates.
  • Chris:
    • Continuing testing number of job slots per protocol basis
    • Doing some work on Quattor Tape Server
    • Start preparing test infrastructure for castor upgrades
    • Finish testing fixes for Atlas
  • Cheney:
    • Build of vulcan database cluster for preprod
    • Fixed backups (couldn't write to its index file for some reason).
  • Tim:
    • ..
  • Richard:
    • Converted CERN castor stress tests into Perl to get around limitations on # of concurrent threads and also to make it easier to bolt on instrumentation for benchmarking purposes
  • Brian:
    • Disk Deployment assignment
    • Comparing CASTOR stager_qry/bdii/dq2 accounting values
    • Disabled Tape investigation
  • Jens:
    • Support for experiments interpreting CIP information, SRM related support

Developments for this week

  • Matthew:
    • 2.1.8/2.1.9 strategy
    • Database - DR and new hardware plans
    • Hardware spend plans
    • Install lcg_utils on castoradm3 for stress testing
    • Depmon (and backup CASTOR on Day) duties
    • Write presentation for T1 Away Day
  • Shaun:
    • More monitoring prototyping
    • SRM work
  • Chris:
    • Castor on Duty
    • Implement Atlas fix: "Reduce Atlas LSF clean period to 14400 (sec)"
    • Continue testing number of job slots per protocol basis. Waiting for LHCB to test rootd
    • Do some work with polymorphic machines
    • Concentrate on Quattor Tape Server
  • Cheney:
    • Handover new Vulcan database cluster
  • Tim:
    • sort out new hardware
    • start installing new tape servers?
    • more work on RAC resiliancy planning
  • Jens
    • Work on CIP 2.2.0 release

Operations Issues

  • Some disk servers lost routing table: switch needs to have its cache refreshed
  • Missing RPMs on new disk servers - kickstart repository was incomplete
  • xinetd not working on newly deployed disk servers. Needed restarting.

Blocking issues

  • Still don't have an ip address allocated for one node of new Vulcan database cluster.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
ORACLE security patch 02/03/2010 10:00 02/03/2010 11:00 At-risk All
Change to LSF configuration 02/03/2010 10:00 02/03/2010 11:00 At-risk ATLAS

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Chris
  • Matt on paternity leave for 2 weeks from approx 8 March
  • Staff absences: Chris (Friday)