Difference between revisions of "RAL Tier1 weekly operations castor 4/04/2010"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 12:16, 7 April 2010

Summary of Previous Week

  • Matthew:
    • Tier1 Open Day
    • Determine 2.1.8/2.1.9 upgrade plan for test upgrades of production db snapshots
    • CASTOR DB Disaster Recovery and Way Forward plans
    • CoD/Depmon work
    • Published list of 'approved exceptions' - changes that don't require formal change control
    • Running functional tests against 2.1.7 certification
  • Shaun:
    • ..
  • Chris:
    • Partially tested cold stand-by central castor servers
    • Disk server deployment duties
    • Castor 2.1.8/2/1.9 upgrade work
    • Doing work related to Tier1 Security Group project
  • Richard:
    • Created the necessary setups for the mix of commands agreed for the pre-prod stress testing
    • Run the above commands (using limited # of threads and repeats) to "test the tests"
  • Brian:
    • ..
  • Jens:
    • Continued CIP 2.2 dev. Discussions with CERN about roadmap for CIP/CASTOR interface.
    • Added to test plan for CASTOR upgrade, testing with the grid.

Developments for this week

  • Matthew:
    • Preparing for GridPP24 CASTOR session
    • Install puppetmaster on new hardware
    • Depmon work
  • Shaun:
    • Meet ORACLE consultant
  • Chris:
    • Test SL5 (64bit) disk server with xfs
    • Test cold stand-by central castor servers and then write documentation
    • Disk server deployment duties
    • Test Quattor disk server procedure and build castor disk server
    • Castor 2.1.8/2/1.9 upgrade work
    • Doing work related to Tier1 Security Group project
  • Richard:
    • Will kick off tests to run during the weekend
  • Brian:
    • ..
  • Jens:
    • See if I can finally get enough femtoseconds to install CIP on preprod or cert.

Operations Issues

  • Puppet master became unstable after latest disk server deployment. We need to move it to new hardware.
  • Reboot of Neptune node: castor151. Suspicion is that is is related to the NFS mounts for backup
  • A few job failures on CMS due to incorrect free space being reported on SRM, appearing in FTS log
  • Networking issues over weekend (4/4/10) lead to site outage including CASTOR

Blocking issues

None

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

None

Advanced Planning

  • Upgrade to 2.1.8/2.1.9 2010
  • Upgrade to SRM 2.8-6 after testing is complete
  • ATLAS want to know how much capacity is available in disabled servers (published as Capability). Low priority CIP change to do this.
  • CASTOR Instance for Non LHC 2010Q2
  • Install/enable gridftp-internal on Gen (Before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Chris
  • Staff absences:
    • Matthew (Wed,Thu morning)
    • Richard (Tue/Wed)