RAL Tier1 weekly operations castor 15/02/2010

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • 2.1.9 fact finding
    • meetings at CERN
    • Production resiliency investigations
  • Shaun:
    • lhcbUser diskcopy problem
    • srmMonitoring work for castormon
    • Production database system analysis work
  • Chris:
    • Castor on Duty
    • Fixed PreProduction tape server
    • Fixing problems on Quattorized disk server
    • Working on PreProduction instance
    • Testing maximum number of job slots for rfio for new disk servers (still ongoing)
  • Cheney:
    • Set up a test database service on a private network
    • preped cdbe07 to take over from cdbc08
    • Investigation of why database system once worked 100% but no longer does so
    • Assisting people with writing nagios plugins
    • Design of virtualisation architecture
  • Tim:
    • ..
  • Richard:
    • Started running the CERN stress tests on the new pre-prod instance
    • Also started a run against the first quattorised disk server
  • Brian:
    • ..
  • Jens:
    • ..

Developments for this week

  • Matthew:
    • Production resiliency investigations
    • CoD work
    • Facilities evaluation support
  • Shaun:
    • LHCb disk copies
    • SRM Development
    • Nameserver trigger
  • Chris:
    • Continue testing maximum number of job slots for new disk servers
    • Start working on Quattor tape server
    • Finish Puppet manifests for polymorphic central servers
    • Work on LHCB disk2disk problem
  • Cheney:
    • Memory upgrades
  • Tim:
    • ..
  • Richard:
    • Continue running the CERN stress tests
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • c08 continuing being instable. Plan for removal from production
  • Two disk servers in atlas and one in cms showing routing problems.
  • Migration stopped for CMS - resarted Friday.

Blocking issues

  • Lack of adequate preprod database on preprod is stopping us doing proper stress testing

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Memory upgrade on 2 db servers 15/02/2010 10:30 15/02/2010 16:00 At-risk All
Memory upgrade on 1 db server 17/02/2010 10:30 17/02/2010 16:00 At-risk All

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Matt
  • Staff absences: Shaun (Wednesday, Thursday, Friday), Jens (Monday, Wednesday, Thursday, Friday) - TBC, Cheney (thurs, fri), Matthew (Friday)