RAL Tier1 weekly operations castor 08/02/2010

From GridPP Wiki
Revision as of 14:55, 15 February 2010 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • High level CASTOR strategy for 2010 (inc. 2.1.9)
    • Coordinating team debugging multipath EMC problems
  • Shaun:
    • Configuration Analysis of Production Systems
    • Castor On Duty things
      • Fixing CMS recall problem
      • Fixing missing passwd entries for stage:st and lsfadmin accounts
      • Bringing up and testing of CASTOR instances
  • Chris:
    • Configuring repack instance
    • Working on PreProd instance
    • Preparing test disk server for new Alice peer/manager
    • Preparing preprod instance to test max number of job slots
    • Writing puppet manifests for preproduction disk servers
  • Cheney:
    • ..
  • Tim:
    • RAC upgrade and getting working again...
    • Hardware purchasing
    • Getting repack working after install
  • Richard:
    • Completed setting up current set of pre-prod disk servers
  • Brian:
    • ATLAS D1T0 draining and disk removal.
  • Jens:
    • CIP upgrade finally graded up. And related upgrade pre and post coordination and testing.

Developments for this week

  • Matthew:
    • meetings at CERN and ATLAS Jamboree
    • Coordinating team debugging multipath EMC problems
    • 2.1.9 fact finding at CERN
  • Shaun:
    • More configuration analysis
    • Looking at ways of improving resilience of current system
    • LHCb disk-2-disk copy problems
    • SRM development if time permits
  • Chris:
    • Castor On Duty (M-F)
    • Test max number of job slots per protocol basis
    • Looking at why vdqm/vmgr not working on preprod
    • Get back to polymorphic configuration
    • Install SRM machine(s) for preprod
  • Cheney:
    • ..
  • Tim:
    • More hardware purchasing
    • Looking at why RAC not working
    • Geeting RAC stability back to what it should be
  • Richard:
    • Run stress tests on pre-prod instance
  • Brian:
    • CASTOR Draining
    • Educating AD on Draining.
  • Jens:
    • Ideally, some CIP development.

Operations Issues

  • Problem with CMS recalls - now fixed
  • c08 continuing being instable. Plan for removal from production
  • approx. 8 corrupt files discovered on gdss66 (cmsFarmRead) sent to CMS. None were critical.
  • entries on /etc/passwd disappeared on gdss67,110. Accedental redeployment?

Blocking issues

  • Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Upgrade of memory on database nodes (Ongoing) (Ongoing) At Risk All instances
Fix EMC multipath issues (Ongoing) (Ongoing) At Risk All instances

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Chris
  • Staff absences: Matthew in CERN (Tueday, Wednesday)