RAL Tier1 weekly operations castor 01/02/2010

From GridPP Wiki
Revision as of 16:18, 1 February 2010 by Jens (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • Fixed bug in persistent test suite
    • Configured IMPI on central nodes
    • Planning CERN trip
    • CoD duties
    • Testing CASTOR
  • Shaun:
    • SRM Development
    • Repack instance tape problem
    • Corrupted files
    • Strategy Meeting
    • Testing SRM monitoring.
  • Chris:
    • Coordinating Castor shutdown
    • Configured IPMI for SRM machines
    • Applied the latest kernel and other outstanding upgrades to all castor servers
    • Working on PreProd instance
    • Reinstalling repack instance
    • Pre-work for Security Challenge
  • Cheney:
    • ..
  • Tim:
    • ..
  • Richard:
    • Setup CCSE02..CCSE07 as CASTOR disk servers
  • Brian:
    • Draining ATLAS RAID5 servers.
    • Audit of RAID 5 in D1TX disk servers.
  • Jens:
    • CIP deployment discussion with CERN

Developments for this week

  • Matthew:
    • Testing CASTOR
    • Setting up preprod stress test
    • Assessing relevance of 2009 strategic actions and dropping where necessary
  • Shaun:
    • Analysis of ATLAS timeouts
    • Disk-2-Disk copy problems on LHCb
    • Pre-production tape server-vdqm problem
  • Chris:
    • Restart production Castor services
    • Finish work for repack instance
    • Finish work for PreProd instance
    • Test number of job slots for new disk servers per protocol basis
    • Test redeployment procedure
    • Test xrootd disk server for aliceTape
  • Cheney:
    • ..
  • Tim:
    • ..
  • Richard:
    • Finalise configuration of CASTOR disk servers: CCSE02..07 + CASTOR301 + CASTOR303 + GDSS198 + GDSS368
  • Brian:
    • Draining ATLAS RAID5 servers
    • Check Removal of ATLAS servers ready fo re-install
  • Jens:
    • CIP upgrade deployed

Operations Issues

  • Major problems with multipath setup encountered when moving back to EMC, resulting in 5 days of unscheduled downtime

Blocking issues

  • Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Upgrade of memory on database nodes (Ongoing) (Ongoing) At Risk All instances
Upgrade LHC CIP and introduce reduncancy of CIPs 1/2/10 11:00 STC 1/2/10 12:00 STC At Risk All instances
Big Intervention - Day 1 Fsck & kernel upgrades of all disk servers & head nodes (apart from tape servers). Add IPMI to head nodes. Restrict user login on disk servers. Update fetch-crl rpm on disk servers 27/1/10 08:00 27/1/10 24:00 Downtime All instances
Big Intervention - Day 2 Move DB to EMC kit. Replace cdbc08 and add new DB archive log destination. Install NameServer CheckSum Trigger 28/1/10 00:00 28/1/10 17:00 Downtime All instances

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Shaun
  • Staff absences: Matthew (Thursday)