Difference between revisions of "RAL Tier1 weekly operations castor 25/01/2010"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 15:47, 25 January 2010

Summary of Previous Week

  • Matthew:
    • Intervention planning
    • CIP and new castoradm1 testing
    • Setting up preprod stress test
    • Planning RAFL spend for new non-LHC instance
    • Assisting DCI team members for evaluating CASTOR for Facilities
    • Defining CASTOR roles for GridPP4
  • Shaun:
    • Load testing of checksum trigger
    • Analysis of diskserver problem on certification
    • Assisting ASGC
    • Documentation updates
    • Installation of new castor client library and kernel on srm boxes
  • Chris:
    • Castor on Duty
    • Testing access restriction to disk servers with Jonathan
    • Working and testing on preproduction instance
    • Working on certification instance
    • Doing verification work for disk server deployed by Quattor
  • Cheney:
    • Assist DB team with database failover testing
    • dished out programming work for npre plugins
    • Started on restore of broken ads0pt02
    • Rebuilt using DR procedure and cut over new castoradm1
    • Listened to a chap from Google tell us how it should be done
    • Assist Kash with fitting memory upgrades to database servers
  • Tim:
    • Configuring preprod tape system
  • Richard:
    • Began re-building castor301 and castor303 as pre-prod disk servers. Castor301 is down with memory probs (JA handling it); apparent SCSI errors on 303 now surmounted. Now need to up-rev some castor packages to required versions
    • Deployed gdss272 and gdss273 disk servers into atlasScratchDisk
  • Brian:
    • Atlas Draining of simStrip RAID5 disk servers
    • Clean up of atlas nameserver of empty obsolete directories.
    • Continued '10 tape families for ATLAS DATATAPE
    • Analysis of SCRATCHDISK Filling and decom of 7 DATADISK servers for migration to SCRATCHDISK
  • Jens:
    • CIP upgrade testing

Developments for this week

  • Matthew:
    • Intervention planning
    • Setting up preprod stress test
    • Assisting DCI team members for evaluating CASTOR for Facilities
    • Planning CERN trip
    • CoD duties
  • Shaun:
    • Installation and testing of nameserver trigger
    • Investigating high failure rates of ATLAS SAM tests
    • castormon developments
    • Documentation updates
    • New disk pool for t2k
  • Chris:
    • Preparation for 'Big intervention' this week
    • Coordinate Castor section for the intervention on Wednesday and Thursday
    • Test and verify disk server deployed by Quattor
    • Brian and I need to prepare 'redeployment' procedure for disk servers which were in production
    • Test maximum number of job slots for new disk servers
    • Verify lcgflex01/02/03 and enable Quattor if all 3 are identical
    • Test 64 bit disk server with XFS (if time permits)
  • Cheney:
    • Database changes
    • Bring into use EMC kit
  • Tim:
    • Spend stuff
    • more pre-prod tape stuff
  • Richard:
    • Complete the update of castor packages on pre-prod disk servers castor301 and castor303.
  • Brian:
    • Atlas Draining of simStrip RAID5 disk servers
  • Jens:
    • CIP upgrade (postponed from prev week), new release for other CASTOR sites

Operations Issues

  • none

Blocking issues

  • Lack of Quattor configuration files for SLC4.8 is stopping us evaluating Quattor alongside CASTOR 2.1.8. Preprod setup will initially proceed with a Kickstart-based deployment.
  • Preprod DB can only be delivered after EMC testing is done (3nd week after Jan'10)

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Upgrade of memory on database nodes (Ongoing) (Ongoing) At Risk All instances
Upgrade LHC CIP and introduce reduncancy of CIPs 1/2/10 11:00 STC 1/2/10 12:00 STC At Risk All instances
Big Intervention - Day 1 Fsck & kernel upgrades of all disk servers & head nodes (apart from tape servers). Add IPMI to head nodes. Restrict user login on disk servers. Update fetch-crl rpm on disk servers 27/1/10 08:00 27/1/10 24:00 Downtime All instances
Big Intervention - Day 2 Move DB to EMC kit. Replace cdbc08 and add new DB archive log destination. Install NameServer CheckSum Trigger 28/1/10 00:00 28/1/10 17:00 Downtime All instances

Advanced Planning

  • Gen upgrade to 2.1.8 2010Q1
  • Install/enable gridftp-internal on Gen (This year/before 2.1.8 upgrade)

Staffing

  • Castor on Call person: Matthew