RAL Tier1 weekly operations castor 14/09/2009

From GridPP Wiki
Revision as of 07:12, 15 September 2009 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • SRM 2.8 upgrade on Gen (Shaun, DB Team)
  • CASTOR away day (All)
  • CIP testing (Jens)
  • Finalizing plans for database performance tuning (DB Team)
  • Certification and testing of 2.1.8 NS (Chris)
  • Preparing disk server deploymentation documentation (Chris)
  • Dealing with CASTOR DB incident (All)
  • Implemented monitoring of tape robot controller (Cheney)
  • Verifying backups (Cheney)
  • Updating kernels (Cheney)
  • Investigating distributing Raid5/6 servers across service classes (Brian)
  • SRM 2.9 development (Shaun)
  • Installation of new CASTOR servers (Matt/Tim)

Developments for this week

  • 2.1.8 NS Upgrade (Chris)
  • SRM 2.8 upgrade on LHCb, CMS (Shaun, DB Team)
  • DB Performance Tuning (DB Team)
  • Updating DB Kernels (Cheney)
  • Investigating distributing Raid5/6 servers across service classes (Brian)
  • Investigating cause of DB hardware problems (Cheney)
  • Installation of new CASTOR servers (Tim/Cheney/Matt)
  • Preproduction planning (Richard/Matt/Tim/Chris)
  • T10KB tape deployment plans (Tim/Matt)

Ongoing

  • CastorMon monitoring graphs for Gen instance (Brian)
  • Setting up Preproduction (Matt, Chris)

Operations Issues

  • DB hardware failure affecting all instances

Blocking issues

  • Problems with ganglia check on GEN instance delaying work on monitoring (in hand)

Planned, Scheduled and Cancelled Down Times

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
Upgrade LHCb SRM to 2.8 14/9/09 1000 14/9/09 1200 Downtime LHCb
Nameserver upgrade and database optimization 15/9/09 0900 15/9/09 1300 Downtime All
Update kernels on database servers 15/9/09 1300 16/9/09 1700 At Risk All
Upgrade CMS SRM to 2.8 16/9/09 1000 16/9/09 1200 Downtime CMS
Upgrade ATLAS SRM to 2.8 21/9/09 1000 21/9/09 1200 Downtime ATLAS
Suspend CASTOR during R89 UPS test 22/9/09 0800 22/9/09 1000 Downtime All

Changes to Production Milestones

none

Advanced Planning

  • CIP upgrade to include nearline publishing (Sept)
  • SRM 2.8 upgrade (Sept)
  • Upgrade nameserver to 2.1.8 (Sept)
  • Black and White lists? (Possibly during Sept)
  • Improve resiliency to central services (This year)

Staffing

  • Castor on Call person: Shaun