RAL Tier1 weekly operations castor 06/07/2009

From GridPP Wiki
Revision as of 14:45, 6 July 2009 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • Moving CASTOR central services to R89 and then bringing up/testing (All)
  • SRM development (Shaun)
  • Certification of 2.1.7-27 with new LSF configuration (Chris

Developments for this week

  • Monitoring CASTOR as it is brought back into production (All)
  • 2.1.7-27 upgrade preparation - testing synchronisation and kernel upgrades (Chris)
  • SRM development (Shaun)
  • CIP development (Jens)

Ongoing

  • Cleaning up database for a future 2.1.8 upgrade (Shaun)
  • Setting up Preproduction (Matt)
  • Test 2.1.8-8 on tape drives (Tim)
  • Prepare preproduction platform for stress testing (crosstalk investigations suspended) (Chris/Matt)
  • adding virtual disk servers to preproduction (Matt)

Operations Issues

  • 5 disk servers (2 disk-only) under intervention since R89 move
  • Tape servers were stuck in BUSY state after CASTOR startup and needed to be reset
  • 3 dead PSUs on head nodes
  • ypbind didn't startup on a headnode, even though it was chkconfig-ed to ON

Blocking issues

none

Scheduled and Cancelled Down Times

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
R89 move 6/6/09 1200 10/6/09 1700 At Risk All
Apply Oracle BigID patch 13/7/09 0800 13/7/09 1700 At Risk All
2.1.7-27 upgrade and LSF reconfiguration 14/7/09 0800 14/7/09 1700 Downtime All
2.1.7-27 upgrade and LSF reconfiguration 14/7/09 0700 15/7/09 1700 At Risk All

Changes to Operational Milestones

Description Changed Status
Migrate to new Oracle database hardware (H) DB team, Cheney DONE
Test and deploy new LSF configuration to remove need of NFS mounts (H) Chris Ongoing
Certify and upgrade to 2.1.7-27 with new functionary which tweaks synchronization (H) Chris Ongoing
Apply Oracle BigID fix to fix (H) DB team New

Advanced Planning

  • Preferably do kernel upgrades of all systems during 2.1.7-27 upgrade
  • SRM 2.8 upgrade (sometime during July)
  • Start using Black and White lists (sometime during July)
  • CIP upgrade to include nearline publishing (sometime during July)
  • Upgrade nameserver to 2.1.8 (September?)

Staffing

  • Castor on Call person (is also Castor on Day Duty): Shaun
  • Chris at CRISTAL1 course Mon-Wed
  • Matt in CERN at STEP09 post mortem Thurs,Fri