RAL Tier1 weekly operations castor 20090706

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Moving CASTOR central services to R89 and then bringing up/testing
  • SRM development (Shaun)
  • Certification of 2.1.7-27 with new LSF configuration (Chris)


Developments for this week

  • Monitoring CASTOR as it is brought back into production (All)
  • 2.1.7-27 upgrade preparation - testing synchronisation and kernel upgrades (Chris)
  • SRM development (Shaun)
  • CIP development (Jens)

Ongoing

  • Cleaning up database for a future 2.1.8 upgrade
  • Setting up Preproduction (Matt)
  • Test 2.1.8-8 on tape drives (Tim)
  • Prepare preproduction platform for stress testing (crosstalk investigations suspended) (Chris/Matt)
  • adding virtual disk servers to preproduction (Matt)


Operations Issues

  • Tape servers were stuck in BUSY state after CASTOR startup and needed to be reset
  • 3 dead PSUs on head nodes
  • ypbind didn't startup on a headnode, even though it was chkconfig-ed to ON


Blocking issues

none


Scheduled and Cancelled Down Times

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s)
R89 move 25/6/09 0600 6/6/09 1200 Downtime All
R89 move 6/6/09 1200 10/6/09 1700 At Risk All
Apply Oracle BigID patch 13/7/09 0800 13/7/09 1700 At Risk All
2.1.7-27 upgrade and LSF reconfiguration 14/7/09 0800 14/7/09 1700 Downtime All
2.1.7-27 upgrade and LSF reconfiguration 14/7/09 0700 15/7/09 1700 At Risk All


Advanced Planning

  • Preferably do kernel upgrades of all systems during 2.1.7-27 upgrade
  • SRM 2.8 upgrade (sometime during July)
  • Start using Black and White lists (sometime during July)
  • CIP upgrade to include nearline publishing (sometime during July)


Staffing

  • Castor on Call person (is also Castor on Day Duty): Shaun
  • Chris at CRISTAL1 course Mon-Wed
  • Matt in CERN at STEP09 post mortem Thurs,Fri