Difference between revisions of "RAL Tier1 weekly operations castor 24/04/2012"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 08:37, 23 April 2012

Operations News

  • Increased number of d2d copy slots for atlasStripInput '07 and '08 servers to help drain disk server quicker (Frid)
  • Fixed SLS tape monitoring

Operations Problems

  • Xrtootd problems (transfer failures) for atlasStripDeg leads to a missing path/svc mapping in xrd.cf on the atlas DLF machine. (Wed)
  • Atlas declared lost of 2 files due to clock being out of sync on atlas stager machine. The problem has been fixed and Nagios check created to monitory any time drifts (Thur)
  • Gdss445 was having some issues with d2d copy effecting draining mode. Fixed by recreating lsf dynamic libraries and restarting lsf client daemons (Thur)
  • Gdss209 (atlasScratchDisk) went down on Friday night and was recovered on Sunday morning

Blocking Issues

  • none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

Description Start End Type Affected VO(s) Lead by
CIP 2.2.0 upgrade (STC) TBD TBD At-risk All Matthew

Advanced Planning

Tasks

  • Test and re-apply CIP upgrade (Jens, Matthew)
  • Test and certify 2.1.12-4 and 2.1.11-9 (Matthew, Chris)
  • Stress testing of Transfer Manager (TM) (Shaun, All) DONE
  • Ganglia monitoring for TM (Rob, Chris) IN PROGRESS
  • Re-instantiate certification on HyperV VMs using Quattor+Puppet (Rob)
  • Stress testing of CV11 generation disk servers on preprod (Rob, Matthew)
  • Selection of disk-only prototype solution (Shaun, Rob, Brian, James)
  • Switch to Tape Gateway on repack and test (Tim, Matthew) DONE

Interventions

  • Upgrade repack to 2.1.12-4 (Apr)
  • Switch from LSF to TM after 2.1.11-8 upgrade. Will need to better stress-test TM on preprod with more disk servers. (Apr)
  • Switch to Tape Gateway (TG) once it has been tested on repack (May)
  • Upgrade Castor Facilities and Tier1 instances to 2.1.11-9 (Jun)
  • Upgrade Oracle to 11g (Jun)
  • Upgrade to 2.1.12 on Tier1 instances once we are happy with TM and TG in performance (Jul)

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • Rob (A/L)