RAL Tier1 weekly operations castor 08/08/2011

From GridPP Wiki
Revision as of 10:56, 5 August 2011 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

  • Repack deployed into production with 6 disk servers
  • New Facilities databases successfully stress tested with preprod. Up to 1000 db transactions per minute recorded - double the typical production peaks on Tier1 instances
  • Tier1 dashboard now switched from old Castormon to new Nagios Castor monitoring replacement
  • Cron on ATLAS now changed to run at 1300 rather than at night, to reduce number of low space related callouts

Operations Problems

  • 7 LHCb files from Sep/Oct 2010 were found to be present in NS but physically missing. Cause of their loss unknown.

Blocking Issues

none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Move Facilities DB instance to new Database hardware running 10g
  • Upgrade SRMs to 2.11 which incorporates VOMS support
  • Start migrating from T10KA to T10KC media later this year
  • Certify 2.1.11 and evaluate the new LSF replacement
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes

Staffing

  • Castor on Call person: Chris
  • Staff absence/out of the office:
    • Matthew (A/L)