RAL Tier1 weekly operations castor 24/10/2011

From GridPP Wiki
Revision as of 15:00, 24 October 2011 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Operations News

WAN tuning changes were rolled out to approximately half production disk servers on 21st. It remains to be seen whether it has improved trasfer rates.

Operations Problems

  • 3 CMS disk servers (gdss303,304,305) were found to have a large amount of dark data, as they had been redeployed from another instance with cleanLostFiles being run on them, but not having waited for garbabe collection to run. In future, data partitions of redeployed disk servers will be wiped with "rm -rf" by the CASTOR team to avoid future problems.
  • Database hardware problems on Saturday brought down all instances of CASTOR. Service was restored on Sunday after hardware reconfiguration.

Blocking Issues

  • We need to understand the cause of the new database disk array hardware problem before we can migrate production databases over to it.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB none

Advanced Planning

  • Move Tier1 instances to new Database infrastructure which with a Dataguard backup instance in R26
  • Upgrade SRMs to 2.11 which incorporates VOMS support
  • Certify 2.1.11 and evaluate the Transfer Manager (the new LSF replacement)
  • Quattorization of remaining SRM servers
  • Hardware upgrade, Quattorization and Upgrade to SL5 of Tier1 CASTOR headnodes

Staffing

  • Castor on Call person: Matthew
  • Staff absence/out of the office:
    • Matthew at LTUG (Wed) and in DL (Fri)