RAL Tier1 weekly operations Fabric 20110110

From GridPP Wiki
Jump to: navigation, search

Editing RAL Tier1 weekly operations Fabric 20110110

Developments

  • All:
  • Martin:
  • Ian:
    • Catching up after holiday
    • Setting up new database nodes for testing
    • Some work on virtualisation tests
    • Fixing repository updates
    • Generating new errata templates in Quattor
  • Tim:
  • James A:
    • On leave
  • James T
    • Post Christmas catch up
    • Discovered with Shaun that rsyslog is using TCP for log messages to DLF which may be the cause of unresponsive disk servers.
    • Tested update of puppet using quattor
    • SL10/V10 acceptance tests
    • Investigated the CERN burn-in tests
    • Updated documentation
  • Cheney
    • Fixed nagios on solaris boxen
    • Cleared down the errors on servers from the xmas week
    • Straighten out and check over backups of various sorts
    • Emptied out the Outlook inbox
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss380 still with Streamline for fix.(Crashed with single faulty drive)
    • gdss417 acceptance testing. (Crashed with single faulty drive)
    • gdss357 replaced memory and Power distribution board with Viglen Engineer.
    • Updated wiki for Spares for Xmas period.
    • Job plan review.
    • Fabric Hardware failure metrics.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
    • gdss70 given back to Castor team.
    • gdss337 Kernel panic (Faulty memory)
    • gdss283 crashed with File system problem.(Intervention)
    • gdss68 re-created array but still fail to see replacement drive. (Probably faulty backplane)
    • SL 2010 and Viglen 2010 disk servers in testing.
    • gdss496 Scsi errors out of production.
    • SL 2009 Auto rebuild on hotspare fails.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
    • Finalising and publishing errata templates
    • Help Shaun with Facilities Castor configuration
    • Virtualisation/iSCSI testing


  • Tim:
  • Cheney
    • DMF rsync setup
    • DMF samba users setup
    • DMF disaster recovery plan
    • Write a mathematical model for disk storage problems


  • James T:
    • Preparation for ATLAS SL5 64-bit upgrade
    • disk servers as iSCSI targets
    • Test puppet update on kickstarted disk servers
    • Puppet -> Quattor migration on disk servers
  • James A:
    • Catching up after holiday
    • Adding new sensors to Artemis
    • Working to get new worker nodes into testing
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Job plan review update.
    • SL 2009 Auto rebuild on hotspare fails.
    • Hardware failure metrics continue.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Cheney out tues morn.
  • James T A/L Thursday

Fabric On-Call

  • Kashif Monday - Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1