RAL Tier1 weekly operations Fabric 20110124

From GridPP Wiki
Jump to: navigation, search

Editing RAL Tier1 weekly operations Fabric 20110110

Developments

  • All:
  • Martin:
  • Ian:
  • Tim:
  • James A:
    • Atlas power off completed with no issues.
    • Bulk of ClusterVision WN issues now sorted, now trying to push through acceptance.
    • Continued benchmarking.
  • James T
    • ATLAS SL5 x86_64 upgrade
    • End of acceptance tests on Viglen and Streamline 2010 disks
    • AFS documentation
    • First aid course Wednesday and Thursday
  • Cheney
    • Atlas power shutdown
    • backups performance testing
    • huge backups testing
    • generate solarb stats
    • dmf disaster recovery
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss380 add new mac address in dhcp and re-install.
    • gdss189 read-only filesystem.(Scsi errors)
    • gdss496 Scsi errors. Reported to Streamline with logs.(Intervention)
    • gdss451 replaced 4x2gb memory. (Back into production same day)
    • Fabric Hardware failure metrics.
    • Jetstor systems multiple drive failures.
    • gdss576 and gdss577 getting logs to send to Streamline.
    • gdss496 post-mortem.
    • lcgwms03 replaced drive with hotswap method.
    • gdss283 crashed with File system problem.(Intervention)
    • Quattor01 wend down. No problem found.
    • SL 2010 and Viglen 2010 disk servers in testing.
    • SL 2009 Auto rebuild on hotspare fails.


Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
  • Tim:
  • Cheney
    • DMF dr
  • James T:
    • Disk checks in Kash's Absence
    • Review testing of Streamline and Viglen 2010 disks
    • Preparation for CMS SL5 x86_64 upgrade
    • Streamline 2008 testing
    • Project management training Thursday and Friday
  • James A:
    • Pushing ClusterVision WNs through acceptance testing.
    • Hardware support while Kash on leave.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • SL 2009 Auto rebuild on hotspare fails.
    • Hardware failure metrics continue.
    • SL08 testing.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Kash out Monday (Annual Leave)
  • JRHA out Wednesday (Annual Leave)
  • James T out Thursday and Friday (project management training)

Fabric On-Call

  • Monday - Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1