RAL Tier1 weekly operations Fabric 20100823

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
  • Tim:
  • Jonathan:
    • Out all week
  • James A:
  • James T
  • Cheney
    • ADS cache node ran out of disk space
    • trying to locate key to unlock disk array
    • move preprod database to florence disk array x 2
    • investigate security alert on ads0pt02
    • regenerate web stats for hinode external customer
    • fix backups on buxton-kiki
    • make space on dmf


  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss110 fsprobe errors. (Intervention)
    • Replaced 1 drive in Streamline 2009 (Test) disk servers.
    • gdss490,492,499, 501 and 505 crashed during acceptance testing. (reported)
    • gdss381 crashed with single drive failure. (Intervention)
    • lcgfts02 replaced both drives.(Fixed)
    • gdss280 fsprobe errors (Intervention)
    • Hardware failure stats/graphs.
    • Preparing Viglen 2006 disk servers with new raid configuration for Castor Preprod.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)


Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Jonathan on leave Tuesday - Thursday so out all week

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
  • Tim:
  • Cheney
    • Facilities castor
  • Jonathan:
    • update imapd certificate on pat
    • start Nagios process on new slave server (nagios01 for batch workers) and shut down old Nagios slave servers once stable
    • release new versions of RPMs tier1-nagios-plugins, tier10-sudo-config and tier1-nrpe-config; for change to RPM tier1-nrpe-config make equivalent change to Quattor configuration
  • James T:
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss417 crashed. (Intervention)
    • gdss468 down (Intervention)
    • Update daily status of Streamline 2009 disk servers testing.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Fabric On-Call

  • Kashif Hafeez

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1