RAL T1 weekly ops Fabric 20110711

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Tim:
  • James A:
    • Started looking at SL6 deployment with Quattor.
    • Started Magda on datasource integration.
    • Moved two clusters of worker nodes to new subnet.
  • Cheney
    • helped roger with extract of 100 million files from dmf
    • some tlc for the disk arrays
    • fixed various backups
    • battling with quattor to send out amanda
    • received two big nfs servers to test backups


  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old disk servers/batch systems.
    • Report incident to SHE.
    • Catch up with James.
    • Update firmware on Jetstor systems.(ongoing) Updated on three.
    • Try to send a faulty drive from SL08 batch to Areca.
    • Completed 'verify fix' on SL09 disk servers with bad blocks on drives.
    • Update hardware database.
    • Check and configure 7 SL08 disk servers for deployment.
    • Viglen 2009 send smart test logs to David power.
    • Appointment with OCH.
    • Write commands on wiki to check BBU status on raid cards. (Adaptec and Areca)
    • Appointment with OCH physio.
    • Enable write cache protected with battery option in all SL09 disk servers.
    • Put risk assessment notice in Test room.
    • gdss208 read-only file system.
    • High rate of drives failure in Viglen 07 generation.


  • Martin:
  • Ian:


Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Tim:
  • Cheney
    • battle of wills with quattor
    • stabilisation of large backups thru amanda
  • James A:
    • Continue overseeing move of worker nodes to new subnet.
    • Continue work on SL6 in Quattor.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Hardware failure metrics continue.
    • Continue SL08 testing.
    • Continuous decommissioning old disk servers/batch systems.(R 27)
    • Continue Labelling racks and systems in UPS and HPD room.


  • Martin:
  • Ian:

Absences

Fabric On-Call

  • Kashif : Monday - Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1