RAL T1 weekly ops Fabric 20110509

From GridPP Wiki
Jump to: navigation, search

Developments (last week)

  • All:
  • Tim:
    • Metrics
    • Chasing usless Oracle to get T10KC drives installed
    • Central Backup stuff
    • DMF tape usage
    • DMF futures and costs
    • Castor tape funny.
  • James A:
    • Annual Leave
  • Cheney
    • Writing job plan
    • Metrics
    • Solaris amanda testing
    • RCG meeting
    • Boot up thought bubble website for RDG peeps.
    • Extract amanda data to database
    • Writing DSM, disk server stats.
    • Didn't go to any conferences
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • quattor02 is not showing more SCSI errors after driver and firmware updates.
    • Viglen 2007 all disk servers firmware update. (ongoing)
    • Update firmware on Jetstor systems.(ongoing) Updated on three.
    • gdss502 started acceptance test for 7 days.
    • Use Adaptec Storage Manager to monitor Storage servers. (SL09, V09 and SL10)
    • SL08 testing 3 disk servers with multiple drives failure and failed array.
    • Replaced two CV10 nodes.
    • gdss293 started memory test.
    • Added IPMI addresses for NC rack systems in dhcp and network spreadsheet.
    • gdss206 taken out of production due to multiple drive failures.
    • Booked review meeting with Andrew and James for Fabric metrics for other hardware failures
  • Martin:
    • HEPiX
  • Ian:
    • HEPiX
    • Planning switching Atlas production using CVMFS
    • Science Oxford talk (before bank holiday weekend)


Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities (Coming week)

  • All
  • Tim:
    • Central Backup Document
    • T10KBs on DMF
    • SDB production service planning
    • More hassel for Oracle till C drives installed
    • Chat to SSC about re-newing framework agreements
  • Cheney
    • Solaris amanda testing
    • Set up backups webpage
    • Chomp the infosec for amanda
    • Think about training and how-tos for amanda
  • James A:
    • Catching up after leave.
    • Clearing ticket backlog.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Hardware failure metrics continue.
    • Continue SL08 testing.
    • Continuous decommissioning old batch systems.(R 27)
    • Continue Labelling racks and systems in UPS and HPD room.
  • Martin:
    • Database systems installations
    • Job plans
    • SLM stuff
    • ESC/CICT common ops program meeting
    • Networking issues
    • Catch up
  • Ian:
    • Switch Atlas production to use CVMFS
    • Joint e-Science/CICT project meeting
    • EGI User Virtualisation meeting (Thurday/Friday)
    • Catching up

Absences

  • Ian out Thursday-Friday
  • Tim - Out Tuesday

Fabric On-Call

  • Ian -Primary - Monday - Tuesday & Saturday-Sunday
  • Kash Fabric on-call Wednesday-Friday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1