RAL Tier1 weekly operations Fabric 20100531

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
    • Virtualisation platform research
    • Managing Performance course
    • Helped Tiju set up egee nagios server with Quattor
  • Tim:
  • Jonathan:
    • reset AFS password for user
    • responded to request from security re calls from AFS servers to commercial system
    • 1 Nagios configuration update
    • corrected bug in check_spma.sh plugin
    • issued new versions of RPMs tier1-nagios-plugins, tier1-sudo-config and tier1-nrpe-config
    • worked on Job Plan for 2010-2011
  • James A:
  • James T
    • Wrote benchmarking document for tender process and ran benchmarks on Viglen 2009 kit to get performance figures.
    • Streamline 2009 testing
    • Completed job plan
    • Deployment allocations
    • Started investigating mail forwarding on Quattor systems (taken up by Ian)
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.
    • gdss380 swapped drives with gdss368. (Intervention)
    • gdss423 found faulty backplane. Replaced by Viglen Engineer. (Fixed)
    • gdss85 double disk failure.(Intervention)
    • gdss321 given back to production.
    • gdss332 probably faulty IPMI card.
    • gdss153 and gdss165 given back to production.
    • Reported Streamline/areca disk servers crashed due to single faulty drive.
    • gdss272 three faulty drives. (Replaced all three drives)
    • gdss213 two faulty drives. (Back to production)
    • Job plan with MJB.

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Cheney leave all week
  • Jonathan on leave, Tuesday 25th

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
    • Virtualisation platform development
    • CRISTAL 2 preparation
    • Reviewing site wide Quattor configurations
    • complete job plan
  • Tim:
  • Cheney
  • Jonathan:
    • start regular check restores of home filesystem
    • complete Job Plan
    • continue investigations on setting up AFS directory as Atlas software server
    • Nagios configuration updates
  • James T:
    • Streamline 2009 testing
    • Security team work
    • Adaptec and LSI support for nagios and the verify system
    • CRISTAL feedback for Ian
    • Job plan into SSC
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.
    • gdss423 move back to machine room.
    • gdss67 replace memory.

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • All on Bank holiday, Monday 31st May
  • Tim on leave 1st-4th June
  • James T on leave 7th to 11th June

Fabric On-Call

Ian Primary Tuesday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1