RAL Tier1 weekly operations Fabric 20100816

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
    • Work on virtualisation testbed
    • Configuring iSCSI storage
  • Tim:
  • Jonathan:
    • got new host certificate for pat, saved on touch and created Change Control for update of imapd certificate
    • applied 2 updates to source for RPM tier1-sudo-config for CMS
    • corrected problem with logrotate on lcgcts0[1-9] by correcting Quattor configuration
    • AFS user assistance
    • user changes
    • 1 Nagios configuration update
    • installed slave server for batch workers (nagios01) using Quattor, created initial configuration for nagios01 and verified it; submitted Change Control for switch to new slave
    • updated source for tier1-nrpe-config RPM
  • James A:
    • Planning cabling for Castor racks G and H.
    • General Quattoryness.
  • James T
    • Disk server bits and pieces in Kash's absence.
    • Streamline 2009 testing:
      • 57 machines testing OK, due to finish on Wednesday 18th August.
      • Two machines stopped testing. Streamline, LSI and Wd are looking into why.
      • One machine still at LSI until problem with the two machines has been diagnosed.
    • Work on new Areca firmware for Streamline 2008 machines to fix problems with arrays going offline.
      • Testing on gdss380 has run for a week without problems.
      • Three machines have now shown this problem (gdss380,381,417) so we've escalated to Streamline.
    • Started to re-configure IPMI on the disk servers to use the 10.0.0.0/8 network addresses and document access.
    • Initial stab at quattorised logger (work in progress).
  • Cheney
    • not around.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss110 fsprobe. (Started memtest)
    • Replaced 2 drives in Streamline 2009 (Test) disk servers.
    • gdss490 and gdss492 crashed during acceptance testing. (reported)
    • gdss381 crashed with single drive failure. (Intervention)
    • lcgfts02 replaced drive (sda).
    • Hardware failure stats/graphs.
    • Preparing Viglen 2006 disk servers with new raid configuration for Castor Preprod.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)


Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
    • Preparation for GridPP 25
    • Further work on virtualisation testbed
  • Tim:
  • Cheney
    • cutover to another disk array for castor preprod
  • Jonathan:
    • On leave Tuesday-Thursday, so out all week
  • James T:
    • A/L Monday and Tuesday
    • Disk server IPMI set up
    • Quattorised loggers
  • James A:
    • Cabling Castor racks G & H.
    • Various Nagios test changes and developments.
    • Planning and administration for meetings.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss417 crashed again. (Intervention)
    • Update daily status of Streamline 2009 disk servers testing.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Jonathan on leave Tuesday - Thursday (so out all week)
  • Cheney on leave thursday
  • Tim on leave all week

Fabric On-Call

  • Kashif - Monday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1