RAL Tier1 weekly operations Fabric 20090727

From GridPP Wiki
Jump to: navigation, search


Summary of week gone

Developments

  • All
  • Martin:
  • Ian:
    • Revised Quattor Workplan
    • Production quattor server
    • Introducing jrha and jiant to Quattor
  • James T:
    • TOIL on Monday.
    • Worked on problems with testing on Viglen 2008 disk - ongoing
    • New loggers rolled out, most hosts using them now. Acceptance machines now using a test disk server as syslog server to prevent flooding of the loggers.
  • Jonathan:
    • worked on moving logs from csflnx266/270
    • assisted Kashif to move marley hardware from R27 to R89
    • assigned Tier1 userid for Mayo Agard-Olubu
    • Nagios configuration updates
    • switched off callouts from netnag
    • switched Nagios MySQL databases from lcgsql0363 to nagiosdb
    • switched off callouts from nincom whilst sysreq and automate servers were being moved
  • James A:
    • Started looking at Sindes for IC.
    • Went through QUATTOR installs with IC.
    • Updated Mimi(c) to use new nagios database.
    • Fixed a few Mimi(c) bugs.
    • Started investigating interaction between Kernel, TORQUE & MAUI memory limits to understand jobs kills.
    • Applied gmetric-df to software servers.
    • Added option to pool account creation script to skip batch system checks at install time.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss356 Tested memory no fault found. (Still in intervention)
    • gdss128, 150 and 248 given back to castor. (Fixed)
    • gdss218, 260 and 269 replaced 8x1gb memory given back to castor. (Fixed)
    • lcgpx0619 replaced 2 new drives and update BIOS for hotswapping. (Fixed)
    • Working on Viglen and Streamline 2008 disk servers.
    • Working on gdss73, 196, 198 and 243.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Further setup and configuration of SAN, arrays and systems for resilient non-Castor Oracle service
    • Procurements
  • Ian:
    • Finalising production quattor server
    • Quattor deployment - further work on RAL specific components
  • James T:
    • Viglen 2008 disk acceptance testing
    • Move switch/PDU syslog to new loggers.
    • Decommission old loggers.
    • Quattor disk server deployment.
  • Jonathan:
    • A/L
  • James A:
    • Add RAS's Production Actions to MyActions system.
    • Discuss WN desployments with IC.
    • Continue learning about TORQUE & MAUI.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous working on Viglen and Streamline 2008 disk servers.
    • Continuous working on gdss73, 196, 198 and 243.

Absenses

  • Ian:
  • James T:
  • Jonathan:
    • A/L all week

Fabric On-Call

  • Ian (Primary on-Call)

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric