RAL Tier1 weekly operations Fabric 20090713

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
  • Martin:
    • Post-move tidying up
    • Procurements
  • Ian:
    • Quattor FP7 bid planning Meeting
    • Trouble shooting reinstallation on quattor server
    • New hardware installation timeline
  • James T:
    • Unquiesced remaining disk servers following move.
    • Acceptance testing of Streamline kit.
    • Built new loggers and started testing.
    • Primary on call over the weekend.
  • Jonathan:
    • worked on reducing space used space and changed logrotate policy on system loggers
    • updated RPMs on several systems and rebooted where required
    • renamed csfmonitor-pbs RPM as tier1-batchinfo, updated to write to nagiosdb and tested
    • corrected problems on various systems following service restart after R89 move
    • created MySQL database minos_dogwood1 for user
    • updated RPMs on servers and rebooted
  • James A:
    • Startup of batch system.
    • Joined Ian's work on QUATTOR.
    • Updated IPMI card firmware on a few systems.
    • Cabled CASTOR Rack F (Certification Systems).
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Worked with Viglen Engineers.
    • gdss192 replaced memory, mainboard and raid card (Fixed).
    • gdss266 added new raid card. (Fixed)
    • gdss198 replaced memory, mainboard and raid card. (Still broken)
    • Replaced new memory in gdss102, 223, 216 and 236.
    • Working on gdss73, 196, 198, 128, 121, 135, 150, 243

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Physical installation of resilient arrays for Oracle systems
    • Castor and kernel upgrade on disk servers
    • Possible reset on Tuesday of Stack-8 to regain access to it
    • Planning of Fabric intervention schedule for period to 31 August
  • Ian:
    • Python Course (Weds & Thurs)
    • Work with Martin installing production Quattor Server
    • Configuring production Quattor Server
  • James T:
    • Python course Monday and Tuesday.
    • Acceptance testing of Viglen kit now that they've all been handed over.
    • Keep an eye on all acceptance tests.
    • A bit of work on the CASTOR intervention on Tuesday (disk servers).
    • Finish testing new loggers, copy data across and put into production.
    • Quattor disk server deployment.
  • Jonathan:
    • Nagios configuration updates as required
    • update RPMs on core servers and reboot as required
    • complete adding simple Nagios configuration documentation to wiki
    • continue configuration work on nagger and nagiosdb
    • resurrect plan to move home filesystem to new server
    • create SL5 version of tier1-sendmail-config RPM
  • James A:
    • CASTOR Upgrade.
    • Start looking at Sindes for IC.
    • Add RAS's Production Actions to MyActions system.
    • Document user access to t1bofh.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous Working on gdss73, 196, 198, 128, 121, 135, 150, 243, 218, 260 and 248.

Absences

  • James:
    • On leave Wed-Fri.

Fabric On-Call

  • Mon-Thu: James T

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 38567 - Dedicated WN for Alice (SW area + gridftp area)
    • lcg0614 handed to Services Team
  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric