RAL Tier1 weekly operations Fabric 20090720

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
  • Martin:
    • (Saturday) Terminated testing on 27 disk servers (Viglen08) due to severe problems on each
    • Install and initial configuration of EMC data arrays and second SAN switch for resilient non-Castor Oracle services
  • Ian:
    • Python Course
    • Installing production Quattor server
    • Resolving Quattor installation mechanism issues
  • James T:
    • Python course Monday and Tuesday.
    • Acceptance testing of 2008 disk now in full swing
    • Keep an eye on all acceptance tests.
    • Rolled out the verify scheduling across the production disk servers, with a few exceptions to follow up.
    • Finished testing new loggers, copy data across and put new logger1 into production.
    • Started work on Quattor disk server deployment.
  • Jonathan:
    • sorted out atlasbackup problems on lcgsql0363, touch
    • updated RPMs on several core systems
    • installed tier1-batchinfo RPM on csflnx353, lcgbatch01
    • updated Nagios configuration
    • updated plan for migration of Nagios MySQL database and master server to new hosts
  • James A:
    • CASTOR Upgrade.
    • Documented user access to t1bofh.
    • On leave Wed-Fri.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss126 and gdss166 given back to castor (Fixed).
    • gdss192 given back to castor. (Fixed working without IPMI card)
    • Working on gdss73, 196, 198, 128, 121, 135, 150, 243, 218, 260 and 248.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type
Nagios Cutover of backend database from lcgsql0363 to nagiosdb Monday pm Monday pm None No risk

Development priorities

  • All
  • Martin:
    • Further setup and configuration of SAN and arrays for resilient non-Castor Oracle service
  • Ian:
    • Further Quattor install server work
  • James T:
    • TOIL on Monday all day.
    • Keep an eye on all acceptance tests.
    • Quattor disk server deployment.
  • Jonathan:
    • switch Nagios and batch jobs MySQL databases to nagiosdb with update of Mimic (James A)
    • work to with James T to sort out system loggers
    • reboot AFS servers (after kernel updates last week)
    • complete updates to plan to move home filesystem to new server
  • James A:
    • Start looking at Sindes for IC.
    • Go through QUATTOR installs with IC.
    • Add RAS's Production Actions to MyActions system.
    • Update Mimi(c) to use new nagios database.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Move "Marley" from R27 (Ops).
    • Create list and location of Viglen 06 Disk servers for adding additional Raid cards.
    • Continuous Working on gdss73, 196, 198, 128, 121, 135, 150, 243, 218, 260 and 248.

Absences

  • Ian:
    • A/L from midday Wednesday
  • James T:
    • TOIL on Monday
  • Jonathan:
    • A/L Wednesday (weather dependent), Friday.
    • A/L All w/b 27/7

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

  • Update of ntpd RPM can leave ntpd process not running (3 instances seen and corrected; there is a Nagios test for ntpd daemon)
  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric