RAL Tier1 weekly operations Fabric 20090902

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
  • Martin:
    • Procurements
    • Security incident overview, kernel updates
    • Hardware asset tagging
  • Ian:
    • Quattor work for SL5 WNs
  • James T:
    • Continued testing of new Viglen machines.
    • Met with Viglen to get a status update on their end of things and to give update on our end, progress is being made.
    • Worked on Quattor disk server configuration.
    • Fabric on call over the weekend.
  • Jonathan:
    • updated SSH keys for root userid on farm nodes
    • updated /etc/exports on nfs1 to remove NO_ROOT_SQUASH for /home/farm (new home filesystem)
    • rebooted several servers after kernel security update
    • with Richard investigated problem with rpm command on lcgmon01
    • found problems with Quattor installation of lcgbatch01
    • updated NIS netgroup to remove old batch workers and add sv08 hosts to t1auto
    • fixed mail quota for user
    • Nagios configuration updates
  • James A:
    • Focussed on QUATTOR as much as possible.
    • Built support infrastructure for SL5 x86_64 Quattorised WN deployment.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss169 reinitializing array will verify after its completion. (Intervention)
    • gdss72 and 126 have been given back to castor.
    • lcgpx0620 replaced memory and moved back in UPS room. (Fixed)
    • gdss87 added additional raid card. (Failed to update firmware)
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss67, 73, 169, and 243.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
Kernel security problem 13/08/09 Ongoing Critical All

Summary of plans for week ahead

Note: Working week is three days Weds-Fri

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type
UIs, user accessed hosts Reboots As soon as new kernels are available When they've all been updated All Down Time (short)

Development priorities

  • All
  • Martin:
    • Kernel patching
    • Procurement issues
  • Ian:
    • A/L
  • James T:
    • Liaise with Viglen on disk issues. Next meeting 13.30 on Thursday.
    • Quattor work on disk servers and ganglia gmond config.
    • Add new disk servers to Overwatch.
    • Acceptance tests of 2008 Streamline machines.
  • Jonathan:
    • reboot sl3 servers after installing new kernel
    • check Quattor installation for new batch workers and correct problems
    • restart work on new Nagios server
    • restart work on new home filesystem server
    • Nagios configuration updates as required
  • James A:
    • Lead WN deployment.
    • Ensure backups are working correctly for quattor01
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Create graph of drives failure. (Daily, Weekly and Monthly)
    • Continuous working on 2008 disk servers and working nodes.
    • Working on gdss67, 152, 169, and 243.

Absenses

  • Martin:
    • A/L 7-11 Sept
  • Ian
    • A/L Weds-Fri


Fabric On-Call

  • Mon-Sun:

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric