RAL Tier1 weekly operations Fabric 20090914

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All
  • Martin:
    • A/L
  • Ian:
  • James T:
    • Ongoing liaison with Viglen over disk issues
    • Continued acceptance testing of Streamline nodes
    • Kickstart for CASTOR on new Streamline nodes
    • Quattor build of SL5/64-bit VO software server
  • Jonathan:
    • fixed /var filesystem full problem for system consoles
    • built RPM tier1-batchinfo (v 1.0-12) without Requires: perl-DB-mysql
    • investigated NIS problems
    • created final backups for /pool and /pool/machines on csflnx266/270
    • fixed problems for 2 users
    • Nagios configuration updates
  • James A:
    • Focussed on QUATTOR as much as possible.
    • Deployed all new worker nodes to SL5 64-bit with QUATTOR into new batch farm.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss169 fixed and given back to castor.
    • gdss67 moved into Test are for further intervention with the help of James A and T.
    • gdss302 has been given back to castor.
    • lcgfts02 configured for hotswapping.
    • gdss172 replaced raid card battery (Fixed) and has been given back to castor.
    • gdss164 replaced two new drives also added additional raid card with James T.
    • Labeled (from front) Clustervision 2007 working nodes.
    • Created graphs of hardware/drives failure.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss67, 78, 85, 86, 105, 110 and 243.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
1 gdss67 RAID failure (RT#49145) 18/08/09 Ongoing Critical CMS
2 gdss164 RAID5 failure (RT#49192) 19/08/09 Ongoing Critical BaBar

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Catchup
    • Procurements
    • Next steps in database migration plan
  • Ian:
  • James T:
    • Keep up to speed with Viglen on disk issues and run any tests they need
    • Sl5/64-bit software server build
    • Quattor: disk servers and ganglia config files
    • Tuning on gen disk servers at Brian's request
  • Jonathan:
    • create list of archived tapes on wiki
    • work on migration on NIS servers to new hardware and new version of SL
    • add update to /var/yp/nicknames for systems managed by Quattor
    • work on plan to move home filesystem to new server
  • James A:
    • Migration of 75% of batch capacity to SL5.
    • Migration of SL4 nodes to new scheduler.
    • Provide assistance with WMS quattorisation.
    • Troubleshoot deployment of software install boxes.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continue working on 2008 disk servers and working nodes.
    • Continue working on gdss67, 78, 85, 86, 105, 110 and 243.

Absences

  • Ian
    • Monday
  • James T:
    • A/L Thu-Fri

Fabric On-Call

  • Tue-Sun: Ian (Primary on-call)

Advanced Warning of Requirements and Blocking issues

Services Issues

  • RT# 44835 – non capacity HW for testing (Services)

Category:RAL_Tier1

RAL Tier1 weekly operations fabric