RAL Tier1 weekly operations Fabric 20100524

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
  • Tim:
    • Installing new T10KB tape servers
    • DMF admin
  • Cheney:
    • fixed logrotations in quattor
    • changed dmf backups
    • replace battery emc kit
    • installed acsls on kiki
    • investigate tape server problems
    • fix metrics problem
    • investigate hinode security alerts
  • Jonathan:
    • updated NIS netgroup file to update list of disk servers and to add new top BDII servers
    • updated iptables on enigma to allow Nagios checks via NRPE
    • updated NIS group file to bring Tier1-Fabric and Tier1-Castor groups up to date
    • worked on AFS solution for Atlas software server
    • 4 Nagios configuration updates
    • stopped Nagios process on nagios01/05; later issued poweroff command
    • updated Twiki documentation for Nagios slave server change
    • installed mysql.i386 RPM on nagger to solve problem with MySQL check
    • new versions of tier1-nagios-plugins, tier1-sudo-config and tier1-nrpe-config RPMs
  • James A:
    • Change control proposal for ATLAS software server upgrade.
    • Job Plan
  • James T
    • Acceptance testing Streamline '09 kit
    • Modified Quattor disk server installations to correctly detect and mount XFS data partitions
    • SL5/XFS disk server change planning and documentation.
    • Deployment allocations sliding block puzzle.
    • Job plan
    • Discovered networking problems on SL5 disk servers, fixed after Martin raised an urgent networking ticket (due to incorrect router ACLs).
    • Work on benchmarking disk servers for specification in the tender documents.


  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.
    • gdss380 crashed with single drive failure. (Intervention)
    • gdss423 moved into test area with John. (Intervention)
    • gdss71 replaced raid card memory with John.(Fixed)
    • gdss434 rebuild completed successfully. Back to castor.
    • gdss332 probably faulty IPMI card. (Intervention)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • cheney leave 24th to 28th

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
  • Tim:
    • Work on CMS T10KB migtration
    • Plan install of Facilities Castor
    • New user for DMF
    • Sort out DMF backups
  • Cheney
    • Sit in sun
    • cold beer
  • Jonathan:
    • start regular check restores of home filesystem
    • Job Plan
    • stop exports for some old filesystems on csfnfs58
    • continue investigations on setting up AFS directory as Atlas software server
    • Nagios configuration updates
  • James T:
    • Acceptance testing Streamline '09 kit
    • Allocations and deployment
    • Benchmarking disk servers
    • Job plan
  • James A:
    • Taking Streamline 2009 WNs out of Acceptance Testing.
    • Benchmarking Streamline 2009 WNs.
    • Inspecting Streamline 2009 WNs.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.
    • gdss380 sending logs to vendor. Received new raid card.

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Cheney leave Monday 24th to Friday 28th May
  • Jonathan on leave Tuesday 25th May
  • Tim on leave 1-4th June
  • Ian On Leave Friday pm

Fabric On-Call

Ian Monday-Thursday Kash - Friday-Monday

Advanced Warning of Requirements and Blocking issues

Services Issues

    • Xen development environment not available for hadoop testing due to network issues. A resolution of the problem is being started on but is unlikely to be available for a while yet.



RAL Tier1 weekly operations fabric

Category:RAL_Tier1