RAL Tier1 weekly operations Fabric 20100517

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
    • Visit to CERN for:
      • F2F meetings with Tim Bell, Olof Baring
      • Virtualisation Working Group F2F
      • May GDB
    • Work on APRs
    • Database infrastructure paper
  • Ian:
    • Attended HEPiX Virtualisation Working Group F2F
    • Met with services virtualisation admins at CERN
    • Attended part of GDB
    • Quattor documentation
  • Tim:
    • More work on new tape servers
    • Investigating repack checksum errors
    • Finished exporting data for BOPCRIS
    • DMF sorting out some bad tapes
    • Library had two drives with stuck tapes
  • Cheney:
    • investigate logrotate problem on tape servers (cron not running)
    • sort out blown psus x 2 occurred at same time
    • fix various backups glitches
    • add servers to nagios and tweak various odds and ends
    • add new VO to sls and tsbn
    • testing of db backups restore - (fail)
    • temporary fixes for various problems with new tape servers
  • Jonathan:
    • corrected atlasbackup problem
    • investigated and solved intermittent “Connection refused” problems for lcgvo-02-21
    • restored Somnus Oracle database from backups
    • renamed AFS userid
    • Completed APR
  • James A:
    • Kept sv-08-16 up-to-date with current ATLAS software from lcg0617 ahead of switch-over.
    • Re-cabled Streamline 2009 storage nodes into production network and removed part of testing network from rack.
    • Finished APR.
    • Started change-control request for upgrading ATLAS software server.
  • James T
    • Retrieved and installed new certificates on gdss87-367
    • Got certificates for Viglen09 kit.
    • Drive swapping in Kash's absence.
    • APR
    • Installed Streamline '09 kit via Quattor
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.
    • gdss228, gdss229 and gdss232 given back to castor.
    • gdss423 four faulty drives and probably faulty raid card. (Reported)
    • Faxed dispatch note to Viglen for reference.
    • gdss434 new drive not shown. (Need reboot)
    • gdss71 faulty memory. (Intervention)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Cheney leave Monday 24th to Friday 28th May

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Database infrastructure plan docuemnt + costings
    • Disk ITT
    • APR + Plan stuff
  • Ian:
    • Set up Redhat repositories for Quattor installation
    • Help Tim with final tape server work
    • Job plans
    • Virtualisation strategy task work
  • Tim:
    • Get remainuing tape servers into production
    • Prepare repack for CMS migration
    • DMF sort out Peter Chui requirements
    • Get CLF access to DMF
  • Cheney
    • Improvements to dmf backups
    • srb tasks
    • Fix patching
  • Jonathan:
    • start regular check restores of home filesystem
    • close nagios01/05 (old Nagios slave servers)
    • stop exports for some old filesystems on csfnfs58
    • continue investigations on setting up AFS directory as Atlas software server
    • Nagios configuration updates
  • James T:
    • Acceptance testing Streamline '09 kit
    • Job Plan
    • Change control request for moveing to Sl5 64-bit + XFS on disk servers
    • Assigning Viglen '06 disk servers to preprod so that the Viglen '08 machines in preprod can be reclaimed to satisfy allocations
    • Disk server benchmarking
  • James A:
    • Working on change control for ATLAS software server upgrade.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Cheney leave Monday 24th to Friday 28th May

Fabric On-Call

  • Ian all week

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1