RAL Tier1 weekly operations Fabric 20100503

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
    • APRs
  • Martin:
  • Ian:
    • Work on VOboxes with Catalin and Andrew LAhiff
    • Help with new atlas sw server
  • Tim:
    • Install new tape servers and drives
    • NDA talks with Oracle/Sun/STK on T10KC
    • monthly stats
    • Modify script to move tapes from free to VO tape pool
  • Cheney:
    • set up a new server to backup archives of database redos
    • started looking at dmf backups optimisation
    • set up more nagios checks for disk arrays
    • docco - castor
    • learnt some basic quatting
    • testing ahead of db changes
  • Jonathan:
    • fixed atlasbackup problems on several nodes
    • cleaned up /tmp on lcgui01 by removing files that have not been accessed for 150 days
    • stopped pacman web service on csfmove02 (RT #57936)
    • increased home filesystem quota for user
    • Nagios configuration updates
    • issued new versions of tier1-nagios-plugins, tier1-nrpe-config and tier1-sudo-config
  • James A:
  • James T:
    • Aod Thursday
    • Disk server recovery docco
    • Updated hardware raid nagios check to support Viglen '08 kit
    • Log searches for security challenge
    • Telecon with Streamline/Western Digital/LSI/Boston regarding Streamline '09 kit
    • Handed SL5 disk server build to CASTOR team
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • Daily hardware failures status of Streamline 2009 disk servers to James T.
    • gdss290 verified raid array. (Fixed)
    • gdss312 replaced IPMI card. (Fixed)
    • Streamline Engineer service call. Gdss490 taken by Graham.
    • gdss420 replaced 24 ports raid controller card. (Fixed)
    • Castor C2certdb received replacement drive.
    • Wrong labels on Viglen 2009 disk servers. (Updated to James T)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Tim in London Thursday

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
    • Visit to CERN
  • Tim:
    • Get T10KB migration going
    • Plan install of Facilities Castor
  • Cheney:
    • get the backups working right - they lose sync
    • patching
    • dmf backups
    • docco
  • Jonathan:
    • start regular check restores of home filesystem
    • final checks of new Nagios slave
    • continue investigations on setting up AFS directory as Atlas software server
    • Nagios configuration updates
  • James T:
    • AoD Thursday
    • Keep an eye on Streamline '09 testing
    • APR
    • Disk swaps for Kash
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)
    • Viglen Engineers service call on Wednesday 28th April 2010.
    • gdss290 fs errors and probably data lost. (Intervention)
    • gdss312 and gdss337 replace IPMI card.
    • gdss420 replace 24 ports raid controller card.
    • Daily hardware failures status of Streamline 2009 disk servers to James T.

Absences

  • Monday is Bank Holiday
  • Jonathan on partial retirement (not in on Monday and Friday)

Fabric On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1