RAL Tier1 weekly operations Fabric 20100705

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
    • Facilities Castor planning
    • Setting up new Castor Dell boxes for acceptance testing
    • prepared (and fixed) ncm-resolver to allow config of resolv.conf
    • added quattor components for Castor to repositories
  • Tim:
    • Check test T10KB migrations went OK
    • Start mass CMS T10KB migration
    • DMF - set up " site bit" in DMF to control migration for user
    • Finish planning for FaC instance
    • Sort out STK maintanence contract
  • Jonathan:
    • worked on deleting old filesystems from csfnfs55/58 and gdss130/132
    • updated RPMs on NIS servers and mail server
    • added userid accounts for BADC testing of Castor (RT #61654)
    • brought Quattor up to date for minor manual change to nagios06.gridpp.rl.ac.uk
    • updated kernel RPMs on nagger and rebooted
    • updated RPMs on nagios0[2-4] and rebooted for new kernel
    • issued new version of RPM tier1-sudo-config (1.0-53)
    • set kernel tuning parameter on nagios06
  • James A:
    • Cabled up several of Dell services nodes, more remain to be done.
    • Finished troubleshooting multipath configuration problems.
    • Developed several new squid ganglia metrics with Alastair for monitoring the frontier service.
  • James T
    • HDD replacements in Kash's absence
    • TAVS (Tier1 Array Verify Scheduler)
      • Added support for Adaptec cards
      • Fixed bug in error handling (thanks to John for spotting that)
      • Built RPM (still needs to be deployed)
    • Change control for LHCb WAN tuning
    • Streamline 2009 disk server problems
      • Meeting with Gareth Glaccum
      • Shipped machien to LSI in the USA for testing
  • Cheney
    • database backup restore testing
    • chasing down rogue emails
    • document how to replace array disks
    • turn off of old ads arrays
    • show ops details of using ipmi
    • workaround for sls availability
    • put live tsbn web version
    • testing of new version amanda backup rpm
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss67 filesystem problem. Replaced Raid card. (Intervention)
    • gdss78 being re-installed.
    • gdss207 given back to castor.
    • gdss474 waiting for backplane.
    • gdss486 collected by Gareth (Streamline) for testing.
    • gdss512 shipped to LSI (USA) for testing.
    • gdss420 low voltage on battery. (waiting for battery)
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
    • Setting up test HyperV server for services virtualisation
    • Facilities Castor instance planning
    • WLCG Collaboration workshop
  • Tim:
    • Monitor dust levels
    • New "check_tape_pools" script
  • Cheney
    • change control docco improvements
  • Jonathan:
    • Administrator on Duty on Wednesday and Thursday (during WLCG collaboration workshop)
    • continue work on shutting down old NFS servers
    • create Nagios slave for worker nodes
    • Nagios configuration updates
  • James T:
    • Fix bug in ncm-etcservices Quattor component
    • LHCb WAN tuning
    • File system creating and testing of gdss67
    • WLCG workshop Wednesday to Friday
  • James A:
    • Continue cabling up Dell services nodes.
    • Support systems while team members are at the WLCG Collaboration workshop.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss78 need re-install and create array from scratch.
    • gdss380 run 7 days acceptance test.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Ian, James T and Martin at WLCG collaboration workshop in London Wednesday-Friday

Fabric On-Call

  • Ian primary on call, Monday-Sunday - not Saturday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1