RAL Tier1 weekly operations Fabric 20101206

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
    • Created project plan for preprod Quattorised SRMs
    • Set up initial Quattor config and installed preprod SRMs
    • Began public wiki page for cvmfs testing/setup
    • Test latest version of cvmfs client
    • Ongoing virtualisation testing
    • Job plan reviews
  • Tim:
    • Problems with SL8500
    • FaC tape server config
    • Sorting out problems with various tapes
    • Tape drive microcode updates
    • Power blip
    • Tracking down the casuse of the slow network transfers


  • Jonathan:
  • James A:
    • Developed and tested deployment of grid map files with ZipWire.
    • Worked with Chris K to upgrade a fully Quattorised facilities instance to 2.1.9-10.
    • Spent some time recovering batch workers after the site power glitch.
  • James T
    • Liaising with Viglen and Streamline over 2010 delivery testing.
    • Wrote FSPROBE nagios check.
    • Experiments with iSCSI on linux clients/initiators.
  • Cheney
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss380 still with Streamline for fix.(Crashed with single faulty drive)
    • gdss417 acceptance testing. (Crashed with single faulty drive)
    • gdss280 replaced 16 ports raid card. ** gdss117 replaced raid card and 3 drives.
    • Power outage on Wednesday 01/12/2010. Lots of drives failures.
    • Hardware failure stats/graphs.
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)
    • gdss90 and gdss120 given back to Castor team.


Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
  • Ian:
    • Any required updates to SRMs
    • Add detail to public cvmfs page
    • Investigate nagios checks for cvmfs client
    • Look at and test cvmfs mirroring prototype
    • Rebuild hyper-v cluster
    • Job plan reviews
  • Tim:
    • More CMS repacking
    • Stats/Metric generation
    • Preparing for move to MyOracle Support
  • Cheney
  • Jonathan:
  • James T:
    • CASTOR 2.1.9 upgrade on disk servers
    • Continue iSCSI experiments
    • Familiarisation with AFS+krb5
    • Tidy up overwatch
    • Job plan updates
  • James A:
    • Liaise with ClusterVision engineers while new Worker Node delivery takes place.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Job plan review.
    • gdss117 and gdss280 configure and install with quattor.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Cheney - changed date for being off - now Nov 24th - early warning -likely to be off most of december - date subject to change -
  • Tim Wed afternoon and Thursday

Fabric On-Call

  • Kash Monday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1