Difference between revisions of "RAL Tier1 weekly operations Fabric 20091116"

From GridPP Wiki
Jump to: navigation, search
(No difference)

Latest revision as of 15:50, 16 November 2009

Summary of week gone


  • All:
  • Martin:
    • Completed Disk Procurement eval
    • more work on EMC arrays problems
    • CPU ITT evaluation
  • Ian:
    • Work on Quest FP7 bid
    • Rolling out kernel security update on quattor system
    • First look at disk failure stats
  • James T:
    • Updated ganglia configs for Storage_LHCb and Services_Grid
    • Viglen disk server problems
    • TOASTER prep
  • Jonathan:
    • reconfigured NIS servers to allow access to shadow map from any port
    • check AFS servers for contacts from compromised Manchester system
    • BIOS update for sv-08-06 (to be lcgcc-s3-06)
    • sorted out problems with atlasbackup for many nodes
    • sorted out ntp configuration problem on t1pg0373
    • Nagios configuration updates
    • updated tier1-nagios-plugins to version 2.0-58
    • gave talks about Nagios to Production Team etc
  • James A:
    • A/L
  • Kash:
    •  Drive replacement.
    •  Fixing broken WNs.
    •  gdss262 replaced 8x1gb memory fixed and back in production.
    •  gdss67 need to run 7 days test.
    •  gdss125 given back to castor
    •  gdss413 replaced 4x2gb memory.(Ready for deployment)
    • sl4sys32-sl4sys64 replaced PSU.
    •  Working on 2008 Disk servers and working nodes.
    •  Working on gdss67, 163 and 282.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am not in sight Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • Work on evacuating A1 Upper (Castor admin and LSF systems)
  • Martin:
    • complete CPU ITT evaluation
    • testing sample hardware
    • install database test boxes
  • Ian:
    • Further Quattor FP7 work (last week)
    • Finish roll out new kernels on Quattor managed machines
    • Kernels on SL4 batch workers
    • Work on CPU procurements
    • Castor Quattor tutorial
  • James T:
    • Viglen disk server problems
    • CRISTAL2 preparation
    • Catch up on helpdesk tickets and other actions
    • Disk server kernel updates
  • Jonathan:
    • Set up regular checks of backups for home filesystem, AFS volumes and MySQL databases
    • Quattor implementation for Nagios slave
    • update environment for SL5 systems
    • updates to farm to allow Babar functional userids to migrate home filesystem
    • Nagios configuration updates
  • James A:
    • A/L
  • Kash:
    •     Drive replacement.
    •     Fixing broken WNs.
    •     gdss67 rebuild from scratch and move in HPD room.
    •     Continuous working on 2008 disk servers and working nodes.
    •     Continuous working on gdss67, 163 and 282.


  • James A
    • Annual Leave (Mon 9th - Fri 20th).

Fabric On-Call

  • Mon-Sun: Ian

Advanced Warning of Requirements and Blocking issues

Services Issues

  • Various requests for hardware.


RAL Tier1 weekly operations fabric