RAL Tier1 weekly operations Fabric 20091109

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
    • SSC HR rollout tasks
  • Martin:
    • Completed Disk Procurement eval
    • Work on EMC arrays problem
    • Survey of CPU tender responses
    • Meeting with Seagate and Viglen re 2008 disk acceptance
  • Ian:
    • Work on Quest FP7 bid
    • Attended Quattor Workshop (and QUEST F2F)
    • Making new kernels available to Quattor managed systems
  • James T:
    • Catch up
    • Viglen Disk Server Problems
    • CRISTAL2 preparation
    • A/L Friday
  • Jonathan:
    • updated RPMS and rebooted for new kernel on many systems
    • sorted out problems with atlasbackup for some nodes
    • sorted out Nagios problems for some servers
    • arranged for gdss411-413 to be installed as Castor disk servers prior to deployment
    • migrated farm home filesystems from /home/csf to /home/tier1 for sremaining users except for bfactory functional userids)
    • Nagios configuration of updates
    • with Kash, shutdown nagger and nagiosdb to replace faulty memory in nagger
  • James A:
    • Quattor Workshop @ Brussels
  • Kash:
    •  Drive replacement.
    •  Fixing broken WNs.
    •  gdss154 and 168 fixed and back in production.
    •  gdss383 replaced 4x2gb memory. (Fixed) and ready for deployment.
    •  gdss117, 139 and 154 fixed and given back to castor
    • gdss67 after long intervention and efforts (Replaced new 24 ports raid card). I've managed to fix it (Data saved). Need to rebuild it from scratch.
    • gdss411, 412 and 413 fixed and ready for deployment.
    • nagger replaced memory with Mr. Wheeler.
    •  Working on 2008 Disk servers and working nodes.
    •  Working on gdss67, 125, 282 and 403.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am not in sight Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • Work on evacuating A1 Upper (Castor admin and LSF systems)
  • Martin:
    • Move EMC kit
    • Spares and additional hardware for database arrays
    • CPU ITT evaluation
      • Testing sample hardware
  • Ian:
    • Further Quattor FP7 work (last two weeks)
    • Roll out new kernels on Quattor managed machines
    • Look at disk stats with Kash
    • Work on CPU procurements
  • James T:
    • Progress meeting with Viglen
    • CRISTAL2 preparation
    • Disk server cover in Kash's absence
    • Catch up on helpdesk tickets and meeting actions
    • TOASTER preparation
  • Jonathan:
    • Quattor implementation for Nagios slave
    • update environment for SL5 systems
    • updates to farm to allow Babar functional userids to migrate home filesystem
    • Nagios configuration updates
  • James A:
    • A/L
  • Kash:
    •     Drive replacement.
    •     Fixing broken WNs.
    •     gdss67 rebuild from scratch and move in HPD room.
    •     Continuous working on 2008 disk servers and working nodes.
    •     Continuous working on gdss67, 125, 282 and 403.

Absences

  • Jonathan
    • A/L on Thursday (12th November)
  • James A
    • Annual Leave (Mon 9th - Fri 20th).

Fabric On-Call

  • Mon-Sun: Ian is Primary On-Call

Advanced Warning of Requirements and Blocking issues

Services Issues

  • Various requests for hardware.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric