RAL Tier1 weekly operations Fabric 20091221

From GridPP Wiki
Jump to: navigation, search

Summary of week gone

Developments

  • All:
    • Tier1 Review as required
  • Martin:
    • Minor procurements, particularly networking
    • Planning to move or decommission hosts in A1 Upper
  • Ian:
  • James T:
    • Met with Viglen regarding disk swap out on 2008 procurement.
    • Removed Nincom as a ganglia data source for Services_Monitoring.
    • Installed tier1-oldprocesskiller on machines that were missing it.
    • Fixed drivemap errors on disk servers.
    • Disk replacements while Kash was on leave.
    • Updated various boxes.
    • Backed up /pool on csfsys{a,b} and copied both backups to sl4sys{32,64}
    • Preparation for disk server kernel intervention, including script to install workaround for null pointer dereference vulnerabilities.
    • Investigated SL3 repos on disk servers (ongoing).
  • Jonathan:
    • updated RPMs on AFS servers including new kernel
    • disabled login and removed SSH key from root login (for some disk servers) for user after laptop theft
    • corrected atlasbackup problem on one node
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • Moved srm0383 from Atlas to R89 Ups room.
    • gdss105 and 354 given back to castor.
    • gdss171 double disk failure back to castor. (Fixed)
    • gdss79 fsprobe started memtest on Friday.
    • Moved pallet from to Atlas with MJB.
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss79 and 282.

Absences

  • Jonathan: S/L Mon-Thu
  • Ian: S/L Mon-Wed
  • Kash: A/L Thu

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday 6/0ct am UPS issues to be fixed Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
    • Move/decommission hosts in A1 Upper
  • Martin:
    • Minor procurements
  • Ian:
  • James T:
    • Preparation for disk server intervention on Tuesday 22nd.
    • Disk server intervention.
    • Job plan bits and bobs.
    • Tidy up for Christmas.
    • Fabric/on site on call for whole of Christmas break.
  • Jonathan:
    • Stop export of /home/csf from csfnfs02
    • Check for web servers on csfmove02
    • Write and test script for active checks of database statuses on peaceful
  • James A:
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous working on gdss79 and 282.

Absences

  • All: Fri 25/12 to Fri 1/1. Back 4/1.
  • Ian: Mon-Thu
  • Jonathan: Tue

Fabric On-Call

  • James T Primary on-call to 28th Dec.

Advanced Warning of Requirements and Blocking issues

  • Unable to proceed with Atlas TAG migration to 64bit due to arrays being used for 3D systems while EMC kit is flakey.

Services Issues

  • Various requests for hardware.
    • Working on hardware provision for Services team testbeds.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric