RAL Tier1 weekly operations Fabric 20100621

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
  • Ian:
    • Information gathering and planning for Facilities Castor instance
    • Services virtualisation planning
    • Tested quattor component for controlling resolv.conf
  • Tim:
    • Sort out VTL mess
    • CMS T10K migration
    • Facilities castor planning
    • Prep for tape workshop next week
  • Jonathan:
    • updated tier1-sudo-config SVN source for lcgce03
    • 2 Nagios configuration updates
    • check for pager problem
  • James A:
    • Multipathing stuff
  • James T
    • Catch up after leave
    • Increased size of RAM disk (/var/lib/ganglia) on ganglia01 as it was getting full
    • SL09 acceptance problems
      • Wrote acceptance test summary document
      • Added Gareth from Streamline's key to affected disk servers
      • Held face-to-face meeting with Streamline and Boston (with WD on the phone)
    • Disk sweeps in Kash's absence
    • Rebuild file systems on gdss67
  • Cheney
    • patching
    • got hadoop virtual machines up for dcache testing
    • upgraded tsbn stats
    • closed off Mayo's project work for me
    • various nagios check tweaks
    • db dr testing
    • setup of samba accounts and access for mike courthold's team
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • gdss67 filesystem problem. James A is investigating it. (Intervention)
    • gdss107, 108, 109 and 110 moved from HPD to LPD room with John Kelly.
    • gdss474 faulty backplane arranging Engineer.
    • gdss207 tried installing it twice but didn't work.
    • gdss390 replaced fan with John. (Fixed)
    • Streamline/areca disk servers crashed due to single faulty drive. (ongoing)


Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Jonathan on leave on Tuesday (15th) and Thursday (17th)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • CRISTAL 1 Monday
    • Away day Tuesday
  • Ian:
    • CERN Monday-Friday
    • WLCG Multicore and Virtualisation Workshop
    • Information exchange re services virtualisation and Castor configuration
  • Tim:
    • CMS T10KB migration
    • Facilities castor planning
    • Dust stuff
  • Cheney
    • fix srb comms problem
  • Jonathan:
    • e-Science away day
    • finish preparing talk about AFS for Fabric Team
    • continue work on shutting down csfnfs58 (old NFS server)
    • Nagios configuration updates
  • James T:
    • Away day Tuesday
    • Cover in Kash's absence
    • SL09 testing
    • Quattor /etc/services fix for rfiod on SL5 disk servers
  • James A:
    • CRISTAL 1 Monday
    • Away day Tuesday
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Ian at CERN all week
  • Tim at SARA Mon/Tues next week

Fabric On-Call

James T primary oncall Monday-Thursday

Ian Primary oncall Friday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues


RAL Tier1 weekly operations fabric

Category:RAL_Tier1