RAL Tier1 weekly operations Fabric 20100322

From GridPP Wiki
Jump to: navigation, search

Developments

  • All:
  • Martin:
    • Chasing minor procurement receipting
    • Networking planning
    • Discussions surrounding Atlas software area and use of AFS for it
    • Decommissioning CV04 kit
    • Installing new Dell hardware in rack
    • HEPiX bookings
    • Management issues
  • Ian:
    • Attended Quattor workshop
    • Set up gLite update 62 - and contributed back to QWG
    • Got test version of Aquilon running on system at RAL
  • James T:
    • Quattor installation of Viglen 09 kit for testing
    • Testing of Viglen 09 kit
      • Started testing
      • Monitoring for faults
    • Tier1 tour preparation
    • Disk server deployment
      • Documented procedure for deployment using quattor
      • Allocated disk servers for deployment into Atlas NonProd with Quattor
  • Jonathan:
    • fixed atlasbackup problems on several nodes
    • fixed ntpd process problem on lcgfts0423
    • investigated issues around setting up software area for Atlas VO
    • added AFS userid
    • Nagios configuration updates
    • built local up-to-date 64 bit version of nagios-plugins RPM
  • James A:
    • Completed SL54 upgrade on first five racks of WNs (270 Nodes, ~49% of Farm)
    • Started acceptance testing on Viglen 2009 WNs.
    • Provisioned network and power cabling for CASTOR Rack G.
    • Worked on OPB and open day tours with JIT.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Decommissioning old batch systems.(R 27)
    • Moved Viglen twin system to logistics for collection.
    • gdss126 given back to castor.
    • Unpacked and moved New Dell servers in UPS room with MJB.
    • Added switches and cable bars in Rack in R27 A5 lower with MJB.
    • Worked with HP Engineer. (Graham)
    • gdss211 partition and re-install.
    • Castor server cdbd03 managed to install Linux. (working ok)
    • Castor server Fakecdb13 still working. (Intervention)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • minor procurements issues (receipting, invoicing)
    • open day talk
    • catch up with Cheney and Tim
    • drafting various change control notices
  • Ian:
    • Help James T with 64 bit Castor disk server
    • Help James A with new quattor server
    • finalise brining new software install server into production
  • James T:
    • SL5.4 x86_64 + XFS disk server build for CASTOR testing
    • Keep an eye on Viglen 2009 testing
    • Tier1 tour prep
  • Jonathan:
    • continue work on setting up AFS storage for Atlas software
    • continue reconfiguration of nagios06
    • continue work on disposal of old kit from A1 Upper machine room
  • James A:
    • Monitoring acceptance testing.
    • Starting acceptance testing on Streamline 2009 WNs.
    • Continue SL54 upgrade, aiming for four-to-five more racks (112-148 nodes, another ~20-27% of the farm) by the end of the week.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous decommissioning old batch systems.(R 27)
    • Continuous working with HP engineer.
    • Re-pack New disk servers rack sliders for return. (Wrong sliders)

Absences

  • Jonathan on partial retirement (not in on Monday and Friday)
  • Kashif A/L (Tuesday)

Fabric On-Call

Ian Primary on call Monday-Sunday

Advanced Warning of Requirements and Blocking issues

Services Issues

Category:RAL_Tier1

RAL Tier1 weekly operations fabric