Difference between revisions of "RAL Tier1 weekly operations Fabric 20091012"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 15:04, 12 October 2009

Summary of week gone

Developments

  • All
  • Martin:
    • Disk procurement ITT evaluation
    • Depolyment of 3D databases onto old hardware due to power feed problems making the EMC arrays unstable
    • Meeting with Seagate about disk problems
  • Ian:
  • James T:
    • Viglen testing:
      • Meeting with
      • Drives swapped for a different batch in 10 machines (220 drives).
      • Logs captured on 2 October by Seagate showed further issues so they issued another updated firmware.
      • More logs captured from timed-out drives on Thursday 8th.
      • Tested racks with the functional earth removed - same problems.
    • user_xattr mount option rolled out to all CASTOR disk servers.
    • Created Storage_CASTOR_Gen ganglia cluster for Brian (former CASTOR team blocking issue).
    • Cleaned up some fabric tickets.
    • DNS request for repack server.
    • HEPSYSMAN on Wednesday 7th (talked about Tier1 storage).


  • Jonathan:
    • configured nagios@nagger.gridpp.rl.ac.uk as PBS operator
    • worked on migration of user home filesystems to new server
    • updated RPMs on core servers and rebooted where required
    • updated wiki documentation referring to change Nagios master server to nagger
    • added new users to Tier1 and AFS
    • added new top directory superb for Babar (RT #52070)
    • Nagios configuration updates on servers and clients
  • James A:
    • Lots of work on BatchWorkers in QUATTOR.
    • Brought SL5 farm to 90% of KSI2K Capacity.
    • Shrunk SL4 farm respectively.
    • Made some minor progress with SINDES.
    • Some changes to ARTEMIS for UPS room.
    • Removed AtlasBackup from base machine template in QUATTOR
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • gdss354 fixed and back in production.
    • gdss218 wrong way round backplane cables. (Fixed)
    • gdss126 double disks failure. Completed verifying array.
    • Seagate 220 drives dispatched, given to Seagate Engineer.
    • Completed adding additional raid cards in v06 (Castor disk servers).
    • Working on 2008 Disk servers and working nodes.
    • Working on gdss67, 86, 126 and 170.

Operational Issues and Incidents

Index Description Start End Severity Affected VO(s)
EMC arrays serving 3D/LFC/FTS databases made unstable by attempts to stabilise the Castor EMC arrays Tuesday am not in site Catastrophic All

Summary of plans for week ahead

Scheduled and Cancelled Down Times

Type=Down/At Risk/Cancelled entries in/planned to go to GOCDB

Component Description Start End Affected VO(s) Type

Development priorities

  • All
  • Martin:
    • Disk procurement ITT evaluation
    • CPU procurement ITT clarifications
  • Ian:
  • James T:
    • Assign machines for deployment.
    • Send out requests for people to complete CRISTAL 2 feedback forms.
    • Viglen testing:
      • Continue testing latest firmware.
      • Prepare to hand over to someone else.
  • Jonathan:
    • work on migration of Tier1 home filesystem to new server
    • work on installing Nagios slave servers using Quattor
    • Nagios configuration updates as required
  • James A:
    • Continue pushing forward with SINDES.
    • Take over disk issues from James T.
    • Integrate of BMS alerts into ARTEMIS data stream.
  • Kash:
    • Drive replacement.
    • Fixing broken WNs.
    • Continuous working on 2008 disk servers and working nodes.
    • Continuous Working on gdss67, 86, 126 and 170.

Absences

  • James T
    • James T on A/L from Thursday 15th until Monday November 2nd.

Fabric On-Call

  • Mon-Fri:

Advanced Warning of Requirements and Blocking issues

Services Issues

  • Various requests for hardware.

Category:RAL_Tier1

RAL Tier1 weekly operations fabric