RAL Tier1 weekly operations castor 04/07/2010

From GridPP Wiki
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
    • CoD work
    • 2.1.9 change control document
    • arranging for access of BADC/NEODC people to Gen and helping them to get started
    • getting 2.1.9 DLF working on preprod with rsyslog & testing
    • assisting with Security Challenge 4
    • setup website on e-Science wiki for 2.1.9 upgrade
  • Shaun:
    • 2.1.9 Upgrades
    • Moving last disk server into cmsTemp
    • Fixed gdss539
  • Chris:
    • Castor 2.1.9 tests and work related to it
    • SSC4
    • Alice disk servers filling up /var partitions
    • rsyslog
  • Richard:
    • Installed the machine lcg0625 as a test CIP server
    • Trying to get latest (2.1.9-x) functional tests running on pre-prod
    • Re-config pre-prod to reverse the "local ns" change
    • Starting a run of stress tests for 2.1.9 pre-prod
  • Brian:
    • ..
  • Jens:
    • ..

Developments for this week

  • Matthew:
    • Finalizing plans for NIS + networking for CASTOR for Facilities
    • 2.1.9 DLF configuration & testing
    • Testing 2.1.9 tape migration
    • WLCG (Wed-Fri)
  • Shaun:
    • SRM work
    • COD
  • Chris:
    • Castor 2.1.9 tests and work related to it
    • WLCG meeting
    • Finishing SSC4
  • Richard:
    • Continuing with stress tests for 2.1.9 pre-prod
    • 4.5 days A/L
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • Approx. 1000 CMS files were lost on gdss67 (D1T0 cmsFarm) after a failure of a RAID array. The decision to recreate the array was carried out prematurely before CMS was announced of the file loss. A postmortem has been written http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100630_Disk_Server_Data_Loss_CMS.
  • Transtech disk array controller reset itself for an unknown reason, causing mounts on 6 nodes to go read-only, stopping backup of DB redo logs. A reboot of the nodes during a downtime on 5/7/10 fixed the issue.

Blocking issues

none

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

None

Advanced Planning

  • Upgrade to 2.1.8/2.1.9 2010

Staffing

  • Castor on Call person: Shaun
  • Staff absences:
    • Jens (Mon,Tue)