RAL Tier1 weekly operations castor 05/07/2010

From GridPP Wiki
Revision as of 14:26, 6 July 2010 by Matt viljoen (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary of Previous Week

  • Matthew:
  • Shaun:
    • 2.1.9 Upgrades
    • Moving last disk server into cmsTemp
    • Fixed gdss539
  • Chris:
    • Castor 2.1.9 tests and work related to it
    • SSC4
    • Alice disk servers filling up /var partitions
    • rsyslog
  • Richard:
    • Installed the machine lcg0625 as a test CIP server
    • Trying to get latest (2.1.9-x) functional tests running on pre-prod
    • Re-config pre-prod to reverse the "local ns" change
    • Starting a run of stress tests for 2.1.9 pre-prod
  • Brian:
    • ..
  • Jens:
    • ..

Developments for this week

  • Matthew:
    • Finalizing plans for NIS + networking for CASTOR for Facilities
    • 2.1.9 DLF configuration & testing
    • Testing 2.1.9 tape migration
    • WLCG (Wed-Fri)
  • Shaun:
    • SRM work
    • COD
  • Chris:
    • Castor 2.1.9 tests and work related to it
    • WLCG meeting
    • Finishing SSC4
  • Richard:
    • Continuing with stress tests for 2.1.9 pre-prod
    • 4.5 days A/L
  • Brian:
    • ..
  • Jens:
    • ..

Operations Issues

  • Approx. 1000 CMS files were lost on gdss67 (D1T0 cmsFarm) after a failure of a RAID array. The decision to recreate the array was carried out prematurely before CMS was announced of the file loss. A postmortem has been written http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100630_Disk_Server_Data_Loss_CMS.
  • Transtech disk array controller reset itself for an unknown reason, causing mounts on 6 nodes to go read-only, stopping backup of DB redo logs. A reboot of the nodes during a downtime on 5/7/10 fixed the issue.

Blocking issues

  • CERN don't provide a 32bit build of xroot - this means we can't install it on most of our disk servers.

Planned, Scheduled and Cancelled Interventions

Entries in/planned to go to GOCDB

None

Advanced Planning

  • Upgrade to 2.1.8/2.1.9 2010

Staffing

  • Castor on Call person: Shaun
  • Staff absences:
    • Jens (Mon,Tue)