RAL Tier1 weekly operations castor 05/07/2010
From GridPP Wiki
Revision as of 14:26, 6 July 2010 by Matt viljoen (Talk | contribs)
Contents
Summary of Previous Week
- Matthew:
- CoD work
- 2.1.9 change control document
- arranging for access of BADC/NEODC people to Gen and helping them to get started
- getting 2.1.9 DLF working on preprod with rsyslog & testing
- assisting with Security Challenge 4
- setup website on e-Science wiki for 2.1.9 upgrade: https://wiki.e-science.cclrc.ac.uk/web1/bin/view/EScienceInternal/CASTOR219Upgrade
- Shaun:
- 2.1.9 Upgrades
- Moving last disk server into cmsTemp
- Fixed gdss539
- Chris:
- Castor 2.1.9 tests and work related to it
- SSC4
- Alice disk servers filling up /var partitions
- rsyslog
- Richard:
- Installed the machine lcg0625 as a test CIP server
- Trying to get latest (2.1.9-x) functional tests running on pre-prod
- Re-config pre-prod to reverse the "local ns" change
- Starting a run of stress tests for 2.1.9 pre-prod
- Brian:
- ..
- Jens:
- ..
Developments for this week
- Matthew:
- Finalizing plans for NIS + networking for CASTOR for Facilities
- 2.1.9 DLF configuration & testing
- Testing 2.1.9 tape migration
- WLCG (Wed-Fri)
- Shaun:
- SRM work
- COD
- Chris:
- Castor 2.1.9 tests and work related to it
- WLCG meeting
- Finishing SSC4
- Richard:
- Continuing with stress tests for 2.1.9 pre-prod
- 4.5 days A/L
- Brian:
- ..
- Jens:
- ..
Operations Issues
- Approx. 1000 CMS files were lost on gdss67 (D1T0 cmsFarm) after a failure of a RAID array. The decision to recreate the array was carried out prematurely before CMS was announced of the file loss. A postmortem has been written http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100630_Disk_Server_Data_Loss_CMS.
- Transtech disk array controller reset itself for an unknown reason, causing mounts on 6 nodes to go read-only, stopping backup of DB redo logs. A reboot of the nodes during a downtime on 5/7/10 fixed the issue.
Blocking issues
- CERN don't provide a 32bit build of xroot - this means we can't install it on most of our disk servers.
Planned, Scheduled and Cancelled Interventions
Entries in/planned to go to GOCDB
None
Advanced Planning
- Upgrade to 2.1.8/2.1.9 2010
Staffing
- Castor on Call person: Shaun
- Staff absences:
- Jens (Mon,Tue)