Tier1 Operations Report 2009-12-09

RAL Tier1 Operations Report for 9th December 2009.

This is a review of issues since the last meeting on 2nd December.

Last week it was reported that the repack exercise on ALICE tapes has revealed a problem on three files which have been declared as lost. All Alice tapes now repacked and no further such examples have been found.

On Wednesday (2nd) there was a minor configuration change on SRMs aimed at reducing timeouts. However, it was realized that transfers were then failing and the change was backed out. (This item happened just before last week's meeting but not reported there).

On Friday (4th) December there was a problem on the FTS caused by an Oracle session that had become stuck. This session was killed at around 10am and FTS transfers resumed after a break of around 6 hours.

Over the weekend CMS had problems with transfers to US Tier 2s. This was traced to new disk servers missing a script which kills old gridftp processes. This has now been resolved with the script rolled out to all disk servers, and our deployment procedure modified to include the script in newly deployed disk servers.

Tape migration problem for Atlas since yesterday (8th) resolved end of this morning (around 11am).

This morning a fix was rolled out to all SL5 system (including all worker nodes) for the severe acpid package vulnerability announced yesterday.

Reported last week was the double disk failure on gdss138, part of the LHCb_Dst space token (D1T0) with resulting data loss. Also reported was that during tests after the disks were replaced a further disk error was found. This disk was replaced, more extensive tests run and the server readied for production. However, the return to production is delayed pending discussions on migrating to only RAID6 servers for this service class.

Long standing Database Disk array problem: The test of the UPS bypass has been delayed and will now be scheduled early January (see below). We now plan to turn off critical services and databases so that we are more able to recover in a controlled manner should there be problem during the test.

There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation.

A mismatch between tape contents and Castor meta-data is being investigated. This dates from 2007 and has been found for CMS data. So far investigations have not found other evidence of this problem. This affects 11 tapes with a total of 983 files on those tapes.

Configuration issue on CREAM CE (lcgce01) caused problems for Monte-Carlo production jobs for CMS. Awaiting application of the fix.

Thursday 10th December. At Risk on site-bdii to enable resilience in updating CIP (Castor Information Provider).
Possible: During first part of week beginning Monday 21st December:
- Turn off old home nfs file system. (Already replaced - just a tidy up)
- Outage of Castor and batch for disk server reboots.
- Migrate CIP (Castor Information Provider) to more resilient hardware.
Updates being planned for January.
- Tuesday 5th January: Test of UPS bypass. Databases will be stopped so will include a stop of services (Castor, LFC, FTS)
- Establishing further plans for January.