Tier1 Operations Report 2010-01-13

RAL Tier1 Operations Report for 13th January 2010.

Problem on Castor 'GEN' instance with process that allocates servers to writes. This was traced to garbage added at the end of a database table. During an outage to the GEN instance on Friday (8th) this was cleaned up and the problem resolved.
BNL have had problems transferring files (for Atlas) from RAL. Traced to an interaction between the FTS (BNL are running the new version 2.2) and requests for files that are on a disk server that is draining. This is a temporary problem (until draining complete) and Brian has a manual workaround for when it occurs.
Night 11/12 January: Air conditioning failure in part of the Atlas building. However, not a significant problem for Tier1 as we have no critical equipment in areas affected.
13th Jan. At Risk on LHCB 3D and FTS while engineer fixes problem on one of two Oracle RAC nodes behind this service. (Delayed from last week owing to snow).

FSPROBE errors reported on gdss79 (LHCbDst) and gdss70 (LHCbMDst - D1T1). Checksums have been received from LHCb and are in the process of being compared against the files on disk. Both servers still not in service. Initial findings are showing some checksum differences which remain to be understood.
Long standing Database Disk array problem: Following the successful test of the UPS bypass last week plans are underway to migrate the databases back. This will initially be to non-UPS power. Once the UPS problems are resolved the disk arrays will be moved back to UPS power. (See advanced warning section below)
Issues with the WMSs reported over the holiday are awaiting installation of a patch. (To be installed during January).
The following items are unchanged since the last meeting:
On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is under investigation.
- There is a problem within Castor Disk-to-Disk copies for LHCb from the LHCbUser Service Class. This is still under investigation.
- A mismatch between tape contents and Castor meta-data is being investigated. This dates from 2007 and has been found for CMS data. So far investigations have not found other evidence of this problem. This affects 11 tapes with a total of 983 files on those tapes.
- Configuration issue on CREAM CE (lcgce01) caused problems for Monte-Carlo production jobs for CMS. Awaiting application of the fix.

The overall approach is to try and get interventions done by the end of January and go for stable running during February up to LHC start-up.

Tuesday/Wednesday 19/20 January preceded by a farm drain. This is a two-day outage to carry out a significant amount of work, including:
- Migrating Oracle databases for Castor, LFC & FTS back to their original disk arrays.
- FSCK all disk servers, update kernels.
- Update batch engine to 64-bit (requires a farm drain), kernel updates on worker nodes.
- Various other updates to CEs, WMSs and some network reconfiguration.
Wednesday 27th January: Migrate 3D & LHCb-LFC databases back to original disk arrays.
(Date To be decided) Network intervention on Site Access and UKLight routers.

There were no scheduled outages in this period.
One unscheduled outage had to be extended (At Risk for A/C problem)
There were two entries (one an "At Risk", One an "Outage" for the Castor GEN instance problem.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lhcb-lfc, lugh	SCHEDULED	AT_RISK	13/01/2010 11:00	13/01/2010 15:00	4 hours	An engineer is coming to investigate memory errors on one (of a pair) of the Oracle RAC nodes behind this service.
ce.ngs, lcgce01, lcgce02, srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-t2k.	UNSCHEDULED	OUTAGE	08/01/2010 13:00	08/01/2010 15:00	2 hours	Outage during investigation of a problem on the Castor 'GEN' instance that affects the allocation of write requests to disk servers.
Tier1 site	UNSCHEDULED	AT_RISK	07/01/2010 16:16	08/01/2010 14:00	21 hours and 44 minutes	Continuing At Risk owing to ongoing problems on air conditioning system with no standby.
Castor GEN instance (srm-alice, srm-dteam, srm-hone, srm-ilc, srm-mice, srm-minos, srm-t2k,)	UNSCHEDULED	AT_RISK	07/01/2010 11:58	07/01/2010 16:00	4 hours and 2 minutes	At Risk on Castor 'GEN' instance during investigation into problem with process within Castor that allocates disk servers.
Tier1 Site	UNSCHEDULED	AT_RISK	07/01/2010 09:50	07/01/2010 16:00	6 hours and 10 minutes	At Risk declared as there are faults on one (of two) chillers and one (of two) pumps that provide the cooling. This means no standby.