RAL Tier1 Operations Report for 12th October 2011

Review of Issues during the week 5th to 12th October 2011.

On Wednesday lunchtime (7th) one of of the site BDIIs was not working. This was picked up by a test for the presence of information about our site (RAL-LCG2) in the Top BDII. A restart of the BDII service fixed it and the problem was resolved within an hour.
On Wednesday (7th) late afternoon we started to see problems on CE08 (a LCG CE). During Wednesday and Thursday attempts to fix these were intermittently successful. However, an outage was declared in the GOC DB on CE08 from Thursday lunchtime (8th) until the following morning when the problem was resolved.
On Thursday (8th) the batch farm was very full (although this is regularly the case nowadays). There were 2000 LHCb jobs running which, owing to large memory requirements, blocked a further 1000 job slots. Some adjustments were made temporarily to the maximum number of jobs LHCb could run.
On Tuesday (11th) There was a problem with the Atlas Castor instance. This was the same problem as was seen two weeks ago with a duplicate request in the stager database. (A Post Mortem is being produced for the problem of 2 weeks ago). This time the symptoms were recognised quickly and the problem fixed. There was an outage of 75 minutes declared in the GOC DB for this incident.
We have set-up a Perfsonar network test system. Initial results show some some anomalies, such as high latencies. These may be a consequence of something in the test environment but they need to be understood.

Following a routine maintenance check a problem was located on the 11kV feed into the computer building with an intermittent short taking place. This was located and following some internal switching the discharge stopped. A transparent intervention made recently may have fixed this, but further tests are needed to confirm this.
Atlas report slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue. The types of things being studied include disk server tcp/ip configurations.

Thursday 13th Sep: FSPROBE reported a problem on gdss396 (CMSWanIn). This resulted in the loss of a single CMS file (since copied in from elsewhere). The server is still undergoing tests.
gdss456 (AtlasDataDisk) failed with a read only file system on Wednesday 28th September. The system is being drained ahead of further investigation. However, draining is taking some time (3.5K files left to drain as of yesterday - 11th Oct).
On Thursday (29th Sep) FSPROBE reported a problem on gdss295 (CMSFarmRead). This server has undergone tests and had firmware updated on the raid controller cards. It is currently verifying its RAID array before being returned to production.
gdss296 (CMSFarmRead) has been out of production since 20th August. This server is undergoing acceptance tests before being returned to production.

The writing of Atlas data to the T10KC tapes continues OK and a start has been made on migrating Atlas data from the 'A' to the 'C' tapes. This has thrown up the first problem files while copying off the 'A' tapes and we are clarifying the procedure we will use to manage these.
On Tuesday (11th) The FTS agents, that manage the file transfers for each FTS channel, were re-configured to spread them across two nodes. The single node that had been running these was starting to show high loads.
A Disk only (D1T0) service class has been created in Castor for Alice and is now in operation.

WMS02, We are planning an intervention from Thursday (13th Oct) at 16:00 until the following Thursday (20th). This allows time for a drain of its jobs and maintenance that is needed to resolve a problem of its database growing too large.
Tuesday 18th October. Microcode updates for the tape libraries. No tape access from 09:00 to 13:00.
Tuesday 1st November: There will be an intervention on the network link into RAL that will last up to an hour. We plan to declare a site Outage and will drain out the FTS and pause batch work.

LCGCE09 (Cream CE) will be unavailable from Thursday (13th) until Thursday (20th) for a glite update to be applied (preceded by a drain).

The following items are being discussed and are still to be formally scheduled and announced:

Intervention to fix problem on 11kV power feed to building and connect up some parts of the cooling system to the UPS. This may have been fixed, but we await confirmation. There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
Networking change required to extend range of addresses that route over the OPN.
Address permissions problem regarding Atlas User access to all Atlas data.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
Replace hardware running Castor Head Nodes (aimed for end of year).

There were two unscheduled outages in this last week. One on the Atlas Castor (SRM), and one on CE08.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgfts	SCHEDULED	WARNING	11/10/2011 09:00	11/10/2011 11:00	2 hours	Re-configuration of FTS channel agents.
srm-atlas	UNSCHEDULED	OUTAGE	11/10/2011 08:45	11/10/2011 10:00	1 hour and 15 minutes	Outage while we investigate a problem with Castor Atlas
lcgce08	UNSCHEDULED	WARNING	06/10/2011 13:00	07/10/2011 09:15	20 hours and 15 minutes	We are experiencing problems on this CE (LCGCE08) which are under investigation.