RAL Tier1 Operations Report for 28th September 2011

Review of Issues during the week 21st to 28th September 2011.

On Thursday morning (22nd Sep) there were many Atlas file transfers failing. The number of FTS transfers was reduced (down to 25% of the normal values) and the system recovered. The rate of file transfers was ramped back up during the day.
Over the weekend (24/25 Sep) a large backlog (around 12,000 files) of Atlas tape migrations built up. These are being processed and were caused by the rate of tape migration requests.
Starting Friday evening (23rd) there was an error in the CMS SAM tests that caused us to fail CMS tests for around 7 hours.
On Tuesday morning (27th Sep) at around 4am we started to fail transfers to/from the Atlas Castor instance. A GGUS alarm ticket was received. Investigations during the morning found an inconsistency in the Castor Oracle database. The instance was declared as in an outage in the GOC DB, and was returned to production at 15:30 that day. A Post Mortem will be prepared for this incident.
Packet loss on the main network link from the RAL site is now at a very low level, and well below any level that would cause operation difficulties. This will be monitored but it is appropriate to remove this from the list of ongoing problems.
Added on 12-Oct-11: There was a problem on a LHCb disk server (gdss500) which was running xroot but it was not doing anything. It may have been like this for some time. Picked up on 28th Sep.

We have seen a high rate of writing tapes. Our stock of T10KA/B tapes is diminishing rapidly. We will bring the T10KC tapes into service this week.
Following a routine maintenance check a problem was located on the 11kV feed into the computer building with an intermittent short taking place. This was located and following some internal switching the discharge stopped. A transparent intervention made recently may have fixed this, but further tests are needed to confirm this.
Atlas report slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue. The types of things being studied include disk server tcp/ip configurations.

Thursday 13th Sep: FSPROBE reported a problem on gdss396 (CMSWanIn). This resulted in the loss of a single CMS file (since copied in from elsewhere). The server is still undergoing tests.
On Wednesday (21st Sep) and investigation into failing transfers for Atlas led to the discovery of corrupt files on disk server gdss487 in AtlasDataDisk. These are old files added to Castor before checksumming was enabled. There are around 150 files lost. Final clearing up is still ongoing with Atlas. A validation of all files on the disk server that have checksums showed no problems.
The checksumming check has thrown up a problem with one Alice file on gdss460. (AliceTape). This appears to be a one-off transfer failure. It has been reported to Alice as a data loss.
This morning (Wed 28th Sep) at around 07:00 the monitoring found a Read only file system on gdss456 (AtlasDataDisk). This server is currently out of production as this is investigated.

The disk controller firmware in all production Viglen2007a disk servers has been updated. Most were done during last week (19-23 September). The opportunity was taken to do the last few (which were Atlas ones) during the outage of the Atlas Castor instance yesterday (Tuesday 27th Sep).

The following items are being discussed and are still to be formally scheduled and announced:

Intervention to fix problem on 11kV power feed to building and connect up some parts of the cooling system to the UPS. This may have been fixed, but we await confirmation. There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
Networking change required to extend range of addresses that route over the OPN.
Address permissions problem regarding Atlas User access to all Atlas data.
Microcode updates for the tape libraries are due.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
Replace hardware running Castor Head Nodes (aimed for end of year).

There were two unscheduled entries during this week. Both relate to the problem on the Atlas Castor instance on Tuesday 27th September.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
srm-atlas	UNSCHEDULED	WARNING	27/09/2011 09:50	27/09/2011 15:30	5 hours and 40 minutes	Outage to investigate and fix the problems to Oracle database for Castor Atlas
srm-atlas	UNSCHEDULED	OUTAGE	27/09/2011 04:00	27/09/2011 11:00	7 hours	We are currenlty investigating a problem with Castor Atlas