RAL Tier1 Operations Report for 5th October 2011

Review of Issues during the week 28th September to 5th October 2011.

At the end of last week (Friday 30th Sep) there was a large backlog of tape migrations for Atlas, particularly from the AtlasMCTape service class which was filling up. Over the weekend (1/2 Oct) FTS transfers were throttled back for Atlas so as to ensure this area did not completely fill up. By Monday the tape migration backlog had been cleared and the FTS channels opened back up.
Over the weekend, and extending up to Tuesday (1 - 4 Oct) there were network connectivity problems across the Tier1 network. This did cause some operational degradation - for example failing some CE SAM tests with connection problems. Some packet loss could also be seen within the Tier1 network. Investigations showed anomalous data flows and large numbers of discarded packets. The problem disappeared in the early hours of Tuesday morning (4th October). A review of activity shows that it was triggered (although we believe not caused) by the repack operation that had been ongoing. It was possible to reproduce the problem using generated traffic. Flushing the MAC address caches in the interfaces in the C300 switch fixed most of the problems and returned data flows to normal. Some lower level issued remained, but this has been an effective resolution of this problem. Nevertheless, why the problem arose in the first place is not understood.
This morning (5th October) one of the five nodes that make up the Top BDII lcgbdii0633 hung up. It was brought back up around an hour later.

On Friday (30th Sep) GDSS374 rebooted (AtlasFarm) rebooted. The cause is unknown. The server was only unavailable for the duration of the boot.

Following a routine maintenance check a problem was located on the 11kV feed into the computer building with an intermittent short taking place. This was located and following some internal switching the discharge stopped. A transparent intervention made recently may have fixed this, but further tests are needed to confirm this.
Atlas report slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue. The types of things being studied include disk server tcp/ip configurations.

Thursday 13th Sep: FSPROBE reported a problem on gdss396 (CMSWanIn). This resulted in the loss of a single CMS file (since copied in from elsewhere). The server is still undergoing tests.
Last Wednesday (28th Sep) at around 07:00 the monitoring found a Read only file system on gdss456 (AtlasDataDisk) failed with a read only file system. The system is being drained ahead of further investigation. However, this process is slow as Atlas have seen errors whilst reading files from a draining disk server and over the weekend the server was in a read-only mode.
On Thursday (29th Sep) FSPROBE reported a problem on gdss295 (CMSFarmRead). This server is out of production pending investigation.
gdss296 (CMSFarmRead) has been out of production since 20th August. This server is undergoing acceptance tests before being returned to production.

Atlas data was switched to write to T10KC tapes on Thursday 29th September. This has essentially removed the possibly running out of the A/B media.

Tuesday 11th October. Re-configuration of FTS to spread agents across two machines rather than one.
Tuesday 18th October. Microcode updates for the tape libraries.

The following items are being discussed and are still to be formally scheduled and announced:

Intervention to fix problem on 11kV power feed to building and connect up some parts of the cooling system to the UPS. This may have been fixed, but we await confirmation. There are also plans to move part of the cooling system onto the UPS supply that may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure. This will only proceed once the problem that caused the cancellation of the first stage of this work last week is understood and fixed.
Networking change required to extend range of addresses that route over the OPN.
Address permissions problem regarding Atlas User access to all Atlas data.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
Replace hardware running Castor Head Nodes (aimed for end of year).

None.