RAL Tier1 Operations Report for 17th August 2011

Review of Issues during the week from 10th to 17th August 2011.

On Friday (4th Aug) there was a problem running batch work for LHCb on any worker nodes that have been re-installed. This was fixed by a change to the variable defining the location of the LHCb software directory. However, there were problems with the roll-out of this fix to some nodes. This problem was identified (it was caused by some nodes being set to ignore Quattor updates) on Monday (8th) and believed fixed. However, it became clear by today (Wed 10th) that there were still a few nodes missing the update and all nodes were finally checked out by Friday (12th).
On Friday (12th Aug) there were contention problems reading/writing to Atlas Scratch Disk which was very full. This was referred back to Atlas. However, our investigations also revealed that the BDII information was not being updated from the CIP and this was fixed.
Over the weekend (14/15 Aug) we received a GGUS ticket from Atlas about failing Atlas sonar test file transfers. These have been fixed by reverting a change in the FTS (the use of srmcopy changed back to urlcopy for these channels). However, its is not clear if this fixed the root of this problem or just the symptom.
On Tuesday (16th) three files were declared as lost to LHCb. These came from a single tape and were seen as attempts to read data failed.
On Tuesday (16th) one file was declared lost to Atlas. This was picked up by a daily check of the checksums of files written the previous day.

GDSS419 (AtlasDataDisk) had a problem with a read only file system in the morning of Wednesday 10th Aug and was taken out of production. It was returned to production the same evening initially in read-only mode. It was returned to full (read/write) production the following day.
GDSS96 (CMSWanIn) crashed on Sunday evening (14th Aug). It was returned to production Monday afternoon.
GDSS565 (AtlasStripInput) was not responding correctly over the weekend (transfers were failing) although it was passing the monitoring tests. The system was rebooted and returned to full production on Monday morning (15th)
GDSS228 & GDSS229 (both AtlasScratchDisk) were removed from production on Tuesday (16th Aug) as both had double disk failures. They were returned to production this morning (Wednesday 17th Aug) after the first drive had rebuilt in each case.

Following a routine maintenance check a problem was located on the 11Kv feed into the computer building with an intermittent short taking place. This has now been located and following some internal switching the discharge has stopped. However, an intervention on the power systems in the building is required, although the extent of any power outage during this work is not yet known.
The problem of packet loss on the main network link from the RAL site remains. RAL networking team are actively investigating this problem. A failover test of the RAL JANET link tomorrow morning (Thursday 18th Aug) is part of this work.
Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue.

The following items are being discussed and are still to be formally scheduled and announced:

Failover of the RAL Janet link on the morning of Thursday 18th August.
Intervention to fix problem on 11Kv power feed to building and connect up some parts of the cooling system to the UPS. This is being planned but may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Networking (routing) change relating to https traffic outbound.
Address permissions problem regarding Atlas User access to all Atlas data.
Networking upgrade to provide sufficient bandwidth for T10KC tapes.
Microcode updates for the tape libraries are due.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
Replace hardware running Castor Head Nodes (aimed for end of year).

There were no entries during this week.