RAL Tier1 Operations Report for 24th August 2011

Review of Issues during the week from 17th to 24th August 2011.

On Wednesday afternoon (17th) there was a backlog of tape migrations for the Castor GEN instance today (SuperB). A second tape drive was allocated to the task to bring the backlog down.
There have been a number of issues for the Atlas Castor instance. On Friday there was very high load caused by a mixture of File Transfers, batch work and disk servers draining. These were alleviated by stopping the draining of a couple of disk servers. On Saturday evening there were further problems caused by a deadlock in the Atlas SRM database, which was resolved late evening. There were also some further SAM test failures overnight Monday-Tuesday following on from time-outs that in turn were caused by a disk server draining in the service class used for the tests. These issues have been compounded by a problem in the handling of the time-outs by the Atlas SAM tests leading to an unclear availability measure.
There have been a couple of problems with one of the site BDII machines during the last week. These are picked up by a test for missing information in the top-bdii.
There have been two instances (over the weekend and during the night Tuesday-Wednesday) where the CMS software server has given problems (failed to serve files) under load.

On Thursday (18th Aug) GDSS416 (LHCbRawRDst) was unavailable for about an hour. It crashed with a problem on the system disk - which was fsck'd.
On Sunday (21st Aug) GDSS230 (AtlasGroupDisk) was taken out of production following a double disk failure. It was returned to production on the morning of Tuesday (23rd).
On Monday (22nd Aug) GDSS211 (AtlasGroupDisk) was taken out of production following a double disk failure. It was returned to production on the morning of Tuesday (23rd).
On Monday (22nd Aug) GDSS233 (AtlasGroupDisk) was taken out of production. It had a single disk failure. However, a mis-wired backplane led to the wrong disk being taken out of the system - hence precipitating a double disk failure. It took some time for the disks to rebuild, but once this was completed for the 'good' drive the system was taken down and the wiring fault corrected and the correct disk replaced. The system was returned to production early afternoon yesterday (23rd Aug).
On Saturday (21st August) GDSS296 (cmsFarmRead) was taken out of production with a read only file system. The system ran a fsck on its disks but the file structure on all three Castor partitions was badly damaged. the eight un-migrated files were declared lost to CMS this morning (24th Aug).

Following a routine maintenance check a problem was located on the 11kV feed into the computer building with an intermittent short taking place. This has now been located and following some internal switching the discharge has stopped. However, an intervention on the power systems in the building is required, although the extent of any power outage during this work is not yet known.
The problem of packet loss on the main network link from the RAL site remains. RAL networking team are actively investigating this problem.
Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue.

Failover test of the RAL Janet link on the morning of Thursday 18th August went OK.

The following items are being discussed and are still to be formally scheduled and announced:

Intervention to fix problem on 11kV power feed to building and connect up some parts of the cooling system to the UPS. This is being planned but may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Networking (routing) change relating to https traffic outbound.
Address permissions problem regarding Atlas User access to all Atlas data.
Networking upgrade to provide sufficient bandwidth for T10KC tapes.
Microcode updates for the tape libraries are due.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
Replace hardware running Castor Head Nodes (aimed for end of year).

There were no entries during this week.