RAL Tier1 Operations Report for 3rd August 2011

Review of Issues during the week from 27th July to 3rd August 2011.

On Thursday and into Friday there was a problem with Atlas tape migrations that was traced to a mis-configuration of tape pools and service classes.
On Friday afternoon and evening there were problems with the batch scheduler, which was not starting enough jobs. A couple of minor changes were made and the batch system has run OK since, although there remain some internal issues within the batch system to follow up.

Wednesday (27th July) gdss434 (AtlasDataDisk D1T0) also showed memory faults and was out of production until the following (Thursday) lunch-time.
On Thursday, just as gdss434 was recovering, gdss435 (also AtlasDataDisk D1T0) started showing memory faults as well. This was removed from production, being returned to service that evening. It was then out of service for an hour on Monday (1st Aug) while the memory was replaced.
Gdss211 (AtlasGroupDisk) was out of production for around 7 hours on Monday (1st Aug) following memory problems.
Gdss208 (AtlasScratchDisk) had been drained and taken out of production on Friday 22nd July for investigations. Following the intervention, which included rebuilding the RAID array, it was returned to service on Tuesday (2nd August).
Following problems gdss190 (AtlasScratchDisk) was drained and taken out of service last Friday (29th). It was returned to production this morning (3rd Aug).

We are investigating some (7) LHCb files that date from December 2010 and appear to be missing from Castor.
Following a routine maintenance check a problem was located on the 11Kv feed into the computer building with an intermittent short taking place. This has now been located and following some internal switching the discharge has stopped. However, an intervention on the power systems in the building is required.
The problem of packet loss on the main network link from the RAL site remains. RAL networking team are actively investigating this problem and the reboot of the site firewall (see below) is part of this work.
Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue.

On Tuesday 26th July (actually last week, but missing from last week's report) transformer TX2 was brought back into service. This had been running for two months without load following the resolution of the problem that had caused it to trip some months ago. Since then switching changes to investigate the 11Kv discharge problem have meant that we are currently running using transformers TX1 & TX3 only.
Some blocks of batch works have been drained in rotation in order to change their IP addresses.

Tuesday 9th August. Site Outage for an hour around reboot of RAL site firewall (08:00 - 09:00 BST)

The following items are being discussed and are still to be formally scheduled and announced:

Modifications to CVMFS mount point for Atlas planned for next week.
Intervention to fix problem on 11Kv power feed to building and connect up some parts of the cooling system to the UPS. This is being planned but require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Networking (routing) change relating to https traffic outbound.
Address permissions problem regarding Atlas User access to all Atlas data.
Networking upgrade to provide sufficient bandwidth for T10KC tapes.
Microcode updates for the tape libraries are due.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
Replace hardware running Castor Head Nodes (aimed for end of year).