RAL Tier1 Operations Report for 7th September 2011

Review of Issues during the week from 31st August to 7th September 2011.

Over the weekend (Sunday 5th Sep) there were load issues on the Castor Atlas instance (MCTape service class) The Atlas FTS channels to RAL were reduced (in the end down to 25% of nominal values). These were raised back to 50% of nominal values on Monday, and to 100% yesterday (Tuesday) morning.
There was a failure of the RAL Site Access Router that broke network connectivity into RAL from 01:10 to 08:10 on the morning of Monday 5th September. The call-out mechanisms that should have notified someone of this failure did not work - resulting in the long site outage.
We have seen intermittent errors for the SAM tests on our one non-cream CE (lcgce06) for the last week or so. Cause not yet understood.

The Atlas migration queue (writing to tape) has been developing a large backlog since Sunday. There is high load, but we are investigating as there appears to be another problem exacerbating this.
There is an ongoing problem today migrating CMS data to tape. This is a problem triggered by a particular set of CMS files.
Following a routine maintenance check a problem was located on the 11kV feed into the computer building with an intermittent short taking place. This has now been located and following some internal switching the discharge has stopped. However, an intervention on the power systems in the building is required, although the extent of any power outage during this work is not yet known.
The problem of packet loss on the main network link from the RAL site remains. RAL networking team continue to actively investigate this problem. This is currectly at a low level and is not causing problems, although a concern remains that it may become worse if the network load rises.
Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue. The types of things being studied include disk server tp/ip configurations.

GDSS233 (AtlasGroupDisk) suffered three disk failures (the second and third during last night, Tuesday-Wednesday 6/7 Sep). The server is still working. It was being drained, a process that was largely complete, before this happened. The system is currently out of production as it rebuilds one of its disks.

A networking (routing) change relating to https traffic outbound was made on Tuesday morning (6th Sep.)

Wednesday 7th September: Apply Oracle Security updates ("CPU") to the databases behind the LFC, FTS & 3D services.
Provisionally: Wednesday 21st September:
- First stage of Castor Database Migrations.
- Routing change to extend 'OPN' address range.

Wednesday 7th September: LFC, FTS & 3D: At Risk during rolling upgrade to apply Oracle Critical Patch Update.

The following items are being discussed and are still to be formally scheduled and announced:

Intervention to fix problem on 11kV power feed to building and connect up some parts of the cooling system to the UPS. This is being planned but may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Networking change required to extend range of addresses that route over the OPN.
Address permissions problem regarding Atlas User access to all Atlas data.
Networking upgrade to provide sufficient bandwidth for T10KC tapes.
Microcode updates for the tape libraries are due.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
Replace hardware running Castor Head Nodes (aimed for end of year).

There was one unscheduled entry when the failure of the Site access Router disrupted network access to RAL.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgfts, lcgftm, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk,	SCHEDULED	WARNING	07/09/2011 09:00	07/09/2011 17:00	8 hours	At Risk during rolling upgrade to apply Oracle Critical Patch Update.
Whole site	UNSCHEDULED	OUTAGE	05/09/2011 01:10	05/09/2011 08:10	7 hours	Networking problem disconnected whole site.