RAL Tier1 Operations Report for 10th August 2011

Review of Issues during the week from 3rd to 10th August 2011.

As reported last week were investigating the loss of seven LHCb data files. A check of other similar files brought the total number declared lost to LHCb to nine. These were written into Castor on two dates, Sep 30 & Oct 14 (both 2010), last year and were picked up following the tracing of failed reads. Although there were entries in the Castor Nameserver database for these files, there was no corresponding entry giving a location for the file on disk or tape.
We also reported two lost (corrupted) Atlas files. These were found to be corrupted: The checksum of the file on disk did not match with Atlas' stored checksum value. These were also old files (written into Castor on 5th Oct 2010) and were found following an investigation into failing file transfers.
On Friday there was a problem running batch work for LHCb on any worker nodes that have been re-installed. This was fixed by a change to the variable defining the location of the LHCb software directory. However, there were problems with the roll-out of this fix to some nodes. This problem was identified (it was caused by some nodes being set to ignore Quattor updates) on Monday (8th) and believed fixed. However, it became clear by today (Wed 10th) that there were still a few nodes missing the update.
Over the weekend (6-7) August there were problems with the non-LHC WMS (WMS03). This system does not call-out in the case of problems and these were resolved when staff returned to work on Monday (8th).

On Wednesday (3rd Aug) gdss487 (AtlasDataDisk) had a problem. It was returned to production during Thursday morning (4th).
On Thursday (4th) two machines gdss490 (AtlasDataDisk) and gdss501 (CMSFarmRead) crashed following work to investigate a problem with the IPMI system on the servers. In both cases the servers 'fsck'd' their disks and were returned to production during the following morning.
GDSS279 (CMSFarmRead) crashed on Saturday (6th Aug). It was returned to production early Monday afternoon.

Following a routine maintenance check a problem was located on the 11Kv feed into the computer building with an intermittent short taking place. This has now been located and following some internal switching the discharge has stopped. However, an intervention on the power systems in the building is required.
The problem of packet loss on the main network link from the RAL site remains. RAL networking team are actively investigating this problem and the reboot of the site firewall (see below) is part of this work.
Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The pattern of asymmetrical flows appears complex and is being actively investigated. We are in contact with people in the US investigating a similar issue.

GDSS419 (AtlasDataDisk) had a problem with a read only file system this morning (Wed 10th Aug) and was taken out of production.

RAL Site firewall reboot on Tuesday morning (9th Aug).
Moving IP address for more worker nodes on a rotation continues.
The modifications to the CVMFS mount point for Atlas have taken place although there are some remaining items to do.
For internal purposes the daily 'cron' job on the Castor nodes (including disk servers) has been moved from 4am to 1pm.

The following items are being discussed and are still to be formally scheduled and announced:

Intervention to fix problem on 11Kv power feed to building and connect up some parts of the cooling system to the UPS. This is being planned but may require a complete outage (including systems on UPS).
Switch Castor and LFC/FTS/3D to new Database Infrastructure.
Networking (routing) change relating to https traffic outbound.
Address permissions problem regarding Atlas User access to all Atlas data.
Networking upgrade to provide sufficient bandwidth for T10KC tapes.
Microcode updates for the tape libraries are due.
Further updates to CEs: (CE06 de-commissioning; update to Glite updates on CE09 outstanding).
Replace hardware running Castor Head Nodes (aimed for end of year).

There were no unscheduled entries during this week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
Whole site	SCHEDULED	OUTAGE	09/08/2011 08:00	09/08/2011 08:45	45 minutes	Site Outage During Work on Site Firewall.