RAL Tier1 Operations Report for 25th May 2011

Review of Issues during the week from 18th to 25th May 2011.

The requested Post Mortem (SIR) for the outage of the LFC on 10th May has been completed and is available at:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110510_LFC_Outage_After_DB_Update

On Wednesday (18th) GDSS94 (CMSwanIn – D0T1) was taken out of production at the end of the afternoon. Performance was poor after it suffered a double disk failure - it was rebuilding its RAID array and coping with a second failed drive at the same time. The system was initially put into read-only mode while the remaining migration candidates were moved to tape. Once this had been completed the server was disabled until the raid array rebuild had finished. It was returned to production on Friday morning (20th).
On Wednesday (18th) there was a problem with the database behind the Atlas RALTAG database. As this is not a production service there was no operational effect.
Overnight Wednesday - Thursday (18-19): Problem with the backup OPN link. No operational effect.
There have been some performance issues with the AtlasScratchDisk. The problem is an effect of an imbalance in disk sizes across the service class. When quite full, free space (and hence write load) is concentrated on one or two larger disk servers. Starting on Friday (20th) seven of the smaller disk servers in the large AtlasDataDisk service class were drained. On Monday (23rd) these were placed into the AtlasScratchDisk enabling the two larger disk servers to be drained ahead of removal from the AtlasScratchDisk. This leaves AtlasScratchDisk with approximately the same capacity as before but with a uniform size of disk server. Separately some additional small disk servers will be moved in to enhance its capacity as well. The draining of the seven disk servers did flush out three corrupt files (on different servers) which have been reported to Atlas as lost.
On Friday (20th) GDSS515 (AtlasDataDisk) was out of production for around an hour for a fan to be swapped.
On Friday (20th) Very few Alice jobs were running. This was because they are very short (few minutes) and the batch system is configured to have a low start rate for Alice.
Monday afternoon (23rd) problems reported by our monitoring on Top-bdiis. One of Site-bdiis was not returning information. This was fixed promptly and had no known operational effect to users.
Changes made this last week:
- None.

Current operational status and issues.

On Saturday (30th April) FSPROBE reported a problem on GDSS293 (CMSFarmRead - D0T1) which was removed from production. The server was put back in service on Sunday (1st May.) On Tuesday (3rd May) the server was put into draining mode ahead of further investigations. The draining completed on 4th May when the server was removed from production. Following work on the hardware (it had a failed drive) the system has been rebuilt. The Castor software is being checked ahead of re-introduction into service.
GDSS294 (CMSFarmRead - D0T1)failed with a read-only file system on the evening of Monday 9th May. It is currently out of production.
On Friday (20th) GDSS365 (CMSTemp - D1T0) reported a read only filesystem and was taken out of production. It was returned to production in 'draining' mode on Monday (23rd). On Tuesday morning (24th) the draining had completed and the system taken out of production for further tests.
There are some ongoing intermittent problems with CVMFS. The main effect has been some failures of LHCb SAM Tests (on the CE). This is being investigated.
We are still seeing seen some intermittent problems with the site BDIIs. Until this is further understood the daemons are being restarted regularly.
Atlas reported slow data transfers into the RAL Tier1 from other Tier1s and CERN (ie. asymmetrical performance). CMS seems to experience this as well (but between RAL and foreign T2s). The investigation into this is still ongoing.

Declared in the GOC DB

None

Advanced warning:

The following items are being discussed and are still to be formally scheduled:

Updates to Site Routers (the Site Access Router and the UKLight router) are required.
Upgrade Castor clients on the Worker Nodes to version 2.1.10.
Address permissions problem regarding Atlas User access to all Atlas data.
Minor Castor update to enable access to T10KC tapes.
Networking upgrade to provide sufficient bandwidth for T10KC tapes.
Microcode updates for the tape libraries are due.
Switch Castor and LFC/FTS/3D to new Database Infrastructure.

Entries in GOC DB starting between 18th and 25th May 2011.

There were no unscheduled entries in the GOCDB for this period.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
lcgce07	SCHEDULED	OUTAGE	24/05/2011 09:00	07/06/2011 16:00	14 days, 7 hours	drain and decommission as lcg-CE
lcgce03, lcgrbp01	SCHEDULED	OUTAGE	17/05/2011 09:00	19/05/2011 15:00	2 days, 6 hours	Drain and Reinstallation of CE
lcgwms03	SCHEDULED	OUTAGE	12/05/2011 16:00	19/05/2011 12:10	6 days, 20 hours and 10 minutes	lcgwms03 (non-LHC WMS) drain and maintenance

Tier1 Operations Report 2011-05-25

Contents

RAL Tier1 Operations Report for 25th May 2011

Review of Issues during the week from 18th to 25th May 2011.

Current operational status and issues.

Declared in the GOC DB

Advanced warning:

Entries in GOC DB starting between 18th and 25th May 2011.

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools