Tier1 Operations Report 2010-03-17

RAL Tier1 Operations Report for 17th March 2010.

Review of Issues during week 10th to 17th March 2010.

There have been some relatively minor problems with tape migrations that have been resolved simply (daemon restarts).
There was a problem last week whereby the draining of LHCb RAID5 disk servers caused a problem for production jobs. It was agreed with LHCb that they would stop using the relevant space token (LHCb-Dst) while the draining was pushed through on Friday and over the weekend. This was done, with srm-lhcb being marked a At Risk.
There was a load problem on the Atlas software server from the early hours of Sunday morning through to Monday. This was similar to that reported a couple of weeks ago. This was traced to a particular set of batch jobs and around 60 of these were killed on Monday which resolved the problem on the software server. (Note - these jobs represented small fraction of the approximately 2000 jobs Atlas were running at the time). Some ten or so of the jobs were allowed to complete (fail!) normally for diagnostic purposes and Atlas have been contacted.
There was a problem with the Atlas SRM over the weekend. This was traced to jobs from a single user that were knocking out the Atlas SRMs. This is being followed up with Atlas. to try and understand the root cause.
The Oracle January 'PSU' patches were applied to the LFC, FTS and 3D databases on Tuesday (16th). This was completed successfully although there was some issues with the streams from CERN which were stopped and started around the update.
There were no known faults that required a disk server to be taken out of production during the last week.

Current operational status and issues.

A rolling upgrade to the batch nodes to SL 5.4 is under way.
This morning (17th March) the FTS was upgraded to version 2.2.3.

Declared in the GOC DB:

None

Advanced warning:

The following items remain to be scheduled:

Enabling GLEXEC on the Worker Nodes. Proposed for Monday 22nd March.
Clean-up of non-Atlas LFC schema. This is to remove redundant information from when the Atlas and non-Atlas LFCs were split. Owing to the FTS upgrade being scheduled for today (17th March) this is being delayed until next week (Tuesday or Wednesday 23/24th March in the morning T.B.C.).
Castor Oracle Database infrastructure. One change, the removal of unstable node from Oracle RAC and its replacement by another node, remains to be done.

Entries in GOC DB starting between 10th and 17th March 2010.

No UNSCHEDULED outages during the last week.

Service	Scheduled?	Outage/At Risk	Start	End	Duration	Reason
FTS & FTM	SCHEDULED	OUTAGE	17/03/2010 08:00	17/03/2010 13:00	5 hours	Upgrade of FTS to version 2.2.3.
srm-alice, srm-atlas, srm-cms, srm-lhcb	SCHEDULED	AT_RISK	17/03/2010 05:00	17/03/2010 07:00	2 hours	At Risk on LHC Castor end points during maintenance work on OPN link RAL - CERN.
FTS, FTM, lfc-atlas, lfc.gridpp.rl.ac.uk, lhcb-lfc, lugh.gridpp, ogma.gridpp	SCHEDULED	AT_RISK	16/03/2010 11:00	16/03/2010 14:00	3 hours	At Risk for application of patch to Oracle databases.
srm-lhcb	SCHEDULED	AT_RISK	11/03/2010 17:00	15/03/2010 08:00	3 days, 15 hours	To facilitate the draining of RAID5 disk servers, LHCb will blacklist the lhcbDst service class here at RAL while we do intensive draining over the weekend. Other LHCb services should work as usual.

Tier1 Operations Report 2010-03-17

Contents

RAL Tier1 Operations Report for 17th March 2010.

Review of Issues during week 10th to 17th March 2010.

Current operational status and issues.

Declared in the GOC DB:

Advanced warning:

Entries in GOC DB starting between 10th and 17th March 2010.

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main GridPP website

Navigation

Tools