Tier1 Operations Report 2010-03-17

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 17th March 2010.

Review of Issues during week 10th to 17th March 2010.

  • There have been some relatively minor problems with tape migrations that have been resolved simply (daemon restarts).
  • There was a problem last week whereby the draining of LHCb RAID5 disk servers caused a problem for production jobs. It was agreed with LHCb that they would stop using the relevant space token (LHCb-Dst) while the draining was pushed through on Friday and over the weekend. This was done, with srm-lhcb being marked a At Risk.
  • There was a load problem on the Atlas software server from the early hours of Sunday morning through to Monday. This was similar to that reported a couple of weeks ago. This was traced to a particular set of batch jobs and around 60 of these were killed on Monday which resolved the problem on the software server. (Note - these jobs represented small fraction of the approximately 2000 jobs Atlas were running at the time). Some ten or so of the jobs were allowed to complete (fail!) normally for diagnostic purposes and Atlas have been contacted.
  • There was a problem with the Atlas SRM over the weekend. This was traced to jobs from a single user that were knocking out the Atlas SRMs. This is being followed up with Atlas. to try and understand the root cause.
  • The Oracle January 'PSU' patches were applied to the LFC, FTS and 3D databases on Tuesday (16th). This was completed successfully although there was some issues with the streams from CERN which were stopped and started around the update.
  • There were no known faults that required a disk server to be taken out of production during the last week.

Current operational status and issues.

  • A rolling upgrade to the batch nodes to SL 5.4 is under way.
  • This morning (17th March) the FTS was upgraded to version 2.2.3.

Declared in the GOC DB:

None

Advanced warning:

The following items remain to be scheduled:

  • Enabling GLEXEC on the Worker Nodes. Proposed for Monday 22nd March.
  • Clean-up of non-Atlas LFC schema. This is to remove redundant information from when the Atlas and non-Atlas LFCs were split. Owing to the FTS upgrade being scheduled for today (17th March) this is being delayed until next week (Tuesday or Wednesday 23/24th March in the morning T.B.C.).
  • Castor Oracle Database infrastructure. One change, the removal of unstable node from Oracle RAC and its replacement by another node, remains to be done.

Entries in GOC DB starting between 10th and 17th March 2010.

No UNSCHEDULED outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
FTS & FTM SCHEDULED OUTAGE 17/03/2010 08:00 17/03/2010 13:00 5 hours Upgrade of FTS to version 2.2.3.
srm-alice, srm-atlas, srm-cms, srm-lhcb SCHEDULED AT_RISK 17/03/2010 05:00 17/03/2010 07:00 2 hours At Risk on LHC Castor end points during maintenance work on OPN link RAL - CERN.
FTS, FTM, lfc-atlas, lfc.gridpp.rl.ac.uk, lhcb-lfc, lugh.gridpp, ogma.gridpp SCHEDULED AT_RISK 16/03/2010 11:00 16/03/2010 14:00 3 hours At Risk for application of patch to Oracle databases.
srm-lhcb SCHEDULED AT_RISK 11/03/2010 17:00 15/03/2010 08:00 3 days, 15 hours To facilitate the draining of RAID5 disk servers, LHCb will blacklist the lhcbDst service class here at RAL while we do intensive draining over the weekend. Other LHCb services should work as usual.