Tier1 Operations Report 2010-03-03

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 3rd March 2010.

Review of Issues during week 24th February to 3rd March 2010.

  • Last week load problems on the Atlas software server were reported. These had been largely resolved by the time of last week's meeting. (Atlas jobs were flushed out, the software server had been restarted). On Thursday (25th Feb) the temporary limit on the number of Atlas production jobs was removed. A reasonable number of Atlas jobs have run since this problem, with a peak of around 1400 jobs running concurrently on the night of 24/25 February (before the final limit was removed). However, since then there has been very little Atlas batch work run at the RAL Tier1. Work ongoing to look at possible improvements to this software server.
  • On Friday 26th February there was a problem on GDSS160 (LHCbDst) which was out of production for a some hours while a rebuild of the disk array was forced.
  • gdss347 (AtlasDataDisk) was out of production from late Thursday evening (25th) to Monday 1st March. The system had failed with a kernel panic, which was traced to a memory fault.
  • gdss208 (AtlasDataDisk) failed on Saturday (27th Feb.). Appears to be a burnt component. Vendor's engineer arrived this morning and replaced (burnt) component. Returned to production late morning today (Wednesday 3rd March).
  • There was found to be a problem on five disk servers added to the Atlas MCDISK area on Friday 26th resulting in some connection errors being seen in FTS. This was fixed by restarting xinetd on the systems.
  • There was a problem with the DNS server that hosts the gridpp.ac.uk domain over the weekend (fixed Monday morning). This had some affect the UK cloud with problems accessing the BDII.
  • On Sunday 28th February there was a failure of the primary link from RAL to JANET via Reading. This failed over to the backup link via London for a few hours and had no effect on our services.

Current operational status and issues.

  • The FTM (FTS monior) is currently down owing to a hardware fault.
  • On 31st December a recurrence of the Castor/Oracle 'BigID' problem was seen. This is still under investigation.

Other Points.

A couple of detailed modifications to the Atlas Castor instance have been made this week:

  • Some LSF parameters were modified to reduce memory usage.
  • Some disk parameters were modified to improve handling of nearly-full disk servers.

Advanced warning:

There are no items scheduled in the GOC DB.

The following items remain to be scheduled:

  • Castor Oracle Database infrastructure. One change, the removal of unstable node from Oracle RAC and its replacement by another node, remains to be done.
  • At Risks (to Castor, LFC, FTS & 3D services) for roll-out of Oracle January 'PSU' security patch.
  • Upgrade to FTS 2.2.
  • Rolling upgrade of batch farm to latest version of Scientific LINUX (5.4)

Entries in GOC DB starting between 24th February and 3rd March 2010.

No UNSCHEDULED outages during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lfc, lfc-atlas, fts, ftm, All Castor. SCHEDULED AT_RISK 02/03/2010 10:00 02/03/2010 11:00 1 hour At Risk during application of security patch to back end Oracle databases for Castor, LFC and FTS services.
3D (lugh, ogma) & lhcb-lfc SCHEDULED AT_RISK 24/02/2010 11:00 24/02/2010 11:27 27 minutes At Risk on 3D services and LHCb-LFC during application of Oracle security patch.
All Castor & Batch (CEs) SCHEDULED OUTAGE 24/02/2010 10:00 24/02/2010 11:00 1 hour Outage to Castor and Batch during a reconfiguration of Castor Oracle back-end databases. Castor will be unavailable and the batch system will be paused at that time.
FTS SCHEDULED OUTAGE 24/02/2010 09:00 24/02/2010 11:00 2 hours Outage of FTS during Castor downtime. This includes a drain of FTS transfers ahead of the intervention.