Tier1 Operations Report 2010-03-10

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 10th March 2010.

Review of Issues during week 3rd to 10th March 2010.

  • The FTM (FTS monitor) was down owing to a hardware fault. This service was restored on Monday (8th March).
  • A batch of eleven disk servers were deployed for LHCb. A fault was found before any data was written to the disks and these servers had to be taken out of, and put back into service (Wednesday/Thursday 3/4 March).
  • There was a 'hot file' problem for CMS while processing data that had been generated by monte-carlo. On Friday (5th March) a new Castor Service Class was set-up to enable replication of these hot files to multiple servers. This was initially problematic, but the problems were resolved on Monday and the file replication is now in service.
  • There was a problem with Atlas tape migrations that was fixed on Monday (8th March).
  • Disk server GDSS203 (Atlas MCDISK) was unavailable from Monday to Wednesday owing to memory errors.
  • Monday 8th March: CPU/walltime limits changed to 96 KSI2K hours on grid3000M queue to meet ATLAS requirements.
  • A problem with the streaming of LHCb LFC data from CERN has been fixed although the analysis of the root cause is ongoing. The data at RAL should be read-only but had been modified. A loophole by which this seems to have occurred has been closed. The actual data change was linked to the set-up of the new Nagios based 'SAM' tests.

Current operational status and issues.

  • There has been a problem where the draining of LHCB data from RAID 5 disk arrays (in the LHCb_Dst space token) has conflicted with normal data access. The current proposal (in agreement with LHCb) is for LHCb to stop data access to that space token for four days while the draining is pushed forward.
  • There is an ongoing problem (late Wednesday morning 10th March) with tape migration. Currently under investigation.

Declared in the GOC DB:

  • Wednesday 17th March. 05:00-07:00. At Risk on LHC Castor end points during maintenance work on OPN link RAL - CERN.

Advanced warning:

The following items remain to be scheduled:

  • Clean-up of non-Atlas LFC schema. This is to remove redundant information from when the Atlas and non-Atlas LFCs were split. Proposed date 17th March am (t.b.c.).
  • Castor Oracle Database infrastructure. One change, the removal of unstable node from Oracle RAC and its replacement by another node, remains to be done.
  • At Risks (to Castor, LFC, FTS & 3D services) for roll-out of Oracle January 'PSU' security patch.
  • Upgrade to FTS 2.2.
  • Rolling upgrade of batch farm to latest version of Scientific LINUX (5.4) about to start.

Entries in GOC DB starting between 3rd and 10th March 2010.

One UNSCHEDULED outage during the last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgftm UNSCHEDULED OUTAGE 03/03/2010 11:45 08/03/2010 14:52 5 days, 3 hours and 7 minutes FTM service down while hardware fault being examined.