Tier1 Operations Report 2010-06-02

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 2nd June 2010

Review of Issues during week 26th May and 2nd June 2010.

  • Gdss380 (lhcbmdst) which had failed twice in recent weeks was replaced with another server.
  • Two servers that form part of AtlasScratch had problems. Gdss213 has two disks showing problems was taken out of service on Thursday 27th May and returned on Friday 28th. Gdss272 had three drives showing problems and was taken out of service on Thursday 27th and returned to production on Saturday (29th). This service class had been under high load.
  • A problem was found on a tape containing CMS data (tape CS6000) on Thursday 27th. 16 data files (monte-carlo generated data in this case) were lost.
  • Planned maintenance work on the UPS took place on Tuesday 1st June without problem.
  • Application of the Oracle April PSU patch for the 3D, databases (OGMA, LUGH) has taken places successfully.
  • All CEs used by LHC VOs have been configured for glexec.

Current operational status and issues.

  • gdss67 CMS farmRead had problems (FSPROBE errors) and was removed from service on morning of Thursday 20th. Expected back soon.
  • On Monday 24th gdss207 (AliceTape was removed from service due to possible file system corruption.
  • We are still failing some SAM test on the site BDII. Not likely to be fixed in current SAM setup - see GGUS ticket 58054
  • Ongoing issues with the Atlas software server - lcg0617. A replacement machine is being brought into service now with batch workers being progressively moved over to use the new server.
  • Following application of the Oracle April PSU patch for the LFC & FTS databases (SOMNUS) this morning (2nd June) there are some issues with one of the Oracle RAC nodes which is being investigated.

Declared in the GOC DB

None

Advanced warning:

The following items remain to be scheduled:

  • Doubling of network link to network stack for tape robot and Castor head nodes. It had been hoped to do this at the same time as the UPS test on 1st June, but this was not possible.
  • Preventative maintenance work on transformers in R89 will require site 'At Risk's. Being scheduled for next two LHC technical stops, the first of which is the 28-30 June.
  • Closure of SL4 batch workers at RAL-LCG2 announced for the start of August.

Entries in GOC DB starting between 26th May and 2nd June 2010.

There were no unscheduled entries in the GOC DB for this last week.

Service Scheduled? Outage/At Risk Start End Duration Reason
FTS. FTM, LFC, lfc-atlas. SCHEDULED AT_RISK 02/06/2010 10:00 02/06/2010 12:00 2 hours At Risk for application of Oracle PSU patches.
Whole Site SCHEDULED AT_RISK 01/06/2010 08:45 01/06/2010 12:00 3 hours and 15 minutes Regular UPS Test. Site At Risk.
lhcb-lfc, lugh (LHCb 3D). SCHEDULED AT_RISK 27/05/2010 10:00 27/05/2010 12:00 2 hours At Risk during application of Oracle PSU patches.
lcgce08 SCHEDULED OUTAGE 24/05/2010 10:00 26/05/2010 17:00 2 days, 7 hours CE being reconfigured for glexec roles mapping