Tier1 Operations Report 2010-07-28

From GridPP Wiki
Revision as of 12:18, 28 July 2010 by John kelly (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 28th July 2010

Review of Issues during the week from 21th July to 28th July 2010.

  • Thursday 22nd July scheduled downtime on Somnus canceled.
  • Monday 26th July gdss548-557 were deployed into atlasStripInput
  • Monday 26th and Tuesday 27th July At risks on CEs to enable new CMS T1 production account
  • Wednesday 28th July. At risk on Somnus database to apply an ACL.
  • Wednesday 28th July At risk to apply glite software updates to RAL top-level BDIIs

Current operational status and issues.

  • gdss187 (atlasFarm) was removed from service on 21st July and is still with fabric. It has fsprobe errors.
  • gdss207 (aliceTape) was removed from service 3 weeks ago and we are still awaiting parts.
  • As reported at previous meetings, one power supply (of two) for one of the (two) EMC disk arrays behind the LSC/FTS/3D services was moved to UPS power as a test if the electrical noise has been reduced sufficiently. The test is ongoing.
  • On Saturday 10th July transformer TX2 in R89 tripped. This transformer is the same one as tripped some months ago, and for which remedial work was undertaken. Ops still have no further information on this.
  • Dust in the Computer Room - Remedial work on lagging pipes is ongoing. Only the pipes directly under the CRAC units remain to be done. All the work in the HPD room is complete.

Declared in the GOC DB

  • Today 28th July At risk on Somnus (LFC and FTS) to apply a security configuration change.
  • Today 28th July At risk to apply glite software updates to RAL top-level BDIIs
  • 29th July 08:00 - 10:00 An outage to drain the FTS service so we can fail over to the standby FTS host.
  • Monday 2nd August - Tuesday 10 August lcgce02.gridpp.rl.ac.uk - downtime for lcgce02 to allow draining and de-commissioning.

Advanced warning:

The following items remain to be scheduled:

  • Closure of SL4 batch workers at RAL-LCG2 announced for the start of August.
  • Doubling of network link to network stack for tape robot and Castor head nodes.
  • re-visit the SAN / multipath issue for the non-castor databases.

Entries in GOC DB starting between 21th July and 28th July 2010.

There were 2 unscheduled outages during the last week. Both were on CEs to enable the CMS T1 production account

Note that the outage scheduled for the 21th did not actually go ahead.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgbdii.gridpp.rl.ac.uk SCHEDULED AT_RISK 28/07/2010 10:00 28/07/2010 13:00 3 hours At risk to apply glite software updates to RAL top-level BDIIs
lcgfts.gridpp.rl.ac.uk, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED AT_RISK 28/07/2010 09:00 28/07/2010 11:00 2 hours At risk on LFC and FTS while security configuration is applied.
lcgce06.gridpp.rl.ac.uk, lcgce07.gridpp.rl.ac.uk UNSCHEDULED AT_RISK 27/07/2010 09:00 27/07/2010 12:00 3 hours At risk on CEs, lcgce06 and lcgce07 while re-configuring to use new CMS accounts.
lcgce01.gridpp.rl.ac.uk UNSCHEDULED AT_RISK 26/07/2010 13:00 26/07/2010 15:00 2 hours At risk on lcgce01 while it is re-configured to use new CMS accounts.
lcgfts.gridpp.rl.ac.uk, lfc-atlas.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk, lhcb-lfc.gridpp.rl.ac.uk, SCHEDULED AT_RISK 21/07/2010 10:00 21/07/2010 11:57 1 hour and 57 minutes At risk for Lugh database multipath reconfiguration. All database services on same SAN are being marked as at risk.