Tier1 Operations Report 2009-09-16

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 16th September 2009.

This is a review of issues since the last meeting on 2nd September.

Current operational status and issues.

  • The planned work oto update the Castor nameserver to version 2.1.8 has given unexpected problems. At the time of preparation of this report (lunchtime 16th September) Castor is still unavailable.
  • Ongoing problem with CMS batch jobs failing is awaiting the setting up of a substantial SL5 service.
  • LINUX security issue (CVE-2009-2692 local root vulnerability). Most systems are now patched (see list of outages declared in the GOC DB below). A small number of systems still await updated kernels (e.g. systems within the Oracle RACs.)
  • Swine ‘Flu. We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite. At the moment we have reduced the frequency of our assessments as rates of infection seen nationally have dropped.
  • Although not an operational problem, note that much (around 75%) of the batch capacity is being migrated to SL5 at the moment. This is part of the strategy to update systems ahead of a Tier1 change freeze from the end of September.

Review of Issues during weeks 2 to 16 September.

  • Memory failures on disk server gdss107 (AtlasDataDisk) over the weekend (12th Sep.) Memory swapped and server returned to production on Monday 13th.
  • Unavailability of Atlas 3D databases (OGMA) for internal database fix early afternoon 11th September.
  • A failure of the Storage Area Network underneath the Oracle databases led to a Castor outage on 10th September. Oracle hung (instead of carrying on without its mirror disk set) causing a Castor outage from early afternoon until 20:00 that evening.
  • Outage on disk server gdss332 (LHCbDst) 2 - 4 September following kernel panic.
  • Outage on disk server gdss151 (Atlas MCdisk) 4 - 7 September following a failure to start a rebuild after a disk failure.
  • Castor GEN instance. Some problems on 9th September following on from the SRM upgrade that morning.
  • Double disk failure on gdss164 (BaBar) on Friday (21st August). This disk server was returned to operation on 15th September. The data was copied off, the RAID array rebuily, and the data restored.
  • Air Conditioning Problems. Systems still running fine since services were restored. This incident is not yet closed for the Tier1. The cause of the first outage is now known (restart of BMS). Modifications have been made to counteract the effect of an overpressure which was the cause of the second outage.
  • Condensation water dripping into the tape robot. As last week, this incident is not yet closed for the Tier1. Measures are (and have) being put in place to ensure there is no repeat. We will also track error rates on media that may have been affected.

Advanced warning:

The following outages have been declared in the GOC DB:

  • SRM-CMS Update to version 2.8.0. Thursday 17th September.
  • lcglb02 resinstallation to enable hot-swap of disks. Should be transparent. Monday 21st September.
  • SRM-Atlas Update to version 2.8.0. Monday 21st September.
  • R89 UPS tests. Castor down. Tuesday 22nd September.
  • WMS02 resinstallation to enable hot-swap of disks. Tuesday 22nd September.


Table showing entries in GOC DB starting between 2nd and 16th September.

Service Scheduled? Outage/At Risk Start End Duration Reason
srm-cms.gridpp.rl.ac.uk SCHEDULED OUTAGE 2009-09-17 10:00 2009-09-17 12:00 2:00 SRM end point down for upgrade to SRM version 2.8-0.
srm-cms.gridpp.rl.ac.uk SCHEDULED OUTAGE 2009-09-16 10:00 2009-09-16 12:00 2:00 SRM end point down for upgrade to SRM version 2.8-0. (CANCELLED)
All Castor UNSCHEDULED AT_RISK 2009-09-15 13:00 2009-09-16 17:00 1 days 4:00 At Risk for Castor following nameserver upgrade. During this time will also make some kernel updates on database servers.
All Castor & CEs UNSCHEDULED OUTAGE 2009-09-15 12:06 2009-09-16 12:00 23:54 Outage for upgrade of Castor Nameserver component to version 2.1.8. This is a continuation of the work that started at 08:00(UTC) earlier today.
All Castor & CEs SCHEDULED OUTAGE 2009-09-15 09:00 2009-09-15 12:59 3:59 Outage for upgrade of Castor Nameserver component to version 2.1.8.
lcgfts.gridpp.rl.ac.uk UNSCHEDULED AT_RISK 2009-09-15 08:00 2009-09-15 13:00 5:00 The channels to the RAL Tier1 will be drained ahead of the Castor intervention. (Transfers submitted for these channels will be held until the channels are restarted). Other channels will be left up. Declaring this as an At Risk.
lcgce07.gridpp.rl.ac.uk SCHEDULED OUTAGE 2009-09-14 10:30 2009-09-16 12:00 2 days 1:30 CE unavailable while the RAL Tier1 batch system is reconfigured to move from SL4 to SL5
lcgce01 & lcgce02 SCHEDULED OUTAGE 2009-09-14 10:30 2009-09-18 12:00 4 days 1:30 CEs unavailable while the Tier1 batch system is reconfigured. This reconfiguration is to move the bulk of the capacity from SL4 to SL5.
lcgce03, 04 & 05 SCHEDULED OUTAGE 2009-09-14 10:30 2009-09-28 12:00 14 days 1:30 Migration of batch system to SL5: These CEs are being retired to be replaced by new CEs that will support the SL5 batch system.
srm-lhcb.gridpp.rl.ac.uk SCHEDULED OUTAGE 2009-09-14 10:00 2009-09-14 12:00 2:00 SRM end point down for upgrade to SRM version 2.8-0.
All Castor & CEs UNSCHEDULED AT_RISK 2009-09-10 20:00 2009-09-11 13:00 17:00 At-risk following the work on the oracle database hardware
All Castor & CEs UNSCHEDULED OUTAGE 2009-09-10 17:34 2009-09-10 20:00 2:26 Outage to address the problems with the Oracle database hardware behind Castor.
All Castor & Ces UNSCHEDULED OUTAGE 2009-09-10 14:53 2009-09-10 18:00 3:07 Problems on the CASTOR database. Under investigation
Castor GEN instance UNSCHEDULED AT_RISK 2009-09-09 12:00 2009-09-09 14:38 2:38 After an upgrade the SRM service appears to work correctly for manual transfers but it is failing SAM tests. Service in 'at risk' status while this is investigated/fixed.
Castor GEN instance SCHEDULED OUTAGE 2009-09-09 10:00 2009-09-09 12:00 2:00 SRM end point down for upgrade to SRM version 2.8-0.
ogma.gridpp.rl.ac.uk SCHEDULED OUTAGE 2009-09-08 09:00 2009-09-08 15:00 6:00 Migration of Oracle database to a 64-bit system. A 3-hour outage has been declared although it is expected that the database will only be unavailable for a short while at some point during this interval. Due to unforeseen circumstances we need to extend this downtime to 3:00 pm localtime.
lcgwms01.gridpp.rl.ac.uk UNSCHEDULED AT_RISK 2009-09-07 12:00 2009-09-07 16:59 4:59 upgrade to the latest glite-WMS; a reboot will be necessary
All Ces UNSCHEDULED AT_RISK 2009-09-03 10:00 2009-09-03 11:00 1:00 At risk while some nodes are rebooted to pick up a new kernel. This includes the batch scheduler system and NFS server that contains users' home directories. It is expected that this intervention will be transparent to grid work.
lfc, lfc-atlas, fts & ftm SCHEDULED AT_RISK 2009-09-02 11:00 2009-09-02 11:30 0:30 At Risk for minor reconfiguration of Oracle database.