Tier1 Operations Report 2009-09-16
From GridPP Wiki
Contents
RAL Tier1 Operations Report for 16th September 2009.
This is a review of issues since the last meeting on 2nd September.
Current operational status and issues.
- The planned work oto update the Castor nameserver to version 2.1.8 has given unexpected problems. At the time of preparation of this report (lunchtime 16th September) Castor is still unavailable.
- Ongoing problem with CMS batch jobs failing is awaiting the setting up of a substantial SL5 service.
- LINUX security issue (CVE-2009-2692 local root vulnerability). Most systems are now patched (see list of outages declared in the GOC DB below). A small number of systems still await updated kernels (e.g. systems within the Oracle RACs.)
- Swine ‘Flu. We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite. At the moment we have reduced the frequency of our assessments as rates of infection seen nationally have dropped.
- Although not an operational problem, note that much (around 75%) of the batch capacity is being migrated to SL5 at the moment. This is part of the strategy to update systems ahead of a Tier1 change freeze from the end of September.
Review of Issues during weeks 2 to 16 September.
- Memory failures on disk server gdss107 (AtlasDataDisk) over the weekend (12th Sep.) Memory swapped and server returned to production on Monday 13th.
- Unavailability of Atlas 3D databases (OGMA) for internal database fix early afternoon 11th September.
- A failure of the Storage Area Network underneath the Oracle databases led to a Castor outage on 10th September. Oracle hung (instead of carrying on without its mirror disk set) causing a Castor outage from early afternoon until 20:00 that evening.
- Outage on disk server gdss332 (LHCbDst) 2 - 4 September following kernel panic.
- Outage on disk server gdss151 (Atlas MCdisk) 4 - 7 September following a failure to start a rebuild after a disk failure.
- Castor GEN instance. Some problems on 9th September following on from the SRM upgrade that morning.
- Double disk failure on gdss164 (BaBar) on Friday (21st August). This disk server was returned to operation on 15th September. The data was copied off, the RAID array rebuily, and the data restored.
- Air Conditioning Problems. Systems still running fine since services were restored. This incident is not yet closed for the Tier1. The cause of the first outage is now known (restart of BMS). Modifications have been made to counteract the effect of an overpressure which was the cause of the second outage.
- Condensation water dripping into the tape robot. As last week, this incident is not yet closed for the Tier1. Measures are (and have) being put in place to ensure there is no repeat. We will also track error rates on media that may have been affected.
Advanced warning:
The following outages have been declared in the GOC DB:
- SRM-CMS Update to version 2.8.0. Thursday 17th September.
- lcglb02 resinstallation to enable hot-swap of disks. Should be transparent. Monday 21st September.
- SRM-Atlas Update to version 2.8.0. Monday 21st September.
- R89 UPS tests. Castor down. Tuesday 22nd September.
- WMS02 resinstallation to enable hot-swap of disks. Tuesday 22nd September.
Table showing entries in GOC DB starting between 2nd and 16th September.
Service | Scheduled? | Outage/At Risk | Start | End | Duration | Reason |
---|---|---|---|---|---|---|
srm-cms.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 2009-09-17 10:00 | 2009-09-17 12:00 | 2:00 | SRM end point down for upgrade to SRM version 2.8-0. |
srm-cms.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 2009-09-16 10:00 | 2009-09-16 12:00 | 2:00 | SRM end point down for upgrade to SRM version 2.8-0. (CANCELLED) |
All Castor | UNSCHEDULED | AT_RISK | 2009-09-15 13:00 | 2009-09-16 17:00 | 1 days 4:00 | At Risk for Castor following nameserver upgrade. During this time will also make some kernel updates on database servers. |
All Castor & CEs | UNSCHEDULED | OUTAGE | 2009-09-15 12:06 | 2009-09-16 12:00 | 23:54 | Outage for upgrade of Castor Nameserver component to version 2.1.8. This is a continuation of the work that started at 08:00(UTC) earlier today. |
All Castor & CEs | SCHEDULED | OUTAGE | 2009-09-15 09:00 | 2009-09-15 12:59 | 3:59 | Outage for upgrade of Castor Nameserver component to version 2.1.8. |
lcgfts.gridpp.rl.ac.uk | UNSCHEDULED | AT_RISK | 2009-09-15 08:00 | 2009-09-15 13:00 | 5:00 | The channels to the RAL Tier1 will be drained ahead of the Castor intervention. (Transfers submitted for these channels will be held until the channels are restarted). Other channels will be left up. Declaring this as an At Risk. |
lcgce07.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 2009-09-14 10:30 | 2009-09-16 12:00 | 2 days 1:30 | CE unavailable while the RAL Tier1 batch system is reconfigured to move from SL4 to SL5 |
lcgce01 & lcgce02 | SCHEDULED | OUTAGE | 2009-09-14 10:30 | 2009-09-18 12:00 | 4 days 1:30 | CEs unavailable while the Tier1 batch system is reconfigured. This reconfiguration is to move the bulk of the capacity from SL4 to SL5. |
lcgce03, 04 & 05 | SCHEDULED | OUTAGE | 2009-09-14 10:30 | 2009-09-28 12:00 | 14 days 1:30 | Migration of batch system to SL5: These CEs are being retired to be replaced by new CEs that will support the SL5 batch system. |
srm-lhcb.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 2009-09-14 10:00 | 2009-09-14 12:00 | 2:00 | SRM end point down for upgrade to SRM version 2.8-0. |
All Castor & CEs | UNSCHEDULED | AT_RISK | 2009-09-10 20:00 | 2009-09-11 13:00 | 17:00 | At-risk following the work on the oracle database hardware |
All Castor & CEs | UNSCHEDULED | OUTAGE | 2009-09-10 17:34 | 2009-09-10 20:00 | 2:26 | Outage to address the problems with the Oracle database hardware behind Castor. |
All Castor & Ces | UNSCHEDULED | OUTAGE | 2009-09-10 14:53 | 2009-09-10 18:00 | 3:07 | Problems on the CASTOR database. Under investigation |
Castor GEN instance | UNSCHEDULED | AT_RISK | 2009-09-09 12:00 | 2009-09-09 14:38 | 2:38 | After an upgrade the SRM service appears to work correctly for manual transfers but it is failing SAM tests. Service in 'at risk' status while this is investigated/fixed. |
Castor GEN instance | SCHEDULED | OUTAGE | 2009-09-09 10:00 | 2009-09-09 12:00 | 2:00 | SRM end point down for upgrade to SRM version 2.8-0. |
ogma.gridpp.rl.ac.uk | SCHEDULED | OUTAGE | 2009-09-08 09:00 | 2009-09-08 15:00 | 6:00 | Migration of Oracle database to a 64-bit system. A 3-hour outage has been declared although it is expected that the database will only be unavailable for a short while at some point during this interval. Due to unforeseen circumstances we need to extend this downtime to 3:00 pm localtime. |
lcgwms01.gridpp.rl.ac.uk | UNSCHEDULED | AT_RISK | 2009-09-07 12:00 | 2009-09-07 16:59 | 4:59 | upgrade to the latest glite-WMS; a reboot will be necessary |
All Ces | UNSCHEDULED | AT_RISK | 2009-09-03 10:00 | 2009-09-03 11:00 | 1:00 | At risk while some nodes are rebooted to pick up a new kernel. This includes the batch scheduler system and NFS server that contains users' home directories. It is expected that this intervention will be transparent to grid work. |
lfc, lfc-atlas, fts & ftm | SCHEDULED | AT_RISK | 2009-09-02 11:00 | 2009-09-02 11:30 | 0:30 | At Risk for minor reconfiguration of Oracle database. |