Tier1 Operations Report 2009-09-23

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 23rd September 2009.

This is a review of issues since the last meeting on 16th September.

Current operational status and issues.

  • The migration of the bulk (around 75%) of the batch system to SL5 was completed last week. However, problems have since been encountered with the batch scheduler and these are under investigation.
  • A problem has been encounted on the Atlas SRM interface since the upgrade (on Monday 21st) to SRM version 2.8.0. The SRM daemons on all four systems within srm-atlas failed at about the same time. The cause of this is now understood and a fix to the SRM scheduled for release on Monday. In the meantime we have added a workaround to stop the problem occurring.
  • Ongoing problem with CMS batch jobs failing: This is awaiting news back from CMS as to how well this type of job runs on the SL5 service.
  • Swine ‘Flu. As reported last week: We continue to track this and ensure preparations are in place should significant portion of our staff be away or have to work offsite. At the moment we have reduced the frequency of our assessments as rates of infection seen nationally have dropped.

Review of Issues during week 16 to 23 September.

  • The planned work to update the Castor nameserver to version 2.1.8 on Tuesday 15th September gave unexpected problems. Disk to Disk copies within Castor failed. This was traced to the LSF scheduler within Castor (and nothing to do with the Nameserver update itself.) The planned upgrade was carried out. A workaround for the LSF issues has been put in place, although why this problem occurred (when no changes had been made to LSF for some time) is not understood. Details are in an incident report at http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090915
  • Some Atlas file transfers had failed with an 'invalid path' error. These were traced to some long-standing database inconsistencies within Castor that have been resolved.
  • Successful UPS tests on Tuesday 22nd September: The test was successful and planned network changes were also made at the same time. However, there was an issue bringing the Castor databases back at the end of the tests. All nodes in the Oracle RAC were rebooted to resolvethe problem. This item added to the report after the meeting.

Advanced warning:

The following outages have been declared in the GOC DB:

  • Castor At Risk for CIP upgrade. Tuesday 29th September.

Table showing entries in GOC DB starting between 16th and 23rd September.

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgwms02.gridpp.rl.ac.uk SCHEDULED OUTAGE 2009-09-22 16:00 2009-09-30 17:00 8 days 1:00 Outage for making the disks on the host hotswappable. This includes time for draining beforehand and subsequent re-installation.
Entire Site SCHEDULED AT_RISK 2009-09-22 08:00 2009-09-22 11:59 3:59 At Risk during and following test of UPS.
All Castor & CEs SCHEDULED OUTAGE 2009-09-22 07:45 2009-09-22 10:00 2:15 Outage for tests of UPS in new Computer Room. During the tests transfers to/from Castor will be suspended.
lcgfts SCHEDULED AT_RISK 2009-09-22 07:00 2009-09-22 10:00 3:00 Transfers to the RAL Tier1 will be drained out during this time. This is in connection with the scheduled outage of Castor while tests are made on the UPS in the new computer room. (Other transfers will continue as normal).
All Castor SCHEDULED AT_RISK 2009-09-21 12:00 2009-09-21 14:00 2:00 At Risk during patching of Oracle database behind Castor. This is to apply a patch to imrove resilience in the event of a hardware failure underneath the Oracle system.
srm-atlas.gridpp.rl.ac.uk SCHEDULED OUTAGE 2009-09-21 10:00 2009-09-21 10:34 0:34 SRM end point down for upgrade to SRM version 2.8-0.
lcglb02.gridpp.rl.ac.uk SCHEDULED OUTAGE 2009-09-21 09:00 2009-09-21 16:00 7:00 This node will be re-installed to enable the hot-swapping of disks so as to improve resilience. The associated WMS has been reconfigured not to use this node and there should be no effect on the users.
All Castor UNSCHEDULED AT_RISK 2009-09-17 09:14 2009-09-17 12:00 2:46 Castor Problens following the planned NameServer update have been resolved. Putting all Castor endpoints At Risk until midday (local time 11:00 UTC).
All Castor CEs UNSCHEDULED OUTAGE 2009-09-16 12:00 2009-09-17 09:17 21:17 We are still seeing problems with Castor following a planned upgrade of the Castor nameserver yesterday. We are extending the downtime for Castor until 17:00 local time (16:00 UTC) tomorrow 17th September.
All Castor UNSCHEDULE AT_RISK 2009-09-15 13:00 2009-09-16 17:00 1 days 4:00 At Risk for Castor following nameserver upgrade. During this time will also make some kernel updates on database servers.
All Castor & CEs UNSCHEDULED OUTAGE 2009-09-15 12:06 2009-09-16 12:00 23:54 Outage for upgrade of Castor Nameserver component to version 2.1.8. This is a continuation of the work that started at 08:00(UTC) earlier today.
lcgce07.gridpp.rl.ac.uk SCHEDULED OUTAGE 2009-09-14 10:30 2009-09-16 12:00 2 days 1:30 CE unavailable while the RAL Tier1 batch system is reconfigured to move from SL4 to SL5
lcgce01 & 02 SCHEDULED OUTAGE 2009-09-14 10:30 2009-09-17 15:24 3 days 4:54 CEs unavailable while the Tier1 batch system is reconfigured. This reconfiguration is to move the bulk of the capacity from SL4 to SL5.
lcgce03, 04, 05 SCHEDULED OUTAGE 2009-09-14 10:30 2009-09-28 12:00 14 days 1:30 Migration of batch system to SL5: These CEs are being retired to be replaced by new CEs that will support the SL5 batch system.