Tier1 Operations Report 2012-07-04

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 4th July 2012

Review of Issues during the week 27th June to 4th July 2012
  • On Friday (29th June) and over the weekend (Sunday 1st July) there was a backlog of migrations to tape for CMS. This was not an operational problem and resolved itself - it was caused by the high rate of tape access going on for that period.
Resolved Disk Server Issues
  • GDSS586 (AtlasGroupDisk - D1T0) was out of service for a couple of hours on Monday 2nd July for battery replacement.
  • Also on Monday 2nd July four disk servers from Alice Tape (D0T1) were taken out of production for a short while (less than an hour) for a disk controller firmware update.
Current operational status and issues
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.
  • There is still a problem with the reporting of disk capacity to be followed up.
Ongoing Disk Server Issues
  • GDSS607 (LHCbDst - D1T0) has been out of service for some time. It is being swapped for a different server.
Notable Changes made this last week
  • Moved Castor databases to Oracle 11. (Currently running without DataGuard which we expect to re-instate tomorrow).
  • FTS Database moved back to correct Oracle RAC (Somnus) (which is at Oracle 11).
  • Disk servers and Castor headnodes rebooted to update kernels & errata.
  • EMI installation on WMS02. (This means all WMSs now at EMI WMS v3.3.5).
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • The FTS Agents are being progressively moved to virtual machines.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • A minor update to the Castor Information provider (CIP).
    • Upgrade to version 2.1.12.
  • Networking:
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)


Entries in GOC DB starting between 27th June and 4th July 2012

There were no unscheduled outages during the last week. We also note that although the batch system was declared as down during the Castor Oracle database move on 27th June, already running Atlas batch jobs were allowed to continue to run through the intervention. Other VO's batch jobs were paused. (No batch jobs were allowed to start during this period.)

Service Scheduled? Outage/At Risk Start End Duration Reason
All Castor & Batch SCHEDULED OUTAGE 27/06/2012 08:45 27/06/2012 12:00 3 hours and 15 minutes Storage (Castor) and Batch (CEs) unavailable. Oracle database behind Castor being moved to Oracle 11.
lcgfts.gridpp.rl.ac.uk, SCHEDULED OUTAGE 27/06/2012 07:45 27/06/2012 10:45 3 hours Service drained then unavailable while back end (Oracle) database moved back to correct Oracle RAC.
lcgwms02.gridpp.rl.ac.uk, SCHEDULED OUTAGE 21/06/2012 12:00 27/06/2012 13:00 6 days, 1 hour EMI installation WMSv3.3.5
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
83768 Green Urgent Waiting Reply 2011-07-02 2012-07-03 NA62 FTS channel from Liverpool to RAL
83578 Red Urgent Waiting Reply 2011-06-26 2012-06-26 MICE Tape space on Castor for mice reconstructed data
83564 Red Less Urgent Waiting Reply 2011-06-25 2012-07-02 MICE Software area for MICE data reconstruction
68853 Red Less Urgent On hold 2011-03-22 2012-06-25 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers