Tier1 Operations Report 2012-03-28

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 28th March 2012

Review of Issues during the week 21st to 28th March 2012.

  • Note: The outage of the Tier1 on Friday 16th March owing to problems on the Tier1 network is being post mortemed. The link to this will be added here as soon as it is sufficiently complete.
  • On Friday 23rd March there was a problem in the Tier1 network when a connection became loose cutting off equipment in the UPS room. This was resolved and services recovered quickly. A 40 minute outage on the Tier1 site was declared.
  • This morning (Wed. 28th March) we have been investigating a problem where CMS batch jobs are not starting.

Resolved Disk Server Issues

  • GDSS392 (CMSTape D0T1) was taken out of production on Sunday evening (18th March) and returned to service on the afternoon of Wednesday 21st.

Current operational status and issues.

  • There have been no further problems with the link between the UKLight And SAR routers since the last intervention over two weeks ago. Network Teams's monitoring is showing an improved behaviour of the link. We will continue to track this issue here for a little while yet.
  • Last Wednesday (21st March) patches were applied to try and fix the problems that had been seen with FTS since the upgrade to version 2.2.8. These were both a patch to Oracle and a patch to the FTS code itself. For a while when we were experiencing problems the FTS monitoring was disabled as it was seen to increase load on the database. This has now been switched back on. The FTS problems have not been seen since these fixes were applied.

Ongoing Disk Server Issues

  • None

Notable Changes made this last week

  • Patches applied to the FTS Oracle database and FTS code to resolve problems. (Wed. 21st March).
  • Additional 480TB of disk space put into service for LHCb. (Monday 26th March).
  • The further intervention on a power board supplied by the UPS took place successfully yesterday (Tuesday 27th March).
  • MyProxy was updated to the UMD version on a virtual machine this morning (Wed. 28th March).

Forthcoming Work & Interventions

  • The second batch of worker nodes are expected to go into production within a few weeks.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Databases:
    • Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update.)
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.

Entries in GOC DB starting between 21st and 28th March 2012.

There were eight unscheduled entries in the GOC DB for this last week. These relate to the Tier1 site outage on Friday (16th) and the problems with the FTS (both reported above).

Service Scheduled? Outage/At Risk Start End Duration Reason
lcgft-atlas.gridpp.rl.ac.uk, SCHEDULED WARNING 28/03/2012 13:00 28/03/2012 15:00 2 hours Testing AGIS downtime calendar
lcgrbp01.gridpp.rl.ac.uk SCHEDULED WARNING 28/03/2012 10:45 28/03/2012 12:00 1 hour and 15 minutes Upgrade of MyProxy to UMD version.
Whole Site SCHEDULED WARNING 28/03/2012 10:00 28/03/2012 12:00 2 hours At Risk during intervention on internal Tier1 network.
Whole Site UNSCHEDULED OUTAGE 23/03/2012 15:20 23/03/2012 16:00 40 minutes Declaring Site Outage - We are Investigating a problem somewhere within our Tier1 network.
lcgfts.gridpp.rl.ac.uk, lfc.gridpp.rl.ac.uk UNSCHEDULED WARNING 21/03/2012 10:00 21/03/2012 12:00 2 hours Warning (At Risk) while applying patch to Oracle database to fix ongoing problem with FTS.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
68853 Red Less Urgent On hold 2011-03-22 2012-03-27 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)