Tier1 Operations Report 2012-04-04

From GridPP Wiki
Revision as of 11:25, 4 April 2012 by Gareth smith (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

RAL Tier1 Operations Report for 4th April 2012

Review of Issues during the week 28th March to 4th April 2012.

  • Note: The outage of the Tier1 on Friday 16th March owing to problems on the Tier1 network is being post mortemed. The report can be seen at:

https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20120316_Network_Packet_Storm

  • In the early hours of Friday morning (30th March) one of the network switch stacks lost contact with the rest of the network. The stack connects a mixture of CPU and disk servers. Staff attended on-site to resolve the problem. The problem started at around 03:00 and was fixed at 05:40.
  • On the morning of Monday 2nd April a problem was investigated on the batch system with a lack of running jobs and errors reported in the log files. A 'warning' was declared on the batch system as this was investigated and the opportunity was taken to replace a failed disk (part of a mirrored pair) on the batch server.

Resolved Disk Server Issues

  • None

Current operational status and issues.

  • There have been no further problems with the link between the UKLight And SAR routers since the last intervention over three weeks ago and this issue will be dropped from this report.
  • The problems seen with the FTS since its upgrade to version 2.2.8 have not been seen since the patching of 21st March and this issue will be dropped from this report.

Ongoing Disk Server Issues

  • GDSS392 (CMSTape D0T1) crashed during the evening of Monday 2nd April. All un-migrated files have been removed from this server and it will re-do acceptance testing before being returned to production.

Notable Changes made this last week

  • We are in the process of rolling out cvmfs client 2.0.11-1 on worker nodes.
  • The second batch of 2011 worker nodes are being rolled out into production now.
  • The Castor "repack" instance has been uppgraded to version 2.1.11-9 and the new tapegateway service used.

Forthcoming Work & Interventions

  • The FTS front ends will be moved to virtual machines next week.
  • Some modified WAN tuning settings are being rolled out across disk servers.
  • GenScratch (effective 160TB writeable) will be enabled for production next week. Available for testing now.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced. We are carrying out a significant amount of work during the current LHC stop.

  • Databases:
    • Regular Oracle "PSU" patches are pending for SOMNUS (LFC & FTS).
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update.)
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 14th May. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.

Entries in GOC DB starting between 28th and 4th April 2012.

There was one unscheduled entry in the GOC DB as batch problems were investigated on Monday (2nd April).

Service Scheduled? Outage/At Risk Start End Duration Reason
All CEs (lcgce03, lcgce05, lcgce07, lcgce08, lcgce09) UNSCHEDULED WARNING 02/04/2012 09:45 02/04/2012 14:00 4 hours and 15 minutes Investigating problem with batch system and replacing faulty disk on batch server.
lcgft-atlas.gridpp.rl.ac.uk, SCHEDULED WARNING 28/03/2012 13:00 28/03/2012 15:00 2 hours Testing AGIS downtime calendar
lcgrbp01.gridpp.rl.ac.uk, SCHEDULED WARNING 28/03/2012 10:45 28/03/2012 12:00 1 hour and 15 minutes Upgrade of MyProxy to UMD version.
Whole Site SCHEDULED WARNING 28/03/2012 10:00 28/03/2012 12:00 2 hours At Risk during intervention on internal Tier1 network.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
80775 Green Urgent On hold 2011-03-31 2012-04-02 LHCb version of the LFC
80668 Red Urgent On hold 2011-03-27 2012-04-04 SNO+ Please can curl-config and bzip2-devel be installed
68853 Red Less Urgent On hold 2011-03-22 2012-03-27 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)