Tier1 Operations Report 2012-05-23

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 23rd May 2012

Review of Issues during the week 16th to 23rd May 2012
  • On Friday 18th May there was an issue with gdss374. This resulted in 90 lost files for Atlas. This is in addition to the 34 files lost from this machine on Monday 14th May.
  • On Saturday 19th May there was a failure of FTS transfers over a period of a few hours. The cause is not fully understood.
  • On Saturday and Monday there were problems with various bdii's, causing some job failures.
  • On Monday 21st May, there was a failure of a switch stack (stack 15). This affected the attached machines for approx half an hour. This switch stack failed again on Tuesday 22nd at approx 13:00.
  • On Monday 21st errata was applied to the FTS machines. This caused the the problem with certificates within FTS for LHCb and CMS to recur ( See current issues).
  • On Monday 21st one of the CMS squid machines failed to reboot after errata was applied. It is currently in hardware intervention.
  • On Tuesday 22nd, there was an site network intervention, which affected services from approx 08:00 until 09:45.
Resolved Disk Server Issues
  • gdss469 (lhcbUser) reported fsprobe errors on Monday 21st May. It was taken out of service and had a disk replaced. It was returned to service at approx 13:30 the same day.
Current operational status and issues
  • Investigations into an ongoing communications problem between the CEs and the batch server continue.
  • There have been no further problems in the last week on the UKLight-SAR link although we will continue to track this here.
  • There is a known problem with the handling of some certificates within FTS that is currently causing problems for outgoing CMS FTS transfers.
Ongoing Disk Server Issues
  • gdss644 (atlasStripInput) was found to have an incorrect installation and it is being drained and will be re-installed.
Notable Changes made this last week
  • Errata and kernel updates are being deployed.
Declared in the GOC DB
  • None declared, but see the advanced warnings section.
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.
  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11. There is a proposed date of 13th June, pending internal sign-off.
  • Castor:
    • Deploy Transfer Manager for Castor. We now have proposed dates for this
      • 28 May 2012 10:00-11:00 LHCb
      • 30 May 2012 10:00-11:00 Gen
      • 31 May 2012 10:00-11:00 CMS
      • 07 Jun 2012 10:00-11:00 ATLAS
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
    • Upgrade to version 2.1.12.
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer. There is now a firm date of 19th June for upgrading the Site Access Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.|
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 18th June. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half. This has been postponed from the 14th May.


Entries in GOC DB starting between 2nd and 9th May 2012

There were no unscheduled outages during the last week.

There were no scheduled outages in the GOCDB in the past week.

Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
68853 Red Less Urgent On hold 2011-03-22 2012-04-20 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)
82100 Yellow Less Urgent In progress 2012-05-10 2012-05-14 snoplus.snolab.ca default se
81699 RED Less Urgent In progress 2011-04-27 2012-05-21 NA62 FTS channel for na62.vo.gridpp.ac.uk
82376 Green Very Urgent In progress 2012-05-21 2012-05-22 T2K t2k.org jobs aborting due to failed delegation ID
82378 Green Less Urgent In progress 2012-05-21 2012-05-23 snoplus.snolab.ca Software install job cancelled
82385 Green Less Urgent In progress 2012-05-21 2012-05-22 CMS RAL tape migration
82402 Green Urgent In progress 2012-05-22 2012-05-23 snoplus.snolab.ca WMS proxy renewal problems

}