Difference between revisions of "Tier1 Operations Report 2012-07-18"

From GridPP Wiki
Jump to: navigation, search
 
(No difference)

Latest revision as of 10:09, 18 July 2012


RAL Tier1 Operations Report for 18th July 2012

Review of Issues during the week 11th to 18th July 2012
  • A faulty PDU caused a number of machines to lose power on Friday evening (13th July). Including two NIS servers, a CVMFS squid and a CMS squid. These systems all had redundancy and so the failure was not service affecting. The PDU failed again on Monday evening 16th July. Critical machines have now been moved to other racks.
Resolved Disk Server Issues
  • GDSS447 (Atlas DataDisk - d1T0). Reported has having problems last week. It was recovered and returned to production on Thursday 12th July.
  • GDSS452 (Atlas StripInput- d1T0) had a system drive failure on Monday 16th July. Then on Tuesday morning (17th July) the data partitions went read-only. The RAID card was reset and the machine was checked and then returned to production at 15:30 the same day.
Current operational status and issues
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.
Ongoing Disk Server Issues
  • GDSS607 (LHCbDst - D1T0) has been out of service for some time. It is being swapped for a different server which is being acceptance tested ahead of deployment.
Notable Changes made this last week
  • Wednesday 11th July CMS batch work was switched to use CVMFS.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12.
  • Networking:
    • The site network team have scheduled an intervention on the site firewall on the 21th August.
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • The FTS Agents are being progressively moved to virtual machines.
    • Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)


Entries in GOC DB starting between 4th and 11th July 2012

There were no GOC DB entries for this period.

Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
84307 Green Very Urgent In Progress 2012-07-17 2012-07-18 LHCb Some files missing from storage - probably they fa..
84270 Amber Less Urgent In Progress 2012-07-16 2012-07-17 N/A Recommended Top BDII List for WLCG -> lcgbdii.gridpp.rl.ac.uk
84166 Green Urgent in progress 2012-07-12 2012-07-12 t2k FTS transfers RALLCG2-VICTORIALCG2 (t2k.org)
83927 Green Urgent waiting for reply 2012-07-06 2012-07-18 snoplus.snolab.ca glite-transfer permissions
68853 Red Less Urgent On hold 2011-03-22 2012-06-25 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers