Tier1 Operations Report 2012-08-01

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 1st August 2012

Review of Issues during the week 25th July to 1st August 2012
  • Following patching for a security update there was a problem with one of the software components on WMS01 & WMS02 (the WMSs used by the LHC experiments) repeatedly crashing. A workaround (re-starter) was in place. This issue was resolved last Thursday (26th July) by applying a fix provided by the WMS developers.
  • There have been a couple of fail-overs of the main CERN link to the backup. (On Friday 27th owing to a card in a router at Reading) and again from 06:00 on Saturday (28th) to 08:00 the following day (Sun) when the main link started working in only one direction.
  • There was a problem Monday evening (30th July) when the disk buffer for CMS Tape storage filled up owing to a problem with the garbage collection. (Problem was traced to GC policy misconfiguration on cmsTape service class and fixed late that evening).
  • Following a ticket from SNO+ an error was found in the published value of GlueHostMainMemoryVirtualSize which has now been increased to 2000 for grid2000M queue.
  • Seven files were reported as lost to Atlas. These were found following the draining of gdss452 (AtlasDataDisk) which was done after it failed (on 17th July).
  • Two files have been reported lost to ALICE. These were picked when investigating files that would not migrate to tape.
Resolved Disk Server Issues
  • GDSS607 (LHCbDst - D1T0) had been out of service for some time. It was replaced by another similar server (gdss611) on Thursday (26th July).
Current operational status and issues
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.
Ongoing Disk Server Issues
  • None
Notable Changes made this last week
  • Migration of FTS agents to virtual machines is almost complete. A problem was seen with the network driver on the VMs which has been fixed.
  • Authentication server switched from SCAS to Argus for all WNs yesterday (Monday 30th July).
  • Torque server updated to 2.5.12 with munge API support patch (Monday 30th July).
  • Continuing test of hyperthreading, one batch of worker nodes (the Dell 2011 batch) has number of jobs increased further (from 14 to 16) on Thursday (26th July).
  • As stated before: CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • cvmfs client version 2.0.18-1 rolled out to Worker Nodes.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12.
  • Networking:
    • The site network team have scheduled an intervention on the site firewall on the 21th August.
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • The FTS Agents are being progressively moved to virtual machines.
    • Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)


Entries in GOC DB starting between 25th July and 1st August 2012

There were no entries in the GOC DB for this period.

Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
84682 Green Less Urgent In Progress 2012-07-31 2012-07-31 snoplus lcg-del not deleting files
84655 Green Less Urgent Waiting Reply 2012-07-30 2012-08-01 snoplus wms not responding to job submit
84492 Red Urgent Waiting Reply 2012-07-24 2012-07-30 snoplus Job time/memory requirements not provided
83927 Red Urgent In Progress 2012-07-06 2012-07-30 snoplus glite-transfer permissions
68853 Red Less Urgent On hold 2011-03-22 2012-07-30 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers