Tier1 Operations Report 2012-07-25

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 25th July 2012

Review of Issues during the week 18th to 25th July 2012
  • In the early hours of Sunday 22nd July a PSU on a disk server (gdss102) failed. This caused the PDU to trip, which in turn took out a network switch. Staff attended on site and the problem was resolved around 06:00.
  • At around 18:00 on Sunday 22nd one of the five nodes that make up the Top BDII service failed. It was removed from the DNS alias within an hour.
  • There has been a recurrence of the communication problems between the CEs and the batch server. An additional Nagios test (checking memory usage by one of daemons on the batch server) gives notice of the problem which is worked around by a restart of the daemon.
  • For CMS disk servers there was 100% slot utilization and a build up of pending read requests leading to failed jobs. In response for CMS disk servers the transfer manager total slots were doubled and the timeout for the CMS transfer manager was increased (from 120 to 180 seconds).
Resolved Disk Server Issues
  • None
Current operational status and issues
  • Following patching for a security update we have a problem with one of the software components on WMS01 & WMS02 (the WMSs used by the LHC experiments) repeatedly crashing. A workaround (re-starter) is in place and the problem has been reported back.
  • On 12th/13th June the first stage of switching ready for the work on the main site power supply took place. The work on the two transformers is expected to take until 18th December and involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.
Ongoing Disk Server Issues
  • GDSS607 (LHCbDst - D1T0) has been out of service for some time. It is being swapped for a different server which has undergoing acceptance testing and is now in the final stages of deployment (expected to be available tomorrow).
Notable Changes made this last week
  • Following enabling of hyperthreading, one batch of worker nodes (the Dell 2011 batch) had the number of jobs increased from 12 to 14 on Wednesday (18th July).
  • CVMFS available for testing by non-LHC VOs (including "stratum 0" facilities).
  • The nfs server behind the CreamCEs was re-configured to support argus today (Wednesday 25th July). A first tranche of worker nodes has been configured to use argus rather than SCAS.
  • The number of nodes available in the 'whole node' job queue has been increased to ten.
Declared in the GOC DB
  • None
Advanced warning for other interventions
The following items are being discussed and are still to be formally scheduled and announced.

Listing by category:

  • Databases:
    • Switch LFC/FTS/3D to new Database Infrastructure.
  • Castor:
    • Upgrade to version 2.1.12.
  • Networking:
    • The site network team have scheduled an intervention on the site firewall on the 21th August.
    • Install new Routing layer for Tier1 and update the way the Tier1 connects to the RAL network. (Plan to co-locate with replacement of UKlight network).
    • Update Spine layer for Tier1 network.
    • Replacement of UKLight Router.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • The FTS Agents are being progressively moved to virtual machines.
    • Updates of Grid Services as appropriate. (Services now on EMI/UMD versions unless there is a specific reason not.)


Entries in GOC DB starting between 18th and 25th July 2012

There were no unscheduled GOC DB entries for this period.

Service Scheduled? Outage/At Risk Start End Duration Reason
All CEs SCHEDULED WARNING 25/07/2012 10:00 25/07/2012 11:00 1 hour At risk while the nfs server behind the CreamCEs is re-configured to support argus
Open GGUS Tickets
GGUS ID Level Urgency State Creation Last Update VO Subject
84503 Green Urgent In Progress 2012-07-24 2012-07-24 snoplus python-devel
84492 Green Urgent In Progress 2012-07-24 2012-07-24 snoplus Job time/memory requirements not provided
84270 Red Less Urgent In Progress 2012-07-16 2012-07-18 N/A Recommended Top BDII List for WLCG -> lcgbdii.gridpp.rl.ac.uk
83927 Red Urgent reopened 2012-07-06 2012-07-24 snoplus glite-transfer permissions
68853 Red Less Urgent On hold 2011-03-22 2012-07-17 N/A Retirenment of SL4 and 32bit DPM Head nodes and Servers