Tier1 Operations Report 2012-04-25

From GridPP Wiki
Jump to: navigation, search

RAL Tier1 Operations Report for 25th April 2012

Review of Issues during the fortnight 11th to 25th April 2012.

  • Two additional FTS front end systems (implemented as virtual machines) were added to the FTS alias on the morning of Wednesday 11th April. However, this change was backed out of later that day. A number of problems were encountered - not all of which were understood at the time. The change was re-applied on Thursday 19th April. A problem with FTS incorrectly thinking proxies have expired was traced to versions of globus libraries. Furthermore one of the CMS VOMS servers was not configured in (now fixed). However, any remote site that has not updated to the newer UK EScience CA certificate cannot submit FTS transfers.
  • On Thursday 12th April a large number of disk servers dropped network connections. The servers were all in the AtlasStripInput (AtlasDataDisk) service class and came from two batches of disk server which have a common network interface card. The root cause is not understood. However, we rolled out an updated kernel (with updated network driver) to the affected machines. (Four were updated on on the Thursday afternoon and the remaining ten servers on Friday morning 13th April.) A similar server (gdss613) from one of these batches of servers but in the LHCbDst service class showed a similar problem over the weekend of 21/22 April. The updated kernel was rolled out to all disk servers in the affected batches this morning (Wednesday 25th April).
  • During the afternoon of Wednesday 18th April a problem was found with clock drift one the Castor Atlas stager. This caused the loss of two files. the problem has been fixed and a new nagios check rolled out to look for clock drift.
  • On Thursday 19th April a problem was found with xrootd access to AtlasStripDeg. It turns out there was some mapping missing for this service class.
  • On Thursday 19th April the number of jobs in the Grid4000M queues was increased (from 2000 to 3000). This was picked up after LHCb jobs failed to be scheduled to run.

Resolved Disk Server Issues

  • GDSS392 (CMSTape D0T1) crashed during the evening of Monday 2nd April. All un-migrated files were removed from this server and it is being decommissioned.
  • GDSS445 (AtlasDataDisk) reported an FSPROBE problem on Friday evening (6th April). It was out of production for a while (a bit less than two hours) then put into 'read-only' mode. It was taken out of service during the day of Wednesday 11th April for further investigation. However, no faults were found and it was decided to drain this system. Initially draining of this server was very slow - this was fixed following a check of the Castor parameters on the system.
  • As referred to above, a large number of disk servers in AtlasDataDisk dropped their network connections on Thursday 12th April (a total of 18 cases spread over 14 disk servers during the day). On Monday evening (23rd April) GDSS613 (LHCbDst) also lost connectivity for the same reason.
  • GDSS610 was not working following its reboot on Friday 13th April. Over the weekend this was traced to the LSF daemon not being started and was resolved on the Saturday afternoon (14th April).

Current operational status and issues.

  • Following an investigation into intermittent SUM test failures for the LHCb CE tests we are investigating a communications problem between the CEs and the batch server. So far a parameter change on the batch server does not seem to have resolved it. A workaround (restarting the pbs_server) seems to provide a temporary fix.
  • We have seen two short breaks again in the UKLight-SAR link. These occurred on the morning of Monday 16th April and the evening of Tuesday 24th April. Both breaks were short (around 15 minutes).

Ongoing Disk Server Issues

  • GDSS209 (AtlasScratcHDisk - D1T0) crashed on the evening of Friday 20th April. It was put back in service on Sunday (22nd April) but awaits further investigation of the fault. Two Atlas files have been reported as lost (corrupt) following this.

Notable Changes made this last week

  • CVMFS client version 2.0.13-1 is being rolled out across the worker nodes.
  • Two additional FTS front ends running on virtual machines were re-added into the alias.
  • More nodes have been added to the "whole node" batch queue (GridWN) bringing the total to 18. Also increased memory limit to 8GB for this queue.

Forthcoming Work & Interventions

  • Some modified WAN tuning settings are being rolled out across disk servers.
  • A third intervention on a power board supplied by the UPS is be needed. This will lead to a very low risk intervention probably on Tuesday 1st May.
  • Castor will move to use the new "Tape Gateway" and "Transfer Manager" features.

Declared in the GOC DB

  • None

Advanced warning for other interventions

The following items are being discussed and are still to be formally scheduled and announced.

  • Databases:
    • Regular Oracle "PSU" patches are pending.
    • Switch LFC/FTS/3D to new Database Infrastructure.
    • Update LFC/FTS databases to Oracle 11.
  • Castor:
    • Update the Castor Information Provider (CIP) (Need to re-schedule.)
    • Move to use Oracle 11g (requires a minor Castor update to version 2.1.11-9).
    • Upgrade to version 2.1.12.
  • Networking:
    • Install new Routing & Spine layers for Tier1 network.
    • Main RAL network updates - early summer.
    • Addition of caching DNSs into the Tier1 network.
  • Grid Services:
    • Updates of Grid Services (including WMS, LFC front ends) to EMI/UMD versions.
  • Infrastructure:
    • The electricity supply company plan to work on the main site power supply for 6 months commencing 14th May. This involves powering off one half of the resilient supply for 3 months while being overhauled, then repeat with the other half.

Entries in GOC DB starting between 11th and 25th April 2012.

There were three unscheduled entries in the GOC DB during this period - all 'warnings'. These all relate to the problem with disk servers loosing network connectivity.

Service Scheduled? Outage/At Risk Start End Duration Reason
Atlas LHCb & GEN instances. UNSCHEDULED WARNING 25/04/2012 11:00 25/04/2012 13:00 2 hours Warning while some disk servers are rebooted to pick up new kernels.
lcgfts.gridpp.rl.ac.uk, SCHEDULED WARNING 19/04/2012 10:00 19/04/2012 12:00 2 hours Addition of two more nodes into alias for FTS front ends. (FTS submissions from sites that are not up to date with CA certificates may have problems.)
All Castor & Batch (CEs & SRMs) UNSCHEDULED WARNING 12/04/2012 17:00 13/04/2012 12:00 19 hours We are still investigating a problem where some disk servers loose network connectivity.
All Castor & Batch (CEs & SRMs) UNSCHEDULED WARNING 12/04/2012 13:30 12/04/2012 17:00 3 hours and 30 minutes We have had a number of disk servers loose network connectivity this morning. Declaring a 'Warning' (on Castor and batch) while we investigate.

Open GGUS Tickets

GGUS ID Level Urgency State Creation Last Update VO Subject
81542 Green Very Urgent In Progress 2011-04-24 2012-04-24 RAL-LCG2 still publishing as member of EGEE ROC UKI.
68853 Red Less Urgent On hold 2011-03-22 2012-04-20 Retirement of SL4 and 32bit DPM Head nodes and Servers (Holding Ticket for Tier2s)