SAM availability: October 2007 - May 2008

From GridPP Wiki
Jump to: navigation, search

SAM Test Results - 4th analysis to the end of May 2008

The plots below map the SAM availability results for each GridPP site vs the average across all the sites. The SAM summary data is available from: http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/sam.html. The figure for each day is based on the hourly tests that run that day from midnight to midnight.

The title of each graph has the following format: Site name (% site availability for last month: % average GridPP site availability for the comment period: amount site is above or below average). The current target is >90%.

The KSI2K hrs of delivered CPU for the period will be added shortly.


London Tier2

UKI-LT2-Brunel

Site comments

  • We've had no-one in a formal support role since Jan., and thus not enough people to provide a sensible level of cover over the Easter break. Hence a scheduled shutdown over the holiday itself, which was followed by further downtime for various updates - hence limited availability in the mid-March to mid-April range.
  • The other major issues are both ongoing - faulty airconditioning and instability of DPM pool nodes under heavy loads. The latter, at least, is being worked on...

File:SAM A Recent UKI-LT2-Brunel.png

UKI-LT2-IC-HEP

Site comments

All the failures are related to two major problems during this period:

  • Problems with site SRM (dCache) pool nodes going down.
  • Problems with main disk server running the (VOs) home directories etc.

File:SAM A Recent UKI-LT2-IC-HEP.png

UKI-LT2-IC-LESC

Site comments

  • End of October: Unexpected failure due to problems with primary fileserver.
  • End of January: Problems with site SRM (dCache) pool nodes going down.

File:SAM A Recent UKI-LT2-IC-LeSC.png

UKI-LT2-QMUL

Site comments

  • Site came back on-line start of May and for May the site availability (from gridmap) was 76%
  • Problems caused by CRL's not updating on the worker nodes - no fetch-crl cron job in place and fetch-crl script thought to be missing from tarball.
  • Some problems caused by site DNS failures

File:SAM A Recent UKI-LT2-QMUL.png

UKI-LT2-RHUL

Site comments

  • Overall poor availability was the problem, but when we were up, reliability was measured at 97%.
  • The reasons for poor availability are as follows.
    • Big drop outs in November: no idea, nothing scheduled
    • 12-13 Dec07 aircon fault
    • End of year: shutdown over holidays 21Dec-3Jan
    • 14-15 Feb08: power failure
    • 16 March to 16 May: RHUL machine room aircon broken so intermittent failures and regular scheduled downtime for the engineers to attempt new fixes. Now finally solved.
    • Disk failure on master node took out the cluster behind ce2 from around 13-22 May.
    • ce2 problems caused by running lcg-CE software that was up to date, solved by installing older version (!) - see Alessandra's blog
  • More to come...

File:SAM A Recent UKI-LT2-RHUL.png

UKI-LT2-UCL-CENTRAL

Site comments

  • Site in down-time for much of this period


File:SAM A Recent UKI-LT2-UCL-CENTRAL.png

UKI-LT2-UCL-HEP

Site comments

  • Observation 1
  • Observation 2

File:SAM A Recent UKI-LT2-UCL-HEP.png


NorthGrid

UKI-NORTHGRID-LANCS-HEP

Site comments

  • Late November/Early December 2007 dip caused by complications during the upgrade to dcache 1.8 and srm 2.2.
  • Late January/Early Febuary problems caused by dcache headnode blowing its system disk. Complications getting it back up and running despite having configs backed up prompted move to dpm.
  • Early March downtime caused by the monbox and site bdii blowing its system disk. All future mission critical services are going to have mirrored system disks!

File:SAM A Recent UKI-NORTHGRID-LANCS-HEP.png

UKI-NORTHGRID-LIV-HEP

Site comments

  • Early Feb 2008 site was in maintenance for major upgrades to CE, SE, WN and site fabric
  • Late Feb 2008 to early March 2008 continual problems with dcache configuration and information publishing but site was otherwise in full production
  • End of March 2008 whole site down for electrical repairs

File:SAM A Recent UKI-NORTHGRID-LIV-HEP.png

UKI-NORTHGRID-MAN-HEP

Site comments

  • Observation 1
  • Observation 2

File:SAM A Recent UKI-NORTHGRID-MAN-HEP.png

UKI-NORTHGRID-SHEF-HEP

Site comments

  • Mid February 2008 - Problem with DPM (dpmmgr)setup and publishing
  • Mid March 2008 - Upgrading WN's to SL4
  • Mid April 2008 - Installing SL4 and DPM 1.6.7 on SE
  • End of April 2008 - Problem with new certificate release (inconsistent with Glite3.0 version on CE an dGlite3.1 on SE )

File:SAM A Recent UKI-NORTHGRID-SHEF-HEP.png

ScotGrid

UKI-SCOTGRID-DURHAM

Site comments

  • Mid October -- Power failure to machine room
  • Late December -- 4 Nodes eating jobs
  • Mid March -- Planned upgrade of CE, SE and WN to SL4
  • Early April -- Issues following on from upgrade.
  • Early May -- RAID on SE failed
  • Late May -- lcg-CA-1.21 broke the site due to mismatch between CA and CRLs.

File:SAM A Recent UKI-SCOTGRID-DURHAM.png

UKI-SCOTGRID-ECDF

Site comments

  • Site availability was low until the start of 2008, due to issues with the use of virtual memory limits on the ECDF queues. As virtual memory is counted by the linux kernel without taking into account shared libraries being shared, the measured virtual memory footprint of massively forked jobs, like SAM tests, is huge compared to the actual - which caused jobs to be killed by SGE. Worked around by introducing a "secret" set of queues that allow additional virtual memory usage for Grid jobs...
  • Since the start of 2008, we have had various issues, mostly with the ECDF's shared GPFS filesystem, which have caused additional downtime. In addition to these issues, there have been several periods of "official" downtime for upgrades (in late January and early February), and for essential work on the GPFS filesystem (in April and May).

File:SAM A Recent UKI-SCOTGRID-ECDF.png

UKI-SCOTGRID-GLASGOW

Site comments

Generally Q1 was hard for us - some planned downtimes, network problems and some user DOS attacks took us well below 90%. Things are much improved in Q2 (we're at 95%) and we have corrected and improved monitoring where we can.

  • End Jan 2008: Problem for Ops on SE. [1]
  • End Feb 2008: CE crash followed by upgrade of site to gLite 3.1 servers (CE, SE). [2], [3]
  • Feb - March 2008: General lower availability caused by Globus 79 errors on CE, eventually tracked to campus networking people having some arbitrary blocks on ports in our globus range. [4]
  • Mid April 2008: Some worker nodes turned into blackholes by an LHCb user. [5]
  • Start May 2008: Problems updating CRLs after CE disk filled up. [6] [7]

File:SAM A Recent UKI-SCOTGRID-GLASGOW.png

SouthGrid

UKI-SOUTHGRID-BHAM-HEP

  • Availability until January was overall very good apart from a drop caused by an electrical shutdown on December 12. We had two further shutdowns in January which were dictated by a Campus wide ugrade of our electrical infrastructure.
  • A broken GBIC card seriously affected our connectivity, and hence availability, at the end of January and in early February.
  • Airco failure caused a drop in availability on March 11.
  • Our instabilities from mid March until early May were caused by various problems with the new SL4 CE.

File:SAM A Recent UKI-SOUTHGRID-BHAM-HEP.png

UKI-SOUTHGRID-BRIS-HEP

Site comments

  • 21 Nov'07: CMS killed Bris DPM SE with massive simultaneous transfers. SE needs more RAM! Ordered it.
  • Mid-Dec'07: not sure, but Bris failed some tests bcs opssgm was trying to write to dteam storage space. Maarten.Litmaath said it was a SAM problem, soon fixed.
  • 22 Jan'08: scheduled (in GOC DB) outage to add +2GB RAM to SE, 750GB disk to CE for experiment software space
  • 8 Apr'08: https://gus.fzk.de/pages/ticket_details.php "SARA wrongly publishes an LFC for the ops VO", SAM error affected all LCG
  • 19 Apr'08: Major power failure in HPC machine room
  • 25 Apr'08: https://gus.fzk.de/pages/ticket_details.php?ticket=35757 RAL top-level bdii broken ca.8pm 24 April. Fixed next day.
  • 23 May'08: HPC shutdown due to cooling water leak, repairs took a week!

File:SAM A Recent UKI-SOUTHGRID-BRIS-HEP.png

UKI-SOUTHGRID-CAM-HEP

Site comments

  • Observation 1
  • Observation 2

File:SAM A Recent UKI-SOUTHGRID-CAM-HEP.png


EFDA-JET

Site comments

  • Observation 1
  • Observation 2

File:SAM A Recent EFDA-JET.png


UKI-SOUTHGRID-OX-HEP

Site comments

  • Installation of two SL4 based ce's during April-May caused some instabilities, this was followed by migration of the DPM head node to SL4. The old SL3 based ce (t2ce03) will be decommissioned in June and the Mon upgraded to SL4.
  • Observation 2

File:SAM A Recent UKI-SOUTHGRID-OX-HEP.png

UKI-SOUTHGRID-RALPP

Site comments

  • Observation 1
  • Observation 2

File:SAM A Recent UKI-SOUTHGRID-RALPP.png


Tier1

RAL-LCG2-Tier-1

Site comments

  • Observation 1
  • Observation 2

File:SAM A Recent RAL-LCG2 Tier-1.png


Grid Ireland

csTCDie

Site comments

  • Observation 1
  • Observation 2

File:SAM A Recent csTCDie.png