Batch system status

From GridPP Wiki
Jump to: navigation, search

Other links

Sites batch system status

This page has been setup to collect information from GridPP sites regarding their batch systems in February 2014. The information will help with wider considerations and strategy. The table seeks the following:

  1. Current product (local/shared) - what is the current batch system at the site. Is it locally managed or shared with other groups?
  2. Concerns - has your site experienced any problems with the batch system in operation?
  3. Interest/Investigating/Testing - Does your site already have plans to change and if so to what. If not are you actively investigating or testing any alternatives?
  4. CE type(s) - What CE type (gLite, ARC...) do you currently run and do you plan to change this, perhaps in conjunction with a batch system move?
  5. glExec/pilot support for all VOs - do you have glExec and pilot pool accounts for all VOs, as opposed to just the LHC VOs? Used for the move to a Dirac WMS.
  6. Multicore status for ATLAS and CMS
    1. ATLAS multicore jobs history for UK sites
  7. Machine/Job Features (MJF) enabled: - = not started; Fail = failing SAM tests; Warn = warnings from SAM tests; Pass = passing SAM tests
  8. Notes - Any other information you wish to share on this topic.

See Cloud & VM status for status of Vac/Cloud deployment by site.

Site Current product (local/shared) Concerns and observations Interest/Investigating/Testing CE type(s) & plans at site Pilots for all cgroups Multicore Atlas/CMS MJF CentOS7 WN Notes
RAL Tier-1 HTCondor (local) None No reason ARC-CE Yes Yes Yes Warn Yes
UKI-LT2-Brunel Arc/Condor ArcCE info system Spark cluster in test ARC-CE Yes Yes Yes -
UKI-LT2-IC-HEP Gridengine (local) None No reason CREAM, ARC Yes No Yes - Yes

UKI-LT2-QMUL SLURM SLURM does support MaxCPUTime for queues but it's complicated SLURM CREAM Yes Yes Yes No In local testing GPU and preempt queues also supported on the grid
UKI-LT2-RHUL Torque/Maui (local) Torque/Maui support non-existent Will follow the consensus CREAM Yes No Yes -
UKI-NORTHGRID-LANCS-HEP Son of Gridengine (HEC) Torque/Maui decommissioned CREAM, moving to ARC eventually Yes No Yes - Yes
UKI-NORTHGRID-LIV-HEP HTCondor/VAC (local) None Centos7 ARC Yes Yes Yes Yes Yes None

UKI-NORTHGRID-MAN-HEP Torque/Maui (local)/ HTCondor (local) Maui is unsupported. HTCondor Started migration to ARC-CE/HTCondor Yes Yes Yes Pass Yes
UKI-NORTHGRID-SHEF-HEP Torque/Maui (local) Torque/Maui support non-existent HTCondor is in testing mode CREAM CE, ACR CE is in test Yes No Yes -
UKI-SCOTGRID-DURHAM SLURM (local) No reason ARC CE Yes Yes Yes -
UKI-SCOTGRID-ECDF Gridengine None No reason ARC-CE No Yes - Yes
UKI-SCOTGRID-GLASGOW HTcondor (local), Torque/Maui (local) Becomes unresponsive at times of high load or nodes being un-contactable. Investigating HTCondor/SoGE/SLURM as a replacement. ARC-CE Yes Yes -
UKI-SOUTHGRID-BHAM-HEP Torque/Maui Maui sometimes fails to see new jobs and so nothing is scheduled HTCondor CREAM No No -
UKI-SOUTHGRID-BRIS HTCondor (shared) Cannot run modern workflows (e.g. Apache Spark) kubernetes, Mesos ARC-CE, plan to add HTCondor CE once accouting is sorted. On roadmap yes No - In local testing
UKI-SOUTHGRID-CAM-HEP Torque/Maui (local) Torque/Maui support non-existent Will follow the consensus CREAM CE Yes No Yes Pass
UKI-SOUTHGRID-OX-HEP HTCondor (local) None No reason ARC CE in production Yes Yes Yes -
UKI-SOUTHGRID-RALPP HTCondor None No reason ARC CE Yes Yes Yes Warn
UKI-SOUTHGRID-SUSX (Shared) Gridengine - (Univa Grid Engine) None No reason CREAMCE Looking into it Yes -