RAL Tier1 Workload Management

From GridPP Wiki
Jump to: navigation, search

Important note: As from 1st December 2008, there is no more lcg-RB service offered by RAL Tier1 (lcgrb01, lcgrb02 and lcgrb03.gridpp.rl.ac.uk servers have been decommissioned). The information below is out of date and will be replaced by proper glite-WMSLB related information early January 2009. Users may want however to submit job to lcgwms01 and lcgwms02.gridpp.rl.ac.uk using glite-wms-job-* tools.


Service Endpoints

The RAL Tier1 runs a LCG Workload Management System or Resource Broker on three machines: lcgrb01.gridpp.rl.ac.uk, lcgrb02.gridpp.rl.ac.uk and lcgrb03.gridpp.rl.ac.uk

A list of VOs that the RBs support can be found from

  ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 \
      -b 'Mds-vo-name=RAL-LCG2,o=Grid' '(GlueServiceType=ResourceBroker)' \
      GlueServiceAccessControlRule

As on 15 October 2008

 GlueServiceAccessControlRule: atlas
 GlueServiceAccessControlRule: alice
 GlueServiceAccessControlRule: lhcb
 GlueServiceAccessControlRule: cms
 GlueServiceAccessControlRule: biomed
 GlueServiceAccessControlRule: zeus
 GlueServiceAccessControlRule: hone
 GlueServiceAccessControlRule: cdf
 GlueServiceAccessControlRule: dzero
 GlueServiceAccessControlRule: babar
 GlueServiceAccessControlRule: pheno
 GlueServiceAccessControlRule: t2k
 GlueServiceAccessControlRule: esr
 GlueServiceAccessControlRule: ilc
 GlueServiceAccessControlRule: magic
 GlueServiceAccessControlRule: minos.vo.gridpp.ac.uk
 GlueServiceAccessControlRule: mice
 GlueServiceAccessControlRule: dteam
 GlueServiceAccessControlRule: fusion
 GlueServiceAccessControlRule: geant4
 GlueServiceAccessControlRule: cedar
 GlueServiceAccessControlRule: manmace
 GlueServiceAccessControlRule: gridpp
 GlueServiceAccessControlRule: ngs.ac.uk
 GlueServiceAccessControlRule: camont
 GlueServiceAccessControlRule: totalep
 GlueServiceAccessControlRule: vo.southgrid.ac.uk
 GlueServiceAccessControlRule: vo.northgrid.ac.uk
 GlueServiceAccessControlRule: vo.scotgrid.ac.uk
 GlueServiceAccessControlRule: supernemo.vo.eu-egee.org
 GlueServiceAccessControlRule: na48
 GlueServiceAccessControlRule: vo.nanocmos.ac.uk
 GlueServiceAccessControlRule: vo.londongrid.ac.uk
 GlueServiceAccessControlRule: ops

Basic Usage

A user interface can be configured to use any of these Resource Brokers (in the example below 'lcgrb01' can be replaced with 'lcgrb02')

 # edg_wl_ui.conf
 [
    VirtualOrganisation = "dteam";
    NSAddresses = "lcgrb01.gridpp.rl.ac.uk:7772";
    LBAddresses = "lcgrb01.gridpp.rl.ac.uk:9000";
    MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk"
 ]
 # edg_wl_ui_cmd_var.conf
 [
    rank = - other.GlueCEStateEstimatedResponseTime;
    requirements = other.GlueCEStateStatus == "Production";
    RetryCount = 3; 
    ErrorStorage = "/tmp";
    OutputStorage = "/tmp/jobOutput";
    ListenerPort = 44000;
    ListenerStorage = "/tmp";
    LoggingTimeout = 30;
    LoggingSyncTimeout = 30;
    LoggingDestination = "lcgrb01.gridpp.rl.ac.uk:9002";
    NSLoggerLevel = 0;
    DefaultLogInfoLevel = 0;
    DefaultStatusLevel = 0;
    DefaultVo = "unspecified";
 ]

And finally submit a job with

 $ edg-job-list-match --config edg_wl_ui.conf \
       --config-vo edg_wl_ui_cmd_var.conf  HelloWorld.jdl

Service Monitoring


The ganglia plots also indicates the number of jobs currently held within the logging and bookkeeping service in various states.

Job State Plot Name Description
ABORTED jobs_aborted Aborted by system (at any stage).
CANCELLED jobs_cancelled Cancelled by user.
CLEARED jobs_cleared Output transfered back to user and freed.
DONE jobs_done Execution finished, output is available.
READY jobs_ready Matching resources found.
RUNNING jobs_running Executable is running.
SCHEDULED jobs_scheduled Accepted by LRMS queue.
SUBMITTED jobs_submitted Entered by the user to the User Interface.
WAITING jobs_waiting Accepted by WMS, waiting for resource allocation.


Alarms:

- If FD (Number of file descriptors opened by edg-wl-log_monitor process) gets in red (i.e. too large), then the following procedure is needed:

1. Edit /etc/cron.d/edg-wl-check-daemons and comment out the cron job.

2. Next:
----------------------------------------------------------------------
/etc/init.d/edg-wl-lm stop
cd /var/edgwl/logmonitor/CondorG.log/
find CondorG.*.log -mtime +30 -print -exec mv {} ./recycle/ \;
cd /
/etc/init.d/edg-wl-lm start
----------------------------------------------------------------------

3. Edit /etc/cron.d/edg-wl-check-daemons and uncomment the cron job.


Local Deployment Information

The RAL Tier1 team has installed, configured and started to operate a second LCG Resource Broker (lcgrb02.gridpp.rl.ac.uk) in addition to the existing one (lcgrb01.gridpp.rl.ac.uk). Also a load-balancing mechanism was implemented (with help from CERN specialists) and tested and now is ready to be used.

For this to happen some changes are needed at the UI (or central job submission mechanism if any) level i.e. manual modification of some config files:

1. In $EDG_LOCATION/etc/edg_wl_ui_cmd_var.conf comment out the line that specifies the LoggingDestination.

Probably

  #LoggingDestination = "lcgrb01.gridpp.rl.ac.uk:9002";

2. For each supported VO, in $EDG_LOCATION/etc/$VO/edg_wl_ui.conf name the load-balanced RBs like this:

  NSAddresses = {"lcgrb01.gridpp.rl.ac.uk:7772","lcgrb02.gridpp.rl.ac.uk:7772"};
  LBAddresses = {{"lcgrb01.gridpp.rl.ac.uk:9000"},{"lcgrb02.gridpp.rl.ac.uk:9000"}};

Beware the exact syntax of the curly braces!

At least in the RAL UIs case

  $EDG_LOCATION=/opt/edg

No restart of services is needed.

The theory (and the tests as well) says that the edg-job-submit command will pick a random RB. If that RB fails to accept the job, the next RB will be tried, and so on. Once the job has been submitted successfully, it is tied to the RB that accepted it.

Also, if local config files are used (instead of standard config files as above) when submitting specific jobs, the same changes could be done at local config files level.


3 July 2007 - A new lcg-RB lcgrb03.gridpp.rl.ac.uk has been deployed for the solely use of ALICE. This experiment is using a hammering job submission mechanism which could make the RB unavailable for other users. LHCb might be asked to use lcgrb03 in near future.

9 July 2007 - LHCb asked to only use lcgrb03 for their job submissions.

1 October 2008 - As the gLite WMSLB service is in place at RAL, the lcg-RBs will be decommissioned on 1 December 2008.

Known Problems

Rogue gahp_server processes

During a normal behaviour of a lcg-RB, there are several gahp_server processes as children of a condor_gridmanager process.

[root@lcgrb02 root]# pstree|grep gahp
     -condor_master---condor_schedd---condor_gridmana---12*[gahp_server]

But sometimes (usually during heavy load) one or two gahp_server processes act as orphaned processes and are increasing up to 100% the CPU usage. The only (and simply) workaround is to kill these run-away processes. It is expected that a new Condor release for lcg-RB will improve the stability.


Multiple cancelation of a same job

When one (or more) jobs are cancelled more than once, the edg-wl-renewd processes are significantly loading the system CPU, therefore the system seems being overloaded. The only remedy is to keep restarting edg-wl-proxyrenewal and edg-wl-wm services until the double cancellations have disappeared from /var/edgwl/workload_manager/input.fl. It can be up to one restart of both services _per_ repeated cancellation!

Other Resources