RAL Tier1 Workload Management
Important note: As from 1st December 2008, there is no more lcg-RB service offered by RAL Tier1 (lcgrb01, lcgrb02 and lcgrb03.gridpp.rl.ac.uk servers have been decommissioned). The information below is out of date and will be replaced by proper glite-WMSLB related information early January 2009. Users may want however to submit job to lcgwms01 and lcgwms02.gridpp.rl.ac.uk using glite-wms-job-* tools.
Contents
Service Endpoints
The RAL Tier1 runs a LCG Workload Management System or Resource Broker on three machines: lcgrb01.gridpp.rl.ac.uk, lcgrb02.gridpp.rl.ac.uk and lcgrb03.gridpp.rl.ac.uk
A list of VOs that the RBs support can be found from
ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 \ -b 'Mds-vo-name=RAL-LCG2,o=Grid' '(GlueServiceType=ResourceBroker)' \ GlueServiceAccessControlRule
As on 15 October 2008
GlueServiceAccessControlRule: atlas GlueServiceAccessControlRule: alice GlueServiceAccessControlRule: lhcb GlueServiceAccessControlRule: cms GlueServiceAccessControlRule: biomed GlueServiceAccessControlRule: zeus GlueServiceAccessControlRule: hone GlueServiceAccessControlRule: cdf GlueServiceAccessControlRule: dzero GlueServiceAccessControlRule: babar GlueServiceAccessControlRule: pheno GlueServiceAccessControlRule: t2k GlueServiceAccessControlRule: esr GlueServiceAccessControlRule: ilc GlueServiceAccessControlRule: magic GlueServiceAccessControlRule: minos.vo.gridpp.ac.uk GlueServiceAccessControlRule: mice GlueServiceAccessControlRule: dteam GlueServiceAccessControlRule: fusion GlueServiceAccessControlRule: geant4 GlueServiceAccessControlRule: cedar GlueServiceAccessControlRule: manmace GlueServiceAccessControlRule: gridpp GlueServiceAccessControlRule: ngs.ac.uk GlueServiceAccessControlRule: camont GlueServiceAccessControlRule: totalep GlueServiceAccessControlRule: vo.southgrid.ac.uk GlueServiceAccessControlRule: vo.northgrid.ac.uk GlueServiceAccessControlRule: vo.scotgrid.ac.uk GlueServiceAccessControlRule: supernemo.vo.eu-egee.org GlueServiceAccessControlRule: na48 GlueServiceAccessControlRule: vo.nanocmos.ac.uk GlueServiceAccessControlRule: vo.londongrid.ac.uk GlueServiceAccessControlRule: ops
Basic Usage
A user interface can be configured to use any of these Resource Brokers (in the example below 'lcgrb01' can be replaced with 'lcgrb02')
# edg_wl_ui.conf [ VirtualOrganisation = "dteam"; NSAddresses = "lcgrb01.gridpp.rl.ac.uk:7772"; LBAddresses = "lcgrb01.gridpp.rl.ac.uk:9000"; MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk" ]
# edg_wl_ui_cmd_var.conf [ rank = - other.GlueCEStateEstimatedResponseTime; requirements = other.GlueCEStateStatus == "Production"; RetryCount = 3; ErrorStorage = "/tmp"; OutputStorage = "/tmp/jobOutput"; ListenerPort = 44000; ListenerStorage = "/tmp"; LoggingTimeout = 30; LoggingSyncTimeout = 30; LoggingDestination = "lcgrb01.gridpp.rl.ac.uk:9002"; NSLoggerLevel = 0; DefaultLogInfoLevel = 0; DefaultStatusLevel = 0; DefaultVo = "unspecified"; ]
And finally submit a job with
$ edg-job-list-match --config edg_wl_ui.conf \ --config-vo edg_wl_ui_cmd_var.conf HelloWorld.jdl
Service Monitoring
- Ganglia Host Level Monitoring lcgrb01
- Ganglia Host Level Monitoring lcgrb02
- Ganglia Host Level Monitoring lcgrb03
The ganglia plots also indicates the number of jobs currently held within the logging and
bookkeeping service in various states.
Job State | Plot Name | Description |
ABORTED | jobs_aborted | Aborted by system (at any stage). |
CANCELLED | jobs_cancelled | Cancelled by user. |
CLEARED | jobs_cleared | Output transfered back to user and freed. |
DONE | jobs_done | Execution finished, output is available. |
READY | jobs_ready | Matching resources found. |
RUNNING | jobs_running | Executable is running. |
SCHEDULED | jobs_scheduled | Accepted by LRMS queue. |
SUBMITTED | jobs_submitted | Entered by the user to the User Interface. |
WAITING | jobs_waiting | Accepted by WMS, waiting for resource allocation. |
- RB/WMS Monitoring (thanks to Yvan Calas - CERN)
- RB/WMS Monitoring Tool HowTo
Alarms:
- If FD (Number of file descriptors opened by edg-wl-log_monitor process) gets in red (i.e. too large), then the following procedure is needed:
1. Edit /etc/cron.d/edg-wl-check-daemons and comment out the cron job. 2. Next: ---------------------------------------------------------------------- /etc/init.d/edg-wl-lm stop cd /var/edgwl/logmonitor/CondorG.log/ find CondorG.*.log -mtime +30 -print -exec mv {} ./recycle/ \; cd / /etc/init.d/edg-wl-lm start ---------------------------------------------------------------------- 3. Edit /etc/cron.d/edg-wl-check-daemons and uncomment the cron job.