https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_Workload_Management_WMS&feed=atom&action=historyRAL Tier1 Workload Management WMS - Revision history2024-03-29T05:34:37ZRevision history for this page on the wikiMediaWiki 1.22.0https://www.gridpp.ac.uk/w/index.php?title=RAL_Tier1_Workload_Management_WMS&diff=2052&oldid=prevCatalin condurache at 13:30, 30 April 20092009-04-30T13:30:00Z<p></p>
<p><b>New page</b></p><div>The [[RAL Tier1]] runs a Workload Management System (WMS) and Logging and Bookeeping Subsystem (LB) on 3 glite-WMS and 2 glite-LB servers. Each WMS server uses an internal load balancing mechanism to access the both LB servers.<br />
<br />
==Service Endpoints==<br />
<br />
A list of VOs that the WMS servers support can be found from <br />
<br />
ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 \<br />
-b 'Mds-vo-name=RAL-LCG2,o=Grid' '(GlueServiceType=org.glite.wms.WMProxy)' \<br />
GlueServiceAccessControlRule<br />
<br />
Similar for LB servers<br />
<br />
ldapsearch -x -H ldap://site-bdii.gridpp.rl.ac.uk:2170 \<br />
-b 'Mds-vo-name=RAL-LCG2,o=Grid' '(GlueServiceType=org.glite.lb.Server)' \<br />
GlueServiceAccessControlRule<br />
<br />
As on 15 April 2009<br />
<br />
WMS01 (lcgwms01.gridpp.rl.ac.uk) and WMS02 (lcgwms02.gridpp.rl.ac.uk) accept jobs only from LHC VOs (Alice, ATLAS, CMS and LHCb) + dteam and ops.<br />
<br />
WMS03 (lcgwms03.gridpp.rl.ac.uk) accepts jobs from non-LHC VOs only (biomed, zeus, hone, cdf, dzero, babar, pheno, t2k, esr, ilc, magic, minos.vo.gridpp.ac.uk, mice, fusion, geant4, cedar, manmace, gridpp, ngs.ac.uk, camont, totalep, vo.southgrid.ac.uk, vo.northgrid.ac.uk, vo.scotgrid.ac.uk, supernemo.vo.eu-egee.org, na48 vo.nanocmos.ac.uk, vo.londongrid.ac.uk) + dteam and ops.<br />
<br />
LB01 (lcglb01.gridpp.rl.ac.uk) and LB02 (lcglb02.gridpp.rl.ac.uk) are used by all three WMS above mentioned, and therefore are for general use.<br />
<br />
==Basic Usage==<br />
A user interface can be configured to use any of the WMS servers (in the example below 'lcgwms01' can be replaced with 'lcgwms02' or 'lcgwms03')<br />
<br />
# /opt/glite/etc/dteam/glite_wmsui.conf<br />
[<br />
NSAddresses = {"lcgwms01.gridpp.rl.ac.uk"};<br />
MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk";<br />
VirtualOrganisation = "dteam";<br />
LBAddresses = {};<br />
HLRLocation = "";<br />
]<br />
<br />
# /opt/glite/etc/dteam/glite_wms.conf<br />
[<br />
OutputStorage = "/tmp/jobOutput";<br />
JdlDefaultAttributes = [<br />
RetryCount = 3;<br />
rank = - other.GlueCEStateEstimatedResponseTime;<br />
PerusalFileEnable = false;<br />
AllowZippedISB = true;<br />
requirements = other.GlueCEStateStatus == "Production";<br />
ShallowRetryCount = 10;<br />
SignificantAttributes = {"Requirements", "Rank", "FuzzyRank"};<br />
MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk";<br />
];<br />
WMProxyServiceDiscoveryType = "org.glite.wms.wmproxy";<br />
virtualorganisation = "dteam";<br />
ErrorStorage = "/tmp";<br />
EnableServiceDiscovery = true;<br />
ListenerStorage = "/tmp";<br />
LBServiceDiscoveryType = "org.glite.lb.server";<br />
WMProxyEndpoints = {"https://lcgwms01.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server"};<br />
]<br />
<br />
<br />
Check the matching CEs for a JDL job (with glite-wms-job-match) and finally submit a job with glite-wms-job-submit<br />
<br />
<br />
==Service Monitoring==<br />
* [http://ganglia.gridpp.rl.ac.uk/ganglia/?c=Services_Grid&h=lcgwms01.gridpp.rl.ac.uk Ganglia Host Level Monitoring lcgwms01]<br />
* [http://ganglia.gridpp.rl.ac.uk/ganglia/?c=Services_Grid&h=lcgwms02.gridpp.rl.ac.uk Ganglia Host Level Monitoring lcgwms02]<br />
* [http://ganglia.gridpp.rl.ac.uk/ganglia/?c=Services_Grid&h=lcgwms03.gridpp.rl.ac.uk Ganglia Host Level Monitoring lcgwms03]<br />
* [http://ganglia.gridpp.rl.ac.uk/ganglia/?c=Services_Grid&h=lcglb01.gridpp.rl.ac.uk Ganglia Host Level Monitoring lcglb01]<br />
* [http://ganglia.gridpp.rl.ac.uk/ganglia/?c=Services_Grid&h=lcglb02.gridpp.rl.ac.uk Ganglia Host Level Monitoring lcglb02]<br />
* [http://ganglia.gridpp.rl.ac.uk/cgi-bin/ganglia-rbs/wms-page.pl?r=day WMS Metrics]<br />
<br />
<br />
<br />
==Local Deployment Information==<br />
<br />
The RAL Tier1 glite-WMS servers are deployed as it follows:<br />
<br />
lcgwms01 and lcgwms02.gridpp.rl.ac.uk for LHC VO job submissions only<br />
<br />
lcgwms03.gridpp.rl.ac.uk for non-LHC VO job submissions only<br />
<br />
The LHC VO users have the possibility to load balance the two WMS servers by using some appropriate configuration when submitting jobs. At the UI level (or central job submission mechanism if any)<br />
<br />
1. <tt>/opt/glite/etc/$VO/glite_wmsui.conf</tt> should contain<br />
<br />
NSAddresses = {"lcgwms01.gridpp.rl.ac.uk lcgwms02.gridpp.rl.ac.uk"};<br />
LBAddresses = {};<br />
<br />
2. <tt>/opt/glite/etc/$VO/glite_wms.conf</tt> should contain<br />
<br />
WMProxyEndpoints = {"https://lcgwms01.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server",<br />
"https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server"};<br />
<br />
The theory (and the tests as well) says that the glite-wms-job-submit command will randomly pick a WMS from the list. If that WMS fails to accept the job, the next WMS will be tried, and so on. Once the job has been submitted successfully, it is tied to the WMS that accepted it.<br />
<br />
The LB servers (lcglb01 and lcglb02) are accessed in a load balancing manner by all WMS servers using an internal configuration.<br />
<br />
'''How to 'drain' a LB?'''<br />
<br />
Assuming a LB has a RAID disk failure, it might be useful to stop WMSes sending jobs there to ease the load on it until the faulty disk is replaced. To do that, one can edit /opt/glite/etc/glite_wms.conf and remove the degraded LB server from the line:<br />
<br />
LBServer = {"lcglb01.gridpp.rl.ac.uk:9000","lcglb02.gridpp.rl.ac.uk:9000"};<br />
<br />
in the WorkloadManagerProxy section.<br />
<br />
Then restart the WMProxy to have it become effective immediately, otherwise existing WMProxy processes will keep using the old value until they exit after MaxServedRequests = 50.<br />
<br />
==Known Problems==<br />
<br />
'''Bulk submissions'''<br />
<br />
Although one of the main reasons for glite-WMS development and deployment was the bulk job submissions, currently only bulk submissions of max 50 jobs are recommended. A known bug ([https://savannah.cern.ch/bugs/index.php?32345 click here]) makes submissions of 50+ jobs to fail.<br />
<br />
<br />
'''SandboxDir output file size'''<br />
<br />
Currently no limit can be imposed on the size of job output files. Therefore users are recommended to carefully operate with output files, and also are kindly requested to retrieve these files (i.e. glite-wms-job-output) once the job is terminated.<br />
<br />
<br />
'''No compatible resources'''<br />
<br />
If a job fails and the WMS finds the job's "token" file still present (in the job's sandbox area), it means the job already exited before the WMS job wrapper was started. In that case the WMS can try a shallow resubmission (as allowed by the JDL), but that fails because of [https://savannah.cern.ch/bugs/?28235 this bug]. When a resubmission happens, previously used CEs are not considered at all, and so if there aren't "new" CEs the job is aborted because "not compatible resources were found".<br />
<br />
* [http://glite.web.cern.ch/glite/packages/R3.1/deployment/glite-WMS/glite-WMS-known-issues.asp Other gLite WMS known issues]<br />
<br />
==Other Resources==<br />
<br />
* [http://glite.web.cern.ch/glite/wms The Workload Management Subsystem (WMS)]<br />
* [http://glite.web.cern.ch/glite/lb The Logging and Bookeeping Subsystem (LB)]<br />
* [http://web.infn.it/gLiteWMS/ glite WMS]<br />
* [http://egee.cesnet.cz/en/JRA1/LB/ EGEE - Logging and Bookkeeping]<br />
* [http://egee-jra1-wm.mi.infn.it/egee-jra1-wm/lb_install.shtml LB server quick Installation Guide]<br />
* [https://twiki.cern.ch/twiki/bin/view/FIOgroup/ScLCGWms31ConfigVO Specific configuration on gLite WMS and LB 3.1 nodes for each VO]<br />
* [http://egee.cesnet.cz/cvsweb/LB/documentation.html Logging and Bookkeeping Documentation]<br />
* [https://edms.cern.ch/file/572489/1/EGEE-JRA1-TEC-572489-WMS-guide-v0-3.pdf WMS User Guide]<br />
* [https://twiki.cern.ch/twiki/bin/view/LCG/GLiteWMSTroubleshooting WMS Troubleshooting Guide]<br />
<br />
[[Category:Workload Management]]<br />
[[Category:RAL Tier1]]</div>Catalin condurache