This is still only in outline towards the end but does include the changes you actually need to make, even if the detailed explanation of how it works is lacking.

Putting Existing Farms on the Testbed

This page describes a minimal set of modifications to an EDG Testbed Site and an existing Farm to make the Farm available on the Testbed. It assumes you are familiar with the way the EDG Testbed works, and with its installation procedure (the GridPP Testbed Support pages have an installation guide.)

For our purposes, the key points about a Testbed Site are:

  • Installation and configuration is done from an LCFG installation server. This is installed first and then used to define the properties of the other testbed elements.
  • Jobs enter the site via the Computing Element. This provides a Globus gatekeeper, a PBS server and publishes information about the current state of the site into the MDS information service.
  • The CE also hosts the home directories of the pools of dynamic accounts allocated to users as jobs come in. Each pool account has a prefix dependent on its VO (eg atlas) and then three digits (so atlas001, atlas002, ...)
  • Bulk data is managed by the Storage Element, which can use GDMP to import and export replicas of files. The SE shares is store of files by NFS, in a hierarchy under /flatfiles
  • The jobs are executed on the Worker Nodes. There can be any number of WNs, defined by those nodes to which PBS will send jobs.
  • During execution, jobs are shepherded by a Job Manager process on the CE, which can report on their status to the Resource Brokering system.
  • Jobs import their inbox sandbox from the Resource Broker, and export their output sandbox, using the globus-url-copy command.

Due to the complex and sparsely documented configuration needed, it is usually impractical to install Testbed elements using anything other than the automated LCFG installation procedure. However, this requires that elements be installed from scratch.

For the existing BaBar and DZero/Atlas farms at Manchester, we developed a procedure for using our Testbed Site as a front end to the Farms, that required only minor modifications to the Farm and avoided the need to reinstall its PBS Server or any of the PBS Nodes.

Additionally, only a handful of changes were needed on the Testbed CE, and this makes upgrading the Testbed Site in time with new EDG releases straightword.

Our procedure relies on sharing NFS and PBS spaces between the Farm and the Testbed Site.

NFS is readily and transparently shared between all the machines involved, and with careful use of symbolic links and automounting, this has very little impact on the existing configuration of the Farm.

The Farms have existing PBS Servers which have been configured with queues appropriate to the production use of their resources. It is important not to interfere with, or side-step, this scheduling, and for this reason we could not use the PBS server on the CE to directly send jobs to the Farm PBS Nodes, even though this is technically possible.

Instead, we use the remote job submission and querying features of the PBS commands that the Globus and EDG software use to manipulate jobs. We set up queues on the BaBar (bfq) and DZero/Atlas (dfq) Farms, as well as the existing queue on the Testbed Site (gfq). By modifying Globus and EDG scripts, we were able to arrange remote (bfq/dfq) or local (gfq) job submissions in a way that was transparent to the EDG Resource Management system.

Specific changes: Testbed Site

Share NFS disks
Add the domain names, wildcards or IP subnets of the Farm PBS Server and Nodes to the list of Worker Nodes in site-cfg.h on your LCFG server.

Specific changes: Existing Farm

Add pool users
Add the pool account lines from /etc/passwd on the CE to the /etc/passwd on the Farm PBS Server and Farm PBS Nodes (if you use NIS, then modify the NIS passwd map instead.) You should not change the /home directory names for the users.

Share NFS disks

If have queues gfq, bfq, dfq etc to be advertised, for each queue create:

/opt/edg/info/mds/etc/ldif/ce-static.ldif.gfq

Architecture: intel
OpSys: RH 6.2
MinPhysMemory: 512
MinLocalDiskSpace: 25000
NumSMPs: 6
MinSPUProcessors: 1
MaxSPUProcessors: 1
AverageSI00: 450
MinSI00: 450
MaxSI00: 450
AFSAvailable: FALSE
InboundIP: FALSE
OutboundIP: TRUE
RunTimeEnvironment: CMS-1.1.0
RunTimeEnvironment: ATLAS-3.2.1
RunTimeEnvironment: ALICE-3.07.01
RunTimeEnvironment: LHCb-1.1.1
RunTimeEnvironment: IDL-5.4
RunTimeEnvironment: CERN-MSS
RunTimeEnvironment: CMSIM-125
RunTimeEnvironment: EDG-TEST

/opt/edg/info/mds/etc/ldif/closese-gfq.ldif

dn: closeSE=gf19.hep.man.ac.uk
objectClass: CloseStorageElement
objectClass: DataGridTop
objectClass: DynamicObject
ceId: gf18.hep.man.ac.uk:2119/jobmanager-pbs-gfq
closeSE: gf19.hep.man.ac.uk
mountPoint: /flatfiles/
entryTtl: 3600

Modify /opt/edg/info/mds/sbin/skel/ce-globus.skel so that args: includes additional %QUEUE% suffixes on the arguments to the -static and -auth-users-from-grid-mapfile options. (args: and its arguments should all be on one long line)

dn: ceId=%CEID%,hn=%HOST%,%SITEDN%
objectclass: GlobusTop
objectclass: GlobusActiveObject
objectclass: GlobusActiveSearch
type: exec
path: %WP3_DEPLOY%/bin
base: ce-%BATCHSYSTEM%
args: -globus-path %GLOBUSPATH% -static %CE_LDIF%.%QUEUE% 
-closeses %CLOSESE_LDIF% -globus-config-file 
%GLOBUS_JOBMANAGER% -auth-users-from-grid-mapfile 
%GRIDMAP%.%QUEUE% -queue %QUEUE% -dn 
ceId=%CEID%,hn=%HOST%,%SITEDN% 
-cluster-batch-system-bin-path %BATCH_SYS_PATH% -ttl 120 
-cluster %QUEUE%-server.localdomain
cachetime: 30
timelimit: 20
sizelimit: 10

Add lines like the following to /etc/hosts, with the IP numbers of the farm PBS servers (and the CE for the queues managed by the PBS server on the CE itself)

194.36.3.178    gfq-server.localdomain
194.36.3.93     bfq-server.localdomain
194.36.3.121    dfq-server.localdomain

Wherever GRIDMAP in /etc/globus.conf on the CE points, add additional grid-mapfiles or symbolic links with queue names as suffixes: eg /share/grid-security/grid-mapfile.gfq

Changes to Globus wrapper script /opt/globus/libexec/globus-script-pbs-submit

status=`${qstat} -Q $grami_queue`
becomes (all one line)
status=`${qstat} -Q $grami_queue@$grami_queue-server.localdomain`
echo "#PBS -q $grami_queue" >> $PBS_JOB_SCRIPT
becomes (all one line)
echo "#PBS -q $grami_queue@$grami_queue-server.localdomain" >> $PBS_JOB_SCRIPT

Add all the queues to the LCFG site config, with the CE as the hostname: eg

#define SITE_CE_HOSTS_ CE_HOSTNAME:2119/jobmanager-pbs-gfq,CE_HOSTNAME:2119/jobmanager-pbs-bfq,CE_HOSTNAME:2119/jobmanager-pbs-dfq,
#define CE_QUEUE_               gfq,bfq,dfq

If you force an LCFG update on the CE by stopping and starting /etc/rc.d/init.d/lcfg.init and then stopping and starting /etc/rc.d/init.d/globus-mds the changes should propagate to the configuration files.

No changes are needed to PBS on the CE for the queues not managed by the PBS Server on the CE (ie the farm queues.)

Make sure the farm PBS Server and PBS Nodes are acceptable to the /etc/exports entry for /home and /share/grid-security on the CE and /flatfiles on the SE. (Edit the #define SITE_WN_HOSTS line in the LCFG site-cfg.h file.) It's also worth manually adding them to the /etc/exports on the LCFG for /opt/local/linux since it makes it much easier to manually install RPM's on the farm worker nodes.

On the farm PBS Server, make sure you can do PBS qsubs from the CE to the PBS Server in the normal PBS ways - eg with /etc/hosts.equiv

For the PBS Server and all of the PBS Nodes, add the CE /home and /share/grid-security and SE /flatfiles either to /etc/fstab or much better, to your automount map:

/etc/auto.master

/nfs /etc/auto.nfs

/etc/auto.nfs

gf-home		-rw,suid	gf-home.hep.man.ac.uk:/home
gf-flatfiles	-rw,suid	gf-flatfiles.hep.man.ac.uk:/flatfiles
gf-optlocal		      gf-optlocal.hep.man.ac.uk:/opt/local/linux
where gf-home, gf-flatfiles and gf-optlocal are aliases for the CE, SE and LCFG server (you could just use their canonical hostnames, but it avoids reconfiguring the farm nodes if you change CE etc.)

Add the pool accounts from the CE /etc/passwd to the /etc/passwd on all PBS Nodes and the PBS Server (if you are sure only certain pools will be used on the farm, you can choose to only add those.)

For each pool account to be used, make a symbolic link under /home on the farm to /nfs/gf-home/USERNAME

Optional and fiddly: for tighter security, use the qmgr command on the PBS Server to set acl_users and deny everyone else (include any static accounts owned by non-Grid users) This stops people with access to the CE doing qsubs on to the farm themselves unless they are a legitimate farm user.

#
# Create and define queue bfq
#
create queue bfq
set queue bfq queue_type = Execution
set queue bfq max_running = 80
set queue bfq acl_user_enable = True
set queue bfq acl_users = -
set queue bfq acl_users += wpsix001
set queue bfq acl_users += wpsix002
set queue bfq acl_users += jonnynogrid
set queue bfq enabled = True
set queue bfq started = True

On each farm node, add the CE as a clienthost to /usr/spool/PBS/mom_priv/config and make sure the usecp will work for the CE /home as well as the normal home directory of the farm:

$clienthost gf18
$clienthost gf18.hep.man.ac.uk
$usecp *:/ /

You should restart PBS on the nodes to make all this take effect.

With the LCFG opt local mounted via automount as above, you need to install the following RPM's on the PBS Nodes for the EDG WP1 job submission to work (specifically, the sandbox handling):

/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_common-gcc32dbg_rtl-2.0-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_openssl-gcc32dbg_rtl-0.9.6b-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_ssl_utils-gcc32dbg_rtl-2.1-21e.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gssapi_gsi-gcc32dbg_rtl-2.0-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gss_assist-gcc32dbg_rtl-2.0-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_io-gcc32dbg_rtl-2.0-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_ftp_control-gcc32dbg_rtl-1.0-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_ftp_client-gcc32dbg_rtl-1.2-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gass_transfer-gcc32dbg_rtl-2.0-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gass_copy-gcc32dbg_rtl-2.0-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gass_copy-gcc32dbg_pgm-2.0-21.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_user_env-noflavor_data-2.1-21b.i386.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_CERN-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_CERN-new-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_CESNET-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_CNRS-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_CNRS-DataGrid-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_CNRS-Projets-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_DOESG-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_DOESG-Root-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_GermanGrid-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_Grid-Ireland-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_GridPP-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_INFN-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_LIP-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_NIKHEF-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_NorduGrid-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_Russia-0.10-1.noarch.rpm
/nfs/gf-optlocal/6.2/RPMS/security/ca_Spain-0.10-1.noarch.rpm

Alternatively, you should configure sshd on the CE to allow remote ssh commands and make a wrapper that looks like globus-url-copy. Something like this:

#!/bin/sh
ssh gf18 /opt/globus/bin/globus-url-copy $*
and installed as /opt/globus/bin/globus-url-copy on the PBS Worker Nodes.

This also means the PBS Nodes do not need direct access to the internet (ie not private IP and not NAT)


Last modified Mon 21 April 2008 . View page history
Switch to HTTPS . Website Help . Print View . Built with GridSite 1.4.3
For more about GridPP please contact Neasan O'Neill