This is still only in outline towards the end but does include the changes you actually need to make, even if the detailed explanation of how it works is lacking.
Putting Existing Farms on the Testbed
This page describes a minimal set of modifications to an EDG Testbed Site and an existing Farm to make the Farm available on the Testbed. It assumes you are familiar with the way the EDG Testbed works, and with its installation procedure (the GridPP Testbed Support pages have an installation guide.)
For our purposes, the key points about a Testbed Site are:
- Installation and configuration is done from an LCFG installation server. This is installed first and then used to define the properties of the other testbed elements.
- Jobs enter the site via the Computing Element. This provides a Globus gatekeeper, a PBS server and publishes information about the current state of the site into the MDS information service.
- The CE also hosts the home directories of the pools of dynamic accounts allocated to users as jobs come in. Each pool account has a prefix dependent on its VO (eg atlas) and then three digits (so atlas001, atlas002, ...)
- Bulk data is managed by the Storage Element, which can use GDMP to import and export replicas of files. The SE shares is store of files by NFS, in a hierarchy under /flatfiles
- The jobs are executed on the Worker Nodes. There can be any number of WNs, defined by those nodes to which PBS will send jobs.
- During execution, jobs are shepherded by a Job Manager process on the CE, which can report on their status to the Resource Brokering system.
- Jobs import their inbox sandbox from the Resource Broker, and export their output sandbox, using the globus-url-copy command.
Due to the complex and sparsely documented configuration needed, it is usually impractical to install Testbed elements using anything other than the automated LCFG installation procedure. However, this requires that elements be installed from scratch.
For the existing BaBar and DZero/Atlas farms at Manchester, we developed a procedure for using our Testbed Site as a front end to the Farms, that required only minor modifications to the Farm and avoided the need to reinstall its PBS Server or any of the PBS Nodes.
Additionally, only a handful of changes were needed on the Testbed CE, and this makes upgrading the Testbed Site in time with new EDG releases straightword.
Our procedure relies on sharing NFS and PBS spaces between the Farm and the
Testbed Site.
NFS is readily and transparently shared between all the machines involved, and with careful use of symbolic links and automounting, this has very little impact on the existing configuration of the Farm.
The Farms have existing PBS Servers which have been configured with queues appropriate to the production use of their resources. It is important not to interfere with, or side-step, this scheduling, and for this reason we could not use the PBS server on the CE to directly send jobs to the Farm PBS Nodes, even though this is technically possible.
Instead, we use the remote job submission and querying features of the PBS commands that the Globus and EDG software use to manipulate jobs. We set up queues on the BaBar (bfq) and DZero/Atlas (dfq) Farms, as well as the existing queue on the Testbed Site (gfq). By modifying Globus and EDG scripts, we were able to arrange remote (bfq/dfq) or local (gfq) job submissions in a way that was transparent to the EDG Resource Management system.
Specific changes: Testbed Site
- Share NFS disks
- Add the domain names, wildcards or IP subnets of the Farm PBS Server
and Nodes to the list of Worker Nodes in site-cfg.h on your LCFG
server.
Specific changes: Existing Farm
- Add pool users
- Add the pool account lines from /etc/passwd on the CE to the
/etc/passwd on the Farm PBS Server and Farm PBS Nodes (if you use NIS,
then modify the NIS passwd map instead.) You should not change the /home
directory names for the users.
- Share NFS disks
If have queues gfq, bfq, dfq etc to be advertised, for each queue create:
/opt/edg/info/mds/etc/ldif/ce-static.ldif.gfq
Architecture: intel OpSys: RH 6.2 MinPhysMemory: 512 MinLocalDiskSpace: 25000 NumSMPs: 6 MinSPUProcessors: 1 MaxSPUProcessors: 1 AverageSI00: 450 MinSI00: 450 MaxSI00: 450 AFSAvailable: FALSE InboundIP: FALSE OutboundIP: TRUE RunTimeEnvironment: CMS-1.1.0 RunTimeEnvironment: ATLAS-3.2.1 RunTimeEnvironment: ALICE-3.07.01 RunTimeEnvironment: LHCb-1.1.1 RunTimeEnvironment: IDL-5.4 RunTimeEnvironment: CERN-MSS RunTimeEnvironment: CMSIM-125 RunTimeEnvironment: EDG-TEST
/opt/edg/info/mds/etc/ldif/closese-gfq.ldif
dn: closeSE=gf19.hep.man.ac.uk objectClass: CloseStorageElement objectClass: DataGridTop objectClass: DynamicObject ceId: gf18.hep.man.ac.uk:2119/jobmanager-pbs-gfq closeSE: gf19.hep.man.ac.uk mountPoint: /flatfiles/ entryTtl: 3600
Modify /opt/edg/info/mds/sbin/skel/ce-globus.skel so that args: includes additional %QUEUE% suffixes on the arguments to the -static and -auth-users-from-grid-mapfile options. (args: and its arguments should all be on one long line)
dn: ceId=%CEID%,hn=%HOST%,%SITEDN% objectclass: GlobusTop objectclass: GlobusActiveObject objectclass: GlobusActiveSearch type: exec path: %WP3_DEPLOY%/bin base: ce-%BATCHSYSTEM% args: -globus-path %GLOBUSPATH% -static %CE_LDIF%.%QUEUE% -closeses %CLOSESE_LDIF% -globus-config-file %GLOBUS_JOBMANAGER% -auth-users-from-grid-mapfile %GRIDMAP%.%QUEUE% -queue %QUEUE% -dn ceId=%CEID%,hn=%HOST%,%SITEDN% -cluster-batch-system-bin-path %BATCH_SYS_PATH% -ttl 120 -cluster %QUEUE%-server.localdomain cachetime: 30 timelimit: 20 sizelimit: 10
Add lines like the following to /etc/hosts, with the IP numbers of the farm PBS servers (and the CE for the queues managed by the PBS server on the CE itself)
194.36.3.178 gfq-server.localdomain 194.36.3.93 bfq-server.localdomain 194.36.3.121 dfq-server.localdomain
Wherever GRIDMAP in /etc/globus.conf on the CE points, add additional grid-mapfiles or symbolic links with queue names as suffixes: eg /share/grid-security/grid-mapfile.gfq
Changes to Globus wrapper script /opt/globus/libexec/globus-script-pbs-submit
status=`${qstat} -Q $grami_queue`
becomes (all one line)
status=`${qstat} -Q $grami_queue@$grami_queue-server.localdomain`
echo "#PBS -q $grami_queue" >> $PBS_JOB_SCRIPTbecomes (all one line)
echo "#PBS -q $grami_queue@$grami_queue-server.localdomain" >> $PBS_JOB_SCRIPT
Add all the queues to the LCFG site config, with the CE as the hostname: eg
#define SITE_CE_HOSTS_ CE_HOSTNAME:2119/jobmanager-pbs-gfq,CE_HOSTNAME:2119/jobmanager-pbs-bfq,CE_HOSTNAME:2119/jobmanager-pbs-dfq, #define CE_QUEUE_ gfq,bfq,dfq
If you force an LCFG update on the CE by stopping and starting /etc/rc.d/init.d/lcfg.init and then stopping and starting /etc/rc.d/init.d/globus-mds the changes should propagate to the configuration files.
No changes are needed to PBS on the CE for the queues not managed by the PBS Server on the CE (ie the farm queues.)
Make sure the farm PBS Server and PBS Nodes are acceptable to the /etc/exports entry for /home and /share/grid-security on the CE and /flatfiles on the SE. (Edit the #define SITE_WN_HOSTS line in the LCFG site-cfg.h file.) It's also worth manually adding them to the /etc/exports on the LCFG for /opt/local/linux since it makes it much easier to manually install RPM's on the farm worker nodes.
On the farm PBS Server, make sure you can do PBS qsubs from the CE to the PBS Server in the normal PBS ways - eg with /etc/hosts.equiv
For the PBS Server and all of the PBS Nodes, add the CE /home and /share/grid-security and SE /flatfiles either to /etc/fstab or much better, to your automount map:
/etc/auto.master
/nfs /etc/auto.nfs
/etc/auto.nfs
gf-home -rw,suid gf-home.hep.man.ac.uk:/home gf-flatfiles -rw,suid gf-flatfiles.hep.man.ac.uk:/flatfiles gf-optlocal gf-optlocal.hep.man.ac.uk:/opt/local/linuxwhere gf-home, gf-flatfiles and gf-optlocal are aliases for the CE, SE and LCFG server (you could just use their canonical hostnames, but it avoids reconfiguring the farm nodes if you change CE etc.)
Add the pool accounts from the CE /etc/passwd to the /etc/passwd on all PBS Nodes and the PBS Server (if you are sure only certain pools will be used on the farm, you can choose to only add those.)
For each pool account to be used, make a symbolic link under /home on the farm to /nfs/gf-home/USERNAME
Optional and fiddly: for tighter security, use the qmgr command on the PBS Server to set acl_users and deny everyone else (include any static accounts owned by non-Grid users) This stops people with access to the CE doing qsubs on to the farm themselves unless they are a legitimate farm user.
# # Create and define queue bfq # create queue bfq set queue bfq queue_type = Execution set queue bfq max_running = 80 set queue bfq acl_user_enable = True set queue bfq acl_users = - set queue bfq acl_users += wpsix001 set queue bfq acl_users += wpsix002 set queue bfq acl_users += jonnynogrid set queue bfq enabled = True set queue bfq started = True
On each farm node, add the CE as a clienthost to /usr/spool/PBS/mom_priv/config and make sure the usecp will work for the CE /home as well as the normal home directory of the farm:
$clienthost gf18 $clienthost gf18.hep.man.ac.uk $usecp *:/ /
You should restart PBS on the nodes to make all this take effect.
With the LCFG opt local mounted via automount as above, you need to install the following RPM's on the PBS Nodes for the EDG WP1 job submission to work (specifically, the sandbox handling):
/nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_common-gcc32dbg_rtl-2.0-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_openssl-gcc32dbg_rtl-0.9.6b-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_ssl_utils-gcc32dbg_rtl-2.1-21e.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gssapi_gsi-gcc32dbg_rtl-2.0-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gss_assist-gcc32dbg_rtl-2.0-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_io-gcc32dbg_rtl-2.0-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_ftp_control-gcc32dbg_rtl-1.0-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_ftp_client-gcc32dbg_rtl-1.2-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gass_transfer-gcc32dbg_rtl-2.0-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gass_copy-gcc32dbg_rtl-2.0-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_gass_copy-gcc32dbg_pgm-2.0-21.i386.rpm /nfs/gf-optlocal/6.2/RPMS/globus2_beta21/globus_user_env-noflavor_data-2.1-21b.i386.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_CERN-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_CERN-new-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_CESNET-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_CNRS-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_CNRS-DataGrid-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_CNRS-Projets-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_DOESG-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_DOESG-Root-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_GermanGrid-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_Grid-Ireland-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_GridPP-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_INFN-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_LIP-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_NIKHEF-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_NorduGrid-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_Russia-0.10-1.noarch.rpm /nfs/gf-optlocal/6.2/RPMS/security/ca_Spain-0.10-1.noarch.rpm
Alternatively, you should configure sshd on the CE to allow remote ssh commands and make a wrapper that looks like globus-url-copy. Something like this:
#!/bin/sh ssh gf18 /opt/globus/bin/globus-url-copy $*and installed as /opt/globus/bin/globus-url-copy on the PBS Worker Nodes.
This also means the PBS Nodes do not need direct access to the internet (ie not private IP and not NAT)
Last modified Mon 21 April 2008 . View page history
Switch to HTTPS . Website Help . Print View . Built with GridSite 1.4.3