Cambridge Work :: Setting-up Condor for LCG in multi-cultural environment

From GridPP Wiki
Jump to: navigation, search

Before we start

This document is entirely based on my work during the time I was configuring our LCG farm, using condor as out principal batch system and I thank TsanLung Hsieh for helping me in various ways to do that.

The basic installation and configuration of Condor is outside of the scope of this document. It can be done your own way. We use our lcg-CE as the Condor central manager, which also hosts Collector, Negotiator and Scheduler services. The CE is also the submit host for the LCG/EGEE jobs. Our cluster is not only for LCG users. We are also the part of the CamGrid project, so it's being used by Cambridge local users and the CamGrid project users as well at the same time. This is the base line that we maintain for our cluster:

  • Condor runs on private network
  • Always use latest version from Stable series
  • Jobs run as dedicated condor VMx_USER
  • One VMx_USER per CPU core

Running Jobs

Setting up VMx_USER

Under Linux/Unix, by default condor will run as user "nobody", which is not really the "recommendation", since forked processes, for Vanilla universe jobs, may remain active even after the parent Condor job has terminated. So, we run Condor as a dedicated user and it ensures that no forked processes survive the parent process when that exits. These users are known as VMx_USERS, where "x" is an integer, and there should be one such user per virtual machine on the host. Say, for a Single processor box, we will just add VM1_USER to the local configuration file:

VM1_USER                   = condor_user1
EXECUTE_LOGIN_IS_DEDICATED = TRUE

NOTE: The syntax for dedicated user accounts has changed significantly in Condor 7.0.x series. The new "slot" nomenclature has been introduced and the following syntax should be used in stead.

SLOT1_USER                              = condor_user1
STARTER_ALLOW_RUNAS_OWNER               = False
DEDICATED_EXECUTE_ACCOUNT_REGEXP        = condor_user[0-9]+


First, we need to create those users on the condor execute hosts. I created them as the system users i.e. uid less than 500, without any home directory and the login shell set to /bin/false. Since the idea is to add one VMx_USER account per virtual machine [CPU core], this script will do the job:

#!/bin/bash

VM_USERS=`cat /proc/cpuinfo | grep processor | wc -l`

# Add group for Condor
NM_GRP="condor"
ID_GRP="499"
groupadd -r -g ${ID_GRP} ${NM_GRP}

# Add user for Condor
for ix in `seq ${VM_USERS}`; do
    $user = "condor_user${ix}"
    if ! (useradd -r -p "*NP*" -c "mapped user for group ${NM_GRP}" -u 10${ix} -g ${ID_GRP} -s /bin/false ${user}); then
        echo "User: $user (uid 10${ix}) could not be created."
        return 1
    fi
done

# Modify the local configuration file 
for qx in `seq ${VM_USERS}`; do
cat << EOF >> `condor_config_val LOCAL_CONFIG_FILE`
VM${qx}_USER                    = condor_user${qx}
EOF

done

cat << EOF >> `condor_config_val LOCAL_CONFIG_FILE`
EXECUTE_LOGIN_IS_DEDICATED      = TRUE
EOF

condor_reconfig

NOTE: Every time, we change something in the condor local_config file, we need to run condor_reconfig to reflect those changes.

where condor_user1, condor_user2..... are the name of the WMx_USER. The name of the group and the VMx_USER could be anything as you like, don't have to be "condor" and/or condor_user1, respectively.

Setting up the environment

NOTE: This isn't the best probable work around and maybe not recommended by the middleware. Testing an alternative is in progress.

As I mentioned earlier, that our goal is to configure our farm for a number of different type of users, so we don't want to set up the LCG/EGEE specific environment variables for every single job the WN runs. To figure that out, We decided to set those variables in the user level via LCG job wrappers. Let's see what does it mean. Take a real example from a running job. On our CE:

[root@serv03 root]# condor_q

-- Submitter: serv03.hep.phy.cam.ac.uk : <172.24.116.151:9579> : serv03.hep.phy.cam.ac.uk
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
91741.0   calice014      10/18 16:48   3+20:02:10 R  0   58.6 data              
91742.0   calice014      10/18 16:50   3+20:02:00 R  0   58.6 data              
.......
.......          
92676.0   atlas095       10/22 12:17   0+00:52:25 R  0   97.7 data              
92680.0   atlas095       10/22 12:58   0+00:06:25 R  0   9.8  data              

16 jobs; 1 idle, 15 running, 0 held

using the -long option, we can get the description of the queried jobs by printing the entire job classad. We take the last one (job id #92680.0) as an example and will look for "Cmd" (the name of the executable):

[root@serv03 root]# condor_q 92680 -long | grep Cmd
Cmd = "/home/atlas095/.globus/.gass_cache/local/md5/a7/05bbcec57cde9681c373b86f8361f2/md5/06/cd8bbef6e4f1835f22fdf74bf84be8/data"

[root@serv03 root]# less /home/atlas095/.globus/.gass_cache/local/md5/a7/05bbcec57cde9681c373b86f8361f2/md5/06/cd8bbef6e4f1835f22fdf74bf84be8/data
#!/bin/bash
bootstrap=`mktemp /tmp/bootstrap.XXXXXX`; chmod 700 $bootstrap
cat >> $bootstrap <<EOFbs
#!/usr/bin/perl -w
use strict;
use Fcntl;
.........
.........
EOFbs
$bootstrap /home/atlas095/ serv03.hep.phy.cam.ac.uk /home/atlas095/.globus/....<A_LONG_COMMAND>....:LRMS=000000:APP=000000
rm $bootstrap

The script that sets up the LCG environment is glite_setenv.*sh (location: /etc/glite/profile.d/ on the WNs) and it gets sourced from the .bashrc of every single pool user. If jobs run as their actual owner (most of PBS/Torque sites do) and all the pool accounts (configured by YAIM) are present on the WNs, then it's automatically configured. Since, this is not the case here, we need to set the LCG environment by ourselves, by sourcing glite_setenv.sh (.csh for C shell), and that's for the LCG/EGEE jobs only.

First copy the glite_setenv.*sh from one of the WNs to the CE. I placed the scripts in the same location as the WNs i.e. "/etc/glite/profile.d/"(may need to create those directories first). Then we need to modify Helper.pm (location: /opt/globus/lib/perl/Globus/GRAM/) on the CE, which will create the executable - "Cmd". In return this executable will create /tmp/bootstrap.XXX to run the users' job. What we are doing here is modifying these three lines:

[root@serv03 root]# cat /opt/globus/lib/perl/Globus/GRAM/Helper.pm | grep "\$my_hostname \$local_x509"
            $script->print("\$bootstrap ".$description->directory()." $my_hostname $local_x509 ".
        $script->print("\$bootstrap ".$description->directory()." $my_hostname $local_x509 ".
    $script->print("\\\$bootstrap ".$description->directory()." $my_hostname $local_x509 ".

The first two lines are in the helper_write_non_mpi_script function and the third line is in helper_write_fake_mpi_script. And now, we gonna add "source /etc/glite/profile.d/glite_setenv.sh;" to every line so that it looks like this:

[root@serv03 root]# cat /opt/globus/lib/perl/Globus/GRAM/Helper.pm | grep "\$my_hostname \$local_x509"
            $script->print("source /etc/glite/profile.d/glite_setenv.sh; \$bootstrap ".$description->directory()." $my_hostname $local_x509 ".
        $script->print("source /etc/glite/profile.d/glite_setenv.sh; \$bootstrap ".$description->directory()." $my_hostname $local_x509 ".
    $script->print("source /etc/glite/profile.d/glite_setenv.sh; \\\$bootstrap ".$description->directory()." $my_hostname $local_x509 ".

When it's done, we take a look at the executable again:

[root@serv03 root]# condor_q 92695 -long | grep Cmd
Cmd = "/home/atlas095/.globus/.gass_cache/local/md5/0c/9b3a344d1c39478bef8a76fadf7ea5/md5/32/318e20342922b71b7b36aab8d0ff45/data"

root@serv03 root]# cat /home/atlas095/.globus/.gass_cache/local/md5/0c/9b3a344d1c39478bef8a76fadf7ea5/md5/32/318e20342922b71b7b36aab8d0ff45/data | less
#!/bin/bash
bootstrap=`mktemp /tmp/bootstrap.XXXXXX`; chmod 700 $bootstrap
cat >> $bootstrap <<EOFbs
#!/usr/bin/perl -w
use strict;
use Fcntl;
.......
.......
EOFbs
source /etc/glite/profile.d/glite_setenv.sh; $bootstrap /home/atlas095/ serv03.hep.phy.cam.ac.uk /home/atlas095/.globus/.....<A_LONG_COMMAND>....:LRMS=000000:APP=000000
rm $bootstrap

Thus, first we are setting up the LCG environment and then creating the executable to run the job.

Publishing site-info

Max cpu / walltime

Condor doesn't actually have any parameter for maximum cpu or walltime to define. If we want to set a maximum time a job can run, we do it as a policy. A simple policy to allow jobs to run for three days of wall time would be something like:

WANT_SUSPEND                    = FALSE
PREEMPT                         = ( (Activity == "Busy") && (State == "Claimed") && ($(ActivityTimer) > 129600) )

Add these two lines some where in your local configuration file with the desired walltime value. Please consult the manual for any further details on Condor policies.

As I've mentioned earlier that maximum cpu or walltime is not a parameter for Condor, therefore there isn't a parameter for the LCG software to look up straight way, as it does for PBS/Torque. So it's up to the LCG middleware to figure out this information and the script that collects this info - lcg-info-dynamic-condor - doesn't really publish anything for this by default. A quick & dirty work around is to modify config_gip (location: /opt/glite/yaim/functions); look for GlueCEPolicyMaxWallClockTime and replace the existing value (usually, 0) with the one is set in the policy i.e. 129600 in this case.


Running / Waiting jobs

As far as I'm concerned, the number of running/waiting jobs is calculated dynamically, hence, the dynamic-scheduler-wrapper, which probably has not been ported well to condor to do the job, yet. The result is GlueCEStateRunningJobs and GlueCEStateWaitingJobs are always the wrong.

First replace the original lcg-info-dynamic-condor (location: /opt/lcg/libexec) with this one: lcg-info-dynamic-condor.py on the CE and then:

cd /opt/lcg/libexec
wget http://www.hep.phy.cam.ac.uk/~santanu/grid/script/lcg-info-dynamic-condor.py
chmod +x lcg-info-dynamic-condor.py
mv lcg-info-dynamic-condor lcg-info-dynamic-condor-ORG
ln -s lcg-info-dynamic-condor.py lcg-info-dynamic-condor
ls -l

Once it's done you can try ldapsearch to see the result. If every thing goes right, it should be okay now. ( Thanks to TsanLung Hsieh for writing the original version. )


Job ResponseTime

The directory: /opt/lcg/var/gip/tmp vary in ownership depending on whether rgma or edginfo last ran the plugin - they overwrite each other's files!

[root@serv03 /]# watch -n 1 'ls -l'

Situation One
-rw-r--r--    1 rgma     rgma         2784 2007-10-25 13:17:16 +0100 lcg-info-dynamic-ce.ldif.4049
-rw-r--r--    1 rgma     rgma         6708 2007-10-25 13:17:16 +0100 lcg-info-dynamic-scheduler-wrapper.ldif.5622
-rw-r--r--    1 rgma     rgma         8735 2007-10-25 13:17:16 +0100 lcg-info-dynamic-software-wrapper.ldif.5538
-rw-r--r--    1 rgma     rgma        60165 2007-10-25 13:17:16 +0100 lcg-info-provider-software-wrapper.ldif.5892

Situation Two
-rw-r--r--    1 edginfo  edginfo      2784 2007-10-25 13:18:26+0100 lcg-info-dynamic-ce.ldif.4049
-rw-r--r--    1 edginfo  edginfo      6914 2007-10-25 13:18:26+0100 lcg-info-dynamic-scheduler-wrapper.ldif.5622
-rw-r--r--    1 edginfo  edginfo      8735 2007-10-25 13:18:26+0100 lcg-info-dynamic-software-wrapper.ldif.5538
-rw-r--r--    1 edginfo  edginfo     60165 2007-10-25 13:18:26+0100 lcg-info-provider-software-wrapper.ldif.5892

when it's run (and own) by rgms, result is fine:

[root@serv03 gip]# cd -
/opt/lcg/var/gip/tmp
[root@serv03 tmp]# ll
total 92
-rw-r--r--    1 rgma     rgma         2784 Oct 25 15:39 lcg-info-dynamic-ce.ldif.4049
-rw-r--r--    1 rgma     rgma         6708 Oct 25 15:39 lcg-info-dynamic-scheduler-wrapper.ldif.5622
-rw-r--r--    1 rgma     rgma         8735 Oct 25 15:39 lcg-info-dynamic-software-wrapper.ldif.5538
-rw-r--r--    1 rgma     rgma        60165 Oct 25 15:39 lcg-info-provider-software-wrapper.ldif.5892

[root@serv03 tmp]# ldapsearch -x -H ldap://serv03.hep.phy.cam.ac.uk:2170 -b mds-vo-name=UKI-SOUTHGRID-CAM-HEP,o=grid | grep ResponseTime
GlueCEStateEstimatedResponseTime: 0
GlueCEStateWorstResponseTime: 0
GlueCEStateEstimatedResponseTime: 0
GlueCEStateWorstResponseTime: 0
GlueCEStateEstimatedResponseTime: 0
GlueCEStateWorstResponseTime: 2146060842
GlueCEStateEstimatedResponseTime: 0

But when it's run by the edginfo user, all screwed up:

[root@serv03 tmp]# ll
total 92
-rw-r--r--    1 edginfo  edginfo      2784 Oct 25 15:40 lcg-info-dynamic-ce.ldif.4049
-rw-r--r--    1 edginfo  edginfo      6914 Oct 25 15:40 lcg-info-dynamic-scheduler-wrapper.ldif.5622
-rw-r--r--    1 edginfo  edginfo      8735 Oct 25 15:40 lcg-info-dynamic-software-wrapper.ldif.5538
-rw-r--r--    1 edginfo  edginfo     60165 Oct 25 15:40 lcg-info-provider-software-wrapper.ldif.5892

[root@serv03 tmp]# ldapsearch -x -H ldap://serv03.hep.phy.cam.ac.uk:2170 -b mds-vo-name=UKI-SOUTHGRID-CAM-HEP,o=grid | grep ResponseTime
GlueCEStateEstimatedResponseTime: 777777
GlueCEStateWorstResponseTime: 1555554
GlueCEStateEstimatedResponseTime: 777777
GlueCEStateWorstResponseTime: 1555554
GlueCEStateEstimatedResponseTime: 777777
GlueCEStateWorstResponseTime: 1555554

I'm not sure about the actual reason but the problem was with /opt/lcg/libexec/lcg-info-wrapper - wasn't sourcing condor environment properly and only edginfo got affected somehow. So, first create a file e.g. condor_env.sh somewhere on the CE with all the condor variables according to the setup. I created the file my condor installation directory.

condor=/opt/condor

PATH=${condor}/bin:$PATH
PATH=${condor}/sbin:$PATH
export PATH

CONDOR_LOCATION=${condor}
export CONDOR_LOCATION

CONDOR_CONFIG=${condor}/etc/condor_config
export CONDOR_CONFIG

CONDOR_IDS=201.499
export CONDOR_IDS

Then edit lcg-info-wrapper so that it looks like the one listed below:

#!/bin/bash

# Added to set the condor environments before running the scriprt; 
source /opt/condor/setup_env.sh
export LANG=C

/opt/lcg/bin/lcg-info-generic /opt/lcg/etc/lcg-info-generic.conf

[NOTE :: gLite 3.0] However, I noticed a problem with gLite 3.0; It looked like edginfo somehow fails to run the dynamic command that queries Condor, or writing the output or maybe the problem is that edginfo does not have a login shell when it runs the info provider. Later figured out, the MDS runs as edginfo and the BDII runs as edguser, but does not call the GIP on gLite 3.0 - a mystery here! The MDS uses /opt/globus/etc/grid-info-resource-ldif.conf, which actually refers to /opt/glite/libexec/glite-info-wrapper. Then who is calling /opt/lcg/libexec/lcg-info-wrapper instead - probably no one any more and the result was all wrong again. So did this to fix the problem:

cd /opt/glite/libexec/
mv lcg-info-wrapper lcg-info-wrapper-ORG
ln -s /opt/lcg/libexec/lcg-info-wrapper glite-info-wrapper
/etc/init.d/globus-mds restart

Now, the result should be okay again.


GlueCEStateStatus

A this point, we'll introduce condor_group to easily publish the GlueCEStateStatus (i.e. Production, Closed etc.). condor_group or AccountingGroup is to allow multiple users to submit the jobs into the same group. e.g. atlas001, prd_atlas001 and plt_atlas001 can submit the jobs as group_ATLAS and this future can be used to publish the GlueCEStateStatus in a semi-dynamic way. [ details coming up......]

Running *sgm Jobs

There are some VOs, especially LHCb, requires a site to must have *sgm pool accounts enabled for their operation and chances are very remote to fulfill this requirement if jobs are running as condor VMx_USER. To run *sgm jobs successfully, the $VO_<VO_NAME>_SW_DIR area should be writable by the user the job is running as. But as all of the VMx_USERs belong to the same group e.g. condor and the jobs can run as any of the VMx_USERs, therefore you won't be stupid enough to make "$VO_<VO_NAME>_SW_DIR" writable by the group "condor" for obvious security reason. It's also not possible to make it VMx_USER writable as we never know which one of those predefined users are gonna run the job. So, *sgm jobs will fail, apparently and that exactly was happening at our site.

So, we came up with the idea to reserve a WN for running jobs as the actual owner (i.e. the *sgm users) and the *sgm jobs to force running on that particular node only, without affecting the existing set-up. There are some ways to set up machines so that only specific users' job can run on them but what we wanted is kinda other way round: Not actually to reserve a specific machine for a specific job but to restrict a specific job to a specific machine at the same time. So, this is what we have done.

To run job as the real user, we need to configure condor UID_DOMAIN variable. If the UID_DOMAIN on the execute node is different than the UID_DOMAIN on the submit host, then job will run as VMx_USER if not, then condor will ignore the VMx_USER settings and the job will be running as the real owner. First we need to find out the UID_DOMAIN on the submit host. serv03 is the submit host here and I'm gonna use it in this example.

[root@serv03 /]# condor_config_val UID_DOMAIN
serv03.hep.phy.cam.ac.uk

Then add these lines to the local configuration file on that particular execute host:

UID_DOMAIN                      = serv03.hep.phy.cam.ac.uk
TRUST_UID_DOMAIN                = TRUE
SOFT_UID_DOMAIN                 = TRUE

ALLOWED_USER                    = FALSE
ALLOWED_USER                    = $(ALLOWED_USER) || ( TARGET.Owner =?= "atlassgm" )
ALLOWED_USER                    = $(ALLOWED_USER) || ( TARGET.Owner =?= "lhcbsgm" )
START                           = ($(ALLOWED_USER)) && ($(CPUIdle))
EXECUTE_LOGIN_IS_DEDICATED      = FALSE

SGM_NODE_TYPE                   = 1
STARTD_EXPRS                    = $(STARTD_EXPRS), SGM_NODE_TYPE

NOTE: Notice that EXECUTE_LOGIN_IS_DEDICATED is set to FALSE here, which was set to TRUE earlier. It's very important to remember and must not set "TRUE" if the UID_DOMAINs are the same on the Submit and Execute host. It's appears to be a bug in 6.8 series, which is fixed in the 6.9.3 of development series. This statement is from the Condor team:

"The problem is that it doesn't distinguish between jobs that ran with the dedicated batch-slot account vs. ones that ran as some other user id that is not necessarily dedicated to one batch slot. In 6.8, Condor on unix will only use the VMx_USER accounts if the UidDomains of the submit and execute machine are different, so you have to be careful never to specify EXECUTE_LOGIN_IS_DEDICATED=True if the UidDomains are the same."

By default Condor will try to ensure that the UID_DOMAIN of a given submit machine is a substring of that machine's fully-qualified host name.
TRUST_UID_DOMAIN will not carry out this check if set to "TRUE", other wise the job will fail.
SOFT_UID_DOMAIN set to "TRUE" will run the job under the remote user's UID if that particular user is not present (or present with a different UID) on that node.
SGM_NODE_TYPE is a macro that forces this particular node to only accept the jobs, carrying "SGM_NODE_TYPE" requirement [default is 0].
ALLOWED_USER defines the list of users, those who are allowed to run the jobs on this host. So, even if someone messes around and send a fake "SGM_NODE_TYPE = 1" requirement with the job, the job still can't be run if the user is not mapped to any of the allowed users. Keep adding this line as required.
[don't forget to issue condor_reconfig when it's done.]

We need to specify the default value for SGM_NODE_TYPE in the condor global_config file on the CE (assuming CE is the condor "submit host" for LCG/EGEE jobs and also hosting Negotiator, Collector and Scheduler services). This is the file needs to be altered:

[root@serv03 /]# echo $CONDOR_CONFIG
/opt/condor/etc/condor_config

Open this file using your favorite editor; look for STARTD_EXPRS (should be line #883) and modify the file like this:

STARTD_EXPRS            = COLLECTOR_HOST_STRING, SGM_NODE_TYPE
SGM_NODE_TYPE           = 0

Then reconfigure the node like this:

[root@serv03 /]# for ix in negotiator collector master; do condor_reconfig -${ix}; done
[root@serv03 /]# condor_reconfig -schedd -all

At this point, we have done with Condor part. Now we need to tell the middleware to send SGM_NODE_TYPE = 1 as a requirement for the node(s) to accept the job, and only if the job comes from one of the *sgm users. To do that, we need to alter jobmanager script, i.e. lcgcondor.pm [location: /opt/globus/lib/perl/Globus/GRAM/JobManager].

Open the file in the editor you feel comfortable with and put these stuff:

$username = (getpwuid($<))[0];

    # add the *sgm users as required
    if( $username eq "atlassgm" || $username eq "lhcbsgm" || $username eq "camontsgm" ) {
        $script_file->print("Requirements = SGM_NODE_TYPE == 1\n");
    }

between the lines $script_file->print("$submit_attrs_string\n"); and $script_file->print("queue " .$description->count() . "\n");
That's all; now the system is ready to run *sgm jobs successfully.

NOTE: Any changes to the globus files (Helper.pm or lcgcondor.pm) are not permanent; YAIM will put back the original ones every time it runs. lcgcondor.pm gets generated from /opt/globus/setup/globus/lcgcondor.in by YAIM's config_globus function. So applying the changes to the source file will make it "sort of" permanent until you upgrade the YAIM rpm.

Result from my Test

For a normal job, this is what happens when it runs:

[santanu@ui santanu]$ globus-job-run serv03.hep.phy.cam.ac.uk:2119/jobmanager-lcgcondor -q dteam /usr/bin/id
uid=103(condor_user3) gid=499(cd677) groups=499(cd677)

I have four VMx_USER accounts on my hosts and job could be run as any of them; this time it's condor_user3, regardless of the LCG pool-user I'm mapped to i.e. the original owner of the job. To test the new set up, a job needs to be appeared as coming from one of the *sgm users. So, in order to map myself to atlassgm, I generate a grid proxy and modify /etc/grid-security/grid-mapfile: "/C=UK/O=eScience/OU=Cambridge/L=UCS/CN=santanu das" atlassgm on the CE (don't forget to switch back to the original when done) and then submit a test job with this script:

  • test-submit.jdl
Executable              = "job.sh";
StdOutput               = "job.out";
StdError                = "job.err";
InputSandbox            = {"job.sh"};
OutputSandbox           = {"job.out","job.err"};
Requirements            = RegExp("serv03",other.CEId);
RetryCount              = 3;
VirtualOrganisation     = "atlas";

The job comes straight to our CE: serv03.hep.phy.cam.ac.uk due to the Requirements set in the .jdl file.

  • job.sh
#!/bin/bash

sleep 60

# ------ 1st Half ------
cd `echo $VO_ATLAS_SW_DIR`
DT=`date +%T`
sleep 60

cat << EOF > testRun.${DT}
This test is running on `/bin/hostname`
at      : `/bin/date`
by      : `/usr/bin/id`
System  : `/bin/uname -sipro`
EOF

# ------ 2nd  Half ------ 
echo "======== HOST :: `/bin/hostname` ========="
echo""

echo "--- ls -la ---"
ls -la

echo ""
echo "DATE :: `/bin/date`"
echo "USER :: `/usr/bin/id`"
echo ""

exit 0

The first half of the script (job.sh) is simply going to our "/experiment-software" area and creating a file (i.e. testRun) with the date stamp attached to it and the second half is generating data for StdOutput (i.e. job.out). Submitted the job:

[santanu@ui submit-script]$ edg-job-submit -o info.out test-submit.jdl

Selected Virtual Organisation name (from JDL): atlas
Connecting to host lcgrb01.gridpp.rl.ac.uk, port 7772
Logging to host lcgrb01.gridpp.rl.ac.uk, port 9002

================================ edg-job-submit Success =====================================
 The job has been successfully submitted to the Network Server.
 Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:

 - https://lcgrb01.gridpp.rl.ac.uk:9000/YXf4dz_pevg8vtMMx1hS9A

 The edg_jobId has been saved in the following file:
 /U/usera/santanu/submit-script/info.out
=============================================================================================

The job gets scheduled:

[santanu@ui submit-script]$ edg-job-status -i info.out

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lcgrb01.gridpp.rl.ac.uk:9000/YXf4dz_pevg8vtMMx1hS9A
Current Status:     Scheduled
Status Reason:      Job successfully submitted to Globus
Destination:        serv03.hep.phy.cam.ac.uk:2119/jobmanager-lcgcondor-atlas
reached on:         Fri Oct 19 11:11:03 2007
*************************************************************

farm052.hep.phy.cam.ac.uk is configured to run *sgm jobs. On the submit host (serv03, which is the CE as well), this is what happened during the negotiation:

10/19 12:11:52 ---------- Started Negotiation Cycle ----------
10/19 12:11:52 Phase 1:  Obtaining ads from collector ...
10/19 12:11:52   Getting all public ads ...
10/19 12:11:52   Sorting 217 ads ...
10/19 12:11:52   Getting startd private ads ...
10/19 12:11:52 Got ads: 217 public and 142 private
10/19 12:11:52 Public ads include 11 submitter, 142 startd
10/19 12:11:52 Phase 2:  Performing accounting ...
10/19 12:11:52 Phase 3:  Sorting submitter ads by priority ...
10/19 12:11:52 Phase 4.1:  Negotiating with schedds ...
10/19 12:11:52   Negotiating with atlassgm@serv03.hep.phy.cam.ac.uk at <172.24.116.151:9579>
10/19 12:11:52 0 seconds so far
10/19 12:11:52     Request 92020.00000:
10/19 12:11:52       Matched 92020.0 atlassgm@serv03.hep.phy.cam.ac.uk <172.24.116.151:9579> preempting none <172.24.116.184:9504> vm1@farm052.hep.phy.cam.ac.uk
10/19 12:11:52       Successfully matched with vm1@farm052.hep.phy.cam.ac.uk
10/19 12:11:52     Got NO_MORE_JOBS;  done negotiating
10/19 12:11:52 ---------- Finished Negotiation Cycle ----------

As you can see, the serv03 negotiated with farm052, which simply means, farm052 agreed to run the job (job id #92020) and that's what we see when we query the batch system:

[root@serv03 experiment-software]# condor_q -r
-- Submitter: serv03.hep.phy.cam.ac.uk : <172.24.116.151:9579> : serv03.hep.phy.cam.ac.uk
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)        
91581.0   euindia005     10/18 03:22   1+08:39:30 vm1@farm020.hep.phy.cam.ac.uk
91582.0   euindia005     10/18 03:22   1+08:39:18 vm2@farm011.hep.phy.cam.ac.uk
...........
...........
91999.0   euindia010     10/19 10:58   0+01:10:26 vm2@farm015.hep.phy.cam.ac.uk
92019.0   atlas095       10/19 12:11   0+00:00:58 vm2@farm020.hep.phy.cam.ac.uk
92020.0   atlassgm       10/19 12:11   0+00:00:20 vm1@farm052.hep.phy.cam.ac.uk

35 jobs; 0 idle, 35 running, 0 held

The atlassgm job is running. The job is to create a file in the "/experiment-software/atlas" (i.e. VO_ATLAS_SW_DIR) area and it did:

[root@serv03 gip]# ll /experiment-software/atlas | grep testRun
-rw-r--r--    1 atlassgm atlas         187 Oct 19 12:14 testRun.12:13:14
-rw-r--r--    1 atlassgm atlas         187 Oct 19 09:54 testRun.09:54:05

that's the file - testRun.12:13:14, created at 12:14 this morning and this is what the file say:

[root@serv03 experiment-software]# cat /experiment-software/atlas/testRun.12\:13\:14
This test is running on farm052
at      : Fri Oct 19 12:14:14 BST 2007
by      : uid=12051(atlassgm) gid=1307(atlas) groups=1307(atlas)
System  : Linux 2.4.21-52.ELsmp i686 i386 GNU/Linux

Notice the the user the job was running as - it's "atlassgm"

[root@serv03 experiment-software]# cat /etc/passwd | grep atlassgm
atlassgm:x:12051:1307:mapped user for group ID 1307:/home/atlassgm:/bin/bash

So, the test job successfully completed and that's what I see from the bookkeeping system as well:

[santanu@ui submit-script]$ edg-job-status -i info.out

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lcgrb01.gridpp.rl.ac.uk:9000/YXf4dz_pevg8vtMMx1hS9A
Current Status:     Done (Success)
Exit code:          0
Status Reason:      Job terminated successfully
Destination:        serv03.hep.phy.cam.ac.uk:2119/jobmanager-lcgcondor-atlas
reached on:         Fri Oct 19 11:17:38 2007
*************************************************************

VOView

To be continued.............


APEL Accounting

To be continued.............