Tracing jobs on typical GridPP site config
This job tracing example is based on HTCondor/ARC CE combination at Oxford.
ARC CE logs does not provide much useful information but HTCondor provides detailed logs which can be easily searched. I am assuming that we know the DN of the user.
HTCondor logs can be accessed through either condor_q or condor_history. Condor_q provides job information about the current jobs in the queue and condor_history gives information about jobs which has been finished.
Tracing Current Jobs
Given the DN of the user, we can find all jobs running by the user and all other information about the job.
condor_q -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe"' -af:t x509userproxysubject GlobalJobId RemoteHost /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#264910.0#1542863626 email@example.com /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#266640.0#1542881001 firstname.lastname@example.org /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267721.0#1542893650 email@example.com /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267723.0#1542893653 firstname.lastname@example.org
From the above, we can find the job number and node where the job is running. (GlobalJobId gives the SchedD name, the job id, and the submission time).
Now ssh to the WN where the job is running and find the pid of the job, using "267723" as an example.
Run from the WN
grep -r 267723 /var/log/condor/StarterLog* /var/log/condor/StarterLog.slot1_7:11/22/18 13:34:32 (pid:24740) Job 267723.0 set to execute immediately /var/log/condor/StarterLog.slot1_7:11/22/18 13:34:32 (pid:24740) Starting a VANILLA universe job with ID: 267723.0
This gives the PID for the process as "24740".
pstree -pAla 24740
The above command will give detail of all processes running on the WN related to this PID (may require psmisc package if not already installed)
Run from the ARC CE
ARC CE keeps a copy of job submission files and executables in controldir and sessiondir as defined in /etc/arc.conf
Find Controldir and sessiondir of the job
condor_q 267723.0 -af Iwd x509userproxy /var/spool/arc/grid04/ARCJOBID /var/spool/arc/jobstatus/job.ARCJOBID.proxy
Here the first entry is sessiondir, change to this directory
cd /var/spool/arc/grid04/ARCJOBID [root@t2arc00 ARCJOBID ]# ls -la -rw------- 1 vo198 vo 977 Nov 22 05:13 condorjob.jdl -rwxr-xr-x 1 vo198 vo 8569 Nov 22 05:13 condorjob.sh -rwx------ 1 vo198 vo 51312 Nov 22 05:13 DIRAC_XRpdx6_pilotwrapper.py -rw------- 1 vo198 vo 2076 Nov 22 06:59 log
Sessiondir has job jdl and job executable etc
Controldir keeps all files in the same directory. In our case, it is /var/spool/arc/jobstatus/ and as we are only interested in this particular ARCJOBID
ls -la /var/spool/arc/jobstatus/job.ARCJOBID* -rw------- 1 vo198 vo 749 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.description -rw------- 1 vo198 vo 13 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.diag -rw------- 1 vo198 vo 9670 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.errors -rw------- 1 vo198 vo 1160 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.grami -rw------- 1 vo198 vo 0 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.input -rw------- 1 vo198 vo 729 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.local -rw------- 1 vo198 vo 28 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.output -rw------- 1 vo198 vo 17193 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.proxy -rw------- 1 root root 0 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.statistics -rw------- 1 root root 1606 Nov 22 15:49 /var/spool/arc/jobstatus/job.ARCJOBID.xml
Interesting files are .description and .grami files.
Tracing old jobs
If the job has finished then we can not find any live processes but still find many details from log file. It is quicker to trace files for limited time. Condor_history command show epochtimes. Assuming we want to check job which arrived after 13:00:00 on 22 November 2018. Find the epochtime first
date -d '2018-11-22 13:00:00' +"%s" 1542891600 condor_history -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe"' -constraint 'QDate>1542891600' -forward -af:t x509userproxysubject GlobalJobId LastRemoteHost /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267958.0#1542896955 email@example.com /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267959.0#1542896956 firstname.lastname@example.org /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267966.0#1542897001 email@example.com
The above command will show any job by Mr John Doe which was queued after 13:00:00 on 22nd Nov 2018. As the job has finished so sessiondir will not be there but we can get some information from controldir in the same way as we did in 'current job' section.
condor_history 267958.0 -af x509userproxy /var/spool/arc/jobstatus/job.OLDARCJOBID.proxy
In a CREAM/SGE setting jobs will have both a batch system ID (a seven digit number) or the CREAM ID (a prefix, usually cream_, and an 8 digit number)
Cream Log file is
Quite hard to pull mapping from here, best way is to pull mapping from gridmapdir.
Areas of interest: cream sandbox, where files get stage to and from.
This is set by your configs, directory structure of the form
To query a running job:
qstat -j $batchid (or $creamid)
To query a job that's no longer running:
qacct -j $batchid (or $creamid)
This will yield the host the job ran on.
On the host the interesting directories are the users home area and the job tmpdir: