Tracing Jobs

From GridPP Wiki
Revision as of 09:33, 5 March 2019 by David Crooks 46c33702b3 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Tracing jobs on typical GridPP site config

ARC CE/HTCondor

This job tracing example is based on HTCondor/ARC CE combination at Oxford.

ARC CE logs does not provide much useful information but HTCondor provides detailed logs which can be easily searched. I am assuming that we know the DN of the user.

HTCondor logs can be accessed through either condor_q or condor_history. Condor_q provides job information about the current jobs in the queue and condor_history gives information about jobs which has been finished.

Tracing Current Jobs

Given the DN of the user, we can find all jobs running by the user and all other information about the job.

condor_q -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe"' -af:t x509userproxysubject GlobalJobId RemoteHost

/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#264910.0#1542863626 slot1_4@t2wn146.physics.ox.ac.uk 
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#266640.0#1542881001 slot1_9@t2wn123.physics.ox.ac.uk 
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267721.0#1542893650 slot1_7@t2wn135.physics.ox.ac.uk
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267723.0#1542893653 slot1_7@t2wn128.physics.ox.ac.uk

From the above, we can find the job number and node where the job is running. (GlobalJobId gives the SchedD name, the job id, and the submission time).

Now ssh to the WN where the job is running and find the pid of the job, using "267723" as an example.

Run from the WN

On t2wn128.physics.ox.ac.uk

grep -r 267723 /var/log/condor/StarterLog*

/var/log/condor/StarterLog.slot1_7:11/22/18 13:34:32 (pid:24740) Job 267723.0 set to execute immediately
/var/log/condor/StarterLog.slot1_7:11/22/18 13:34:32 (pid:24740) Starting a VANILLA universe job with ID: 267723.0

This gives the PID for the process as "24740".

pstree -pAla 24740

The above command will give detail of all processes running on the WN related to this PID (may require psmisc package if not already installed)

Run from the ARC CE

ARC CE keeps a copy of job submission files and executables in controldir and sessiondir as defined in /etc/arc.conf

Find Controldir and sessiondir of the job

condor_q 267723.0 -af Iwd x509userproxy

/var/spool/arc/grid04/ARCJOBID
/var/spool/arc/jobstatus/job.ARCJOBID.proxy

Here the first entry is sessiondir, change to this directory

cd /var/spool/arc/grid04/ARCJOBID

[root@t2arc00 ARCJOBID ]# ls -la

-rw------- 1 vo198 vo 977 Nov 22 05:13 condorjob.jdl
-rwxr-xr-x 1 vo198 vo 8569 Nov 22 05:13 condorjob.sh
-rwx------ 1 vo198 vo 51312 Nov 22 05:13 DIRAC_XRpdx6_pilotwrapper.py
-rw------- 1 vo198 vo 2076 Nov 22 06:59 log

Sessiondir has job jdl and job executable etc

Controldir keeps all files in the same directory. In our case, it is /var/spool/arc/jobstatus/ and as we are only interested in this particular ARCJOBID

ls -la /var/spool/arc/jobstatus/job.ARCJOBID*

-rw------- 1 vo198 vo 749 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.description
-rw------- 1 vo198 vo 13 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.diag
-rw------- 1 vo198 vo 9670 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.errors
-rw------- 1 vo198 vo 1160 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.grami
-rw------- 1 vo198 vo 0 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.input
-rw------- 1 vo198 vo 729 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.local
-rw------- 1 vo198 vo 28 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.output
-rw------- 1 vo198 vo 17193 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.proxy
-rw------- 1 root root 0 Nov 22 05:13 /var/spool/arc/jobstatus/job.ARCJOBID.statistics
-rw------- 1 root root 1606 Nov 22 15:49 /var/spool/arc/jobstatus/job.ARCJOBID.xml

Interesting files are .description and .grami files.

Tracing old jobs

If the job has finished then we can not find any live processes but still find many details from log file. It is quicker to trace files for limited time. Condor_history command show epochtimes. Assuming we want to check job which arrived after 13:00:00 on 22 November 2018. Find the epochtime first

date -d '2018-11-22 13:00:00' +"%s" 

1542891600

condor_history -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=Doe/CN=CNID/CN=John Doe"' -constraint 'QDate>1542891600' -forward -af:t x509userproxysubject GlobalJobId LastRemoteHost

/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267958.0#1542896955 slot1_1@t2wn125.physics.ox.ac.uk
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267959.0#1542896956 slot1_5@t2wn132.physics.ox.ac.uk
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=doe/CN=CNID/CN=John Doe t2arc00.physics.ox.ac.uk#267966.0#1542897001 slot1_17@t2wn136.physics.ox.ac.uk

The above command will show any job by Mr John Doe which was queued after 13:00:00 on 22nd Nov 2018. As the job has finished so sessiondir will not be there but we can get some information from controldir in the same way as we did in 'current job' section.

condor_history 267958.0 -af x509userproxy

/var/spool/arc/jobstatus/job.OLDARCJOBID.proxy

SGE/Cream

Matt's Notes

In a CREAM/SGE setting jobs will have both a batch system ID (a seven digit number) or the CREAM ID (a prefix, usually cream_, and an 8 digit number)

Cream Log file is

/var/log/cream/glite-ce-cream.log

Quite hard to pull mapping from here, best way is to pull mapping from gridmapdir.

Areas of interest: cream sandbox, where files get stage to and from.

This is set by your configs, directory structure of the form

/var/sandbox/$poolgroupname/$DN_$VOMS_$poolusername

To query a running job:

qstat -j $batchid (or $creamid)

To query a job that's no longer running:

qacct -j $batchid (or $creamid)

This will yield the host the job ran on.

On the host the interesting directories are the users home area and the job tmpdir:

$TMPDIR/$batchid.1.$queue