Better way to get the job stats after the PBS job completes

1.6k Views Asked by At

I was wondering if there is a better way to get some job statistics (such as cputime, walltime, mem usage etc) in a PBS job script (once the job completes). In my current set up, I have a line at the end of my PBS script

qstat -f "${PBS_JOBID}"

But, the problem is if the job fails or gets killed for some reason, this line won't get executed. Please let me know other options that I can use.

I greatly appreciate any help or advice, thanks!

2

There are 2 best solutions below

0
On

You may find the tracejob script useful. It is available in PBS derivative batch scheduling systems.

tracejob takes one argument, the JOB_ID and one option -n days that indicates how deep should it look into the log files for relevant stats.

Note on split submission and server hosts

Note that tracejob works only if the logs are accessible on the host where it is invoked. On some installations, PBS server runs on one host and job submissions are performed on another and log files are stored on a file system, local to the PBS server. In this case tracejob would not work.

Example

$ qstat -f 10082
qstat: Unknown Job Id 10082.

qstat fails since the job has completed, while tracejob works

$ tracejob -n 10 10082 2>/dev/null

Job: 10082.pbs.cl.localnet

12/14/2014 03:33:10  S    Job deleted at request of
                          USERNAME
12/14/2014 03:33:10  S    Job sent signal SIGTERM on delete
12/14/2014 03:33:10  S    Not sending email: User does not want mail of this
                          type.
12/14/2014 03:33:10  S    Exit_status=271 resources_used.cput=369:59:52
                          resources_used.mem=609672kb
                          resources_used.vmem=674112kb
                          resources_used.walltime=167:08:56
12/14/2014 03:33:10  A    requestor=USERNAME
12/14/2014 03:33:10  A    user=USERNAME group=users jobname=MYJOB
                          queue=simple ctime=1417901048 qtime=1417901048
                          etime=1417901048 start=1417901048
                          owner=USERNAME exec_host=HOST/CPU
                          Resource_List.walltime=90000:00:00 session=15324
                          end=1418502790 Exit_status=271
                          resources_used.cput=369:59:52
                          resources_used.mem=609672kb
                          resources_used.vmem=674112kb
                          resources_used.walltime=167:08:56
12/14/2014 03:43:11  S    dequeuing from simple, state COMPLETE

You can redirect stderr to /dev/null when executing tracejob to avoid multiple message of the form

/var/lib/torque/sched_logs/DATE: No matching job records located

In the above logs the information that is not relevant to the question was replaced with capitalized words.

0
On

The best way to do this and have it placed in the job would be to add it through an epilogue script. Full information to configure this is found here. The resource usage information is argument 7 to the epilogue script, and if you write to standard out it will be appended to the stdout file for your job.