File can't be found in a small fraction of submitted jobs

131 Views Asked by At

I'm trying to run a very large set of batch jobs on a RHEL5 cluster which uses a Lustre file system. I was getting a strange error with roughly 1% of the jobs: they could't find a text file they are all using for steering. A script that reproduces the error looks like this:

#!/usr/bin/env bash

#PBS -t 1-18792
#PBS -l mem=4gb,walltime=30:00
#PBS -l nodes=1:ppn=1
#PBS -q hep
#PBS -o output/fit/out.txt
#PBS -e output/fit/error.txt

cd $PBS_O_WORKDIR
mkdir -p output/fit
echo 'submitted from: ' $PBS_O_WORKDIR 

files=($(ls ./*.txt | sort)) # <-- NOTE THIS LINE

cat batch/fits/fit-paths.txt

For some small fraction of jobs, the error stream output would show:

cat: batch/fits/fit-paths.txt: No such file or directory

Weird enough, but it gets stranger.


When I change the files=($(ls ./*.txt | sort)) line to

files=($(ls batch/fits/*.txt | sort))

The jobs run without errors! Needless to say, this is far from satisfying: I'd rather not have my jobs depend on black magic (although black magic is better than no magic).

Any idea what's going on here?

1

There are 1 best solutions below

0
On

Try replacing

files=($(ls ./*.txt | sort))

with

files=(./*.txt)

Normally, the shell automatically sorts glob results, and – in contrast to parsing ls(1) output, which should never be done in portable shell scripts – handles quoting of special characters correctly.

Although this is only an issue if you ever have files with certain shell metacharacters in them. Candidates here are space, tab, newline and possibly carriage return.