See the following code running on an AWS EC2 from multiple docker containers running in parallel
#!/bin/bash
echo 'Starting lock and download script.'
download_command () {
project_version=$1
if test -f "$dest"; then
echo 'dest exists, no download needed'
else
echo 'Downloading dest...'
aws s3 cp $src $dest
echo 'Download complete'
fi
}
export -f download_command
flock -x -w 60 /opt/shared/.ta.lock -c "bash -c 'download_command $1'"
Where /opt/shared is a directory local to the host that's mounted through to all containers as a docker volume. There might be up to dozens of docker containers starting around the same time that race to download a shared file, synchronized with flock and this works 99% of the time. But it's been observed a handful of times that all invocations on a specific host will timeout at 60s, seemingly none are able to obtain the lockfile.
I'm trying to understand what aspect of this locking pattern could prevent any process from obtaining the lock, the rarity makes me think of some kind of race condition, but I was expecting flock to handle these. Any ideas would be great, thanks.
Note:
- The lockfile is never deleted