flock intermittently deadlocking parallel processes

59 Views Asked by At

See the following code running on an AWS EC2 from multiple docker containers running in parallel

#!/bin/bash

echo 'Starting lock and download script.'

download_command () {
  project_version=$1

  if test -f "$dest"; then
    echo 'dest exists, no download needed'
  else
    echo 'Downloading dest...'
    aws s3 cp $src $dest
    echo 'Download complete'
  fi
}

export -f download_command

flock -x -w 60 /opt/shared/.ta.lock -c "bash -c 'download_command $1'"

Where /opt/shared is a directory local to the host that's mounted through to all containers as a docker volume. There might be up to dozens of docker containers starting around the same time that race to download a shared file, synchronized with flock and this works 99% of the time. But it's been observed a handful of times that all invocations on a specific host will timeout at 60s, seemingly none are able to obtain the lockfile.

I'm trying to understand what aspect of this locking pattern could prevent any process from obtaining the lock, the rarity makes me think of some kind of race condition, but I was expecting flock to handle these. Any ideas would be great, thanks.

Note:

  • The lockfile is never deleted
0

There are 0 best solutions below