Is it safe to use flock on AWS EFS to emulate a critical section?

4.3k Views Asked by At

According to the docs, AWS EFS (Amazon Elastic File System) supports file locking:

Amazon EFS provides a file system interface and file system access semantics (such as strong data consistency and file locking).

On a local file system (e.g., ext4), flock can be used in shell scripts to create a critical section. For example, this answer describe a pattern that I used in the past:

#!/bin/bash
(
  # Wait for lock on /var/lock/.myscript.exclusivelock (fd 200) for 10 seconds
  flock -x -w 10 200 || exit 1

  # Do stuff

) 200>/var/lock/.myscript.exclusivelock

Can the same pattern be applied on EFS? Amazon mentions that they are using the NFSv4 protocol, but does it provide the same guarantees as flock on ext4?

If not, how can you enforce that an operation runs exclusively across all EC2 instances that are attached to the same EFS volume? It is sufficient if it works for processes, as I'm not planning to run multiple threads.

Or did I misunderstood the locking support provided in NFSv4? Unfortunately, I don't know the details of the protocol, but providing atomicity in a distributed system is a much harder problem than on a local machine.

Update: small scale experiment

Not a proof, of course, but in my tests it works across multiple instances. For now, I assume the pattern is safe to use. Still, would be nice to know if it is theoretically sound.

2

There are 2 best solutions below

3
On BEST ANSWER

It should work.

The flock command as used in the pattern in the question should work on all NFS file systems. That means, it will also work on EFS, which implements the NFSv4 protocol. In practice, I also did not encounter any problems so far when using it to synchronize shell scripts on different EC2 instances.


Depending on your use case, you have to aware of the gotchas of file locking on Linux, although most of it is not NFS specific. For instance, the pattern above operates on the process level, and cannot be used if want to synchronize multiple threads.

While reading, I came across old issues. In kernels prior to 2.6.12, there seemed to be problems with NFS and the flock system call (e.g., see flock vs lockf on Linux).

It should not apply here, as it has been improved in newer kernels. Looking the source code of the flock command, you can confirm that it still uses the flock system call, but it could be potentially implemented by the safe fcntl system call:

while (flock(fd, type | block)) {
  ...
  case EBADF:       /* since Linux 3.4 (commit 55725513) */
        /* Probably NFSv4 where flock() is emulated by fcntl().
         * Let's try to reopen in read-write mode.
         */

Note: the workaround refers to this commit in the Linux kernel can be found:

Since we may be simulating flock() locks using NFS byte range locks, we can't rely on the VFS having checked the file open mode for us.

3
On

If the use-case is just to make sure that another process (or instance/container) doesn't "take over" the job, I'd use a simpler lock file instead. I/it's called a lock file, but it's really just a simple file.

Something like

while true; do
    printf "Aquiring lock: "
    if [ ! -e "some_lock_file_somewhere" ]; then
        echo "done."
        touch some_lock_file_somewhere
        echo "doing stuff"
        sleep 60 # just because I couldn't come up with something that takes a while :D 
    else
        echo "waiting 60s for lock"
        sleep 60
    fi
done

You'd have to remove that lock file manually, or write the logic to do it, but that can then be run in multiple shells and only the first one will do the actual work.