How can I serialize access to a directory in Linux?

1.2k Views Asked by At

Lets say 4 simultaneous processes are running on a processor, and data needs to be copied from an HDFS (used with Spark) file system to a local directory. Now I want only one process to copy that data, while the other processes just wait for that data to be copied by the first process.

So, basically, I want some kind of a semaphore mechanism, where every process tries to obtain semaphore to try copying the data, but only one process gets the semaphore. All processes who failed to acquire the semaphore would then just wait for the semaphore to be cleared (the process who was able to acquire the semaphore would clear it after its done with copying), and when its cleared they know the data has already been copied. How can I do that in Linux?

2

There are 2 best solutions below

0
On

There's a lot of different ways to implement semaphores. The classical, System V semaphore way is described in man semop and more broadly in man sem_overview.

You might still want to do something more easily scalable and modern. Many IPC frameworks (Apache has one or two of those, too!) have atomic IPC operations. These can be used to implement semaphores, but I'd be very very careful.

Generally, I regularly encourage people who write multi-process or multi-threaded applications to use C++ instead of C. It's often simpler to see where a shared state must be protected if your state is nicely encapsulated in an object which might do its own locking. Hence, I urge you to have a look at Boost's IPC synchronization mechanisms.

0
On

In addition of Marcus Müller's answer, you could use some file locking mechanism to synchronize.

File locking might not work very well on networked or remote file systems. You should use it on a locally mounted file system (e.g. Ext4, BTRFS, ...) not on a remote one (e.g. NFS)

For example, you might adopt the convention that your directory contains (or else you'll create it) some .lock file and use an advisory lock flock(2) (or a POSIX lockf(3)) on that .lock file before accessing the directory.

If using flock, you could even lock the directory directly....

The advantage of using such a file lock approach is that you could code shell scripts using flock(1)

And on Linux, you might also use inotify(7) (e.g. to be notified when some file is created in that directory)

Notice that most solutions are (advisory, so) presupposing that every process accessing that directory is following some convention (in other words, without more precautions like using flock(1), a careless user could access that directory - e.g. with a plain cp command -, or files under it, while your locking process is accessing the directory). If you don't accept that, you might look for mandatory file locking (which is a feature of some Linux kernels & filesystems, AFAIK it is sort-of deprecated).

BTW, you might read more about ACID properties and consider using some database, etc...