I am experiencing a weird behavior from File::NFSLock in Perl v5.16. I am using stale lock timeout option as 5minutes. Let's say I have three processes. One of them took more than 5minutes before releasing a lock and process 2 got lock. However, even process 2 has lock for less than 5minutes, 3rd process is coming and removing the lock file causing the 2nd process to fail while removing NFSLock held by itself.
My theory says that process 3 wrongly read the last modified time of lock as that of written by process 1 and not process 2. I am writing nfs lock on partitions mounted on NFS.
Does anyone has an idea or faced similar issue with perl NFSLock? Please refer the below snapshot
my $lock = new File::NFSLock {file => $file,
lock_type => LOCK_EX,
blocking_timeout => 50, # 50 sec
stale_lock_timeout => 5 * 60};# 5 min
$DB::single = 1;
if ($lock) {
$lock->unlock()
}
If I block at debugger point for process 1 for more than 5 minutes, I am observing this behavior
From reviewing the code at
https://metacpan.org/pod/File::NFSLock
I see that the Lock is implemented just by a physical file in the system.
I work in almost every project with the same logic of process lock.
With the Process Lock it is crucial not to set the
stale_lock_timeouttoo tight.Or it will occur a "Race Condition" as it is also mentioned in In-Code Comments.
As you mentioned the 3 processes start to compete over the same Lock because the Job takes > 5 min and you set the
tale_lock_timeoutto 5 min.If you have a fix time giver like the
crondService this will launch a process every 5 min. Each process will take the Lock as outdated because 5 min already passed although the process takes more than > 5 minTo describe a possible scenario:
Some DB Job takes 4 min to complete but on a congested system can take up to 7 min or more.
Now if the
crondService launches a process every 5 minAt 0 min the first process
process1will find the Job as new and set the Lock and start the Job which will take up to 7 min.Now at 5 min the
crondService will launchprocess2which finds the Lock ofprocess1but decides that it is already stale because it's already 5 min since the Lock was created and it will be taken as stale. Soprocess2releases the Lock and reaquires it for itself.Later at 7 min
process1has already finshed the Job and without checking if it is still his Lock it releases the Lock ofprocess2and finishes.Now at 10 min
process3is launched and does not find any Lock because the Lock ofprocess2was already released byprocess1and sets its own Lock.This scenario is actually really problematic because it leads to a process accumulation and workload accumulation and unpredictable results.
The Suggestion to fix this issue is:
stale_lock_timeoutto an amount far bigger than what would take the Job (like 10 min or 15 min). Thestale_lock_timeoutbut be bigger than the execution time schedule.process1,process2andprocess3into one onlyprocess_masterwhich launches each process after the former onces are finished.