Efficient way to walk directory tree containing link cycles

417 Views Asked by At

Is there a more efficient way to walk a directory tree that contains link cycles than tracking which files have already been visited?

For example consider walking a directory containing these files:

symlink "parent" -> ".."
symlink "uh_oh" -> "/"
regular file "reg"
symlink "reg2" -> "reg"
2

There are 2 best solutions below

0
On

The tree walk algorithm guarantees that you'll visit every file under a directory, so instead of tracking individual files you can maintain a list of search "roots":

  • Add the initial directory to the list of roots
  • Walk the directory tree for each search root
  • For every symlink you find, check if it's already contained by a search root. If it's not, add it as a new search root.

This way you'll visit every file and directory, will never get stuck in a loop, but may visit files and directories more than once. That can happen only when you find a symlink to an ancestor of an existing root. To avoid doing that, you can check if a directory is a search root before entering it.

2
On

You should also track which directories have been visited, as per your first example, but otherwise there is no better solution than maintaining visited flags for every file.

Maintaining the flags would be easier if there were a portable way of getting a short unique identifier for a mounted filesystem. Even then, you need to think through the consequences of mount and umount operations occurring during the scan, particularly since such a scan might take quite a long time if the filesystem tree includes remote filesystems.

In theory, you can get a "filesystem id" from the stafvfs interface, but in practice that is not totally portable. Quoting man statfs from a Linux distro:

Nobody knows what f_fsid is supposed to contain…

…The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type. Several OSes restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the filesystem when NFS-exported, and giving it out is a security concern.

This latter restriction -- that f_fsid is presented as 0 to non-privileged users -- does not violate the Posix standard cited above, because that standard includes a very general disclaimer: "It is unspecified whether all members of the statvfs structure have meaningful values on all file systems."