How to walk the directory tree of huge directory and ignore files

1.8k Views Asked by At

I need to walk a directory on a network drive and create a map of child to parent in the hierarchy. One representative directory is 6 Terrabytes, has 900,000 files and 900 folders. I only care about the folders and not the files. For testing purposes I copied the folders without files to another network drive and ran my code on the copied version. Just iterating over the 900 folders takes maybe 10 seconds. However iterating over the original directory structure takes 30 minutes. It appears that we are iterating through all 900,000 files even though we are just ignoring them.

Is there a way to speed this up by not even looking at the files? I would prefer to stick with pure Java if we can. When browsing this huge directory through Windows Explorer, it does not feel slow at all. My code is below.

public static Map<String, String> findFolderPaths(File parentFolder) throws IOException {
        Map<String, String> parentFolderMap = new HashMap<String, String>();
        Files.walkFileTree(parentFolder.toPath(), new FolderMappingFileVisitor(parentFolderMap));

        return parentFolderMap;
    }


static class FolderMappingFileVisitor extends SimpleFileVisitor<Path> {
        private Map<String, String> mapping;
        FolderMappingFileVisitor(Map<String, String> map) {
            this.mapping = map;
        }
        @Override
        public FileVisitResult preVisitDirectory(Path dir,
                BasicFileAttributes attrs) throws IOException {
            File directory = dir.toFile();
            mapping.put(directory.getName(), directory.getParent());

            return FileVisitResult.CONTINUE;
        }
    }

Edit:

An important piece of the puzzle that I did not mention is that we are running the app in webstart. The times I reported were from production, not development. Running from Eclipse, the times are more what I would expect for the FileWalker.

2

There are 2 best solutions below

1
On BEST ANSWER

The file walker appears to be working much faster than File.listFiles(). The problem appears to be Java Webstart. When I run the app in production under Java Webstart, it takes around 30 minutes. When I run the app from Eclipse, it takes a couple of minutes. Java Webstart is just killing us performance-wise.

This app is a very data/io intensive app, and I have noticed other issues in the past with this app when running under Webstart. The solution is to migrate away from Java Webstart.

1
On

The method you are using is obtaining the BasicFileAttributes which I suspect is visiting the file description information of each file.

If all you need is the names, I suggest you repeatedly/recursively call File.listFiles(); and this should only obtain the information you ask for.

Something like

public static Map<String, String> findFolderPaths(File parentFolder) throws IOException {
    Map<String, String> map = new HashMap<String, String>();
    findFolderPaths(parentFolder, map);
    return map;
}

public static void findFolderPaths(File dir, Map<String, String> map) throws IOException {
    map.put(dir.getName(), dir.getPparent());
    for(File file : dir.listFiles())
        if (file.isDirectory())
            findFolderPaths(file, map);
}

As you can see, it is not do anything you don't need it to do.