longest substrings file tree

73 Views Asked by At

Question

What is the shortest and most efficient (preferrable most efficient) way of reading in a single depth folder and creating a file tree that consists of the longest substrings of each file?

start with this

.
├── hello
├── lima_peru
├── limabeans
├── limes
├── limit
├── what_are_those
├── what_is_bofa
└── what_is_up

end with this

.
├── hello
├── lim
│   ├── lima
│   │   ├── lima_peru
│   │   └── limabeans
│   ├── limes
│   └── limit
└── what_
    ├── what_are_those
    └── what_is_
        ├── what_is_bofa
        └── what_is_up

Framing

I feel like I know how to do the naive version of this problem but I wanted to see what the best way to do this in python would be.

Format

def longest_substrings_filetree(old_directory: str, new_directory: str):
    ...
2

There are 2 best solutions below

0
On

You can use a trie, or prefix tree, which is a data structure that achieves exactly what you are looking for. Here is a link to numerous python implementations for a python trie.

How to create a trie in Python

0
On

Given:

$ ls -1
hello
lima_peru
limabeans
limes
limit
what_are_those
what_is_bofa
what_is_up

You can use groupby and commonprefix to get started:

import os 
from itertools import groupby 
from collections import Counter 

words=sorted(os.listdir(the_dir))
depth={}

for k,v in groupby(words, key=lambda w: w[0]):
    prefix=''
    these_words=list(v)
    cnt=0
    while pre:=os.path.commonprefix(these_words):
        prefix+=pre
        depth[prefix]=cnt
        cnt+=1
        these_words=[x[len(prefix):] for x in these_words]
        c=Counter(x[0] for x in these_words if x)
        these_words=[x for x in these_words if x and c[x[0]]>1]

seen=set()

levels=sorted([max((len(p),p,w) for p in depth.keys() 
                    if w.startswith(p)) for w in words], 
             key=lambda t: (t[1], len(t[1])))

for _,p,w in levels:
    if p==w:
        print(w)
    elif p in seen:
        lead='\t'*(depth[p]+1)
        print(f'{lead}{w}')
    else:
        lead=lead='\t'*(depth[p])
        print(f'{lead}{p}:')
        print(f'{lead}\t{w}')
        seen.add(p)

Prints:

hello
lim:
    limes
    limit
    lima:
        lima_peru
        limabeans
what_:
    what_are_those
    what_is_:
        what_is_bofa
        what_is_up

This is not a complete solution nor efficient. It does not handle whole words that are then prefixes (additional logic needed for that) and loops and sorts a lot leading to catastrophic complexity for larger file lists.

Consider a demo only. It is fine for the example -- perhaps a little larger list -- but I would not use on huge lists.

More efficient would be a appropriately modified prefix tree.