Python IntelliJ style 'search everywhere' algorithm

207 Views Asked by At

I have a list of file names in python like this:

HelloWorld.csv
hello_windsor.pdf
some_file_i_need.jpg
san_fransisco.png
Another.file.txt
A file name.rar

I am looking for an IntelliJ style search algorithm where you can enter whole words or simply the first letter of each word in the file name, or a combination of both. Example searches:

hw -> HelloWorld.csv, hello_windsor.pdf
hwor -> HelloWorld.csv
winds -> hello_windsor.pdf

sf -> some_file_i_need.jpg, san_francisco.png
sfin -> some_file_i_need.jpg
file need -> some_file_i_need.jpg
sfr -> san_francisco.png

file -> some_file_i_need.jpg, Another.file.txt, A file name.rar
file another -> Another.file.txt
fnrar -> A file name.rar

You get the idea.

Is there any Python packages that can do this? Ideally they'd also rank matches by 'frecency' (how often the files have been accessed, how recently) as well as by how strong the match is.

I know pylucene is one option but it seems very heavyweight given the list of file names is short and I have no interest in searching the contents of the file? Is there any other options?

2

There are 2 best solutions below

1
On BEST ANSWER

You can do this by using the regular expression (import re) in the python and creating the function. This is bit complex but is achievable using regular expression.

import re
def intellij_search(search_term, file_list):
    words = search_term.split()

    #empty list for storing name
    matching_files = []
    for file_name in file_list:
        # Initialize a variable to keep track.
        matches_all_words = True

        #Iterate over each word in the search term
        for word in words:
            # Create a regular expression pattern
            pattern = '.*'.join(word)

            # Check if the file name matches the pattern
            if not re.search(pattern, file_name, re.IGNORECASE):
                # If the file name does not match the pattern, set the 
                #variable to False and break the loop
                matches_all_words = False
                break

        # If the file name matches all words in the search term, add it to 
        #the list of matching file name
        if matches_all_words:
            matching_files.append(file_name)

    # Return the matche file
    return matching_files


files = ['HelloWorld.csv', 'hello_windsor.pdf', 'some_file_i_need.jpg', 'san_francisco.png', 'Another.file.txt', 'A file name.rar']
#print(intellij_search('hw', files)) 
#print(intellij_search('sf', files))
#print(intellij_search('Afn', files))

I am not sure if you are looking for something like this or else.

0
On

To implement your desired IntelliJ-style fuzzy filter we should first define its expected behavior in precise language:

  1. For each file name, split the name into a list of words by either non-word characters (so 'hello_windsor.pdf' becomes ['hello', 'windsor', 'pdf']) or positions that are followed by a capital letter and preceded by another word character (so 'HelloWorld' becomes ['Hello', 'World']). A word consists of only letters and/or digits. The list of words should be lowercased to allow case-insensitive matching.
  2. Split the given query string by space into a list of patterns. All patterns have to match a name for the name to be considered a match.
  3. For each pattern, match each character of the pattern one by one against the list of words, starting from the first character of the first word, in the following way:
    • if the current character of the pattern matches the current character of the current word:
      • try matching the next character of the pattern against the next character of the current word (so when the character w in pattern winds matches the first character of windstor, try matching i in the pattern against the i in windstor);
      • failing that, try matching the next character of the pattern against the first character of the next word (so when the character s in pattern sfr matches the first character of san in san francisco but the next characterf doesn't match a in san, try matching f against the f in the next word francisco);
    • if not, match the start of the pattern starting from the next word (so if file doesn't match the first word another in another file, try matching file against the next word file).
  4. A match is found if the end of the pattern is reached.
  5. No match is found if all words in the list are exhausted before reaching the end of the pattern.

Since there are multiple possible paths for a pattern to match a list of words, we can implement the behavior of exploration with a backtracking algorithm:

import re

def fuzzy_filter(patterns, names):
    def match(pattern, words_index=0, pattern_index=0, word_index=0):
        return pattern_index == len(pattern) or (
            words_index < len(words) and word_index < len(words[words_index])
        ) and (
            pattern[pattern_index] == words[words_index][word_index] and (
                match(pattern, words_index, pattern_index + 1, word_index + 1) or
                match(pattern, words_index + 1, pattern_index + 1)
            ) or match(pattern, words_index + 1)
        )

    for name in names:
        words = list(map(str.lower, re.split(r'[\W_]+|(?<=\w)(?=[A-Z])', name)))
        if all(map(match, patterns.split())):
            yield name

so that:

file_names = [
    'HelloWorld.csv',
    'hello_windsor.pdf',
    'some_file_i_need.jpg',
    'san_francisco.png',
    'Another.file.txt',
    'A file name.rar'
]

queries = '''\
hw
hwor
winds
sf
file need
sfr
file
file another
fnrar'''.splitlines()

for query in queries:
    print(f'{query} -> {", ".join(fuzzy_filter(query, file_names))}')

outputs:

hw -> HelloWorld.csv, hello_windsor.pdf
hwor -> HelloWorld.csv
winds -> hello_windsor.pdf
sf -> some_file_i_need.jpg, san_francisco.png
file need -> some_file_i_need.jpg
sfr -> san_francisco.png
file -> some_file_i_need.jpg, Another.file.txt, A file name.rar
file another -> Another.file.txt
fnrar -> A file name.rar

Demo: https://ideone.com/D1ITXQ

Finally, to rank the file names by last access time, you can sort the list by os.path.getatime in reverse order:

import os

os.chdir(path_to_files)
matches = sorted(fuzzy_filter(query, file_names), key=os.path.getatime, reverse=True)