One bloom filter for multiprocessing on Python

163 Views Asked by At

I need to use the maximum possible size for the bloom filter. I use "from pybloom_live import BloomFilter". It was possible to create a bloom filter equal to 49% of RAM (to create it, you need the same amount of memory to save it). Then I did a couple of cycles to check the data in the bloom filter. I started it, the script works as I need it, but the problem is that the script uses only 10% of the processor resources. That is, I need the script to load the bloom filter 1 time and distribute it to the threads. I only need to check if there is a value or not. But any attempts to do this lead to memory overflow. The fact is that the threads create a copy of the bloom filter, that is, all available RAM goes to the 2nd thread. Is it possible on Python to make it possible to use one copy of the bloom filter for multiple threads? Or maybe there is a way in one window where the script is running to use more CPU resources? Below is the maximum that I managed to do, but even here the 2 threads does not start.

def load_bloom_filter(file_path):
    try:
        bloom_filter = BloomFilter.fromfile(open(file_path, 'rb'))
        if bloom_filter is None:
            print("[-] error load")
        else:
            print(f"[+] load successful {file_path}, size: {len(bloom_filter)}")
        return bloom_filter
    except Exception as e:
        print(f"[-] error: {e}")
        return None


def main(bloom_filter, start, end):
    current_process_ = multiprocessing.current_process()
    current_process = str(current_process_.name)
    print("[+] ", current_process, "main bloom size:", len(bloom_filter.value))

    for i, value in enumerate(range(start, end), start=start):
        calculated_value_txt = some_calc(value)
        if calculated_value_txt in bloom_filter.value:
            print('found', calculated_value_txt)
        else:
            print('not found', calculated_value_txt)


def process_task(bloom_filter, start, end):
    main(bloom_filter, start, end)


if __name__ == "__main__":
    loaded_bloom_filter = load_bloom_filter(input_bloom_file_path1)
    manager = multiprocessing.Manager()
    bloom_filter = manager.Value(BloomFilter, loaded_bloom_filter)

    task_params = [
        (bloom_filter, start1, end1),
        (bloom_filter, start2, end2),
    ]

    with multiprocessing.Pool() as pool:
        pool.starmap(process_task, task_params)

UDP: I would like to explain my task once again. I have a lot of data in which to check the value. the speed of checking the values is very important, so I can't use the database. A good solution was to create a bloom filter that takes up several times less space. My code loads this bloom filter into RAM and in the loop I check for the presence of a value. But the problem is that this process does not use all the CPU resources. I would like to increase the processing speed in some way. I thought about using multiprocessing, but a new problem came out, the second thread creates a copy of the bloom filter, which takes 49% in RAM and the program terminates. I need any solution that will load the processor by 80% and not by 10% as it is now.

1

There are 1 best solutions below

0
Stelios Koroneos On

Use this https://github.com/prashnts/pybloomfiltermmap3 It uses mmap files, which means you can run parallel queries to the SAME bloom filter without having to reload it in memory separately for every query.
Just an FYI though with large bloom filters the real problem is not computational bound but memory access bound