Efficiently comparing list of dictionaries in JSONL file with list of keys

45 Views Asked by At
  • I have a jsonl file containing around 1,000,000 dictionaries
  • I am interested in discionaries where the values of field_1 is a string from list_of_strings which contains around 100,000 strings.

I can hold both in memory at the same time, and i'd like to quickly and efficiently compare them.

my first attempt was

matching_dicts = []
key = "field_1 "

# Open the JSONL file and iterate over its lines
with jsonlines.open(file_path) as reader:
    for line_number, obj in enumerate(reader):
        # Check if the object has the target field and its value is in the list_of_strings
        if key in obj and obj[key] in list_of_strings :
            # If so, append the line to the list
            matching_articles.append((obj, line_number))

this is slow what would be faster?

1

There are 1 best solutions below

1
Obaskly On

Preprocess and load the list of strings into a set for faster membership checks:

import jsonlines

set_of_strings = set(list_of_strings)
key = "field_1"
matching_dicts = []

# Open the JSONL file and iterate over its lines
with jsonlines.open(file_path) as reader:
    for line_number, obj in enumerate(reader):
        # Check if the object has the target field and its value is in the set_of_strings
        if key in obj and obj[key] in set_of_strings:
            matching_dicts.append((obj, line_number))