How to efficiently find identical indices in multiple dataframes

69 Views Asked by At

I have a process that collects reports generated throughout the week and consolidates the collection to eliminate identical reports.

I've written a function that identifies identical reports by finding those that have identical indices, then it excludes all but one of those that are identical and moves on. While it works fine for 5000-10,000 reports, it starts to take a serious amount of time to process through, say, 50,000+ reports, which as time goes on will be more and more common.

It would be nice if I could pre-emptively eliminate the reports and so avoid this step, but the process of generating the reports doesn't allow for that. So, I want to find a way to make this or a similar function more efficient.

The code is below:

def report_diff_index(self,dnc_data,folders):
    master_report_dict, master_val_dict = self.report_orderer(folders)
    sorts = self.report_sorter(dnc_data,master_report_dict)
    keys = [k for k in sorts.keys()]
    consolidated_sorts = keys
    print('Original Report Size: ', len(consolidated_sorts))
    for k in keys:
        if k in consolidated_sorts:
            for j in keys[keys.index(k)+1:]:
                if j in consolidated_sorts:
                    if len(list(set(sorts[k].index).symmetric_difference(sorts[j].index))) == 0:
                        consolidated_sorts.remove(j)
    print('Consolidated Report Size: ', len(consolidated_sorts))
    consolidated_report = {}
    consolidated_val = {}
    for s in consolidated_sorts:
        consolidated_report[s] = master_report_dict[s]
        consolidated_val[s] = master_val_dict[s]
    return consolidated_report, consolidated_val
2

There are 2 best solutions below

3
On BEST ANSWER

I do not know if I understand your problem correctly, and even if I do, I do not know if this is faster, but wouldn't it be possible to create a dict, where you use the unique report index as key (e.g. using frozenset) and then the report key as value. It feels like a quicker way to build the unique list, but I may be off:

def report_diff_index(self,dnc_data,folders):
    master_report_dict, master_val_dict = self.report_orderer(folders)
    sorts = self.report_sorter(dnc_data,master_report_dict)
    print('Original Report Size: ', len(sorts))
    unique_reports = dict()
    for report_key, report in sorts.items:
        key = frozenset(report.index)
        # Alt 1. Replace with new (identical) repoirt
        unique_reports[key] = report_key
        # Alt 2. Keep first report
        if key not in unique_reports:
            unique_reports[key] = report_key
    consolidated_sorts = unique_reports.values()
    print('Consolidated Report Size: ', len(consolidated_sorts))
    consolidated_report = {}
    consolidated_val = {}
    for s in consolidated_sorts:
        consolidated_report[s] = master_report_dict[s]
        consolidated_val[s] = master_val_dict[s]
    return consolidated_report, consolidated_val

As you can see, there is also two options in the dict update, which depends on whether you want to keep the first found report or if it does not matter.

Insertion into a dict should approach O(1) therefore I would imagine this to be rather quick.

6
On

Correct me if I am wrong, but it looks like everything in:

consolidated_sorts = keys
print('Original Report Size: ', len(consolidated_sorts))
for k in keys:
    if k in consolidated_sorts:
        for j in keys[keys.index(k)+1:]:
            if j in consolidated_sorts:
                if len(list(set(sorts[k].index).symmetric_difference(sorts[j].index))) == 0:
                    consolidated_sorts.remove(j)

is just about finding unique reports. In fact, the iterations are redundant, because you first set consolidated_sorts equal to keys, then iterate those values and ask if they are in consolidated_sorts, which is where they came from.

If you simply want unique keys, you may try something like this:

def report_diff_index(self,dnc_data,folders):
    master_report_dict, master_val_dict = self.report_orderer(folders)
    sorts = self.report_sorter(dnc_data,master_tree)

    # New code to create unique set of keys
    unique_keys = set(sorts.keys())

    consolidated_report = {}
    consolidated_val = {}

    for key in unique_keys:
        consolidated_report[key] = master_report_dict[key]
        consolidated_val[key] = master_val_dict[key]

    return consolidated_report, consolidated_val