Improve efficiency of sqlite3 query in python

81 Views Asked by At

I have a table with around 7 million rows of geographical regions where usage of a communication protocol results in channel conflicts. A simplified example of this table is as follows, where HexGeoBinId comes from HexGeoBins table and ChannelId is from Channels table.

Id HexGeoBinId ChannelId
1 450187 1
2 450187 2
3 1200447 7
4 1050395 7
5 1650603 16
6 1650603 18
7 1650603 19
8 1650603 20
9 450187 20
10 450187 22

ChannelId links to the table Channels with the following structure:

Id Protocol Frequency
0 protocol1 1005
1 protocol2 1005
2 protocol3 1005
3 protocol1 1010
4 protocol2 1010
5 protocol3 1010
6 protocol1 1015
7 protocol2 1015
8 protocol3 1015
9 protocol1 1020

I am using the below code to loop through all non-repetitive permutations of three potential frequencies, where the frequency can be 1000 to 1200 in steps of 5, to find the number of geographical regions that still have channel conflicts. This essentially counts the number of geographical bins that contain conflicts on all three frequencies.

I can get individual results with nested for loops and individual queries (query below). Each query takes over 60s to process.

    separation = 10
    f = list(map(int, np.arange(1000, 1200, 5)));
    protocol = 'protocol1'

    query = '''SELECT COUNT(DISTINCT("Hex1")) FROM 
    (SELECT HexGeoBins.HexId AS "Hex1", Channels.Frequency 
    FROM ChannelConflicts JOIN Channels ON ChannelConflicts.ChannelId == Channels.Id 
    JOIN HexGeoBins ON ChannelConflicts.HexGeoBinId == HexGeoBins.Id 
    WHERE Channels.Frequency = ? AND Channels.Protocol = ?) 
    JOIN 
    (SELECT HexGeoBins.HexId AS "Hex2", Channels.Frequency 
    FROM ChannelConflicts JOIN Channels ON ChannelConflicts.ChannelId == Channels.Id 
    JOIN HexGeoBins ON ChannelConflicts.HexGeoBinId == HexGeoBins.Id 
    WHERE Channels.Frequency = ? AND Channels.Protocol = ?) 
    ON "Hex1" = "Hex2" 
    JOIN 
    (SELECT HexGeoBins.HexId AS "Hex3", Channels.Frequency 
    FROM ChannelConflicts JOIN Channels ON ChannelConflicts.ChannelId == Channels.Id 
    JOIN HexGeoBins ON ChannelConflicts.HexGeoBinId == HexGeoBins.Id 
    WHERE Channels.Frequency = ? AND Channels.Protocol = ?) 
    ON "Hex1" = "Hex3";'''

    for f1 in f:
        for f2 in f:
            if (f1 <= f2 - separation):
                for f3 in f: 
                    if (f2 <= f3 - separation):
                        hexcount = cur.execute(query,[f1,protocol,f2,protocol,f3,protocol]).fetchone()[0];
                        dict1 = {"Protocol": protocol,
                                 "Frequency1": f1,
                                 "Frequency2": f2,
                                 "Frequency3": f3,
                                 "HexCount": hexcount,
                                 "UsPercent": hexcount/1800631.0*100};
                        rowList.append(dict1);
    df = pd.DataFrame(rowList);

A sample output of this df is:

Id Protocol Frequency1 Frequency2 Frequency3 HexCount UsPercent
0 protocol1 1005 1015 1025 1096 0.060867551
1 protocol1 1005 1015 1030 3185 0.176882437
2 protocol1 1005 1015 1035 1248 0.069309037
3 protocol1 1005 1015 1040 181 0.010052032
4 protocol1 1005 1015 1045 911 0.050593375
5 protocol1 1005 1015 1050 1052 0.058423964
6 protocol1 1005 1015 1055 2935 0.162998416
7 protocol1 1005 1015 1060 1751 0.097243688
8 protocol1 1005 1015 1065 2059 0.114348803
9 protocol1 1005 1015 1070 6043 0.335604574

This currently takes over 60s per query to run. With a significant number of permutations, the overall run time is unacceptable.

Any suggestions on how to expedite this effort?

In the short-term I'd prefer to not have to modify the database, but if that it brings significant performance gains, I will make the changes eventually. Ideally, there is a simpler, faster query that can be created to do this as-is.

UPDATE

I was able to speed up the query with the simple realization that I don't care about the actual HexGeoBinIds, just the count of distinct ones. So, that removes one JOIN. Also, I can avoid another JOIN by using ChannelId instead of using Protocol and Frequency from Channels table. This removes another JOIN.

This gives the current query:

    query = '''SELECT COUNT(DISTINCT("Hex1")) FROM 
    (SELECT ChannelConflicts.HexGeoBinId AS "Hex1", ChannelConflicts.ChannelId 
     FROM ChannelConflicts WHERE ChannelConflicts.ChannelId = ?) 
    JOIN 
    (SELECT ChannelConflicts.HexGeoBinId AS "Hex2", ChannelConflicts.ChannelId 
     FROM ChannelConflicts WHERE ChannelConflicts.ChannelId = ?) 
    ON "Hex1" = "Hex2" 
    JOIN 
    (SELECT ChannelConflicts.HexGeoBinId AS "Hex3", ChannelConflicts.ChannelId 
     FROM ChannelConflicts WHERE ChannelConflicts.ChannelId = ?) 
    ON "Hex1" = "Hex3";'''

I have improved from 60s to 2s per query, but would like it much faster. Any thoughts?

1

There are 1 best solutions below

2
On

Not sure I understood, but FWIW one possible solution would be to treat different valueIds as "bits" and reduce their combinations into numbers ("bitmasks"), on which you can conveniently perform aggregate operations.

Basically, this

select binId, sum(1 << valueId) as bitmask from conflicts group by binId

will return something like

binId   bitmask
1       14
2       12
3       14

where "bitmask" uniquely identifies a specific set of "conflicts".

To find out how many "bins" are associated with each combination, just group once again:

select bitmask, count(*) from (select ... etc) group by bitmask

Here's a full transcript:

sqlite> create table conflicts(binId int, valueId int);
sqlite> insert into conflicts values
(1,2),
(1,3),
(1,1),
(2,2),
(2,3),
(3,1),
(3,3),
(3,2);
sqlite> select binId, sum(1 << valueId) as bitmask from conflicts group by binId;
1|14
2|12
3|14
sqlite> select bitmask, count(*) from (
    select binId, sum(1 << valueId) as bitmask from conflicts group by binId
) group by bitmask;
12|1
14|2

To convert these bitmasks back to valueIds you'll need some trivial code on the app side.