I have a table with around 7 million rows of geographical regions where usage of a communication protocol results in channel conflicts. A simplified example of this table is as follows, where HexGeoBinId comes from HexGeoBins table and ChannelId is from Channels table.
Id | HexGeoBinId | ChannelId |
---|---|---|
1 | 450187 | 1 |
2 | 450187 | 2 |
3 | 1200447 | 7 |
4 | 1050395 | 7 |
5 | 1650603 | 16 |
6 | 1650603 | 18 |
7 | 1650603 | 19 |
8 | 1650603 | 20 |
9 | 450187 | 20 |
10 | 450187 | 22 |
ChannelId links to the table Channels with the following structure:
Id | Protocol | Frequency |
---|---|---|
0 | protocol1 | 1005 |
1 | protocol2 | 1005 |
2 | protocol3 | 1005 |
3 | protocol1 | 1010 |
4 | protocol2 | 1010 |
5 | protocol3 | 1010 |
6 | protocol1 | 1015 |
7 | protocol2 | 1015 |
8 | protocol3 | 1015 |
9 | protocol1 | 1020 |
I am using the below code to loop through all non-repetitive permutations of three potential frequencies, where the frequency can be 1000 to 1200 in steps of 5, to find the number of geographical regions that still have channel conflicts. This essentially counts the number of geographical bins that contain conflicts on all three frequencies.
I can get individual results with nested for loops and individual queries (query below). Each query takes over 60s to process.
separation = 10
f = list(map(int, np.arange(1000, 1200, 5)));
protocol = 'protocol1'
query = '''SELECT COUNT(DISTINCT("Hex1")) FROM
(SELECT HexGeoBins.HexId AS "Hex1", Channels.Frequency
FROM ChannelConflicts JOIN Channels ON ChannelConflicts.ChannelId == Channels.Id
JOIN HexGeoBins ON ChannelConflicts.HexGeoBinId == HexGeoBins.Id
WHERE Channels.Frequency = ? AND Channels.Protocol = ?)
JOIN
(SELECT HexGeoBins.HexId AS "Hex2", Channels.Frequency
FROM ChannelConflicts JOIN Channels ON ChannelConflicts.ChannelId == Channels.Id
JOIN HexGeoBins ON ChannelConflicts.HexGeoBinId == HexGeoBins.Id
WHERE Channels.Frequency = ? AND Channels.Protocol = ?)
ON "Hex1" = "Hex2"
JOIN
(SELECT HexGeoBins.HexId AS "Hex3", Channels.Frequency
FROM ChannelConflicts JOIN Channels ON ChannelConflicts.ChannelId == Channels.Id
JOIN HexGeoBins ON ChannelConflicts.HexGeoBinId == HexGeoBins.Id
WHERE Channels.Frequency = ? AND Channels.Protocol = ?)
ON "Hex1" = "Hex3";'''
for f1 in f:
for f2 in f:
if (f1 <= f2 - separation):
for f3 in f:
if (f2 <= f3 - separation):
hexcount = cur.execute(query,[f1,protocol,f2,protocol,f3,protocol]).fetchone()[0];
dict1 = {"Protocol": protocol,
"Frequency1": f1,
"Frequency2": f2,
"Frequency3": f3,
"HexCount": hexcount,
"UsPercent": hexcount/1800631.0*100};
rowList.append(dict1);
df = pd.DataFrame(rowList);
A sample output of this df is:
Id | Protocol | Frequency1 | Frequency2 | Frequency3 | HexCount | UsPercent |
---|---|---|---|---|---|---|
0 | protocol1 | 1005 | 1015 | 1025 | 1096 | 0.060867551 |
1 | protocol1 | 1005 | 1015 | 1030 | 3185 | 0.176882437 |
2 | protocol1 | 1005 | 1015 | 1035 | 1248 | 0.069309037 |
3 | protocol1 | 1005 | 1015 | 1040 | 181 | 0.010052032 |
4 | protocol1 | 1005 | 1015 | 1045 | 911 | 0.050593375 |
5 | protocol1 | 1005 | 1015 | 1050 | 1052 | 0.058423964 |
6 | protocol1 | 1005 | 1015 | 1055 | 2935 | 0.162998416 |
7 | protocol1 | 1005 | 1015 | 1060 | 1751 | 0.097243688 |
8 | protocol1 | 1005 | 1015 | 1065 | 2059 | 0.114348803 |
9 | protocol1 | 1005 | 1015 | 1070 | 6043 | 0.335604574 |
This currently takes over 60s per query to run. With a significant number of permutations, the overall run time is unacceptable.
Any suggestions on how to expedite this effort?
In the short-term I'd prefer to not have to modify the database, but if that it brings significant performance gains, I will make the changes eventually. Ideally, there is a simpler, faster query that can be created to do this as-is.
UPDATE
I was able to speed up the query with the simple realization that I don't care about the actual HexGeoBinIds, just the count of distinct ones. So, that removes one JOIN. Also, I can avoid another JOIN by using ChannelId instead of using Protocol and Frequency from Channels table. This removes another JOIN.
This gives the current query:
query = '''SELECT COUNT(DISTINCT("Hex1")) FROM
(SELECT ChannelConflicts.HexGeoBinId AS "Hex1", ChannelConflicts.ChannelId
FROM ChannelConflicts WHERE ChannelConflicts.ChannelId = ?)
JOIN
(SELECT ChannelConflicts.HexGeoBinId AS "Hex2", ChannelConflicts.ChannelId
FROM ChannelConflicts WHERE ChannelConflicts.ChannelId = ?)
ON "Hex1" = "Hex2"
JOIN
(SELECT ChannelConflicts.HexGeoBinId AS "Hex3", ChannelConflicts.ChannelId
FROM ChannelConflicts WHERE ChannelConflicts.ChannelId = ?)
ON "Hex1" = "Hex3";'''
I have improved from 60s to 2s per query, but would like it much faster. Any thoughts?
Not sure I understood, but FWIW one possible solution would be to treat different
valueIds
as "bits" and reduce their combinations into numbers ("bitmasks"), on which you can conveniently perform aggregate operations.Basically, this
will return something like
where "bitmask" uniquely identifies a specific set of "conflicts".
To find out how many "bins" are associated with each combination, just group once again:
Here's a full transcript:
To convert these bitmasks back to
valueIds
you'll need some trivial code on the app side.