Using joins on non-distributed columns in Citus distributed tables

1.5k Views Asked by At

I have succesfully created a Citus cluster with a couple of worker nodes. I get issues running a query using distributed tables.

This is a very simplified version of an indicative query I want to run:

SELECT a.origin, a.destination, b.label AS origin_label, c.label AS destination_label
FROM commuting_data a
LEFT JOIN geography_data b ON a.origin=b.sequence_id
LEFT JOIN geography_data c ON a.destination=c.sequence_id
ORDER BY a.origin, a.destination

commuting_data consists of about 20 million rows and among other information each row contains an origin and destination code. geography_data consists of about 100 thousand rows and contains information about each geography code included in the commuting_data table.

The sequence_id column in geography_data matches the origin and destination columns and provides labels and other information about these spatial points. I want to join the commuting_data to the geography_data table on both origin and destination columns. This is easily done in a single node PostgreSQL database, but it is slow to complete the query.

My problem is that Citus only supports one distribution column per table. Therefore, if I set the distribution column to be either 'origin' or 'destination' for table commuting_data I get the error 'complex joins are only supported when all distributed tables are co-located and joined on their distribution columns'.

How can I overcome this problem and produce the results given from the query above utilising all the power of database sharding that Citus can offer?

I am using PostgreSQL 12.4 with Citus 9.4 on Ubuntu 20.04.

1

There are 1 best solutions below

2
On

Citus cannot execute your query in a distributed fashion as your query does a LEFT JOIN on a non-distribution column.

If it wouldn't cost too much for you, you could wrap non-distribution column of geography_data in a cte and use it in your query.