Netezza - What is colocation in the perspective of SQL?

1.4k Views Asked by At

I know that colocation is important for distributed joins in Netezza. At a high level, it has the following definition:

All data for joins are located in the same SPU

I also talked to some Netezza employees in the past and they mentioned that a join is considered colocated if all tables distribute and join on the same columns.

However, I still feel that definition is a bit lacking ... Based on my understanding of 1-phase and 2-phase GROUP BY's, I suspect colocation really operates on the following definition:

A join is considered colocated if the set of columns used in the join condition is a superset of the distribution keys of all participating tables.

Is that a correct definition? I tried searching for a precise definition of colocation in NZ but all I got is a bunch of articles that kinda assume you know the definition already.

Input on this would be appreciated. Thanks!

Edit: Based on ScottMcG's suggestion, I reformulated the definition of colocated join as:

1. It must be a HASH or MERGE SORT JOIN
2. Set of columns in join conditions must be superset of all distribution keys of all participating tables
3. ?

The ? mark for #3 is an ambiguity I need to iron out. Accordng to ScottMcG, the distribution keys of each table must also be joined with each other.

Suppose Tables A, B, C are distributed on text columns A.C1, A.C2, B.C3, B.C4, C.C5, and C.C6 and we have the following join.

SELECT * FROM A 
    INNER JOIN B "Join1"
        ON A.C1=B.C3
    INNER JOIN C "Join2"
        ON A.C2=B.C4
        AND A.C2=C.C6
        AND [X]

Now, let us provide a few possible definitions of [X]. Then for which definitions of [X] will Join2 be colocated?

(1) [X] = A.C2 = 5

(2) [X] = A.C2 = B.C1 OR A.C2 = C.C5

(3) [X] = A.C1 IS NULL

(4) [X] = A.NonKeyColumn1 = B.NonKeyColumn2
1

There are 1 best solutions below

6
ScottMcG On

For Netezza, a join is considered to be colocated when the tables involved in the join do not need to be redistributed or broadcast from the data slices on which they permanently reside in order to perform the join.

This can only happen if:

  • The set of a columns required by the join are a superset of the columns in the distribution key of each table
  • Each table participating in the join has the same set of columns as their distribution key.
  • The join is an equi-join.

These conditions are pretty close to what you propose in your definition, and are necessary to allow, but not sufficient to insure, a colocated join. It is possible that the optimizer might decide to pre-broadcast one of the tables if it were small enough even though they are distributed on the same columns, and that would then technically not be a colocated join.

One caveat I should add is that for a column to be considered the "same" as another column, the column values should hash to the same value. Generally speaking this means that the column data types would be the same. An exception is that the integer family of datatypes (byteint, smallint, int, bigint) will hash to the same value as long as they are in the supported range.

With regard to the effect of types of joins, equijoins would be of this form. Note that this could either be a hash join or a merge sort join (if the data types were perhaps floating point) under the covers. In either case, we don't need to redistribute the data. In these examples, both tables are distributed on COL1.

SELECT ...
FROM TableA A
   JOIN TableB B
   ON A.COL1 = B.COL1

If the join is an expression based join like either of the following, then you will end up with a redistribution or broadcast of the data. For the "less than" join, you have to be able to determine that 8 is less than 9, but since they will both be hashed to different data slices, they can only be compared if one is relocated to the other.

SELECT ...
FROM TableA A
   JOIN TableB B
   ON A.COL1 < B.COL1

SELECT ...
FROM TableA A
   JOIN TableB B
   ON A.COL1 -  B.COL1 = 0