I have a problem with how to get my Hive code optimized. I have a huge table as follows:
Customer_id Product_id Date Value
1 1 02/28 100.0
1 2 02/02 120.0
1 3 02/10 144.0
2 2 02/15 120.0
2 3 02/28 144.0
... ... ... ...
I want to create a complex network where I link the products through the buyers. The graph does not have to be directed and I have to count the number of links between them. In the end I need this:
Product_x Product_y amount
1 2 1
1 3 1
2 3 2
Can anyone help me with this? I need an optimized way to do this. The join of the table with itself is not the solution. I really need an optimum way on this =/
CREATE TABLE X AS
SELECT
a.product_id as product_x,
b.product_id as product_y,
count(*) as amout
FROM table as a
JOIN table as b
ON a.customer_id = b.customer_id
WHERE a.product_id < b.product_id
GROUP BY product_x, product_y;