Lately I have been playing a bit with Hive. Most things have been progressing well, however, when I try to convert something like
2015-04-01 device1 traffic other start
2015-04-01 device1 traffic violation deny
2015-04-01 device1 traffic violation deny
2015-04-02 device1 traffic other start
2015-04-03 device1 traffic other start
2015-04-03 device1 traffic other start
Into
2015-04-01 1 2
2015-04-02 1
2015-04-03 2
I tried using the following query but for some reason the reduce stage of the query just gets stuck at 96% no matter how long I wait.
SELECT pass.date, COUNT(pass.type), COUNT(deny.deny_type) FROM firewall_logs as pass
JOIN (
SELECT date, type as deny_type FROM firewall_logs
WHERE device = 'device1'
AND date LIKE '2015-04-%'
AND type = 'traffic' AND subtype = 'violation' and status = 'deny'
) deny ON ( pass.date = deny.date )
WHERE pass.device = 'device1'
AND pass.date LIKE '2015-04-%'
AND pass.type = 'traffic' AND pass.subtype = 'other' AND pass.status = 'start'
GROUP BY pass.date ORDER BY pass.date ;
All MR2 logs show is:
2015-06-11 01:54:04,206 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 9028000 rows for join key [2015-04-26]
2015-06-11 01:54:04,423 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 9128000 rows for join key [2015-04-26]
2015-06-11 01:54:04,638 INFO [main] org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 9228000 rows for join key [2015-04-26]
2015-06-11 01:54:04,838 INFO [main] org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
Would anyone have an idea of why?
I try to avoid self-joins in Hive like the plague. You can do this by collecting and creating a map