Query Taking Ages to Run in Hive, any way to simplify it

234 Views Asked by At

I was looking for some help in regards to an SQL query that I am trying to run in Hive (Hue Platform) and was looking for some advice to see if the query could be shortened or written differently to make it run quicker. When I have ran this (see below) against a dummy data set (1500 rows) in SQL Developer this runs farily quickly and gives me what I need:

SELECT 
    P_AGENT,
    TRUNC(P_DATE) AS P_DATE,
    SUM(TOUCH_COUNT) AS TOUCH_COUNT,   
    MIN(P_DATE) AS START_TIME,
    MAX(P_DATE) AS END_TIME,
    MAX(P_DATE) - MIN(P_DATE) AS TIME_TAKEN
FROM 
    (
    SELECT 
        T1.*,
        SUM(COUNT1) OVER (PARTITION BY P_AGENT, TRUNC(P_DATE) ORDER BY P_DATE) AS GRP,
        CASE  
            WHEN P_DATE - LAG(P_DATE) OVER (PARTITION BY P_AGENT, TRUNC(P_DATE) ORDER BY P_DATE) <= 1 / 48  
            THEN NULL
            ELSE 1
         END AS TOUCH_COUNT
    FROM 
        (
          SELECT 
            T1.*,
            (CASE WHEN P_DATE - LAG(P_DATE) OVER (PARTITION BY P_AGENT, TRUNC(P_DATE) ORDER BY P_DATE) <= 1/48  
             THEN 0 ELSE 1
            END) AS COUNT1
          FROM 
            TABLE2 T1
        ) T1
    ) T1
GROUP BY
    P_AGENT, TRUNC(P_DATE), GRP
ORDER BY
    P_DATE
    ;

Unfortunately when I try to run the same query in Hive, the query takes ages to run and the max I have ran it for is 7 hours and I still dont have an output. The problem is that I am trying to run this against a table that has over 300 million rows hence the length of time it is taking.

I decided to break up the queries and put in the initial subquery into a table (dummy data shown below) and this was fine and the output I now have in a sub table with 347million rows (dummy data below only shows 10 rows)

CREATE TABLE

CREATE TABLE "TIME_ISSUES" 
   (    
    "P_DATE" DATE, 
    "P_AGENT" VARCHAR2(8 BYTE), 
    "COUNT1" NUMBER(5,0)
   )

INSERT STATEMENTS

Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('05-APR-16 20:20:12','DD-MON-RR HH24:MI:SS'),'CLQRC0',1); 
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('07-APR-16 15:06:09','DD-MON-RR HH24:MI:SS'),'SMCXF3',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('08-APR-16 04:33:00','DD-MON-RR HH24:MI:SS'),'EAQGH1',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('08-APR-16 12:17:53','DD-MON-RR HH24:MI:SS'),'JMENJDS',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('09-APR-16 13:06:53','DD-MON-RR HH24:MI:SS'),'JMENJDS',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('11-APR-16 10:41:00','DD-MON-RR HH24:MI:SS'),'NUKXY3',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('12-APR-16 10:15:21','DD-MON-RR HH24:MI:SS'),'JMRJADS',1);
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('12-APR-16 11:19:23','DD-MON-RR HH24:MI:SS'),'CLMXB1',1); 
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('15-APR-16 08:36:00','DD-MON-RR HH24:MI:SS'),'EMA0L1',1);  
Insert into TIME_ISSUES (P_DATE,P_AGENT,COUNT1) values (to_date('22-APR-16 16:22:00','DD-MON-RR HH24:MI:SS'),'EADBM1',1);

Unfortunately the initial table has 49 different columns but I have selected the 2 columns that I am using in the example above.

I have found that when running the second sub query, it looks the problem lies in the statement:

SUM(COUNT1) OVER (PARTITION BY P_AGENT, TRUNC(P_DATE) ORDER BY P_DATE) AS GRP,

The full query that I am running is shown here:

SELECT T1.*,
  SUM(COUNT1) OVER (PARTITION BY P_AGENT, TO_DATE(P_DATE) ORDER BY P_DATE) AS GRP,
  CASE
    WHEN P_DATE - LAG(P_DATE) over (PARTITION BY P_RECR, TO_DATE(P_DATE) ORDER BY P_DATE) <= 1 / 48
    THEN NULL
    ELSE 1
  END AS TOUCH_COUNT
FROM TIME_ISSUES T1 ;

but this takes ages to run and so I wanted to some advice to see if there is anything different that i could try as i am stuck on how to resolve this now.

The output that I am looking for will show all work that an agent has completed under a 30 minute timeframe.

An example of this from my data set is:

P_AGENT | P_DATE | TOUCH_COUNT | START_TIME | END _TIME | TIME_TAKEN
JMRJADS | 12-APR-16 | 1 | 12-APR-16 10:15:21 | 12-APR-16 10:15:21 | 0
1

There are 1 best solutions below

5
On

No need for TOUCH_COUNT when you already calculated COUNT1 at the previous stage