LEFT JOIN on multiple columns with unwanted duplicates

88 Views Asked by At

I have been running in circles with a query that is driving me nuts.

The background:
I have two tables, and unfortunately, both have duplicate records. (Dealing with activity logs if that puts it into perspective). Each table comes from a different system and I am trying to join the data together to get a sudo full picture (I realize that I won't get a perfect view because there is no "event key" shared between the two systems; I am attempting to match on a composite of meta data).

Here is what I am working with:

Table1 
------------
JobID      CustID      Name      ActionDate             IsDuplicate
12345      11111       Ryan      1/1/2015 01:20:20      False
12345      11112       Bob       1/1/2015 02:10:20      False
12345      11111       Ryan      1/1/2015 04:15:35      True
12346      11113       Jim       1/1/2015 05:10:40      False
12346      11114       Jeb       1/1/2015 06:10:40      False
12346      11111       Ryan      1/1/2015 07:10:30      False

Table2 
------------
ResponseID   CustID     ActionDate            Browser
11123        10110      12/1/2014 23:32:15    IE
12345        11111      1/1/2015 03:20:20     IE
12345        11112      1/1/2015 05:10:20     Firefox
12345        11111      1/1/2015 06:15:35     Firefox
12346        11113      1/1/2015 07:10:40     Chrome
12346        11114      1/1/2015 08:10:40     Chrome
12346        11111      1/1/2015 10:10:30     Safari
12213        11123      2/1/2015 01:10:30     Chrome

Please note a few things:
- JobID and ResponseID are the same thing
- JobID and ResponseID are indicators of an event on the site (people are responding to an event)
- Action date does not match (system 2 has about an inconsistent 2 hour delay on it but never more that 3 hours delay)
- Note Table2 doesnt have a duplicate flag
- table 1 (~2,000 records) is significantly smaller than table 2 (~16,000 records)
- Note Cust 11111 is bopping around on browsers, taking the same action twice on job 12345 at different times and only taking action once on job 12346

What I am looking for:

Result (ideal)
------------
t1.JobID   t1.CustID  t1.Name  t1.ActionDate      t2.Browser
12345      11111      Ryan     1/1/2015 01:20:20  IE
12345      11112      Bob      1/1/2015 02:10:20  Firefox
12345      11111      Ryan     1/1/2015 04:15:35  Firefox
12346      11113      Jim      1/1/2015 05:10:40  Chrome
12346      11114      Jeb      1/1/2015 06:10:40  Chrome
12346      11111      Ryan     1/1/2015 07:10:30  Safari

Note that I JUST want matches for records in Table1. I am getting tons of duplicates because of the join...Which is frustrating.

Here is what I have so far (which I can humbly can say; isn't really close):

SELECT
  t1.JobID,
  t1.CustID,
  t1.Name,
  t1.ActionDate,
  t2.Browser
FROM
  Table1 t1
LEFT OUTER JOIN
  Table2 t2
 ON
  t1.JobID=t2.ResponseID AND 
  t1.CustID=t2.CustID AND
  DATEPART(dd,t1.ActionDate)=DATEPART(dd,t2.ActionDate)
2

There are 2 best solutions below

0
On

Try changing the join condition for the date to check that t2.actiondate fulfills the condition t1.actiondate <= t2.actiondate <= t1.actiondate + 3 hours

SELECT 
    t1.JobID, t1.CustID, t1.Name, t1.ActionDate, t2.Browser
FROM
    Table1 t1 
LEFT JOIN Table2 t2
       ON t1.JobID = t2.ResponseID
      AND t1.CustID = t2.CustID
      AND t2.ActionDate >= t1.ActionDate
      AND t2.ActionDate <= DATEADD(hour, 3, t1.ActionDate)
ORDER BY t1.JobID , t1.ActionDate;

With your sample data the result of this query matches your desired result.

0
On

One method is to enumerate each table using row_number() and match on the sequence numbers as well:

select t1.JobID, t1.CustID, t1.Name, t1.ActionDate, t2.Browser
from (select t1.*,
             row_number() over (partition by JobId, CustId order by ActionDate) as seqnum
      from table t1
     ) t1 join
     (select t2.*
             row_number() over (partition by ResponseId, CustId order by ActionDate) as seqnum
      from table t2
     ) t2
     on t1.JobId = t2.ResponseId and
        t1.CustId = t2.CustId and
        t1.seqnum = t2.seqnum;

This works for your sample data. However, if there is not a response for every job, then the alignment might get out of whack. If that is a possibility, then date arithmetic might be the better solution.