Mysql deduplicate records in single query

1.2k Views Asked by At

I have the following table:

CREATE TABLE `relations` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `relationcode` varchar(25) DEFAULT NULL,
  `email_address` varchar(100) DEFAULT NULL,
  `firstname` varchar(100) DEFAULT NULL,
  `latname` varchar(100) DEFAULT NULL,
  `last_contact_date` varchar(25) DEFAULT NULL,
  PRIMARY KEY (`id`)
)

In this table there are duplicates, these are relation with exact the same relationcode and email_address. They can be in there twice or even 10 times. I need a query that selects the id's of all records, but excludes the ones that are in there more than once. Of those records, I only would like to select the record with the most recent last_contact_id only.

I'm more into Oracle than Mysql, In Oracle I would be able to do it this way:

select * from (
    select row_number () over (partition by relationcode order by to_date(last_contact_date,'dd-mm-yyyy')) rank, 
           id, 
           relationcode,  
           email_address , 
           last_contact_date
    from RELATIONS)
where rank = 1

But I can't figure out how to modify this query to work in MySql. I'm not even dure it's possible to do the same thing in a single query in MySQl. Any ideas?

2

There are 2 best solutions below

1
On BEST ANSWER

Here is a method that will work in both MySQL and Oracle. It rephrases the question as: Get me all rows from relations where the relationcode has no larger last_contact_date.

It works something like this:

select r.*
from relations r
where not exists (select 1
                  from relations r2
                  where r2.relationcode = r.relationcode and
                        r2.last_contact_date > r.last_contact_date
                 );

With the appropriate indexes, this should be pretty efficient in both databases.

Note: This assumes that last_contact_date is stored as a date not as a string (as in your table example). Storing dates as strings is just a really bad idea and you should fix your data structure

2
On

Normal way to do this is a sub query to get the latest record and then join that against the table:-

SELECT id, relationcode, email_address, firstname, latname, last_contact_date
FROM RELATIONS
INNER JOIN
(
    SELECT relationcode, email_address, MAX(last_contact_date) AS latest_contact_date
    FROM RELATIONS
    GROUP BY relationcode, email_address
) Sub1
ON RELATIONS.relationcode = Sub1.relationcode
AND RELATIONS.email_address = Sub1.email_address
AND RELATIONS.last_contact_date = Sub1.latest_contact_date

It is possible to manually generate the kind of rank that your Oracle query uses using variables. Bit messy though!

SELECT id, relationcode, email_address, firstname, latname, last_contact_date
FROM
(
    SELECT id, relationcode, email_address, firstname, latname, last_contact_date, @seq:=IF(@relationcode = relationcode AND @email_address = email_address, @seq + 1, 1) AS seq, @relationcode := relationcode, @email_address := email_address
    (
        SELECT id, relationcode, email_address, firstname, latname, last_contact_date
        FROM RELATIONS
        CROSS JOIN (SELECT @seq:=0, @relationcode := '', @email_address :='') Sub1
        ORDER BY relationcode, email_address, last_contact_date DESC
    ) Sub2
) Sub3
WHERE seq = 1

This uses a sub query to initialise the variables. The sequence number is added to if the relation code and email address are the same as the previous row, if not they are reset to 1 and stored in a field. Then the outer select check the sequence number (as a field, not as the variable name) and records only returned if it is 1.

Note that I have done this as multiple sub queries. Partly to make it clearer to you, but also to try to force the order that MySQL executes it is. There are a couple of possible issues with how MySQL says it may order the execution of things that could cause an issue. They never have done for me, but with sub queries I would hope for force the order.