Postgres - Find duplicate values after lowering the values

1k Views Asked by At

Hello StackOverflow users... I have a tricky situation and I have yet to find an answer. Maybe you can help me.

Database: PostgreSQL 8.4 (can't upgrade)

In this database, there is a users table. Sadly, the usernames that users can provide when they create a user profile is case sensitive, so a username of Alex is not the same as a username of alex.

There is a new system going out and username is no longer case sensitive. I'm trying to find all of the usernames that would be considered duplicates in the old system. This way we can reach out and have them update the usernames manually and then migrate their users to a newer system (without conflict of username).

I have the following query which will show me the counts of each username matching another with the "lower()" function.

select count(*), lower(username)
  from users
  where deleted = false
  group by lower(username) having count(*) > 1

This returns results like the following:

|count|lower   |
|-----+--------+
|3    |alex    |
|2    |george  |

What I need to do is get this data into a temp table and display all of those duplicate users and other details so that we have a list to go through.

I have part of the temp table figured out, but my main issue is: How do I get the distinct values of all of these duplicates? So in the long run, I get results that look like the following (and maybe even without a temp table if possible):

|lower  |username|
|-------+--------+
|alex   |Alex    |
|alex   |alex    |
|george |georGe  |
|george |George  |

Restrictions:

  • I can't change the version of postgres from 8.4
  • Some duplicates will have more than 2 hits (the most I've seen so far is 3)
  • Since the users must be informed, there is no way to change the data other then to contact them prior (which is why the list is needed)

I appreciate any suggestions/feedback you may be able to provide.

2

There are 2 best solutions below

2
Gil Hamilton On BEST ANSWER

How about this. Just generate your above list as a CTE, then join with it in the main query:

WITH dups AS (
    SELECT lower(username) uname, count(*) ucount 
    FROM users 
    WHERE deleted = false 
    GROUP BY lower(username) HAVING count(*) > 1)
SELECT username, uname, ucount 
FROM users INNER JOIN dups ON lower(username) = uname 
WHERE deleted = false
ORDER BY ucount DESC, uname ASC;

 username | uname  | ucount
----------+--------+--------
 Alex     | alex   |      3
 alex     | alex   |      3
 ALEX     | alex   |      3
 GeorGe   | george |      2
 george   | george |      2
(5 rows)

Or even simpler if you only want a bare list of the affected users:

SELECT username
FROM users 
WHERE deleted = false AND lower(username) IN (
    SELECT lower(username)
    FROM users
    WHERE deleted = false
    GROUP BY lower(username) HAVING count(*) > 1)
ORDER BY lower(username) ASC;

 username
----------
 Alex
 alex
 ALEX
 GeorGe
 george
(5 rows)
11
dliff On

I would usually use string_agg, but it looks like it's not supported in 8.4. There appears to be a workaround, but note that I haven't tested due to not having a local copy of 8.4 handy. Something like this should work:

select
  (max(u1.username)),
  array_to_string(array_agg(u2.username), ',') as duplicates
  from users u1
         inner join users u2 on u1.id < u2.id
         and lower(u1.username) = lower(u2.username)
         left join users u3 on u1.id > u3.id
         and lower(u1.username) = lower(u3.username)
         and u3.deleted = false
 where u1.deleted = false
   and u2.deleted = false
   and u3.id is null
 group by u1.id;

This will get the "earliest" user by ID (assuming there is a primary key that isn't username. It could be modified to show the actual lowercase username, and then the rest in the duplicates column.

Edit: to show a row for each duplicate:

select
  lower(u1.username),
  u2.username
  from users u1
         inner join users u2 on u1.id < u2.id
         and lower(u1.username) = lower(u2.username)
         left join users u3 on u1.id > u3.id
         and lower(u1.username) = lower(u3.username)
         and u3.deleted = false
 where u1.deleted = false
   and u2.deleted = false
   and u3.id is null
order by u1.username;