De-duping by Self join

doris day

Let’s continue the series on getting rid of duplicate data.  Today we’re going to use self joins to get rid of them.  A self join is a join from a table back to itself.  It might seem weird to do that, but by the end of this article you’ll see how it’s useful.  I use self joins to de-dupe when the table has a primary key, yet that primary key didn’t prevent duplicates from being inserted.  Let’s set up a demo table now.

CREATE TABLE TroubledTableWithPK (
      ttPrimaryKey INT IDENTITY(1,1)
    , FirstName VARCHAR(50)
    , LastName VARCHAR(50)
    , email VARCHAR(255)
)

Now, let’s copy some data from the backup table we created before.

INSERT INTO TroubledTableWithPK
(FirstName, LastName, email)
SELECT * from OriginalTroubledTable

Take a look at the duplicates in that table.

SELECT email, COUNT(*)
FROM TroubledTableWithPK
GROUP BY email
HAVING COUNT(*) > 1

Yeah, we have duplicates!  Take a closer look at one of these duplicates and you can clearly see how the primary key, which was supposed to uniquely identify the row to the table, didn’t prevent duplicate email addresses from being added.

SELECT *
FROM TroubledTableWithPK
WHERE email ='elit.a.feugiat@Etiam.com'

Let’s propose a way to remove all the duplicate records that aren’t the “last” copy of the duplicate data.  The last copy would be the record with the highest ttPrimaryKey Value.

SELECT email, MAX(ttPrimaryKey)
FROM TroubledTableWithPK
WHERE email ='elit.a.feugiat@Etiam.com'
GROUP BY email

If we were to join TroubledTableWithPK to a sub select of TroubledTableWithPK, we could show all those records we want to remove.

SELECT tt.*
FROM TroubledTableWithPK tt
INNER JOIN (
    SELECT email, MAX(ttPrimaryKey) AS maxTTPrimaryKey
    FROM TroubledTableWithPK
    GROUP BY email) maxvals
    ON tt.email = maxvals.email
    AND tt.ttPrimaryKey != maxvals.maxTTPrimaryKey
WHERE tt.email ='elit.a.feugiat@Etiam.com'

Compare these results with the SELECT * statement we ran.  By joining on email, and where the ttPrimaryKey != MAX(ttPrimaryKey) shows us the “early” duplicates in the table.  Now all we have to do is cahnge the SELECT to a DELETE statement.

DELETE FROM tt
FROM TroubledTableWithPK tt
INNER JOIN (
    SELECT email, MAX(ttPrimaryKey) AS maxTTPrimaryKey
    FROM TroubledTableWithPK
    GROUP BY email) maxvals
    ON tt.email = maxvals.email
    AND tt.ttPrimaryKey != maxvals.maxTTPrimaryKey
WHERE tt.email ='elit.a.feugiat@Etiam.com'

So now when we run the SELECT *, no more duplicates!

SELECT *
FROM TroubledTableWithPK
WHERE email ='elit.a.feugiat@Etiam.com'

So all we have to do to delete all the duplicates, not just the ones for Elit Feugiat, just remove the where clause.

DELETE FROM tt
FROM TroubledTableWithPK tt
INNER JOIN (
    SELECT email, MAX(ttPrimaryKey) as maxTTPrimaryKey
    FROM TroubledTableWithPK
    GROUP BY email) maxvals
    ON tt.email = maxvals.email
    AND tt.ttPrimaryKey != maxvals.maxTTPrimaryKey

And All our duplicates not-so-magically disappear!

SELECT email, COUNT(*)
FROM TroubledTableWithPK
GROUP BY email
HAVING COUNT(*) > 1

Now, I would like to say, you might want to check out the execution plan of this method.  You could run into a situation where the table scan, the implicit order of the subquery and nested loops of the inner join to the sub table could give you troubles if you had a large number of records in TroubledTableWithPK.  These operations could cause you some blocking issues if any other process was trying to read records from this table during the delete.

I’d also like to point out, if you’re deleting a large number of records, you could also find your transaction log filling up.

Deletes are going to cost you more than a truncate if you inserted the distinct records to a temp table, then truncated the table.  It’s even more than selecting the distinct into a new table, then renaming the new table to the old name.

Now you have another tool available when you’re faced with duplicate data.  Just one more method to go in this series: CTEs.  Don’t worry, they’re not as hard as you might think.  Until then, if you have any questions, let me know.  I’m here to help!

1 Comment on "De-duping by Self join"


Leave a Reply

Your email address will not be published. Required fields are marked *