De-duping by CTE

Ok, today we wrap up our design patterns for removing duplicates from tables by turning to common table expressions (CTE’s). When I start talking CTE’s some people begin to cringe in fear. I’ll admit, I still have to refer to a reference when doing something I haven’t done before with a CTE, which is pretty much de-duping and building hierarchical lists. But they’re useful, and they can have a light load on your SQL server in some situations, and de-duping is one of those situations, so let’s dive right in!

First we’re going to make sure we have our troubled table exists. If you don’t have it, grab my script here to create it. Then take a look at the duplicates you have.

SELECT email, COUNT(*)
FROM TroubledTable
GROUP BY email
HAVING COUNT(*) > 1

We’re going to look at a email address that’s been duplicated in our table.

SELECT * 
FROM troubledtable
WHERE email = 'eget.odio.Aliquam@enimnec.ca'

I’m looking at 7 copies of that email address. Now we’re going to create a CTE that gives every copy of an email address a distinct row number. Then we can use that row number to identify the “extra” copies, in our case any record that has a row number greater than 1.

Let’s first look at the query that gives us the row numbers for each copy.

SELECT 
	ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN
	, Firstname, lastname, email
FROM   troubledtable
WHERE email = 'eget.odio.Aliquam@enimnec.ca' 

--RESULTS:
/*
RN  Firstname lastname  email
--- --------- --------- ----------------------------- 
1   Abra      Romero    eget.odio.Aliquam@enimnec.ca
2   Abra      Romero    eget.odio.Aliquam@enimnec.ca
3   Abra      Romero    eget.odio.Aliquam@enimnec.ca
4   Abra      Romero    eget.odio.Aliquam@enimnec.ca
5   Abra      Romero    eget.odio.Aliquam@enimnec.ca
6   Abra      Romero    eget.odio.Aliquam@enimnec.ca
7   Abra      Romero    eget.odio.Aliquam@enimnec.ca
8   Abra      Romero    eget.odio.Aliquam@enimnec.ca
*/

Now, if we define that query as our CTE we can add another where clause that will show us only the “extra” rows. Please notice I start with the semi-colon. Do that, so the interpreter will understand your query. If you miss it, the interpreter might try to include a previous query, and that just won’t work.

;WITH cte AS (
	SELECT 
		ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN
		, Firstname, lastname, email
    FROM   troubledtable
    WHERE email = 'eget.odio.Aliquam@enimnec.ca'  
)
SELECT * 
FROM cte
WHERE  
	RN > 1 
--RESULTS:
/*
RN  Firstname lastname  email
--- --------- --------- ----------------------------- 
2   Abra      Romero    eget.odio.Aliquam@enimnec.ca
3   Abra      Romero    eget.odio.Aliquam@enimnec.ca
4   Abra      Romero    eget.odio.Aliquam@enimnec.ca
5   Abra      Romero    eget.odio.Aliquam@enimnec.ca
6   Abra      Romero    eget.odio.Aliquam@enimnec.ca
7   Abra      Romero    eget.odio.Aliquam@enimnec.ca
8   Abra      Romero    eget.odio.Aliquam@enimnec.ca
*/

Those are the records we want to delete. So if we change our SELECT from CTE to A DELETE from the CTE, we’re good to go!

;WITH cte AS (
	SELECT 
		ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN
		, Firstname, lastname, email
    FROM   troubledtable
    WHERE
		email = 'eget.odio.Aliquam@enimnec.ca'	
)
DELETE 
FROM cte
WHERE  
	RN > 1 
--(7 row(s) Affected)

Here’s the really cool part of using a CTE, the delete was actually performed against the base table. Check out the execution plan for the proof!

SELECT email, COUNT(*)
FROM TroubledTable
WHERE email = 'eget.odio.Aliquam@enimnec.ca'
GROUP BY email
HAVING COUNT(*) > 1
--(0 row(s) affected)

Pretty awesome, right?
Now, let’s remove the where clause and de-dupe the whole table.

;WITH cte AS (
	SELECT 
		ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN
		--, Firstname, lastname, email -- not needed for the delete, just to show the data.
    FROM   troubledtable)
DELETE 
FROM cte
WHERE  
	RN > 1
--see they're all gone:
SELECT email, COUNT(*)
FROM TroubledTable
GROUP BY email
HAVING COUNT(*) > 1

Using CTE’s to remove dupes can be a pretty cool way of doing the job. If you’re only removing a small percentage of the rows in your table, the CTE can actually reduce your locking from a table lock to a page or row lock.

As with everything, you need to choose the method that works best for your situation. You’ll have to make the judgement call on when each solution will work best for you.

That’s it for the design patterns for removing duplicates. Next time we’re going to cover a way of preventing duplicates from being inserted in the first place by performing upserts, instead of just inserting new data!

Until then, if you have any questions, please let me know! I’m here to help.

Leave a Reply

Your email address will not be published. Required fields are marked *