Ok, today we wrap up our design patterns for removing duplicates from tables by turning to common table expressions (CTE’s). When I start talking CTE’s some people begin to cringe in fear. I’ll admit, I still have to refer to a reference when doing something I haven’t done before with a CTE, which is pretty much de-duping and building hierarchical lists. But they’re useful, and they can have a light load on your SQL server in some situations, and de-duping is one of those situations, so let’s dive right in!
First we’re going to make sure we have our troubled table exists. If you don’t have it, grab my script here to create it. Then take a look at the duplicates you have.
SELECT email, COUNT(*) FROM TroubledTable GROUP BY email HAVING COUNT(*) > 1
We’re going to look at a email address that’s been duplicated in our table.
SELECT * FROM troubledtable WHERE email = 'eget.odio.Aliquam@enimnec.ca'
I’m looking at 7 copies of that email address. Now we’re going to create a CTE that gives every copy of an email address a distinct row number. Then we can use that row number to identify the “extra” copies, in our case any record that has a row number greater than 1.
Let’s first look at the query that gives us the row numbers for each copy.
SELECT ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN , Firstname, lastname, email FROM troubledtable WHERE email = 'eget.odio.Aliquam@enimnec.ca' --RESULTS: /* RN Firstname lastname email --- --------- --------- ----------------------------- 1 Abra Romero eget.odio.Aliquam@enimnec.ca 2 Abra Romero eget.odio.Aliquam@enimnec.ca 3 Abra Romero eget.odio.Aliquam@enimnec.ca 4 Abra Romero eget.odio.Aliquam@enimnec.ca 5 Abra Romero eget.odio.Aliquam@enimnec.ca 6 Abra Romero eget.odio.Aliquam@enimnec.ca 7 Abra Romero eget.odio.Aliquam@enimnec.ca 8 Abra Romero eget.odio.Aliquam@enimnec.ca */
Now, if we define that query as our CTE we can add another where clause that will show us only the “extra” rows. Please notice I start with the semi-colon. Do that, so the interpreter will understand your query. If you miss it, the interpreter might try to include a previous query, and that just won’t work.
;WITH cte AS ( SELECT ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN , Firstname, lastname, email FROM troubledtable WHERE email = 'eget.odio.Aliquam@enimnec.ca' ) SELECT * FROM cte WHERE RN > 1 --RESULTS: /* RN Firstname lastname email --- --------- --------- ----------------------------- 2 Abra Romero eget.odio.Aliquam@enimnec.ca 3 Abra Romero eget.odio.Aliquam@enimnec.ca 4 Abra Romero eget.odio.Aliquam@enimnec.ca 5 Abra Romero eget.odio.Aliquam@enimnec.ca 6 Abra Romero eget.odio.Aliquam@enimnec.ca 7 Abra Romero eget.odio.Aliquam@enimnec.ca 8 Abra Romero eget.odio.Aliquam@enimnec.ca */
Those are the records we want to delete. So if we change our SELECT from CTE to A DELETE from the CTE, we’re good to go!
;WITH cte AS ( SELECT ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN , Firstname, lastname, email FROM troubledtable WHERE email = 'eget.odio.Aliquam@enimnec.ca' ) DELETE FROM cte WHERE RN > 1 --(7 row(s) Affected)
Here’s the really cool part of using a CTE, the delete was actually performed against the base table. Check out the execution plan for the proof!
SELECT email, COUNT(*) FROM TroubledTable WHERE email = 'eget.odio.Aliquam@enimnec.ca' GROUP BY email HAVING COUNT(*) > 1 --(0 row(s) affected)
Pretty awesome, right?
Now, let’s remove the where clause and de-dupe the whole table.
;WITH cte AS ( SELECT ROW_NUMBER() OVER (PARTITION BY firstname, lastname, email ORDER BY ( SELECT 0)) RN --, Firstname, lastname, email -- not needed for the delete, just to show the data. FROM troubledtable) DELETE FROM cte WHERE RN > 1 --see they're all gone: SELECT email, COUNT(*) FROM TroubledTable GROUP BY email HAVING COUNT(*) > 1
Using CTE’s to remove dupes can be a pretty cool way of doing the job. If you’re only removing a small percentage of the rows in your table, the CTE can actually reduce your locking from a table lock to a page or row lock.
As with everything, you need to choose the method that works best for your situation. You’ll have to make the judgement call on when each solution will work best for you.
That’s it for the design patterns for removing duplicates. Next time we’re going to cover a way of preventing duplicates from being inserted in the first place by performing upserts, instead of just inserting new data!
Until then, if you have any questions, please let me know! I’m here to help.