Go ahead and set up your TroubledTable by running the Create Troubled Table T-SQl Script. Now take a look at your duplicate rows.
SELECT email, COUNT(*) FROM troubledTable GROUP BY email HAVING COUNT(*) > 1
You’ve got 200 rows that have been duplicated over and over. Not a pretty sight. Imagine how long it you would take you to retrieve each distinct row, and manually clear out the duplicate rows.
SELECT * FROM troubledtable WHERE email = 'elit.a.feugiat@Etiam.com'
Returns:
FirstName | LastName | |
Clarke | Glass | elit.a.feugiat@Etiam.com |
Clarke | Glass | elit.a.feugiat@Etiam.com |
Clarke | Glass | elit.a.feugiat@Etiam.com |
Clarke | Glass | elit.a.feugiat@Etiam.com |
You’d have to then construct two T-SQL statements for each duplicate (all 200 of them).
DELETE FROM troubledtable WHERE email = 'elit.a.feugiat@Etiam.com' INSERT INTO troubledtable (firstname, lastname, email) VALUES ('Clarke','Glass','elit.a.feugiat@Etiam.com')
That would take a very long time! Most of us would argue it would take too long, and look for another method of de-duping that data, right? Now if you had only one or two dupes, you might still consider this method. Let’s look at one more reason you might not want to do that. Look at the execution plan for that delete.
That table scan could be an issue if your table is quite large (1 million+ rows). While you could change the table scan into a index seek by adding an index, you’d then create an additional operation during the delete: an index update. If your table is a high transaction volume table, you may not want to trade the performance loss of an index operation for the gain in performance on your look up.
You need to keep performance concerns in mind when developing your solutions. That way you can have the best answer as well as the fastest answer! Next time we’re going to take a look at using temp tables to get rid of those pesky dupes. As always, if you have any questions, please let me know. I’m here to help!