String Manipulation — REPLACE, PATINDEX and Regular Expressions
In the previous SQL posts, I showed you SUBSTRING, then I showed you REPLACE. Now, with REPLACE, you had to know what characters you wanted to replace. What if you only knew what characters you wanted to keep? Well, let me introduce you to PATINDEX. This is an advanced command that is easy to learn, difficult to master. I’ll bring this tool out of my tool belt in future posts to provide you more ways to use it.
PATINDEX let’s you define a string pattern. Many strings you’ll store in your database will have a pattern. A phone number (in america) has a very distinct pattern 3 digits of area code, 3 digits of prefix, and 4 digits telroot. You may store this in several different ways, but you’ll always store the same pattern every time, otherwise, how will your users interpret the data your storing.
The example I’m going to use to introduce you to PATINDEX is based on the idea of a setting up a fixed format or pattern for your phone numbers.
But first, I need to include my standard disclaimer:
I would like to point out that you have to be careful when and how often you use the techniques below. A good rule to keep in mind is text manipulation is slow and painful to a server. If you can leave the text manipulation to your middle-ware or front end, that would be better. But we all have been stuck in a situation where we needed to alter a string before those two points, and so I bring you the following lesson.
96615551222 Ext. 249
+44 7930 555271
Looking at the examples in the column we know there characters in there other than 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0. How can we easily remove the non numeric values from this column? We can use regular expressions. We can say, if it isn’t 0-9, remove it.
Now, the solution to this problem will use a WHILE loop. This is something I’m not really going to go into too much detail on, but using a while loop can really eat up processor time and memory. This is one of those times where this solution should only be used as a last resort, and only during non-peak server usage. Before implementing this solution, I’d be sure to spend time putting together a fix on the front end, to scrub out these bad characters before they got into my database. But, let’s come off the soapbox and discuss the solution, shall we?
Let’s first look at the syntax for the PATINDEX function.
PATINDEX ('%pattern%', sourceString)
The PATINDEX function will search for a pattern in the SourceString, and return a bigint of where that pattern is found in the sourceString. If it’s not found, then you get 0. With this you can also provide a pattern ‘%[^0-9]%’ and it will search for characters that are not numbers.
But that’s only part of the solution. You then need to couple that with REPLACE, so you can remove anything that is not a number.
--variable to hold the current phone number being updated DECLARE @phone VARCHAR(20) --variable to hold the original value (so we can update it) DECLARE @originalPhone VARCHAR(20) --if there is a phone number with something other than a number in it WHILE EXISTS ( SELECT TOP 1 phone FROM sourceTable (READCOMMITTED) WHERE PATINDEX('%[^0-9]%', phone) ) > 0) BEGIN --store the original number SELECT TOP 1 @originalPhone = phone FROM sourceTable (READCOMMITTED) WHERE PATINDEX('%[^0-9]%', phone) > 0 --then copy it to a temporary variable SET @phone = @originalPhone --spin through the phone number one character at a time -- if there is a character other than a number WHILE(PATINDEX('%[^0-9]%', @phone) ) > 0 BEGIN --then remove that one character, then continue SET @phone = REPLACE(@phone , SUBSTRING(@phone, PATINDEX('%[^0-9]%', @phone), 1) , '') END --once all the non numeric characters are gone, update UPDATE sourceTable SET Phone = @phone WHERE Phone = @originalPhone END
If you’re having difficulties with the WHILE loop, search my blog for the entry on WHILE loops. They’re pretty easy once you get to know them.
If you wanted to get fancy you could make the replace portion of this code a scalar function, and call it in an update step. It would get rid of the explicit outer loop. In future versions of SQL this may improve, but for SQL 2000, this is the best I’ve come up with.
There are many more string manipulation techniques to cover, so stay tuned to this same bat-channel!