.NET, Programming

Normalizing Spaces with Regular Expressions

03.14.09 | Comment?

Back before I migrated from XML to regular expressions, I used XSL transforms to change various flavors of RSS and Atom feeds into a common format for importing. XSLT had a very nice function in it called normalize-space(). This function would take a string and return you that same string, except with all instances of multiple whitespace characters reduced to a single space. This was pretty handy, as I needed to be able to count words so I could create a short extract, and knowing that I’d only need to worry about a single space at a time.

When I moved the GetExtract functionality into Visual Basic, I figured I didn’t need to worry about this, since I would be using the String.Split function to create an array of words, and that function would be smart enough to deal with consecutive spaces, right? Turns out I wasn’t giving Bill Gates and his minions enough credit. When the Split function is confronted with two or more consecutive spaces, it does indeed count some of them as words*. A web search didn’t turn up a native .NET way to do this, so I had to implement it myself.

And as it turns out it’s pretty simple. I just used the regular expression \s\s+ to match any sequence of more than a single whitespace character – \s matches whitespace, and + means one or more occurences.

Here’s all the code required:

Public Shared Function NormalizeWhitespace (ByVal InputStr As String) As String

    Dim NormRx As Regex = New Regex("\s\s+")
    Return NormRx.Replace(InputString.Trim, " ")

End Function

That’s it, and it works like a champ.

* As it happens I didn’t check to see if it was counting the spaces themselves as words, or if it was creating words that were empty strings (i.e. the text “between” consecutive spaces). Either way, I was getting extracts that had 10 or 12 words instead of the desired 25.

have your say

Add your comment below, or trackback from your own site. Subscribe to these comments.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>