.NET, Programming, XML

So Long, XML. Hello, RegEx.

03.06.09 | 1 Comment

So a few years ago, as I was clicking the links in my blogroll for the 10th time that day, I wondered if there was a better way of finding out when my favorite sites were updated. This was before feed readers were really popular, and anyway I wanted more of a Google News layout. Then I had an epiphay: site feeds were just XML files, and I had already created a bunch of XML-processing script in ASP for my Bleeker Books site (and God help me, it’s still using them).

So over a long weekend I cobbled together a program that would read these feeds into a database and spit them back out in a nice format, and CrimeSpot was born.

Over the next year or so I recoded the site in .NET, and I have to say those tools made it a lot easier. But I still ran into a problem from time to time, one that I couldn’t do anything about: every so often, I couldn’t import a feed. It would have some sort of formatting problem that made it an invalid XML document, and my program would throw up its hands and give up.

Usually this was because of Microsoft Word – if you copy the contents of a Word document and paste it as HTML, a lot of the formatting information gets converted in a weird way. In particular, you end up with a lot of tags that look like <o:p>. To XML, that looks like an undefined namespace, and the document can’t be read. More broadly, any error anywhere in an XML document causes the entire document to be unreadable.

As I said, this has been going on for a while, but as I add feeds I can see that it’s going to be a more and more common problem. So I finally made a command decision. Processing these documents as XML is out. From now on I’m going to use regular expressions to extract the data I want.

For those of you not in the know, regular expressions are pattern matching tools that can find and extract information from a longer document. In practical terms, this means that as long as the tags surrounding the content are correct, I can retrieve the information I want. Any errors in the content itself I can clean up once I’ve got it.

This goes back to Postel’s Law, “Be conservative in what you send, be liberal in what you receive.” In my case, this means I have make my best effort to accept the data that I’m given, ignoring errors whenever possible. And using regular expressions makes that possible.

Now, I love XML (and XSLT, too), and I use it a lot. In fact, XML is my Golden Hammer – I can find a way to work it into just about every project. But in this case I’m working with data that’s not entirely under my control, and I need to be as flexible as I can. And hammers aren’t noted for flexibility!

1 Comment

have your say

Add your comment below, or trackback from your own site. Subscribe to these comments.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>