So a few years ago, as I was clicking the links in my blogroll for the 10th time that day, I wondered if there was a better way of finding out when my favorite sites were updated. This was before feed readers were really popular, and anyway I wanted more of a Google News layout. Then I had an epiphay: site feeds were just XML files, and I had already created a bunch of XML-processing script in ASP for my Bleeker Books site (and God help me, it’s still using them).
So over a long weekend I cobbled together a program that would read these feeds into a database and spit them back out in a nice format, and CrimeSpot was born.
Over the next year or so I recoded the site in .NET, and I have to say those tools made it a lot easier. But I still ran into a problem from time to time, one that I couldn’t do anything about: every so often, I couldn’t import a feed. It would have some sort of formatting problem that made it an invalid XML document, and my program would throw up its hands and give up.
Usually this was because of Microsoft Word – if you copy the contents of a Word document and paste it as HTML, a lot of the formatting information gets converted in a weird way. In particular, you end up with a lot of tags that look like <o:p>. To XML, that looks like an undefined namespace, and the document can’t be read. More broadly, any error anywhere in an XML document causes the entire document to be unreadable.
As I said, this has been going on for a while, but as I add feeds I can see that it’s going to be a more and more common problem. So I finally made a command decision. Processing these documents as XML is out. From now on I’m going to use regular expressions to extract the data I want.
For those of you not in the know, regular expressions are pattern matching tools that can find and extract information from a longer document. In practical terms, this means that as long as the tags surrounding the content are correct, I can retrieve the information I want. Any errors in the content itself I can clean up once I’ve got it.
This goes back to Postel’s Law, “Be conservative in what you send, be liberal in what you receive.” In my case, this means I have make my best effort to accept the data that I’m given, ignoring errors whenever possible. And using regular expressions makes that possible.
Now, I love XML (and XSLT, too), and I use it a lot. In fact, XML is my Golden Hammer – I can find a way to work it into just about every project. But in this case I’m working with data that’s not entirely under my control, and I need to be as flexible as I can. And hammers aren’t noted for flexibility!
When I posted a while back about importing XML documents as objects using serialization, one of the purposes I wanted to put this to was creating a list of objects that could be selected by a unique key value. For example, if you had a list of books, you could pick out the one you wanted by specifiying its ISBN.
In .NET, this kind of lookup is handled by a type of collection called a dictionary. You give it a key and a value, and Presto! You can sort it, look up specific items, etc.
One problem: dictionaries don’t support serialization.
I spent a while banging my head against this brick wall before I found a convenient way around it: a .NET class called KeyedCollection. KeyedCollection derives from the iList interface instead of iDictionary, and is therefore serializable, but also allows you to specify a key. This class is an abstract type, so you must derive your own custom class, but as we’ll see in a second, that’s a snap.
KeyedCollection must be inherited because it has to be a list of a specific object type. Then, instead of specifying your own key value for each item in the list, you indicate which of the object’s fields you want to be used as the key. Here is a class I created this morning:
Public Class SourceTypeList
Inherits System.Collections.ObjectModel.KeyedCollection(Of Long, SourceType)
Protected Overrides Function GetKeyForItem(ByVal item As SourceType) As Long
Return item.SourceTypeID
End Function
Sub New()
MyBase.New()
End Sub
End Class
What does this code do? It tells Visual Basic to create a new type collection derived from KeyedCollection, where the key is a long integer and the value is an object of type SourceType (SourceType represents information about a type of syndication file, such as Atom 0.3 or RSS 2.0). You then overried the function GetKeyForItem and tell VB which field you want to use as the key.
And it works beautifully. I had tested deserialization using a generic List(Of T) and I was able to swap out the code in maybe 5 minutes.
So if you need a keyed list and your objects include unique values, you can use KeyedCollection and get the benefits of serialization as well.
When I first started writing the back-end code for CrimeSpot.net, I was confronted with a dilemma: I had to import two different versions of Atom and three of RSS, all of which had slightly different formats. I had two options. I could create a separate routine within the program to import each of these formats, or I could transform each of them to a single format using XSL templates.
I decided to use templates and import a single, common XML format. Originally, I chose to do this because it made it very simple to separate program and data. Combining the two is one of my biggest pet peeves. By doing it this way, I could just create an entry in the database for each input type and include an XSL file to change it to the common form.
Subsequent events have shown this to be a wise decision.
Why? Well, I have been fooling around with one of the features of the .NET framework – the ability to take objects within programs and “serialize” them to XML files. Normally this is used so that you can retain the object’s value between instances of the program. If you need that object back at a later time, you can “deserialize” that XML file back into an object.
But when you’re deserializing, there’s no reason that the XML must come from an object that was previously serialized. You can use any XML file that matches the object’s format. With a little work, you can even import a collection of objects.
This helps me tremendously because the objects I will be importing need some processing before they can be saved. In particular, I need to inspect a date/time field and capture the offset from Universal time (UTC, aka GMT). This information is lost when the date is created as a date, so I need to get it when the date is still just text.
And .NET supports saving XML directly into a database (via the DataSet object), so when I’m done, I can just serialize the object and save the resulting XML. This approach may have performance issues, but it’s simple and elegant, and I can always buy a faster computer.
UPDATE: Here’s a little source code to show how this works. This code will read XML from a DataSet and import it into a collection of objects. First, the object classes:
Public Class SourceTypes
Private TypeList As New SourceTypeList
<System.Xml.Serialization.XmlElementAttribute("SourceType", Form:=System.Xml.Schema.XmlSchemaForm.Unqualified)> _
Public Property Types() As SourceTypeList
Get
Return Me.TypeList
End Get
Set(ByVal TypeList As SourceTypeList)
Me.TypeList = TypeList
End Set
End Property
Public Sub New()
End Sub
End Class
This class is a serialization wrapper. It exists only to provide a convenient XML representation of the collection of SourceType objects. For information on the SourceTypeList class, please see this post. Incidentally, the XmlElementAttribute causes the list not to have an XML element of its own; instead it presents the list items directly below the root element.
Here is the SourceType class:
Public Class SourceType
Public SourceTypeID As New Long
Public Name As String
Public Description As String
Public ItemField As String
Public UpdateField As String
Public UpdateCheckRX As String
Public UpdateSelectRX As String
Public UpdateReplaceRX As String
Private TemplateString As String
<System.Xml.Serialization.XmlIgnore()> _
Public TemplateTransform As New XslCompiledTransform
.
.
.
End Class
I have simplified the class a bit. It had a property that accepted an XSL string and used it to initialize the TemplateTransform field. I can’t emphasize enough how helpful properties are when using serialization. It makes it easy to do some processing without having to explicity invoke any methods. Here, the XmlIgnore attribute prevents that field from participating in serialization.
Now here’s the guts of the program, where we instanciate the class from the DataSet (which we will assume has already been filled):
Dim TypeList As SourceTypes
Dim TypeSerializer As New XmlSerializer(GetType(SourceTypes))
Dim TypeReader As StringReader
TypeReader = New StringReader(TypeSet.GetXml)
TypeList = CType(TypeSerializer.Deserialize(TypeReader), SourceTypes)
It may be more efficient to use DataSet.WriteXML and an XML reader here, I haven’t tested it. The result is an object that contains a collection of SourceType objects.
As always, please drop a note in the contents if this helps.