I have to admit that I've been following fellow CodeSniper Ben Bryant's discussion of XML with just a bit of misplaced glee. I have been working with XML extensively since mid-2001 at organizations ranging from US Goverment Departments to a handful of developers and a maniacal boss with improper desires for XML and have seen it all. No, I'm quite serious, I've seen it all. Or atleast every day, I pray I have.

While I was at the Library of Congress, I had the opportunity to build some XML schemas from scratch. It was a great experience as we had a literal “clean slate” that every developer dreams of along with people who had huge amounts of domain expertise. They were sharp people all the way around and it showed. The Schemas were huge, verbose, complex, but amazingly complete in terms of the data and metadata we were able to capture.

The main problem we faced was that the XML hype was just beginning and the support of the tools was mediocre at best. We found ourselves having to write much of our own infrastructure before we could implement our great ideas. Most of them worked and we managed to develop – what we believe – were the first WebServices within the US Federal government and I believe my name/initials are still in some of the Schemas.

Fast forward a few years and the XML hype machine (aka Microsoft) is going full speed and it turns out that XML is going to solve all your problems with legacy systems in addition to make your coffee and end terrorism. I'll never forget when my boss said “we're going to build 100% XML applications!” I'm not even sure what that means. Do you mean that we'll be able to open up Notepad, draw some nifty brackets, and the system will understand and implement our business logic? This is from an organization that when I left wasn't even using source control. I left a copy of the Joel Test on the wall when I left. Score: 0-1

XML or BPEL might be able to do that someday, but I simply don't see it yet. XML is reasonable for things such as Ant, but we can't even agree on what RSS is, how can we figure out the best way to do real development. But I digress…

Fast forward a bit farther and *some* people have learned where XML is and isn't appropriate.

On a recent project, I was the lead in building a system which would poll a number of different content providers and download all the new content. There are two well established XML standards for this and so from the 5 providers how many XML formats would you expect?

Any guesses? Come on, I know you have one.

How about seven? Yes, five organizations each present the same type of information in seven different ways despite the fact that there are two established standards. This was a perfect place for standardization and collaboration since it's the same TYPE of information with all of the same data represented pretty much the same ways, but alas… And it gets even worse as three of the organizations regularly provide documents missing required elements or ill-formed XML. Simply stunning.

So what did we do? First, we dump the ill-formed XML. Next, it made little sense to build rip each of those apart. Instead we chose a simple name/value pair structure – which is quite similar to one of the standards – and perform an XSL transform to all of the documents using a variation of the Two-Step View. Finally, we rip apart the name value pairs, scrub the data and insert it into the database. This provides the XML to the importer in exactly one way and allows us to support additional XML standards by simply writing a new XSL Transform.

This is where XML serves a great purpose. It allows us to pull together the otherwise disconnected datasets and do something with it. It allows our code to ignore their changing XML structures. If they move a field, we tweak the corresponding XSL and move on with life. If we get a new structure, we tweak some XSL and move on with life.

Would I call the system and “XML Application”? No, but I would say that it makes reasonable use of XML where it makes the most sense.

Write a Reply or Comment

Your email address will not be published.