Tuesday, March 24, 2009

XML Madness in March

If we truly want to cleanse ourselves of XML, we must not use SOAP.
- Anonymous

It's March Madness time. One thing familiar to basketball fans is the timeout. When things get a bit out of control -- shot selection, turnovers, tempers, etc. -- you need to stop the game and get grounded again.

This is something that never happened with XML Madness. Nobody ever called timeout to take breather. It was all momentum, and people just kept adopting it. I adopted it. Microsoft adopted it. Sun adopted it. Why? It was convenient, and it looked like HTML in a web-based world.

Let's go back in the day, to the early years of the Internet. What did the first XML people set out to solve? What were the basic use cases? To break it down into something really simple, we needed a way to store data hierarchically. The data should be fast and easy to parse. You should be able to store numbers, dates, and higher level objects as aggregations of the more primitive types.

Anything jump off the screen at you? Maybe the part about fast and easy to parse? The words easy and parse, when used together, form an oxymoron! How about numbers and dates? Numbers are easy if you compose and consume XML from the same locale, because that accounts for the decimal character. Still, you have to add an extra bit of information into the XML to tell the reader what locale to shift into. And converting text to numbers; is that fast and easy? Well, compared to reading the binary representation of a number from a stream, it's going to be pretty darned slow. And there is the round-off thing. You will lose precision. And date/time values... these are even more interesting. If every programming language and API just assumed that Date.toString() should return something like "2009-02-10 23:10:02:995", we would have a text-sortable, standard text representation of dates. Unfortunately, toString() methods tend to return verbose, localized strings that don't serialize into other cultures. Less experienced developers and testers won't notice these subtleties, and nothing will show up when testing under laboratory conditions. But send an XML document from the US to Germany, and these text conversion issues will declare themselves, embarrassingly enough, in a release version of your software.

As if XML wasn't bad enough, enter XPath and namespaces. These are the kinds of things that take all the fun out of programming. Open an XML document with namespaces and the inexplicable URI associations, and it feels like the oxygen has just been sucked out of the room. You find yourself sending IMs to the guy two cubes away: "Dood, lunch?" XPath is another amazing outcome of a standards committee in action. How do you forget that you have JavaScript, Python, and dozens of other bindings available for query logic? Somehow, the committee came out of the room with a new query language that had deviant escaping rules. This led me to doubt the credibility of the standard bearers for XML.

I originally penned this post with an illustration of how to alternately store hierarchical data as nested binary, variant structures. Oddly, I thought nobody else had ever thought about this, but it turns out, a couple startups named Google and Facebook were already releasing APIs. Facebook Thrift and Google Protocol Buffers store compact, binary data, bypassing the "parsing" problem -- XML's biggest bottleneck. These toolkits are also type-safe, since they're not text-based.

The existence of these toolkits is very strong evidence that XML fails in performance and type safety. Google is especially known for seeking out any possible performance gains, even something as small as a 1% boost. They apparently never believed that bigger, faster, better XML parsers would ever add up to the gains they would achieve by going old-school (binary).

So what's missing with these toolkits? Part of what made XML popular is the ability to view it in a text editor. It's easy to perceive binary information as "not portable". This is just a matter of perception -- JPEGs are binary, and they're pretty portable. Heck, UTF-8 text can look like an encrypted mess when viewed in a binary editor!

What we need is a way to load the binary nodes of these structured documents and display them in a ubiquitous, free, familiar editor. Each field of data is associated with a numeric key; these keys can be mapped to readable text of the user's choosing so that the editor can display useful information. If this sounds a bit abstract, it's basically the same concept as using #define or enum in code. Note that mappings are completely arbitrary, since they only affect what's displayed in the editor. This means that you could actually localize the editing of raw data.

Let's get way out there on a limb with this editor concept. What if the structured data represented programming code? This opens up all kinds of new possibilities. One of the biggest roadblocks to maintaining a programming language is its grammar. The parsing of a language's grammar is a formidable task, even using YACC or ANTLR. With a structured data format, we could even program with the variable/function names translated to the developer's native tongue (again, mapping). The idea of parsing and altering programming code, as if it was data, is even more daunting. It has been tried, with ANT scripts using XML as a basic grammar, with MFC dispatch maps, and so on.

There is a precedent for code/data duality. ANT is a great example, but the use of XML makes it so verbose, it's not really very "programmable". JSON (JavaScript Object Notation) is another example of code/data duality. Resource strings are another possible example. What if they were smart enough to apply or override plurality rules for a language, on the fly? They would need a bit of programming logic to make them behave as something more than a chunk of data.

But I digress. Replacing XML with better format(s) offers some exciting possibilities. Until then, we're going to build slower, buggier software while we cope with DOM, SAX, XPath, XSL spaghetti logic, and exceptions related to namespace resolution.

In conclusion, I want to present a little analogy. What if our country had a Secretary of Data post on the president's cabinet? We would surely run a deep background check on candidates for that job. We would try and predict how they would perform under great pressure. We would try to prevent an unfit candidate from filling the post. If the candidate fails to perform later on, even after being confirmed, we would have hearings where the Secretary gets grilled. Well, we don't have a Department of Data. What we do have is terrabytes of critical information being passed around in a format that impedes performance and type safety. Should this be worthy of hearings? XML was never vetted like the Secretary of Data would be! As our processing and storage needs continue to grow at a phenomenal rate, at some point, I believe the widespread use of XML is going to impede productivity in all software niches.

Why do we passively adopt half baked standards that lead to big, long term problems? We created the Y2K problem with full knowledge that it could lead to big trouble. It did indeed lead to trouble -- vast amounts of time and money were wasted on prevention of catastrophe. XML will prove itself unable to man up to the sheer volume of data we generate, so we're looking at eventually redesigning a great deal of infrastructure. Stop the XML madness! Let's talk about this today, before another 10 years of investment in these shaky standards goes by!

Cheers,
Chris

No comments:

Post a Comment

Please keep it short, respectful, and clean!