Saturday, March 28, 2009

Smart Enums in Python

What numbers are meaningful to you? I can think of a few. My Social Security number is committed to memory. 11 is "loud" if you're in Spinal Tap. 8675309 is Jenny from the famous '80s song. All of these numbers are pretty arbitrary. They were chosen to uniquely identify something else that they map to. If my name changes, my SSN does not. Phone numbers are getting more and more like SSNs, in that they stay with us wherever we go. We can even keep them if we move nearby.

We use arbitrary numbers in software all the time. In relational databases, we use 8 to 128 bit numbers as unique identifiers, much like SSNs. If the text changes in the table row, the ID value stays the same, and relational integrity is preserved. We also use numbers to indicate status. They are often used in C/C++ as return values from functions, where every possible return value indicates a distinctly different result. Integers are suitable for this purpose because they are far more efficient to return than, say, strings.

Since numbers used as status codes are pretty arbitrary, the numbers themselves have very little significance outside the programmer's mind. If a function call fails and logs an error code of 1002 to an event log, it means very little to anyone lacking source code. If you use a function written by someone else, and it returns 1002, it likewise means very little. Enter enumerations, wherein numeric values have a text name in code. An example:


enum MyEnum {
Ok,
NotSoGood,
QuiteBadActually,
}

MyEnum someFunction()
{
//...blah, blah, blah
return Ok;
}
If you call into someFunction(), then you can clearly test the return value using the definition of MyEnum. Though the values in MyEnum map to 0, 1, and 2, respectively, you can test the return value by name. So because of this, you get a sense of the meaning attached to each return value.

Unfortunately, if you do the following in C++:

MyEnum result = someFunction();
cout << result;

... you get numeric output to the console. So when you try to log status codes defined in enumerations, you're still stuck with totally arbitrary information that's only meaningful to the original programmer. Even worse, since enumerated values can be defined without explicit assignment, you can't search source code and always find the definition of the error code.

.Net helps with this, somewhat. If you convert an enumerated value to a string, it extracts the name rather than the numeric value -- much better for event logging! Programmers are never satisfied, though, and they always want to come up with better ways of doing things. So if you get a return value of NotSoGood, what does that mean to a person in China viewing the event log? Frankly, it doesn't say much even in English. What we need is for status codes to be more than just a key/pair -- they should also supply descriptive information in the user's native language.

Not all functions return status codes. I write alot of void functions/methods that throw exceptions when they get upset. If we're going to make a better enum, we should make it throwable. In Python, everything is throwable, so our code below can be used for exception tossing.

I can't think of a language where it's not possible to implement what I will show in Python. I'm creating 'status' objects called EnumValue, and instances belong to a containing enumeration. EnumValue encapsulates a name, number, and localized description. I've done this in C++, and it required code generation to be practical (in C/C++, you need smart pointer classes if you want to return class objects, and this is outside the scope of this discussion). I've seen similar exploits in Java. It's probably trivial in .Net.

There were a lot of techniques available to make enums in Python. There are alot of great blog entries showing various ways to define Python enums. I chose to use dynamic definition of classes as my secret sauce, and by doing this, I got around having to implement a singleton class around each enumeration. I'll come back to this later.

Alas, my code:

from new import classobj
import sys
import inspect
import types

class EnumValue:
def __init__(self, ownerClass, name, value):
v = value + 0 #test that it's an int without having to know what to throw
v = name + 'test' #same sort of thing here
if not issubclass(ownerClass, Enumeration):
raise 'Owner must be an Enumeration'
self._owner = ownerClass
self._value = value
self._name = name

@property
def Name(self):
return self._name

@property
def Value(self):
return self._value

@property
def Owner(self):
return self._owner

def __str__(self):
s = self._owner.__name__ + "." + self._name + "=" + str(self._value)
if len(self.Description) > 0:
s = s + ";" + self.Description
return s

_descript = None #demand-loaded member
@property
def Description(self):
if not self._descript:
#Try to load a resource string...
#... default to "" if we don't have a localized resource
self._descript = ""
return self._descript

class EnumScope:
def __init__(self, name, initValue=0):
try: #Must clean up the frame references per inspect module docs
fcur = inspect.currentframe()
frame = fcur.f_back
self._locals = frame.f_locals
finally:
if frame:
del frame
if fcur:
del fcur
self._name = name
self._counter = initValue - 1

@property
def Locals(self):
return self._locals

@property
def Name(self):
return self._name

@property
def Next(self):
self._counter = self._counter + 1
return self._counter

#Just a tag class
class Enumeration:

def __str__(self):
s = "Enumeration " + self.__class__.__name__
for key in self._keyOrder:
if len(s) > 0:
s = s + "\n"
s = s + ' ' + str(getattr(self, key))
return s

def __makeenum(scope, members, keyOrderList):
cls = classobj("EnumClass_" + scope.Name, (Enumeration,), {})
cls.__shared_state = {'_keyOrder':keyOrderList, '__class__': cls}

#Change the defining module to that of the calling scope.
cls.__module__ = scope.Locals['__name__']
lastVal = 0
for key in members:
numval = members[key]
if type(numval) != types.IntType:
numval = lastVal + 1
lastVal = numval
cls.__shared_state[key] = EnumValue(cls, key, numval)

inst = cls()
inst.__dict__ = cls.__shared_state
scope.Locals[scope.Name] = inst

def makeenum(scope, **members):
mems = {}
for key in members:
mems[key] = members[key]
__makeenum(scope, mems, members.keys() )

def makeenum2(scope, *members):
mems = {}
nextval = scope.Next
keyOrder = []
for key in members:
ar = key.split('=')
if len(ar) > 1:
key = ar[0]
val = int(ar[1])
else:
val = nextVal
mems[key] = val
keyOrder.append(key)
nextVal = val + 1
__makeenum(scope, mems, keyOrder )

enumdef.py


import enumdef

#Make an enum named MyEnum...
scope = enumdef.EnumScope("MyEnum", 1)
enumdef.makeenum(scope, a=scope.Next, b=scope.Next, c=scope.Next)
#Did it really set the module scope to THIS module?
assert MyEnum.__module__ == globals()['__name__']


#Now use a different syntax to create an enum OtherEnum...
enumdef.makeenum2(enumdef.EnumScope("OtherEnum",0), "x=4", "y", "z")

print MyEnum.a
print MyEnum.b
print MyEnum.c

print OtherEnum.x
print OtherEnum.y
print OtherEnum.z

print 'Test out string coercion:'
print MyEnum #Note the key order is random.
print OtherEnum #Keys are well-ordered here.


PyEnumHarness.py -- Test harness application.

You can run the code simply by typing PyEnumHarness.py from a command prompt, or just create a shortcut to it. The PY extension should map to Python.exe in Windows. Here's the output of the program:


EnumClass_MyEnum.a=1
EnumClass_MyEnum.b=2
EnumClass_MyEnum.c=3
EnumClass_OtherEnum.x=4
EnumClass_OtherEnum.y=5
EnumClass_OtherEnum.z=6
Test out string coercion:
Enumeration EnumClass_MyEnum
EnumClass_MyEnum.a=1
EnumClass_MyEnum.c=3
EnumClass_MyEnum.b=2
Enumeration EnumClass_OtherEnum
EnumClass_OtherEnum.x=4
EnumClass_OtherEnum.y=5
EnumClass_OtherEnum.z=6


I created two ways to define an enum. First, MyEnum is defined using named arguments. Named args have the disadvantage that they do not preserve key order -- they're probably a hashtable internally. Because of this, I created another option, makeenum2, which I used in the definition of OtherEnum. The second function allows you to preserve key order. By key, I am referring to the name associated with each EnumValue.

The EnumScope class was necessary for a couple of reasons. It caches away the context in which it was instantiated, and a variable will be created within that scope. In this case, it's the scope of the actual module. Notice that you never actually declare the variable MyEnum, but I call into it a couple lines afterward. This is a bit of a magic step that can baffle anyone looking at the code, but it illustrates how you can manipulate a variable scope in Python non-declaritively.

I left it as an exercise for the reader to implement the mathematical operations on EnumValue. This allows you to add, subtract, or perform bitwise operations on EnumValues as if they were integers.

One member worth noting is EnumValue.Description. This can be beefed up to perform a resource lookup for localized text. Remember, part of the use case was to be able to supply localized, descriptive information about the EnumValue.

I mentioned the singleton pattern earlier. This is a point of much discussion in the Pythonic blogosphere. I actually danced around having to implement singleton by basing my enum definitions on function calls. An instance from a dynamic class is returned, but the rest of your code will be blissfully unaware of the existence of that class. Your code will be happy to just use the Enumeration instance variable as it if was a singleton.

Cheers,
Chris

Tuesday, March 24, 2009

XML Madness in March

If we truly want to cleanse ourselves of XML, we must not use SOAP.
- Anonymous

It's March Madness time. One thing familiar to basketball fans is the timeout. When things get a bit out of control -- shot selection, turnovers, tempers, etc. -- you need to stop the game and get grounded again.

This is something that never happened with XML Madness. Nobody ever called timeout to take breather. It was all momentum, and people just kept adopting it. I adopted it. Microsoft adopted it. Sun adopted it. Why? It was convenient, and it looked like HTML in a web-based world.

Let's go back in the day, to the early years of the Internet. What did the first XML people set out to solve? What were the basic use cases? To break it down into something really simple, we needed a way to store data hierarchically. The data should be fast and easy to parse. You should be able to store numbers, dates, and higher level objects as aggregations of the more primitive types.

Anything jump off the screen at you? Maybe the part about fast and easy to parse? The words easy and parse, when used together, form an oxymoron! How about numbers and dates? Numbers are easy if you compose and consume XML from the same locale, because that accounts for the decimal character. Still, you have to add an extra bit of information into the XML to tell the reader what locale to shift into. And converting text to numbers; is that fast and easy? Well, compared to reading the binary representation of a number from a stream, it's going to be pretty darned slow. And there is the round-off thing. You will lose precision. And date/time values... these are even more interesting. If every programming language and API just assumed that Date.toString() should return something like "2009-02-10 23:10:02:995", we would have a text-sortable, standard text representation of dates. Unfortunately, toString() methods tend to return verbose, localized strings that don't serialize into other cultures. Less experienced developers and testers won't notice these subtleties, and nothing will show up when testing under laboratory conditions. But send an XML document from the US to Germany, and these text conversion issues will declare themselves, embarrassingly enough, in a release version of your software.

As if XML wasn't bad enough, enter XPath and namespaces. These are the kinds of things that take all the fun out of programming. Open an XML document with namespaces and the inexplicable URI associations, and it feels like the oxygen has just been sucked out of the room. You find yourself sending IMs to the guy two cubes away: "Dood, lunch?" XPath is another amazing outcome of a standards committee in action. How do you forget that you have JavaScript, Python, and dozens of other bindings available for query logic? Somehow, the committee came out of the room with a new query language that had deviant escaping rules. This led me to doubt the credibility of the standard bearers for XML.

I originally penned this post with an illustration of how to alternately store hierarchical data as nested binary, variant structures. Oddly, I thought nobody else had ever thought about this, but it turns out, a couple startups named Google and Facebook were already releasing APIs. Facebook Thrift and Google Protocol Buffers store compact, binary data, bypassing the "parsing" problem -- XML's biggest bottleneck. These toolkits are also type-safe, since they're not text-based.

The existence of these toolkits is very strong evidence that XML fails in performance and type safety. Google is especially known for seeking out any possible performance gains, even something as small as a 1% boost. They apparently never believed that bigger, faster, better XML parsers would ever add up to the gains they would achieve by going old-school (binary).

So what's missing with these toolkits? Part of what made XML popular is the ability to view it in a text editor. It's easy to perceive binary information as "not portable". This is just a matter of perception -- JPEGs are binary, and they're pretty portable. Heck, UTF-8 text can look like an encrypted mess when viewed in a binary editor!

What we need is a way to load the binary nodes of these structured documents and display them in a ubiquitous, free, familiar editor. Each field of data is associated with a numeric key; these keys can be mapped to readable text of the user's choosing so that the editor can display useful information. If this sounds a bit abstract, it's basically the same concept as using #define or enum in code. Note that mappings are completely arbitrary, since they only affect what's displayed in the editor. This means that you could actually localize the editing of raw data.

Let's get way out there on a limb with this editor concept. What if the structured data represented programming code? This opens up all kinds of new possibilities. One of the biggest roadblocks to maintaining a programming language is its grammar. The parsing of a language's grammar is a formidable task, even using YACC or ANTLR. With a structured data format, we could even program with the variable/function names translated to the developer's native tongue (again, mapping). The idea of parsing and altering programming code, as if it was data, is even more daunting. It has been tried, with ANT scripts using XML as a basic grammar, with MFC dispatch maps, and so on.

There is a precedent for code/data duality. ANT is a great example, but the use of XML makes it so verbose, it's not really very "programmable". JSON (JavaScript Object Notation) is another example of code/data duality. Resource strings are another possible example. What if they were smart enough to apply or override plurality rules for a language, on the fly? They would need a bit of programming logic to make them behave as something more than a chunk of data.

But I digress. Replacing XML with better format(s) offers some exciting possibilities. Until then, we're going to build slower, buggier software while we cope with DOM, SAX, XPath, XSL spaghetti logic, and exceptions related to namespace resolution.

In conclusion, I want to present a little analogy. What if our country had a Secretary of Data post on the president's cabinet? We would surely run a deep background check on candidates for that job. We would try and predict how they would perform under great pressure. We would try to prevent an unfit candidate from filling the post. If the candidate fails to perform later on, even after being confirmed, we would have hearings where the Secretary gets grilled. Well, we don't have a Department of Data. What we do have is terrabytes of critical information being passed around in a format that impedes performance and type safety. Should this be worthy of hearings? XML was never vetted like the Secretary of Data would be! As our processing and storage needs continue to grow at a phenomenal rate, at some point, I believe the widespread use of XML is going to impede productivity in all software niches.

Why do we passively adopt half baked standards that lead to big, long term problems? We created the Y2K problem with full knowledge that it could lead to big trouble. It did indeed lead to trouble -- vast amounts of time and money were wasted on prevention of catastrophe. XML will prove itself unable to man up to the sheer volume of data we generate, so we're looking at eventually redesigning a great deal of infrastructure. Stop the XML madness! Let's talk about this today, before another 10 years of investment in these shaky standards goes by!

Cheers,
Chris