Sunday, May 4, 2008

Complex Data Simple Code

This lesson is probably best expressed by Fred Brooks in The Mythical Man Month, Chapter 9, Section "Representation Is the Essence of Programmer":

Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious.


In modern parlance that would be "Explain your data structures, and figuring out the code is easy. Show me the code and figuring out your data structures will be hard". The corollary of this that if you push as much code into a data structure as possible, the code will be even easier to read.

I've been proud at many points of the fact that I'd written lots and lots of code. However, now I'm more proud when I cleverly pack more functionality into smaller, more maintainable code.

C/C++ code that looks like this:


setValue("x","y");
setValue("foo", "bar");
setValue("bar", "baz");


Should always be translated into something like this:


const static struct
{
const char *key;
const char *value;
} list[] =
{
{ "x", "y" },
{ "foo", "bar" },
{ "bar", "baz" },
};

for (unsigned iii = 0; iii < sizeof(list) / sizeof(list[0]); ++iii)
{
setValue(list[iii].key, list[iii].value);
}


While this code looks far more complex, imagine if you need to edit it. What if after every "setValue" call another call needed to be made? In the first one, you are adding 3 very repetitive lines. You'll be tempted to cut and paste, which is a very good way to introduce a bug. Imagine if that list grows to 100! At least for me, it's vastly easier to say: "There's 100 sets of data that are handled identically" than to read 100 lines of code and examine them to ensure that nothing changed. If one of those 100 lines is slightly different, it's hard to spot.

By pushing this into a data structure, the complexity is split. In this problem, there is the issue of "What data am I working on", versus the issue of "What am I doing with that data". In the "table" driven method it is easy to check the data orthogonally to checking what is done with that data. With the "write it out method", any errant typo in either the code or the data is a problem. It's also difficult to focus on hundreds of items to ensure that no single minor textual difference is the cause of handling "foo" differently then "bar".

Now, imagine if there are 5 lines of nearly identical code repeated 20 times. Which is easier to read? The 20 lines of config and 5 lines in a for loop, or 100 lines of code? It is also far easier to well comment the mere five lines of code, and the data structure itself can serve as ample documentation.


const static struct
{
char *key;
char *value;
bool globalValue;
} list[] =
{
{ "key", "value", true },
// ... more data goes here...
}

for (/* iterate over list */)
{
map.setValue(list[iii].key, list[iii].value);
if (list[iii].globalValue)
{
global.addKey(list[iii].key);
}
}


Avoiding a typo between the map.setValue(...) and global.add(...) values when you're cutting and pasting isn't an issue. It is also obvious by inspection which ones are global and which ones aren't.

Another common example is the dreaded if/else or switch statement. Personally I abhor both, and I attempt to avoid them if possible. Those are good places to push complexity out of code and into a data structure. A complex switch statement can normally be turned into a C++/Java map data structure where the key is case value and the second is a function pointer or functor that contains the body of the specific case. If/else chains, especially nested and symmetric cases, can a lot of times be re-factored into a more table driven system.

So in summary; try and push similar, structured and repetitious code into a data structure that simple code can operate on. Depending on the compiler and the tool chain, it might be much faster to use other control structures, as compiler writers spend lots of time optimizing those. So be sure and profile the code to see if a performance critical section needs this undone. I've seen the other case, where it runs faster due to caching effects, or more commonly, it compiles much faster, especially with older compilers.

No comments: