Coding Hard Knocks

Monday, November 10, 2008

There is no magic, only technology.

A common mistake I saw while tutoring folks in Algebra was that they thought that there was magic involved. They thought that there was some secret trick that nobody told them. I see the same problem with a lot of computer users, programmers, and software engineers.

A quote of Arthur C. Clark seems apropos:

Any sufficiently advanced technology is indistinguishable from magic.

The corollary to walk away with is that if you understand the advanced nature, it turns from magic to technology. It is my responsibility to build technology, not magic. This means that anything I do not understand I need to get to the bottom of. Mysterious behavior and apparent magic are something that need to be understood. Never allow something to be magic. Always track down the secret of the weird behavior.

When your software crashes intermittently and the behavior stops, check it out. Get to the bottom of it. There is no magic. There is a well founded solid reason for it. Get to the bottom of it. Sometimes it is your software, sometimes it is the tools you use, sometimes it is a failing piece of hardware.

Things I did not understand used to trip me up and cause me problems. Re-entrant safe code, threadsafety, and heap corruption were all issues that caused me problems. One of the nastier issues I remember reading about was in the Linux Kernel dealing with PCI bus and speculative instruction re-ordering. Inside of the Kernel, there would be a chunk of memory that was mapped to the PCI bus. Thus writing to that memory would transmit over the PCI bus to the card. What would happen was that the compiler would translate the instructions and realize that if the order of the writes were modified, the register usage could be optimized and the code would run faster, but the behavior of the program would be identical. C/C++ compilers do this sort of thing all the time, and it leads to huge performance gains. The problem with this case was that with a lot of hardware, you must do the writes in the proper order or the hardware will lock up or behave erratically.

People would write new Linux drivers, and they would work a lot of the time, but crash intermittently. They would examine the code, and it would all make sense and look perfect. They would eventually ask for help, and people would explain that "magic" of the C/C++ compiler, and then show them the PCI interfaces for writing to PCI memory mapped to physical memory. Thus the "magic" turned into advanced technology. The PCI interface had what are referred to as "memory barriers" across which read/write operations could not be re-ordered. This nicely encapsulated all of the nastier bits into a simple to use interface as long as you remembered to use it. It was highly portable across multiple target architectures. The problem was eventually found when someone disassembled the instructions and read the lower level "magic". Writing the new API that was safe turned it into mere technology, because it was well documented what it was and why it existed.

On a more personal problem, I was using the Eclipse IDE and building an RCP application. I was the "new guy", most everyone else understood better how the RCP system worked. While trying to run a product we had developed, it would crop up an error about "Could not start plug-in 'foo' due to unresolved dependencies." The standard accepted solution to this was to go to the Run configuration for the product and click "Add all Required Dependencies". After doing that the program would run just fine. Development would proceed, and everything was fine - right up until we attempted to export the product and run it outside of the IDE. When the product was exported, it never worked. The problem was that the product had the list of all of the plug-ins that would be exported. When we clicked "Add all Required Dependencies", we had just given in to the "magic", and moved away from technology. By looking at the list of plug-ins before and after "Add all Required Dependencies", we deduced what plug-ins were missing. We would evaluate why those were needed, and then add them, or remove the dependency. The problem was that nobody understood the technology.

The folks I had learned from learned about Eclipse from a couple of "How to get started with Eclipse" on the internet or other aspects. They had learned a lot about how to interact with the Eclipse IDE, how to generate Eclipse RCP products, and lots of other mid-level and up pieces of technology, which was a great thing. The Eclipse RCP platform is really slick and handles a lot of nasty issues that crop up when building a plug-in based system. What had not been done was a close inspection of the lower level technology. No one had looked into the OSGi level and how that technology worked. Understanding this was critical to comprehending why exported products did not work. Learning about that was critical to learning how to debug specific low level problems. It was critical to troubleshooting why standard Java libraries failed to work correctly inside of an RCP application. In the end, it was critical that OSGi bundles and classloader segmentation were causing numerous problems. Once the way those pieces operated was understood, numerous "magic" problems turned into simple technology issues to overcome. It was critical that these pieces be well understood before well engineered software could come out of the system.

The lesson is that magic cannot be allowed to exist. (AJD: I can't make heads or tails of the following partial sentence) If that magic is some bug that is exhibited, or something you do not understand in the framework or libraries you are using. Nothing that you do not understand can be left as magic. All of it must be turned into technology. Sometimes this will mean a lot of research into areas and levels of technology that are unnecessary to understand 99.9% of the time. However, in the 0.1% case it is absolutely critical that this be understood. Most people never need to worry about re-ordered writes in C/C++, and it is very infrequent that understanding how OSGi works is needed while implementing an Eclipse RCP application. Understanding and comprehension of the magic is required - there is no substitute. Weird intermittent crashes, code that does not act as it should, and technology or tools that fail for some unknown reasons must be tracked down. Learning more about those tools will be enlightening to the problem at hand, and about future problems.

When you do not understand the underpinnings of technology and how things works, you are doomed to repeat history. History like the comprehension of epicycles, or even of Newtonian Mechanics. The anomalies, the problems that were going wrong were an opportunity and clue that something we did not understand was going on. Digging deeper, keeping at it until there was a better explanation was critical to the translation of magic into science and technology. One of the harder learned lessons was that anything I realized I did not understand had a very good chance to become a problem, or more likely was a problem that I had yet to identify.

A craftsman never blames his tools.

Long, long ago, I was using the Borland C/C++ v3.1 compiler. It was a nice little compiler that did exactly what you told it to. I was still in high school, and didn't have anybody who could help me with it. I wrote code, and it would work. I wrote more code, and it would work. I wrote yet more code, and it would work. Eventually, I would write enough code that it would crash. I would try and figure out what I'd done wrong, and it all compiled, and ran. I would try to back out code that I had written recently (if only I'd known about version control back then). I'd get frustrated. I'd get really frustrated. I'd give up.

I eventually stopped writing code for it, and for years I swore it was a busted compiler. Eventually, I learned that the problem was stack corruption or abuse of pointers or any number of other beginning mistakes I had made. At the time, I'd always assume it was somebody else's fault. It was never my fault; I understood my code, I knew what it was supposed to do. What I did not know was how a compiler worked. I did not know was how assembly code worked. I did not know how to run a debugger. I did not know that a mistake made 10,000 CPU cycles ago would cause my program to crash here. I did not realize how a memory corruption bug acted. All I knew was that it should have worked and it did not work.

I always blamed the tools I was using. The problem with this is that while not all tools are infallible, in general they are more reliable than my code. One day, I was telling our the VP of Engineering all about this problem I was having in some C code. The function 'strncpy' was not copying all of the bytes it was supposed to. I was sure the library routine had a bug in it. Sure, sure, sure... Right up until the VP of Engineering asked the simple question: "Could the code you are copying have a binary '\0' in it?"... Dumbfounded that I had never considered it, that was precisely the problem. The function I wanted was memcpy. Yes, it was not the really standard library function, it was the idiot calling it.

Another time, a co-worker named Dave and I were chasing down a segmentation fault that happened inside of some std::map code in C++. It was very strange, I had failed to believe my user for 3 months that this crash happened. Once we finally caught it red-handed, I could not figure it out. It had to be data dependent, so we captured all of that. We had all of the code involved, it was clearly segmentation faulting inside of the STL code, not my code. It could not have been my fault, it was not crashing in my code. I looked and looked, and could not see how it could be my fault. So I go follow a standard best practice of duplicating the problem in the smallest test case I could. Still no joy. In the aftermath, I found out the reason I could not duplicate the problem was a failure of imagination. I interpreted the failure as a sign that something else was wrong. I finally was ready to give up and blame the tools. It was an infrequent problem, and generally only cropped up every couple of weeks. Not a huge deal. Dave was not giving up. He kept after it, and eventually figured out that the reason it was crashing was because the custom time the std::map was keying off of had a buggy 'operator<'. The order of a std::map is completely dictated by the implementation of 'operator<'. The STL code assumed and optimized around the fact that the 'operator<' would be a proper partial ordering (or is it total ordering?). The type T that was being operated on was a two element tuple. The case that it mishandled was (x, y) < (a, b) when x equals a. In that case it mis-sorted the results. My failure of imagination in reproducing the test case was that I re-implemented all of the data using only the first element of the tuple. While the problem exhibited itself inside of the STL code, it was a bug in my code that was causing all of the problems.

98% of the times I have blamed a tool, ultimately the fault has been mine. I have found a handful of bugs in tools, but what I have learned is that those bugs are far less likely then me being a bonehead. Blaming my tools is generally a cop out - it is a sign that I have given up when I should have kept pressing on.

The key here is knowing what your tools are: understanding fundamentally how the tools are supposed to work, and how problems I created could affect my tools. The Borland C/C++ tools only had problems because I had corrupted the stack or heap unwittingly. So while the code crashed in an arbitrary piece of code that was "just fine", the problem was that I did not understand what a stack or heap was at the time. I surely had no idea that my mistakes could cause such horrible problems. In the case of memcpy vs. strncpy, the problem was that I never validated that the tools' assumptions and pre-conditions matched mine. The final case was me not thinking clearly. The std::map problem was exactly the same, I had violated a precondition of using the 'std::map', and ultimately the fault was mine.

When you find yourself blaming the tools, double check everything. Then double check it all again. Most of the time, tools are of reasonable quality. Most of the time, it is not the tool. If the tool is your catch all for bugs, pick better tools or look really closely in a mirror. I know it took me a long, long time to realize that the safest assumption I can make is: "I am an idiot". Never blame your tools until you can demonstrate how and where the tool fell down.

Sunday, May 4, 2008

Complex Data Simple Code

This lesson is probably best expressed by Fred Brooks in The Mythical Man Month, Chapter 9, Section "Representation Is the Essence of Programmer":

Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious.

In modern parlance that would be "Explain your data structures, and figuring out the code is easy. Show me the code and figuring out your data structures will be hard". The corollary of this that if you push as much code into a data structure as possible, the code will be even easier to read.

I've been proud at many points of the fact that I'd written lots and lots of code. However, now I'm more proud when I cleverly pack more functionality into smaller, more maintainable code.

C/C++ code that looks like this:


setValue("x","y");
setValue("foo", "bar");
setValue("bar", "baz");

Should always be translated into something like this:


const static struct
{
 const char *key;
 const char *value;
} list[] =
{
 { "x", "y" },
 { "foo", "bar" },
 { "bar", "baz" },
};

for (unsigned iii = 0; iii < sizeof(list) / sizeof(list[0]); ++iii)
{
 setValue(list[iii].key, list[iii].value);
}

While this code looks far more complex, imagine if you need to edit it. What if after every "setValue" call another call needed to be made? In the first one, you are adding 3 very repetitive lines. You'll be tempted to cut and paste, which is a very good way to introduce a bug. Imagine if that list grows to 100! At least for me, it's vastly easier to say: "There's 100 sets of data that are handled identically" than to read 100 lines of code and examine them to ensure that nothing changed. If one of those 100 lines is slightly different, it's hard to spot.

By pushing this into a data structure, the complexity is split. In this problem, there is the issue of "What data am I working on", versus the issue of "What am I doing with that data". In the "table" driven method it is easy to check the data orthogonally to checking what is done with that data. With the "write it out method", any errant typo in either the code or the data is a problem. It's also difficult to focus on hundreds of items to ensure that no single minor textual difference is the cause of handling "foo" differently then "bar".

Now, imagine if there are 5 lines of nearly identical code repeated 20 times. Which is easier to read? The 20 lines of config and 5 lines in a for loop, or 100 lines of code? It is also far easier to well comment the mere five lines of code, and the data structure itself can serve as ample documentation.


const static struct
{
 char *key;
 char *value;
 bool globalValue;
} list[] =
{
 { "key", "value", true },
 // ... more data goes here...
}

for (/* iterate over list */)
{
 map.setValue(list[iii].key, list[iii].value);
 if (list[iii].globalValue)
 {
   global.addKey(list[iii].key);
 }
}

Avoiding a typo between the map.setValue(...) and global.add(...) values when you're cutting and pasting isn't an issue. It is also obvious by inspection which ones are global and which ones aren't.

Another common example is the dreaded if/else or switch statement. Personally I abhor both, and I attempt to avoid them if possible. Those are good places to push complexity out of code and into a data structure. A complex switch statement can normally be turned into a C++/Java map data structure where the key is case value and the second is a function pointer or functor that contains the body of the specific case. If/else chains, especially nested and symmetric cases, can a lot of times be re-factored into a more table driven system.

So in summary; try and push similar, structured and repetitious code into a data structure that simple code can operate on. Depending on the compiler and the tool chain, it might be much faster to use other control structures, as compiler writers spend lots of time optimizing those. So be sure and profile the code to see if a performance critical section needs this undone. I've seen the other case, where it runs faster due to caching effects, or more commonly, it compiles much faster, especially with older compilers.

Sunday, April 27, 2008

Change Whitespace, or change functionality but not both.

When changing software, do yourself a huge favor. For every commit, pick one of these two things to do:

Modify the implementation and supporting documentation
Modify nothing about the implementation, but instead prettify the code

It is a huge mental effort for humans to process a patch with both style and functional modifications.

This rule has three corollaries:

Patches that modify both induce extra effort for reviewers
Applying whitespace changes in a place that is likely to be diff'ed across often should be avoided if at all possible.
Whitespace should be done correctly the first time, when the code is implemented.

The simple problem is that computers are great at comparing two blocks of flat text, and humans are really awful at it. When diff'ing the code, it's very difficult to spot the functionality changes that are embedded inside of the areas that also have whitespace changes. In my experience, it is easier for the person making the changes to modify the whitespace first, and then make the necessary functionality changes. That is the case because the whitespace changes should make the code easier to read (otherwise, why change it?). However, it makes it harder for someone to merge their changes in.

Let's setup a scenario:

  r2 - r3 - r4
   \
    wc1

Developer 1 checked out r2, and made a functional change r3, and whitespace changes r4. Meanwhile wc1 is r2 plus the functional changes by developer 2. There are a couple of ways to approach this. Generally the easiest to do is the following:

cd wc1
svn diff > /tmp/patch.tmp
cd ..
mkdir wc2
cd wc2
svn co -r 3 http://repo/project
patch -p1 -i /tmp/patch.tmp

At this point, there are no whitespace changes in the merge, so all of the merge conflicts are real problems. Next, resolve any merge conflicts. The scenario will look like:

r2 - r3 - r4
      \
       wc2

Then continue the standard workflow:

cd wc2
svn update

At this point, resolve all merge conflicts by choosing the one from the workspace. Choose the one from the workspace, you will want to modify all of the new code to conform to the new whitespace rules. Before committing, review the diff to ensure that there are only functionality differences.

When making a whitespace-only change, clearly mark it, and ensure that using diff with the ignoring whitespace option shows no changes. Reviewers should validate that it is a whitespace only change.

This will save a great deal of trouble when reviewing. It should be stated that prettifying the code should be avoided - the whitespace being wrong should be something that reviewers catch and enforce prior to committing.

In code that is likely to be merged or branched, it is generally best to avoid prettyifing the code. It is always a hassle to read a diff that crosses a significant amount of whitespace changes. It is not uncommon for the author to diff up to the whitespace change, and diff after whitespace changes, and then apply both changes (by hand if necessary).

"Listen To Your Tools"

This is the first post to this blog. Never written one, so this will be a bit of an adventure.

My first thought is:

Listen to your tools.

So, let me tell you a little story. There I was, about a year into working on C code professionally. The project I was working on was an Airbus A-320 simulator that was originally written in QNX 2.0 and was being forward ported to QNX 4.2x. It was a really nice little project. The problem I was solving was the last in a long line of problems; one of our senior developers had run out of time, and it was left to me. We were modifying a video driver that used DMA and I/O ports. All that was done, and everything worked except for the fonts. The screen display had circles, boxes, lines, and raster images. The problem was that all the letters showed up as vertical lines.

So I compiled up the code, and 30 screens of warnings go flying by. It was a really large project, with hundreds of files. Most of the files were modified by an automated tool called MIG24 (Migrate from QNX 2 to QNX 4). So I listened to everybody who told me, "Don't worry about those warnings, they're all over the place". One of the warnings in the files I was was working on was: "No prototype for 'round()'". This concerned me as I knew we used the round function. I asked around, and was told 'round' was a standard C function and not to worry about it.

I ground up against this problem for about three weeks. I assumed we were sending the wrong commands to the hardware. I spent a great deal of time looking into that. And then more and more time; poring over that area of code, using the debugger, spending lots of time writing little test drivers, compiling and re-compiling that code over and over again. I watched the screens full of errors, happily ignoring them for a week or two. Eventually I finally figured the hardware commands were right, otherwise all the other stuff wouldn't work. So I started to debug the font rendering code directly. After a day or two of that, I checked the coordinates of the fonts. Huh? The coordinates that we are pushing to the hardware should draw vertical lines. Another important lesson:

Computers almost always do exactly what you tell them to.

So I start poking around some more, and look at the coordinate computations. They are all clearly correct. They are using the scaling ratio to take a standard size letter and scale it to the right size. So I go check the code that's reading in the fonts, and I check the source data of the font. Hmmm, that all looks good. So I go look at that "obviously correct" scaling code:

float scaleFactor = /* floating point computations. */
int scaledX = round(initX * scaleFactor);

So after printing out scaledX, initX and scaleFactor, I finally figure out that my whole problem was in "round". I tool around some more and find out that C compilers (or at least this one), generate a prototype that assumes all arguments are ints. So it would take "initX * scaleFactory", which produces a float, and pass it using the integer calling conventions, because there was no prototype. So after three weeks of chasing DMA, reading hardware specs, re-confirming the I/O bit-twiddling, chasing file I/O and validating input, I find out that all I had to do was go to the header file for the helper math routines and add

int round(float value);

That was it. Three whole weeks of hard work assuming it was the dirtiest, nastiest problem it could be. The whole time, from the very first compile, my tools told me exactly what the problem was. I just didn't want to listen.

This was my hard knock lesson in listening to my tools. The people who write tools are generally fairly clever folks. They do not warn about things for no good reason, especially at the default warning level. When a tool generates a warning, the first order of business is to research why it does, what the implications of that warning are, and to then fix that warning.

When a tool, especially a high quality tool, prints out a warning - no matter what the tools, be it a C/C++ compiler, Python, the make build system, etc - take the time to track it down. Who knows? It might save you three weeks of hard knocks.