Monday, November 10, 2008

A craftsman never blames his tools.

Long, long ago, I was using the Borland C/C++ v3.1 compiler. It was a nice little compiler that did exactly what you told it to. I was still in high school, and didn't have anybody who could help me with it. I wrote code, and it would work. I wrote more code, and it would work. I wrote yet more code, and it would work. Eventually, I would write enough code that it would crash. I would try and figure out what I'd done wrong, and it all compiled, and ran. I would try to back out code that I had written recently (if only I'd known about version control back then). I'd get frustrated. I'd get really frustrated. I'd give up.

I eventually stopped writing code for it, and for years I swore it was a busted compiler. Eventually, I learned that the problem was stack corruption or abuse of pointers or any number of other beginning mistakes I had made. At the time, I'd always assume it was somebody else's fault. It was never my fault; I understood my code, I knew what it was supposed to do. What I did not know was how a compiler worked. I did not know was how assembly code worked. I did not know how to run a debugger. I did not know that a mistake made 10,000 CPU cycles ago would cause my program to crash here. I did not realize how a memory corruption bug acted. All I knew was that it should have worked and it did not work.

I always blamed the tools I was using. The problem with this is that while not all tools are infallible, in general they are more reliable than my code. One day, I was telling our the VP of Engineering all about this problem I was having in some C code. The function 'strncpy' was not copying all of the bytes it was supposed to. I was sure the library routine had a bug in it. Sure, sure, sure... Right up until the VP of Engineering asked the simple question: "Could the code you are copying have a binary '\0' in it?"... Dumbfounded that I had never considered it, that was precisely the problem. The function I wanted was memcpy. Yes, it was not the really standard library function, it was the idiot calling it.

Another time, a co-worker named Dave and I were chasing down a segmentation fault that happened inside of some std::map code in C++. It was very strange, I had failed to believe my user for 3 months that this crash happened. Once we finally caught it red-handed, I could not figure it out. It had to be data dependent, so we captured all of that. We had all of the code involved, it was clearly segmentation faulting inside of the STL code, not my code. It could not have been my fault, it was not crashing in my code. I looked and looked, and could not see how it could be my fault. So I go follow a standard best practice of duplicating the problem in the smallest test case I could. Still no joy. In the aftermath, I found out the reason I could not duplicate the problem was a failure of imagination. I interpreted the failure as a sign that something else was wrong. I finally was ready to give up and blame the tools. It was an infrequent problem, and generally only cropped up every couple of weeks. Not a huge deal. Dave was not giving up. He kept after it, and eventually figured out that the reason it was crashing was because the custom time the std::map was keying off of had a buggy 'operator<'. The order of a std::map is completely dictated by the implementation of 'operator<'. The STL code assumed and optimized around the fact that the 'operator<' would be a proper partial ordering (or is it total ordering?). The type T that was being operated on was a two element tuple. The case that it mishandled was (x, y) < (a, b) when x equals a. In that case it mis-sorted the results. My failure of imagination in reproducing the test case was that I re-implemented all of the data using only the first element of the tuple. While the problem exhibited itself inside of the STL code, it was a bug in my code that was causing all of the problems.

98% of the times I have blamed a tool, ultimately the fault has been mine. I have found a handful of bugs in tools, but what I have learned is that those bugs are far less likely then me being a bonehead. Blaming my tools is generally a cop out - it is a sign that I have given up when I should have kept pressing on.

The key here is knowing what your tools are: understanding fundamentally how the tools are supposed to work, and how problems I created could affect my tools. The Borland C/C++ tools only had problems because I had corrupted the stack or heap unwittingly. So while the code crashed in an arbitrary piece of code that was "just fine", the problem was that I did not understand what a stack or heap was at the time. I surely had no idea that my mistakes could cause such horrible problems. In the case of memcpy vs. strncpy, the problem was that I never validated that the tools' assumptions and pre-conditions matched mine. The final case was me not thinking clearly. The std::map problem was exactly the same, I had violated a precondition of using the 'std::map', and ultimately the fault was mine.

When you find yourself blaming the tools, double check everything. Then double check it all again. Most of the time, tools are of reasonable quality. Most of the time, it is not the tool. If the tool is your catch all for bugs, pick better tools or look really closely in a mirror. I know it took me a long, long time to realize that the safest assumption I can make is: "I am an idiot". Never blame your tools until you can demonstrate how and where the tool fell down.

No comments: