Monday, November 10, 2008

There is no magic, only technology.

A common mistake I saw while tutoring folks in Algebra was that they thought that there was magic involved. They thought that there was some secret trick that nobody told them. I see the same problem with a lot of computer users, programmers, and software engineers.

A quote of Arthur C. Clark seems apropos:

Any sufficiently advanced technology is indistinguishable from magic.


The corollary to walk away with is that if you understand the advanced nature, it turns from magic to technology. It is my responsibility to build technology, not magic. This means that anything I do not understand I need to get to the bottom of. Mysterious behavior and apparent magic are something that need to be understood. Never allow something to be magic. Always track down the secret of the weird behavior.

When your software crashes intermittently and the behavior stops, check it out. Get to the bottom of it. There is no magic. There is a well founded solid reason for it. Get to the bottom of it. Sometimes it is your software, sometimes it is the tools you use, sometimes it is a failing piece of hardware.

Things I did not understand used to trip me up and cause me problems. Re-entrant safe code, threadsafety, and heap corruption were all issues that caused me problems. One of the nastier issues I remember reading about was in the Linux Kernel dealing with PCI bus and speculative instruction re-ordering. Inside of the Kernel, there would be a chunk of memory that was mapped to the PCI bus. Thus writing to that memory would transmit over the PCI bus to the card. What would happen was that the compiler would translate the instructions and realize that if the order of the writes were modified, the register usage could be optimized and the code would run faster, but the behavior of the program would be identical. C/C++ compilers do this sort of thing all the time, and it leads to huge performance gains. The problem with this case was that with a lot of hardware, you must do the writes in the proper order or the hardware will lock up or behave erratically.

People would write new Linux drivers, and they would work a lot of the time, but crash intermittently. They would examine the code, and it would all make sense and look perfect. They would eventually ask for help, and people would explain that "magic" of the C/C++ compiler, and then show them the PCI interfaces for writing to PCI memory mapped to physical memory. Thus the "magic" turned into advanced technology. The PCI interface had what are referred to as "memory barriers" across which read/write operations could not be re-ordered. This nicely encapsulated all of the nastier bits into a simple to use interface as long as you remembered to use it. It was highly portable across multiple target architectures. The problem was eventually found when someone disassembled the instructions and read the lower level "magic". Writing the new API that was safe turned it into mere technology, because it was well documented what it was and why it existed.

On a more personal problem, I was using the Eclipse IDE and building an RCP application. I was the "new guy", most everyone else understood better how the RCP system worked. While trying to run a product we had developed, it would crop up an error about "Could not start plug-in 'foo' due to unresolved dependencies." The standard accepted solution to this was to go to the Run configuration for the product and click "Add all Required Dependencies". After doing that the program would run just fine. Development would proceed, and everything was fine - right up until we attempted to export the product and run it outside of the IDE. When the product was exported, it never worked. The problem was that the product had the list of all of the plug-ins that would be exported. When we clicked "Add all Required Dependencies", we had just given in to the "magic", and moved away from technology. By looking at the list of plug-ins before and after "Add all Required Dependencies", we deduced what plug-ins were missing. We would evaluate why those were needed, and then add them, or remove the dependency. The problem was that nobody understood the technology.

The folks I had learned from learned about Eclipse from a couple of "How to get started with Eclipse" on the internet or other aspects. They had learned a lot about how to interact with the Eclipse IDE, how to generate Eclipse RCP products, and lots of other mid-level and up pieces of technology, which was a great thing. The Eclipse RCP platform is really slick and handles a lot of nasty issues that crop up when building a plug-in based system. What had not been done was a close inspection of the lower level technology. No one had looked into the OSGi level and how that technology worked. Understanding this was critical to comprehending why exported products did not work. Learning about that was critical to learning how to debug specific low level problems. It was critical to troubleshooting why standard Java libraries failed to work correctly inside of an RCP application. In the end, it was critical that OSGi bundles and classloader segmentation were causing numerous problems. Once the way those pieces operated was understood, numerous "magic" problems turned into simple technology issues to overcome. It was critical that these pieces be well understood before well engineered software could come out of the system.

The lesson is that magic cannot be allowed to exist. (AJD: I can't make heads or tails of the following partial sentence) If that magic is some bug that is exhibited, or something you do not understand in the framework or libraries you are using. Nothing that you do not understand can be left as magic. All of it must be turned into technology. Sometimes this will mean a lot of research into areas and levels of technology that are unnecessary to understand 99.9% of the time. However, in the 0.1% case it is absolutely critical that this be understood. Most people never need to worry about re-ordered writes in C/C++, and it is very infrequent that understanding how OSGi works is needed while implementing an Eclipse RCP application. Understanding and comprehension of the magic is required - there is no substitute. Weird intermittent crashes, code that does not act as it should, and technology or tools that fail for some unknown reasons must be tracked down. Learning more about those tools will be enlightening to the problem at hand, and about future problems.

When you do not understand the underpinnings of technology and how things works, you are doomed to repeat history. History like the comprehension of epicycles, or even of Newtonian Mechanics. The anomalies, the problems that were going wrong were an opportunity and clue that something we did not understand was going on. Digging deeper, keeping at it until there was a better explanation was critical to the translation of magic into science and technology. One of the harder learned lessons was that anything I realized I did not understand had a very good chance to become a problem, or more likely was a problem that I had yet to identify.



A craftsman never blames his tools.

Long, long ago, I was using the Borland C/C++ v3.1 compiler. It was a nice little compiler that did exactly what you told it to. I was still in high school, and didn't have anybody who could help me with it. I wrote code, and it would work. I wrote more code, and it would work. I wrote yet more code, and it would work. Eventually, I would write enough code that it would crash. I would try and figure out what I'd done wrong, and it all compiled, and ran. I would try to back out code that I had written recently (if only I'd known about version control back then). I'd get frustrated. I'd get really frustrated. I'd give up.

I eventually stopped writing code for it, and for years I swore it was a busted compiler. Eventually, I learned that the problem was stack corruption or abuse of pointers or any number of other beginning mistakes I had made. At the time, I'd always assume it was somebody else's fault. It was never my fault; I understood my code, I knew what it was supposed to do. What I did not know was how a compiler worked. I did not know was how assembly code worked. I did not know how to run a debugger. I did not know that a mistake made 10,000 CPU cycles ago would cause my program to crash here. I did not realize how a memory corruption bug acted. All I knew was that it should have worked and it did not work.

I always blamed the tools I was using. The problem with this is that while not all tools are infallible, in general they are more reliable than my code. One day, I was telling our the VP of Engineering all about this problem I was having in some C code. The function 'strncpy' was not copying all of the bytes it was supposed to. I was sure the library routine had a bug in it. Sure, sure, sure... Right up until the VP of Engineering asked the simple question: "Could the code you are copying have a binary '\0' in it?"... Dumbfounded that I had never considered it, that was precisely the problem. The function I wanted was memcpy. Yes, it was not the really standard library function, it was the idiot calling it.

Another time, a co-worker named Dave and I were chasing down a segmentation fault that happened inside of some std::map code in C++. It was very strange, I had failed to believe my user for 3 months that this crash happened. Once we finally caught it red-handed, I could not figure it out. It had to be data dependent, so we captured all of that. We had all of the code involved, it was clearly segmentation faulting inside of the STL code, not my code. It could not have been my fault, it was not crashing in my code. I looked and looked, and could not see how it could be my fault. So I go follow a standard best practice of duplicating the problem in the smallest test case I could. Still no joy. In the aftermath, I found out the reason I could not duplicate the problem was a failure of imagination. I interpreted the failure as a sign that something else was wrong. I finally was ready to give up and blame the tools. It was an infrequent problem, and generally only cropped up every couple of weeks. Not a huge deal. Dave was not giving up. He kept after it, and eventually figured out that the reason it was crashing was because the custom time the std::map was keying off of had a buggy 'operator<'. The order of a std::map is completely dictated by the implementation of 'operator<'. The STL code assumed and optimized around the fact that the 'operator<' would be a proper partial ordering (or is it total ordering?). The type T that was being operated on was a two element tuple. The case that it mishandled was (x, y) < (a, b) when x equals a. In that case it mis-sorted the results. My failure of imagination in reproducing the test case was that I re-implemented all of the data using only the first element of the tuple. While the problem exhibited itself inside of the STL code, it was a bug in my code that was causing all of the problems.

98% of the times I have blamed a tool, ultimately the fault has been mine. I have found a handful of bugs in tools, but what I have learned is that those bugs are far less likely then me being a bonehead. Blaming my tools is generally a cop out - it is a sign that I have given up when I should have kept pressing on.

The key here is knowing what your tools are: understanding fundamentally how the tools are supposed to work, and how problems I created could affect my tools. The Borland C/C++ tools only had problems because I had corrupted the stack or heap unwittingly. So while the code crashed in an arbitrary piece of code that was "just fine", the problem was that I did not understand what a stack or heap was at the time. I surely had no idea that my mistakes could cause such horrible problems. In the case of memcpy vs. strncpy, the problem was that I never validated that the tools' assumptions and pre-conditions matched mine. The final case was me not thinking clearly. The std::map problem was exactly the same, I had violated a precondition of using the 'std::map', and ultimately the fault was mine.

When you find yourself blaming the tools, double check everything. Then double check it all again. Most of the time, tools are of reasonable quality. Most of the time, it is not the tool. If the tool is your catch all for bugs, pick better tools or look really closely in a mirror. I know it took me a long, long time to realize that the safest assumption I can make is: "I am an idiot". Never blame your tools until you can demonstrate how and where the tool fell down.