Monday, November 10, 2008

There is no magic, only technology.

A common mistake I saw while tutoring folks in Algebra was that they thought that there was magic involved. They thought that there was some secret trick that nobody told them. I see the same problem with a lot of computer users, programmers, and software engineers.

A quote of Arthur C. Clark seems apropos:

Any sufficiently advanced technology is indistinguishable from magic.


The corollary to walk away with is that if you understand the advanced nature, it turns from magic to technology. It is my responsibility to build technology, not magic. This means that anything I do not understand I need to get to the bottom of. Mysterious behavior and apparent magic are something that need to be understood. Never allow something to be magic. Always track down the secret of the weird behavior.

When your software crashes intermittently and the behavior stops, check it out. Get to the bottom of it. There is no magic. There is a well founded solid reason for it. Get to the bottom of it. Sometimes it is your software, sometimes it is the tools you use, sometimes it is a failing piece of hardware.

Things I did not understand used to trip me up and cause me problems. Re-entrant safe code, threadsafety, and heap corruption were all issues that caused me problems. One of the nastier issues I remember reading about was in the Linux Kernel dealing with PCI bus and speculative instruction re-ordering. Inside of the Kernel, there would be a chunk of memory that was mapped to the PCI bus. Thus writing to that memory would transmit over the PCI bus to the card. What would happen was that the compiler would translate the instructions and realize that if the order of the writes were modified, the register usage could be optimized and the code would run faster, but the behavior of the program would be identical. C/C++ compilers do this sort of thing all the time, and it leads to huge performance gains. The problem with this case was that with a lot of hardware, you must do the writes in the proper order or the hardware will lock up or behave erratically.

People would write new Linux drivers, and they would work a lot of the time, but crash intermittently. They would examine the code, and it would all make sense and look perfect. They would eventually ask for help, and people would explain that "magic" of the C/C++ compiler, and then show them the PCI interfaces for writing to PCI memory mapped to physical memory. Thus the "magic" turned into advanced technology. The PCI interface had what are referred to as "memory barriers" across which read/write operations could not be re-ordered. This nicely encapsulated all of the nastier bits into a simple to use interface as long as you remembered to use it. It was highly portable across multiple target architectures. The problem was eventually found when someone disassembled the instructions and read the lower level "magic". Writing the new API that was safe turned it into mere technology, because it was well documented what it was and why it existed.

On a more personal problem, I was using the Eclipse IDE and building an RCP application. I was the "new guy", most everyone else understood better how the RCP system worked. While trying to run a product we had developed, it would crop up an error about "Could not start plug-in 'foo' due to unresolved dependencies." The standard accepted solution to this was to go to the Run configuration for the product and click "Add all Required Dependencies". After doing that the program would run just fine. Development would proceed, and everything was fine - right up until we attempted to export the product and run it outside of the IDE. When the product was exported, it never worked. The problem was that the product had the list of all of the plug-ins that would be exported. When we clicked "Add all Required Dependencies", we had just given in to the "magic", and moved away from technology. By looking at the list of plug-ins before and after "Add all Required Dependencies", we deduced what plug-ins were missing. We would evaluate why those were needed, and then add them, or remove the dependency. The problem was that nobody understood the technology.

The folks I had learned from learned about Eclipse from a couple of "How to get started with Eclipse" on the internet or other aspects. They had learned a lot about how to interact with the Eclipse IDE, how to generate Eclipse RCP products, and lots of other mid-level and up pieces of technology, which was a great thing. The Eclipse RCP platform is really slick and handles a lot of nasty issues that crop up when building a plug-in based system. What had not been done was a close inspection of the lower level technology. No one had looked into the OSGi level and how that technology worked. Understanding this was critical to comprehending why exported products did not work. Learning about that was critical to learning how to debug specific low level problems. It was critical to troubleshooting why standard Java libraries failed to work correctly inside of an RCP application. In the end, it was critical that OSGi bundles and classloader segmentation were causing numerous problems. Once the way those pieces operated was understood, numerous "magic" problems turned into simple technology issues to overcome. It was critical that these pieces be well understood before well engineered software could come out of the system.

The lesson is that magic cannot be allowed to exist. (AJD: I can't make heads or tails of the following partial sentence) If that magic is some bug that is exhibited, or something you do not understand in the framework or libraries you are using. Nothing that you do not understand can be left as magic. All of it must be turned into technology. Sometimes this will mean a lot of research into areas and levels of technology that are unnecessary to understand 99.9% of the time. However, in the 0.1% case it is absolutely critical that this be understood. Most people never need to worry about re-ordered writes in C/C++, and it is very infrequent that understanding how OSGi works is needed while implementing an Eclipse RCP application. Understanding and comprehension of the magic is required - there is no substitute. Weird intermittent crashes, code that does not act as it should, and technology or tools that fail for some unknown reasons must be tracked down. Learning more about those tools will be enlightening to the problem at hand, and about future problems.

When you do not understand the underpinnings of technology and how things works, you are doomed to repeat history. History like the comprehension of epicycles, or even of Newtonian Mechanics. The anomalies, the problems that were going wrong were an opportunity and clue that something we did not understand was going on. Digging deeper, keeping at it until there was a better explanation was critical to the translation of magic into science and technology. One of the harder learned lessons was that anything I realized I did not understand had a very good chance to become a problem, or more likely was a problem that I had yet to identify.



No comments: