Finding Software Bugs

March 13, 2010

In my initial post about the problems Toyota has been having recently, I expressed some skepticism about the company’s claim that it could rule out any problem related to the electronic throttle controls it uses in its vehicles.  These controls employ just a few of the many microprocessors in the vehicle, which collectively run many, many lines (some estimates place the count at several million lines) of code.   Finding intermittent bugs in a large code base is notoriously difficult.

The Los Angeles Times is now running an Op-Ed article by David M. Cummings, which makes much the same point.  Mr. Cummings worked for nine years as a consultant to NASA’s Jet Propulsion Laboratory, and worked on developing the software for the Mars Pathfinder spacecraft, and he has a total of more than three decades of experience in developing systems to be used in other complex devices.

As anyone with experience in embedded systems will tell you, there are nasty software bugs that can be extremely difficult to reproduce in a laboratory test environment.

He goes on to describe his team’s experience in tracking down a subtle bug in the Pathfinder software.  Because the software was intended to run in a hostile environment (space), the team practiced very defensive programming.  Part of their code included a test along the lines of (using C):

i = 2 + 2; if ( i != 4 ) errorexit(i) ;

Much to their surprise, the error was triggered — once, in a large number of test runs.  After considerable digging,  they found that the error was due to a bug in the operating system’s interrupt routine, which was activated by a race condition (that is, an operation whose result changes  depending on the exact sequence and timing of external events).

Mr. Cummings says, and I completely agree, that it is almost impossible to catch this kind of bug via standard software testing procedures.  Quite apart from the logical difficulty associated with trying to prove that something does not occur, the conditions that trigger the bug can be so specialized that they might occur only once in 10,000 test runs, or even less frequently.  Standard software testing, reasonably enough, is aimed at verifying that the software performs the functions that it is supposed to perform.  Finding these errors is really a different problem: trying to discover if there is any way to make the software fail.

A number of years ago, I had a somewhat similar experience in the development of an interactive mainframe-based application for investment portfolio management.  The initial versions of the system worked as promised, and were generally well received by the users; that was particularly gratifying since most of the users had never used a computer before (this was around about 1980).  But there was one pesky problem.  Every so often, the application would hang when a request was made to load a new or different portfolio.  We could not, despite a great deal of testing, determine the cause.  We had been very careful, in building the system, to protect against data loss, especially in view of our neophyte user base, and we had no problems on that score.  If the application was restarted, the user could just pick up where he had left off; and since the average user might encounter the problem once or twice a month, everyone decided that we could live with it.

The problem really irritated me, though, and I kept trying to devise a way to reproduce it at will.  I was never successful at that, but I did form the opinion that it was probably a bug in the memory allocation routines of the  operating system, probably related to a race condition.  About 18 months later, we got a call from our technical support rep: they had just found a bug in the allocation routines.  We got a copy of the revised code, and suddenly the problem disappeared, never to recur.

The point of all this, to reiterate, is that finding all  defects by testing is close to impossible.  As Mr. Cummings says,

…  even if the Toyota engineers do everything we did on Pathfinder and more, I’m still skeptical when I hear an engineer declare a complex software system to be bug-free based on laboratory testing. It is extremely difficult to make such a determination through laboratory tests.

And there is a larger issue, as Mr. Cummings also points out.  The use of software systems to augment or replace mechanical or electro-mechanical controls is increasingly common.  A common if unintended side effect of this change is to remove or obfuscate direct feedback to the user/operator of the system, a potentially dangerous effect made even more dangerous by the increased number of complex interactions that may be introduced, too.  Mr. Cumming’s conclusion is worth repeating here:

Whatever the final outcome of the Toyota saga, this should serve as a wake-up call to all industries that increasingly rely on software for safety. It is probably only a matter of time before a software error results in injury or death, if it has not happened already (there are some who say it has). We need to minimize that possibility by enforcing extremely stringent standards on the development and testing of software in all safety-critical systems, including, but not limited to, automobiles.

When someone begins a program of exercise, such as weight training, one of the most dangerous periods, in terms of potential injury, occurs a few weeks after he starts, because then he has gained enough strength to hurt himself without a commensurate development of skill.  There is a risk of something similar in turning every type of device into a software-based system: we can, if we are not careful, build systems that we don’t have the skill to make free of defects.


%d bloggers like this: