## Rounding and Other Errors, Part 2

In my previous post on this subject, I talked about some of the potential pitfalls of computer calculations: inappropriate use of integer arithmetic, real rounding error, and representational limitations. Now I want to discuss, briefly, two remaining problem areas: loss of significance, and perverse formulations, which are particularly perilous in floating-point arithmetic.  There are two caveats I have to mention first: both of these areas are inter-related, as I think will become apparent; and to even attempt to cover them completely is far beyond the scope of this article.  Nonetheless, I’ll try to give at least an overview of the potential hazards.

##### 4.  Loss of Significance

Despite the fact that programming languages may make floating-point arithmetic look like ordinary arithmetic (and FORTRAN even called the floating data type REAL), floating point calculations do not necessarily work according to the rules of arithmetic, as we all learned them for hand calculation.  There are a few different formats used for floating point numbers, but the all have the general form:

N EEE...EEE SSS...SSS

where each letter represents a bit. N is the sign bit, generally 0 for plus and 1 for minus.  The EEE… bits represent the exponent of the number (analogous to the power of 10 in scientific notation), and the SSS… bits represent the significant figures of the value (this is sometimes called the significand).  In a floating-point number in IEEE 754 double binary format, there is one sign bit, eleven exponent bits, and 52 significand bits, for a total length of 64 bits.

There are various details of exactly how the number is stored that we don’t need to worry about here.  The key thing to note are that the number is stored with finite precision, and that every floating-point value, contrary to FORTRAN’s implication, is a rational number.  In other words, over any given interval of real numbers,. the set of possible floating-point values is a “grainy” subset of rational numbers; no irrational number can possibly be represented precisely.

The finite precision of machine arithmetic means that some problems will require more precision to calculate accurately than the hardware can deliver.  For example, if we try the subtraction:

1,000,000,000,000,000,000
- 999,999,999,999,999,999

the correct answer is obviously 1, but we are unlike to get exactly that, even using double arithmetic.  We don’t worry about this kind of difficulty in manual computation, since it is clearly visible, but when compounded at gigaFLOP speeds, the resulting errors can be, um, embarrassing.

On a more subtle, and dangerous, note, the finite precision of machine arithmetic also means that there are some equations, for example, that are true mathematically, but not necessarily computationally.  For example, both of the following equations are true:

(x + y) (x – y) = x² − y²

sin² θ + cos² θ = 1

but you would be well advised not to depend on exact equality for the correct operation of your program.  (There are software packages, such as Mathematica, which can handle these correctly, because they have embedded “knowledge” of the underlying mathematics.)  This leads us naturally to the next topic.

##### 5.  Perverse Formulations

One problem that has cropped up repeatedly in all kinds of software for calculations is what I am calling “perverse formulation”.  This usually occurs when a software developer gets the formula for doing some computation out of a textbook or handbook.  If the formula is one that was originally intended for manual calculation, the results can be embarrassing.    For example, early versions of the Lotus 1-2-3 spreadsheet had a bult-in function for calculating the standard deviation of a set of numbers.  However, when one used a sample of just three numbers:

1,000,000    1,000,001    999,999

the function returned a negative number, which is mathematically impossible by definition of the standard deviation.  (The correct answer is 1.)   The definition of the sample standard deviation is: where $\bar{x}$ is the sample mean, and n is the number of observations.  Note that computing the standard deviation using this “recipe” will require two passes through the data; the first pass is needed to compute the mean $\bar{x}$, and the second to compute the summation.  This is tedious for manual calculation, so there is an alternative formula that is mathematically equivalent: Although this calculation is mathematically equivalent, it is not at all computationally equivalent when performed in finite precision.  If the dispersion of the sample is small relative to the magnitude of the mean, the second formula will require the subtraction of two large, nearly equal values, which is a classic prescription for losing significance.  (This was in fact the source of the Lotus 1-2-3 problem I mentioned earlier.)  The formula for the roots of a quadratic equation that you learned in Algebra I has a similar problem.

An in-depth discussion of this would take a book — in fact, there are several on my shelves.  A good starting point for further information, which contains many worked examples,  is Numerical Recipes in C (2nd edition), by Press, Teukolsky, Vetterling, and Flannery, ISBN 0-521- 43108-5.  (I believe there are also FORTRAN and Pascal “flavors” of the book.)   I hope at least that I have made you a bit wary of copying formulas without thinking.

##### A Closing Note

The large standard deviation formulas above were pasted in as graphics (you don;t want to know how).   Although WordPress is generally an excellent blogging platform, and supports embedded LaTeX math code, I could not manage to get it to recognize those formulas as formulas.  Still working on it — if I gain any insights, I’ll let you know.