14. Floating Point

comp.lang.c FAQ list · Question 14.1

Q: When I set a float variable to, say, 3.1, why is printf printing it as 3.0999999?


A: Most computers use base 2 for floating-point numbers as well as for integers, and just as for base 10, not all fractions are representable exactly in base 2. It's well-known that in base 10, a fraction like 1/3 = 0.333333... repeats infinitely. It turns out that in base 2, one tenth is also an infinitely-repeating fraction (0.0001100110011...), so exact decimal fractions such as 3.1 cannot be represented exactly in binary. Depending on how carefully your compiler's binary/decimal conversion routines (such as those used by printf) have been written, you may see discrepancies when numbers not exactly representable in base 2 are assigned or read in and then printed (i.e. converted from base 10 to base 2 and back again). [footnote] See also question 14.6.




comp.lang.c FAQ list · Question 14.2

Q: I'm trying to take some square roots, and I've simplified the code down to

	main()
	{
		printf("%f\n", sqrt(144.));
	}
but I'm still getting crazy numbers.


A: Make sure that you have #included <math.h>, and correctly declared other functions returning double. (Another library function to be careful with is atof, which is declared in <stdlib.h>.) See also questions 1.25, 14.3, and 14.4a.

References: CT&P Sec. 4.5 pp. 65-6




comp.lang.c FAQ list · Question 14.3

Q: I'm trying to do some simple trig, and I am #including <math.h>, but the linker keeps complaining that functions like sin and cos are undefined.


A: Make sure you're actually linking with the math library. For instance, due to a longstanding bug in Unix and Linux systems, you usually need to use an explicit -lm flag, at the end of the command line, when compiling/linking. See also questions 13.25, 13.26, and 14.2.




comp.lang.c FAQ list · Question 14.4a

Q: My floating-point calculations are acting strangely and giving me different answers on different machines.


A: First, see question 14.2.

If the problem isn't that simple, recall that digital computers usually use floating-point formats which provide a close but by no means exact simulation of real number arithmetic. Among other things, the associative and distributive laws do not hold completely; that is, order of operation may be important, and repeated addition is not necessarily equivalent to multiplication. Underflow, cumulative precision loss, and other anomalies are often troublesome.

Don't assume that floating-point results will be exact, and especially don't assume that floating-point values can be compared for equality. (Don't throw haphazard ``fuzz factors'' in, either; see question 14.5.) Beware that some machines have more precision available in floating-point computation registers than in double values stored in memory, which can lead to floating-point inequalities when it would seem that two values just have to be equal.

These problems are no worse for C than they are for any other computer language. Certain aspects of floating-point are usually defined as ``however the processor does them'' (see also questions 11.33 and 11.34), otherwise a compiler for a machine without the ``right'' model would have to do prohibitively expensive emulations.

This document cannot begin to list the pitfalls associated with, and workarounds appropriate for, floating-point work. A good numerical programming text should cover the basics; see also the references below. (Beware, though, that subtle problems can occupy numerical analysts for years.)

References: Kernighan and Plauger, The Elements of Programming Style Sec. 6 pp. 115-8
Knuth, Volume 2 chapter 4
David Goldberg, ``What Every Computer Scientist Should Know about Floating-Point Arithmetic''




comp.lang.c FAQ list · Question 14.4b

Q: I'm sure I've got the trig functions declared correctly, but they're still giving me wrong answers.


A: You weren't handing them angles in degrees, were you? C's trig functions (like FORTRAN's and most other languages) accept angles in radians. The conversion from degrees to radians is simple enough:

	sin(degrees * pi / 180)



comp.lang.c FAQ list · Question 14.5

Q: What's a good way to check for ``close enough'' floating-point equality?


A: Since the absolute accuracy of floating point values varies, by definition, with their magnitude, the best way of comparing two floating point values is to use an accuracy threshold which is relative to the magnitude of the numbers being compared. Rather than

	double a, b;
	...
	if(a == b)	/* WRONG */
use something like
	#include <math.h>

	if(fabs(a - b) <= epsilon * fabs(a))
where epsilon is a value chosen to set the degree of ``closeness'' (and where you know that a will not be zero). The precise value of epsilon may still have to be chosen with care: its appropriate value may be quite small and related only to the machine's floating-point precision, or it may be larger if the numbers being compared are inherently less accurate or are the results of a chain of calculations which compounds accuracy losses over several steps. (Also, you may have to make the threshold a function of b, or of both a and b.)

A decidedly inferior approach, not generally recommended, would be to use an absolute threshold:

	if(fabs(a - b) < 0.001)		/* POOR */
Absolute ``fuzz factors'' like 0.001 never seem to work for very long, however. As the numbers being compared change, it's likely that two small numbers that should be taken as different happen to be within 0.001 of each other, or that two large numbers, which should have been treated as equal, differ by more than 0.001 . (And, of course, the problems merely shift around, and do not go away, when the fuzz factor is tweaked to 0.005, or 0.0001, or any other absolute number.)

Doug Gwyn suggests using the following ``relative difference'' function. It returns the relative difference of two real numbers: 0.0 if they are exactly the same, otherwise the ratio of the difference to the larger of the two.

#define Abs(x)    ((x) < 0 ? -(x) : (x))
#define Max(a, b) ((a) > (b) ? (a) : (b))

double RelDif(double a, double b)
{
	double c = Abs(a);
	double d = Abs(b);

	d = Max(c, d);

	return d == 0.0 ? 0.0 : Abs(a - b) / d;
}
Typical usage is
	if(RelDif(a, b) <= TOLERANCE) ...

References: Knuth Sec. 4.2.2 pp. 217-8




comp.lang.c FAQ list · Question 14.6

Q: How do I round numbers?


A: The simplest and most straightforward way is with code like

(int)(x + 0.5)
C's floating to integer conversion truncates (discards) the fractional part, so adding 0.5 before truncating arranges that fractions >= 0.5 will be rounded up. (This technique won't work properly for negative numbers, though, for which you could use something like (int)(x < 0 ? x - 0.5 : x + 0.5), or play around with the floor and ceil functions.)

You can round to a certain precision by scaling:

	(int)(x / precision + 0.5) * precision
Handling negative numbers, or implementing even/odd rounding, is slightly trickier.

Note that because truncation is otherwise the default, it's usually a good idea to use an explicit rounding step when converting floating-point numbers to integers. Unless you're careful, it's quite possible for a number which you thought was 8.0 to be represented internally as 7.999999 and to be truncated to 7.

Additional links: further reading




comp.lang.c FAQ list · Question 14.7

Q: Why doesn't C have an exponentiation operator?


A: One reason is probably that few processors have a built-in exponentiation instruction. C has a pow function (declared in <math.h>) for performing exponentiation, although explicit multiplication is usually better for small positive integral exponents. [footnote] In other words, pow(x, 2.) is probably inferior to x * x. (If you're tempted to make a Square() macro, though, check question 10.1 first.)

References: ISO Sec. 7.5.5.1
H&S Sec. 17.6 p. 393




comp.lang.c FAQ list · Question 14.8

Q: The predefined constant M_PI seems to be missing from my machine's copy of <math.h>.


A: That constant (which is apparently supposed to be the value of pi, accurate to the machine's precision), is not standard; in fact a standard-conforming copy of <math.h> should not #define a symbol M_PI. [footnote] If you need pi, you'll have to define it yourself, or compute it with 4*atan(1.0) or acos(-1.0). (You could use a construction like

	#ifndef M_PI
	#define M_PI 3.1415926535897932385
	#endif
to provide your own #definition only if some system header file has not.)

References: PCS Sec. 13 p. 237




comp.lang.c FAQ list · Question 14.9

Q: How do I set variables to, or test for IEEE NaN (``Not a Number'') and other special values?


A: Many systems with high-quality IEEE floating-point implementations provide facilities (e.g. predefined constants, and functions like isnan(), either as nonstandard extensions in <math.h> or perhaps in <ieee.h> or <nan.h>) to deal with these values cleanly, and work is being done to formally standardize such facilities. A crude but usually effective test for NaN can be written based on the fact that IEEE NaN's never compare equal to anything, even themselves; therefore a number that doesn't compare equal to itself must be a NaN:

	#define isnan(x) ((x) != (x))
Beware, though, that non-IEEE-aware compilers may optimize the test away. (Note also that even if you do have a predefined constant like NAN, you cannot use it in comparisons like if(x == NAN), again because NaN's do not compare equal to themselves.)

C99 provides isnan(), fpclassify(), and several other classification routines.

Another possibility is to format the value in question using sprintf: on many systems it generates strings like "NaN" and "Inf" which you could compare for in a pinch.

To initialize variables with these values (and if your system does not provide cleaner solutions), you may be able to get away with some compile-time ``arithmetic'':

	double nan = 0./0.;
	double inf = 1./0.;
Don't be too surprised, though, if these don't work (or if they abort the compiler with a floating-point exception).

(The most reliable way of setting up these special values would use a hex representation of their internal bit patterns, but initializing a floating-point value with a bit pattern would require using a union or some other type punning mechanism and would obviously be machine-dependent.)

See also question 19.39.

References: C9X Sec. 7.7.3




comp.lang.c FAQ list · Question 14.10

Q: How can I handle floating-point exceptions gracefully?


A: See question 19.39.




comp.lang.c FAQ list · Question 14.11

Q: What's a good way to implement complex numbers in C?


A: It is straightforward to define a simple structure and some arithmetic functions to manipulate them. C99 supports complex as a standard type. [footnote] Here is a tiny example, to give you a feel for it:

typedef struct {
	double real;
	double imag;
	} complex;

#define Real(c) (c).real
#define Imag(c) (c).imag

complex cpx_make(double real, double imag)
{
	complex ret;
	ret.real = real;
	ret.imag = imag;
	return ret;
}

complex cpx_add(complex a, complex b)
{
	return cpx_make(Real(a) + Real(b), Imag(a) + Imag(b));
}
You can use these routines with code like
	complex a = cpx_make(1, 2);
	complex b = cpx_make(3, 4);
	complex c = cpx_add(a, b);
or, even more simply,
	complex c = cpx_add(cpx_make(1, 2), cpx_make(3, 4));

See also questions 2.7, 2.10, and 14.12.

References: C9X Sec. 6.1.2.5, Sec. 7.8




comp.lang.c FAQ list · Question 14.12

Q: I'm looking for some code to do:

Fast Fourier Transforms (FFT's)
matrix arithmetic (multiplication, inversion, etc.)
complex arithmetic


A: Ajay Shah has prepared a nice index of free numerical software which has been archived pretty widely; one URL is ftp://ftp.math.psu.edu/pub/FAQ/numcomp-free-c . See also questions 18.9b, 18.13, 18.15c, and 18.16.




comp.lang.c FAQ list · Question 14.13

Q: I'm having trouble with a Turbo C program which crashes and says something like ``floating point formats not linked.''


A: Some compilers for small machines, including Turbo C (and Ritchie's original PDP-11 compiler), leave out certain floating point support if it looks like it will not be needed. In particular, the non-floating-point versions of printf and scanf save space by not including code to handle %e, %f, and %g. It happens that Borland's heuristics for determining whether the program uses floating point are insufficient, and the programmer must sometimes insert a dummy call to a floating-point library function (such as sqrt; any will do) to force loading of floating-point support. (See the comp.os.msdos.programmer FAQ list for more information.)

A partially-related problem, resulting in a similar error message (perhaps ``floating point not loaded'') can apparently occur under some MS-DOS compilers when an incorrect variant of the floating-point library is linked. Check your compiler manual's description of the various floating-point libraries.

Additional links: another possibility





Read sequentially: prev next up



about this FAQ list   about eskimo   search   feedback   copyright

Hosted by Eskimo North