Binary Numbers & IEEE 754 — How Computers Represent Integer and Floating-Point Data

Every number a computer handles — whether an integer, a pixel color, a game coordinate, or a bank balance — is ultimately stored as a sequence of binary digits. Understanding base-2 representation, two's complement for negative integers, bitwise operations, and the intricate IEEE 754 floating-point standard is foundational to programming, debugging, and understanding why 0.1 + 0.2 ≠ 0.3 in every major programming language.

1. Binary and Hexadecimal Basics

Positional number systems: Base 10 (decimal): 1027 = 1\times10³ + 0\times10² + 2\times10¹ + 7\times10⁰ Base 2 (binary): 1101 = 1\times2³ + 1\times2² + 0\times2¹ + 1\times2⁰ = 8+4+0+1 = 13 Base 16 (hex): 1A2F = 1\times16³ + 10\times16² + 2\times16¹ + 15\times16⁰ = 6703 Decimal \to Binary (divide-by-2 method): 13 \div 2 = 6 r 1 6 \div 2 = 3 r 0 \to read remainders bottom to top: 1101 3 \div 2 = 1 r 1 1 \div 2 = 0 r 1 Hex uses digits 0-9, A-F (10-15). Each hex digit represents exactly 4 bits: 0x1A2F = 0001 1010 0010 1111 Unsigned 8-bit integers: 0 to 255 (2⁸ - 1) Unsigned 16-bit: 0 to 65535 Unsigned 32-bit: 0 to 4,294,967,295 Unsigned 64-bit: 0 to 18,446,744,073,709,551,615

2. Two's Complement

The dominant encoding for signed integers. For an n-bit number, the most significant bit has weight −2^(n-1) instead of +2^(n-1):

8-bit two's complement range: -128 to +127 Examples: 0000 0001 = +1 0111 1111 = +127 1000 0000 = -128 (only negative value with no positive counterpart!) 1111 1111 = -1 Converting positive integer n to -n: Method: flip all bits, then add 1. Example: +5 = 0000 0101 flip: 1111 1010 +1: 1111 1011 = -5 ✓ Check: 5 + (-5) = 0000 0101 + 1111 1011 = (1)0000 0000 ✓ (carry discarded) Why two's complement: • Addition/subtraction circuit is identical for signed and unsigned • Zero has only one representation (unlike sign-magnitude: +0 and -0) • Comparison with zero: just check the sign bit Overflow in n-bit arithmetic: (+127) + 1 = 1000 0000 = -128 (wraps around — undefined behavior in C!) In most languages, arithmetic modulo 2ⁿ (Java/JS always; C/C++ unsigned always)

3. Bitwise Operations

AND (&): 0b1100 & 0b1010 = 0b1000 (both bits must be 1) OR (|): 0b1100 | 0b1010 = 0b1110 (at least one bit is 1) XOR (^): 0b1100 ^ 0b1010 = 0b0110 (exactly one bit is 1) NOT (~): ~0b1100 = 0b0011 (flip all bits) Left shift (< >n): x >> n = x / 2ⁿ (arithmetic, preserves sign) Unsigned right shift (>>>n): (zero-fill, always non-negative) Common bit manipulation tricks: ───────────────────────────────────────────────────── Test bit k: (x >> k) & 1 Set bit k: x | (1 << k) Clear bit k: x & ~(1 << k) Toggle bit k: x ^ (1 << k) Check power of 2: n && !(n & (n-1)) Lowest set bit: x & (-x) [isolate LSB] Clear lowest set: x & (x-1) [used in popcount loops] Round to next pow2: --n; n|=n>>1; n|=n>>2; n|=n>>4; n|=n>>8; n|=n>>16; ++n Swap without temp: a^=b; b^=a; a^=b [XOR swap trick] Absolute value: mask = n>>31; (n^mask) - mask ───────────────────────────────────────────────────── Applications: hash table sizing (next power of 2), GPU thread group sizes, compression, encryption, fast pixel manipulation.

4. IEEE 754 Floating-Point

The IEEE 754 standard (1985, revised 2008) defines how real numbers are stored in binary. Two primary formats:

Single precision (32-bit, float): bit 31: sign s (0=positive, 1=negative) bits 30-23: exponent e (8 bits, biased by 127) bits 22-0: mantissa m (23 bits, implicit leading 1) Value (normal): (-1)^s \times 1.m \times 2^(e-127) Double precision (64-bit, double / JS Number): bit 63: sign s bits 62-52: exponent e (11 bits, biased by 1023) bits 51-0: mantissa m (52 bits, implicit leading 1) Value (normal): (-1)^s \times 1.m \times 2^(e-1023) Example: 0.1 in double: s = 0 1.0001100110011... \times 2^(-4) [1/10 is non-terminating in base 2!] Stored as: 0 01111111011 1001100110011001100110011001100110011001100110011010 Actual stored value: 0.1000000000000000055511151231257827021181583404541015625 This is why: 0.1 + 0.2 = 0.30000000000000004 in most languages. Double precision range and precision: Smallest positive normal: ~2.225 \times 10⁻³⁰⁸ Largest normal: ~1.798 \times 10³⁰⁸ Precision: ~15-17 significant decimal digits "Machine epsilon": ε = 2⁻⁵² \approx 2.22 \times 10⁻¹⁶ (1 ULP at magnitude 1)

5. Special Values: NaN, Inf, Denormals

Special bit patterns in IEEE 754: Exponent = all 1s (255 for float, 2047 for double): mantissa = 0: \pmInfinity (e.g., 1.0/0.0 = +Inf) mantissa \neq 0: NaN (Not a Number) (e.g., 0.0/0.0, sqrt(-1.0)) Exponent = all 0s: mantissa = 0: \pmZero (both +0 and -0 exist; +0 == -0 but 1/+0 \neq 1/-0) mantissa \neq 0: Denormal (subnormal) — extends range near zero Value: (-1)^s \times 0.m \times 2^(-1022) [no implicit leading 1] Trades off speed for gradual underflow NaN properties (IEEE 754): • NaN \neq NaN (the only value that isn't equal to itself) • Checking: isNaN(x) in JS / std::isnan(x) in C++ • Any operation involving NaN produces NaN (NaN "propagates") • NaN is not ordered: neither NaN < x nor NaN >= x is true Infinity arithmetic: +Inf + finite = +Inf ✓ +Inf - +Inf = NaN (indeterminate form \infty - \infty) +Inf \times 0 = NaN (indeterminate form 0 \times \infty) finite / +Inf = \pm0 ✓ Denormals: important for numerical stability near zero. On some hardware, denormal operations are 100\times slower (flush-to-zero mode).

6. Precision and Rounding

Not all integers are representable exactly in double precision! Doubles have 53 significant bits \to exact ints up to 2⁵³ = 9,007,199,254,740,992 Problem: 9007199254740993 + 1 = 9007199254740992 (wrong — lost the +1) Catastrophic cancellation: x = 1234567890123456.5 y = 1234567890123456.0 x - y = ? \to should be 0.5, but both are rounded to same double \to 0.0 ! Kahan compensated summation: For summing many floats, tracking the "lost" part c: float sum = 0, c = 0; for each x: float y = x - c; float t = sum + y; c = (t - sum) - y; sum = t; Error O(ε) instead of O(n\cdotε) for naive summation. Rounding modes (IEEE 754): • Round to nearest, ties to even (default, "banker's rounding") • Round toward +\infty • Round toward -\infty • Round toward zero (truncation) Interval arithmetic uses +\infty and -\infty rounding for guaranteed bounds.

7. JavaScript Number Gotchas

JavaScript has a single number type: IEEE 754 double. This leads to some surprising behavior:

// All JavaScript numbers are 64-bit IEEE 754 doubles
0.1 + 0.2 === 0.3      // false  (0.30000000000000004)
Number.MAX_SAFE_INTEGER  // 9007199254740991 (2^53 - 1)
Number.EPSILON           // 2.220446049250313e-16
Number.MAX_VALUE         // 1.7976931348623157e+308
Number.MIN_VALUE         // 5e-324 (smallest positive, denormal)

// Safe integer check
Number.isSafeInteger(9007199254740991)  // true
Number.isSafeInteger(9007199254740992)  // false!

// Floating point comparison
function approxEqual(a, b, eps = Number.EPSILON * 4) {
  return Math.abs(a - b) <= eps * Math.max(Math.abs(a), Math.abs(b), 1);
}

// BigInt for arbitrary precision integers (ES2020+)
const big = 9007199254740993n;  // exact!
big + 1n === 9007199254740994n;  // true ✓

// Bitwise ops convert to int32 then back to double!
2147483648 | 0   // -2147483648  (2^31 wraps to negative int32)
0xFFFFFFFF | 0  // -1           (same)
0xFFFFFFFF >>> 0 // 4294967295   (unsigned -- use >>> for uint32)

Use BigInt for large exact integers and decimal.js / bignumber.js for exact decimal arithmetic in financial calculations. Never compare floating-point values for exact equality — always use a tolerance proportional to the magnitude.

💻 Explore Algorithms →