Binary Numbers & IEEE 754 — How Computers Represent Integer and Floating-Point Data
Every number a computer handles — whether an integer, a pixel color, a game coordinate, or a bank balance — is ultimately stored as a sequence of binary digits. Understanding base-2 representation, two's complement for negative integers, bitwise operations, and the intricate IEEE 754 floating-point standard is foundational to programming, debugging, and understanding why 0.1 + 0.2 ≠ 0.3 in every major programming language.
1. Binary and Hexadecimal Basics
Positional number systems:
Base 10 (decimal): 1027 = 1×10³ + 0×10² + 2×10¹ + 7×10⁰
Base 2 (binary): 1101 = 1×2³ + 1×2² + 0×2¹ + 1×2⁰ = 8+4+0+1 = 13
Base 16 (hex): 1A2F = 1×16³ + 10×16² + 2×16¹ + 15×16⁰ = 6703
Decimal → Binary (divide-by-2 method):
13 ÷ 2 = 6 r 1
6 ÷ 2 = 3 r 0 → read remainders bottom to top: 1101
3 ÷ 2 = 1 r 1
1 ÷ 2 = 0 r 1
Hex uses digits 0–9, A–F (10–15).
Each hex digit represents exactly 4 bits:
0x1A2F = 0001 1010 0010 1111
Unsigned 8-bit integers: 0 to 255 (2⁸ - 1)
Unsigned 16-bit: 0 to 65535
Unsigned 32-bit: 0 to 4,294,967,295
Unsigned 64-bit: 0 to 18,446,744,073,709,551,615
2. Two's Complement
The dominant encoding for signed integers. For an n-bit number, the most significant bit has weight −2^(n-1) instead of +2^(n-1):
8-bit two's complement range: -128 to +127
Examples:
0000 0001 = +1
0111 1111 = +127
1000 0000 = -128 (only negative value with no positive counterpart!)
1111 1111 = -1
Converting positive integer n to -n:
Method: flip all bits, then add 1.
Example: +5 = 0000 0101
flip: 1111 1010
+1: 1111 1011 = -5 ✓
Check: 5 + (-5) = 0000 0101 + 1111 1011 = (1)0000 0000 ✓ (carry discarded)
Why two's complement:
• Addition/subtraction circuit is identical for signed and unsigned
• Zero has only one representation (unlike sign-magnitude: +0 and -0)
• Comparison with zero: just check the sign bit
Overflow in n-bit arithmetic:
(+127) + 1 = 1000 0000 = -128 (wraps around — undefined behavior in C!)
In most languages, arithmetic modulo 2ⁿ (Java/JS always; C/C++ unsigned always)
3. Bitwise Operations
AND (&): 0b1100 & 0b1010 = 0b1000 (both bits must be 1)
OR (|): 0b1100 | 0b1010 = 0b1110 (at least one bit is 1)
XOR (^): 0b1100 ^ 0b1010 = 0b0110 (exactly one bit is 1)
NOT (~): ~0b1100 = 0b0011 (flip all bits)
Left shift (<>n): x >> n = x / 2ⁿ (arithmetic, preserves sign)
Unsigned right shift (>>>n): (zero-fill, always non-negative)
Common bit manipulation tricks:
─────────────────────────────────────────────────────
Test bit k: (x >> k) & 1
Set bit k: x | (1 << k)
Clear bit k: x & ~(1 << k)
Toggle bit k: x ^ (1 << k)
Check power of 2: n && !(n & (n-1))
Lowest set bit: x & (-x) [isolate LSB]
Clear lowest set: x & (x-1) [used in popcount loops]
Round to next pow2: --n; n|=n>>1; n|=n>>2; n|=n>>4; n|=n>>8; n|=n>>16; ++n
Swap without temp: a^=b; b^=a; a^=b [XOR swap trick]
Absolute value: mask = n>>31; (n^mask) - mask
─────────────────────────────────────────────────────
Applications: hash table sizing (next power of 2), GPU thread group sizes,
compression, encryption, fast pixel manipulation.
4. IEEE 754 Floating-Point
The IEEE 754 standard (1985, revised 2008) defines how real numbers are stored in binary. Two primary formats:
Single precision (32-bit, float):
bit 31: sign s (0=positive, 1=negative)
bits 30–23: exponent e (8 bits, biased by 127)
bits 22–0: mantissa m (23 bits, implicit leading 1)
Value (normal): (-1)^s × 1.m × 2^(e-127)
Double precision (64-bit, double / JS Number):
bit 63: sign s
bits 62–52: exponent e (11 bits, biased by 1023)
bits 51–0: mantissa m (52 bits, implicit leading 1)
Value (normal): (-1)^s × 1.m × 2^(e-1023)
Example: 0.1 in double:
s = 0
1.0001100110011... × 2^(-4) [1/10 is non-terminating in base 2!]
Stored as: 0 01111111011 1001100110011001100110011001100110011001100110011010
Actual stored value: 0.1000000000000000055511151231257827021181583404541015625
This is why: 0.1 + 0.2 = 0.30000000000000004 in most languages.
Double precision range and precision:
Smallest positive normal: ~2.225 × 10⁻³⁰⁸
Largest normal: ~1.798 × 10³⁰⁸
Precision: ~15–17 significant decimal digits
"Machine epsilon": ε = 2⁻⁵² ≈ 2.22 × 10⁻¹⁶ (1 ULP at magnitude 1)
5. Special Values: NaN, Inf, Denormals
Special bit patterns in IEEE 754:
Exponent = all 1s (255 for float, 2047 for double):
mantissa = 0: ±Infinity (e.g., 1.0/0.0 = +Inf)
mantissa ≠ 0: NaN (Not a Number) (e.g., 0.0/0.0, sqrt(-1.0))
Exponent = all 0s:
mantissa = 0: ±Zero (both +0 and -0 exist; +0 == -0 but 1/+0 ≠ 1/-0)
mantissa ≠ 0: Denormal (subnormal) — extends range near zero
Value: (-1)^s × 0.m × 2^(-1022) [no implicit leading 1]
Trades off speed for gradual underflow
NaN properties (IEEE 754):
• NaN ≠ NaN (the only value that isn't equal to itself)
• Checking: isNaN(x) in JS / std::isnan(x) in C++
• Any operation involving NaN produces NaN (NaN "propagates")
• NaN is not ordered: neither NaN < x nor NaN >= x is true
Infinity arithmetic:
+Inf + finite = +Inf ✓
+Inf - +Inf = NaN (indeterminate form ∞ - ∞)
+Inf × 0 = NaN (indeterminate form 0 × ∞)
finite / +Inf = ±0 ✓
Denormals: important for numerical stability near zero.
On some hardware, denormal operations are 100× slower (flush-to-zero mode).
6. Precision and Rounding
Not all integers are representable exactly in double precision!
Doubles have 53 significant bits → exact ints up to 2⁵³ = 9,007,199,254,740,992
Problem: 9007199254740993 + 1 = 9007199254740992 (wrong — lost the +1)
Catastrophic cancellation:
x = 1234567890123456.5
y = 1234567890123456.0
x - y = ? → should be 0.5, but both are rounded to same double → 0.0 !
Kahan compensated summation:
For summing many floats, tracking the "lost" part c:
float sum = 0, c = 0;
for each x:
float y = x - c;
float t = sum + y;
c = (t - sum) - y;
sum = t;
Error O(ε) instead of O(n·ε) for naive summation.
Rounding modes (IEEE 754):
• Round to nearest, ties to even (default, "banker's rounding")
• Round toward +∞
• Round toward -∞
• Round toward zero (truncation)
Interval arithmetic uses +∞ and -∞ rounding for guaranteed bounds.
7. JavaScript Number Gotchas
JavaScript has a single number type: IEEE 754 double. This leads to some surprising behavior:
// All JavaScript numbers are 64-bit IEEE 754 doubles0.1 + 0.2 === 0.3// false (0.30000000000000004)Number.MAX_SAFE_INTEGER // 9007199254740991 (2^53 - 1)Number.EPSILON // 2.220446049250313e-16Number.MAX_VALUE // 1.7976931348623157e+308Number.MIN_VALUE // 5e-324 (smallest positive, denormal)// Safe integer checkNumber.isSafeInteger(9007199254740991) // trueNumber.isSafeInteger(9007199254740992) // false!// Floating point comparisonfunctionapproxEqual(a, b, eps = Number.EPSILON *4) {
return Math.abs(a - b) <= eps * Math.max(Math.abs(a), Math.abs(b), 1);
}
// BigInt for arbitrary precision integers (ES2020+)const big = 9007199254740993n; // exact!
big +1n === 9007199254740994n; // true ✓// Bitwise ops convert to int32 then back to double!2147483648 | 0// -2147483648 (2^31 wraps to negative int32)0xFFFFFFFF | 0// -1 (same)0xFFFFFFFF >>> 0// 4294967295 (unsigned -- use >>> for uint32)
Use BigInt for large exact integers and decimal.js / bignumber.js for exact decimal arithmetic in financial calculations. Never compare floating-point values for exact equality — always use a tolerance proportional to the magnitude.