processor | year | frequency | transistors |
---|---|---|---|
Pentium 4 | 2000 | 3.8 GHz | 0.040 billions |
Intel Haswell | 2013 | 4.4 GHz | 1.4 billions |
Apple M1 | 2020 | 3.2 GHz | 16 billions |
Apple M2 | 2022 | 3.49 GHz | 20 billions |
Apple M3 | 2024 | 4.05 GHz | 25 billions |
Apple M4 | 2024 | 4.5 GHz | 28 billions |
AMD Zen 5 | 2024 | 5.7 GHz | 50 billions |
processor | year | arithmetic logic units | SIMD units |
---|---|---|---|
Pentium 4 | 2000 | 2 | |
AMD Zen 2 | 2019 | 4 | |
Apple M* | 2019 | 6+ | |
Intel Lion Cove | 2024 | 6 | |
AMD Zen 5 | 2024 | 6 |
Moving to up to 4 load/store per cycle
1.3321321e-12
to double
double result;
fast_float::from_chars(
input.data(), input.data() + input.size(), result);
Reference: Number Parsing at a Gigabyte per Second, Software: Practice and Experience 51 (8), 2021
Modern processors execute nearly as many instructions per cycle as you can supply.
In computational workloads (batches), minimizing instruction count is critical for achieving optimal performance.
We massively reduced the number of CPU instructions required.
function | instructions |
---|---|
strtod | |
our parser |
Reference:
Number Parsing at a Gigabyte per Second, Software: Practice and Experience 51 (8), 2021
In ASCII/UTF-8, the digits 0, 1, ..., 9 have values
0x30, 0x31, ..., 0x39.
To recognize a digit:
6 to 7 instructions per multiplication
for (size_t i = 0; i < length; i++)
sum += x[i] * y[i];
3 to 5 instructions per mutiplication
for (; i < length - 3; i += 4)
sum += x[i] * y[i]
+ x[i + 1] * y[i + 1]
+ x[i + 2] * y[i + 2]
+ x[i + 3] * y[i + 3];
PROCEDURE shuffle(array)
FOR j FROM $$|array| - 1$$ DOWN TO 1
k ← random_integer(0, j)
SWAP array[j] WITH array[k]
END FOR
END PROCEDURE
Reference: Batched Ranged Random Integer Generation, Software: Practice and Experience 55 (1), 2025
Hard-to-predict branches can derail performance
[U+D800 to U+DBFF]
followed U+DC00 to U+DFFF
PROCEDURE validate_utf16(code_units)
i ← 0
WHILE i < $$|code_units|$$
unit ← code_units[i]
IF unit ≤ 0xD7FF OR unit ≥ 0xE000 THEN
INCREMENT i
CONTINUE
IF unit ≥ 0xD800 AND unit ≤ 0xDBFF THEN
IF i + 1 ≥ $$|code_units|$$ THEN
RETURN false
next_unit ← code_units[i + 1]
IF next_unit < 0xDC00 OR next_unit > 0xDFFF THEN
RETURN false
i ← i + 2 // Valid surrogate pair
CONTINUE
RETURN false
RETURN true
1 character per second might be just 4 GB/s (slower than disk)
We are now barely at 1 GB/s!
static uint8_t transition_table[3][256] = {
{...},
{...},
{...}
};
bool is_valid_utf16_ff(std::span<uint16_t> code_units) {
uint8_t state = 0; // Start in Initial state
for (auto code_unit : code_units) {
uint8_t high_byte = code_unit >> 8;
state = transition_table[state][high_byte];
}
return state == 0; // Valid only if we end in Initial state
}
How does the processor manage to validate one UTF-16 character per cycle
when it takes many cycles just to load the character?
cycle | action | action | pizza en route |
---|---|---|---|
1 | order pizza A | ||
2 | order pizza B | A |
|
3 | order pizza C | A |
|
4 | order pizza D | eat pizza A |
B |
5 | order pizza E | eat pizza B |
C |
6 | order pizza F | eat pizza C |
D |
8 hash functions, (Intel Ice Lake processor, out-of-cache filter)
Less than half the cache misses
For each character c
If c - 'A' < 'Z' - 'A' then
c = c + 'a' - 'A'
EndIf
EndFor
__m512i ca = _mm512_sub_epi8(c, _mm512_set1_epi8('A'));
__mmask64 is_upper = _mm512_cmple_epu8_mask(ca, _mm512_set1_epi8('Z' - 'A'));
__m512i to_lower = _mm512_mask_add_epi8(c, is_upper, c, to_lower)
successive difference:
for (size_t i = 1; i < n; ++i) {
dst[i] = src[i] - src[i - 1];
}
prefix sum:
for (size_t i = 1; i < n; ++i) {
dst[i] = dst[i - 1] + src[i];
}
simdjson: The fastest JSON parser in the world https://simdjson.org
simdutf: Unicode routines (UTF8, UTF16, UTF32) and Base64 https://github.com/simdutf/simdutf

---
# With unrolling, you can get close to 1 store per cycle
--- # How much can your processor learn? | size | ns/value | GHz | cycles/value | instr/value | i/c | |------|---------:|-----:|-------------:|------------:|-----:| | 1048576 | 1.59 | 4.51 | 7.20 | 8.01 | 1.11 | | 524288 | 1.50 | 4.51 | 6.76 | 8.01 | 1.19 | | 262144 | 1.31 | 4.51 | 5.90 | 8.01 | 1.36 | | 131072 | 0.76 | 4.52 | 3.43 | 8.01 | 2.34 | | 65536 | 0.49 | 4.52 | 2.20 | 8.01 | 3.64 | | 32768 | 0.49 | 4.52 | 2.19 | 8.02 | 3.66 |
https://lemire.me/blog/2023/04/27/hotspot-performance-engineering-fails/
https://lemire.me/blog/2023/04/27/hotspot-performance-engineering-fails/