Algorithms for Modern Processor Architectures

processor	year	frequency	transistors
Pentium 4	2000	3.8 GHz	0.040 billions
Intel Haswell	2013	4.4 GHz	1.4 billions
Apple M1	2020	3.2 GHz	16 billions
Apple M2	2022	3.49 GHz	20 billions
Apple M3	2024	4.05 GHz	25 billions
Apple M4	2024	4.5 GHz	28 billions
AMD Zen 5	2024	5.7 GHz	50 billions

processor	year	arithmetic logic units
Pentium 4	2000	2
AMD Zen 2	2019	4
Apple M*	2019	6+
Intel Lion Cove	2024	6
AMD Zen 5	2024	6

function	instructions
strtod
our parser

cycle	action	action	pizza en route
1	order pizza A
2	order pizza B		A
3	order pizza C		A, B
4	order pizza D	eat pizza A	B, C
5	order pizza E	eat pizza B	C, D
6	order pizza F	eat pizza C	D, E

![center](simdjsonlogo.png)

---

# With unrolling, you can get close to 1 store per cycle

--- # How much can your processor learn? | size | ns/value | GHz | cycles/value | instr/value | i/c | |------|---------:|-----:|-------------:|------------:|-----:| | 1048576 | 1.59 | 4.51 | 7.20 | 8.01 | 1.11 | | 524288 | 1.50 | 4.51 | 6.76 | 8.01 | 1.19 | | 262144 | 1.31 | 4.51 | 5.90 | 8.01 | 1.36 | | 131072 | 0.76 | 4.52 | 3.43 | 8.01 | 2.34 | | 65536 | 0.49 | 4.52 | 2.20 | 8.01 | 3.64 | | 32768 | 0.49 | 4.52 | 2.19 | 8.02 | 3.66 |

https://lemire.me/blog/2023/04/27/hotspot-performance-engineering-fails/

Algorithms for Modern Processor Architectures

Disk at gigabytes per second

High Bandwidth Memory

Some numbers

Frequencies and transistors

Where do the transistors go?

Where do the transistors go?

Superscalar execution

Parsing a number

Parsing a number

Lemire's Rule 1

Lemire's Corrolary 1

Lemire's Tips

Going back to number parsing

SWAR

Check whether we have a digit

Batching (unrolling)

Knuth's random shuffle

Batched random shuffle

Results (Apple M4): Use a large array (8 MB).

Branching

Unicode (UTF-16)

Validate

Validate

Performance results (Apple M4)

Performance results (Apple M4)

Speculative execution

How much can your processor learn?

Finite state machine to the rescue

Performance results (Apple M4)

The finite-state approach can be faster!

Rules of thumb

Apple M4 can learn 10,000 random (0/1) branches.

Pipelining

Little's Law

Memory-level parallism

1 lane

2 lanes

Consequence

Bloom filter

Bloom filter

Bloom filter

Data-level parallelism

SIMD

ASCII to lower case

ASCII to lower case: 64 characters in 3 instructions

Deltas (C)

Apple M4

Now allow SIMD! (Autovectorization)

Need to learn SIMD design magic !

UTF-16

UTF-16, random (adversarial), Apple M4

Interested? Check these projects

Measurements

Measurements

Sigma events

What if we dealt with log-normal distributions?

Real-world measurements

Conclusion