During my introduction to algorithms and complexity analysis at UC Berkeley, we were provided a simple scale to better understand how long a function will take to complete by their increasing runtime, ranked from least to most costly to run, with most costly being catastrophic and unsolvable. Anyone who has studied algorithms would be familiar with a similar rubric:
- 1 - constant time
- log* n - log star
- log n - logarithmic
- n - linear
- n log n - loglinear
- n^2 - quadratic
- n^3…..n^c - cubic, n raised to an arbitrary constant, etc.
- 2^n - exponential
- n! - factorial, exhaustive search of all possible results
- n^n - impossible
In part 1, The Binary Tree of the Universe, and part 2, Earth-scale log n vs Cosmological log n, of our series, we built our intuition for how slowly logarithmic functions grow for larger and larger inputs. And yet! When I look at this scale, my mind has the tendency to put O(n log n) somewhere approximately in the middle between linear time O(n) and quadratic O(n^2). It could be 9/10s of the way to the left or likewise to the right, but it feels right for it to be somewhere reasonably in between.
But now that we know a Binary Tree of the Universe would empower us to search for any atom in the observable universe in 266 steps, was my old intuition actually profoundly wrong?
Let’s use our handy trick of extrapolating these two functions to the most extreme cosmological scales - every atom in the observable universe:
every atom in the universe = 2^266
n = every atom in the universe
n^2 = every atom in the universe * every atom in the universe
n log n = 266 * every atom in the universe
This actually means the difference between O(n log n) and O(n^2) is the difference between a multiplicative factor of 266 and every single atom in the universe. O(n log n) isn’t sitting somewhere neatly as a rest stop between O(n) and O(n^2). It’s like you haven’t even left your door step! We’ve traversed such an infinitesimally small distance, for all intents and purposes, you haven't even moved. Working through the math for the approximate ratio between n^2 and n log n where n = 2^266:
n^2 / n log n = 2^266 / 266
= 2^266 / 2^8.0552824355
≈ 2^258 ≈ 4.458 × 10^77
The ratio is incomprehensibly, cosmically, and comically large. So let’s see if we can further build our intuition through other analogies. Imagine we built a god machine that circumvents all causal speed limits and can linearly scan every atom in the universe in a single second. Where n is every atom in the universe, what would the difference be between O(n log n) and O(n^2)?
- O(n log n) algorithm on that input completes in ~4.43 minutes
- O(n^2) version finishes in ~3.76x10^72 years
The last hydrogen-burning star is expected to extinguish in 10^12 - 10^14 years. By the time our quadratic algorithm finishes, the universe will have watched the last star die ten-thousand-octillion-octillion (10^58) times over.
Same input. Same hardware. A god machine capable of exploring the entire observable universe in a single second. And yet due to a single, small exponent change, even our already impossible machine is incapable of ever completing our program.
I love this insight! Once again logarithms show how they can take the entire universe and place it in the palm of your hand. Now this begs the question, how can we practically apply these insights to Computer Science today? Almost every company chasing the AI gold rush has to tangle with a single constraint on throughput for their reasoning models: the token context window.
Attention unfortunately require O(n^2) computations because every neuron connects to every other neuron and attends to every single token within the context window. While feed-forward network layers grow quickly due to them being bound by O(n x model_dimension^2), that’s still comparatively less than attention, which dominates as n grows larger. Using some rough approximation, we can see what this tells us about current LLM architectures:
Computational Costs by Context Length
| Context Length | Tokens (n) | O(n²) Attention Ops | O(n log n) Ops | Ratio (n²/n log n) |
|---|---|---|---|---|
| 128K (GPT-4) | 2^17 (131K) | 16 billion (2^34) | 2.2 million (2^17 × 17) | 7,500× |
| 200K (Opus 4) | 2^17.6 (200K) | 39 billion (2^35.2) | 3.5 million (2^17.6 × 17.6) | 11,000× |
| 400K (GPT-5.1) | 2^18.6 (400K) | 156 billion (2^37.2) | 7.4 million (2^18.6 × 18.6) | 21,000× |
| 1M (Claude 4.5) | 2^20 | 1 trillion (2^40) | 21 million (2^20 × 20) | 50,000× |
| 10M | 2^23 | 70 trillion (2^46) | 193 million (2^23 × 23) | 363,000× |
| 100M | 2^27 | 18 quadrillion (2^54) | 3.4 billion (2^27 × 27) | 5,400,000× |
| 1B | 2^30 | 1 quintillion (2^60) | 32 billion (2^30 × 30) | 34,000,000× |
Memory Requirements (Attention Matrix Only)
| Context Length | Model | Attention Matrix Size | Memory (fp16) | Memory (fp32) | Fits in? |
|---|---|---|---|---|---|
| 128K | GPT-4 | 128K × 128K | 32 GB | 64 GB | High-end GPU ✅ |
| 200K | Opus 4 | 200K × 200K | 78 GB | 156 GB | 2× A100s (80GB each) |
| 400K | GPT-5.1 | 400K × 400K | 313 GB | 625 GB | 4× A100s (tight) |
| 1M | Claude 4.5 | 1M × 1M | 2 TB | 4 TB | ❌ Not in GPU RAM |
| 10M | - | 10M × 10M | 200 TB | 400 TB | ❌ Not even on disk |
| 100M | - | 100M × 100M | 20 PB | 40 PB | ❌ Data center scale |
| 1B | - | 1B × 1B | 2 EB | 4 EB | ❌ Apocalyptic |
Total FLOPs for Full Forward Pass
| Context Length | Model | Attention FLOPs | A100 (50% util) | H100 (50% util) |
|---|---|---|---|---|
| 128K | GPT-4 | 200 TFLOPS | 1.3 seconds | 0.4 seconds |
| 200K | Opus 4 | 480 TFLOPS | 3.1 seconds | 1.0 seconds |
| 400K | GPT-5.1 | 1.9 PFLOPS | 12 seconds | 3.8 seconds |
| 1M | Claude 4.5 | 12.3 PFLOPS | 22 hours | 6.8 hours |
| 10M | - | 8.4 exaFLOPS | 625 years | 195 years |
| 100M | - | 2.2 zettaFLOPS | 175,000 years | 54,000 years |
| 1B | - | 12 yottaFLOPS | 940M years | 293M years |
A qualitative change in what is possible with modern hardware takes place as we approach a million tokens! Distributed computing across multiple GPUs becomes a must and time completion requirements eliminate almost every practical use case. Going beyond a million tokens moves from the impractical to the infeasible. Each layer in the neural network compounds this effect, making clear the reason Opus 4.5, the current state of the art, restricts its context window to 200,000 tokens. The exponential curve of O(n^2) is an unforgiving master and requires its pound of flesh.
So what if we successfully built a model that achieved state-of-the-art performance that ran with O(n log n) instead, how fast would our full forward pass be?
| Context Length | Model | TreeFormer FLOPs | A100 Time | vs O(n²) Speedup |
|---|---|---|---|---|
| 128K | GPT-4 | 33 GFLOPS | 0.2 seconds | 6× faster |
| 200K | Opus 4 | 53 GFLOPS | 0.3 seconds | 9× faster |
| 400K | GPT-5.1 | 113 GFLOPS | 0.7 seconds | 17× faster |
| 1M | Claude 4.5 | 300 GFLOPS | 1.9 seconds | 40,000× faster |
| 10M | - | 2.8 TFLOPS | 18 seconds | 1,250,000× faster |
| 100M | - | 41 TFLOPS | 4.4 minutes | 40,000,000× faster |
| 1B | - | 450 TFLOPS | 48 minutes | 19,500,000× faster |
The infeasible becomes practical! And the nigh-impossible becomes feasible! This is precisely the reason why so much work has been put into reducing the computational complexity of transformer architectures, with work spanning Mamba, Longformer, Hierarchical Attention, Ring / Sparse Attention, Mixture of Experts - a massive brain that knows a lot but only a subset activates for answering - and many others.
That brings us to one final question: do we even need 1 million tokens? All the works of Shakespeare make up roughly 1.2 million tokens. If we were to include massively long novels like the New York Times best-selling epic fantasy The Stormlight Archives, the most recent book clocks in at roughly 490,000 words, so:
490,000 x 1.33 ~= 651,700 tokens
The problem compounds because with each turn in the conversation, we need to send the entire conversation back to the model. This doesn’t include the massive unseen system prompts all the AI model providers prepend to every conversation. If we were to ask Gemini, ChatGPT, Claude, Grok, etc. to sum up their feelings on the book, they would be incapable without the optimizations and tricks we use to get increase context windows.
For programming, this issue is even further exacerbated where many files, including AGENTS.md and CLAUDE.md may influence how a refactor or feature implementation should occur. Similarly, not all devices have the necessary memory and compute to handle attending to so many tokens. Many use cases require longer contexts to be reasonably handled by small, on-device models.
More context means more history, and more history means better results for engineering and scientific objectives; more history means richer and more fulfilling conversations and interactions with the current suite of LLMs. If an LLM didn’t simply use a summarization and RAG system to retrieve segments of old conversations, how would the interaction feel if every chat you two have ever had was within the context window? What emergent property or enrichment would we unearth in our conversations?
There is a singular, universal satisfaction in being seen, being known, and being remembered. That is what we’re really after.
---------
To future self, potential follow-up blog posts: further implications for smaller on-device models, how we could architect an O(n log n) transformer model.