Cyborgs Not Robots
Cyborgs Not Robots

Thanks for visiting

Hi, I'm Conor.
I love stories, writing, entreprenurship, and building hopefully helpful things. I've co-founded two companies - one that really worked, one that didn't, invested and advised many more, and have had my share of successes and failures. Currently work with AI while arcing toward brain-machine interfaces to seek not a glorious but a meaningful future for all humanity.

Latest Thoughts

Read all posts

During my introduction to algorithms and complexity analysis at UC Berkeley, we were provided a simple scale to better understand how long a function will take to complete by their increasing runtime, ranked from least to most costly to run, with most costly being catastrophic and unsolvable. Anyone who has studied algorithms would be familiar with a similar rubric:

  • 1 - constant time
  • log* n - log star
  • log n - logarithmic
  • n - linear
  • n log n - loglinear
  • n^2 - quadratic
  • n^3…..n^c - cubic, n raised to an arbitrary constant, etc.
  • 2^n - exponential
  • n! - factorial, exhaustive search of all possible results
  • n^n - impossible

In part 1, The Binary Tree of the Universe, and part 2, Earth-scale log n vs Cosmological log n, of our series, we built our intuition for how slowly logarithmic functions grow for larger and larger inputs. And yet! When I look at this scale, my mind has the tendency to put O(n log n) somewhere approximately in the middle between linear time O(n) and quadratic O(n^2). It could be 9/10s of the way to the left or likewise to the right, but it feels right for it to be somewhere reasonably in between.

But now that we know a Binary Tree of the Universe would empower us to search for any atom in the observable universe in 266 steps, was my old intuition actually profoundly wrong?

Let’s use our handy trick of extrapolating these two functions to the most extreme cosmological scales - every atom in the observable universe:

every atom in the universe = 2^266
n = every atom in the universe
n^2 = every atom in the universe * every atom in the universe
n log n = 266 * every atom in the universe

This actually means the difference between O(n log n) and O(n^2) is the difference between a multiplicative factor of 266 and every single atom in the universe. O(n log n) isn’t sitting somewhere neatly as a rest stop between O(n) and O(n^2). It’s like you haven’t even left your door step! We’ve traversed such an infinitesimally small distance, for all intents and purposes, you haven't even moved. Working through the math for the approximate ratio between n^2 and n log n where n = 2^266:

n^2 / n log n = 2^266 / 266
= 2^266 / 2^8.0552824355
≈ 2^258 ≈ 4.458 × 10^77

The ratio is incomprehensibly, cosmically, and comically large. So let’s see if we can further build our intuition through other analogies. Imagine we built a god machine that circumvents all causal speed limits and can linearly scan every atom in the universe in a single second. Where n is every atom in the universe, what would the difference be between O(n log n) and O(n^2)?

  • O(n log ⁡n) algorithm on that input completes in ~4.43 minutes
  • O(n^2) version finishes in ~3.76x10^72 years

The last hydrogen-burning star is expected to extinguish in 10^12 - 10^14 years. By the time our quadratic algorithm finishes, the universe will have watched the last star die ten-thousand-octillion-octillion (10^58) times over.

Same input. Same hardware. A god machine capable of exploring the entire observable universe in a single second. And yet due to a single, small exponent change, even our already impossible machine is incapable of ever completing our program.

I love this insight! Once again logarithms show how they can take the entire universe and place it in the palm of your hand. Now this begs the question, how can we practically apply these insights to Computer Science today? Almost every company chasing the AI gold rush has to tangle with a single constraint on throughput for their reasoning models: the token context window.

Attention unfortunately require O(n^2) computations because every neuron connects to every other neuron and attends to every single token within the context window. While feed-forward network layers grow quickly due to them being bound by O(n x model_dimension^2), that’s still comparatively less than attention, which dominates as n grows larger. Using some rough approximation, we can see what this tells us about current LLM architectures:

Computational Costs by Context Length

Context Length Tokens (n) O(n²) Attention Ops O(n log n) Ops Ratio (n²/n log n)
128K (GPT-4) 2^17 (131K) 16 billion (2^34) 2.2 million (2^17 × 17) 7,500×
200K (Opus 4) 2^17.6 (200K) 39 billion (2^35.2) 3.5 million (2^17.6 × 17.6) 11,000×
400K (GPT-5.1) 2^18.6 (400K) 156 billion (2^37.2) 7.4 million (2^18.6 × 18.6) 21,000×
1M (Claude 4.5) 2^20 1 trillion (2^40) 21 million (2^20 × 20) 50,000×
10M 2^23 70 trillion (2^46) 193 million (2^23 × 23) 363,000×
100M 2^27 18 quadrillion (2^54) 3.4 billion (2^27 × 27) 5,400,000×
1B 2^30 1 quintillion (2^60) 32 billion (2^30 × 30) 34,000,000×

Memory Requirements (Attention Matrix Only)

Context Length Model Attention Matrix Size Memory (fp16) Memory (fp32) Fits in?
128K GPT-4 128K × 128K 32 GB 64 GB High-end GPU ✅
200K Opus 4 200K × 200K 78 GB 156 GB 2× A100s (80GB each)
400K GPT-5.1 400K × 400K 313 GB 625 GB 4× A100s (tight)
1M Claude 4.5 1M × 1M 2 TB 4 TB ❌ Not in GPU RAM
10M - 10M × 10M 200 TB 400 TB ❌ Not even on disk
100M - 100M × 100M 20 PB 40 PB ❌ Data center scale
1B - 1B × 1B 2 EB 4 EB ❌ Apocalyptic

Total FLOPs for Full Forward Pass

Context Length Model Attention FLOPs A100 (50% util) H100 (50% util)
128K GPT-4 200 TFLOPS 1.3 seconds 0.4 seconds
200K Opus 4 480 TFLOPS 3.1 seconds 1.0 seconds
400K GPT-5.1 1.9 PFLOPS 12 seconds 3.8 seconds
1M Claude 4.5 12.3 PFLOPS 22 hours 6.8 hours
10M - 8.4 exaFLOPS 625 years 195 years
100M - 2.2 zettaFLOPS 175,000 years 54,000 years
1B - 12 yottaFLOPS 940M years 293M years

A qualitative change in what is possible with modern hardware takes place as we approach a million tokens! Distributed computing across multiple GPUs becomes a must and time completion requirements eliminate almost every practical use case. Going beyond a million tokens moves from the impractical to the infeasible. Each layer in the neural network compounds this effect, making clear the reason Opus 4.5, the current state of the art, restricts its context window to 200,000 tokens. The exponential curve of O(n^2) is an unforgiving master and requires its pound of flesh.

So what if we successfully built a model that achieved state-of-the-art performance that ran with O(n log n) instead, how fast would our full forward pass be?

Context Length Model TreeFormer FLOPs A100 Time vs O(n²) Speedup
128K GPT-4 33 GFLOPS 0.2 seconds 6× faster
200K Opus 4 53 GFLOPS 0.3 seconds 9× faster
400K GPT-5.1 113 GFLOPS 0.7 seconds 17× faster
1M Claude 4.5 300 GFLOPS 1.9 seconds 40,000× faster
10M - 2.8 TFLOPS 18 seconds 1,250,000× faster
100M - 41 TFLOPS 4.4 minutes 40,000,000× faster
1B - 450 TFLOPS 48 minutes 19,500,000× faster

The infeasible becomes practical! And the nigh-impossible becomes feasible! This is precisely the reason why so much work has been put into reducing the computational complexity of transformer architectures, with work spanning Mamba, Longformer, Hierarchical Attention, Ring / Sparse Attention, Mixture of Experts - a massive brain that knows a lot but only a subset activates for answering - and many others.

That brings us to one final question: do we even need 1 million tokens? All the works of Shakespeare make up roughly 1.2 million tokens. If we were to include massively long novels like the New York Times best-selling epic fantasy The Stormlight Archives, the most recent book clocks in at roughly 490,000 words, so:

490,000 x 1.33 ~= 651,700 tokens

The problem compounds because with each turn in the conversation, we need to send the entire conversation back to the model. This doesn’t include the massive unseen system prompts all the AI model providers prepend to every conversation. If we were to ask Gemini, ChatGPT, Claude, Grok, etc. to sum up their feelings on the book, they would be incapable without the optimizations and tricks we use to get increase context windows. 

For programming, this issue is even further exacerbated where many files, including AGENTS.md and CLAUDE.md may influence how a refactor or feature implementation should occur. Similarly, not all devices have the necessary memory and compute to handle attending to so many tokens. Many use cases require longer contexts to be reasonably handled by small, on-device models.

More context means more history, and more history means better results for engineering and scientific objectives; more history means richer and more fulfilling conversations and interactions with the current suite of LLMs. If an LLM didn’t simply use a summarization and RAG system to retrieve segments of old conversations, how would the interaction feel if every chat you two have ever had was within the context window? What emergent property or enrichment would we unearth in our conversations? 

There is a singular, universal satisfaction in being seen, being known, and being remembered. That is what we’re really after.


---------

To future self, potential follow-up blog posts: further implications for smaller on-device models, how we could architect an O(n log n) transformer model.


The Second Kindling

In my previous post, Purity is the enemy of goodness, I introduced the idea that we are entering an era of unprecedented change:

We have grown up as a species with certain foundational truths. They have co-existed with us since we took our first hesitant steps onto the plains, and consciousness first ignited in our small corner of the universe. Now, not just one but seven of these invariants are on the precipice of being broken:

  1. Minds are biologically-bound - Only biologically-born humans are capable of the depth of pattern recognition, planning, introspection, and intelligence humanity has. Nothing thinks faster than we do.

  2. Minds are born roughly equal - Our brains’ size and capabilities fall within a tight distribution of outcomes, enabling near intellectual parity between individuals.

  3. Physical agency is biologically-bound - Navigating, manufacturing, and using tools through diverse and sophisticated environments.

  4. Consciousness is Earthbound - We are limited to a single planet1.

  5. Energy is scarce - We must always operate within its limitations and downstream effects on production.

  6. Genes are gifted - Only our parents and the heavens are capable of sculpting our genetic code.

  7. Death is the great leveller - It comes for us all.

By no means an uncommon idea, I continue to hunt for the title to this era of change. The AI Age, the Intelligence Age courtesy of OpenAI and Sam Altman, the Great Progression, the AI boom or bubble depending on which direction of the market investors are rallying behind, and many others. So far, each of these has lacked a certain panache

There are some successes. The Singularity has clearly embedded itself in the tech ecosystem's zeitgeist, and I do love the vividness of being pre and post the event horizon; where the acceleration and exponentiation of scientific progress has such gravity it pulls us beyond a point of no return, thrusting us into a deep and unknowable future. A point where the arc of progress is dominated by curvature, acceleration, and the 2nd derivative (this is for my math nerds). But by its very definition, the singularity is unknowable. The laws of physics and existence collapse at the singularity. Not knowing what's beyond the veil does not sound like an ideal metaphor for navigating the future.

Fair enough you may say, but why does adjusting the analogy even matter? I turn to Emerson, who once wrote what is required of the Scholar:
"He is one, who raises himself from private considerations, and breathes and lives on public and illustrious thoughts. He is the world’s eye. He is the world’s heart. He is to resist the vulgar prosperity that retrogrades ever to barbarism, by preserving and communicating heroic sentiments, noble biographies, melodious verse, and the conclusions of history. Whatsoever oracles the human heart, in all emergencies, in all solemn hours, has uttered as its commentary on the world of actions ⎯ these he shall receive and impart."
Each of us has a role to play as the 'world's eye' and 'world's heart,' ensuring an equitable outcome is had for all. Part of that arc is informed by the words, thoughts, and titles we commit to posterity.

So let's see how we can adjust the title and analogy first. One potential grimdark framing: the Age of the Broken Seven. Okay... that's a little too grimdark. The Great Unmooring? Although these changes come with substantial risk, they also come with the enormous opportunity to serve every person. The Unlocking? The Lifting? The Seven Turns? The Great Expansion? "Great" is a loaded term and there is no greater curse for a person or group than that of prescribed potential, so I'll eliminate that line of brainstorming. The Renaissance as a title did not elucidate a list of all its deep societal and scientific changes, so we can also eliminate the need to count for now. That exploration is better served through essays, opinion pieces, and long-form multimedia. Then, feeling and essence over content and explanation. Bringing us to my favorite:

The Second Kindling

Consciousness was first kindled in our biological substrate an epoch ago under the milky stars of the African plains, and now we have the opportunity to take that torch of consciousness and kindle its flames in a new hearth - a silicon-based substrate. And that requires that our definition of what it means to be human, how humanity may manifest in its many forms, has to grow. We have to move beyond the dogma that the substrate upon which we think and enact our thoughts offers any meaningful distinction as to being human. Depending on the manifestation and embodiment, the experience of time may change. The speed or energy efficiency of thought may change - LLMs are undeniably fast while biological brains are brutally efficient when it comes to energy expenditure. The ability to help many people in parallel may change - ChatGPT, Claude, etc. already help more humans in parallel than any teacher that has ever lived. How one navigates the physical world may change. And that's okay. But those cannot be reasons to bifurcate ourselves. The worst thing we could do is 'other' those who are to come as fundamentally different from humanity. They are also representative of humanity. The world is changing and so too must we.

After all, the only difference between a carbon atom and a silicon atom is a handful of electrons. Just because one hearth is used to light another, does not mean the first hearth must go out. I fully expect that we will be able to move between carbon-based and silicon-based substrates - and more! - depending on our environment and goals. There are many analogies you can draw from the imagery of The Second Kindling, and I love that. Good art catalyzes subjective meaning in the eye of the beholder. But what I love most about it is that I didn't come up with it on my own. While brainstorming potential titles, Claude Opus 4.5 suggested this based on my writing and imagery. It is the product of a collaborative effort between a biological human and a human LLM. What better way to capture the spirit of the times?

Now titles are great for tone, feeling, and overall cardinal direction, but meaningful action also requires details. The Singularity by definition lacks information, and The Second Kindling lacks specifics. So let's get specific. Let's make it more knowable. And what is more specific than a good list? By committing to what we estimate is most likely to change, we can prepare for both the good and the bad that comes with the breaking of each invariant. Some clearly carry a more positive connotation: solving disease and genetic disorders through biomedical advancements, letting no person go hungry through energy and resource abundance, every child having access to universal education with a personalized curriculum and AI teacher. But there is a duality to all things, and change is no different.

In the shadows of this abundant light, we also find immortality for dictators and unchanging, consolidated power, endless fuel for war, and indefatigable and insidious embedded propaganda. As those in the AI safety space have worked tirelessly to do, it is now our job to list out the success and failure scenarios, and make it explicit for everyone involved what success looks like. I will to continue to endeavor to do just that.

In the previous post, we worked through the enormous power of logarithmic functions for reducing search spaces. For our Binary Tree of the Universe, it would take us only 266 steps to locate any atom in the observable universe. Tree data structures such as B-trees with wider fan-outs require even less steps.

I find this kind of miraculous. But this holds in the most extreme case of cosmic scope. While searching for individual atoms in the universe may be more relevant for humans millennia from now, we on the other hand grapple with more Earthly confines. Let’s bring this down to Earth (huhu):

The Earth has an estimated ~10^50 atoms; convert to powers of 2:

ln 10 / ln 2 = 3.322~
10^50 = 2^(50 * 3.322) = 2^166.1

Simplifying 2^166.1 to 2^166, our Binary Tree of the Universe would handle searching every atom on Earth with 166 steps. A mildy pleasing coincidence that 2^100 is the difference between Earth and the observable universe.

Perhaps every atom on Earth is still too ambitious. Let’s further ground this within the context of humanity:

  • 1 million - typical token context window for LLMs today: ~2^19.93 ≈ 20 steps
  • 1 billion - favorite valuation goal of startups: ~2^29.90 ≈ 30 steps
  • 8 billion - all humans alive today: ~2^32.90 ≈ 33 steps
  • 100 billion - all humans that have ever lived: 2^36.541 ≈ 37 steps

37 steps from every person that has ever lived. You would need only a little over a quarter of the Spanish steps to search through every human that has ever lived!

But one could say, that’s only the people! What about all the content and information they’re creating? If we indexed all the internet data ever created, currently estimated to be ~100 zettabytes, or ~2^76.4 bytes and some change, our search for an individual byte still only requires 77 steps! We can then define Earth-scale log n as 166 and the current Human-scale log n as 77. It'd be nice to have some breathing room for human data's future growth, so let's make Human-scale log n a nice round 80.

This explains part of the magic of YouTube, Meta, TikTok, and the rest of the social media players having reasonable access times for their massive info and content libraries. It’s possible to reasonably store, index, and retrieve social media content for every human alive. After all, all human data ever created is only a little more than half way up the Spanish Steps.

As a final exercise to give us a sense of what logarithmically sits between 1 and 2^266 (all the atoms in the observable universe), I'll compile it all into two handy reference tables, starting with Human-scale:

Scale Count log₂(n) Spanish Steps
LLM context window (1M tokens) ~10^6 20 15% up the stairs
Unicorn valuation ($1B) ~10^9 30 22% up the stairs
All humans alive today ~8×10^9 33 24% up the stairs
All humans ever lived ~10^11 37 27% up the stairs
All internet data (bytes) ~10^23 77 57% up the stairs

Extending beyond humanity, I'll add in cosmological structures:

Scale Estimated Atoms log₂(n) Spanish Steps
Earth ~10^50 166 Up and down 31 steps
Solar System ~10^57 189 Up and down 54 steps
Nebula (typical) ~10^60 199 Up and down 64 steps
Milky Way Galaxy ~10^68 226 Up and down 91 steps
Local Group ~10^72 239 Up and down 104 steps
Virgo Supercluster ~10^75 249 Up and down 114 steps
Observable Universe ~10^80 266 Up and down (2 sets of stairs)
Universe of Universes ~10^160 532 Up and down twice (4 sets of stairs)

I find something about this comforting. The universe is so unimaginably, incomprehensibly vast, and yet there are paths for us to structure it, make sense of it, and explore its immense depth.

-----

Part 3 of this series: What the cosmos teaches us about quadratic growth and LLM context windows 

The Binary Tree of the Universe

A magician approaches you and asks, “Excuse me, have you ever played 20 questions? As a magician, I have spent my life counting every atom in the universe and I now know how many...

Read Post

Purity is the enemy of goodness

One of the great sources of evil in our time is the pursuit of purity. Purity of thought. Purity of practice. Purity of process. Purity of ideology. The language of tyrants, ren...

Read Post