"Through the Looking Glass: From MaxEnt to Transformers"

March 05, 2026

I spent a decade building dialogue systems that couldn't talk.

That's dramatic, but it's true. From roughly 2006 to 2015, I worked at USC's Institute for Creative Technologies, an FFRDC in Playa Vista where the military, Hollywood, and academia collided in the pursuit of "Virtual Humans." Our mission was immersive simulation and training: avatar-driven systems that could hold conversations with soldiers, medical learners, and anyone who needed to practice a difficult human interaction.

The technology under the hood? Maximum entropy classifiers. Supervised learning. Hand-crafted intent categories. Static dialogue trees with enough branching logic to feel responsive but trapped in a cage of predefined paths. We were sorcerers in the old tradition, inscribing runes by hand, hoping the incantation would cover enough cases to pass as magic.

It worked. Sort of. Well enough to get funded, well enough to publish, well enough to help people practice hard conversations. But every practitioner in that building knew the walls. You could feel them when a user said something your classifier hadn't trained on. When the dialogue manager hit a state it didn't have a transition for. When meaning slipped through the cracks of your ontology like water through a sieve.

The Symbolic Detour

After ICT, I joined a startup spun out of CalTech and JPL. Their thesis was bold: take the symbolic reasoning systems that had guided the Mars Rover (formal logic, knowledge graphs, structured inference) and apply them to conversational AI. Real AI, they said. Not statistical pattern matching. Reasoning from first principles.

I wanted to believe. The elegance of symbolic systems is seductive. Clean axioms. Provable properties. The satisfying click of logic locking into place.

But language doesn't click. Language oozes. It's ambiguous by design, contextual by nature, and hostile to formalization. The symbolic approach produced systems that could reason about narrow, well-defined worlds and shattered on contact with the messy, contradictory, creative thing that humans do when they open their mouths.

That startup didn't go anywhere. And then the Big Bang struck.

The Big Bang

Between 2017 and 2020, the field I'd spent my career in underwent a phase transition. Not gradual improvement. A discontinuity. The rules changed so that expertise accumulated under the old regime became both invaluable context and a potential liability.

The transformer architecture, introduced in a 2017 paper titled "Attention Is All You Need," didn't improve on existing approaches. It dissolved the bottlenecks that had defined the field for decades. Sequential processing, gone. Fixed-context windows, gone. The compression-to-a-single-vector problem that haunted encoder-decoder models, replaced with a mechanism where every token could attend to every other token, in parallel, learning what to pay attention to as part of training.

And then they scaled it.

GPT-2 in 2019 could write coherent paragraphs. GPT-3 in 2020 could learn new tasks from a handful of examples in the prompt, no retraining required. By 2023, models were writing code, passing bar exams, and generating prose that fooled human evaluators. By 2025, they were reasoning through multi-step problems with what can only be described as something that looks like thought.

I watched from the practitioner's side of the glass. I could use these systems. I could build with them. I could evaluate their outputs with the hard-won intuition of someone who'd spent years wrestling with language understanding. But I couldn't explain how they worked. Not in the way I could explain a MaxEnt classifier or a dialogue state tracker.

That gap bothered me.

Asking the Oracle

I did something that would've sounded like science fiction in my ICT days: I asked the technology itself to teach me its own origin story.

I sat down with Claude, an instance of the architecture I wanted to understand, and laid out my background. The years at ICT. The symbolic AI dead end. The dual CS/Neuroscience degree that gives me enough cognitive science to be dangerous. The two and a half decades of engineering practice. And the honest admission that my near-fifty-year-old brain isn't running the same gradient descent it did at twenty-two.

Claude returned a structured learning plan: eight phases covering the conceptual arc from basic neural network foundations through the current frontier of mechanistic interpretability. Not a curriculum for a PhD student. A practitioner's path. Emphasis on intuition over proofs, on why did this work over derive this from first principles.

Here's the shape of the revolution.

The Four Leaps

The journey from MaxEnt classifiers to modern LLMs isn't one leap. It's four, stacked. Each one makes incremental sense. The miracle is what happens when you stack them and add scale.

Leap One: From Symbols to Vectors. Word2Vec (2013) showed that training a simple neural network to predict words from their neighbors produces internal representations that capture semantic meaning as geometric relationships. "King minus man plus woman equals queen" wasn't programmed. It emerged from the training objective. Meaning doesn't need to be hand-encoded. It can be learned from patterns of co-occurrence.

Leap Two: From Static to Sequential. Recurrent Neural Networks and LSTMs introduced the ability to process sequences, to maintain a hidden state that accumulates context as tokens flow through. The word "bank" could mean different things depending on what came before it. But RNNs process tokens one at a time, and their memory degrades over distance. There was a horizon to what they could hold in mind.

Leap Three: Attention as Soft Lookup. Bahdanau attention (2014) broke the bottleneck. Instead of forcing the entire input sequence into a single fixed-size vector, the model could learn to look back at any part of the input at each step of generation. A learned, differentiable, soft dictionary lookup, and the conceptual ancestor of everything that followed.

Leap Four: Self-Attention and Parallelism. The transformer made attention the entire mechanism. Every token attends to every other token. No recurrence. No sequential processing. Massive parallelism on GPUs. And O(1) path length between any two tokens in a sequence, versus O(n) for RNNs. Information doesn't traverse a gauntlet of hidden states. It flows.

The Unreasonable Effectiveness of Compression

This is the part that still breaks my brain.

A language model's training objective is almost comically simple: predict the next token. Given a sequence of text, assign a probability distribution over what comes next. Get better at this prediction. Repeat, billions of times, over trillions of tokens of human-generated text.

The thesis, validated by experiment but not yet explained by theory, is that to become good at next-token prediction across all of human language, a model must build internal representations that capture something deep about the world that language describes. You can't predict what comes next in a physics textbook without modeling physics. You can't predict the next move in a negotiation transcript without modeling human psychology. You can't predict the resolution of a logic puzzle without doing logic.

It's compression across all of human language. And what falls out of that compression, at sufficient scale, functions as reasoning.

Whether it is reasoning, in the way my neuroscience half wants to define that word, remains one of the most fascinating open questions I've encountered in twenty-five years of building things.

The Road Ahead

I'm working through this material at a pace appropriate for someone with a day job, a family, and a brain that needs more repetitions than it used to. The plan spans eight weeks, from linear algebra refreshers through the transformer architecture itself, into the scaling laws that produced emergent capabilities, and into alignment: how you turn a text-completion engine into something that tries to be helpful, honest, and safe.

My key resources: Jay Alammar's illustrated blog posts for visual intuition. Andrej Karpathy's "Let's build GPT from scratch" for the implementer's understanding. 3Blue1Brown for the mathematical foundations. And Claude itself as a patient, always-available study partner, an arrangement that is as philosophically strange as it sounds.

I'll write more as I go. This is the beginning of the journey, not the end.

For Fellow Practitioners

If you recognize yourself here, if you built NLP systems before the revolution, if you wrestled with feature engineering and classification pipelines and hand-authored grammars, your experience isn't obsolete. It's context.

You understand why these problems are hard in a way that someone who's only fine-tuned a pretrained model never will. You've felt the walls. You've built systems that worked well enough to reveal how much further there was to go.

The transformer didn't solve AI. It solved representation learning and sequence processing well enough that, combined with massive scale and self-supervised learning, capabilities emerged that nobody predicted, including the people who built them.

I'm through the looking glass now. Come along if you want.

This article was born from a conversation with Claude about my own learning journey. The irony of asking an AI to help me understand how AI works is not lost on me, and is rather the point.

← Back to home