Solomonoff Induction

12 Apr, 2025

When you leave your room 15 minutes before class starts, you're implicitly predicting how long the walk will take, whether the elevator will be prompt, and if there will be a red light at the crossing. When you check your phone based on a feeling, you're predicting there might be a notification. When you bring an umbrella despite the clear sky because the weather app says 40% chance of rain, you're weighing competing prediction models.

Some of these predictions work out; others don't. The NUS shuttle bus that was supposed to arrive at 8:45 am according to the tracking app somehow vanishes from existence. The Japanese stall at Flavours that's never crowded suddenly has a 30-minute queue the one day you're running late. The professor who never checks attendance decides today is the day.

It might be nice to have a universal algorithm for making optimal predictions.

What is Solomonoff Induction?

Solomonoff induction, developed by Ray Solomonoff in the 1960s, is a mathematical formalization of Occam's razor. It provides a universal way to predict future observations based on past observations, given absolutely no prior domain knowledge.

The core idea is brilliantly simple and frustratingly impractical:

Consider all possible computer programs that could generate your observed data
Weight each program by 2^(-length of the program in bits)
Use this weighted mixture to predict what comes next

In other words, shorter programs get exponentially more weight than longer ones. This makes intuitive sense - simpler explanations are more likely to be correct.

A Concrete Example

Let's say you observe the sequence: 2, 4, 6, 8, ...

What comes next? A human would say 10, because we recognize this as "add 2 each time."

Solomonoff induction formalizes this intuition. It considers:

Program 1: "Print 2, 4, 6, 8, 10, ..." (long, explicit listing)
Program 2: "For n=1 to infinity, print 2n" (short, captures the pattern)
Program 3: "Print 2, 4, 6, 8, 9, ..." (long, arbitrary sequence)
...and infinitely many other possibilities

Since Program 2 is much shorter than the others, it gets much more weight in the prediction. So Solomonoff induction predicts 10, just like we would.

Why It Matters

Solomonoff induction has some remarkable properties:

It converges to the true pattern rapidly as you observe more data
It's optimal in a specific mathematical sense (no other prediction method can do better in general)
It provides a principled way to handle both deterministic and probabilistic sequences
It's deterministic (I love determinism)

It's essentially the theoretical upper limit of what machine learning can achieve without domain-specific knowledge.

The Catch

Remember when I said "brilliantly simple and frustratingly impractical"? Here's the rub: Solomonoff induction is completely uncomputable.

The problem is that to actually implement it, you'd need to:

Enumerate all possible computer programs (infinite)
Properly weight their predictions (impossible due to the halting problem)

This is what happens when theoreticians design algorithms without worrying about trivial details like "computational feasibility."

Practical Approximations

Despite being uncomputable, Solomonoff induction serves as an ideal that practical algorithms try to approximate:

Minimum Description Length (MDL) principle
Bayesian model averaging
Deep learning (in some sense)
The entire field of algorithmic information theory

Physics as Compression

Newton's laws of physics represent one of the most efficient compression algorithms in scientific history. With just three laws and a handful of equations, we can predict the motion of objects across ten orders of magnitude - from a dust particle to an A380.

Consider what these laws compress, interactions between multitudes of individual bodies pushing and pulling against each other. The compression ratio is staggering. Billions of possible physical scenarios are reduced to equations simple enough to fit on a freshman's cheat sheet. And the decompression algorithm is remarkably efficient—give an engineering student these laws and initial conditions, and they can simulate the entire system.

Of course, this compression has lossy elements. At very small scales, quantum effects emerge that Newton never accounted for. At very high speeds, relativistic effects distort the Newtonian predictions. At extreme gravitational fields, spacetime curvature matters. But for the middle scales where humans typically operate—from millimeters to kilometers, from grams to tons—the compression is nearly perfect.

Maxwell's equations achieve an even more impressive compression. The entire domain of classical black magic (read: electromagnetism - every electric motor, generator, antenna, and optical device) is compressed into four elegant equations: ∇ · E = ρ/ε₀ ∇ · B = 0 ∇ × E = -∂B/∂t ∇ × B = μ₀J + μ₀ε₀∂E/∂t

These 10-odd symbols compress countless engineering applications. Every time your phone connects to a cell tower, when your microwave heats food, or when light refracts through your glasses, you're experiencing the decompression of Maxwell's equations into physical reality.

The compression isn't perfect either - quantum electrodynamics reveals where the approximations break down. But the compression ratio and fidelity within their domain of applicability are remarkable.

What makes these physical laws powerful (beautiful) is their compression efficiency.

Heritage and tradition compress the life experiences of our ancestors. Proverbs, folktales, and cultural practices are compressed wisdom—heuristics that worked well enough to be preserved. "A stitch in time saves nine" compresses generations of experience about the value of preventative maintenance into seven words.

Even language itself is a compression algorithm. When I say "chair," I'm compressing an infinite set of possible chair-like objects into a four-letter word that you can decompress in your mind.

The Closest Real-World Implementation: Context-Mixing Compressors

The closest practical equivalent to Solomonoff induction, are context-mixing compressors like PAQ8 and its descendants.

These compressors work by building probabilistic models of the data and combining them to predict the next bit or byte. The better they predict, the better they compress. This is fundamentally the same task as Solomonoff induction—finding patterns that let you make accurate predictions.

What makes these compressors special is how they approach the problem:

They maintain multiple models of the data (arithmetic coders, Markov models etc.)
Each model makes predictions about what comes next
The predictions are weighted and combined based on past performance
The system effectively learns which models work best for different contexts

In fact, the best compressor is also the best predictor. This is known as the Compression-Prediction Equivalence Theorem, which states that finding regularities to compress data and finding patterns to make predictions are mathematically equivalent problems.

Conclusion

By formalizing Occam's razor, Solomonoff gives us a mathematical definition of simplicity. It tells us that the best explanation for observed data is the shortest computer program that generates it. This is a possible description of the nature of knowledge itself.

Is the universe actually running on the shortest possible program? Probably not. But the remarkable success of simple physical laws in explaining complex phenomena suggests that reality is at least approximately compressible.

Maybe that's why we find beauty in simplicity. Perhaps aesthetic appreciation of elegant solutions is just our intuitive recognition of good compression. The equations that govern our reality could have been incomprehensibly complex, yet they fit on t-shirts and coffee mugs.

The gap between Solomonoff's ideal and our practical approximations isn't just a limitation of computational resources. It's a reminder of the gap between the map and the territory, between our models and reality itself.