Distilling Math Reasoning into a 7.44 KB Cheatsheet

Can mathematical reasoning be compressed into a short, human-readable cheatsheet that makes a small open model reason like a much larger one? That is the bet behind the SAIR Foundation Mathematics Distillation Challenge. I entered Stage 1 and finished 29th of 1000+ entrants (1,061 participants), with a cheatsheet that was both smaller and cheaper than the first-place entry: 7.44 KB at about $0.00037 per problem, versus 9.83 KB at $0.00060.

This is my research thesis in miniature: efficiency times reliability, and how cheaply you can preserve real reasoning. Here is what the challenge was, and how I built the cheatsheet.

The Challenge

The competition is organized by the SAIR Foundation, with Damek Davis (Associate Professor of Statistics and Data Science, University of Pennsylvania) and Terence Tao (Fields Medalist, UCLA, co-founder of SAIR). It is inspired by Honda, Murakami, and Zhang (2025), Distilling Many-Shot In-Context Learning into a Cheat Sheet. The twist: instead of a single model query producing the cheatsheet, the cheatsheets are discovered through an open competition.

The core question is simple to state: can strong mathematical reasoning be distilled into a compact, human-readable cheatsheet that improves LLM performance on formal tasks?

The Task: Equational Implication Over Magmas

A magma is a set with an arbitrary binary operation *. Nothing else is assumed: no associativity, no commutativity, no identity, no cancellation. The pilot task, drawn from Terence Tao's Equational Theories Project (a list of 4,694 laws and their implication graph), is:

Given two equational laws, does Equation 1 imply Equation 2 over all magmas?

Example: E_4 (x = x * y)  implies  E_3 (x = x * x).

This is deceptively hard. The answer is a universally quantified statement over every possible magma, so a single counterexample magma settles a FALSE, but proving TRUE means the implication holds in structures you never explicitly checked.

What Makes It a Distillation Problem

The constraints are where the research lives:

So the cheatsheet must make a general open model decide a genuinely hard logic question, cheaply, with no external help, on problems it has never seen. Dumping in-context examples does not survive that. The model needs a procedure.

My Approach: A Two-Phase Decision Procedure

I wrote the cheatsheet as an explicit algorithm the model executes, not a pile of examples. It has two phases: a strict rule layer that commits only when a rule fires cleanly, and a structural classifier used as a fallback when the rules are silent.

Phase 1: Strict rules (commit only on a clean firing)

Phase 2: Structural fallback (when Phase 1 is silent)

When no rule fires and no witness refutes, I compute a few variable-occurrence features of Eq1, for example: does some variable appear exactly once; is the left-hand side a bare variable, a single product, or a nested product. These feed a five-rule heuristic prior:

Q1. bare LHS  + a once-occurring variable        -> TRUE
Q2. simple LHS + a once-occurring variable        -> TRUE
Q3. nested (complex) LHS                           -> FALSE
Q4. every variable occurs twice or more           -> FALSE
Q5. otherwise                                      -> FALSE

The Q-rules are an honest prior, not a proof. They are applied only after Phase 1 produces no verdict, and the cheatsheet says so explicitly so the model does not over-trust them.

The output contract keeps the model disciplined and cheap: reasoning under 200 words, name the rule that fired, then end with a parseable verdict.

VERDICT: TRUE or VERDICT: FALSE
If TRUE: BASIS: <rule> brief reason.
If FALSE: COUNTEREXAMPLE: <magma + failing assignment>, or BASIS: <fallback>.

The Accuracy, Cost, Size Frontier

The design philosophy was to sit on the accuracy-cost-size frontier. Deterministic rules go first because they are cheap and exact when they fire. The witness library is the only "search," and it is tiny and constant-time. The heuristic prior mops up the residual cases without an expensive chain of thought. Everything fits in 7.44 KB.

That combination is what produced the result: 29th of 1000+, with an artifact that beat the first-place entry on both axes I care about most: it was smaller (7.44 KB vs 9.83 KB) and cheaper ($0.00037 vs $0.00060 per problem).

The Numbers

On the Stage 1 leaderboard, my submission (EQT01-T00201) scored an overall 60.2% accuracy and 68.3% F1, at 99.7% mean parse success and $0.00037 per problem, all from one 7.44 KB cheatsheet. The per-model, per-difficulty breakdown across the three evaluation models:

ModelSetAccuracyF1ParseCost
GPT-OSS 120Bnormal85.2%87.0%99.5%$0.00023
GPT-OSS 120Bhard62.7%71.1%99.8%$0.00023
GPT-OSS 120Bextra_hard36.3%52.7%99.8%$0.00025
Gemma 4 31B ITnormal78.3%81.8%100.0%$0.00047
Gemma 4 31B IThard61.8%71.6%100.0%$0.00046
Gemma 4 31B ITextra_hard44.0%61.1%100.0%$0.00047
Llama 3.3 70Bnormal74.3%72.2%99.8%$0.00042
Llama 3.3 70Bhard56.2%62.4%99.5%$0.00042
Llama 3.3 70Bextra_hard43.2%55.1%99.0%$0.00040

The shape is what I wanted. Strong on the normal set (GPT-OSS-120B at 85.2%), holding on hard, and degrading gracefully on extra_hard, while parse success stays near-perfect (99 to 100%) so the verdicts are almost always machine-readable, at roughly two to five hundredths of a cent per problem. The cheatsheet is model-agnostic: the same 7.44 KB drives all three open models.

Why I Cared

My research is about efficiency times reliability for foundation models, with a deep focus on the reliability of quantized models. This challenge is the same question in a different medium: not "how few bits can hold the weights," but "how few kilobytes of instruction can hold the reasoning." Distilling a decision procedure into a cheatsheet that a small open model can execute, cheaply and correctly, is cost-efficient reasoning made concrete. It is the kind of frontier I keep coming back to.

Stage 1 prompts may be published for community learning, and Stage 2 (harder benchmark, larger cheatsheets) is where this goes next. I am looking forward to it.

Competition: SAIR Foundation Mathematics Distillation Challenge, Equational Theories, Stage 1. Based on Terence Tao's Equational Theories Project.