Can mathematical reasoning be compressed into a short, human-readable cheatsheet that makes a small open model reason like a much larger one? That is the bet behind the SAIR Foundation Mathematics Distillation Challenge. I entered Stage 1 and finished 29th of 1000+ entrants (1,061 participants), with a cheatsheet that was both smaller and cheaper than the first-place entry: 7.44 KB at about $0.00037 per problem, versus 9.83 KB at $0.00060.
This is my research thesis in miniature: efficiency times reliability, and how cheaply you can preserve real reasoning. Here is what the challenge was, and how I built the cheatsheet.
The Challenge
The competition is organized by the SAIR Foundation, with Damek Davis (Associate Professor of Statistics and Data Science, University of Pennsylvania) and Terence Tao (Fields Medalist, UCLA, co-founder of SAIR). It is inspired by Honda, Murakami, and Zhang (2025), Distilling Many-Shot In-Context Learning into a Cheat Sheet. The twist: instead of a single model query producing the cheatsheet, the cheatsheets are discovered through an open competition.
The core question is simple to state: can strong mathematical reasoning be distilled into a compact, human-readable cheatsheet that improves LLM performance on formal tasks?
The Task: Equational Implication Over Magmas
A magma is a set with an arbitrary binary operation *. Nothing else is assumed: no associativity, no commutativity, no identity, no cancellation. The pilot task, drawn from Terence Tao's Equational Theories Project (a list of 4,694 laws and their implication graph), is:
Given two equational laws, does Equation 1 imply Equation 2 over all magmas?
Example: E_4 (x = x * y) implies E_3 (x = x * x).
This is deceptively hard. The answer is a universally quantified statement over every possible magma, so a single counterexample magma settles a FALSE, but proving TRUE means the implication holds in structures you never explicitly checked.
What Makes It a Distillation Problem
The constraints are where the research lives:
- Cheatsheet cap: 10 KB. The whole strategy has to fit in a few thousand tokens.
- Small, general models. Stage 1 evaluates on GPT-OSS-120B, Llama-3.3-70B-Instruct, and Gemma, weighted equally. These are not reasoning-specialized.
- No tools. Offline evaluation gives the model no browser, no web search, no Lean, no code execution.
- Balanced and held out. The evaluation set is 50% TRUE and 50% FALSE, and is different from the 1,669 public problems, so you cannot memorize.
- Cost and latency are scored. The recommendation is under $0.01 and under 10 minutes per problem; exceeding either hurts your rank.
So the cheatsheet must make a general open model decide a genuinely hard logic question, cheaply, with no external help, on problems it has never seen. Dumping in-context examples does not survive that. The model needs a procedure.
My Approach: A Two-Phase Decision Procedure
I wrote the cheatsheet as an explicit algorithm the model executes, not a pile of examples. It has two phases: a strict rule layer that commits only when a rule fires cleanly, and a structural classifier used as a fallback when the rules are silent.
Phase 1: Strict rules (commit only on a clean firing)
- Immediate TRUE (syntactic). Eq1 and Eq2 are the same law up to consistent renaming or side-swapping; or Eq2 is just
t = t. - Trivialization. If Eq1 forces the magma to have a single element (for example
x = y * z, where the leftmost and rightmost leaves of the right-hand side are both different fromx), then everything follows and the answer is TRUE. - Canonical projection shortcuts. The literal forms
x * y = z * w(constant),x = x * y(left-projection), andx = y * x(right-projection) have closed-form implication tests that read off the leaf structure of Eq2. A crucial discipline: these apply only to the exact literal form. A look-alike likex = y * (y * x)is strictly stronger and must fall through. - Witness-magma counterexample search. A curated library of size-2 and size-3 magmas (left and right projection, AND-to-zero, negate-left, negate-right, XOR, and the Z/3 add, subtract, and idempotent tables), each with an O(1) shortcut for "does this law hold here." If a witness satisfies Eq1 but breaks Eq2, the answer is FALSE with an explicit counterexample.
Phase 2: Structural fallback (when Phase 1 is silent)
When no rule fires and no witness refutes, I compute a few variable-occurrence features of Eq1, for example: does some variable appear exactly once; is the left-hand side a bare variable, a single product, or a nested product. These feed a five-rule heuristic prior:
Q1. bare LHS + a once-occurring variable -> TRUE
Q2. simple LHS + a once-occurring variable -> TRUE
Q3. nested (complex) LHS -> FALSE
Q4. every variable occurs twice or more -> FALSE
Q5. otherwise -> FALSE
The Q-rules are an honest prior, not a proof. They are applied only after Phase 1 produces no verdict, and the cheatsheet says so explicitly so the model does not over-trust them.
The output contract keeps the model disciplined and cheap: reasoning under 200 words, name the rule that fired, then end with a parseable verdict.
VERDICT: TRUE or VERDICT: FALSE
If TRUE: BASIS: <rule> brief reason.
If FALSE: COUNTEREXAMPLE: <magma + failing assignment>, or BASIS: <fallback>.
The Accuracy, Cost, Size Frontier
The design philosophy was to sit on the accuracy-cost-size frontier. Deterministic rules go first because they are cheap and exact when they fire. The witness library is the only "search," and it is tiny and constant-time. The heuristic prior mops up the residual cases without an expensive chain of thought. Everything fits in 7.44 KB.
That combination is what produced the result: 29th of 1000+, with an artifact that beat the first-place entry on both axes I care about most: it was smaller (7.44 KB vs 9.83 KB) and cheaper ($0.00037 vs $0.00060 per problem).
The Numbers
On the Stage 1 leaderboard, my submission (EQT01-T00201) scored an overall 60.2% accuracy and 68.3% F1, at 99.7% mean parse success and $0.00037 per problem, all from one 7.44 KB cheatsheet. The per-model, per-difficulty breakdown across the three evaluation models:
| Model | Set | Accuracy | F1 | Parse | Cost |
|---|---|---|---|---|---|
| GPT-OSS 120B | normal | 85.2% | 87.0% | 99.5% | $0.00023 |
| GPT-OSS 120B | hard | 62.7% | 71.1% | 99.8% | $0.00023 |
| GPT-OSS 120B | extra_hard | 36.3% | 52.7% | 99.8% | $0.00025 |
| Gemma 4 31B IT | normal | 78.3% | 81.8% | 100.0% | $0.00047 |
| Gemma 4 31B IT | hard | 61.8% | 71.6% | 100.0% | $0.00046 |
| Gemma 4 31B IT | extra_hard | 44.0% | 61.1% | 100.0% | $0.00047 |
| Llama 3.3 70B | normal | 74.3% | 72.2% | 99.8% | $0.00042 |
| Llama 3.3 70B | hard | 56.2% | 62.4% | 99.5% | $0.00042 |
| Llama 3.3 70B | extra_hard | 43.2% | 55.1% | 99.0% | $0.00040 |
The shape is what I wanted. Strong on the normal set (GPT-OSS-120B at 85.2%), holding on hard, and degrading gracefully on extra_hard, while parse success stays near-perfect (99 to 100%) so the verdicts are almost always machine-readable, at roughly two to five hundredths of a cent per problem. The cheatsheet is model-agnostic: the same 7.44 KB drives all three open models.
Why I Cared
My research is about efficiency times reliability for foundation models, with a deep focus on the reliability of quantized models. This challenge is the same question in a different medium: not "how few bits can hold the weights," but "how few kilobytes of instruction can hold the reasoning." Distilling a decision procedure into a cheatsheet that a small open model can execute, cheaply and correctly, is cost-efficient reasoning made concrete. It is the kind of frontier I keep coming back to.
Stage 1 prompts may be published for community learning, and Stage 2 (harder benchmark, larger cheatsheets) is where this goes next. I am looking forward to it.
Competition: SAIR Foundation Mathematics Distillation Challenge, Equational Theories, Stage 1. Based on Terence Tao's Equational Theories Project.