# Verifiable LLM inference: a short summary

*by Einar Holt, Founder & Partner at tenki | May 2026*

## What problem are we solving?

When a language model gives medical advice, scores a loan, or drafts a regulatory filing, someone has to take the answer at face value. There is no good way for a regulator, an auditor, or a downstream system to confirm that the response really came from the model that was promised, run on the input that was provided, without re-executing the model and seeing both the weights and the prompt. Existing cryptographic tools (zero-knowledge proofs over LLM forward passes) can do this for *one token at a time*, but they do not yet compose to a full chat completion. A 1000-token answer on a 13B model would take roughly **ten days of prover work** under naive composition, which is not a service anyone can ship.

This research proposes three new constructions and a new threat model that close that gap, and validates them on real Qwen 2.5 and Qwen 3.5 models.

## What did we build?

**PSI-LM** is the simplest of the three. It is a *cut-and-choose* protocol: the prover commits to per-token logits, the verifier randomly challenges a small subset of positions, and only those positions get full cryptographic proofs. Soundness comes from the union bound. The result is roughly a **32 times speedup** over naive autoregressive ZK at the cost of probabilistic, not absolute, soundness.

**AsymVZK** breaks a foundational assumption. Existing systems treat the verifier as fully blind. We argue that for many real applications the verifier may reasonably hold a *small public draft model* and run it locally for free. We formalize this as a new threat model called **asymmetric-knowledge zero-knowledge (AKZK)**, a strict generalization of the classical Goldwasser-Micali-Rackoff framework. AsymVZK then uses *speculative decoding* (a plaintext serving optimization from 2023) under a cryptographically committed accept/reject test. The verifier replays the draft model at every position, free of cryptographic cost, and only invokes a full zero-knowledge proof at a small handful of accepted and rejected positions. Verifier work scales with how often the draft and large models *disagree*, not with how long the answer is.

**VEWC** is the deepest of the three and addresses a known weakness of cut-and-choose. Against an adversary that targets a single semantically loaded token (flipping a "not", changing a number), uniform challenges are weak. VEWC introduces a new *semantic-budget adversary* class measured in bits of Shannon entropy, a new theorem giving a strictly tighter soundness bound under entropy-weighted challenges, and a new primitive class called *verifiable entropy commitments*. The adaptive challenge distribution is itself cryptographically committed, so an adversary cannot manipulate it.

## What did we find empirically?

We tested all three on real HuggingFace models running on a laptop CPU.

- On **Qwen 2.5-1.5B paired with Qwen 2.5-0.5B**, the speculative-acceptance rate sits around **70 to 86 percent** across five prompt domains (code, translation, reasoning, definitions). This means AsymVZK only invokes cryptography at roughly a quarter of positions.
- Empirical detection rates against a cheating prover match the theoretical hybrid soundness bound across the full grid.
- On **Qwen 3.5-4B paired with Qwen 3.5-0.8B** (released March 2026), VEWC catches a worst-case semantic-content adversary at **11 times the rate of uniform challenges** with the same challenge budget at low k. At k=4 challenges against a single high-entropy lie, VEWC reaches 68.8 percent detection where uniform reaches 6.2 percent.
- VEWC has an honest cost: against an adversary that strategically lies at low-entropy positions, VEWC's detection drops to zero. The two protocols are complementary, and a hybrid that splits the challenge budget between entropy-weighted and uniform challenges gets the best of both.

## Where can this be used?

The constructions target settings where a third party needs to verify model output without re-running the model and without access to weights or input. Concretely:

- Regulatory audit of on-prem medical or legal LLM deployments.
- DORA-driven oversight of AI-based credit and underwriting decisions.
- Verifiable outsourced AI for Nordic public sector procurement.
- On-chain LLM services where smart contracts verify responses.
- Forensic re-verification of historical inferences after an incident.

## What we did not build

The cryptographic primitives are stubbed in the proof of concept: the per-position consistency proof is implemented as model re-evaluation rather than a full SNARK. We rely on published zkLLM cost figures for the production-cost projections. Folding of lookup arguments, which would tighten the construction further, is an open research problem in its own right and is out of scope.

## Why now?

Three forces meet at this point in 2026: AI deployments in regulated sectors are scaling fast, the EU AI Act and DORA are entering force, and ZKML-for-transformers has just become practical at billion-parameter scale. The remaining gap, autoregressive composition, is exactly the gap our constructions close.
