Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Plonky3 recursion book

This book introduces the Plonky3 recursion stack, a Polygon Labs project, aimed at providing sufficient modularity and flexibility for developers, while maintaining high efficiency through its fixed recursive verifier design.

The material is organized to help potential contributors and users understand the motivation, construction and extensions of the recursive verifier.

Introduction

Plonky3 offers a comprehensive toolbox of cryptographic building blocks: hash functions, finite fields, and polynomial commitment schemes, ... to build tailored STARK proof systems. However, its adoption has been limited by the lack of native recursion, to allow arbitrary program execution replication in proof systems and to alleviate proof sizes and related on-chain verification costs.

This project aims at addressing this limitation, by proposing a minimal, fixed recursive verifier for Plonky3, which conceptual simplicity allows for blazing fast recursion performance. A key distinction with its predecessor plonky2, is that rather than wrapping a STARK proof in a separate plonkish SNARK, the Plonky3 recursion stack itself is built using Plonky3’s STARK primitives.

The source code is open-source, available at Plonky3 recursion and dual-licensed MIT/APACHE-2.

NOTE: This project is under active development, unaudited and as such not ready for production use. We welcome all external contributors who would like to support the development effort.

Recursion Approach and Construction

High-level architecture

Recursion in zero-knowledge proofs means using one proof to verify another: an (outer) prover will generate a proof to assert validity of an (inner) STARK proof. By applying this recursively, one obtains a (possibly compact) outer proof that attests to arbitrarily deep chains of computation.

Our approach to recursion for Plonky3 differs from a traditional zkVM approach: there is no program counter, instruction set, or branching logic. Instead, a fixed program is chosen, and the verifier circuit is specialized to this program only.

Why fixing the program shape?

  • Performance: without program counter logic, branching, or instruction decoding, the verifier’s constraints are much lighter.

  • Recursion efficiency: since the shape of the trace is predetermined, the recursion circuit can be aggressively optimized.

  • Simplicity: all inputs follow the same structural pattern, which keeps implementation complexity low.

By fixing the program to execute, in particular here proving the correct verification of some known AIR(s) program(s), prover and verifier can agree on the integral execution flow of the program. As such, each step corresponds to an instruction known at compile-time with operands either known at compile-time in the case of constants, or defined by the prover at runtime. This removes all the overhead of handling arbitrary control flow, and makes the resulting AIR(s) statement(s) effectively tailored for the program they represent, as opposed to regular VMs.

Limitations

  • Rigidity: only the supported program(s) can be proven.

  • No variable-length traces: input size must fit the circuit’s predefined structure.

  • Reusability: adapting to a new program requires a new circuit.

The rest of this book explains how this approach is built, how to soften its rigidity, and why it provides a powerful foundation for recursive proof systems.

Execution IR

An Execution IR (intermediate representation) is defined to describe the steps of the verifier. This IR is not itself proved, but will be used as source of truth between prover and verifier to guide trace population. The actual soundness comes from the constraints inside the operation-specific STARK chips along with an aggregated lookup argument ensuring consistency of the common values they operate on. The lookups can be seen as representing the READ/WRITE operations from/to the witness table.

The example below represents the (fixed) IR associated to the statement 37.x - 111 = 0, where x is a public input. It can be reproduced by running

cargo test --package p3-circuit --lib -- tables::tests::test_toy_example_37_times_x_minus_111 --exact --show-output

A given row of the represented IR contains an operation and its associated operands.

=== CIRCUIT PRIMITIVE OPERATIONS ===
0: Const { out: WitnessId(0), val: 0 }
1: Const { out: WitnessId(1), val: 37 }
2: Const { out: WitnessId(2), val: 111 }
3: Public { out: WitnessId(3), public_pos: 0 }
4: Mul { a: WitnessId(1), b: WitnessId(3), out: WitnessId(4) }
5: Add { a: WitnessId(2), b: WitnessId(0), out: WitnessId(4) }

i.e. operation 4 performs w[4] <- w[1] * w[3], and operation 5 encodes the subtraction check as an addition w[2] + w[0] = w[4] (verifying 37 * x - 111 = 0).

In order to generate the IR, the first step is to create all operations symbolically.

In the symbolic executor, the computation is represented as a graph where nodes are called either ExprId (since they represent the index of an expression) or Target in the code. Each Target can be:

  • a constant,
  • a public input,
  • the output of an operation.

The computation graph that represents all operations in the IR is called Circuit.

A circuit_builder provides convenient helper functions and macros for representing and defining operations within this graph. See section Building Circuits for more details on how to build a circuit.

Witness Table

The Witness table can be seen as a central memory bus that stores values shared across all operations. It is represented as pairs (index, value), where indices are that will be accessed by the different chips via lookups to enforce consistency.

  • The index column is preprocessed, or read-after-preprocess (RAP): it is known to both prover and verifier in advance, requiring no online commitment.1
  • The Witness table values are represented as extension field elements directly (where base field elements are padded with 0 on higher coordinates) for addressing efficiency.

From the fixed IR of the example above, we can deduce an associated Witness table as follows:

=== WITNESS TRACE ===
Row 0: WitnessId(w0) = 0
Row 1: WitnessId(w1) = 37
Row 2: WitnessId(w2) = 111
Row 3: WitnessId(w3) = 3
Row 4: WitnessId(w4) = 111

Note that the initial version of the recursion machine, for the sake of simplicity and ease of iteration, contains a Witness table. However, because the verifier effectively knows the order of each operation and the interaction between them, the Witness table can be entirely removed, and global consistency can still be enforced at the cost of additional (smaller) lookups between the different chips.

Operation-specific STARK Chips

Each operation family (e.g. addition, multiplication, MMCS path verification, FRI folding) has its own chip.

A chip contains:

  • Local columns for its variables.
  • Lookup ports into the Witness table.
  • An AIR that enforces its semantics.

We distinguish two kind of chips: those representing native, i.e. primitive operations, and additional non-primitive ones, defined at runtime, that serve as precompiles to optimize certain operations. The recursion machine contains 4 primitive chips: CONST / PUBLIC_INPUT / ADD and MUL, with SUB and DIV being emulated via the ADD and MUL chips.

Given only the primitive operations, one should be able to carry out most operations necessary in circuit verification. Primitive operations have the following properties:

  • They operate on elements of the Witness table, through their WitnessId (index within the Witness table).
  • The representation can be heavily optimized. For example, every time a constant is added to the IR, we either create a new WitnessId or return an already existing one. We could also carry out common subexpression elimination.
  • They are executed in topological order during the circuit evaluation, and they form a directed acyclic graph of dependencies.

But relying only on primitive operations for the entire verification would lead to the introduction of many temporary values in the IR. In turn, this would lead to enlarged Witness and primitive tables. To reduce the overall surface area of our AIRs, we can introduce non-primitive specialized chips that carry out specific (non-primitive) operations. We can offload repeated computations to these non-primitive chips to optimize the overall proving flow.

These non-primitive operations use not only Witness table elements (including public inputs), but may also require the use of private data. For example, when verifying a Merkle path, hash outputs are not stored in the Witness table.

This library aims at providing a certain number of non-primary chips so that projects can natively inherit from full recursive verifiers, which implies chips for FRI, MMCS path verification, etc. Specific applications can also build their own non-primitive chips and plug them at runtime.

Going back to the previous example, prover and verifier can agree on the following logic for each chip:

=== CONST TRACE ===
Row 0: WitnessId(w0) = 0
Row 1: WitnessId(w1) = 37
Row 2: WitnessId(w2) = 111

=== PUBLIC TRACE ===
Row 0: WitnessId(w3) = 3

=== MUL TRACE ===
Row 0: WitnessId(w1) * WitnessId(w3) -> WitnessId(w4) | 37 * 3 -> 111

=== ADD TRACE ===
Row 0: WitnessId(w2) + WitnessId(w0) -> WitnessId(w4) | 111 + 0 -> 111

Note that because we started from a known, fixed program that has been lowered to a deterministic IR, we can have the CONST chip's table entirely preprocessed (i.e. known to the verifier), as well as all index columns of the other primitive chips.

Lookups

All chips interactions are performed via a lookup argument. Enforcing multiset equality between all chip ports and the Witness table entries ensures correctness without proving the execution order of the entire IR itself. Lookups can be seen as READ/WRITE or RECEIVE/SEND interactions between tables which allow global consistency over local AIRs.

Cross-table lookups (CTLs) ensure that every chip interaction happens through the Witness table: producers write a (index, value) pair into Witness and consumers read the same pair back. No chip talks directly to any other chip; the aggregated LogUp argument enforces multiset equality between the writes and reads.

For the toy example the CTL relations are:2

(index 0, value 0)   : CONST → Witness → ADD
(index 1, value 37)  : CONST → Witness → MUL
(index 2, value 111) : CONST → Witness → ADD
(index 3, value 3)   : PUBLIC → Witness → MUL
(index 4, value 111) : MUL → Witness ← ADD

  1. Preprocessed columns / polynomials can be reconstructed manually by the verifier, removing the need for a prover to commit to them and later perform the FRI protocol on them. However, the verifier needs work when these columns are not structured, as it still needs to interpolate them. To alleviate this, the Plonky3 recursion stack performs offline commitment of unstructured preprocessed columns, so that we need only one instance of the FRI protocol to verify all preprocessed columns evaluations.

  2. The ADD and MUL tables both issue CTL writes of their outputs to the same Witness row. Because the Witness table is a read-only / write-once memory bus, the aggregated lookup forces those duplicate writes w4 = 111 to agree, which is exactly the constraint 37 * 3 = 111 = 0 + 111.

Building Circuits

This section explains how the CircuitBuilder allows to build a concrete Circuit for a given program. We’ll use a simple Fibonacci example throughout this page to ground the ideas behind circuit building:

let mut builder = CircuitBuilder::<F>::new();

// Public input: expected F(n)
let expected_result = builder.add_public_input();

// Compute F(n) iteratively
let mut a = builder.add_const(F::ZERO); // F(0)
let mut b = builder.add_const(F::ONE);  // F(1)

for _i in 2..=n {
    let next = builder.add(a, b); // F(N) <- F(N-1) + F(N-2)
    a = b;
    b = next;
}

// Assert computed F(n) equals expected result
builder.connect(b, expected_result);

let circuit = builder.build()?;
let mut runner = circuit.runner();

Building Pipeline

In what follows, we call WitnessId what serves as identifier for values in the global Witness storage bus, and ExprId position identifiers in the ExpressionGraph (with the hardcoded constant ZERO always stored at position 0).

Building a circuit works in 4 successive steps:

Stage 1 — Lower to primitives

This stage will go through the ExpressionGraph in successive passes and emit primitive operations.

  • when going through emitted Const nodes, the builder ensures no identical constants appear in distinct nodes of the circuit, seen as a DAG (Directed Acyclic Graph), by performing witness aliasing, i.e. looking at node equivalence classes. This allows to prune duplicated Const nodes by replacing further references with the single equivalence class representative that will be part of the DAG. This allows to enforce equality constraints are structurally, without requiring extra gates.

  • public inputs and arithmetic operations may also reuse pre-allocated slots if connected to some existing node.

Stage 2 — Lower non-primitives

This stage translates the ExprId of logged non-primitive operations inputs (from the set of non-primitive operations allowed at runtime) to WitnessIds similarly to Stage 1.

Stage 3 — Optimize primitives

This stage aims at optimizing the generated circuit by removing or optimizing redundant operations within the graph. For instance, if the output of a primitive operation is never used elsewhere in the circuit, its associated node can be pruned away from the graph, and the operation removed.

Once all the nodes have been assigned, and the circuit has been fully optimized, we output it.

Building recursive AIR constraints

In order to recursively verify an AIR, its constraints need to be added to the circuit and folded together. In Plonky3, we can get an AIR's constraints in symbolic form. Since our primitive chips (see section Execution IR) encompass the various entries in the symbolic representation, we can simply map each symbolic operation to its circuit counterpart. The symbolic_to_circuit function does exactly that for a given symbolic constraint.

We can consider a small example to show how operations are mapped. Given public inputs a and b, and a constant c, we have the following symbolic constraint: Mul{ a, Sub {b, Const{ c }}} (which corresponds to: a * (b - c)).

// We get the `ExprId` corresponding to Const{ c } by adding a constant to the circuit.
let x = builder.add_const(c);
// We use the previously computed `x` to compute the subtraction in the circuit.
let y = builder.sub(b, x);
// We use the previously computed `y` to compute the multiplication in the circuit.
let z = builder.mul(a, y);

z is then the output ExprId of the constraint in the circuit.

Using this function, we have implemented, for all AIRs, the automatic translation from their set of symbolic constraints to the circuit version of the folded constraints:

// Transforms an AIR's symbolic constraints into its counterpart circuit version, 
// and folds all the constraints in the circuit using the challenge `alpha`.
fn eval_folded_circuit(
        // The AIR at hand.
        &self,
        builder: &mut CircuitBuilder<F>,
        // Circuit version of Langrange selectors.
        sels: &RecursiveLagrangeSelectors,
        // Folding challenge.
        alpha: &ExprId,
        // All kind of columns that could be involved in constraints.
        columns: ColumnsTargets,
    ) -> Target {
        // Get all the constraints in symbolic form.
        let symbolic_constraints = 
            get_symbolic_constraints(self, 0, columns.public_values.len());

        // Fold all the constraints using the folding challenge.
        let mut acc = builder.add_const(F::ZERO);
        for s_c in symbolic_constraints {
            let mul_prev = builder.mul(acc, *alpha);

            // Get the current constraint in circuit form.
            let constraints = 
                symbolic_to_circuit(sels.row_selectors, &columns, &s_c, builder);

            // Fold the current constraint with the previous value.
            acc = builder.add(mul_prev, constraints);
        }

        acc
    }

This facilitates the integration of any AIR verification into our circuit.

Proving

Calling circuit.runner() will return a instance of CircuitRunner allowing to execute the represented program and generate associated execution traces needed for proving:

let mut runner = circuit.runner();

// Set public input
let expected_fib = compute_fibonacci_classical(n);
runner.set_public_inputs(&[expected_fib])?;

// Instantiate prover instance
let config = build_standard_config_koalabear();
let prover = BatchStarkProver::new(config);

// Generate traces
let traces = runner.run()?;

// Prove the program
let proof = prover.prove_all_tables(&traces)?;

Key takeaways

  • “Free” equality constraints: by leveraging witness aliasing, we obtain essentially free equality constraints for the prover, removing the need for additional arithmetic constraints.

  • Deterministic layout: The ordered primitive lowering combined with equivalence class allocation yields predictable WitnessIds.

  • Minimal primitive set: With Sub/Div being effectively translated as equivalent Add/Mul operations, the IR stays extremely lean, consisting only of Const, Public, Add and Mul, simplifying the design and implementation details.

Trace Generation Pipeline

After building a circuit, the next step is execution: running the program with concrete inputs and generating the execution traces needed for proving. This section describes the complete flow from a static Circuit specification to the final Traces structure.

Overview

The trace generation pipeline consists of three distinct phases:

  1. Circuit compilation — Transform high-level circuit expressions into a fixed intermediate representation.
  2. Circuit execution — Populate the witness table by evaluating operations with concrete input values.
  3. Trace extraction — Generate operation-specific traces from the populated witness table.

Each phase has a clear responsibility and produces well-defined outputs that feed into the next stage.

Phase 1: Circuit Compilation

Circuit compilation happens when calling builder.build(). This phase translates circuit expressions into a deterministic sequence of primitive and non-primitive operations.

The compilation process is described in detail in the Circuit Building section. In summary, it performs three successive passes to lower expressions to primitives, handle non-primitive operations, and optimize the resulting graph.

The output is a Circuit<F> containing the primitive operations in topological order, non-primitive operation specifications, and witness allocation metadata. This circuit is a static, serializable specification that can be executed multiple times with different inputs.

Phase 2: Circuit Execution

Circuit execution happens when calling runner.run(). This phase populates the witness table by evaluating each operation with concrete field element values.

The runner is initialized with a Circuit<F> and receives:

  • Public input values via runner.set_public_inputs()
  • Private data for non-primitive operations via runner.set_non_primitive_op_private_data()

The runner iterates through primitive operations in topological order, executing each one to populate witness slots. Operations can run in forward mode (computing outputs from inputs) or backward mode (inferring missing inputs from outputs), allowing bidirectional constraint solving.

The output is a fully populated witness table where every slot contains a concrete field element. Any unset witness triggers a WitnessNotSet error.

Phase 3: Trace Extraction

Trace extraction happens internally within runner.run() after execution completes. This phase delegates to specialized trace builders that transform the populated witness table into operation-specific trace tables.

Each primitive operation has a dedicated builder that extracts its operations from the IR and produces trace columns:

  • WitnessTraceBuilder — Generates the central witness table with sequential (index, value) pairs
  • ConstTraceBuilder — Extracts constants (both columns preprocessed)
  • PublicTraceBuilder — Extracts public inputs (index preprocessed, values at runtime)
  • AddTraceBuilder — Extracts additions with six columns: (lhs_index, lhs_value, rhs_index, rhs_value, result_index, result_value)
  • MulTraceBuilder — Extracts multiplications with the same six-column structure

Non-primitive operations require custom trace builders. For example, MmcsTraceBuilder validates and extracts MMCS path verification traces. Custom trace builders follow the same pattern, operating independently in a single pass to produce isolated trace tables. All index columns are preprocessed since the IR is fixed and known to the verifier.

The output is a Traces<F> structure containing all execution traces needed by the prover to generate STARK proofs for each operation-specific chip.

Example: Fibonacci Circuit

Consider a simple Fibonacci circuit computing F(5):

let mut builder = CircuitBuilder::new();

let expected = builder.add_public_input();
let mut a = builder.add_const(F::ZERO);
let mut b = builder.add_const(F::ONE);

for _ in 2..=5 {
    let next = builder.add(a, b);
    a = b;
    b = next;
}

builder.connect(b, expected);
let circuit = builder.build()?;

Phase 1: Compilation produces:

primitive_ops: [
  Const { out: w0, val: 0 },
  Const { out: w1, val: 1 },
  Public { out: w2, public_pos: 0 },
  Add { a: w0, b: w1, out: w3 },  // F(2) = 0 + 1
  Add { a: w1, b: w3, out: w4 },  // F(3) = 1 + 1
  Add { a: w3, b: w4, out: w5 },  // F(4) = 1 + 2
  Add { a: w4, b: w5, out: w2 },  // F(5) = 2 + 3 (connects to expected)
]
witness_count: 6

Phase 2: Execution with runner.set_public_inputs(&[F::from(5)]):

witness[0] = Some(0)
witness[1] = Some(1)
witness[2] = Some(5)
witness[3] = Some(1)
witness[4] = Some(2)
witness[5] = Some(3)

Phase 3: Trace Extraction produces:

const_trace: [(w0, 0), (w1, 1)]
public_trace: [(w2, 5)]
add_trace: [
  (w0, 0, w1, 1, w3, 1),
  (w1, 1, w3, 1, w4, 2),
  (w3, 1, w4, 2, w5, 3),
  (w4, 2, w5, 3, w2, 5),
]
mul_trace: []

The witness table acts as the central bus, with each operation table containing lookups into it. The aggregated lookup argument enforces that all these lookups are consistent.

Key Properties

Determinism — Given the same circuit and inputs, trace generation is completely deterministic, ensuring reproducible proofs.

Separation of concerns — Each phase has a single responsibility: compilation handles expression lowering, execution populates concrete values, and trace builders format data for proving.

Builder pattern efficiency — Trace builders operate in a single pass using only the data they need. No builder depends on another's output, enabling future parallelization.

Preprocessed columns — All index columns in operation traces are preprocessed. Since the IR is fixed, the verifier can reconstruct these columns without online commitments, significantly reducing proof size.

Handling arbitrary programs

The fixed recursive verifier described in this book supports only fixed, predetermined programs.
This design choice maximizes performance but raises the question: how can one prove statements of varying size or complexity?

We highlight below two distinct approaches to alleviate this limitation and allow for arbitrary recursion

Tree-style recursion for variable-length inputs

One can split a large computation into chunks and prove each piece using a fixed inner circuit in parallel. These proofs can then be recursively aggregated in a tree structure, where each leaf of the tree corresponds to a prover portion of the computation. The tree root yields a single proof attesting to the validity of the entire computation.

A formal description of this tree-style recursion for STARKs can be seen in zkTree.

Flexible FRI verification

To support proofs with different FRI shapes, one can:

  • Lift the proofs to a larger domain, as described in Lifting plonky3.
    Lifting allows a fixed circuit to efficiently verify proofs of varying trace sizes by projecting smaller domains into larger ones, reusing the original LDE and commitments without recomputation.

  • Verify distinct proof shapes together inside a fixed FRI verifier circuit. Instead of having a single proof size that can be verified by a given FRI verifier circuit, one can extend it over a range of sizes instead at a minimal overhead cost. See a related implementation in plonky2 recursion: (Plonky2 PR #1635) for more details.

Debugging

The CircuitBuilder provides built-in debugging tools to help identify wiring issues and unsatisfied constraints.

Allocation Logging

The CircuitBuilder supports an allocation logger during circuit building that logs allocations being performed. These logs can then be analyzed at runtime and leveraged to detect issues in circuit constructions.

Enabling Debug Logging

Allocation logging is automatically enabled in debug builds (or moe generally if debug_assertions are enabled). Logs can be dumped to stdout when calling builder.dump_allocation_log(), if logging level is set to DEBUG or lower.

Allocation Log Format

By default, the CircuitBuilder automatically logs all allocations with no specific labels. One can decide to attach a specific descriptive to ease debugging, like so:

let mut builder = CircuitBuilder::<F>::new();

// Allocating with custom labels
let input_a = builder.alloc_public_input("input_a");
let input_b = builder.alloc_public_input("input_b");
let input_c = builder.alloc_public_input("input_c");

let b_times_c = builder.alloc_mul(input_b, input_c, "b_times_c");
let a_plus_bc = builder.alloc_add(input_a, b_times_c, "a_plus_bc");
let a_minus_bc = builder.alloc_sub(input_a, b_times_c, "a_minus_bc");

// Default allocation
let x = builder.add_public_input(); // unlabelled
let y = builder.add(x, z);          // unlabelled

The CircuitBuilder also allows for nested scoping of allocation logs, so that users can debug a specific context within a larger circuit. Scoping can be defined arbitrarily by users as follows:

fn complex_function(builder: &mut CircuitBuilder) {
    builder.push_scope("complex function");

    // Do something
    inner_function(builder); // <- this will create a nested scope within the inner function

    builder.pop_scope();
}

fn inner_function(builder: &mut CircuitBuilder) {
    builder.push_scope("inner function");

    // Do something else

    builder.pop_scope();
}

Debugging constraints

When debugging constraint satisfaction issues, the system relies on Plonky3's internal check_constraints feature to evaluate AIR constraints, available in debug mode. This ensures that all constraints are properly satisfied before proceeding to the next proving phases.

Benchmarks

This section will present empirical performance results for the Plonky3 recursion system, including instructions for reproducibility across target machines.

To be filled soon.