Evidence-gated development

Autonomy ends where verification ends

lark-rs is built with coding agents. That sentence is now unremarkable — many projects can say it. The interesting claim is narrower and harder: authority follows evidence, not confidence. An agent may decide what it can independently check; everything else it must escalate. This essay is how that line is drawn, why parsing is an unusually good place to draw it, and where the method stops.

The trust problem, stated plainly

A reasonable engineer, told that a context-free parser — LALR table construction, an SPPF-based Earley engine, a CYK dynamic program, a contextual lexer — was written substantially by coding agents, should be sceptical. Parsers are exactly the kind of software where a plausible-looking implementation can be subtly, silently wrong on the input you didn't try.

The answer to that scepticism is not “the agent is capable.” Capability is precisely what you cannot verify by inspection. The answer is structural:

The claims that matter are independently checked, and the agent is not authorised to guess outside those checks.

That is the whole proposition. Everything below is the machinery that makes it true rather than aspirational.

The thesis: falsifiability is the boundary

The project began with one move that worked — oracle-first testing. It converted a question that would otherwise need human review (“is this code correct?”) into one an agent can settle by itself (“does it match Python Lark?”). The constitution generalises that single move into a rule for every class of decision:

The boundary of safe autonomy is the boundary of what we have made falsifiable.

An agent may decide anything it can ground against an oracle, a gate, or a written rule. Everything else it escalates — or the human architect builds the missing gate. This reframes the human's job. At this altitude the architect is not reviewing every commit or merging every change; the architect is building gates and handling escalations. The constitution is itself the gate for the decisions that don't have a test-shaped one.

A corollary makes the whole thing durable: all state that matters lives in git and GitHub — the constitution, the decision log, the issues, the compliance banks. A session is a stateless operator over that state. It reads, acts, writes back, and leaves nothing important in conversational memory. That is what lets ordinary, unprivileged sessions add up to a coherent project instead of a pile of context that evaporates when the window closes.

The decision boundary

Every decision is routed by a single question: what grounds it? There are three answers, and they map to three different authorities.

01
Oracle · gate · bank

The result can be independently checked — a parse tree against Python Lark, a regression against a committed bank, a complexity envelope against a work-counter.

→ Decide freely, and self-check.
02
Written principle + judgement

Evidence narrows the choice but does not settle it — an architecture call, a priority ordering, whether a divergence from the oracle is intended contract or implementation leakage.

→ Decide, explain, and record the reasoning as an ADR.
03
Nothing falsifiable

Product direction, taste, a genuine trade-off with no objective test.

→ Escalate. Keep the decision human.

The load-bearing rule is the tie-breaker: when you are unsure which row you are in, you are in the bottom row. Uncertainty routes to the human, never to the easiest implementation.

Why parsing is an unusually good place to try this

Oracle-gated autonomy is only as good as the oracle. Parsing happens to come with a near-perfect one.

  • A reference implementation already exists. Python Lark is mature, widely used, and defines the behaviour lark-rs wants. The oracle is not a model of correctness we invented — it is a second implementation we can run.
  • The output is structured and comparable. A parse is a tree. Two trees are equal or they are not; an ambiguous result is a set you can compare. There is little room for “close enough.”
  • It is hard to write but easy to check. This is the asymmetry that makes the whole approach pay: generating the expected answer (run Python Lark) is cheap and trustworthy, even though producing it natively in Rust is intricate.
  • The expensive failure mode is deterministic. Super-linear blow-up in an Earley completer or a CYK table is a property of work done per byte — countable, not a stopwatch reading.
“Traditional computers automate what you can specify in code. AI/LLMs automate what you can verify.” — Andrej Karpathy

Five practices, and the evidence each leaves behind

1 · Oracle before implementation

A new grammar feature starts with a case added to the oracle generator and a tree produced by Python Lark — a test that fails first. Only then is the feature written. The oracle JSON and the implementation are committed together. The expected values are Python-derived; hand-authoring them to match the Rust output is forbidden, because a test you tuned to pass proves nothing.

2 · Demonstrate before fixing

A bug becomes a failing reproduction before it becomes a fix. A suspected performance pathology becomes a committed, size-parametrised workload that exhibits it before any code changes. This sounds like ordinary discipline until it bites: in the engine's perf history, one hypothesis named the completer as the cost; the profiler found the cost in the forest-to-tree walk instead. The rule that saved it — fix the phase the profiler indicts, not the one the hypothesis names — only works because the demonstration came first. “Couldn't reproduce the pathology” is itself a valid, documented outcome: it closes a suspicion with evidence rather than a guess.

3 · Gates over assurances

Correctness and complexity become executable checks, not review opinions. Compliance is a set of banks strip-mined from Python Lark's own test suite and replayed on every build; the build fails only on a regression, and the known-failure lists are allowed to shrink, never grow. Complexity is gated on deterministic work-counters — per-byte Earley work, the CYK cubic envelope, the lexer's linear scan — asserting a flat envelope, never wall-clock. Wall-clock on shared runners is too noisy to enforce, and a flaky red gate gets muted; a muted gate is worse than no gate, because it looks like coverage.

4 · Human judgement stays explicit

The method does not pretend everything is gateable. Product direction and genuine trade-offs are escalated through a needs-decision queue — the architect's inbox — and an agent that finds only such work surfaces it rather than inventing groundable work to avoid asking. There is a named anti-pattern: never resolve a needs-decision by picking the option that is easiest to implement. And the merge authority follows the same line — an agent may merge a change whose correctness an existing gate fully captures; new public API, new semantics, and every governance change are the architect's to merge.

5 · Durable evidence

Tests, decisions, failures, exceptions and rationale live in the repository, not in a chat log. When an agent makes a semi-groundable call, it records an Architecture Decision Record. The architect reads those in arrears — batched, not per-PR — and promotes any correction into the invariants, so every future session is grounded by the sharpened rule. The decision log is the audit trail that lets the human stay in control without being in the loop.

The checks check themselves

A gate an agent can quietly weaken is not a gate. So the oracle layer is built to be honest by construction. The generators exit non-zero on any contradiction — a committed case whose declared expectation disagrees with what Python Lark actually did — unless it sits in a reasoned allow-list that, like the known-failure ledgers, is only ever allowed to shrink. Replays hold lark-rs to Python's recorded result, not to the author's annotation, with no silent skips.

The sharpest rule is about permissiveness. Being more permissive than the oracle — accepting input Python rejects — is unfalsifiable, so it fails unless a documented reason says otherwise. The same logic generalises to scope: behaviour with no Python counterpart at all is not an autonomous default, because there is nothing to check it against. You cannot earn authority by exceeding the oracle; you earn it by matching one.

Where the method stops

A methodology that claimed to dissolve all judgement would be making exactly the unfalsifiable promise it forbids. The honest version names its own gaps. The project's own decision lens marks several dimensions as gated and two as not:

DimensionStatus
Correctness & Python parityGated oracle + compliance/wild banks
Performance & complexity classGated deterministic work-counters
Portability across targetsGated CI binding jobs
Security / resource bounds on untrusted inputLargely ungated judgement + a flagged gap
Grammar-author ergonomics & error qualityUngated judgement-only
What to build next; API design; tasteHuman escalated by default

Algorithmic-complexity denial-of-service is caught by the scaling gates, but general resource bounds on adversarial input through the public API are not yet gated. The quality of an error message when a grammar author makes a mistake has no oracle at all — Python Lark's diagnostics are not a specification to match. These are tracked as gaps to turn into gates, not quietly assumed solved.

Not every valuable property has an oracle. When the project cannot make a decision falsifiable, it records the uncertainty and keeps the decision human.

Where it generalises — and where it doesn't

Parsing is a fortunate case: a trusted reference implementation already existed. The method transfers to any domain that has, or can cheaply build, an independent check — a compiler against a reference, a numerical kernel against a slow-but-correct version, a protocol implementation against a conformance suite, a data migration against its source. The pattern is always the same: find the asymmetry where the answer is expensive to produce but cheap to verify, and put the agent on the producing side of it.

It does not transfer to domains where correctness is a matter of taste, or where no independent check can be built without first solving the very problem you're checking. In those domains the disciplined move is the same one the constitution makes for product direction: don't fabricate a gate to license autonomy you haven't earned — keep the human in the loop and say so. The contribution here is not a way to remove judgement from software. It is a way to be precise about how much of it you have actually replaced.

The claim, restated

lark-rs is not interesting because an agent wrote a parser. It is interesting because the project systematically turns parser correctness, compatibility and complexity into evidence an agent cannot talk its way around.

Every significant claim on this site carries one of four statuses, for exactly this reason:

Verified enforced by an oracle, bank or gate   Measured observed in a named benchmark   Goal a stated direction, not yet shown   Open a known limitation

The full constitution — invariants, the decision taxonomy and lens, the Definition of Done, the merge tiers, and how the document itself evolves — is PRINCIPLES.md. The dated reasoning behind individual calls lives in docs/decisions/.