A claim-check result is deliberately not one score. It’s three independent axes — how well-checked the claim is, how much was checkable, and how sure Shipmoor is about the intent itself — because a high-drift result built on weak evidence and a vague intent must never read like a confident failure.
Anatomy of the block
Intent: Add a Stripe webhook handler for failed payments (payment_intent.payment_failed)
Source: manual:--intent (manual_string) · agreement: single source · Confidence: low
Claim check VERIFIED · maturity: verified · coverage: 100%
✓ A handler is bound to the Stripe payment_intent.payment_failed event.
Not yet checked:
∅ Failure-path handling
∅ Webhook signature verification
Library 0.1.0 · Policy 0.1.0
Intent:— the resolved goal text, masked (a secret pasted into the intent never survives here).Source:— which inputs resolved it, whether they agreed, and the intent confidence. See Providing intent.- The badge — the maturity state as the loud headline word, plus coverage.
✓ / ✗lines — one per probe that applied:✓satisfied,✗a disclosed gap.Not yet checked(∅) — expectations Shipmoor recognizes for this intent but has no probe for yet. Honest silence, not a pass.Library / Policy— the probe-library and policy versions that produced the result, part of the reproducibility fingerprint.
The three axes
| Axis | Question it answers | Where it shows |
|---|---|---|
| Maturity | What kind of evidence stands behind this result? | The badge headline |
| Coverage | What fraction of applicable checks produced a definite answer? | The badge |
| Confidence | How sure are we about what the change was meant to do? | The Source: line |
Maturity: the five states
| State | What it means | Terminal cue |
|---|---|---|
verified | Deterministic probes fired and were satisfied — the claim is earned on evidence. | green |
partial | Some expectations were satisfied; others were unmet or couldn’t be checked. | yellow |
gap_disclosed | A required expectation is openly unmet — an honest, located negative. | red |
unprobed | No probe applied; there is no deterministic evidence either way. | dim/grey |
inferred | Only an advisory opinion exists; it carries no deterministic weight. | dim/italic |
The weak states are styled to look weak. partial is not “wrong” — it means some expectations were checked and some were not; read the per-expectation lines to see which. Only gap_disclosed can ever earn a block, and blocking is a separate opt-in feature — see Turning on the gate.
Two kinds of “not checked”
Not yet checked(∅) — expectations with no shipped probe yet. These do not lower coverage; there was nothing to run.- The ⚠ footer — probes that did apply but returned
cannot_check(an unsupported language, say). These do lower coverage, and the footer aggregates the count and reasons:
Claim check NOT CHECKED · maturity: unprobed · coverage: 0%
? A Kubernetes Deployment is present in the change. — no relevant files in this change
⚠ We could not check 3 of 3 expectations (no relevant files in this change). Coverage 0%.
So coverage: 100% next to two Not yet checked lines reads: everything I probe, I could check — and here’s what I don’t probe yet.
Useful flags
--explain— expand every expectation with per-probe detail: which fact matched, why a check wascannot_check, the judge’s rationale if one ran.--quiet-intent— collapse the claim check to a single badge line, for busy CI logs.
Plan drift (from a session)
When you pass --session <transcript>, Shipmoor also compares the agent’s own plan against the diff it produced. That’s a separate question from the claim check: not “did the diff do what the developer asked,” but “did the agent do what it said it would.”
Three conservative probes report it: plan.drift.goal_substitution (the plan and the task share no concept), plan.drift.scope_creep (the diff implements the plan plus unrelated files), and plan.drift.partial_implementation (a planned step only partly realized). Each errs toward silence — a false plan-drift is reviewer noise.
Plan-drift findings land in the normal findings list with category: intent_integrity at severity info, and they never change the exit code — not through the structural gate, not through the claim-check gate. The agent’s plan is never the standard of judgment; the resolved intent is.
In JSON and SARIF
--json carries the claim check as change_results[] — additive, absent entirely on a no-intent scan:
{
"verdict": "major_gap",
"maturity": "gap_disclosed",
"coverage": 1.0,
"gate_decision": "not_evaluated",
"resolved_intent": { "goal_text": "…", "confidence": "medium" },
"evidence": [ { "result": "unsatisfied", "basis": "deterministic" } ],
"per_probe_summary": { "satisfied": 3, "unsatisfied": 0, "cannot_check": 0, "unmatched": 1 },
"fingerprint": "sha256:…"
}
gate_decision is not_evaluated (advisory), passed, would_block, or blocked. unmatched counts probes that were considered but didn’t apply to this change — not an error. --sarif emits SARIF 2.1.0; plan-drift findings appear in the regular findings[] with category: intent_integrity.
Next
- Turning on the gate — when a
gap_disclosedverdict should block. - BYO-Judge — where
inferredresults come from. - Providing intent — raising confidence with agreeing sources.