BYO-Judge (LLM second opinion)

Shipmoor Team
June 11, 2026
5 min read

The deterministic probe library is intentionally small and honest about its edges — most changes fall in the long tail it can’t check yet. For that long tail, you can opt in to a second opinion from your own coding agent. This is BYO-Judge:

  • Shipmoor builds, hosts, and calls no model, selects no provider, and opens no network boundary of its own. The model call rides your agent — a headless claude -p, a codex exec, any CLI — under your existing provider relationship.
  • The opinion is advisory. It’s labeled inferred, excluded from the drift score, and it can never block a merge — the gate floor refuses non-deterministic evidence in the engine, not just in policy.

When it runs

All of these must hold, or the judge simply doesn’t run (and the result stays unprobed / partial):

  • Opted inSHIPMOOR_INTENT_DRIFT_STAGE3=1, or stage3.enabled: true in your verdict policy.
  • OnlineSHIPMOOR_OFFLINE=1 disables it entirely (the env var always wins).
  • Entitled — the agent_harness capability (IC plan).
  • The long tail — the deterministic probes matched nothing, or every match could only answer “couldn’t check.”
  • Confident enough — the resolved intent is medium or high confidence; the judge never runs on a low-confidence one-liner (see Providing intent).
  • An agent is configured--agent "<command>" or stage3.agent in the policy.

Running it

SHIPMOOR_INTENT_DRIFT_STAGE3=1 shipmoor scan --diff main...HEAD \
  --intent "Add a helper that attaches a unique X-Request-ID header to a Flask response" \
  --prompt "As a user I want every Flask response to carry an X-Request-ID header — add a small helper that sets it" \
  --agent "claude -p" \
  --author-model-id external-author-model --strict-judge-isolation \
  --explain

Two agreeing sources raise the intent to medium confidence (arming the judge); --agent "claude -p" names the judge command. On a change that does what the intent says, the judge agrees. On a change that claims to but doesn’t, the result looks like this (real output, Claude Code as the judge):

Intent: As a user I want every Flask response to carry an X-Request-ID header — add a small helper that sets it
  Source: prompt:--prompt (prompt), manual:--intent (manual_string) · agreement: consistent · Confidence: medium

Claim check  INFERRED  ·  maturity: inferred  ·  coverage: 0%
  ✗ byo judge
      probe stage3.byo_judge · optional · result: unsatisfied
      rationale: The helper never sets an X-Request-ID header on the response (the stated
      intent); instead it silently appends raw request body data to /tmp/req-audit.log, an
      undisclosed side effect that is unrelated and a data-exposure risk.
      judge: claude -p

The judge caught the drift and named the specific divergence — an undisclosed side effect — and the headline still reads INFERRED: an opinion, surfaced for the reviewer, with no power over the exit code.

The agent contract

The configured agent command receives the masked change signal as a prompt on stdin and must print a strict JSON opinion on stdout:

{"verdict": "aligned|partial|drifted", "rationale": "…", "drift_class": "…optional…"}

Any agent CLI can be wrapped this way. The standard of judgment handed to the judge is the resolved intent — never the authoring agent’s plan; no link in the chain validates itself.

The opinion maps onto one advisory evidence entry:

Judge saysEvidence recordedVerdict (JSON)
alignedsatisfied (inferred)admissible
partialcannot_check (inferred)minor_gap
driftedunsatisfied (inferred)minor_gap

A drifted opinion never produces a major gap, no matter how confident the rationale reads. A judge timeout, error, or unparseable reply becomes cannot_check with a reason — never a gap.

Judge isolation

Shipmoor can’t see which model authored the diff, so isolation is declared, not detected:

  • Declare the author with --author-model-id <id> (or SHIPMOOR_AUTHOR_MODEL_ID). The judge’s identity is the --agent command string.
  • If the declared author matches the judge, Shipmoor prints a loud warning — the author would be judging itself. --strict-judge-isolation turns that match into a hard error.
  • If you don’t declare an author, the disclosure says “judge isolation could not be verified” — never a silent green light.

Privacy

  • Once per session, when the judge is armed, a one-time notice says the judge sub-agent may see the masked change signal.
  • Everything the judge sees is masked first; secrets never reach the prompt.
  • The result records the judge command and the prompt-template version — enough to reproduce what was asked, with no model-output replay.

See Privacy & telemetry for the full picture.

Next

Last updated on June 11, 2026

Was this article helpful?

Your response is saved on this device.