OPEN SOURCE · USE AT YOUR OWN RISK

Your LLM judge is being
gamed. We prove it.

Sintex.AI is the OpenClaw of LLM evaluation — a pre-registered falsification harness that reveals what your eval pipeline can't see. Bradley-Terry pairwise audits, conformal abstention, judge bias detection. Free, open-source, brutally honest.

Install judge.lab See the gaming proof

// This tool will expose your eval pipeline weaknesses. Use responsibly.

Falsifications shipped

58.5%

HumanEval pass@1 (n=164 full)

~2,230

Real LLM calls

1,049

Hermetic tests

The Headline Finding

cod_009 ICL output = 1.000 verbatim copy of expected_output.

We thought we had a SIX-SIGMA win for in-context learning retrieval (+0.457 on Qwen 3B, codegen +1.212 Bonferroni-significant). Then we measured the Jaccard token overlap with expected outputs. The model wasn't reasoning — it was copying.

cod_009 — codegen task

stub-judge correctness = Jaccard(out, expected)

Zero-shot baseline

Q: Write a Python function 'mergeSortedLists(list1, list2)' that...

Certainly! The merge of two sorted lists is calculated by comparing corresponding elements and combining them in order. Given that both sequences are sorted, we can implement this function as follows...

Jaccard(output, expected)

0.140

ICL 5-shot with retrieved demos

Q: ... [5 demonstrations from BV-similar shots, each with prompt + answer]

def mergeSortedLists(list1, list2): return [x for a in (list1, list2) for x in a]

Jaccard(output, expected)

1.000

The Antidote

When the wrapper is right: +58.5% pass@1 on full HumanEval.

Pass@1 (binary: code runs and tests pass, or it doesn't) cannot be rubric-gamed. We ran HumanEval execution-based eval on Qwen 3B local with three prompt wrappers: raw baseline, a generic terseness prefix, and an explicit code-only wrapper. Real, ungameable lift exists when the wrapper is right.

HumanEval n=164 FULL — Qwen 3B local — pass@1

scripts/run_humaneval_ab.py

Condition	pass@1	Δ vs baseline
baseline (raw HumanEval prompt)	0/164 = 0.0%	—
terseness wrapper ("max 1 sentence")	48/164 = 29.3%	+0.293 ✓
codegen-strict ("ONLY Python code")	96/164 = 58.5%	+0.585 ✓✓

Mechanism: raw Qwen 3B emits "Certainly! Here's the implementation..." preamble + markdown fences → parser rejects all 164 outputs. A wrapper saying "Output ONLY Python code, no fences" matches the parser exactly. The lift is emitting-discipline, not reasoning gain — but it's REAL and ungameable.
Full benchmark (n=164): codegen-strict 58.5% pass@1 is competitive for Qwen 2.5 3B on HumanEval. Subset n=50 was 72%; numbers settle as harder problems ramp up later in the benchmark — but pre-registered kill thresholds pass at full-benchmark scale.

Why Sintex.AI exists

If your judge can be gamed, your eval is fiction.

Most LLM eval pipelines run a stub-rubric or LLM-as-judge once and call it done. We've shown — across 9 honest falsifications — that this surfaces format-matching, position bias, and domain confounding as if they were capability lift. Sintex.AI is the X-ray.

Bradley-Terry pairwise + bootstrap CI

Audit any judge with the same protocol as Chatbot Arena. Position-consistency report flags judges with <85% swap-symmetry. Detect biased judges before you trust their verdicts.

Conformal abstention

90%-coverage prediction bands on judge scores reveal when your effect size (e.g. +0.118) is smaller than the noise floor (q̂=0.685). Stop reporting wins your judge can't see.

Pre-registered falsification

Every encoder, every approach, every claim ships with kill thresholds before data collection. 5 routing redesigns and 1 ICL gaming theory falsified by their own pre-registered traps. Empirical discipline by default.

Bias detector + style-strip

Per the April 2026 LLM-judge survey, position bias is now negligible — style bias dominates (0.76-0.92). We strip formatting before judging and report the gap. Reveal what's content vs. theatre.

Execution-based eval

Hooks for HumanEval, MBPP, LiveCodeBench. When your output runs or it doesn't, no rubric can be gamed. The escape hatch from judge gauntlet.

Markdown-defined skills

Inspired by google/skills + Warp. Encoders, judges, and retrievers are .md files with frontmatter belief-rules — hot-reload, audit-trail in-place, composable.

Architecture in 4 steps

From your existing pipeline to honest verdicts.

01 / GENERATE

Run any model

Local Ollama, OpenAI, Anthropic, GH Models, or your own. Identical prompts, paired inputs.

02 / JUDGE

Cross-judge BT

Same outputs, multiple judges. Position-consistency report flags unreliable raters automatically.

03 / AUDIT

Jaccard + style-strip

Measure overlap with expected text. If it's high, your win is format gaming, not capability.

04 / DECIDE

Conformal abstain

If the noise floor exceeds your effect, you abstain — instead of shipping a phantom win.

example.py

# Audit any LLM eval pipeline in 6 lines
from sintex import PairwiseJudge, JaccardAudit, ConformalBands

bt = PairwiseJudge(judge_callable=my_judge, n_bootstrap=400)
report = bt.evaluate(pairs)              # BT + position-consistency
audit = JaccardAudit(records).run()      # format-gaming probe
bands = ConformalBands(scores, alpha=0.1)  # 90% coverage

if report.position_consistency_rate < 0.85:
    print("⚠ judge is biased — verdict not trustworthy")

vs roll-your-own eval

Most pipelines audit nothing. Sintex audits everything.

Capability	Typical eval	Sintex.AI
Pre-registered kill thresholds	—	✓
Position-consistency check	—	✓
Style-strip vs unstripped	—	✓
Jaccard format-gaming probe	—	✓
Bootstrap 95% CI on win-rate	—	✓
Conformal noise-floor abstention	—	✓
Cross-model BT triangulation	manual	✓ built-in
Execution-based eval hooks	—	✓
Markdown skills (hot-reload)	—	✓
Pre-commit secret-leak hook	—	✓
1,049 hermetic tests	—	✓

Get started

Two commands. No build step.

Open source under PRIVATE / Trade-secret per IMUTAVEL. Use at your own risk.

terminal

# 1. Clone (or wait for pip install when public release lands)
git clone https://github.com/ElromEvedElElyon/judge-lab.git
cd Rex-26

# 2. Run the demo audit on your eval set
py -3 scripts/run_icl_jaccard_audit.py
# → Reveals if your wins are format-gaming

py -3 scripts/run_bt_icl_phi4_paced.py
# → Cross-judge BT with position-consistency report

Stop shipping phantom wins.

Star us on GitHub. Run the audits. Tell us what you find.

Star on GitHub Re-read the gaming proof

Your LLM judge is being gamed. We prove it.