OPEN SOURCE · USE AT YOUR OWN RISK

Your LLM judge is being
gamed. We prove it.

Sintex.AI is the OpenClaw of LLM evaluation — a pre-registered falsification harness that reveals what your eval pipeline can't see. Bradley-Terry pairwise audits, conformal abstention, judge bias detection. Free, open-source, brutally honest.

// This tool will expose your eval pipeline weaknesses. Use responsibly.
9
Falsifications shipped
58.5%
HumanEval pass@1 (n=164 full)
~2,230
Real LLM calls
1,049
Hermetic tests
The Headline Finding

cod_009 ICL output = 1.000 verbatim copy of expected_output.

We thought we had a SIX-SIGMA win for in-context learning retrieval (+0.457 on Qwen 3B, codegen +1.212 Bonferroni-significant). Then we measured the Jaccard token overlap with expected outputs. The model wasn't reasoning — it was copying.

cod_009 — codegen task
stub-judge correctness = Jaccard(out, expected)

Zero-shot baseline

Q: Write a Python function 'mergeSortedLists(list1, list2)' that...
Certainly! The merge of two sorted lists is calculated by comparing corresponding elements and combining them in order. Given that both sequences are sorted, we can implement this function as follows...
Jaccard(output, expected)
0.140

ICL 5-shot with retrieved demos

Q: ... [5 demonstrations from BV-similar shots, each with prompt + answer]
def mergeSortedLists(list1, list2): return [x for a in (list1, list2) for x in a]
Jaccard(output, expected)
1.000
The Antidote

When the wrapper is right: +58.5% pass@1 on full HumanEval.

Pass@1 (binary: code runs and tests pass, or it doesn't) cannot be rubric-gamed. We ran HumanEval execution-based eval on Qwen 3B local with three prompt wrappers: raw baseline, a generic terseness prefix, and an explicit code-only wrapper. Real, ungameable lift exists when the wrapper is right.

HumanEval n=164 FULL — Qwen 3B local — pass@1
scripts/run_humaneval_ab.py
Condition pass@1 Δ vs baseline
baseline (raw HumanEval prompt) 0/164 = 0.0%
terseness wrapper ("max 1 sentence") 48/164 = 29.3% +0.293 ✓
codegen-strict ("ONLY Python code") 96/164 = 58.5% +0.585 ✓✓
Why Sintex.AI exists

If your judge can be gamed, your eval is fiction.

Most LLM eval pipelines run a stub-rubric or LLM-as-judge once and call it done. We've shown — across 9 honest falsifications — that this surfaces format-matching, position bias, and domain confounding as if they were capability lift. Sintex.AI is the X-ray.

Bradley-Terry pairwise + bootstrap CI

Audit any judge with the same protocol as Chatbot Arena. Position-consistency report flags judges with <85% swap-symmetry. Detect biased judges before you trust their verdicts.

Conformal abstention

90%-coverage prediction bands on judge scores reveal when your effect size (e.g. +0.118) is smaller than the noise floor (q̂=0.685). Stop reporting wins your judge can't see.

Pre-registered falsification

Every encoder, every approach, every claim ships with kill thresholds before data collection. 5 routing redesigns and 1 ICL gaming theory falsified by their own pre-registered traps. Empirical discipline by default.

Bias detector + style-strip

Per the April 2026 LLM-judge survey, position bias is now negligible — style bias dominates (0.76-0.92). We strip formatting before judging and report the gap. Reveal what's content vs. theatre.

Execution-based eval

Hooks for HumanEval, MBPP, LiveCodeBench. When your output runs or it doesn't, no rubric can be gamed. The escape hatch from judge gauntlet.

Markdown-defined skills

Inspired by google/skills + Warp. Encoders, judges, and retrievers are .md files with frontmatter belief-rules — hot-reload, audit-trail in-place, composable.

Architecture in 4 steps

From your existing pipeline to honest verdicts.

01 / GENERATE

Run any model

Local Ollama, OpenAI, Anthropic, GH Models, or your own. Identical prompts, paired inputs.

02 / JUDGE

Cross-judge BT

Same outputs, multiple judges. Position-consistency report flags unreliable raters automatically.

03 / AUDIT

Jaccard + style-strip

Measure overlap with expected text. If it's high, your win is format gaming, not capability.

04 / DECIDE

Conformal abstain

If the noise floor exceeds your effect, you abstain — instead of shipping a phantom win.

example.py
# Audit any LLM eval pipeline in 6 lines
from sintex import PairwiseJudge, JaccardAudit, ConformalBands

bt = PairwiseJudge(judge_callable=my_judge, n_bootstrap=400)
report = bt.evaluate(pairs)              # BT + position-consistency
audit = JaccardAudit(records).run()      # format-gaming probe
bands = ConformalBands(scores, alpha=0.1)  # 90% coverage

if report.position_consistency_rate < 0.85:
    print("⚠ judge is biased — verdict not trustworthy")
vs roll-your-own eval

Most pipelines audit nothing. Sintex audits everything.

Capability Typical eval Sintex.AI
Pre-registered kill thresholds
Position-consistency check
Style-strip vs unstripped
Jaccard format-gaming probe
Bootstrap 95% CI on win-rate
Conformal noise-floor abstention
Cross-model BT triangulationmanual✓ built-in
Execution-based eval hooks
Markdown skills (hot-reload)
Pre-commit secret-leak hook
1,049 hermetic tests
Get started

Two commands. No build step.

Open source under PRIVATE / Trade-secret per IMUTAVEL. Use at your own risk.

terminal
# 1. Clone (or wait for pip install when public release lands)
git clone https://github.com/ElromEvedElElyon/judge-lab.git
cd Rex-26

# 2. Run the demo audit on your eval set
py -3 scripts/run_icl_jaccard_audit.py
# → Reveals if your wins are format-gaming

py -3 scripts/run_bt_icl_phi4_paced.py
# → Cross-judge BT with position-consistency report

Stop shipping phantom wins.

Star us on GitHub. Run the audits. Tell us what you find.