Sintex.AI is the OpenClaw of LLM evaluation — a pre-registered falsification harness that reveals what your eval pipeline can't see. Bradley-Terry pairwise audits, conformal abstention, judge bias detection. Free, open-source, brutally honest.
We thought we had a SIX-SIGMA win for in-context learning retrieval (+0.457 on Qwen 3B, codegen +1.212 Bonferroni-significant). Then we measured the Jaccard token overlap with expected outputs. The model wasn't reasoning — it was copying.
Pass@1 (binary: code runs and tests pass, or it doesn't) cannot be rubric-gamed. We ran HumanEval execution-based eval on Qwen 3B local with three prompt wrappers: raw baseline, a generic terseness prefix, and an explicit code-only wrapper. Real, ungameable lift exists when the wrapper is right.
| Condition | pass@1 | Δ vs baseline |
|---|---|---|
| baseline (raw HumanEval prompt) | 0/164 = 0.0% | — |
| terseness wrapper ("max 1 sentence") | 48/164 = 29.3% | +0.293 ✓ |
| codegen-strict ("ONLY Python code") | 96/164 = 58.5% | +0.585 ✓✓ |
Most LLM eval pipelines run a stub-rubric or LLM-as-judge once and call it done. We've shown — across 9 honest falsifications — that this surfaces format-matching, position bias, and domain confounding as if they were capability lift. Sintex.AI is the X-ray.
Audit any judge with the same protocol as Chatbot Arena. Position-consistency report flags judges with <85% swap-symmetry. Detect biased judges before you trust their verdicts.
90%-coverage prediction bands on judge scores reveal when your effect size (e.g. +0.118) is smaller than the noise floor (q̂=0.685). Stop reporting wins your judge can't see.
Every encoder, every approach, every claim ships with kill thresholds before data collection. 5 routing redesigns and 1 ICL gaming theory falsified by their own pre-registered traps. Empirical discipline by default.
Per the April 2026 LLM-judge survey, position bias is now negligible — style bias dominates (0.76-0.92). We strip formatting before judging and report the gap. Reveal what's content vs. theatre.
Hooks for HumanEval, MBPP, LiveCodeBench. When your output runs or it doesn't, no rubric can be gamed. The escape hatch from judge gauntlet.
Inspired by google/skills + Warp. Encoders, judges, and retrievers are .md files with frontmatter belief-rules — hot-reload, audit-trail in-place, composable.
Local Ollama, OpenAI, Anthropic, GH Models, or your own. Identical prompts, paired inputs.
Same outputs, multiple judges. Position-consistency report flags unreliable raters automatically.
Measure overlap with expected text. If it's high, your win is format gaming, not capability.
If the noise floor exceeds your effect, you abstain — instead of shipping a phantom win.
# Audit any LLM eval pipeline in 6 lines from sintex import PairwiseJudge, JaccardAudit, ConformalBands bt = PairwiseJudge(judge_callable=my_judge, n_bootstrap=400) report = bt.evaluate(pairs) # BT + position-consistency audit = JaccardAudit(records).run() # format-gaming probe bands = ConformalBands(scores, alpha=0.1) # 90% coverage if report.position_consistency_rate < 0.85: print("⚠ judge is biased — verdict not trustworthy")
| Capability | Typical eval | Sintex.AI |
|---|---|---|
| Pre-registered kill thresholds | — | ✓ |
| Position-consistency check | — | ✓ |
| Style-strip vs unstripped | — | ✓ |
| Jaccard format-gaming probe | — | ✓ |
| Bootstrap 95% CI on win-rate | — | ✓ |
| Conformal noise-floor abstention | — | ✓ |
| Cross-model BT triangulation | manual | ✓ built-in |
| Execution-based eval hooks | — | ✓ |
| Markdown skills (hot-reload) | — | ✓ |
| Pre-commit secret-leak hook | — | ✓ |
| 1,049 hermetic tests | — | ✓ |
Open source under PRIVATE / Trade-secret per IMUTAVEL. Use at your own risk.
# 1. Clone (or wait for pip install when public release lands) git clone https://github.com/ElromEvedElElyon/judge-lab.git cd Rex-26 # 2. Run the demo audit on your eval set py -3 scripts/run_icl_jaccard_audit.py # → Reveals if your wins are format-gaming py -3 scripts/run_bt_icl_phi4_paced.py # → Cross-judge BT with position-consistency report
Star us on GitHub. Run the audits. Tell us what you find.