RRM-1 BENCHMARKS
We ran the numbers. Then we ran them again.
The numbers surprised us, enough to be cautious. Because the only number worth publishing is one that has been stress tested for failure in the real world.
We submit the models our customers actually deploy. We did not optimise for benchmark testing.
We are a small company. We are preparing for independent third-party verification before we publish. A number without independent verification is not useful to anyone making a serious decision.
Benchmark Results
|
BENCHMARKS |
COMFREY |
|
|
GPQA DIAMOND |
Run complete. We have chosen not to publish our scores at this time. |
|
|
MMLU PRO |
|
|
|
SWE BENCH VERIFIED |
|
|
|
HLE |
The GPQA Diamond measures expert-level question answering across scientific domains. That is a meaningful bar. It is not our bar.
The Universal Agentic Reasoning Layer was not built to answer graduate-level chemistry questions. It was built to reason honestly over your documents, flag what it doesn’t know, and never quietly substitute a fabricated answer for a real one. The benchmark that measures what we actually built for does not yet exist in standard form. That is part of what we are working on.
NOTE: We are a small company. We do not have a team dedicated to optimising for leaderboard performance. When we submit. It will be once, dated clearly and what you see is what you get. We think that is more useful to you than a score that was engineered for this page.
How we assess hallucination prevention?
Definition of hallucination: Incorrect retrieval based on the prompt. The retrieved content is not grounded in what the prompt actually asks for. Truthfulness is therefore defined as retrieval that is genuinely relevant and accurate to the prompt, not just semantically similar.
Dataset: HaluBench by PatronusAI, a hallucination evaluation benchmark of 15,000 samples consisting of Context-Question-Answer triplets annotated for whether examples contain hallucinations. It draws from real-world domains including finance and medicine, sourcing from FinanceBench, PubmedQA, CovidQA, HaluEval, DROP, and RAGTruth.
Baseline measurement: 15,000 samples.
Our measurement method: Cosine similarity is used but not relied upon as the sole metric. We layer our own geometric metric that measures contradiction versus coherence, the relationship between what was retrieved and what the prompt actually requires. That tension between contradiction and coherence is what cosine similarity alone cannot capture.
The 93% figure therefore means: 7 out of 100 answers returns an “I don’t know” response, with references when applicable based on gaps in a model’s knowledge. We have only tested this on smaller models to date.
Our approach to AI hallucination prevention
Most evaluation systems score what the model said. We monitor what the model is doing while it decides.
Inside UARL, every response is shaped by two internal signals: confidence and consistency. Confidence measures how certain the model is about its answer. Consistency measures how stable that certainty is across the reasoning process.
When these signals align, the response is sound. When they diverge, high confidence paired with low consistency the model is about to produce an answer it believes but cannot support. That pattern is a pre-hallucination signature. The text has not gone wrong yet. But internally, the model is already in an unreliable state.
We detect that state before it reaches the output reducing the AI hallucination rate significantly.
Why this matters?
Conventional hallucination detection waits for the model to finish and then checks whether the answer looks right. That approach has two problems. It is reactive, the AI hallucination has already been generated. And it depends on an external judge or ground truth to evaluate the output, which introduces its own errors.
What we have built requires neither. It reads the internal state of the model’s reasoning, the relationship between confidence and consistency across layers and identifies structural misalignment before it becomes a factual error.
The output may still look fluent. The language may still sound authoritative. But the internal signals have already told us the answer is at risk. We act on that signal, not on the text it would have produced to reduce AI hallucinations.
What this means for your deployment?
Hallucination risk is surfaced earlier, flagged more reliably and caught without requiring a second model to judge the first. For enterprises running UARL on high-stakes documents like compliance, legal or financial, this is not an evaluation feature. It is a trust infrastructure.
Evals shouldn’t just score outputs. They should expose internal integrity.