Q3 is 89% complete
<-Back to site
BlogPublished: February 23, 2026

CoherenceScore Is Now Running Code

The Chapter 3 math is now a tested Python implementation, including protocol thresholds, geometric mean behavior, and explicit invariants.

What got built

The implementation now lives in poc/convergence with three core modules: embeddings.py for thread-safe normalized embeddings, coherence.py for the whitepaper formulas, and __init__.py exporting compute_coherence_score() plus threshold constants.

The formulas implemented are the whitepaper definitions for CS_semantic, CC, and the final CoherenceScore, which means the protocol math is no longer just prose. It is executable and testable code.

  • β—†embeddings.py: sentence-transformers + L2 normalization
  • β—†coherence.py: CS_semantic, CC, and CoherenceScore
  • β—†__init__.py: public API and threshold exports
Why geometric mean matters

CS_semantic uses the geometric mean of pairwise similarities because it punishes outlier disagreement much harder than an arithmetic average.

If two architectures agree strongly but one diverges, arithmetic mean can hide the disagreement. Geometric mean preserves that disagreement as signal, which is exactly what a convergence protocol should do.

  • β—†Arithmetic mean example: 0.633 can still look acceptable
  • β—†Geometric mean example: 0.448 correctly falls into low confidence
  • β—†Outlier disagreement should not be averaged away
Why CC carries more weight

The current weighting is 40% CS_semantic and 60% CC because causal coherence is harder to game than surface-level semantic similarity.

Two outputs can sound similar while reaching different conclusions. CC is meant to capture premise consistency and conclusion alignment, which makes it the more important signal.

Phase 0 still uses an approximation for CC by embedding the first three sentences of each output. That limitation is explicit in the implementation and will be replaced in Devnet.

Protocol thresholds and tests

The implementation currently enforces four convergence categories: rejected below 0.30, low confidence from 0.30 to 0.60, standard from 0.60 to 0.85, and high coherence above 0.85.

These values are treated as protocol invariants. The tests are intentionally written so changing a threshold requires a whitepaper amendment, not a silent code edit.

  • β—†THRESHOLD_REJECT = 0.30
  • β—†THRESHOLD_STANDARD = 0.60
  • β—†THRESHOLD_HIGH = 0.85
  • β—†ALPHA = 0.4 and BETA = 0.6 are also enforced
What this does not prove yet

The current PoC validates the formula and the invariant boundaries, but it does not yet prove how often real-world queries exceed the standard threshold at scale.

It also does not prove that the current CC approximation is close enough to the full Devnet implementation, or that all-MiniLM-L6-v2 is the ideal similarity model for every domain RAXION will target.

Those questions remain open until the broader benchmark runs.

More from RAXION