When AI systems search for information to answer complex questions, they often use two different approaches: one that looks for semantic similarity in text and another that follows connections in knowledge graphs. These s produce scores that are fundamentally incompatible—like comparing temperatures in Celsius and Fahrenheit without a conversion chart. A new study reveals that simply calibrating these scores to a common scale before combining them can significantly improve AI's ability to retrieve the right information for multi-step reasoning tasks.
The researchers discovered that their calibration , called PHASE GRAPH, improves last-hop retrieval performance on two standard benchmarks for multi-hop question answering. On the MuSiQue benchmark, using a strong HippoRAG2-style pipeline with NV-Embed-v2 embeddings and Llama 3.3 70B for entity extraction, increased last-hop retrieval accuracy from 75.1% to 76.5% on held-out test data. This improvement was statistically significant with 8 wins and 1 loss compared to vector-only retrieval. On the 2WikiMultiHopQA benchmark, performance rose from 51.7% to 53.6% with 11 wins and 2 losses. consistently outperformed rank-based fusion approaches that discard score magnitude information.
The core innovation lies in how the system handles the fundamental mismatch between different retrieval scores. Vector retrieval using cosine similarity produces scores that cluster in a narrow Gaussian distribution with a median around 0.29, while graph retrieval using Personalized PageRank follows a power-law distribution with most values around 0.001 and rare peaks near 0.3. The researchers address this through percentile-rank normalization, which maps each system's scores to a common unit-free scale between 0 and 1 using the probability integral transform. This calibrated approach preserves within-system ordering while making cross-system values comparable. then applies Boltzmann weighting with temperature calibration, converting percentile ranks to energies and combining them with mixing parameters that control vector-graph weighting and consensus boosting.
The data shows clear evidence that calibration matters more than the specific fusion formula. In a theory-guided ablation study on 2WikiMultiHopQA, percentile-based normalization proved directionally more robust than min-max normalization on both tune and test splits, with 1 win and 6 losses compared to the baseline. When comparing fusion formulas, Boltzmann weighting performed comparably to linear fusion after calibration, with 0 wins and 3 losses. The researchers found that once scores are made commensurable, the exact post-calibration operator appears to matter less than in uncalibrated fusion. Figure 2 from the paper illustrates how raw score distributions occupy incomparable scales, while Figure 3 demonstrates how percentile normalization maps both distributions to approximately uniform ranges.
This work has important for building more reliable AI systems that need to answer complex questions requiring multiple steps of reasoning. By treating graph-vector retrieval as a calibration problem first, developers can create more robust systems without needing increasingly elaborate fusion rules. The research shows that practical considerations like entity disambiguation, pool capping, and database isolation can matter as much as the fusion formula itself. On a curated 66-query subset selected for knowledge graph coverage, all percentile-normalized strategies reached 26 last-hop retrieval wins, and adding Ising reranking provided one additional win. However, the researchers caution that these gains require careful tuning and don't transfer automatically to all scenarios.
The study acknowledges several limitations. Primary are confirmed only on MuSiQue and 2WikiMultiHopQA benchmarks, with a HotpotQA cross-check showing no effect due to a strong vector baseline. On 2WikiMultiHopQA using a legacy pipeline, score-based fusion requires per-corpus calibration to transfer effectively, and low loss counts require conservative tuning. The 66-query hard-slice is filtered for knowledge graph coverage, making it favorable to graph s by construction. Testing 660+ configurations on 66 queries creates degrees of freedom that require careful interpretation. Full-corpus gains, while statistically significant, are relatively small at +2.5 percentage points for last-hop retrieval at top-10, and improvements in standard recall metrics are not significant. The research also notes that database isolation experiments provide strong evidence of cross-dataset contamination but not tightly controlled causal ablation.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn