AIResearch AIResearch
Back to articles
AI

AI Models Collaborate Privately Using Hidden Memory

A new federated inference method allows different AI models to share internal memory states, boosting accuracy by over 20% without exposing sensitive user data.

AI Research
April 01, 2026
4 min read
AI Models Collaborate Privately Using Hidden Memory

Large language models (LLMs) on devices like smartphones often struggle with accuracy and speed compared to their cloud counterparts, but a new approach enables them to work together effectively while protecting privacy. Researchers from the Singapore University of Technology and Design and Zhejiang University have developed a framework called Federated Refinement (FedRefine), which allows heterogeneous LLMs to collaborate by sharing key-value (KV) caches—internal memory states that capture semantic knowledge—instead of raw text. This addresses critical s in latency, privacy, and system heterogeneity, offering a scalable solution for edge computing where models vary in architecture and size. By leveraging cache-to-cache communication, FedRefine skips the delays associated with text-based interactions, enabling faster and more accurate inferences without compromising user data.

The key finding of this research is that sharing KV caches between different LLMs can significantly enhance inference performance while maintaining privacy. In experiments, when four transmitter models collaborated with a receiver model (Qwen3-0.6B) using non-private KV cache sharing, accuracy improved by 21.2% over standalone inference, as shown in Figure 3(a). Even with privacy-preserving measures—where input tokens are rephrased to obscure intent—the accuracy drop was only 3%, demonstrating that effectively balances performance and data protection. Compared to traditional text-to-text communication, which requires rebuilding KV caches and incurs high latency, cache-to-cache communication achieved about 15% higher accuracy in full-participation settings, though it demands more bandwidth (88 KB per token versus 16 bytes for text). This highlights the trade-off between communication efficiency and inference gains, with FedRefine optimizing for low-latency, high-accuracy outcomes in resource-constrained environments.

Ology behind FedRefine builds on two core ideas: self-refinement, where an LLM iteratively improves its own output, and cache-to-cache communication, which allows models to exchange KV caches directly. As illustrated in Figure 1, this involves training fuser networks—such as three-layer MLP networks—to project KV caches from one model to another, enabling seamless collaboration even between architectures with different tokenization schemes or embedding dimensions. For instance, in a bidirectional setup (Co-C2C), fusers like Fuser12 and Fuser21 facilitate mutual refinement, allowing smaller models to assist larger ones and vice versa. The framework, depicted in Figure 2, scales to multiple LLMs by maintaining pre-trained fusers for all possible pairs, with a gating network to select relevant cache data. Input tokens are rephrased using the receiver model to preserve privacy, ensuring that sensitive information remains local while still enabling effective knowledge transfer across the network.

From the case study, detailed in Figure 3, validate the efficacy of FedRefine across various metrics. Accuracy consistently increased with more participating models, rising from standalone baselines as additional sharer clients joined, with the KV-based approach outperforming text-based s significantly. Latency evaluations in Figure 3(c) showed that privacy-preserving cache communication, despite added overhead from query rewriting, remained much lower than text-to-text communication, making it suitable for real-time applications. Figure 3(b) further revealed that the intrinsic capabilities of transmitter models, such as Qwen2.5-1.5B or Llama3.2-1B, directly influenced collaborative performance, emphasizing the importance of model selection in heterogeneous systems. These underscore that FedRefine enables efficient knowledge transfer with minimal performance degradation, even under privacy constraints, paving the way for more adaptive AI deployments on edge devices.

In terms of real-world , this research offers a new paradigm for deploying AI in privacy-sensitive areas like healthcare, finance, and personal assistants, where data cannot be shared openly. By allowing devices to collaborate without exposing raw text, FedRefine could enable smarter, faster applications on smartphones and IoT devices, reducing reliance on cloud servers and enhancing user trust. The framework's model-agnostic design means it can integrate diverse AI systems, from small on-device models to larger edge servers, fostering more flexible and scalable intelligent networks. Future directions highlighted in the paper include iterative local refinement, continuous global federation, and extensions to multi-modal LLMs, which could further improve collaborative capabilities. However, the current approach requires significant communication resources for cache transmission, suggesting a need for adaptive strategies that balance bandwidth usage with task requirements.

Limitations of FedRefine, as noted in the paper, include the high communication load associated with KV cache sharing—88 KB per token compared to 16 bytes for text—which may strain network resources in bandwidth-constrained scenarios. The framework also relies on pre-trained fusers for each model pair, adding complexity to deployment and maintenance, especially as the number of models grows. While privacy is enhanced through token rephrasing, there may be residual risks if rephrasing s are not robust enough to fully obscure sensitive intent. Additionally, the case study focused on specific models and datasets (e.g., OpenBookQA), and performance may vary with different architectures or tasks. Future research will need to address these s by exploring dynamic communication schemes, optimizing fuser training, and expanding evaluations to broader applications, ensuring that federated inference can scale sustainably in diverse real-world settings.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn