AIResearch AIResearch
Back to articles
Science

AI Learns to Remember by Forgetting Selectively

A new method helps AI models retain old knowledge while learning new tasks, reducing memory use by 90% and offering a more balanced approach to continual learning.

AI Research
April 01, 2026
4 min read
AI Learns to Remember by Forgetting Selectively

Artificial intelligence systems often struggle with a fundamental flaw: when they learn new information, they tend to forget what they previously knew. This problem, known as catastrophic forgetting, limits the deployment of AI in dynamic real-world environments where models must adapt to evolving data without losing prior expertise. Researchers have now developed a called Selective Forgetting-Aware Optimization (SFAO), which addresses this issue by allowing AI to selectively forget or retain information based on gradient alignment, achieving a 90% reduction in memory usage on benchmarks like MNIST while maintaining competitive performance. This advancement could enable more efficient AI applications in areas such as autonomous driving and medical diagnostics, where continuous learning is essential but resources are constrained.

The key finding from the research is that SFAO dynamically regulates gradient updates through a gating mechanism based on cosine similarity, which measures the alignment between new and past learning directions. operates by accepting, projecting, or discarding updates at each layer of a neural network, depending on whether the cosine similarity exceeds certain thresholds. For instance, if the alignment is high (above an acceptance threshold), the update is accepted; if it is moderate, it is projected to avoid interference; and if it is low, it is discarded. This approach allows the model to balance plasticity (learning new tasks) and stability (retaining old knowledge) without relying on large memory buffers or fixed regularization, as demonstrated in experiments where SFAO achieved an average accuracy close to or better than baseline s like Orthogonal Gradient Descent (OGD) on datasets such as Split CIFAR-10.

Ology behind SFAO involves a per-layer gating rule that uses cosine similarity to decide how to handle gradient updates. Researchers maintain a buffer of past gradients and compute the maximum cosine alignment between the current gradient and a randomly sampled subset from this buffer, a technique known as Monte Carlo approximation that reduces computational cost. Based on tunable thresholds (λproj and λaccept), the system then chooses to accept the gradient as is, project it onto the orthogonal complement of past gradients to prevent interference, or discard it entirely. This process is integrated into standard stochastic gradient descent optimization, adding minimal overhead—training time increased by less than 6-8% compared to vanilla SGD—and it works with various neural network architectures, from simple MLPs to more complex models like Wide ResNet-28×10.

From the paper show that SFAO performs competitively across multiple continual learning benchmarks. On Split MNIST, SFAO achieved accuracies such as 93.6% on Task 1 and 86.8% on Task 5, outperforming EWC and SGD in retention while being memory-efficient. In Permuted MNIST, it reached 76.0% on Task 1 and 82.8% on Task 3, narrowing the gap with OGD at higher thresholds. For more complex datasets like Split CIFAR-100, SFAO demonstrated consistent performance across tasks, with accuracies ranging from 10.1% on Task 1 to 58.1% on Task 10 when using a Wide ResNet backbone, showing a balanced trade-off between stability and plasticity. Additionally, memory usage was drastically reduced: on Split MNIST, SFAO used only 153.71 MB compared to OGD's 1441.82 MB, and projection frequency remained low, indicating computational efficiency.

Of this research are significant for real-world AI applications where models must learn continuously without forgetting. By reducing memory costs and providing a tunable mechanism for controlling forgetting, SFAO makes continual learning more feasible for resource-constrained scenarios, such as edge devices or systems with limited computational power. It also offers a more generalizable solution compared to s like EWC and SI, which required architectural adjustments for stability, as SFAO maintained performance across different model backbones without such modifications. This could lead to more robust AI in fields like cybersecurity, where models need to adapt to new threats while remembering old ones, or in healthcare, where diagnostic tools must update with new data without losing accuracy on previous cases.

However, the study acknowledges limitations, including the instability of some baseline s like EWC and SI, which necessitated switching to more complex architectures for stable training. This highlights a broader in continual learning: the need for s that are robust across diverse architectures and model capacities. While SFAO showed architecture-agnostic stability, future work must focus on developing techniques that maintain consistent performance without architectural workarounds, especially for deployment in resource-constrained environments. The paper also notes that task ordering effects and dynamic threshold adaptation are areas for further exploration to enhance SFAO's adaptability and robustness in varying learning sequences.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn