Machine Learning Must Evolve for Biology

Machine learning has transformed fields like physics by embedding known laws into models, but it struggles with the messy realities of biology. A new position paper argues that this mismatch is not a barrier but an opportunity to evolve the field, proposing Biology-Informed Machine Learning (BIML) as a next step. This shift could unlock more accurate models for drug discovery, disease understanding, and ecological forecasting, making it vital for healthcare and environmental science.

The key finding is that Physics-Informed Machine Learning (PIML), which integrates equations like differential equations into models, excels in domains with clear, observable systems but fails in biology due to uncertainties and complexity. Researchers propose BIML to address this by adapting PIML's principles to handle biological data's inherent noise, unobserved variables, and context-dependence. This approach retains the interpretability of physics-based methods while embracing probabilistic and multi-source knowledge.

Methodologically, the paper outlines how BIML builds on PIML by incorporating four pillars: uncertainty quantification, contextualization, constrained latent structure inference, and scalability. For instance, uncertainty quantification uses probabilistic frameworks to manage conflicting biological data, such as varying confidence in molecular interactions. Contextualization disentangles mechanisms that vary across individuals or conditions, while latent inference recovers unobserved species like intracellular signals. Scalability ensures models handle high-dimensional systems efficiently, leveraging techniques like Bayesian last layers and modular architectures.

Results analysis from the paper highlights that current PIML benchmarks, like ODEBench, focus on low-dimensional, synthetic systems and do not reflect biological realities. In contrast, BIML's illustrative examples, such as inferring gene regulatory networks, show how it can integrate data from diverse sources, handle noise, and generalize to unseen interventions. The paper notes that without such evolution, machine learning risks overstating its applicability in biology, as seen in limited uptake despite conceptual appeal.

Contextually, BIML matters because biology's complexities—such as sparse measurements and heterogeneous data—are common in real-world applications like personalized medicine and climate science. By improving model accuracy and interpretability, BIML could lead to better predictions in drug responses or ecosystem dynamics, directly impacting public health and environmental policies. The paper emphasizes that this is not about replacing PIML but refining it to address high-stakes, ambiguous scenarios.

Limitations include the need for new benchmarks that stress-test models under biological conditions, as current evaluations are inadequate. The paper acknowledges that BIML is a proposal requiring community effort, with unknowns in how to systematically integrate knowledge and ensure scalability without compromising fidelity. It calls for interdisciplinary collaboration to validate and implement these ideas, noting that incremental fixes to PIML may not suffice for biology's unique challenges.

Machine Learning Must Evolve for Biology

About the Author

Guilherme A.