A new approach in artificial intelligence is making it possible to analyze complex data types, such as graphs and time series, with greater flexibility and efficiency. Researchers have developed the generalized Proximity Forest model, which extends the capabilities of existing machine learning techniques to handle data that traditional s like Random Forests struggle with. This advancement allows for applications like supervised outlier detection and missing data imputation across a wider range of domains, from financial analysis to scientific research, by leveraging distance-based similarities rather than feature-based splits.
The key finding from this work is that the generalized Proximity Forest model can match or outperform the k-nearest neighbors in various tasks while offering unique benefits. For instance, in experiments, the model achieved comparable accuracy to k-nearest neighbors on the Palmer Penguin dataset, with a mean difference of only 0.0044 across trials. More notably, it demonstrated superior performance in specific scenarios, such as graph classification, where it achieved a test accuracy of 0.7768 compared to 0.6250 for k-nearest neighbors, and in multivariate time series analysis, where it reached an accuracy of 0.9757 versus 0.9486. These highlight the model's adaptability and effectiveness in diverse data contexts.
Ology behind this innovation involves modifying the original Proximity Forest model, which was designed for univariate time series, to accept custom distance measures. Instead of relying on features like traditional Random Forests, the model uses pairwise distances—such as Dynamic Time Warping for time series or the Weisfeiler-Lehman distance for graphs—to split data based on proximity to exemplars. This allows it to handle data types like graph-valued or multivariate time series of unequal length. Additionally, the researchers introduced a variant for regression tasks and a meta-learning framework that enables any pre-trained classifier to perform supervised imputation by using model-informed distances.
Analysis of reveals several advantages of the generalized Proximity Forest model. In outlier detection, as shown in Figure 1, the model's Geometry- and Accuracy-Preserving proximities produced meaningful outlier scores that aligned with visual patterns in multidimensional scaling embeddings. For imputation, experiments on simulated 2-sphere data showed that GAP-based imputation led to a post-imputation test accuracy of 0.8800, compared to 0.8467 for k-nearest neighbors. The model also scaled more efficiently, with O(log(N)) inference complexity versus O(N) for brute-force k-nearest neighbors, making it suitable for larger datasets where distance measures like Dynamic Time Warping lack acceleration s due to violations of the triangle inequality.
Of this research are significant for real-world applications where data comes in varied and complex forms. By extending Random Forest proximities to new data types, the generalized Proximity Forest model enables more accurate analysis in fields like bioinformatics, where graph data is common, or finance, where time series vary in length. The meta-learning aspect allows existing AI models to gain imputation capabilities, potentially improving data quality in sensitive areas without accessing original information. This could streamline workflows in industries reliant on robust data handling, from healthcare to environmental monitoring.
However, the study acknowledges limitations that point to areas for future work. The model's performance on small, vector-valued datasets, as shown in Figure 2, indicates that Random Forests may still dominate in accuracy for such cases, with the generalized Proximity Forest ranking comparably to k-nearest neighbors. Additionally, the current implementation assumes test instances align over the same time points, leaving variable-length sequences for future exploration. The researchers also note that the distance measure used in meta-learning, such as the class-matching distance defined in equation (1), is not a true metric, which could affect certain applications. These constraints suggest that while the model broadens AI's reach, further refinements are needed to optimize it across all data scenarios.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn