Data scientists face a daunting challenge when building machine learning models: with countless combinations of preprocessing steps, feature engineering, and algorithms to choose from, finding the best pipeline can feel like searching for a needle in a haystack. This combinatorial explosion means that evaluating every possible option is computationally infeasible, especially for large datasets. A new automated machine learning (AutoML) system, developed by researchers at Cornell University, tackles this problem head-on by using matrix and tensor factorization to efficiently predict which pipelines will perform best, dramatically reducing the time and resources needed for model selection.
The key finding of this research is that the performance of different machine learning pipelines across various datasets can be accurately modeled using low-rank matrix and tensor factorizations. This allows the system to predict how well untested pipelines will perform based on a small number of initial evaluations. For example, the system can infer the effectiveness of 23,424 possible pipeline combinations after running just a fraction of them, leveraging the underlying structure in the data to make informed predictions.
The methodology involves two main phases: an offline phase and an online phase. In the offline phase, the system collects performance data from a corpus of training datasets, building a surrogate model that captures the relationships between datasets and pipelines. This model uses matrix factorization for simpler cases and tensor factorization for more complex scenarios involving multiple preprocessing components. In the online phase, when presented with a new dataset, the system actively selects a small subset of pipelines to run, using their results to infer the performance of all others through linear regression and optimization techniques. The approach includes a greedy algorithm for time-constrained experimental design, which maximizes information gain while staying within computational budgets.
Results from experiments on real-world classification problems demonstrate the system's effectiveness. In tests involving 215 OpenML datasets and 183 estimators, the method outperformed existing AutoML systems like auto-sklearn and TPOT. For instance, it achieved lower average rankings in pipeline performance, meaning it consistently identified better-performing pipelines faster. The system also showed robustness when meta-training data was limited, maintaining strong performance even with only 3% of pipeline-dataset combinations observed. Visualization of hyperparameter landscapes confirmed that predictions closely matched actual performance trends, capturing both broad patterns and fine-grained variations.
This research matters because it addresses a fundamental bottleneck in machine learning workflows. By automating pipeline selection, it empowers data scientists to focus on interpretation and deployment rather than tedious trial-and-error. In practical terms, this could accelerate model development in fields like healthcare, finance, and science, where rapid iteration is crucial. The system's ability to work with sparse data also makes it suitable for resource-constrained environments, broadening access to advanced machine learning techniques.
Limitations noted in the paper include the assumption that pipeline performances lie in a low-dimensional space, which may not hold for all datasets. Additionally, the current implementation does not optimize hyperparameters beyond a fixed grid, and the runtime predictions, while generally accurate, can deviate by a factor of two or more in some cases. Future work could explore extending the approach to neural architecture search or incorporating domain-specific knowledge to further improve recommendations.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn