SparOA: Run AI Faster on Edge Devices

TL;DR

SparOA speeds up AI on edge hardware by splitting tasks between the CPU and GPU more efficiently, cutting latency without extra power draw.

As artificial intelligence continues its relentless march toward the edge, a critical bottleneck has emerged: how to efficiently run complex deep neural networks on resource-constrained devices without sacrificing performance or draining batteries. Traditional approaches have forced developers to choose between accuracy-robbing model compression, expensive specialized hardware, or suboptimal hybrid inference s that fail to account for the nuanced characteristics of individual neural network operators. Now, a research team from Politecnico di Milano and Harbin Institute of Technology has unveiled SparOA, a framework that fundamentally rethinks how edge devices should schedule computations between their CPU and GPU components, achieving speedups of up to 50.7× over CPU-only execution while consuming significantly less energy than state-of-the-art alternatives.

At the heart of SparOA lies a crucial insight that previous hybrid inference s have overlooked: sparsity and computational intensity are orthogonal dimensions that must be considered jointly for optimal scheduling. The researchers discovered that operators in popular models like MobileNetV3-Small distribute across four distinct quadrants—high sparsity/high intensity, low sparsity/low intensity, and the more intuitive combinations—with each requiring different hardware allocation strategies. According to the paper, this distribution reveals that neither sparsity nor computational intensity alone is sufficient for effective CPU-GPU hybrid operator scheduling, explaining why existing s that rely on fixed rules or single metrics consistently underperform.

The SparOA framework addresses this complexity through three tightly integrated components. First, a lightweight threshold predictor combines Transformer and LSTM architectures to accurately determine optimal sparsity and computational intensity thresholds for each operator, achieving 92.3% and 90.6% prediction accuracy respectively while maintaining a compact 4MB footprint suitable for edge deployment. Second, a reinforcement learning-based scheduler using the Soft Actor-Critic algorithm dynamically optimizes resource allocation based on real-time hardware states, learning complex mappings between operator characteristics and optimal device assignments through trial-and-error rather than relying on fixed heuristics. Third, a hybrid inference engine coordinates execution through asynchronous operations and dynamic batching optimization, minimizing data transfer overhead while maximizing resource utilization.

Extensive evaluations on NVIDIA Jetson platforms demonstrate SparOA's transformative impact. The framework achieves an average speedup of 1.22×∼1.31× compared to state-of-the-art compilers like TensorRT and TVM, and co-execution frameworks like CoDL, while consuming 7%∼16% less energy per inference. On the high-end Jetson AGX Orin, SparOA delivers up to 50.7× speedup over CPU-only execution for MobileNet-v3, and maintains impressive 1.24×∼11.43× acceleration on the more constrained Orin Nano. The reinforcement learning scheduler proves particularly effective, increasing GPU operator load share to 72.6% compared to 55.6% with greedy algorithms, while dynamic batching reduces overhead to just 2.3%∼8.6% compared to 15.4%∼28.7% in static frameworks.

Despite these impressive , the researchers acknowledge several limitations and future s. Extending SparOA to support more heterogeneous AI accelerators like NPUs and TPUs would require learning more complex allocation policies across hardware types with vastly different characteristics. When DNN architectures change significantly, the predictor may require complete retraining, though the team suggests transfer learning could enable quick adaptation to new models. Additionally, edge devices typically involve multiple concurrent tasks with varying service-level objectives, requiring extensions to SparOA's scheduling approach to prioritize critical applications while managing shared resources effectively across competing workloads.

Reference: Zhang, Z., Liu, J., & Mottola, L. (2025). SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference. ACM Conference.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn