Boxing AI Dataset: 6,915 Punch Clips Improve Sports Analysis

TL;DR

Researchers built a 6,915-clip punch dataset that trains computer vision models to automate boxing coaching and sports analytics.

In the high-stakes world of combat sports, where split-second decisions and precise technique can determine victory, the quest for data-driven insights has long been hindered by a critical bottleneck: the lack of robust, publicly available datasets tailored to the dynamic, unstructured nature of actions like boxing punches. While mainstream sports such as basketball and football have benefited from extensive video analytics resources, combat sports have remained underrepresented, relying on intrusive sensor-based s or limited, coarse-grained datasets. A new breakthrough, detailed in a recent paper titled "BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization," aims to shatter this barrier by introducing a comprehensive, well-annotated video dataset specifically designed for punch detection and classification. This dataset, comprising 6,915 high-quality punch clips extracted from real-world YouTube videos, promises to accelerate research in computer vision for sports analytics, offering a rich benchmark that captures the nuances of boxing techniques across diverse athletes and environments.

Ology behind this dataset is meticulously crafted to address the unique s of combat sports analysis. Researchers sourced 20 publicly available YouTube videos featuring 18 different athletes (11 male, 7 female) engaged in sparring sessions, shadow boxing, and bag training, ensuring a wide range of motion styles, camera angles, and physiques. Each video was manually segmented to extract punch clips, with precise temporal boundaries annotated for start and end frames, and categorized into six distinct punch types: Cross, Jab, Lead Hook, Lead Uppercut, Rear Hook, and Rear Uppercut. To enhance the dataset's utility for pose-based modeling, the team employed AlphaPose for automatic 2D human pose estimation, extracting keypoints for each frame. A systematic tracking approach was implemented to consistently identify the person of interest across frames, using Euclidean distance calculations based on the center of mass from shoulder and hip keypoints, ensuring robust tracking despite multiple individuals in scenes.

Of this effort yield a dataset that stands out in both scale and granularity compared to existing resources. With 5,513 clips allocated for training and 1,402 for validation, the dataset provides temporal segmentation, class labels, and 2D pose annotations, a combination lacking in prior works like the 3DCG dataset (which offers 6,900 clips but lacks pose data and temporal demarcation) or the BoxMAC dataset (with 2,314 clips but no skeletal information). Statistical analysis revealed that punches in 30 fps videos require up to 25 frames for completion, leading to standardized sequence lengths through zero-padding. This structured approach enables fine-grained analysis of motion execution, supporting tasks such as action localization, technique classification, and inter-frame dynamics modeling, all derived from real-world, monocular video footage that mirrors the variability of amateur and professional training settings.

Of this dataset extend far beyond academic research, potentially revolutionizing applied domains in sports technology and healthcare. By providing a benchmark for vision-based models, it lays the groundwork for automated coaching systems that can offer real-time feedback on punch technique, aiding athletes in refining their skills without invasive sensors. In broadcast analytics, it could enable intelligent highlight generation and performance assessment during live events. The dataset also holds promise for sports biomechanics and rehabilitation monitoring, allowing for detailed motion analysis to prevent injuries or track recovery progress. As the authors note, future expansions could incorporate multi-person combat scenarios and opponent interactions, paving the way for next-generation AI models in strategic behavior modeling and personalized training in dynamic, multi-agent environments.

Despite its advancements, the dataset comes with inherent limitations that reflect s of real-world data collection. The reliance on YouTube videos, while providing diversity, introduces variability in recording quality, lighting conditions, and camera motion, which may affect model generalization. The manual annotation process, though ensuring precision, is labor-intensive and could limit scalability for larger datasets. Additionally, the focus on single-person scenarios excludes the complexities of opponent interactions in full combat settings, a gap the authors acknowledge for future work. Ethical considerations around data usage under YouTube's Fair Use Policy are addressed, but broader copyright issues may arise if the dataset is expanded. These constraints highlight the need for continued innovation in automated annotation and multi-modal data integration to fully unlock the potential of AI in combat sports analytics.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn