PHILADELPHIA — Researchers from Drexel University and Michigan State University developed a prototype artificial intelligence program that analyzes video to provide real-time exercise form coaching. The researchers published their work ahead of presenting the prototype at the Conference on Computer Vision and Pattern Recognition in June 2026. The conference is hosted by the Institute of Electrical and Electronics Engineers and the Computer Vision Foundation.
Feng Liu, an assistant professor in Drexel University's College of Engineering and Computing, led the research on the prototype exercise coaching program. The program integrates biomechanical modeling, computer vision, and a vision-language model to provide live feedback during physical activity. "Many people who exercise at home with videos and apps don't get high-quality assessment of their movements," said Liu. "Our goal with BioCoach is to provide timely, specific cues grounded in body motion, closer to the kind of guidance a knowledgeable coach would give," Liu said.
The research team utilized the publicly available Qualcomm Exercise Video Dataset to prepare the training data. The dataset contains hundreds of hours of exercise footage with time-stamped coaching feedback. Researchers re-annotated the dataset to include detailed biomechanical targets and rationales for the guidance. The team added more than 2,400 notes to over 200 videos in the dataset. The annotations were used to train a large language model to generate coaching guidance. The team preserved time stamps in the dataset to evaluate whether the system responds at the correct moment.
The program uses a 3D convolutional neural network to capture visual appearance and motion patterns. A secondary component estimates 3D skeletal movements and body shape to identify joint angles, ranges of motion, and exercise phases. The researchers published their work in a paper titled "From 3D Pose to Prose: Biomechanics-Grounded Vision--Language Coaching." The paper has the Digital Object Identifier 10.48550/arXiv.2603.26938.
The team compared the program against video-language artificial intelligence models developed by NVIDIA, ByteDance, Alibaba, Salesforce, OpenAI, Massachusetts Institute of Technology, Shanghai Jiao Tong University, Chinese University of Hong Kong, Peking University, and Peng Cheng Laboratory. Evaluation metrics for the comparison included timeliness, accuracy, and detail of the provided feedback. The program outperformed the MIT and NVIDIA-developed Stream-VLM model in text quality and judged correctness when tested on the original dataset. The program achieved higher scores than Stream-VLM across all tested metrics when evaluated against the researcher-annotated dataset.
The research team intends to modify the software to estimate joint reaction forces and muscle activation patterns from video. Drexel's Visual Intelligence Lab applies computer vision, machine learning, and 3D human-body modeling to exercise coaching, clinical gait assessment, and classroom education.
No independent assessment was available for this report.