FITT-VP Study: Which AI Writes the Better Training Plan?

Before the headline misleads you: this is not about which AI model writes the best training plans. All four LLMs in this study — GPT-4o, Claude 3.7, DeepSeek R1 and Grok-3 — are outdated by now. The interesting part is how the testing was done: a study published in Frontiers in Physiology in May 2026 had AI-generated training plans systematically rated against FITT-VP — an established framework from sports medicine. That test methodology stays relevant long after the model names have been swapped out. And it offers a starting point for a question I have been chewing on for months: how do you objectively measure whether an AI training plan is any good?

At a glance

A study in Frontiers in Physiology had certified experts rate training plans from GPT-4o, Claude 3.7, Grok-3 and DeepSeek R1 against the FITT-VP framework — Claude 3.7 won with 50.2 out of 60 points. More important than the ranking is the method: all four models are outdated by now, but FITT-VP from sports medicine makes AI training plans objectively comparable for the first time. The test only measures plan structure on paper, though — no real training outcomes, no periodization, no sport-specific quality. You can still use the six FITT-VP questions today to check your own AI plan.

What the study actually did

Three certified exercise specialists rated training plans from four LLMs for 30 fictional patient profiles — dimension by dimension instead of by gut feeling. The researchers Huan Feng and Xiaojun Wang built the profiles from epidemiological data and clinical guidelines, then had them checked in three stages by sports medicine students and experts. Every model got the same 30 cases. Each generated plan was scored independently: zero to ten points per FITT-VP dimension, 60 points max per plan.

The setup is reminiscent of the marathon study from February — humans judged the AI plans there too. The difference: this time there was a fixed, dimension-by-dimension rubric. That is exactly what makes results comparable — across models, across studies, over time.

The results — and why they are already history

Claude 3.7 won with 50.2 out of 60 points, DeepSeek R1 trailed at 40.3 — and both models are obsolete today. In between: Grok-3 at 47.4 and GPT-4o at 44.0. The differences were statistically clear-cut, not noise: which model you picked explained almost 90% of the score variance.

Looking closer: Claude 3.7 scored highest on session duration and progression — the logic of how a plan builds over the weeks. DeepSeek R1 stumbled on intensity and exercise type, of all things — two levers a plan stands or falls on for you. Grok-3 was solid on volume but struggled with complex pre-existing conditions.

But: all four models date from 2024 and early 2025. In AI time, that is an eternity. If your takeaway from this study is "use Claude for your plan", you are drawing the wrong conclusion from solid work. The ranking is the perishable part. The framework is not.

FITT-VP: a yardstick from sports medicine

The study's authors did not invent FITT-VP — it is an established standard for exercise prescription, shaped by the American College of Sports Medicine (ACSM). Its home turf is sports medicine, prevention and rehab: anywhere exercise gets dosed like a medication.

Quick explainer: FITT-VP Frequency: how often per week. Intensity: how hard. Time: how long per session. Type: which form of training. Volume: total weekly amount. Progression: how the plan ramps up over time.

As a test criterion for LLMs, that is a smart choice. Every dimension can be scored on its own, by humans, reproducibly across all models. The big AI benchmarks for math or coding do nothing else: defined tasks, fixed rubric, clear score. And like those benchmarks, FITT-VP only measures one facet — more on that in a moment.

What the test measures — and what it does not

The study measures whether a plan is well built on paper — not whether it makes you faster or stronger. The authors name three limits themselves: the profiles were synthetic, no real person trained on these plans. Each model was queried only once per case — even though LLMs can answer the same question differently every time. And what got scored was expert judgment, not training outcomes.

For you as an athlete there is a fourth limit: FITT-VP comes from rehab and health contexts. Periodization, sport-specific exercise selection, race timing — none of that shows up in the framework. FITT-VP checks a plan's foundation, not the whole house. What studies comparing AI against coaches actually show, I have broken down elsewhere — and where AI plans fail in practice, too.

None of this is a knockout argument against the study. It just means: a high FITT-VP score shows a plan is well built — not that it is the right one for you.

Why the method interests me more than the winner

The study works — as a blueprint for how to evaluate AI training plans seriously in the first place. The hype version would be the headline "Claude is the best AI coach"; the data does not support that. What it does support: human raters plus a fixed rubric beat gut feeling, and the quality gaps between models are real and measurable.

I am currently building an evaluation framework for AI training plans myself — one that puts athletes, not patients, at the center. FITT-VP will be one layer in it: the structure check. Periodization, sport-specific logic and coaching quality need their own criteria, and honestly, I will not know whether my weighting holds up until I have run it against real plans. More on that when it is ready.

Until then, you can use the framework yourself: take your last AI plan and ask it the six FITT-VP questions. Does it say how often, how hard, how long, what exactly, how much per week — and what happens once week four is over? Any dimension without a concrete number is worth a follow-up prompt. Those are exactly the details where even the best models dropped points in this study. And if you do not have an AI plan yet: how to create a training plan with AI, step by step — I have already written that up.

Sources

Feng H, Wang X. Comparative performance of four large language models in generating evidence-based exercise prescriptions using FITT-VP framework. Front Physiol. 2026;17:1846567. doi.org/10.3389/fphys.2026.1846567
ACSM's Health & Fitness Journal: Developing the P (for Progression) in a FITT-VP Exercise Prescription (2018). journals.lww.com

FITT-VP Study: Which AI Writes the Better Training Plan?

What the study actually did

The results — and why they are already history

FITT-VP: a yardstick from sports medicine

What the test measures — and what it does not

Why the method interests me more than the winner

Sources

You might also like

HRV Apps: What a Study of 206 Apps Really Reveals

ACSM Strength Training Guidelines 2026: What Changed After 17 Years

Sweat Biomarkers in Smart Textiles: Lab Values Without Blood?