Athlet steht vor drei verschiedenen KI-generierten TrainingsplΓ€nen und fragt sich welchem er vertrauen kann

AI Training Plans Under the Microscope: When Experts Disagree

Christopher KlenkChristopher Klenk4 min read

You've asked AI for a training plan β€” or at least thought about it. The real question isn't whether it works, but how you can tell if the result is any good. A new study delivers an uncomfortable finding: even experts barely agree on that.

At a glance

Researchers had Gemini 2.5 create training plans for complex patient cases and asked three experts to evaluate them. The experts were internally consistent β€” but barely agreed with each other. The reason: each one prioritised differently, safety vs. progression. That's not a flaw in the study β€” it's a finding about training planning in general. And it applies just as much to your own self-coached routine with AI.

Three experts, three different verdicts β€” on the same plan

You know the situation: one coach says more volume, the next says more intensity β€” and both are convinced they're right. That's not an isolated case, that's structural. This study made exactly that visible in a controlled setting. Three sports physicians and exercise scientists evaluated the same AI-generated plans β€” and arrived at fundamentally different verdicts. One prioritised strict safety limits, another placed more weight on realistic progression and everyday practicality.

Each one was completely consistent on their own β€” but they barely agreed with each other. The study calls this the "inherently expert-dependent nature of training recommendation." Put more simply: there is no objective "right" in training planning. There are experts with different experiences and priorities β€” and AI mirrors exactly that state. Methodologically: with only three raters, this is a strong signal but not a precise measurement.

"AI needs expert oversight" β€” but which expert exactly?

The standard answer to AI-generated training plans is: have an expert check it. Sounds reasonable. But if three experts reach three different verdicts on the same plan β€” which one do you trust? The demand for oversight just shifts the problem. It doesn't solve it.

I always frame it like this: there is a clear wrong and dangerous β€” experts mostly agree on that. And there is an optimum β€” which nobody actually knows. The goal is to get as close to it as possible. Fortunately, there are often several valid paths there. Experts almost always argue about this middle ground, not about the absolute boundaries.

That doesn't mean expert knowledge is worthless. It means there is no neutral authority that can tell you whether your AI plan is good. Even an experienced coach brings their own school of thought, their own priorities. You only notice the difference if you know enough yourself to judge the answer.

More prompt structure helps β€” up to a point

The study tested three levels of prompt structuring: from a simple baseline prompt to a detailed schema with fixed output formats. Level 1 to Level 2 produced measurably better results in safety and guideline adherence. Level 3 β€” even more detailed β€” didn't improve the scores further; in one of the cases they actually got worse. This is a familiar pattern, one that earlier studies on ChatGPT training plans have shown in similar ways.

The study offers no direct explanation β€” one plausible option: if a prompt is defined too narrowly, the model follows the schema rigidly and loses the flexibility that makes a good plan. For you as a self-coached athlete, that means: provide context, yes, but don't force the model into a template. Anyone who doesn't know what progressive overload or a deload means can't ask for it β€” and the LLM won't fill in missing training knowledge out of thin air.

What remains when you strip away the clinical framing

The study looks at complex patient cases β€” diabetes, osteoarthritis, cancer aftercare. In exactly these groups, flawed training planning isn't just suboptimal, it's potentially dangerous. That makes the study ethically relevant β€” and, paradoxically, makes the finding even more interesting for healthy self-coached athletes: if even in this high-risk context no expert consensus emerges, what does that say about training planning in general? But the core finding still holds: AI-generated training plans are only as good as the prompt behind them. If you want to approach this systematically β€” with a structured method instead of trial and error β€” the Claude training plan skill is a good starting point.

What interests me about this study: it shows that the question "Is this AI plan good?" has no objective answer β€” not even from experts. That's not a reason to reject AI tools. It's a reason to use them with your own judgement instead of trusting them blindly. If you know the fundamentals, you can recognise when a plan makes sense β€” and when it doesn't. That applies to AI output just as it does to a coach's advice.

The study is available open access at MDPI.