AI marathon study 2026: researchers test LLMs

A new study in the British Medical Bulletin looked at whether current AI models can generate usable marathon training plans. The researchers had Claude, ChatGPT, Gemini and DeepSeek each build a six-month plan for three performance levels. The result: broadly solid plans with familiar weaknesses. Things get interesting when you look at what the abstract reveals about the methodology — and what it doesn't.

At a glance

Researchers test eight AI models for marathon training plans — according to the abstract, with a simple prompt and no personal details. The plans hit the sports-science fundamentals but struggle to differentiate between performance levels. The model selection mixes flagship and budget models with no stated rationale. Central question: does the study test the AI — or poor prompting? Full text is not openly accessible.

1. What the study examined

Montaruli et al. (2026) tested eight large language models: Claude 3.5 Sonnet, Claude 3.5 Haiku, ChatGPT 4.0, ChatGPT o1, ChatGPT 4 (Free), Gemini 2.0 Flash, Gemini 2.0 Flash Thinking and DeepSeek R1. Each model was asked to generate a six-month marathon training plan for three athlete levels — beginner, intermediate and advanced. The outputs were then compared qualitatively against the sports-science literature on marathon training.

The study is descriptive in design. There is no real-world intervention, no participants and no performance data. The researchers assess only whether the AI-generated plans follow established principles: progressive volume increases, intensity distribution (more than 80% in the low-intensity zone), periodisation and a taper before the race.

2. The results: solid fundamentals, familiar weaknesses

According to the abstract, most AI plans contained the essential building blocks of evidence-based marathon training. Progressive weekly mileage, a clear taper, and an intensity distribution dominated by low-intensity work — none of that is surprising. LLMs have learned these principles from thousands of specialist articles, books and forum posts. That they know the basic structure of a training plan is to be expected.

The weaknesses are more interesting. Some models gave no concrete weekly mileage. The distinction between intermediate and advanced was effectively absent in several models — both levels were treated as if they were identical. Pace targets for advanced runners were inconsistent or not very practical. This is precisely where training planning gets complex — and where the question of how detailed the prompt actually was becomes relevant.

3. Transparency problem: the full text stays behind a paywall

One important caveat up front: I was only able to read the abstract. The full text is behind a paywall at Oxford University Press in the British Medical Bulletin. Single-article access costs 53 euros. This isn't a one-off — it's a structural problem of the academic publishing landscape.

The irony: much of the research that ends up in these journals is publicly funded. Universities working with taxpayer money produce knowledge — and publishers then sell it back to the public that already paid for it. For a study examining whether AI can build training plans for recreational runners, that's especially absurd. The target audience — that is, you — is supposed to benefit from it, but is not allowed to read the results.

What that means for this article: all the analysis below is based on the freely available abstract. Whether the methodology is described in more detail in the full text, whether the exact prompt wording is documented, whether model-specific results are broken out — that remains unclear to me, and probably to most readers. If you have institutional access or search platforms like ResearchGate, you may find an author-shared version there.

OPEN ACCESS AND SCIENCE

Many journals offer open-access options — but these have to be paid for by the authors (often 2,000–4,000 €). That creates a system in which publicly funded research is expensive either for readers or for authors. Preprint servers like arXiv or medRxiv are an alternative, but are still used too rarely in sports science.

4. The open question: how was the prompting done?

This is where it gets interesting — and, at the same time, uncertain, because we only have the abstract. According to the abstract, the models were asked to build a six-month marathon training plan for three performance levels. Whether the prompt also contained further details — age, current training status, weekly mileage, time budget, injury history, target time — is not clear from the abstract. None of that is mentioned. If the prompt really was as minimalist as it appears, that would be a methodological problem.

Because that would be exactly the mistake many beginners make — and it fits the prompt paradox: if you don't know enough about a topic, you don't ask good questions. The AI still answers confidently — just not well. Whether the researchers actually made this mistake or whether the full text documents a more nuanced prompt, there's no way to judge from the outside.

The underlying principle holds regardless of the specific prompt — and has been backed up by expert evaluations in another study on prompt quality for AI training plans: an LLM that only receives "build a marathon plan for advanced runners" has to guess. Is the runner at 40 km/week or 80? Have they been running for 2 years or 10? Is the target time 3:15 or 2:50? Without that information, any model will struggle to meaningfully differentiate between intermediate and advanced. That exactly this was flagged as a weakness could therefore be less of an AI problem and more a consequence of the study design.

4.1 A striking model selection

Alongside the prompt, the choice of models tested also stands out. The study was submitted in June 2025 and revised in September 2025. The model list features Claude 3.5 Sonnet next to Claude 3.5 Haiku (Free), three different ChatGPT variants, Gemini 2.0 Flash, Gemini 2.0 Flash Thinking and DeepSeek R1.

At first glance that's unusual. Haiku is Anthropic's smallest, cheapest model — deliberately designed as a fast, efficient all-rounder, not as a flagship for complex tasks. Having it in the same test as Sonnet without classifying the performance tiers is a bit like comparing a city-running shoe with a race-day racer and not mentioning it. Similar story with Gemini 2.0 Flash — that's the lightweight variant, not the strongest Gemini model, while 2.5 was already available. Whether the researchers consciously chose across model tiers or simply tested whatever was freely available is not clear from the abstract.

Perhaps the full text explains this selection. But if the researchers really just took whatever was accessible, that's understandable — but it weakens the comparability of the results. A more rigorous test would either have compared only flagship models or explicitly treated differences between performance tiers as a variable — I drew exactly that line in my own AI Fitness Benchmark across 15 models.

Garbage in, garbage out applies not only to data — it applies to prompts too. When a study evaluates AI training plans, the quality of the prompt is at least as decisive as the quality of the model.

Same AI, different prompt, different result — the study is most likely testing the first variant.

Whether the full text provides more context here, I can't say. It's possible the researchers deliberately chose a minimalist prompt to represent the "worst case" — that would be methodologically defensible and even useful. But then the study would have to communicate that clearly and frame the results accordingly: not as "what can AI do?", but as "what happens when you use AI without context?". The abstract doesn't make that distinction. That's a relevant finding — but a different one from what the study title suggests.

→ Deep dive: Create a training plan with AI — how to do it right

5. What the study still gets right

The open methodological questions don't change the fact that the study's core idea is valuable. AI-generated training plans are increasingly being used — from hobby runners to ambitious athletes. Asking whether these plans are grounded in sports science is a legitimate question, and one that has barely been investigated so far.

The study also contextualises its results appropriately. The authors stress that individualisation is missing, that professional supervision is not replaced, and that randomised studies with real runners and physiological data are needed. That's the right call. Measuring AI training plans against the specialist literature is a first step — but it's not proof that these plans work in practice.

6. What's missing — and what should come next

As far as the abstract shows, the study tests LLMs as a black box: one prompt in, one plan out. That doesn't reflect how most experienced users actually work with AI. In practice you iterate: you add context, you assess the answer, you correct, you refine. Good AI training plans emerge through dialogue, not in a single shot.

Future studies should do three things differently. First: use optimised prompts that include personal data such as age, current training volume, injury history and target time. Second: test the plans in a real-world intervention — with runners who train according to them and whose performance development is measured. Third: integrate physiological data such as VO2max, lactate threshold (the point at which your body produces more lactate than it can clear) or HRV (heart rate variability — a measure of the recovery capacity of your autonomic nervous system) to check whether AI plans can be individualised.

The authors call for exactly this. Let's hope future studies deliver on that demand — with more precise prompts, transparent methodology and open access to the results.

Verdict: good approach, open questions

This study asks the right question — and delivers a usable snapshot. The results line up with what practitioners have long been observing: AI models know the fundamentals of evidence-based training. Where things get weaker is individualisation. Whether that's down to the models or to the study design can't be settled without the full text. The abstract does, however, suggest that prompt quality and model selection could have been reflected on more rigorously from a methods perspective.

For you as a user, this confirms the basic rule: your training plan will only be as good as your prompt. Without domain knowledge and without personal context you'll get a plan that "kind of fits" — but doesn't fit you. The tool is powerful. But you need to know how to use it.

Source: Montaruli, G., Nikolaidis, P. T., Racil, G., Maffulli, N., Migliaccio, G. M., & Padulo, J. (2026). Artificial intelligence–generated marathon training programs: reliable tools in exercise prescription for athletic performance? British Medical Bulletin, 157(1), ldag010. DOI: 10.1093/bmb/ldag010 | PubMed

Study tests AI marathon training plans — but how good was the test itself?

1. What the study examined

2. The results: solid fundamentals, familiar weaknesses

3. Transparency problem: the full text stays behind a paywall

OPEN ACCESS AND SCIENCE

4. The open question: how was the prompting done?

4.1 A striking model selection

5. What the study still gets right

6. What's missing — and what should come next

Verdict: good approach, open questions

You might also like

HRV Apps: What a Study of 206 Apps Really Reveals

ACSM Strength Training Guidelines 2026: What Changed After 17 Years

Sweat Biomarkers in Smart Textiles: Lab Values Without Blood?