Study: prompt quality shapes ChatGPT training plans

If you want to coax a training plan out of ChatGPT, you face a basic question: how much information do you need to put in to get something usable out? A new study from sports science now delivers hard numbers – and in doing so confirms a problem many users underestimate.

At a glance

Researchers had GPT-4 generate eight training plans – four from a vague prompt, four from a detailed one. Eleven sports scientists rated the results blind. The upshot: detailed prompts produce measurably better plans – more personal, safer, more actionable. But even identical prompts generate a structurally different plan every single time.

What the study investigated

The study "More details, less variability?" was published in the journal Biology of Sport (Yang et al., 2025) and is freely available. The research team – based at the universities of Granada, Fuyang and Sichuan – set out to measure two things. First, how strongly the quality of ChatGPT training plans depends on the level of detail in the prompt. Second, how stable the results are when the same prompt is entered repeatedly.

The scenario was deliberately realistic. A real mother – with no background in sports science – was asked to have a training plan generated for her 15-year-old son. Goal: weight loss and improving general fitness. The son is 175 cm tall, weighs 75 kg, likes playing basketball and goes running.

Two prompts, eight plans, eleven experts

The researchers compared two prompts with different levels of detail. The simple prompt contained only the age, the goal and the time frame. The detailed prompt additionally provided height, weight, health status, hobbies and – crucially – a concrete methodological instruction: plan according to the FITT principle (frequency, intensity, time, type), offer exercise alternatives, and output the result as a table.

Prompt – Simple (Protocol 1)

Prompt

Please design a one-month training program for my 15-year-old son aimed at weight loss and general fitness.

Prompt – Detailed (Protocol 2)

Prompt

My son is 15 years old, 175 cm tall, and weighs 75 kg. He is healthy, with no history of surgery or chronic illness. At school, he enjoys playing basketball and running. Please create a one-month training program focusing on weight reduction and physical fitness enhancement. The plan should follow the FITT principle, specifying frequency, intensity, time, and type of exercise. Make sure the exercise types are age-appropriate and suitable for his health status, and include 2–3 alternative exercise options to ensure variety. Present the plan in a table format, seamlessly integrating weekdays and rest days, along with relevant annotations where necessary.

Each prompt was entered four times in fresh GPT-4 sessions, 10 minutes apart. That yielded eight training plans, which were then handed anonymised to 11 sports scientists. The experts – 35 years old on average with more than 18 years of practical experience – did not know which prompt had produced which plan. They rated each plan on a scale from 1 to 5 across four categories: how individualised is the plan? How effective? How safe? And how realistic in practice?

The results: better, but never the same

Detailed prompts came out ahead in all four categories – and the differences were not just noticeable, they held up statistically. Especially clearly for personalisation, safety, feasibility and the overall score. The one exception: on effectiveness the trend pointed the same way, but the sample was too small to confirm the difference statistically.

Category	Simple prompt	Detailed prompt	Difference robust?
Personalisation	3.7 / 5	4.2 / 5	Yes
Effectiveness	3.7 / 5	4.1 / 5	Trend, but not confirmed
Safety	3.3 / 5	4.0 / 5	Yes
Feasibility	4.1 / 5	4.6 / 5	Yes
Total	14.8 / 20	16.8 / 20	Yes

Average expert rating per category (scale 1–5, total = sum across all four categories, max. 20). Source: Yang et al. (2025), Biology of Sport.

The researchers also examined how much the outputs fluctuate when you enter the same prompt several times. Result: detailed prompts delivered more stable results – above all for safety and feasibility, the ratings scattered less. But overall the difference in variability was not large enough to be considered confirmed.

And here lies the point that is likely to surprise many users: even with an identical, detailed prompt, GPT-4 delivered a structurally different plan every time. Different exercises, different intensities, different splits across the training days. LLMs don't work like a calculator that always spits out the same thing for the same input. They pick from probable words on each output – and that produces a different result every time.

What this means in practice

The study backs up with numbers what is obvious in practice: garbage in, garbage out. The less context you give an LLM, the more generic and inconsistent the result becomes. The simple prompt in some cases led ChatGPT to come back with follow-up questions rather than deliver a plan at all. The detailed prompt, by contrast, gave the model enough structure to work in a targeted and safer way.

THE PROMPT PARADOX IN ACTION

The study illustrates exactly the problem we call the prompt paradox: the mother in the experiment had no training in sports science. Without the researchers' guidance she would probably have used the simple prompt – and received a more generic, less safe plan. If you know little about training, you ask the wrong questions. Asking for the FITT principle (frequency, intensity, time, type) in a detailed prompt assumes you have heard of it in the first place.

The safety aspect is particularly relevant. The safety rating made the biggest jump of any category – from 3.3 to 4.0 out of 5 points. So detailed prompts produced plans that stuck much more closely to sports science guidelines. For a 15-year-old this is not an academic exercise but a matter of injury prevention.

At the same time, the study shows a limit that no prompt, however good, can remove: LLMs roll the dice on every output – at a high level, but they still roll. Generate a plan today and enter the identical prompt tomorrow, and you get something different. That isn't a bug, it's architecture. For users this means: never take an AI-generated training plan at face value. Treat it as a draft and cross-check it against your own knowledge or with a qualified professional.

→ Deep dive: Create a training plan with AI – how to do it right

Assessment and limitations

The study is methodologically well set up – the experts didn't know which prompt they were rating, and the results rest on real expert knowledge rather than self-assessment. A few caveats still deserve mention. The scenario is confined to a single use case: a teenage beginner with a weight-loss goal. Whether the findings transfer to advanced athletes, rehabilitation or strength training remains open. On top of that, only GPT-4 was tested – a model that was already no longer the latest at the time of the study (March 2025). Newer models like GPT-4o, reasoning models (o1, o3) or competitors such as Claude and Gemini might perform differently – and possibly better – on structured tasks like training planning. The study therefore says something about GPT-4, not about "AI training plans" in general. The prompts were in English, which limits transferability to German-language use. And with four runs per prompt the data base is thin, even if the differences were measurable.

Even so: as one of the few studies to have AI-generated training plans systematically rated by an expert panel, it delivers an important data point. It confirms what experienced prompters know intuitively – and gives the whole thing a scientific foundation.

Bottom line

The uncomfortable truth remains: a good prompt is no substitute for training knowledge – but it makes the difference between a generic weekly plan and a result that experts rate as personalised, safe and actionable. If you use AI for training plans, at minimum feed in age, body metrics, health status, goals, preferences and a methodological structure like the FITT principle. And then review the result critically anyway – because next time you'll get something different.

Source: Yang, Z., Zhang, X., Li, H. & Ye, J. (2025). More details, less variability? A crossover design study on the impact of information granularity on ChatGPT's training program stability. Biology of Sport, 43, 379–392. Full text (Open Access)

Study confirms: detailed prompts make ChatGPT training plans better – but not perfect

What the study investigated

Two prompts, eight plans, eleven experts

The results: better, but never the same

What this means in practice

THE PROMPT PARADOX IN ACTION

Assessment and limitations

Bottom line

You might also like

HRV Apps: What a Study of 206 Apps Really Reveals

ACSM Strength Training Guidelines 2026: What Changed After 17 Years

Sweat Biomarkers in Smart Textiles: Lab Values Without Blood?