What LLM training data actually is
LLMs like ChatGPT, Claude or Gemini were not trained on training plans β they were trained on text. The difference is crucial.
Pre-training means: the model reads huge amounts of text and learns patterns within it. It learns what follows what, which concepts belong together, how texts are structured. Fitness knowledge does not emerge through its own training experience, but through reading texts about training.
Concretely that means: the LLM has read thousands of articles on progressive overload, hundreds of studies on periodisation, dozens of books on strength training. It has derived a model of the concept of βtrainingβ from this β but it has never touched a barbell or felt a training session. The model knows what is written in texts. What is in no text, it does not know.
The model does not learn from you
LLMs do not learn from your prompts. If you give ChatGPT a bad answer today, that is irrelevant for the model tomorrow. The knowledge base is frozen β the date on which training was completed is called the knowledge cutoff, comparable to the publication date of a book: everything published after it is simply not in there.
Where the fitness knowledge comes from
The three main sources for LLM knowledge about training are websites and blogs, scientific studies and books β all three with different strengths and different biases.
Websites and blogs
Websites and blogs make up the largest share of the training dataset by volume β estimates put this at 60β80% of the entire pre-training corpus. The largest single dataset is called Common Crawl: an automated crawl of billions of webpages, regularly updated and used as raw material for LLM training. Fitness blogs, forums, advice sites β it all ends up in there. That means: topics discussed a lot on the web are well known to an LLM. Niche methods or new sport science approaches, barely.
The real problem is not the volume, but the structure behind it. I know this industry β and how content is often produced in it. This is not about whether someone uses AI as a writing tool or not. The problem is when people write about topics they simply do not master. Training articles are researched by looking at what competitors write, then lightly reworded. The result: hundreds of articles with the same core claim, mutually reinforcing each other β whether it is technically correct or not.
To an LLM, that looks like consensus. The model cannot distinguish whether a text comes from someone with 15 years of coaching experience or from someone who spent three hours on Google. It sees: pattern appears on many pages, on pages with many inbound links β must be correct. Popularity is not the same as correctness β but LLMs cannot reliably resolve this difference during training.
Scientific studies
Scientific literature makes up around 10β20% of the LLM training dataset β from sources like PubMed, arXiv and other academic repositories. Sport science is only a small fraction of that: estimates based on domain shares in PubMed/PMC put it below one percent of the overall dataset β no exact figure, but the order of magnitude is clear. This is not an attack on sport science, it is simply proportionality β medicine, physics and computer science produce many times more publications.
On top of that comes the demographic bias that runs through the entire literature: sport science research was for decades carried out mainly on young, male, Western test subjects. Women, people over 50 and beginners without a training background are systematically underrepresented β not because study leads wanted it that way, but because these groups were less relevant to classical elite performance research.
On top of that, not all studies are openly accessible. Paywalled articles behind academic licences are more poorly represented in LLM training than open-access publications. What the model knows about sport science is therefore also a function of which studies were available on the net without a paywall β and open-access publishing is not yet a given in sport science.
Books and specialist literature
Books are often the most reliable source in terms of quality β provided they have been digitised and are present in the training dataset. Classics such as βScience and Practice of Strength Trainingβ (Zatsiorsky & Kraemer) or βPeriodizationβ (Bompa & Haff) have presumably made their way into the training data of large LLMs. Their influence on answer quality for strength topics is real.
The downside: books have publication dates. Older standard works may contain concepts that newer research has revised. The LLM does not notice that β it weights text by pattern, not by recency.
Source | Share (estimate)* | Strength | Weakness |
|---|
Websites / blogs (Common Crawl) | ~60β80% | Broad topic coverage, current discussions | Quality uncontrolled, copy cascades, training myths often rank just as well as serious sources |
Scientific studies (PubMed, arXiv) | ~10β20% | Methodically structured, peer-reviewed | Demographic bias (young/male), paywall filter, sport science <1% of total |
Books / specialist literature | ~5β10% | Highest quality, structured knowledge | Only digitised works, potentially outdated, newer revisions missing |
* Estimates based on publicly documented pre-training corpora (including The Pile, DOLMA, C4). Exact shares vary by model and are not fully disclosed by providers.
Where does LLM knowledge about training come from? The three main sources β and their gaps. Websites & blogs Largest volume (~60β80%) Broad topic coverage β Quality varies strongly β Copy cascades β Training myths rank well. Scientific studies Methodically structured (~10β20%) Peer-reviewed β Demographic bias (young, male) β Paywall filter β Sport science <1% of total. Books & specialist literature Highest quality (~5β10%) Structured knowledge β Possibly outdated β Only digitised works β Newer revisions missing. KNOWLEDGE CUTOFF Everything after the cutoff date is missing from the model β new studies, new methods, new tools. Workaround: supply relevant sources yourself as context.
Which model knows what β the comparison (April 2026)
This is the point almost no one explains: not all LLMs have the same sources. And the differences matter for fitness questions. Cutoff, data basis and web search access co-determine what answer quality you can realistically expect.
Snapshot: April 2026
Model versions and cutoff dates change fast β the table below shows the state as of April 2026. The version numbers mentioned (GPT-5.2, Claude 4.6 etc.) may already be outdated by the time you read this. You can find the current cutoff for your model directly at the provider: OpenAI, Anthropic, Google. The principles behind them β source distribution, demographic gaps, cutoff effect β remain stable regardless of version numbers.
Model | Knowledge cutoff | Notable data sources | Fitness implication |
|---|
GPT-5.2 | August 2025 | CommonCrawl, books, Wikipedia, proprietary | Freshest sport knowledge; Bing search available |
Claude 4.6 | August 2025 | CommonCrawl + Anthropic-curated | Similarly current to GPT-5; data composition not fully disclosed |
Gemini 3 Pro | January 2025 | Google web-crawl snapshot, Google Books, likely YouTube transcripts | Broad fitness YouTube coverage; web search as separate inference tool (not training data); 7 months behind GPT/Claude |
Grok 4 | November 2024 | X/Twitter public posts + web | Strong on trend fitness (Zone 2, Carnivore etc.) β more prone to bro science |
Llama 4 | August 2024 | CommonCrawl + Facebook/Instagram posts (public) | Strong on lay fitness discourse; weak on sport science; 1 year behind GPT |
Perplexity | Real-time | Live search, cites sources | No cutoff problem β quality depends on source selection |
GPT-5.2 and Claude 4.6 know studies published in 2024β2025 β such as newer findings on protein timing, zone 2 training or HRV measurement. Llama 4 does not know these. Grok 4 has deeply internalised the Twitter fitness context β which helps with trend topics, but calls for caution in evidence-based training planning. Llama's Facebook/Instagram data base makes it strong on practical community knowledge β and at the same time vulnerable to myths that circulate especially fast there.
Rule of thumb: For fundamental training planning, GPT-5.2 or Claude 4.6 are enough. For time-critical questions on new supplementation research or current sports nutrition recommendations: Perplexity or a model with web search enabled. Anyone who wants to use Claude in a structured way will find a ready-made starting point in the Claude Training & Fitness Skill β for Gemini there is the Gemini Fitness Coach as a Gem.
Demographic gaps β who is missing from the data
Sport science studies have a standard problem: they mainly test young men between 18 and 30, often sport students. Women, people over 50, beginners without a training background and people with pre-existing conditions are systematically underrepresented.
For an LLM that means: recommendations on strength training for women in perimenopause rest on a significantly narrower knowledge base than recommendations for young men. That is not an LLM problem β it is a problem of sport science that is reflected in the LLM. The model has read what was published. And what was published was mostly research on a specific group.
In practice this looks like this: if you, as a 54-year-old woman who has been training for 8 months, put a prompt without giving your context, you get the recommendation optimised for the implicit default user β the 28-year-old man with 2 years of training experience. That is not malice, that is statistics.
That does not mean an LLM is useless for these groups. It means you have to supply more context when prompting β and should classify the answers more critically. That includes explicitly asking about the limitations: βWhich aspects of this recommendation might fit my situation less well?β
Sport-specific gaps and the knowledge cutoff
Marathon and cycling are well documented in the German- and English-language web. Powerlifting and Olympic weightlifting too. But already for sports like climbing, martial arts or rowing the data base gets thinner β and for genuine niche disciplines, LLM knowledge is often thin to random.
That is not because LLMs ignore these sports. It is because simply less text exists about them. An LLM can only learn from what was published on the web β and niche sport simply has a smaller community writing about it.
Mind the knowledge cutoff
Every LLM has a fixed cutoff β everything after that is completely missing, no matter how confidently the answer sounds. What date that concretely is for your model changes with every new version. You can find the current cutoff at the provider β or ask the model directly: βWhat is your knowledge cutoff?β The answer is there in seconds and always more current than any list in an article.
New training methods have it particularly hard. A method needs time to be documented on the web β and then more time to be taken up into LLM pre-training at all. Sport science concepts that appeared in specialist journals this year or last year practically do not exist for most current models.
This also applies to training concepts that have spread quickly in the community. An approach that has gone viral but was barely published before could be missing from LLM knowledge or misrepresented β even if everyone in your training group knows it.
What this means for your prompt
Whoever knows where an LLM's knowledge comes from can query it more purposefully β and knows when to be sceptical.
You cannot ask your app the right questions if you cannot classify the answers. That is the prompt paradox.
The more you know about training, the better you can compensate for the weaknesses of the LLM. You recognise when an answer is too generic. You know which recommendations need to be specific for your situation. You can point the LLM specifically to missing information β or supply your own sources.
If you belong to a group that is underrepresented in the training data, say so explicitly in the prompt. Age, gender, training experience, goal β that is context that makes the difference. Without this context, you get the standard recommendation for the implicit default user.
The same applies to the cutoff problem. If you know your topic concerns newer developments, name that directly in the prompt β for example like this:
Prompt
Your knowledge cutoff is [date]. If you do not know newer developments on this topic after that date, say so explicitly β I will continue with a primary source. Also consider that sport science studies have a known bias towards young, male test subjects. My profile: [age, gender, training experience, goal, sport].
If you practise a niche sport or work with a newer method, give the LLM the relevant basics. Copy/paste from studies, a brief explanation of the approach β that compensates for the knowledge gap for this conversation. The LLM does not store anything permanently, but it can work with it in this context.
Web search as a workaround β which model can what
The most direct way to get around the cutoff problem: web search. If the model can pull current content, that at least partially compensates for the frozen training knowledge. How strongly this workaround helps depends on which tool you use β and whether you have enabled web search at all.
Tool | Web search | Activation | Note |
|---|
Perplexity | Yes β core feature | Always on | Web search is the main product, no switching required |
Gemini | Yes β Google-powered | Automatic | Direct Google Search integration, on by default |
ChatGPT | Yes | Automatic (Free & Plus) | Used independently on relevant prompts |
Claude | Yes | Enable manually | Toggle in the interface β not on by default |
Status: April 2026 β subject to change
Features like web search are continually adjusted by providers β what is on automatically today may disappear behind a toggle tomorrow, or the other way around. This table reflects the state at the time of publication. When in doubt: just ask the respective model directly whether it can access current web content.
Web search solves the cutoff problem β but immediately creates a new one. The model searches in the same web we already identified above as problematic: blogs without expertise, copy cascades, training myths that rank well. Only now in real time. If ChatGPT with web search enabled searches for βbest strength training for women over 50β and the first three hits are advice sites without a sport science background, exactly this half-knowledge ends up in your answer β freshly crawled, effectively packaged.
Garbage in, garbage out β just in real time now.
Web search is still useful β but only if you know what you are looking for. Anyone who names concrete sources to the model gets better results than someone who lets it search freely. You can steer this directly in the prompt:
Prompt
Search exclusively in PubMed (pubmed.ncbi.nlm.nih.gov) for peer-reviewed studies on [topic]. Give me title, year and DOI β no free web browsing.
You are the filter β not the model.
Having knowledge β applying knowledge correctly
Even if an LLM knows a study, that does not mean it applies it correctly. It can cite a meta-analysis with full conviction β and in doing so distort author, year or core finding. There is no built-in uncertainty indicator. The model sounds just as self-assured on a correct answer as on a wrong one.
That is a topic in its own right that goes beyond the data source question. Where exactly this happens β with concrete fitness examples and how you recognise it β is shown in Where AI training plans fail.
Frequent questions on AI training data
Which AI model has the best fitness knowledge?
For current, evidence-based training: GPT-5.2 or Claude 4.6 β both have a cutoff of August 2025 and know newer sport science. For trend topics and community knowledge: Grok 4. For time-critical research where new studies matter: Perplexity.
Can I trust an AI training plan?
For general structures β progression logic, basic exercises, recovery planning β yes. LLMs are solid on established training principles. For specific medical situations, injury history or strongly underrepresented groups (women 50+, niche sports): always combine with your own expertise or real coach input.
What is a knowledge cutoff and why is it relevant for fitness?
The knowledge cutoff is the date after which the model knows nothing new. For fitness that means: studies, methods or findings that appeared after it do not exist for the model β no matter how confidently the answer sounds. GPT-5.2 and Claude 4.6 are current as of August 2025. Llama 4 ends at August 2024 β a relevant difference in fast-moving fields like sports nutrition or supplementation research.
Verdict: The LLM knows what is in the text
LLM knowledge about training is not evenly distributed. It is a mirror of what was published β with all the emphases, biases and gaps that brings. On top of that, the model does not always apply existing knowledge correctly. Whoever understands this can use the tool sensibly and build a perfect training plan with AI. Whoever ignores it gets standard answers for standard situations β phrased with the self-assurance of an Olympic coach.
The LLM knows what is in the text. What is in no text, it does not know. Your job is to know the gaps β and to fill them yourself.