A pilot study, a chatbot and 51 out of 100 points
Norwegian researchers developed FysBot – a mobile app with a GPT-4-based chatbot designed to motivate adults with obesity to become more active. The pilot study at a rehabilitation clinic in Tromsø ran for six weeks with 36 participants. The result is both sobering and instructive.
The System Usability Scale (SUS) – a standardised questionnaire for assessing usability – came in at 51.3 out of 100. For context: anything below 68 is considered below average. Engagement declined steadily over the course of the study, with a clear drop already showing after week two. Of 36 starters, 17 stuck with it until the end (Larbi et al., 2026, Digital Health).
What is the SUS?
The System Usability Scale is a ten-question usability survey that has been in use since 1986. It produces a score between 0 and 100. A value above 68 is considered acceptable, above 80 is good. The SUS does not measure individual features but the user's subjective overall experience.
What specifically went wrong
Participants reported three core problems. First: technical issues – the app showed limitations in functionality. Second: lack of personalisation. The chatbot delivered generic recommendations instead of individually tailored advice. And third: poor fit with everyday life. FysBot could not be meaningfully integrated into existing routines.
Particularly striking: participants' self-efficacy – their confidence in their own ability to exercise – dropped from 58 to 49 points during the study. The authors note that this warrants further research. But the figure shows that usability is not a nice-to-have for AI interventions: if the experience frustrates rather than motivates, it can actively work against the intended goal.
To be fair: that was the whole point
Before anyone writes this off as FysBot failing – this was a pilot study. The explicit goal was to test feasibility and usability. The researchers wanted to find out where the rough edges were before pushing a product to market. That is exactly the right approach. Most fitness apps skip this step and land straight in the app store – with the same problems, but without the honesty to name them.
The study concretely identified what needs to improve: iterative co-design with users, better personalisation and integration that actually fits the target group's daily life. These are not surprising insights – but they are now backed by data.
What this means for AI fitness apps
FysBot is not an isolated case. Many AI-driven fitness apps are built by developers for developers – not for the people they are actually meant to serve. A GPT-4 backend alone does not make a good product. If the interface is off, the answers stay generic and the app does not fit into everyday life, even the best language model is useless.
For anyone wanting to use AI in fitness themselves, this study confirms a decisive point: a good user interface and well-crafted prompts matter at least as much as the model behind them. If you work directly with ChatGPT or Claude, at least you have full control over the input – and you don't have to wait for an app team to retrofit the personalisation.
→ Deep dive: AI in fitness – what's out there and what actually works?
Source
Larbi, D., Zanaboni, P., Årsand, E., Randine, P., Trondsen, M. V., Denecke, K., Wynn, R. & Gabarron, E. (2026). Feasibility and usability of a ChatGPT-based app to support physical activity: A pilot study. Digital Health, 12, 20552076261417860. doi:10.1177/20552076261417860



