Serious Data From Testing LLMs

One of the differences between NAIFs (non-critical AI fanboys… please don’t be one) and me is that I believe in gathering evidence before making strong claims about how a system is likely to behave. I realize that AI is exciting and that it’s fun to think of all the things it might do for us. But we are living in a world where hundreds or thousands of CEOs and CTOs are pushing their people to use AI without a shred of evidence that it won’t corrupt their products, ruin their data, and harm their customers. They live in faith, and that faith can do terrible damage to the industry before they realize what they’ve done to themselves and their employees.

Can AI be used responsibly? I think so. That means we need to test it and understand how it fails.

So, here’s some data for you. I performed a LARC experiment on four LLMs, at three different temperatures, with two styles of prompts. It involved retrieving ingredients for an apple pie recipe from a loosely structured text that included four recipes. This task should be easy as pie itself for these models. Is it? Look at the data and find out.

The raw data is stored in a MongoDB on my hard drive. I can produce it on demand.

The somewhat cooked data is on my website here.

And my report that analyzes the data is on my website here.

I claim that this represents one kind of respectable testing that we need to do with AI. I will be following up with similar experiments, soon. I may also post my code on GitHub. We’ll see. Meanwhile, I would welcome critique. I want to improve and extend this work to other domains and explore different prompting styles.

If you want to perform experiments like this, yourself, I can help you set them up. In fact, Michael Bolton and I did this work as part of developing our new Testers and AI class.

Viktoriia says

15 November 2025 at 3:56 pm

[James’ Reply: Sorry I did not see this earlier!]

Hi James,

Thank you for sharing the report. I found the results interesting and kind of funny in a good way if that makes sense. I feel like we as people tend to assign personalities (or maybe it’s just a neurodivergent thing, or some other subgroup) to things that do not have them, as it makes it easier and more pleasant to weave them into our narratives about the world and to interact with them. Gen AIs are subjected to it more than simpler systems for obvious reasons. Gemma not liking eggs or sugar was such a good way to show and remember the flaws of using chat AIs for analysis. 🙂

On a more serious note, I wonder how much it even makes sense to try and develop some reliable approach to testing AIs in this way, at least outside of training models. I might just be misunderstanding the whole thing, as my AI engine expertise is limited, but they are, at their core, statistics-based engines that spit out whatever they determined to be the likeliest result collective humanity (or subset of humanity when you give them a role to play, and they try to slice their DB to what they think people in that role said) would say in that context. There isn’t any understanding baked in, even at the simple arithmetics level. I keep bumping into it, as ChatGPT when “analyzing” fiction loves to say something like: “Just three words, but they hit hard!” completely miscounting the actual number of words in a sentence. It is amazing that they do as well as they do, frankly, but my bar is pretty low.

[James’ Reply: Well, they are not being sold as mere text spitter-outers. They are sold as “word calculators” as Sam Altman put it. They are sold as serious productivity engines. Therefore, it is reasonable to wonder if they are reliable at doing serious things.]

I guess what I’m trying to say is, despite people trying to rely on Gen AI and getting all excited about it, they are intrinsically unreliable and require an independent expert verifying the results. I like the idea of AIs checking their own work in some shape or form (i.e. one model checking output of the other model) as the first line of defense, similar to what you did in your experiment, but that can’t be enough for anything where results actually matter. We simply do not have a true Gen AI yet. We have specialized AIs unavailable to the public and doing very well in their respective areas (like playing Go or identifying proteins to aid biochemistry research), and we have extremely fancy chat bots that fake expertise and personalities well enough to get everyone (me included, tbh) excited. I don’t think any amount of testing would do much more than demonstrate the fakeness of that expertise and maybe reveal some of the curious quirks of different models.

[James’ Reply: Yet you see from the Apple Pie results that ChatGPT 4.1 is substantially reliable at this task, compared to the others. This is worth knowing. If I were a developer creating a system that used AI to scrape unstructured text, I would be more comfortable using that model rather than Gemma.]

On one of the episodes of the “Skeptics guide to the universe” podcast they quoted the research that showed that AI models also tend to adjust their answers to satisfy the shallow requirements of the prompter, without actually improving the quality of the output. The research was, from memory, in how to train models using penalties for unsatisfying answers, and the result was, the model adjusts to whatever criteria you give it, even if it means inventing “facts” and confidently presenting them as real. Forgive the vagueness, if you are interested, I can try and dig it up, it’s from maybe a year ago. But basically, even if the models did well in your experiment, it would just mean they do well on that particular task on that particular data, and we would still have no idea at which point does that effect wear off (i.e. how different the text or the prompt need to be for the quality to drop sharply). More experiments expand the certainty field, but I am skeptical you could safely infer from the results into the unknown. Especially given that AI models and the surrounding software are constantly in flux and we have little control over it.

For the responsible use, I’m leaning towards starting from the assumption that you cannot trust the AI output without testing it every single time – and therefore pushing against people trying to speed up SDLC by blindly plugging AI tools into it. The only task I’d trust AI to do without checking the results is where I do not care about the quality of the output (e.g. some formal-process-satisfying documentation I know for sure nobody would ever read), at which point I’d rather fight to get rid of that task altogether.

[James’ Reply: I agree that test results on GenAI systems are hard to generalize from.]

Comments

Hari Prasad says
15 October 2025 at 2:37 am
I want to perform the experiment in this way. Could you please help or guide me on how to set it up and carry it out?
[James’ Reply: Sure. I recommend you contact me on Telegram or Signal. I’ll walk you through what you need to do.]
Johnson says
3 November 2025 at 11:13 pm
Is there a reason why all the “prompt performance” sections of GPT4.1 are N/A?
[James’ Reply: Yes, those numbers are based on data that comes back in the Ollama response object. I don’t get that data from the OpenAI response object. However, there’s an easy workaround that I’ve been meaning to implement.]
Viktoriia says
15 November 2025 at 3:56 pm
[James’ Reply: Sorry I did not see this earlier!]
Hi James,
Thank you for sharing the report. I found the results interesting and kind of funny in a good way if that makes sense. I feel like we as people tend to assign personalities (or maybe it’s just a neurodivergent thing, or some other subgroup) to things that do not have them, as it makes it easier and more pleasant to weave them into our narratives about the world and to interact with them. Gen AIs are subjected to it more than simpler systems for obvious reasons. Gemma not liking eggs or sugar was such a good way to show and remember the flaws of using chat AIs for analysis. 🙂
On a more serious note, I wonder how much it even makes sense to try and develop some reliable approach to testing AIs in this way, at least outside of training models. I might just be misunderstanding the whole thing, as my AI engine expertise is limited, but they are, at their core, statistics-based engines that spit out whatever they determined to be the likeliest result collective humanity (or subset of humanity when you give them a role to play, and they try to slice their DB to what they think people in that role said) would say in that context. There isn’t any understanding baked in, even at the simple arithmetics level. I keep bumping into it, as ChatGPT when “analyzing” fiction loves to say something like: “Just three words, but they hit hard!” completely miscounting the actual number of words in a sentence. It is amazing that they do as well as they do, frankly, but my bar is pretty low.
[James’ Reply: Well, they are not being sold as mere text spitter-outers. They are sold as “word calculators” as Sam Altman put it. They are sold as serious productivity engines. Therefore, it is reasonable to wonder if they are reliable at doing serious things.]
I guess what I’m trying to say is, despite people trying to rely on Gen AI and getting all excited about it, they are intrinsically unreliable and require an independent expert verifying the results. I like the idea of AIs checking their own work in some shape or form (i.e. one model checking output of the other model) as the first line of defense, similar to what you did in your experiment, but that can’t be enough for anything where results actually matter. We simply do not have a true Gen AI yet. We have specialized AIs unavailable to the public and doing very well in their respective areas (like playing Go or identifying proteins to aid biochemistry research), and we have extremely fancy chat bots that fake expertise and personalities well enough to get everyone (me included, tbh) excited. I don’t think any amount of testing would do much more than demonstrate the fakeness of that expertise and maybe reveal some of the curious quirks of different models.
[James’ Reply: Yet you see from the Apple Pie results that ChatGPT 4.1 is substantially reliable at this task, compared to the others. This is worth knowing. If I were a developer creating a system that used AI to scrape unstructured text, I would be more comfortable using that model rather than Gemma.]
On one of the episodes of the “Skeptics guide to the universe” podcast they quoted the research that showed that AI models also tend to adjust their answers to satisfy the shallow requirements of the prompter, without actually improving the quality of the output. The research was, from memory, in how to train models using penalties for unsatisfying answers, and the result was, the model adjusts to whatever criteria you give it, even if it means inventing “facts” and confidently presenting them as real. Forgive the vagueness, if you are interested, I can try and dig it up, it’s from maybe a year ago. But basically, even if the models did well in your experiment, it would just mean they do well on that particular task on that particular data, and we would still have no idea at which point does that effect wear off (i.e. how different the text or the prompt need to be for the quality to drop sharply). More experiments expand the certainty field, but I am skeptical you could safely infer from the results into the unknown. Especially given that AI models and the surrounding software are constantly in flux and we have little control over it.
For the responsible use, I’m leaning towards starting from the assumption that you cannot trust the AI output without testing it every single time – and therefore pushing against people trying to speed up SDLC by blindly plugging AI tools into it. The only task I’d trust AI to do without checking the results is where I do not care about the quality of the output (e.g. some formal-process-satisfying documentation I know for sure nobody would ever read), at which point I’d rather fight to get rid of that task altogether.
[James’ Reply: I agree that test results on GenAI systems are hard to generalize from.]

Reader Interactions

Comments

Leave a Reply to Johnson Cancel reply

Footer