Serious Data From Testing LLMs

One of the differences between NAIFs (non-critical AI fanboys… please don’t be one) and me is that I believe in gathering evidence before making strong claims about how a system is likely to behave. I realize that AI is exciting and that it’s fun to think of all the things it might do for us. But we are living in a world where hundreds or thousands of CEOs and CTOs are pushing their people to use AI without a shred of evidence that it won’t corrupt their products, ruin their data, and harm their customers. They live in faith, and that faith can do terrible damage to the industry before they realize what they’ve done to themselves and their employees.

Can AI be used responsibly? I think so. That means we need to test it and understand how it fails.

So, here’s some data for you. I performed a LARC experiment on four LLMs, at three different temperatures, with two styles of prompts. It involved retrieving ingredients for an apple pie recipe from a loosely structured text that included four recipes. This task should be easy as pie itself for these models. Is it? Look at the data and find out.

The raw data is stored in a MongoDB on my hard drive. I can produce it on demand.

The somewhat cooked data is on my website here.

And my report that analyzes the data is on my website here.

I claim that this represents one kind of respectable testing that we need to do with AI. I will be following up with similar experiments, soon. I may also post my code on GitHub. We’ll see. Meanwhile, I would welcome critique. I want to improve and extend this work to other domains and explore different prompting styles.

If you want to perform experiments like this, yourself, I can help you set them up. In fact, Michael Bolton and I did this work as part of developing our new Testers and AI class.

Comments

Hari Prasad says
15 October 2025 at 2:37 am
I want to perform the experiment in this way. Could you please help or guide me on how to set it up and carry it out?
[James’ Reply: Sure. I recommend you contact me on Telegram or Signal. I’ll walk you through what you need to do.]
Johnson says
3 November 2025 at 11:13 pm
Is there a reason why all the “prompt performance” sections of GPT4.1 are N/A?
[James’ Reply: Yes, those numbers are based on data that comes back in the Ollama response object. I don’t get that data from the OpenAI response object. However, there’s an easy workaround that I’ve been meaning to implement.]

Reader Interactions

Comments

Leave a Reply Cancel reply

Footer