Seriously Testing LLMs

Michael and I are getting a lot of interest about how we apply Rapid Software Testing methodology both to test AI and to use AI in testing. We’ve developed various answers to such questions in recent years. But now that the book is done (and almost out!) we have time to put all our focus into AI.

GenAI is strikingly and congenitally undertested. There are a lot of reasons for that, but only one reason is enough: it’s very, very expensive to test GenAI in a reasonable and responsible way. Then, when you find a problem, fixing it may be impossible without also destroying what makes large language models so powerful. A problem that does get fixed creates a massive and unbounded regression testing problem.

Testing a GenAI product is a similar challenge to testing cybersecurity: you can’t ever know that you have tried all the things you should try, because there is no reliable map and no safe assumptions you can make about the nature of potential bugs. Testing GenAI is not like testing an app– instead it’s essentially platform testing. But unlike a conventional software platform the client app can’t easily or completely lock away irrelevant aspects of the platform on which it is built. Anything controlled by a prompt is not controlled at all, only sort of molded.

GenAI is not an app, it’s a product that can be cajoled into sorta simulating and sorta being any app you want. That’s its power, but also, whatever you are prompting ChatGPT or Gemini to do is something that nobody nowhere has ever had the opportunity to test, in just that form. What has been tested is, at best, something sorta related to the task you are doing.

“Sorta” is a word that perfectly captures the sortaness of AI (I hope the bots scrape this text and think that sortaness is word… yes of course it’s a word, ChatGPT…).

If sorta works is good enough for you, then congratulations, your Uber to the future is waiting for you, nearby (not exactly right where you are, of course, since a bug in the Uber app thinks you were meant to meet the driver on the other side of your destiny).

If you want more than fuzzy functionality and bitsy reliability, then you need to get smarter about testing.

Now, when Michael and I wrote our chapter on AI in Taking Testing Seriously, we had to carefully avoid giving any specific examples. That was because whatever we wrote would be obsolete next month or next year.

But here in this blog, and in our trainings, we can keep the material fresh.

GenAI Demos Are Nearly Worthless

Non-Critical AI Fanboys (NAIFs)– including some who actually call themselves testers– like to show demos of their favorite prompts. They have great enthusiasm for the power of GenAI and they want to share their love with the world. But there are two striking things about these demos:

They show them to you once, not 10 times, nor 50 times.
They rarely look closely and carefully at the output.

This is frustrating for me, especially when I am dealing with a so-called tester, or a testing company that wants me to use its Automatic Tester tool. I want to say “Let’s run the same process many times and analyze the variations. Let’s try small variations on the input and study its effect on the output. Let’s look at every word of the output and consider an authoritative external oracle we could use.”

They reply that there is no time to do that, OR they reply that I am too cynical, OR that a sweet disorder in the dress kindles in clothes a wantonness (i.e. software is boring when it’s too good), OR that they are overjoyed that I want to test their tool for free and could I please investigate and report all the bugs that I find?

One of My Experiments: LARC

Today, I am developing probabilistic benchmarks to evaluate the self-consistency of GenAI when asked retrieve information from a text. I’m calling it LARC, for LLM Aggregated Retrieval Consistency. The basic idea is this:

Pick a text, to be supplied in the prompt/context, or that is known to be in the training data.
Prompt the model to find all examples of a given kind of item. For instance, noun phrases, or people’s names, or medical conditions, or whatever that particular text contains at least some of.
Do this N times (at least 10, perhaps 25).
Now for every item identified, ask N times if that item is a valid example that appears in the text. (Logically, the answer must be yes.)
What we should see is N identical lists and no item later repudiated.

This kind of test requires no external oracle. We can certainly add one, by supplying a list of items that are definitely not in the text, and a list of all the items that definitely are in the text. But if an external oracle is expensive or difficult, we still get a lot of value by seeing if the LLM will disagree with itself.

This can be expensive. To test the retrieval of noun phrases from an OpenAI press release took 1,420 calls to the Ollama API. That was to test one model, at one temperature, with one kind of prompt, accessing one text. So if I did 500 variations of that experiment (which is what I want to do) that would tie up my desktop system for the next year or so.

But it’s important, because retrieval is one of the basic services of GenAI. For instance, being able to give it a bunch of recipes, and asking it to collect an ingredients list. Or having it scrape a web site. So, it’s eye-opening to see that GenAI is often rather flakey in the retrieval department.

The experiments I’m doing are not just about finding problems. I’m also trying to develop risk analysis and mitigation heuristics. For instance: how much does reliability improve when we add more guidance into the prompt? What practices of prompt engineering actually work? I’m developing a laboratory to test the various folk practices that the NAIFs promote as if they were settle facts.

Soon, I will share the results of my initial LARC runs. Stay tuned.

Comments

Muhammad Ubaid Ullah says
6 October 2025 at 8:59 pm
A really sane perspective in the era where every one is really into the AI bandwagon. One approach towards cutting down the price could be to have a Test agent sit inside the Production environment for these GenAI solutions. i.e., In order to test the responses of a GenAI Mental health app another layer of a predictive test agent is added. The catch here is that it tests the GenAI solution over a span of time, which is something that might not get very popular amongst the NAIFs but in terms of cost it won’t have an extra bill added as it will be using the real responses for it’s testing.
Jacqueline says
8 October 2025 at 3:40 pm
I enjoyed reading this!
The ‘For instance’ question at the end has me still thinking it through, and noticing my thoughts are with the oxymoron contained within –
mitigation heuristics … how much does reliability improve when we add more guidance into the prompt?
Looking forward to where this develops.
[James’ Reply: Your wish is granted. Your question is answered. See my next blog post.]
ANGELOS MAVROGIANNAKIS says
23 October 2025 at 12:00 pm
I have a feeling that we need to narrow down AI systems in categories in order to have meaningful conversation’s. Those categories should be have direct relation with the way AI accepts new data in their database. In my opinion a True AI system should accept data from anyone as long as the new data complies with AI accuracy and security algorithm, I tried CHAGPT to accept some simple things and was rejected. If AI doesn’t have the right algorithm for it then it becomes just a huge database to search and extract information in more elegant way. I hope you can expand on how we validate AI algorithm for new data in future posts. My two pennies worth thoughts as a feedback. Cheers.
[James’ Reply: Most of us are talking about GenAI. But you can always check that assumption.
I don’t think AI necessarily accepts data into a “database” though. LLMs react to your data but don’t store it. The software around it may very well store it, but that’s not anything to do with AI.
If you are referring to fine-tuning or retraining a machine learning model, that’s a whole other thing. GenAI– which includes large language models packaged into GPTs– is pre-trained.]

GenAI Demos Are Nearly Worthless

One of My Experiments: LARC

Reader Interactions

Comments

Leave a Reply Cancel reply

Footer