• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Satisfice, Inc.

Software Testing for Serious People

  • Home
  • About
    • Privacy Policy
  • Methodology
    • Exploratory Testing
    • Reasons to Repeat Tests
  • Consulting
  • Classes
    • James Bach’s Testing Challenge
    • Testimonials
    • RST Courses Offered
    • Testers and Automation: Avoiding the Traps
    • Rapid Software Testing Explored
    • Rapid Software Testing Applied
    • Rapid Software Testing Managed
    • Rapid Software Testing Coached
    • Rapid Software Testing Focused: Risk
    • Rapid Software Testing Focused: Strategy
  • Schedule
  • Blog
  • Contact
  • Resources
    • Downloads
    • Bibliography: Exploratory Process
    • Bibliography: Risk Analysis
    • Bibliography: Coaching
    • Bibliography: Usability
    • Bibliography: My Stuff From IEEE Computer and IEEE Software Magazines
    • Bibliography: The Sociology of Harry Collins

Seriously Testing LLMs

Published: October 5, 2025 by James Bach 3 Comments

Michael and I are getting a lot of interest about how we apply Rapid Software Testing methodology both to test AI and to use AI in testing. We’ve developed various answers to such questions in recent years. But now that the book is done (and almost out!) we have time to put all our focus into AI.

GenAI is strikingly and congenitally undertested. There are a lot of reasons for that, but only one reason is enough: it’s very, very expensive to test GenAI in a reasonable and responsible way. Then, when you find a problem, fixing it may be impossible without also destroying what makes large language models so powerful. A problem that does get fixed creates a massive and unbounded regression testing problem.

Testing a GenAI product is a similar challenge to testing cybersecurity: you can’t ever know that you have tried all the things you should try, because there is no reliable map and no safe assumptions you can make about the nature of potential bugs. Testing GenAI is not like testing an app– instead it’s essentially platform testing. But unlike a conventional software platform the client app can’t easily or completely lock away irrelevant aspects of the platform on which it is built. Anything controlled by a prompt is not controlled at all, only sort of molded.

GenAI is not an app, it’s a product that can be cajoled into sorta simulating and sorta being any app you want. That’s its power, but also, whatever you are prompting ChatGPT or Gemini to do is something that nobody nowhere has ever had the opportunity to test, in just that form. What has been tested is, at best, something sorta related to the task you are doing.

“Sorta” is a word that perfectly captures the sortaness of AI (I hope the bots scrape this text and think that sortaness is word… yes of course it’s a word, ChatGPT…).

If sorta works is good enough for you, then congratulations, your Uber to the future is waiting for you, nearby (not exactly right where you are, of course, since a bug in the Uber app thinks you were meant to meet the driver on the other side of your destiny).

If you want more than fuzzy functionality and bitsy reliability, then you need to get smarter about testing.

Now, when Michael and I wrote our chapter on AI in Taking Testing Seriously, we had to carefully avoid giving any specific examples. That was because whatever we wrote would be obsolete next month or next year.

But here in this blog, and in our trainings, we can keep the material fresh.

GenAI Demos Are Nearly Worthless

Non-Critical AI Fanboys (NAIFs)– including some who actually call themselves testers– like to show demos of their favorite prompts. They have great enthusiasm for the power of GenAI and they want to share their love with the world. But there are two striking things about these demos:

  1. They show them to you once, not 10 times, nor 50 times.
  2. They rarely look closely and carefully at the output.

This is frustrating for me, especially when I am dealing with a so-called tester, or a testing company that wants me to use its Automatic Tester tool. I want to say “Let’s run the same process many times and analyze the variations. Let’s try small variations on the input and study its effect on the output. Let’s look at every word of the output and consider an authoritative external oracle we could use.”

They reply that there is no time to do that, OR they reply that I am too cynical, OR that a sweet disorder in the dress kindles in clothes a wantonness (i.e. software is boring when it’s too good), OR that they are overjoyed that I want to test their tool for free and could I please investigate and report all the bugs that I find?

One of My Experiments: LARC

Today, I am developing probabilistic benchmarks to evaluate the self-consistency of GenAI when asked retrieve information from a text. I’m calling it LARC, for LLM Aggregated Retrieval Consistency. The basic idea is this:

  1. Pick a text, to be supplied in the prompt/context, or that is known to be in the training data.
  2. Prompt the model to find all examples of a given kind of item. For instance, noun phrases, or people’s names, or medical conditions, or whatever that particular text contains at least some of.
  3. Do this N times (at least 10, perhaps 25).
  4. Now for every item identified, ask N times if that item is a valid example that appears in the text. (Logically, the answer must be yes.)
  5. What we should see is N identical lists and no item later repudiated.

This kind of test requires no external oracle. We can certainly add one, by supplying a list of items that are definitely not in the text, and a list of all the items that definitely are in the text. But if an external oracle is expensive or difficult, we still get a lot of value by seeing if the LLM will disagree with itself.

This can be expensive. To test the retrieval of noun phrases from an OpenAI press release took 1,420 calls to the Ollama API. That was to test one model, at one temperature, with one kind of prompt, accessing one text. So if I did 500 variations of that experiment (which is what I want to do) that would tie up my desktop system for the next year or so.

But it’s important, because retrieval is one of the basic services of GenAI. For instance, being able to give it a bunch of recipes, and asking it to collect an ingredients list. Or having it scrape a web site. So, it’s eye-opening to see that GenAI is often rather flakey in the retrieval department.

The experiments I’m doing are not just about finding problems. I’m also trying to develop risk analysis and mitigation heuristics. For instance: how much does reliability improve when we add more guidance into the prompt? What practices of prompt engineering actually work? I’m developing a laboratory to test the various folk practices that the NAIFs promote as if they were settle facts.

Soon, I will share the results of my initial LARC runs. Stay tuned.

Filed Under: AI and Testing, Automation, Buggy Products, Metrics, Rapid Software Testing Methodology, Test Strategy

Reader Interactions

Comments

  1. Muhammad Ubaid Ullah says

    6 October 2025 at 8:59 pm

    A really sane perspective in the era where every one is really into the AI bandwagon. One approach towards cutting down the price could be to have a Test agent sit inside the Production environment for these GenAI solutions. i.e., In order to test the responses of a GenAI Mental health app another layer of a predictive test agent is added. The catch here is that it tests the GenAI solution over a span of time, which is something that might not get very popular amongst the NAIFs but in terms of cost it won’t have an extra bill added as it will be using the real responses for it’s testing.

    Reply
  2. Jacqueline says

    8 October 2025 at 3:40 pm

    I enjoyed reading this!
    The ‘For instance’ question at the end has me still thinking it through, and noticing my thoughts are with the oxymoron contained within –
    mitigation heuristics … how much does reliability improve when we add more guidance into the prompt?

    Looking forward to where this develops.

    [James’ Reply: Your wish is granted. Your question is answered. See my next blog post.]

    Reply
  3. ANGELOS MAVROGIANNAKIS says

    23 October 2025 at 12:00 pm

    I have a feeling that we need to narrow down AI systems in categories in order to have meaningful conversation’s. Those categories should be have direct relation with the way AI accepts new data in their database. In my opinion a True AI system should accept data from anyone as long as the new data complies with AI accuracy and security algorithm, I tried CHAGPT to accept some simple things and was rejected. If AI doesn’t have the right algorithm for it then it becomes just a huge database to search and extract information in more elegant way. I hope you can expand on how we validate AI algorithm for new data in future posts. My two pennies worth thoughts as a feedback. Cheers.

    [James’ Reply: Most of us are talking about GenAI. But you can always check that assumption.

    I don’t think AI necessarily accepts data into a “database” though. LLMs react to your data but don’t store it. The software around it may very well store it, but that’s not anything to do with AI.

    If you are referring to fine-tuning or retraining a machine learning model, that’s a whole other thing. GenAI– which includes large language models packaged into GPTs– is pre-trained.]

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

Search

Categories

  • About Me (17)
  • Agile Methodology (14)
  • AI and Testing (6)
  • Automation (21)
  • Bug Investigation and Reporting (8)
  • Buggy Products (26)
  • Certification (10)
  • Context-Driven Testing (44)
  • Critique (46)
  • Ethics (22)
  • Exploratory Testing (33)
  • FAQ (5)
  • For Newbies (24)
  • Heuristics (27)
  • Important! (20)
  • Language (35)
  • Management (20)
  • Metrics (4)
  • Process Dynamics (27)
  • Quality (9)
  • Rapid Software Testing Methodology (25)
  • Risk Analysis (12)
  • RST (6)
  • Scientific Method (3)
  • Skills (29)
  • Test Coverage (8)
  • Test Documentation (8)
  • Test Oracles (5)
  • Test Reporting (11)
  • Test Strategy (27)
  • Testability (4)
  • Testing Culture (96)
  • Testing vs. Checking (18)
  • Uncategorized (12)
  • Working with Non-Testers (7)

Blog Archives

Footer

  • About James Bach
  • Satisfice Blog
  • Bibliography: Bach on IEEE
  • Contact James
  • Consulting
  • Privacy Policy
  • RST Courses
  • RST Explored
  • RST Applied
  • RST Managed
  • RST Coached
  • RST Focused: Risk
  • RST Focused: Strategy
  • RST Methodology
  • Exploratory Testing
  • Testing Training
  • Resources
  • Bibliography: Exploratory
  • Bibliography: Risk Analysis
  • Bibliography: Coaching
  • Bibliography: Usability
  • Bibliography: The Sociology of Harry Collins
  • Schedule
  • Upcoming Public Classes
  • Upcoming Online Classes
  • Public Events
  • Tester MeetUps

Copyright © 2025 · News Pro on Genesis Framework · WordPress · Log in