• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Satisfice, Inc.

Software Testing for Serious People

  • Home
  • About
    • Privacy Policy
  • Methodology
    • Exploratory Testing
    • Reasons to Repeat Tests
  • Consulting
  • Classes
    • James Bach’s Testing Challenge
    • Testimonials
    • RST Courses Offered
    • Rapid Software Testing and AI (RST/AI)
    • Rapid Software Testing Explored
    • Rapid Software Testing Applied
    • Rapid Software Testing Managed
    • Rapid Software Testing Coached
    • Rapid Software Testing Focused: Risk
    • Rapid Software Testing Focused: Strategy
  • Schedule
  • Blog
  • Contact
  • Resources
    • Downloads
    • Bibliography: Exploratory Process
    • Bibliography: Risk Analysis
    • Bibliography: Coaching
    • Bibliography: Usability
    • Bibliography: My Stuff From IEEE Computer and IEEE Software Magazines
    • Bibliography: The Sociology of Harry Collins

Serious Data From Testing LLMs

Published: October 15, 2025 by James Bach 3 Comments

One of the differences between NAIFs (non-critical AI fanboys… please don’t be one) and me is that I believe in gathering evidence before making strong claims about how a system is likely to behave. I realize that AI is exciting and that it’s fun to think of all the things it might do for us. But we are living in a world where hundreds or thousands of CEOs and CTOs are pushing their people to use AI without a shred of evidence that it won’t corrupt their products, ruin their data, and harm their customers. They live in faith, and that faith can do terrible damage to the industry before they realize what they’ve done to themselves and their employees.

Can AI be used responsibly? I think so. That means we need to test it and understand how it fails.

So, here’s some data for you. I performed a LARC experiment on four LLMs, at three different temperatures, with two styles of prompts. It involved retrieving ingredients for an apple pie recipe from a loosely structured text that included four recipes. This task should be easy as pie itself for these models. Is it? Look at the data and find out.

The raw data is stored in a MongoDB on my hard drive. I can produce it on demand.

The somewhat cooked data is on my website here.

And my report that analyzes the data is on my website here.

I claim that this represents one kind of respectable testing that we need to do with AI. I will be following up with similar experiments, soon. I may also post my code on GitHub. We’ll see. Meanwhile, I would welcome critique. I want to improve and extend this work to other domains and explore different prompting styles.

If you want to perform experiments like this, yourself, I can help you set them up. In fact, Michael Bolton and I did this work as part of developing our new Testers and AI class.

 

 

Filed Under: AI and Testing, Automation, Buggy Products, Quality, Rapid Software Testing Methodology, Test Strategy

Reader Interactions

Comments

  1. Hari Prasad says

    15 October 2025 at 2:37 am

    I want to perform the experiment in this way. Could you please help or guide me on how to set it up and carry it out?

    [James’ Reply: Sure. I recommend you contact me on Telegram or Signal. I’ll walk you through what you need to do.]

    Reply
  2. Johnson says

    3 November 2025 at 11:13 pm

    Is there a reason why all the “prompt performance” sections of GPT4.1 are N/A?

    [James’ Reply: Yes, those numbers are based on data that comes back in the Ollama response object. I don’t get that data from the OpenAI response object. However, there’s an easy workaround that I’ve been meaning to implement.]

    Reply
  3. Viktoriia says

    15 November 2025 at 3:56 pm

    [James’ Reply: Sorry I did not see this earlier!]

    Hi James,

    Thank you for sharing the report. I found the results interesting and kind of funny in a good way if that makes sense. I feel like we as people tend to assign personalities (or maybe it’s just a neurodivergent thing, or some other subgroup) to things that do not have them, as it makes it easier and more pleasant to weave them into our narratives about the world and to interact with them. Gen AIs are subjected to it more than simpler systems for obvious reasons. Gemma not liking eggs or sugar was such a good way to show and remember the flaws of using chat AIs for analysis. 🙂

    On a more serious note, I wonder how much it even makes sense to try and develop some reliable approach to testing AIs in this way, at least outside of training models. I might just be misunderstanding the whole thing, as my AI engine expertise is limited, but they are, at their core, statistics-based engines that spit out whatever they determined to be the likeliest result collective humanity (or subset of humanity when you give them a role to play, and they try to slice their DB to what they think people in that role said) would say in that context. There isn’t any understanding baked in, even at the simple arithmetics level. I keep bumping into it, as ChatGPT when “analyzing” fiction loves to say something like: “Just three words, but they hit hard!” completely miscounting the actual number of words in a sentence. It is amazing that they do as well as they do, frankly, but my bar is pretty low.

    [James’ Reply: Well, they are not being sold as mere text spitter-outers. They are sold as “word calculators” as Sam Altman put it. They are sold as serious productivity engines. Therefore, it is reasonable to wonder if they are reliable at doing serious things.]

    I guess what I’m trying to say is, despite people trying to rely on Gen AI and getting all excited about it, they are intrinsically unreliable and require an independent expert verifying the results. I like the idea of AIs checking their own work in some shape or form (i.e. one model checking output of the other model) as the first line of defense, similar to what you did in your experiment, but that can’t be enough for anything where results actually matter. We simply do not have a true Gen AI yet. We have specialized AIs unavailable to the public and doing very well in their respective areas (like playing Go or identifying proteins to aid biochemistry research), and we have extremely fancy chat bots that fake expertise and personalities well enough to get everyone (me included, tbh) excited. I don’t think any amount of testing would do much more than demonstrate the fakeness of that expertise and maybe reveal some of the curious quirks of different models.

    [James’ Reply: Yet you see from the Apple Pie results that ChatGPT 4.1 is substantially reliable at this task, compared to the others. This is worth knowing. If I were a developer creating a system that used AI to scrape unstructured text, I would be more comfortable using that model rather than Gemma.]

    On one of the episodes of the “Skeptics guide to the universe” podcast they quoted the research that showed that AI models also tend to adjust their answers to satisfy the shallow requirements of the prompter, without actually improving the quality of the output. The research was, from memory, in how to train models using penalties for unsatisfying answers, and the result was, the model adjusts to whatever criteria you give it, even if it means inventing “facts” and confidently presenting them as real. Forgive the vagueness, if you are interested, I can try and dig it up, it’s from maybe a year ago. But basically, even if the models did well in your experiment, it would just mean they do well on that particular task on that particular data, and we would still have no idea at which point does that effect wear off (i.e. how different the text or the prompt need to be for the quality to drop sharply). More experiments expand the certainty field, but I am skeptical you could safely infer from the results into the unknown. Especially given that AI models and the surrounding software are constantly in flux and we have little control over it.

    For the responsible use, I’m leaning towards starting from the assumption that you cannot trust the AI output without testing it every single time – and therefore pushing against people trying to speed up SDLC by blindly plugging AI tools into it. The only task I’d trust AI to do without checking the results is where I do not care about the quality of the output (e.g. some formal-process-satisfying documentation I know for sure nobody would ever read), at which point I’d rather fight to get rid of that task altogether.

    [James’ Reply: I agree that test results on GenAI systems are hard to generalize from.]

    Reply

Leave a Reply to Johnson Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

Search

Categories

  • About Me (18)
  • Agile Methodology (14)
  • AI and Testing (7)
  • Automation (22)
  • Bug Investigation and Reporting (8)
  • Buggy Products (26)
  • Certification (10)
  • Context-Driven Testing (44)
  • Critique (46)
  • Ethics (23)
  • Exploratory Testing (33)
  • FAQ (5)
  • For Newbies (24)
  • Heuristics (27)
  • Important! (20)
  • Language (35)
  • Management (20)
  • Metrics (4)
  • Process Dynamics (27)
  • Quality (9)
  • Rapid Software Testing Methodology (25)
  • Risk Analysis (12)
  • RST (8)
  • Scientific Method (3)
  • Skills (29)
  • Test Coverage (8)
  • Test Documentation (8)
  • Test Oracles (5)
  • Test Reporting (11)
  • Test Strategy (27)
  • Testability (4)
  • Testing Culture (97)
  • Testing vs. Checking (18)
  • Uncategorized (12)
  • Working with Non-Testers (7)

Blog Archives

Footer

  • About James Bach
  • Satisfice Blog
  • Bibliography: Bach on IEEE
  • Contact James
  • Consulting
  • Privacy Policy
  • RST Courses
  • RST Explored
  • RST Applied
  • RST Managed
  • RST Coached
  • RST Focused: Risk
  • RST Focused: Strategy
  • RST Methodology
  • Exploratory Testing
  • Testing Training
  • Resources
  • Bibliography: Exploratory
  • Bibliography: Risk Analysis
  • Bibliography: Coaching
  • Bibliography: Usability
  • Bibliography: The Sociology of Harry Collins
  • Schedule
  • Upcoming Public Classes
  • Upcoming Online Classes
  • Public Events
  • Tester MeetUps

Copyright © 2025 · News Pro on Genesis Framework · WordPress · Log in