• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Satisfice, Inc.

Software Testing for Serious People

  • Home
  • About
    • Privacy Policy
  • Methodology
    • Exploratory Testing
    • Reasons to Repeat Tests
  • Consulting
  • Classes
    • James Bach’s Testing Challenge
    • Testimonials
    • RST Courses Offered
    • Testers and Automation: Avoiding the Traps
    • Rapid Software Testing Explored
    • Rapid Software Testing Applied
    • Rapid Software Testing Managed
    • Rapid Software Testing Coached
    • Rapid Software Testing Focused: Risk
    • Rapid Software Testing Focused: Strategy
  • Schedule
  • Blog
  • Contact
  • Resources
    • Downloads
    • Bibliography: Exploratory Process
    • Bibliography: Risk Analysis
    • Bibliography: Coaching
    • Bibliography: Usability
    • Bibliography: My Stuff From IEEE Computer and IEEE Software Magazines
    • Bibliography: The Sociology of Harry Collins

Serious Data From Testing LLMs

Published: October 15, 2025 by James Bach Leave a Comment

One of the differences between NAIFs (non-critical AI fanboys… please don’t be one) and me is that I believe in gathering evidence before making strong claims about how a system is likely to behave. I realize that AI is exciting and that it’s fun to think of all the things it might do for us. But we are living in a world where hundreds or thousands of CEOs and CTOs are pushing their people to use AI without a shred of evidence that it won’t corrupt their products, ruin their data, and harm their customers. They live in faith, and that faith can do terrible damage to the industry before they realize what they’ve done to themselves and their employees.

Can AI be used responsibly? I think so. That means we need to test it and understand how it fails.

So, here’s some data for you. I performed a LARC experiment on four LLMs, at three different temperatures, with two styles of prompts. It involved retrieving ingredients for an apple pie recipe from a loosely structured text that included four recipes. This task should be easy as pie itself for these models. Is it? Look at the data and find out.

The raw data is stored in a MongoDB on my hard drive. I can produce it on demand.

The somewhat cooked data is on my website here.

And my report that analyzes the data is on my website here.

I claim that this represents one kind of respectable testing that we need to do with AI. I will be following up with similar experiments, soon. I may also post my code on Github. We’ll see. Meanwhile, I would welcome critique. I want to improve and extend this work to other domains and explore different prompting styles.

If you want to perform experiments like this, yourself, I can help you set them up. In fact, Michael Bolton and I did this work as part of developing our new Testers and AI class.

 

 

Filed Under: Uncategorized

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

Search

Categories

  • About Me (17)
  • Agile Methodology (14)
  • AI and Testing (5)
  • Automation (20)
  • Bug Investigation and Reporting (8)
  • Buggy Products (25)
  • Certification (10)
  • Context-Driven Testing (44)
  • Critique (46)
  • Ethics (22)
  • Exploratory Testing (33)
  • FAQ (5)
  • For Newbies (24)
  • Heuristics (27)
  • Important! (20)
  • Language (35)
  • Management (20)
  • Metrics (4)
  • Process Dynamics (27)
  • Quality (8)
  • Rapid Software Testing Methodology (24)
  • Risk Analysis (12)
  • RST (6)
  • Scientific Method (3)
  • Skills (29)
  • Test Coverage (8)
  • Test Documentation (8)
  • Test Oracles (5)
  • Test Reporting (11)
  • Test Strategy (26)
  • Testability (4)
  • Testing Culture (96)
  • Testing vs. Checking (18)
  • Uncategorized (13)
  • Working with Non-Testers (7)

Blog Archives

Footer

  • About James Bach
  • Satisfice Blog
  • Bibliography: Bach on IEEE
  • Contact James
  • Consulting
  • Privacy Policy
  • RST Courses
  • RST Explored
  • RST Applied
  • RST Managed
  • RST Coached
  • RST Focused: Risk
  • RST Focused: Strategy
  • RST Methodology
  • Exploratory Testing
  • Testing Training
  • Resources
  • Bibliography: Exploratory
  • Bibliography: Risk Analysis
  • Bibliography: Coaching
  • Bibliography: Usability
  • Bibliography: The Sociology of Harry Collins
  • Schedule
  • Upcoming Public Classes
  • Upcoming Online Classes
  • Public Events
  • Tester MeetUps

Copyright © 2025 · News Pro on Genesis Framework · WordPress · Log in