How to Investigate Intermittent Problems

Published: August 5, 2005 by James Bach 9 Comments

The ability and the confidence to investigate an intermittent bug is one of the things that marks an excellent tester. The most engaging stories about testing I have heard have been stories about hunting a “white whale” sort of problem in an ocean of complexity. Recently, a thread on the SHAPE forum made me realized that I had not yet written about this fascinating aspect of software testing.

Unlike a mysterious non-intermittent bug, an intermittent bug is more of a testing problem than a development problem. A lot of programmers will not want to chase that white whale, when there’s other fishing to do.

Intermittent behavior itself is no big deal. It could be said that digital computing is all about the control of intermittent behavior. So, what are we really talking about?

We are not concerned about intermittence that is both desirable and non-mysterious, even if it isn’t exactly predictable. Think of a coin toss at the start of a football game, or a slot machine that comes up all 7’s once in a long while. We are not even concerned about mysterious intermittent behavior if we believe it can’t possibly cause a problem. For the things I test, I don’t care much about transient magnetic fields or minor random power spikes, even though they are happening all the time.

Many intermittent problems have not yet been observed at all, perhaps because they haven’t manifested, yet, or perhaps because they have manifested and not yet been noticed. The only thing we can do about that is to get the best test coverage we can and keep at it. No algorithm can exist for automatically detecting or preventing all intermittent problems.

So, what we typically call an intermittent problem is: a mysterious and undesirable behavior of a system, observed at least once, that we cannot yet manifest on demand.

Our challenge is to transform the intermittent bug into a regular bug by resolving the mystery surrounding it. After that it’s the programmer’s headache.

Some Principles of Intermittent Problems:

Be comforted: the cause is probably not evil spirits.
If it happened once, it will probably happen again.
If a bug goes away without being fixed, it probably didn’t go away for good.
Be wary of any fix made to an intermittent bug. By definition, a fixed bug and an unfixed intermittent bug are indistinguishable over some period of time and/or input space.
Any software state that takes a long time to occur, under normal circumstances, can also be reached instantly, by unforeseen circumstances.
Complex and baffling behavior often has a simple underlying cause.
Complex and baffling behavior sometimes has a complex set of causes.
Intermittent problems often teach you something profound about your product.
It’s easy to fall in love with a theory of a problem that is sensible, clever, wise, and just happens to be wrong.
The key to your mystery might be resting in someone else’s common knowledge.
An intermittent problem in the lab might be easily reproducible in the field.
The Pentium Principle of 1994: an intermittent technical problem may pose a *sustained and expensive* public relations problem.
The problem may be intermittent, but the risk of that problem is ever present.
The more testability is designed into a product, the easier it is to investigate and solve intermittent problems.
When you have eliminated the impossible, whatever remains, however improbable, could have done a lot of damage by then! So, don’t wait until you’ve fully researched an intermittent problem before you report it.
If you ever get in trouble an intermittent problem that you could not lock down before release, you will fare a lot better if you made a faithful, thoughtful, vigorous effort to find and fix it. The journey can be the reward, you might say.

Some General Suggestions for Investigating Intermittent Problems:

Recheck your most basic assumptions: are you using the computer you think you are using? are you testing what you think you are testing? are you observing what you think you are observing?
Eyewitness reports leave out a lot of potentially vital information. So listen, but DO NOT BECOME ATTACHED to the claims people make.
Invite more observers and minds into the investigation.
Create incentives for people to report intermittent problems.
If someone tells you what the problem can’t possibly be, consider putting extra attention into those possibilities.
Check tech support websites for each third party component you use. Maybe the problem is listed.
Seek tools that could help you observe and control the system.
Improve communication among observers (especially with observers who are users in the field).
Establish a central clearinghouse for mystery bugs, so that patterns among them might be easier to spot.
Look through the bug list for any other bug that seems like the intermittent problem.
Make more precise observations (consider using measuring instruments).
Improve testability: Add more logging and scriptable interfaces.
Control inputs more precisely (including sequences, timing, types, sizes, sources, iterations, combinations).
Control state more precisely (find ways to return to known states).
Systematically cover the input and state spaces.
Save all log files. Someday you’ll want to compare patterns in old logs to patterns in new ones.
If the problem happens more often in some situations than in others, consider doing a statistical analysis of the variance between input patterns in those situations.
Consider controlling things that you think probably don’t matter.
Simplify. Try changing only one variable at a time; try subdividing the system. (helps you understand and isolate problem when it occurs)
Complexify. Try changing more variables at once; let the state get “dirty”. (helps you make a lottery-type problem happen)
Inject randomness into states and inputs (possibly by loosening controls) in order to reach states that may not fit your typical usage profile.
Create background stress (high loads; large data).
Set a trap for the problem, so that the next time it happens, you’ll learn much more about it.
Consider reviewing the code.
Look for interference among components created by different organizations.
Celebrate and preserve stories about intermittent problems and how they were resolved.
Systematically consider the conceivable causes of the problem (see below).
Beware of burning huge time on a small problem. Keep asking, is this problem worth it?
When all else fails, let the problem sit a while, do something else, and see if it spontaneously recurs.

Considering the Causes of Intermittent Problems

When investigating an intermittent problem, it maybe worth considering the kinds of things that cause such problems. The list of guideword heuristics below may help you systematically do that analysis. There is some redundancy among the items in the list, because causes can be viewed from different perspectives.

Possibility 1: The system is NOT behaving differently. The apparent intermittence is an artifact of the observation.

Bad observation: The observer may have made a poor observation. (e.g. “Innattentional Blindness” is a phenomena whereby an observer whose mind is occupied may not see things that are in plain view. When presented with the scene a second time, the observer may see new things in the scene and assume that they weren’t there, before. Also, certain optical illusions cause apparently intermittent behavior in an unchanging scene. See “the scintillating grid”)
Irrelevant observation: The observer may be looking at differences that don’t matter. The things that matter may not be intermittent. This can happen when an observation is too precise for its purpose.
Bad memory: The observer may have mis-remembered the observation, or records of the observation could have been corrupted. (There’s a lot to observe when we observe! Our mind immediately compact the data and relate it to other data. Important data may be edited out. Besides, a lot of system development and testing involve highly repetitive observations, and we sometimes get them mixed up.)
Misattribution: The observer may have mis-attributed the observation. (“Microsoft Word crashed” might mean that *Windows* crashed for a reason that had nothing whatsoever to do with Word. Word didn’t “do” anything. This is a phenomenon also known as “false correlation” and often occurs in the mind of an observer when one event follows hard on the heels of another event, making one appear to be caused by the other. False correlation is also chiefly responsible for many instances whereby an intermittent problem is mistakenly construed to be a non-intermittent problem with a very complex and unlikely set of causes)
Misrepresentation: The observer may have misrepresented the observation. (There are various reasons for this. An innocent reason is that the observer is so confident in an inference that they have the honest impression that they did observe it and report it as such. I once asked my son if his malfunctioning Playstation was plugged in. “Yes!” he said impatiently. After some more troubleshooting, I had just concluded that the power supply was shot when I looked down and saw that it was obviously not plugged in.)
Unreliable oracle: The observer may be applying an intermittent standard for what constitutes a “problem.” (We may get the impression that a problem is intermittent only because some people, some of the time, don’t consider the behavior to be a problem, even if the behavior is itself predictable. Different observers may have different tolerances and sensitivities; and the same observer may vary in that way from one hour to the next.)
Unreliable communication: Communication with the observer may be inconsistent. (We may get the impression that a problem is intermittent simply because reports about it don’t consistently reach us, even if the problem is itself quite predictable. “I guess people aren’t seeing the problem anymore” may simply mean that people no longer bother to complain.)

Possibility 2: The system behaved differently because it was a different system.

Deus ex machina: A developer may have changed it on purpose, and then changed it back. (This can occur easily when multiple developers or teams are simultaneously building or servicing different parts of an operational server platform without coordinating with each other. Another possibility, of course, is that the system has been modified by a malicious hacker.)
Accidental change: A developer may be making accidental changes. (The changes may have unanticipated side effects, leading to the intermittent behavior. Also, a developer may be unwittingly changing a live server instead of a sandbox system.)
Platform change: A platform component may have been swapped or reconfigured. (An administrator or user may have changed, intentionally or not, a component on which the product depends. Common sources of these problems include Windows automatic updates, memory and disk space reconfigurations.)
Flakey hardware: A physical component may have transiently malfunctioned. (Transient malfunctions may be due factors such as inherent natural variation, magnetic fields, excessive heat or cold, battery low conditions, poor maintenance, or physical shock.)
Trespassing system: A foreign system may be intruding. (For instance, in web testing, I might get occasionally incorrect results due to a proxy server somewhere at my ISP that provides a cached version of pages when it shouldn’t. Other examples are background virus scans, automatic system updates, other programs, or other instances of the same program.)
Executable corruption: The object code may have become corrupted. (One of the worst bugs I ever created in my own code (in terms of how hard it was to find) involved machine code in a video game that occasionally wrote data over a completely unrelated part of the same program. Because of the nature of that data, the system didn’t crash, but rather the newly corrupted function passed control to the function that immediately followed it in memory. Took me days (and a chip emulator) to figure it out.)
Split personality: The “system” may actually be several different systems that perform as one. (For instance, I may get inconsistent results from Google depending on which Google server I happen to get; or I might not realize that different machines in the test lab have different versions of some key component; or I might mistype a URL and accidentally test on the wrong server some of the time.)
Human element: There may be a human in the system, making part of it run, and that human is behaving inconsistently.

Possibility 3: The system behaved differently because it was in a different state.

Frozen conditional: A decision that is supposed to be based on the status of a condition may have stopped checking that condition. (It could be stuck in an “always yes” or “always no” state.)
Improper initialization: One or more variables may not have been initialized. (The starting state of a computation would therefore depend on the state of some previous computation of the same or other function.)
Resource denial: A critical file, stream, or other variable may not be available to the system. (This could happen either because the object does not exist, has become corrupted, or is locked by another process.)
Progressive data corruption: A bad state may have slowly evolved from a good state by small errors propagating over time. (Examples include timing loops that are slightly off, or rounding errors in complicated or reflexive calculations.)
Progressive destabilization: There may be a classic multi-stage failure. (The first part of the bug creates an unstable state– such as a wild pointer– when a certain event occurs, but without any visible or obvious failure. The second part precipitates a visible failure at a later time based on the unstable state in combination with some other condition that occurs down the line. The lag time between the destabilizing event and the precipitating event makes it difficult to associate the two events to the same bug.)
Overflow: Some container may have filled to beyond its capacity, triggering a failure or an exception handler. (In an era of large memories and mass storage, overflow testing is often shortchanged. Even if the condition is properly handled, the process of handling it may interact with other functions of the system to cause an emergent intermittent problem.)
Occasional functions: Some functions of a system may be invoked so infrequently that we forget about them. (These include exception handlers, internal garbage collection functions, auto-save, and periodic maintenance functions. These functions, when invoked, may interact in unexpected ways with other functions or conditions of the system. Be especially wary of silent and automatic functions.)
Different mode or option setting: The system can be run in a variety of modes and the user may have set a different mode. (The new mode may not be obviously different from the old one.)

Possibility 4: The system behaved differently because it was given different input.

Accidental input: User may have provided input or changed the input in a way that shouldn’t have mattered, yet did. (This might also be called the Clever Hans syndrome, after the mysteriously repeatable ability of Clever Hans, the horse, to perform math problems. It was eventually discovered by Oskar Pfungst that the horse was responding to subtle physical cues that its owner was unintentionally conveying. In the computing world, I once experienced an intermittent problem due to sunlight coming through my office window and hitting an optical sensor in my mouse. The weather conditions outside shouldn’t have constituted different input, but they did. Another more common example is different behavior that may occur when using the keyboard instead of mouse to enter commands. The accidental input might be invisible unless you use special tools or recorders. For instance, two identical texts, one saved in RTF format from Microsoft Word and one saved in RTF format from Wordpad, will be very similar on the disk but not exactly identical.)
Secret boundaries and conditions: The software may behave differently in some parts of the input space than it does in others. (There maybe hidden boundaries, or regions of failure, that aren’t documented or anticipated in your mental model of the product. I once tested a search routine that invoked different logic when the total returned hits were =1000 and = 50,000. Only by accident did I discover these undocumented boundaries.)
Different profile: Some users may have different profiles of use than other users. (Different biases in input will lead to different experiences of output. Users with certain backgrounds, such as programmers, may be systematically more or less likely to experience, or notice, certain behaviors.)
Ghost input: Some other machine-based source than the user may have provided different input. (Such input is often invisible to the user. This includes variations due to different files, different signals from peripherals, or different data coming over the network.)
Deus Ex Machina: A third party may be interacting with the product at the same time as the user. (This maybe a fellow tester, friendly user, or a malicious hacker.)
Compromised input: Input may have been corrupted or intercepted on its way into the system. (Especially a concern in client-server systems.)
Time as input: Intermittence over time may be due to time itself. (Time is the one thing that constantly changes, no matter whatever else you control. Whenever time and date, or time and date intervals, are used as input, bugs in that functionality may appear at some times but not others.)
Timing lottery: Variations in input that normally don’t matter may matter at certain times or at certain loads. (The Mars Rover suffered from a problem like this involving a three microsecond window of vulnerability when a write operation could write to a protected part of memory.)
Combination lottery: Variations in input that normally don’t matter may matter when combined in a certain way.

Possibility 5: The other possibilities are magnified because your mental model of the system and what influences it is incorrect or incomplete in some important way.

You may not be aware of each variable that influences the system.
You may not be aware of sources of distortion in your observations.
You may not be aware of available tools that might help you understand or observe the system.
You may not be aware of the all the boundaries of the system and all the characteristics of those boundaries.
The system may not actually have a function that you think it has; or maybe it has extra functions.
A complex algorithm may behave in a surprising way, intermittently, that is entirely correct (e.g. mathematical chaos can look like random behavior).

Comments

Trevor Hopkinson says
11 November 2008 at 11:28 am
But how to handle these intermuittent faults
Appearance on test results reports – how to handle an 80% test pass rate
what to do with bug reports raising – clearing – raising again
Is it really a bug? –
Any ideas welcome
Trevor

[James’ Reply: What’s the problem? What is an 80% pass rate? Why is the bug report being raised and cleared?]
Reply
chris says
16 July 2009 at 4:47 pm
The essence of intermittent bugs are of chaotic nature – where multiple (at least three) factors come into play. So there is no direct \cause\ for such bugs versus regular bugs. It is impossible for over 99% people to accept that there is no direct cause for the chaotic world. The famous three-body problem is a good start to understand the essence of intermittency.
It is NEVER possible to simplify a true intermittent bug into a regular bug
[James’ Reply: The whole notion of “cause” is artificial. You could as well say that anything that happens has an infinite number of causes. You’re right, the world is chaotic. But I guess we could say that one reason a bug can be intermittent is because of a set of causes that are uncorrelated and unknown.
But when I’m investigating a bug, some of its causes may be temporarily correlated or uncorrelated. There is no “true intermittence” because there is no way to be sure about the nature of the causes, and no way to prove that the causal structure won’t change.
In other words, give me an example of an intermittent bug, and I can give you one new fact that, if true, would reveal it to be not intermittent (or vice versa). Therefore, you might say “there is no way to be sure that a regular bug isn’t secretly an intermittent bug, and no way to know that an intermittent bug isn’t secretly a regular bug.” Hence I call something an intermittent bug if, at the time I’m investigating it, it is intermittent. But I don’t need to say it’s “truly intermittent.”]
Reply
Gaurav says
10 March 2010 at 9:24 am
As per my understanding, it is easier (and cheaper) to find problems between two components rather than finding problems between more than two components. I gather, this is the basis of Pair Wise Testing and I have understand AllPairs, the same logic has been implemented.
I was wondering, if you would be able to provide me some examples on how best to isolate defects between more than two components. (Assuming that such defects in some way maybe considered as intermittent)
Regards
Gaurav Pandey
[James’ Reply: Can you present that as a more natural problem? What would it look like in real life?
If you see something that is intermittent, you won’t know which variables make it intermittent, or how many they are.]
Reply
Wade Wachs says
18 October 2010 at 12:10 pm
It seems to me that what is being said is that the term ‘intermittent’ just means ‘I don’t know how to reproduce it yet’.
[James’ Reply: Pretty much.]

If a bug exists with a certain set of inputs/states, it seems the more possible combinations of inputs and states the more possible outputs you have.

[James’ Reply: Not necessarily. The possibility of outputs is unconnected to input. Consider a random number generator.]
Consider a simple circuit with a switch, power source, and light bulb. When the switch is turned on, the bulb illuminates. When the switch is turned off, the bulb goes dark. If we turn the switch on, and the bulb does not light up, we call this a bug, and we diagnose what we know, for example the power source is dead. This is not considered an ‘intermittent’ bug, we found the problem and can easily reproduce it.
[James’ Reply: It’s intermittent if it happens some times and not others.]

Now, consider a circuit with hundreds of switches, bulbs, resistors, power sources, capacitors, etc. all wired in parallel all with changing input and output as we are trying to test the effect of our switch on our light bulb. If changing our switch only illuminates the bulb some of the time, we call this an intermittent bug, when in reality it is a reproducible bug that we just haven’t figured out how to do so yet.
[James’ Reply: We call a bug intermittent based not on its own nature but on our view of it.]

Our white whale exists somewhere in the hundreds of other components that we weren’t looking at in our test. Just because we don’t know about all the inputs doesn’t make the bug intermittent, it just means we haven’t learned everything about it yet.
[James’ Reply: It *is* intermittent– to us.]

If we were to try to defend the term ‘intermittent’, we would have to somehow define the level of complexity before a bug can graduate from a reproducible bug to an intermittent bug. In the first example, if our power source (which is unknown to the tester) was a hamster running on a wheel with a generator attached, it would seem that intermittently the switch would turn the light on. But since we don’t know the power source, we call this an intermittent bug even though it is seemingly simple system. How can we ever define the total set of our unknowns? We can’t, therefore ‘intermittent’ simply becomes a way of saying ‘there is something I don’t know about the system that is affecting my results’.
[James’ Reply: Yes.]

(I like where this is going, I’m thinking I will make this into a blog post for my own blog out of this. If you’re interested in following this line of thought go check out wadewachs.com)
[James’ Reply: I’m already following your blog.]
Reply
sb says
2 August 2012 at 1:36 pm
I think the most important part of debugging an intermittent error is to NEVER POWER OFF the device!! The state responsible for the error is likely to be completely wiped out by doing a restart. As soon as you notice the intermittent error, try to probe test points, examine status words, and record as much as possible about all the signals on the system. Work out possibilities in your head later.
Reply
Atul says
25 September 2012 at 8:52 pm
I want to ask a different question “If I need to define a bug is not reproducible what should be criteria”. For example after how many tries I should declare the defect is not reproducible.
[James’ Reply: There is no such thing as an unreproducible bug. But there are mysteries. Some mysteries we never solve.]
Reply
Andrew Jasienski says
19 June 2013 at 10:24 pm
@Atul, I think that depends on process. Probably the best way with that kind of bug problems is statistical approach. If you want to describe the process statistically then you should collect some basic features of the process. The features are: minimum values, maximum values, range of the values, standard deviation. If the observation is long enough and you still don’t see any pattern or you can feel (believe me, you will) that there is no sense in data then you can be sure that the bug in the process is that kind of a bug.
Going back to the root of your question: If in the same conditions every single time results are different then you know that the bug is not easy to reproduce. Therefore the model that you’re testing is incomplete or simply wrong (please find YT of Richard Feynman audition about difference between math and physics- it’s very beneficial to logical and logical thinking 🙂 ).
Everytime it’s very important to measure features with proper device. You can test if the device is proper for the trial with MSA R&R study.
Reply
István Forgács says
7 October 2014 at 11:05 pm
It’s really a very comprehensive description of intermittent problems, I like it. We have a solution and tool for these problems. The method is based on execution differences, when we compare the exdecutions, I mean the failed and the passed one, ane based on the differences we usually can find the fault. More information on jidebug.com
My question is that do you aggree on case studies reporting 5-14% of the bugs as intermittent?

[James Reply: Of course I don’t agree. Such a metric doesn’t even make sense. Unless you know “all the bugs” and have an objective way of counting them and you have a sure way to determine whether a bug is intermittent you can’t even establish this percentage for a specific product. And even if you did establish it for one product, you couldn’t generalize that to any other product.
It’s not important to know this percentage. What’s important is to know how to deal with an apparently intermittent bug.]
Reply
testmonkey says
5 October 2017 at 9:49 am
Lots of generic advice. Wish each point had real life examples as opposed to just a few. Maybe you could write a book or suggest one ?
[James’ Reply: At a casual glance, it seems good enough to me. But perhaps I have a blind spot. Can you suggest three or four that seem overly generic to you?]
Reply

Reader Interactions

Comments

Leave a Reply Cancel reply

Footer