CNN Believes Whatever Computers Say

(CNN) — Early evidence points to driver error as the reason a 2005 Prius sped into a stone wall on March 9, according to federal investigators.

“Information retrieved from the vehicle’s onboard computer systems indicated there was no application of the brakes and the throttle was fully open,” according to a statement from the National Highway Traffic Safety Administration.

The statement suggests the driver may have been stepping on the accelerator, instead of the brake, as she told police.

Note to CNN: Information retrieved from an onboard computer is not reliable in cases where the computer itself is a suspect in the crime. No one is claiming that ghosts are causing the cars to go out of control. And there is no evidence yet that a mechanical failure is the culprit, despite the NHTSA looking hard for that evidence. Alcohol is also not a factor in these specific cases (or else we wouldn’t even be talking about them). That leaves two big possibilities: sober experienced people are suddenly forgetting how to drive OR something is wrong with the software or electronics that controls the cars.

If the software that reads the control inputs has the right kind of fault in it, it may occasionally lose the ability to read and react to control inputs. It is easy for a computer programmer to imagine a situation where the part of the system that records the control settings is working fine, while the part that acts on them is failing. That would be consistent with the facts of this case. Moreover, the problem may be transient, leaving no evidence that it happened.

We don’t know what happened. We may never know. But there is no reason to assume that the computer is infallible. Computers are designed by fallible people.

Toyota Story Analysis

I’m not against Toyota, really. I’m against manufacturers who try to weasel out of their responsibilities after they put people at risk with poor design. I do not know if Toyota is guilty of that, here.

I also don’t know what the real story is with this Sikes fellow who was in the car that he claims went out of control. Maybe he’s a gold digger out to exploit Toyota’s bad fortune.

Finally, I don’t know much about the design of a Prius, except that my understanding is that there are no physical cable linkages. It’s software driven. When you press the brakes you are essentially double clicking on the “brake” icon with your foot mouse, hoping that the operating system agrees to apply the real brakes. A Prius is basically a video game console connected to a car. You drive the console, not the car.

What I’m trying to do is use this situation to teach testing. So, I want to break down a news story I found on cnn.com and show you how a tester would think about it:

Toyota takes aim at California runaway Prius story
by Peter Valdes-Dapena, senior writerMarch 15, 2010: 8:24 PM ET

NEW YORK (CNNMoney.com) — Toyota challenged a California driver’s story of an out-of-control Prius at a press conference Monday afternoon.

Toyota held a press conference about this. That is interesting. Of course that means Toyota’s marketing people are actively involved in this investigation, pacing and fretting for good news to send out. This does not create a conducive atmosphere for an investigation.

Public relations people want to rush out with good news, but keep bad news longer to study it and make very sure it’s really correct. This creates a sort of chromatic distortion of the truth in the near term. We must beware of that. You might say that it’s likely to be “green shifted” truth.

Company executives detailed preliminary findings of a joint investigation conducted by Toyota and the National Highway Traffic Safety Administration into the incident.

I like hearing that the NHTSA is involved in the investigation. Are they leading it, I wonder? You know there are several ongoing investigations into Toyota vehicles by the NHTSA (see their website) including one involving the momentary loss of braking on uneven surfaces (which my father has experienced several times on the dirt road going to his house.)

I feel there’s more credibility if the NHTSA is involved, but the press conference seems to have been a Toyota thing, not an NHTSA thing.

Prius owner Jim Sikes made national headlines last week with claims that his car’s accelerator got stuck as he sped up to pass a car while traveling on California’s I-8 highway outside of San Diego, and that he was unable to stop the car.

“As I was going, I was trying the brakes … and it just kept speeding up,” he said.

Reports from non-technical, non-expert users always must be taken with skepticism, even leaving aside the possibility that the guy is just telling lies. Perhaps he suffers from Munchhausen syndrome. You never know.

But let’s say he’s not lying, just for a moment. The phrase “it just kept speeding up” suggests that the brakes were completely inoperative. I wouldn’t use that phrase, as a driver, if the brakes had engaged and were fighting the motor.

Someone needs to sit the guy down (I assume they’ve done this) and walk through the whole incident moment by moment. Several times. Get him to clarify this.

Inconsistencies alone should not worry us too much. It happened quickly, it was a traumatic event, and his personal account may not be reliable just for that reason. But we still need to wring every bit of information we can from his memory.

Sikes story is at odds with the findings of the investigation, according to Toyota and to a draft congressional memo obtained by CNN.

“While a final report is not yet complete, there are strong indications that the driver’s account of the event is inconsistent with the findings of the preliminary analysis,” Toyota said in a prepared statement.

Sikes said he called 9-1-1 for help as he was traveling in excess of 90 mph on a winding, hilly portion of the interstate. He said dispatchers tried to talk him through ways to stop the car, but nothing helped.

I’d like to hear that 911 call in its entirety.

Eventually, a California Highway Patrol officer was able to catch up to Sikes and used the patrol car’s public address system to instruct Sikes to apply the brakes and the emergency brake at the same time. That tactic worked, and he was able to stop the car.

I’d like to know if Sikes had tried the emergency brake alone, before this. Had he tried both at once before this.

I can understand if he did not try both at once, because a normal driver would think if they don’t work individually, why would they suddenly work together? But  in a drive-by-wire system, everything is mediated by software, and software can get into strange states. It’s technically plausible that only with both brake controls activated the software could be bounced out of whatever strange trance it got into.

However, because driving a hybrid car like the Prius with both the gas pedal and the brakes simultaneously depressed would cause serious damage to the car’s electric motor and, possibly other systems, Toyota says the Prius is designed to prevent that from happening.

Of course it’s designed prevent that. But here’s a testing lesson: designing for prevention is not the same as preventing, because your design may have a bug in it.

All Toyota can say is that it was their intent to design the system to prevent that, and to the best of their knowledge that is how the system works… except in this case it didn’t work– unless the guy is simply lying or insane.

If the brake is pressed at the same time as the gas pedal, power to the engine will be reduced just as if the gas pedal had been released, the automaker said.

Unless, of course, there’s a malfunction of the system, which is exactly the issue under consideration.

During driving tests on Sikes’ Prius and on an identical 2008 Prius, the system operated as expected, according the report, preventing the car from pushing forward while braking.

“The system operating as expected” is not actually possible to determine, because they can’t see inside all the software and hardware to detect that every bit and electron is in the right place.

What they can say is that they detected no problems. Problems may be there, they just did not detect them.

If the visible, detectable problems we want to see are triggered by a transient event, such as a specific combination of foot presses, or perhaps there’s a two microsecond window of opportunity for two software events to happen simultaneously (such as have dogged the Mars rover missions ), then of course driving it around a parking lot is probably not going to reproduce the problem.

It is also possible that part of the problem involves a piece of physical equipment that was lodged or worn in a certain way at a specific temperature, and that condition no longer exists on the car in question.

When we try to reproduce problems, we often have to guess at the causes, and we may guess wrong.

If I were Toyota, I would treat this like an epidemiology problem. You interview people and make a list of absolutely everything that was going on. Did they have a cell phone? What kind? Where was it in the car? Where they using the cup holder? What drink was in the cup? Hot or cold? Was the air conditioner on or off? What was the setting?

Then you put all the data into a database and mine it for patterns.

Investigators are extremely meticulous when taking apart a car in a case like this, said Ed Higgins, a Michigan personal injury attorney who has been involved in automobile defect cases. They are aware their work will be subject to intense scrutiny, so they measure and document everything, he said.

That kind of care takes a lot of time. But it hasn’t been very long since the incident occurred. Have they also taken the software apart? Have they comprehensively reviewed the code? I seriously doubt that.

“I would think that any mechanical defect that would have allowed something to happen that otherwise could not have happened would have stood out like a sore thumb,” he said.

Unless it’s a transient interaction between a mechanical defect and an invisible state within the software.

The car also did not show damage consistent with the engine having been running at full throttle while the brakes were on, according to the report.

That suggests the brakes weren’t on, but not that Sikes wasn’t pushing on the brakes.

“Toyota engineers believe that it would be extremely difficult for the Prius to be driven at a continuous high speed with more than light brake-pedal pressure, and that the assertion that the vehicle could not be stopped with the brakes is fundamentally inconsistent with basic vehicle design and the investigation observations,” Toyota said in a statement.

Again, this all assumes normal circumstances and no transient failures. For the purposes of investigation, that belief is irrelevant.

It is already fundamentally inconsistent with the design of the product that ANY failure could occur. We’ve crossed that bridge, guys.

Remember, when flight 427 crashed, Boeing maintained for years that their rudder mechanism could not possibly have failed– until a new form of failure was discovered (“thermal shock”) and the specific failure reproduced in that very rudder assembly.

The car’s front brakes showed significant wear and overheating, Toyota said. That kind of wear and heat would be consistent with the brakes being lightly applied over a long period of time, executives said at the press event.

Data from on-board computers indicated that Sikes had applied the brakes, to some degree, at least 250 times during the 23 mile incident, Toyota executives said, and that the brakes worked normally each time.

Ooh, I love log files. I wonder what other patterns they can mine from that log file?

If the computers indicate that Sikes had applied the brakes, that shows they were getting some kind of signal from the brake mechanism, but not necessarily the correct signal. Therefore saying “the brakes worked normally each time” is completely unwarranted. Part of the system may have been working normally while another part was going haywire. There’s not way to tell after the fact, because “working normally” is not a detectable condition that gets logged in computer.

Every time you experience a problem in your software, your software, on some level, thinks it is doing the right thing. Software doesn’t “know” it’s misbehaving. It just does what it is told.

Edmunds has independently tested Prius cars similar to Sikes’ and confirmed that the engine would stay engaged if the brakes were only pressed lightly, but not hard enough to actually stop or slow the car, said Dan Edmunds, head of auto testing for the automotive Web site Edmunds.com.

He says “would”, but he should say “would, assuming that there is nothing wrong with the car that would cause it not to”

“If you’re just riding the brakes, it will ride the brakes,” he said.

“These findings certainly raise new questions surrounding the veracity of the sequence of events that has been reported by Mr. Sikes,” said Kurt Bardella, spokesman for Rep. Darrell Issa, R-Calif., and ranking member of the committee.

Sikes’ attorney, John Gomez, denied that the report proves his client was wrong about what happened to his car.

“The notion that they weren’t able to replicate it in this particular case tells us nothing,” he said. “They haven’t been able to replicate a single one of these.”

That’s right. Also, learn the phrase “transient failure mode” and press that point. There are plenty of examples in space missions and airliners of such failures.

Sikes has no plans to sue Toyota, Gomez said.

Gomez is also representing the family of Mark Saylor, a California Highway Patrolman who was killed, along with members of his family in a Lexus sedan that accelerated out of control. A preliminary investigation has found that the accelerator pedal in that car probably became trapped on an all-weather floor mat that had been incorrectly installed in the vehicle.

Toyota has issued a recall for several models, including Sikes’ Prius, to address possible floor mat entrapment. Sikes’ floor mat was not interfering with the accelerator, investigators found, and there were no signs the pedal had become stuck in any way, according to the report.

The investigators findings “suggest that there should be further examination of Mr. Sikes account of the events of March 8,” Toyota said in its statement.

Toyota spokesman Mike Michels also took issue with media coverage of the Sikes incident. Journalists sensationalized an admittedly dramatic event, he said, but the public would have been better served had reporters waited for all the facts.

“We need to let investigations take their course,” he said.

Yes indeed. And this investigation has not done that. Be careful what you wish for, Mr. Michels.

Advice to Lawyers Suing Toyota

A press release by Toyota recently stated:

Toyota’s electronic systems have multiple fail-safe mechanisms to shut off or reduce engine power in the event of a system failure. Extensive testing of this system by Toyota has not found any sign of a malfunction that could lead to unintended acceleration.

Here are some notes for the lawyers suing Toyota. Here is what your testing experts should be telling you:

  • Whoever wrote this, even if he is being perfectly honest, is not in a position to know the status of the testing of Toyota’s acceleration, braking, or fault handling systems. The press release was certainly not written by the lead tester on the project. Toyota would be crazy to let the lead tester anywhere near a keyboard or a microphone.
  • Complete testing of complex hardware/software systems is not possible. But it is possible to do a thorough and responsible job of testing, in conjunction with hazard analysis, risk mitigation, and post-market surveillance. It is also quite expensive, difficult, and time consuming. So it is normal for management in large companies to put terrible pressure of the technical staff to cut corners. The more management levels between the testers and the CEO, the more likely this is to occur.
  • “Extensive testing” has no fixed meaning. To management, and to anyone not versed in testing, ALL testing LOOKS extensive. This is because testing bores the hell out of most people, and even a little of it seems like a lot. You need to find out exactly what the testing was. Look at the production-time testing but focus on the design-time testing. That’s where you’ll be most likely to find the trouble.
  • Even if testing is extensive in general, you need to find out the design history of the software and hardware, because the testing that was done may have been limited to older versions of the product. Inadequate retesting is a common problem in the industry.
  • If Toyota is found to have used an automated “regression suite” of tests, then you need to look for the problem of inadequate sampling. What happens is that the tests are only covering a tiny fraction of the operational space of the product (a fraction of the states it can be in), and then they just run those over and over. It looks like a lot of testing, but it’s really just the same test again and again. Excellent testing requires active inquiry at all times, not just recycling old actions.
  • If Toyota is found not to have used test automation at all, look for a different kind of sampling problem: limited human resources not being able to retest very extensively.
  • Most testers are not very ambitious and not well trained in testing. No university teaches a comprehensive testing curriculum. Testing is an intellectually demanding craft. In some respects it is an art. Examine the training and background of the testing staff.
  • Examine the culture of testing, too. If the corporate environment is one in which initiative is discouraged or all actions are expected to be explicitly justified (especially using metrics such as test case counts, pass/fail rates, cyclomatic complexity, or anything numerical), then testing will suffer. During discovery, subpoena the actual test reports and test documentation and evaluate that.
  • Any argument Toyota makes about extensiveness of testing that is based on numbers can be easily refuted. Numbers are a smoke-screen.
  • Examine the internal defect tracking systems and specifically look to see how intermittent bugs were handled. A lack of intermittent bug reports certainly would indicate something fishy going on.
  • Examine how the design team handled reports from the field of unintended acceleration. Were they systematically reviewed and researched?
  • Depositions of the testers will be critical (especially testers who left the company). It is typical in large organizations for testers to feel intimidated into silence on critical quality matters. It is typical for them to be cut off from the development team. You want to specifically look for the “normalization of risk” problem that was identified in both the Columbia and Challenger shuttle disasters.
  • If the depositions or documentation show that no one raised any concerns about the acceleration or braking systems, that is a potential smoking gun. What you expect in a healthy organization is a lot of concerns being raised and then dealt with forthrightly.
  • Find out what specific organizational mechanisms were used for “bug triage”, which is the process of examining problems reported and decided what to do about them. If there was no triage process, that is either a lie or a gross form of negligence.
  • If Toyota claims to have used “proofs of correctness” in their development of the software controllers, that means nothing. First, obviously they would have to have correctly used proofs of correctness. But secondly, proofs of correctness are simply the modern Maginot line of software safety: defects drive right around them. Imagine that the makers of the Titanic provided “proof” that water cannot penetrate steel plates, and therefore the Titanic cannot sink. Yes steel isn’t porous, but so what? It’s the same with proofs of correctness. They rely on confusing a very specific kind of correctness with the general notion of “things done right.”
  • The anecdotal evidence surrounding unintended acceleration is that it does not only involve acceleration, but also a failure of braking. Furthermore, it’s a very rare occurrence, therefore it’s probably a combination of factors that work together to cause the problem. It’s not surprising at ALL that internal testing under controlled conditions would not reproduce the problem. Look at the history of the crash of US Air flight427, which for years went unsolved until the transient mechanism of thermal shock was discovered.
  • You need to get hold of their code and have it independently inspected. Look at the comments in the code, and examine any associated design documentation.
  • Look at how the engineering team was constituted. Were there dedicated full-time testers? Were they co-located with the development team or stuffed off in another location? How often did the testers and developers speak?
  • What were the change control and configuration management processes? How was the code and design modified over time? Were components of it outsourced? Is it possible that no one was responsible for testing all the systems as a whole?
  • What about testability? Was the system designed with testing in mind. Because, if it wasn’t, the expense and difficulty of comprehensive testing would have been much much higher. Ask if simulators, log files, or any other testability interfaces were used.
  • How did their testing process relate to applicable standards? Was the technical team aware of any such standards?
  • In medical device development, manufacturers are required to do “single-fault condition” testing, where specific individual faults are introduced into the product, and then the product is tested. Did Toyota do this?
  • What specific test techniques and tools did Toyota employ? Compare that to the corpus of commonly known techniques.
  • Toyota cars have “black box” logs that record crucial information. Find out what those logs contain, how to read them, and then subpoena the logs from all cars that may have experienced this problem. Compare with logs from similar unaffected cars.

The best thing would be to reproduce the problem in an unmodified Toyota vehicle, of course. In order to do that, you not only need an automotive engineer and an electrical engineer and a software engineer, you need someone who thinks like a tester.

The unfortunate fact of technological progress is that companies are gleefully plunging ahead with technologies that they can’t possibly understand or fully control. They hope they understand them, of course, but only a few people in the whole company are even competent to decide if that understanding is adequate for the task at hand. Look at the crash of Swiss Air flight 111, for instance: a modern aircraft brought down by its onboard entertainment system, killing all aboard. The pilots had no idea it was even possible for an electrical fire to occur in the entertainment system. Nothing on their checklists warned them of it, and they had no way in the cockpit to disable it even if they’d had the notion to. This was a failure of design; a failure of imagination.

Toyota’s future depends on how they take seriously the possibility of novel, multivariate failure modes, and aggressively update their ideas of safe design and good testing. Sue them. Sue their pants off. This is how they will take these problems seriously. Let’s hope other companies learn from no-pants Toyota.