Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Really, every engineer should read the report[1] from the Therac-25 investigation. I would hope anybody that is working on anything that could be potentially dangerous has already read it.

There problems in the Therac-25 went a lot further than just the bad design of the target-selection, which had an (badly designed) interlock. It checked that the rotating beam target was in the correct position to match the high/low power setting (and NOT the 3rd "light window" position without any beam attenuator).

While many design choices contributed to the machine's problems, you could probably say that two big design failures lead to the deaths associated with Therac-25. One was this interlock, which failed if you didn't put it in place (there was no locking mechanism, either, just a friction stopper). If the target was turned slightly, the 3 micro-switches would sense the wrong pattern (bit shift)... which was pattern for one of the OTHER positions.

There was also a race condition in the software that would turn on the beam at a power MUCH higher than it is ever used. This race was only triggered when you typed in the treatment settings very quickly, which is why the manufacturer denied there was a problem: when they tried to recreate the bug by carefully - that is, very slowly - following the reported conditions, it never failed.

Therac-25 is an incredibly powerful lesson in what we mean by "Fail Safe", and why it is absolutely necessary to have defense in depth. Fixing the target wouldn't have fixed the race condition power-level bug. Fixing any of the software wouldn't have fixed the bad target design that could be turned out of alignment. Oh, and they had a radiation sensor on the target (which could shut off the machine as another independent layer of defense... but they mounted it on the turnable target, so the micro-switch problem allowed the sensor to be moved away from the beam path.

The really telling thing, though, is how the previous model acted. It was not software controlled, and was an old-style electromechanical device. It turns out the micro-switch problem existed there as well (among other problems)... and it would blow fuses regularly. Which was yet another layer of safety. It turns out that when they upgraded it to a software-based control system, they got cheap and took out all those "unnecessary" hardware interlocks and "redundant" features. There is a lot of blame to go around, but this is where I put most of the responsibility. You never assume one (or even a few) safety feature will work - the good engineer assumes it will all break at any moment, and makes sure that it will still Fail Safe.

> (although it's possible for interlocks to go wrong too)

If there is one lesson to learn from the Therac-25, this was it. Things break, mistakes happen, and when you're building a device that shoots high-energy x-rays at people, you need to assume that everything did go wrong, and make sure the rest of the device can safely handle that situation.

[1] http://sunnyday.mit.edu/papers/therac.pdf



"The really telling thing, though, is how the previous model acted…it would blow fuses regularly"

Good. When a fuse blows, it shows something is wrong, and needs fixing. Replacing the fuse with a nail or something else that doesn't blow is a sure-fire way to set the thing on fire. Bad enough for a desk-lamp, a little worse for radiotherapy machine.

Sounds like people were irritated by fuses blowing, and decided to simply short-circuit the fuses instead.


The people using the machine a the hospital would replace the (expensive) fuses when they blew. It was the manufacturer that made the later model (the Therac-25) that didn't have the fuses (and other "old" hardware features).

Obviously, something was still very wrong. User error (or other bugs? I'm not sure) in the older hardware and the infamous race condition in the software-controlled Therac-25 was causing the beam to turn on some shockingly high amount of power. The better design of the older models saved people's lives by simply blowing fuses when the power went too high.

You could, perhaps, blame the poor communication between the hospitals and the manufacturer, because the fuse problem should have cause a bit of a panic among the engineer who designed the machine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: