Another excerpt, below, from Systems Architecting, Creating and Building Complex Systems, Eberhardt Rechtin, Prentice Hall, 1991. What's interesting to the modern reader (who may be most interested in software design) is that the application of Lean manufacturing ideas to software isn't new; the approach was at the root of successes in space exploration, nuclear power, defence and satellite communication systems. The section discusses -
- Zero defect approaches
- Progressive redesign and its role in US space vehicle reliability
- Rigour of engineering disciplines for ultraquality systems
- Over-spec or accelerated testing and failure modes
- The limit of practicable and/or commercially feasible quality levels
THE BASIC RESPONSE: MURPHY'S LAW AND ZERO DEFECTS
Murphy's law and zero-defect implementation share in a notable and important omission-neither mentions statistics nor acceptable quality limits (AQL). If something can fail, it will. Perfection, not acceptable imperfection, is essential at every step.
No matter that perfection is an unreachable absolute. It remains the objective. Eliminate failure modes by design. Qualify every supplier. Stop the production line for every anomaly. Give incentive to every participant through pride, understanding, team spirit, peer pressure, and perhaps profit sharing and patriotism-whatever it takes. It worked for military systems in World War II. It worked when statistical analysis predicted that the Apollo mission could not succeed without the loss of 30 astronauts in flight to and from the moon. And it is the foundation of most of the recent total quality management systems (TQMS). Although its focus is on the production floor, it applies equally well to in the loft and managers in the executive offices.
When fully implemented as a disciplined process system, it results in just-in-time (JIT) supply, minimum inventory, minimum rework and turnback, minimum accounting, and minimum cycle (throughput) time.
Its nemesis is acceptance of failure, that is, acceptable quality limits with statistical lot sampling and rejection used as postfacto control.
MANAGERIAL RESPONSE I: PROGRESSIVE REDESIGN*
*For an analytic treatment of this subject, see Selected References on Reliability Growth, Institute of Environmental Sciences, 1988.
There are three intertwined approaches to the ultraquality challenge—managerial technical, and architectural. The architect must understand them all.
One of the oldest and most successful quality-assurance techniques is that of W. Edwards Deming and J. M. Juran made famous by its application by the Japanese in consumer, automotive, and electronic products:
To which should be added,
One Japanese expression for the technique (Hayes, 1981) is "The nail that sticks up gets hammered down." At the beginning of the process, the technique is managerial. As J. M. Juran expresses it, a company's "quality problems are planned that way." (Juran. 1988). At the end, as we shall see, it devolves to economics and science.
A major U.S. manufacturer of TV sets decided that the business was none too profitable and sold it to a Japanese firm. Prior to the sale, final inspection found 1.4 defects per set. Under Japanese management, although standards were set higher, that defect rate was cut by a factor of 100. To accomplish this near-zero defect rate, the product had to be perfect at every step, which meant that each step both inspected itself and demanded perfection from those ahead of it. Inspection at the end of the production line, line supervisor, a matter of pride rather than control. The company is now profitable.
Progressive redesign takes numbers and time to succeed, but, by designing out each failure, a continuing improvement to an asymptotically high success rate is virtually assured. The technique works particularly well for components and systems with short turnover times, from integrated circuits to automobiles. To be effective, it requires a well-thought-out combination of careful design, minimal materials defects, replication by precision machinery, well-instrumented process control, tight tolerance design, and alert detection product and process weaknesses.
Modified versions of the progressive redesign technique also can be seen at work in very large-scale systems—space launch vehicles, spacecraft, and software.
Current expendable launch vehicles, based on historical flight-failure data, have a present success rate of about 0.94 (Figure 8-1). This rate was achieved over a 30-year period of continued improvements. It was less, however, than 0.70 for the first 10 years. Studies show that the present figure could be improved by another 0.044 if all single-point failures were eliminated, 0.018 if workmanship and human errors were reduced by 50%, 0.014 with engine out capability, 0.01 with redundancy in avionics, and so on.
Figure 8-1. U.S. space launches, 1957-1987. (Source: The Aerospace Corporation.)
At this *advanced stage, *still further improvement is neither easy nor cheap. The space shuttle is following much the same redesign path, steady improvement based on lessons learned (Stever, 1988).
(Egan, 1987) Over the last 10 years, there have been 81 electronic part failures in a representative set of military spacecraft. Sixteen were traveling-wave-tube amplifier failures, 6 were tape recorders, 5 were transmitters, 3 were reaction wheels, 3 were receivers, 2 were power switching, and 38 were "other." The probable causes of failure were parts (55), design (45), quality (40), and environment (17). Software accounted for 4 failures. Attempts to improve the failing components have had only limited success - quantities are small, accommodations for the possible failures have been made in the designs, and the cost effectiveness of further improvement is marginal.
(Brooks, 1982, 1987; Musa, 1989) Software maintenance is primarily the continuing discovery and correction of software errors. The process results in a steadily declining occurrence of errors. Musa reports that by the use of models and by monitoring the error discovery rate, one can predict when requirements will finally be met. But there seems to be a limit. Brooks maintains that a point is reached where correcting errors generates still more errors and the error rate starts to climb again. He also concluded that rushing the discovery process by adding more software programmers may instead lengthen it.
It seems to be characteristic of the progressive redesign process that in the beginning, a few major problems dominate the development. Once they are solved, performance stabilizes at a treatable, if not yet satisfactory, level. To raise that level by reducing the failure rate requires solving a significantly larger number of smaller problems, and so on, level by level, that is,
- The number of problems encountered in development is inversely related to their magnitudes.
If we assume that the difficulty of solving a problem is related to its magnitude, then it takes about the same amount of effort to reduce the failure rate by a given factor. Resource managers in the aircraft industry have expressed the result this way:
- Reducing the failure rate by each factor of 2 takes as much effort as the original development.
Thus, if the original development cost X and resulted in a failure rate of 4%, then to reach 2% would require another X, to reach 1% still another X, etc. S.W. Golomb consequently suggests that quality levels might better be expressed the logarithm of the failure rate than as the success percentages [Golomb 1990].
The same heuristic seems to apply to reliability, survivability, safety, software engineering, and electronic countermeasures.
Some of the highest-quality complex systems in the world are programs of NASA and the Department of Defense. Special procedures were developed for them (Leverton, 1981). The essential factors are strong discipline, accurate documentation, faithful reporting of any and all anomalies, independent reviews of all steps and decisions, and extraordinary attention to detail. Perhaps most important for future use by others, these techniques have to be cost-effective. Using the criterion of data produced per dollar spent, the high-performance, long-lived U.S. spacecraft have consistently outperformed their design goals.
Based in part on NASA and DOD efforts, the Department of Energy, (DOE) and the nuclear power industry have further advanced the drive to ultraquality (Floyd Culler in discussion with the author, 1989).
The technique, design against maximum credible accident, begins with the failure modes and effects analyses (FMEAs) pioneered by the DOD. Each possible failure mode is identified and eliminated by design. A new system FMEA is then constructed, with further redesign, until a point is reached where issues of basic physics, chemistry, materials, and structural dynamics become the limit. Experiments are then devised and run until all concerned parties, including the Nuclear Regulatory Commission, are satisfied that a design is safe. One indication of a decade of progress is the reduction in unresolved issues from over 400 to less than 25, none of which would produce catastrophic failure. But as the DOE recognizes, even this is not enough. The public must still be convinced. To accomplish this means that: "The engineering task is to design reactors whose safety is so transparent that the skeptical elite is convinced, and through them, the general public" (A.M. Weinberg, 1989).
The Department of Energy design process is not inexpensive. It is not unusual for the cost of certification of a nuclear power plant design to be more than $100 million. (A typical nuclear power plant cost is in the billions.) And, for various reasons, there has been little opportunity to apply the new design techniques to new plants.
Designing against maximum credible accident is probably the ultimate progressive redesign. It forces quality issues to the most basic level, the scientific unknowns, which are then pursued to a conclusion. Not every system needs to dig that deep on every issue. But one that does call for more fundamental effort across the board is the value of accelerated testing and burn in of ultraquality components and systems. Accelerated testing is a widely used method of attempting to certify, in a short time, a part or element intended for many years of operation. It has three important applications. The first is in early development, when go/no-go decisions have to be made on the use of new technologies. The second is in manufacture, when product lots are screened prior to incorporation into systems. The third is in system test.
In accelerated testing, the device is subjected to environmental stresses - usually temperature and vibration—far greater than expected in operation, but for a correspondingly shorter time. But the technique is controversial. It can harm otherwise good components and its results can be deceptive in predicting - or not predicting - future failures. It has been implicated in damaging high-quality semiconductors, although it can find mechanical flaws like poor bearings, loose parts, and poor connectors. The technique only emulates accurately those failure mechanisms for which high stress over a short period produces effects similar to those of low stress over a long term. Mechanisms that take time alone, like electrochemical corrosion, molecular changes in plastics, or exposure to radiation in space, may not surface. The architect and client may conclude that the system is better than it is.
The key, as with any test, is to understand which failure mechanisms are being induced by each test, which ones might be inadvertently activated, and which ones are not tested at all. That understanding comes from research, especially for for ultraquality systems:
- Testing, without understanding the multiple failure mechanisms to which a system is susceptible, can be both deceptive and harmful.
As a consequence: untested or untestable mechanisms should be specified and accounted for in the design and in the acceptance criteria.
Progressive redesign runs into another limit, profitability to the manufacturers. Once it becomes evident that necessary further investments can never be recouped, the projected losses will almost always force the maker to stop improving the product. Major System builders, needing the components, consequently risk the initial lack or loss of suppliers for whom participation would be unprofitable. The result is a limit to the practical quality that a given supplier can produce at a reasonable price and profit. A client needing small numbers of still better components may have to make them in house, that is, vertical corporate integration of the complete production process may be the only management solution. Vertical integration changes the economics of the component. Although the cost of making the component may be much greater than its general market would bear, its contribution to a particular system may more than justify limited production by the system builder.
Excerpted from Systems Architecting, Creating and Building Complex Systems, Eberhardt Rechtin, Prentice Hall, 1991.