Rechtin on Ultraquality - The Challenge
Excellence Beyond Measure and the need for Ultraquality Systems
Continuing the complex systems architecture theme and discussion of ultra-quality systems, the excerpt below is taken from Eberhardt Rechtin, Systems Architecting, Creating and Building Complex Systems, 1991, Chap 8.
Several issues spring to mind today, particularly how the need for some ultra-quality systems continues to be driven by public demand - examples might include how one might go about certifying AI systems, in particular when mass-produced, e.g. self-driving vehicles. Also relevant is how unforeseen failure modes of nuclear power plants leads us towards safer forms of renewable energy.
In discussing progressive redesign, Rechtin reminds us that Lean manufacturing practices have long been adopted and were fundamental to the continued success of programs within NASA and the US Dept of Defense from the 1950s onwards. In those scientific contexts, Lean ideas have been applied to software engineering for many decades. Further, his reference to Fred Brooks reminds us of those software teams whose architects naively saw fit to expose them to unsafe background counts of defect radiation.
Lastly, Rechtin reminds us of the limits of the technique, and presents us with a rule of thumb for defining acceptance criteria -
untested or untestable mechanisms should be specified and accounted for in the design and in the acceptance criteria.
THE CHALLENGE: ULTRAQUALITY - EXCELLENCE BEYOND MEASURE
The greatest single challenge in architecting complex systems is ultraquality*—a level of excellence so high that measuring it with high confidence is close to impossible. Yet, measurable or not, it must be achieved or the system will be judged a failure.
*Quality: a measure of excellence, that is, freedom from deficiencies. Ultra: beyond or greater than. Ultraquality: excellence beyond measure.
As has been noted in Part One, there is a strong correspondence between a system's complexity and the quality demanded of its elements. As systems become more complex, their elements must be more reliable or survivable, as the case may be, or system performance will suffer. This relationship is the basis of the not good enough and cost/failure rate heuristics. For example, component failure rates of less than 0.001% may be necessary just to keep the failure rate of a 1000-part system in the range of 1-5%. But demonstrating 0.001% with reasonable confidence requires testing of 10,000 units, which may greater than the total intended production. Without certification of the components, certification of the system as a whole will be open to question.
Ultraquality systems, containing millions of components, yet mandated to have failure rates well below 1%, raise the challenge almost beyond reach - gaining client acceptance for systems that for all practical purposes cannot be certified by test and demonstration.
THE NEED FOR ULTRAQUALITY SYSTEMS
Half a dozen ultraquality cases come to mind, all subjects of direct concern to government, industry, and the general public. The fact that their requisite ultraquality cannot be well certified is moot. Credible ultraquality must be produced nonetheless.
Case 1:
*Unacceptable loss of life. *It has been pointed out many times that 50,000 lives are lost on U.S. roads and highways each year. Not too much is done about that. But a very infrequent crash of a large passenger airliner with the loss of hundreds of lives causes great public concern and immediate investigation and correction. Nuclear power plants, demonstrably the safest of the major energy suppliers in casualties per year, are nonetheless perceived by many people as unsafe. The nation's manned space program, though no risk levels were stated publicly, evidently was expected to be ultraquality, that is, considerably better than a 1% system failure rate per mission. The lowest demonstrated launch failure rate prior to the Shuttle was about 4%, which makes the Shuttle, as of 1989, roughly comparable in quality to the lunar Apollo.
Case 2:
*Leveraged dissatisfaction. *Calling for a success rate of 99% would seem to be a call for high quality, yet a 1% failure rate, multiplied by the total number of customers, can lead to an uproar when the customers are in the millions and the products are worth a significant fraction of a customer's yearly income. Automobiles, computers, and consumer electronics are prime examples. Ten thousand dissatisfied customers, each expressing dissatisfaction to 10 friends, who pass it on, can damage a company's reputation for years. It is perhaps not surprising that the quality specifications on computer chips for automobiles are tougher than for most military procurements.
Case 3:
Small numbers. What is the meaning of specifying a 99.9% success rate, or 99.0% one for that matter, when the total number of units to be built ranges from 1 to 10? How can such a specification be demonstrated without testing far more units, or the same units for far longer times, than makes sense? Yet this situation is the normal one for space flight, for major ground installations, and for major naval combatants, carriers, cruisers, and submarines.
Case 4:
Smart systems. Modern smart systems, particularly those in the aerospace, energy, and manufacturing fields, control forces that could destroy the system itself. The calculations to do so are many and complex. Fly-by-wire aircraft (F-16, B-1, B-2), launch vehicles (Shuttle), and highly maneuverable spacecraft (Voyager and Solar Max) are well-known examples. A failure in these systems can wreak great damage in a very short time. A mission can be aborted when it should not be, or not aborted when it should. A spacecraft can lose all its control gas from a miscommand. A launch vehicle can be broken apart by commanding a too-violent maneuver or failing to accommodate a severe wind shear. In comparable business transaction systems, an irreplaceable data base can be lost or crucial deadlines missed.
Case 5:
Systems of extremely high value per unit. The cost of a single complex system now can be so high that a catastrophic failure, particularly one that could have been avoided, is worse than unacceptable; it is unthinkable. Its loss could bankrupt a major company, smash a program, generate virulent public criticism, ruin careers, and create widespread disruption in related areas. A nuclear plant costs billions. The Shuttle ready to la costs $3 billion and its cargo may add another $1 billion. A Stealth aircraft costs $500 million; a B-1, $250 million; and a planetary spacecraft, hundreds of millions. Consequently, huge financial risks are taken with every operational decision. As one pilot it, "How would you like to bail out of a $500 million aircraft when there was even the faintest chance that it might be saved?'' Imagine, too, the pressures on the controller of a $700 million spacecraft in issuing a propulsion command that, if wrong or not communicated, could ruin the mission.
Case 6:
High-value systems under attack. What failure rate would be acceptable for the defense of a city under nuclear attack? For a critical satellite under antisatellite attack? For a strategic command and control system under jamming? For a critical data base against penetration?
In cases like these, any failure at all means that the objective cannot be met—and failure should be so rare that the system may well become obsolete before it fails. At the same time, lack of failure to date gives little confidence, statistically, that all is well.
Excerpted from Systems Architecting, Creating and Building Complex Systems, Eberhardt Rechtin, Prentice Hall, 1991.
You might also be interested in these articles...
Lean Agile Architecture and Development
Posted in: agile architecturecomplex systems architecturecontinuous deliverydemingdesigndistributed systems testingleanrechtinsoftware architecturetestabilitytesting