Notes and key insights from Barry O’Reilly’s "Residuality Theory", exploring how stressors, residues, and random simulation can create resilient software architecture in complex business systems.

These are my notes on Residues: Time, Uncertainty, and Change in Software Architecture by Barry O’Reilly.

Probably one of the most interesting books in software architecture in the last few years.

Key Insights

Random simulation is better than requirements, risks, and predictions.
Flows are better than process or use case mapping.
Residues replace components or patterns as the unit of architecture.
Traditional architecture approaches have no way of capturing time, uncertainty, or change.
The human systems that we work with are complex, and the SW systems (mostly) complicated.
Hyperliminal system: a system where complicated system executes inside complex content.
SW Engineering is the engineering of hyperliminal systems.
- There is nothing in the history of engineering that is like this.
Predicting the future of the complex context is impossible.
The SW architecture is a collection of residues.
- The residue is a unit of change that allows the architecture to change in a particular way without being entirely certain about when the change will occur or even if it will.
The goal of the architecture is criticality, not correctness. Correctness is the goal of the programmer.
- Criticality: a system is resilient to unexpected changes and at the same time not so complicated that it collapses under the weight of managing its own resources.
Architecture approaches criticality when it is difficult to come up with new stressors, and even more difficult to come up with stressors that aren’t already solved by existing residues.
- The architecture then must make it possible for the SW to move between residues easily.
For the mythical 10X developer, 9X is lateral thinking.
SW architecture is the practice of being consistently wrong until you reach a point of being a little bit less wrong.
The presence of a lot of edge cases may indicate linear thinking replacing architecture work.
Business Model Canvas, PESTLE analysis, Porter’s 5 Forces, can be used to generate lists of stressors quite quickly.
Major blocker to producing stressors is not one of knowledge but one of confidence.
Focus on being able to do something correctly before you focus on doing it cheaply.
The key technique is to constantly assume that our structures … are wrong and try to find coherent stories that could lead them to break.

Introduction

A random simulation of stress is a better way of generating a software architecture than prediction, requirements analysis, reuse of patterns, or reactive change management by coding.
Despite the huge number of failing projects in the enterprise SW industry, these ideas, with their roots in traditional engineering, are inherited without question by each new generation.
What is then necessary for an architecture to be successful, since no two architects arrive at it in the same way?

Architecture

Architecture is the structure of a SW application that drives the behaviour of a SW application in its environment. That is, the way the system responds to scale, load, changes, integration, and things we normally refer to as non-functional concerns.
Decision making around SW structure happens implicitly and explicitly in many different places.
Traditional architecture approaches have no way of capturing time, uncertainty, or change.
Most reports on SW engineering failures will blame “poor or incomplete requirements”. This is a symptom rather than a case - business environments are inherently uncertain. SW requires certainty in order to be written. This conflict lies at the heart of what makes architecture difficult.
Ergodic system: you can predict the future states of the system based on past behaviour.
The human systems that we work with are non-ergodic, and the SW systems (mostly) ergodic.
Ergodic ~= ordered ~= complicated.
Non-ergodic ~= disordered ~= complex.
Hyperliminal system: a system where ergodic system executes inside non-ergodic content.
It is the job of the architect to design a structure for the complicated system that will allow it to survive as the complex context around it changes and moves.
Predicting the future of the complex context is impossible.
SW Engineering is the engineering of hyperliminal systems. There is nothing in the history of engineering that is like this, and no grounds for importing historical concerns.

Residuality

A residue is what is left over when the SW system is stressed.
Residuality is very restrictive in what we can say about the hyperliminal system and helps us to know our own limitations:
- We cannot successfully describe the entirety of a complex, hyperliminal system.
- We cannot successfully predict what exactly will happen or when in the hyperliminal system.
- The future of the system will be a function of the residue.
- As we design the system, we can change it to produce better residues.
- The SW architecture is a collection of residues.
The straight jacket of structuralism:
- When we engage with complex contexts and ask stakeholders questions about their environment, our efforts are very often random simulations, but the methodologies we use give the impression that we are following a structured method of interrogation.
- Our seemingly structured methods are indeed a fairly unstructured approach, they are not random enough to be truly useful as a random simulation.
Understanding of the system is sought through the discovery and mapping of this structure. In a complex system, this kind of mapping will often lead to analysis paralysis.
Kauffman networks:
- Creating dependencies between nodes drastically reduced the number of phase states by many orders of magnitude, and that the system seemed to constantly return to the same group of phase states. He called these recurring phase states attractors.
- Attractors depend on number of nodes (N), the number of links between them (K) and the predictability of a node behaviour (P).
The purpose of the methods in this book is to find the attractors in a business system.
Criticality: at certain level of N and K, a system is resilient to unexpected changes and at the same time not so complicated that it collapses under the weight of managing its own resources.
P is applied to SW architectures by restricting the way in which components interact with the rest of the system.
As N and K rise, the number of attractors rises, and as P rises, the number of attractors fall.
What is difficult is knowing what the right levels of these are.
- This problem is solved by randomly simulating the environment until the architecture shows signs of criticality.
The goal of the architecture is criticality, not correctness. Correctness is the goal of the programmer.

Applying Residuality Theory

The naive architecture is the starting point: an architecture that will solve the problem as it has been stated.
The business system is considered as a network of attractors which it shifts between over time.
The architect will randomly simulate the business environment by coming up with stressors (found by involving many stakeholders).
Each time a stressor is described, it is possible to reason about what the attractor might look like.
Each residue is a simple description of the changes necessary to the naive architecture.
The residue is a unit of change that allows the architecture to change in a particular way without being entirely certain about when the change will occur or even if it will.
Architecture approaches criticality when it is difficult to come up with new stressors, and even more difficult to come up with stressors that aren’t already solved by existing residues.
The architecture then must make it possible for the SW to move between residues easily.
The number of potential stressors is orders of magnitude greater than the number of potential attractors. Therefore, every time the architect identifies a particular attractor and amends the residue to survive in it, the architecture will survive all of the potential stressors or unknown combinations of stressors that push the business system to that attractor.
Hyperliminal coupling: Components are coupled when one stressor affects both.

How we learn a domain

Deleuzian Walk:
- Walking around the problem repetitively observing each time the differences in this particular walk.
- This kind of approach leads to a richer understanding of the business environment, better appreciation of customer and competitor concerns, and makes it easier to reason about the possible futures.
- This kind of walking is constantly stressing the current structure that the architect has build up in order to understand the domain.
We can even consider the journey of the SW engineering industry as a number of repetitive walks:
- The industry is constantly repeating the same cycles over and over gain, but every repetition is in a shifted context, and every generation is learning (and forgetting) something new.
Mathematical models or network science do little to help the architect navigate complex social systems.
In residuality the preferred approach to complexity is to allow the collapse of structure and force the creation of new concepts that are more useful in a changing context.
The way that we have advanced beyond what seemed complex in the past was not through iterative, planned experimentation but through sudden bursts of imagination.
Linear thinking is mathematical and exact.
A linear thinker will simply propose a solution and treat all anomalies as edge cases, adjusting the solution slightly each time.
Linear thinkers make excellent programmers.
For the mythical 10X developer, 9X is lateral thinking.
Anybody who has mastered lateral thinking without linear thinking will not be able to produce good architecture.
SW architecture is the practice of being consistently wrong until you reach a point of being a little bit less wrong.

Walking with Stressors

A stressor is any fact about the context that is currently unknown to you.
Every time a system is stressed, you trigger a walk.
The stressor does not have to have a high likelihood, it does not have to have an assigned probability, only needs a coherent narrative that describes how the wider business system will move to a different attractor.
Requirements should always be tested with a number of stressors that investigate how the requirement might change in different attractors.
Constraints are also changing constantly as the context changes.
The presence of a lot of edge cases may indicate linear thinking replacing architecture work.
Scenarios are examples of attractors thought to be likely.
Resilience:
- Residuality seeks to ease the transition of the SW between attractors - not to remain in a single attractor or a limited set of attractors.
- Resilience approaches tend to see SW as complex.
- For complicated SW system is actually robustness, not resilience.
- SW should be robust but its architecture should be residual.
Stressors are not just something that goes wrong or breaks the system, it is anything outside your current understanding of the system.
Attractors requires a good level of understanding of how the business works.
Some rules:
- No use of probability.
- All stressors must go in the list, no matter how ridiculous.
- Use as many sources as possible.
Figuring out the probability of a certain happening in a complex system is incredibly difficult and expensive.
Business Model Canvas, PESTLE analysis, Porter’s 5 Forces, can be used to generate lists of stressors quite quickly.
Major blocker to producing stressors is not one of knowledge but one of confidence.
The result of a stressor analysis is a spreadsheet listing each stressor, attractor, and changes made to preserve the residue. Each row in the spreadsheet represents a residue.
Once the stressor analysis has been completed the architect must now integrate all of these residues into a coherent architecture.
Incidence matrix: allows us to describe the interface between the context and the architecture as a network and investigate the value of N and K that our integrated residual architecture results in.
- Potential components as columns.
- Stressors as rows.
- Cells: 1 if the stressor affects the component, 0 if it doesn’t.
- Quick impression of attractors which are most dangerous to the system and what components are most vulnerable to stress.
- Two 1’s in the same row indicates hidden coupling.
- Two components with the same pattern of response to stress can live in the same component.
Failure Mode Effects Analysis:
- Manage the stress of technical failure.
- Catches technical issues introduced by the addition of components.
Architecture Trade-off Analysis Method:
- Trade-off and balance between different stakeholder opinions about which residues include in the final architecture.
- Catches political and business misunderstandings.

Empirical Test

Residual index Ri = (# of stressors the residual architecture survives - # of stressors the naive architecture survives) / # of stressors
If Ri > 0, the residual architecture is better.
As Ri approaches 0 across iterations, there is less and less return in doing further architectural work.

Over-engineering, Cost, and Agility

Focus on being able to do something correctly before you focus on doing it cheaply.

A Worked Example

Many people approaching the idea for the first time become overwhelmed by the fear of being wrong or performing the tasks incorrectly.
The inclusion of case studies is marketing rather than science.
Business process as the starting point for understanding a system:
- These processes tend to obscure the actual action that happens in the business, they are an abstraction that sometimes hides a lot of information.
- Once they have been drawn however, they will shape the architecture of the system for the rest of its life cycle.
- The concept of use cases is exactly the same.
Instead, we describe our structure in terms of flows: a flow is the movement of information between two actors in a system.
A standard stressor analysis should produce a few hundred stressors.
In order to get a good, broad range of stressors it is necessary to involve a lot of people.
The key technique is to constantly assume that our structures … are wrong and try to find coherent stories that could lead them to break.
Special attention should be paid to any component that uses nouns in its naming. These concepts can change and tear an architecture apart very quickly.
Incidence matrix:
- Represents a network formed by the relationships between stressors and components.
- The number of 1’s in the network gives an approximation of K.
- The number of stressors and components represents N.
- Tell us where to focus:
  1. Where the row totals are the highest: most impact and also reveals the highest levels of hyperliminal coupling.
  2. Where the column totals are highest: components most sensitive to stress, either because they are doing too many things (split them), or because they are important.
  3. Where there is more than one 1 in a row: coupling.
  4. Where two components have the same or similar responses to stress: combine them into the same component.
  5. Where there are many high numbers: complete refactoring of the architecture.
  6. Combinations of stressors: harder to see vulnerabilities.
  7. Untouched components: you have not stressed this part enough.

Heuristics

You cannot map or control hyperliminal systems.
Random simulation is better than requirements, risks, and predictions.
Flows are better than process or use case mapping.
Residues replace components or patterns as the unit of architecture.
Matrices are better than component decomposition.
No probability or cost until the architecture is explored for weaknesses.

Book notes: Residues: Time, Uncertainty, and Change in Software Architecture