Book notes on "Building Microservices: Designing Fine-Grained Systems" by Sam Newman

These are my notes on Building Microservices: Designing Fine-Grained Systems by Sam Newman.

When an author lists more drawbacks thatn benefits (nine vs six), you know he knows what he is talking about.

Key Insights

Independent deployability is key.
Clear, stable service boundaries.
Smaller teams working on smaller codebases tend to be more productive.

Benefits	Pain Points
Tech heterogeneity	Tech overload
Robustness	Latency
Scalability	Data Consistency
Easy of deployment	Dev experience
Org alignment	Cost
Composability	Monitoring and troubleshooting
	Reporting
	Security
	Testing

When to avoid:
- Start ups:
  - Because no stable service boundaries.
  - Small teams.
- SW deployed by customers, due to operation burden.
Keep your middleware dumb, and keep the smarts in the endpoints.
Prefer to put into an event what you would be happy to share via an API.
Ideal technology should:
1. Make backwards compatibility easy.
2. Make interfaces explicit.
3. Keep your APIs technology-agnostic.
4. Make services simple for consumers.
5. Hide internal implementation details.
Communication:
- RPC:
  - gRPC good, other bad.
  - Use when good deal of control of client and server.
- REST:
  - Sensible default choice.
  - Good for external API: wide support + caching.
  - HATEOAS: No practical advantage.
- GraphQL:
  - Dynamic queries can cause stability issues similar to a big/slow/resource intensive SQL query affecting the whole system.
  - CDN caching more difficult.
  - Write requests are awkward.
- Message brokers:
  - What any given broker technology means by guaranteed delivery can vary. Read the documentation very carefully.
  - Pay really careful attention to how exactly once delivery is implemented (if it is).
    - Even better, expect duplicates.
Author still a fan of XML.
Client libraries should be created by team other than the one that owns the server API.
Sagas:
- Saga gives enough information to reason about what state it is in.
- Sagas assume the underlying components are reliable as it does not deal with technical failures (5xx, timeouts).
- General rule:
  - If a team owns the implementation of the entire saga: orchestration.
  - If multiple teams: choreographed.
We want to build artifacts once and only once, and use them for all deployments.
Source code organization:
- Multirepo:
  - If you are constantly making changes across multiple services, your microservices boundaries are wrong.
- Monorepo:
  - Ease changes in multiple services:
    - Atomic commits (but not atomic deployments).
  - Needs load of tooling:
    - Bazel.
- Per team monorepo.
3 services per dev is not an uncommon ratio.
The smarter the PaaS try to be, the more they go wrong.
If you need to do a lot of fine tuning around resources available to your functions, then FaaS is not the best option.
Explore FaaS before Kubernetes.
Future: Kubernetes hidden under a more developer friendly experience.
If your teams work independently, it follows that they should be able to test independently:
- If also follows that they should own their test environments.
- It is a reason against end-to-end tests.
Contract testing and consumer-driven contracts:
- Pair consumer and producer team members to write them.
- Make existing communication channels more explicit.
- Pact Broker.
- Spring Cloud Contract. Only JVM.
Monitoring:
1. Log aggregation:
  - Avoid log forwarding agents reformatting the logs. Make services log in the right format.
2. Metrics aggregation:
  - Look for a product that is built with high cardinality in mind, so you can attach more metadata/tags to the metrics.
3. Distributed tracing:
  - Pick a tool that supports OpenTelemetry.
4. Semantic monitoring:
  - Are we selling “as usual”?
  - Implemented with:
    1. Real user monitoring: In the past.
    2. Synthetic transactions: Catch issues before clients are aware.
- Be sceptical of ML/AI.
Good alert:
- Relevant.
- Unique.
- Timely.
- Prioritized.
- Understandable.
- Diagnostic: clear what is wrong.
- Advisory: help understand what actions to take.
- Focusing.
Standardization:
- Make it easy to do the right thing.
You are only as secure as your least secure aspect.
Credentials of users and operators are often the weakest point of our system:
- In 2020, 80% of hacks due to it.
Cost of any security implementation should be justified by your threat model.
Failure is everywhere:
- Spend less time trying to stop the inevitable and more dealing with it gracefully.
Root cause analysis: it is surprising how often we want that root cause to be a human.
CQRS: one of the hardest forms of scaling.
The ideal number of places to cache is zero.
When customizing a product built by someone else, you have to work in their world.
One experience, one backend for frontend (BFF).
Collective ownership can allow for more standardization, hence easier to move people/work around.
The biggest cost to working efficiently at scale is the need of coordination.
Coming up with a vision without considering how your staff will feel about it and without considering what capabilities they have is likely to lead to a bad place.
Architecture is what happens, not what is planned.
Make Product Manager accountable for the technical quality of the system.

Part I - Foundation

Chapter 1 - What are Microservices?

Independent deployability is key.
Clear, stable service boundaries.
Monoliths:
- Single-process monolith.
- Modular monolith.
- Distributed monolith.
Tech that enable microservices:
- Log aggregation and distributed tracing.
- Containers and Kubernetes.
- Streaming.
- Public Cloud and Serverless.
Smaller teams working on smaller codebases tend to be more productive.

Benefits	Pain Points
Tech heterogeneity	Tech overload
Robustness	Latency
Scalability	Data Consistency
Easy of deployment	Dev experience
Org alignment	Cost
Composability	Monitoring and troubleshooting
	Reporting
	Security
	Testing

When to avoid:
- Start ups:
  - Because no stable service boundaries.
  - Small teams.
- SW deployed by customers, due to operation burden.

Chapter 2 - How to Model Microservices

Boundaries:
- Information hiding.
- Strong cohesion:
  - Code that changes together, stays together.
- Low coupling.
Types of coupling, from low to high:
1. Domain coupling:
  - Unavoidable: one service calling another.
  - Beware of one service talking to too many downstream services:
    - Too much logic centralized.
    - God “class”.
2. Pass-through coupling:
  - One service passes data to another purely because the data is needed further downstream.
  - Fix by:
    1. Bypass the intermediary:
      - Increase the domain coupling of calling service.
    2. Required information to become part of the intermediary contract:
      - Intermediary to collect all/part of that data.
    3. Intermediary to treat required information as a blob:
      - The Clojure way!
3. Common coupling:
  - Two services using the same data.
  - Okish for reference data ( low change, read only).
  - Service that are thin wrappers around CRUD, is a sign of weak cohesion and high coupling.
4. Content coupling:
  - Upstream service reaches into the internals of a downstream service and changes its internal state.
  - Other services touching your DB directly.
To model:
1. Mainly, use DDD:
  - Both aggregates and bounded context give us the unit of cohesion with well defined interfaces: candidates for microservices.
  - Coarser-grained bounded context can contain/hide other bounded context.
2. Event Storming.
3. Other forces:
  - Volatility: things that change frequently should go into their own service.
  - Data: PII/PCI.
  - Technology: use the right tool for the job.
  - Organizational: Conway’s law.
Onion architecture: it has lots of layers and makes you cry when you have to cut through it.

Chapter 3 - Splitting the Monolith

Skipped. See book notes on Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith.

Chapter 4 - Microservices Communication Styles

When you buy into specific tech choice, you are buying into a set of ideas and constraints that come with it.

communication style

Sync blocking: beware of long call chains.
Common data:
- Large volumes.
- Universal.
- High latency.
Keep your middleware dumb, and keep the smarts in the endpoints.
Prefer to put into an event what you would be happy to share via an API.

Part II - Implementation

Chapter 5 - Implementing Microservices Communication

Ideal technology should:
1. Make backwards compatibility easy.
2. Make interfaces explicit.
3. Keep your APIs technology-agnostic.
4. Make services simple for consumers.
5. Hide internal implementation details.
Choices:
- RPC:
  - gRPC good, other bad.
  - Use when good deal of control of client and server.
- REST:
  - Open API concern: a spec used for documenting now being used for a more explicit contract.
  - Lot more complex than protobuf.
  - Less performance than gRPC.
  - HTTP/3: based on QUIC, less overhead than TPC.
  - Sensible default choice.
  - Good for external API: wide support + caching.
  - HATEOAS:
    - Many of the ideas in REST are predicated on creating distributed hypermedia systems, and this isn’t what most people end up building.
    - More chatty.
    - No practical advantage.
- GraphQL:
  - Dynamic queries can cause stability issues similar to a big/slow/resource intensive SQL query affecting the whole system.
  - CDN caching more difficult.
  - Write requests are awkward.
- Message brokers:
  - What any given broker technology means by guaranteed delivery can vary. Read the documentation very carefully.
  - Pay really careful attention to how exactly once delivery is implemented (if it is).
    - Even better, expect duplicates.
Serialization formats:
- Author still a fan of XML.
Use schemas to catch structural breakages.
Avoid breaking changes:
- Just add, never remove.
- Tolerant reader.
Explicit interfaces:
- Events:
  - https://www.asyncapi.com.
  - https://cloudevents.io.
Breaking changes:
- Lockstep deployments:
  - Avoid.
  - More palatable if service + all consumers owned by the same team.
- Coexists incompatible microservices versions:
  - Use sparingly and for short periods of time.
- Emulate the old interface:
  - Preferred approach.
  - Author torn between encoding versions in urls (simpler, obvious) or specify in header (less coupling).
- Social contract:
  - How breaking changes will be notified?
  - How they will be agreed upon?
  - Who will update the consumers?
  - How long before old endpoint is removed?
If you are using libraries for code reuse across microservices boundaries, you have to accept that multiple versions of that library will be out there.
Client libraries should be created by team other than the one that owns the server API:
- To avoid tightly coupled.
Service discovery:
- DNS:
  - Slow to propagate changes.
  - Okish with a load balancer.
- Dynamic service registries:
  - Avoid Zookeeper.
  - Consul: good, specially if you need Vault also.
  - etcd/kubernetes.
When using an API gateway or service mesh, it is essential that their behaviour is generic, agnostic of specific microservices.
In API gateway, avoid:
- Call aggregation.
- Protocol rewriting.
- Use between your microservices.
BizOps.

Chapter 6 - Workflow

Sagas does not give atomicity in ACID terms:
- Saga gives enough information to reason about what state it is in.
Saga failure mode:
- Sagas assume the underlying components are reliable as it does not deal with technical failures (5xx, timeouts).
- Backwards recovery: compensating transactions.
- Forward recovery: retry and keep processing.
Orchestrated sagas:
- One coordinator.
- Easy to understand: one place to look at how a process work.
- High coupling.
- Risk of “god” service and anemic ones:
  - Avoid by having different services play the orchestrator role for different flows.
Choreographed sagas:
- More decoupled.
- harder to understand the process and the saga state.
- Saga state:
  - Events should have a correlationID or sagaID.
  - A process consumes all events to show the actual state.
General rule:
- If a team owns the implementation of the entire saga: orchestration.
- If multiple teams: choreographed.

Chapter 7 - Build

We want to build artifacts once and only once, and use them for all deployments.
Source code organization:
- One giant repo, one giant build:
  - Avoid.
- Multirepo:
  - Hard to work with several services at the same time.
  - If you are constantly making changes across multiple services, your microservices boundaries are wrong.
  - Most straightforward.
- Monorepo:
  - Ease:
    - Fine grained code reuse.
    - Changes in multiple services:
      - Atomic commits (but not atomic deployments).
  - Needs load of tooling:
    - Bazel.
- Per team monorepo.

Chapter 8 - Deployment

Principles of microservices deployment:
1. Isolated execution.
2. Focus on automation:
  - Aim for self-service.
  - 3 services per dev is not an uncommon ratio.
3. Infrastructure as code:
  - How automation should be done.
  - Terraform / Pulumi.
4. Zero-downtime deployment:
  - Avoid coordination.
5. Desired state management:
  - Declarative state.
  - Platform automatically makes changes to arrive to the desired state.
  - Kubernetes / Nomad.
  - GitOps: Flux.
Deployment options:
1. Physical machine.
2. Virtual machine:
  - Stricter isolation than containers.
3. Containers:
  - More lightweight than virtual machines.
4. Application containers.
5. Platform as a service (PaaS):
  - When PaaS solutions work well for your context, they work very well indeed.
  - The smarter the PaaS try to be, the more they go wrong.
6. Function as a Service (FaaS):
  - If you need to do a lot of fine tuning around resources available to your functions, then FaaS is not the best option.
  - Azure Durable Functions.
  - Cold start is optimized in most platforms.
  - One function per microservice or per aggregate:
    - Avoid more fine-grained.
Explore FaaS before Kubernetes.
Kubernetes has limited multitenancy capabilities:
1. Use OpenShift.
2. Use federated model: multiple Kubernetes clusters with a layer on top.
Knative aims to provide FaaS-style workflows to developers:
- Risky to adopt.
Future: Kubernetes hidden under a more developer friendly experience.
Progressive delivery:
- Separate deployment from release.
- Blue/Green deployments.
- Feature toggles.
- Canary releases:
  - Spinnaker.
- Parallel run: send to old and new and compare.

Chapter 9 - Testing

Who owns an end-to-end test suite?
- Everybody / nobody.
- Dedicated team: this can be disastrous.
- Split suite and assign subsets to teams.
Actively remove tests that are no longer needed.

Why not use a version number for the whole system? Now you have 2.1.0 problems. Brandom Byars
If your teams work independently, it follows that they should be able to test independently:
- If also follows that they should own their test environments.
- It is a reason against end-to-end tests.
Contract testing and consumer-driven contracts:
- Pair consumer and producer team members to write them.
- Make existing communication channels more explicit.
- Pact Broker.
- Spring Cloud Contract. Only JVM.

Chapter 10 - From Monitoring to Observability

Observability: the extent to which you can understand the internal state of the system from external outputs.
Monitoring and observability system are production systems.
Building blocks:
1. Log aggregation:
  - Prerequisite for microservices.
  - Pick a common format.
  - Avoid log forwarding agents reformatting the logs. Make services log in the right format.
  - Log correlation ID.
  - Careful with timestamps and clock skew:
    - Tracing does not have this issue.
2. Metrics aggregation:
  - Look for a product that is built with high cardinality in mind, so you can attach more metadata/tags to the metrics.
  - Prometheus is low-cardinality.
  - HoneyComb or LightSteps.
3. Distributed tracing:
  - Pick a tool that supports OpenTelemetry.
4. Are we doing ok?
  - SLA.
  - SLO: at the team level.
  - SLI: indicator-data to know if we are meeting a SLO.
  - Error budgets.
5. Alerting:
  - Biggest question: Should this problem cause someone to be woken up at 3am?
  - Avoid overalerting.
  - Good alert:
    - Relevant.
    - Unique.
    - Timely.
    - Prioritized.
    - Understandable.
    - Diagnostic: clear what is wrong.
    - Advisory: help understand what actions to take.
    - Focusing.
6. Semantic monitoring:
  - Are we selling “as usual”?
  - Are users login as usual?
  - Implemented with:
    1. Real user monitoring:
      - In the past.
      - Noisy.
    2. Synthetic transactions:
      - Catch issues before clients are aware.
7. Testing in production:
  1. Synthetic transactions.
  2. A/B testing.
  3. Canary releases.
  4. Parallel run.
  5. Smoke test.
  6. Chaos engineering.
Standardization:
- Important in monitoring and observability.
- Make it easy to do the right thing.
Select tools that:
- Democratic: everybody can use them.
- Easy to integrate: OpenTelemetry.
- Provide temporal, relative, relational and proportional context.
- Realtime.
- Suitable for your scale.
Be sceptical of ML/AI.

Chapter 11 - Security

You are only as secure as your least secure aspect.
Core principles:
1. Least privilege.
2. Defense in depth:
  - Multiple protections.
  - Microservices provide more defense in depth than monoliths as:
    - They have smaller scope.
    - Can be segmented in networks.
3. Automation:
  - Recover.
  - Rotate keys.
Five functions of Cybersecurity:
1. Identify potential attackers, their targets are where you are most vulnerable:
  - As human beings, we are quite bad at understanding risks.
  - Threat modeling:
    - Focus on the whole system, not a subset.
2. Protect.
3. Detect.
4. Respond.
5. Recover.
Foundations of application security:
1. Credentials:
  - Credentials of users and operators are often the weakest point of our system:
    - In 2020, 80% of hacks due to it.
  - Troy Hunt, Passwords evolved:
    - Use password managers.
    - Avoid complex password rules.
    - Avoid mandated regular password changes.
  - git-secret and gitleaks.
2. Patching:
  - Containers: Aqua.
  - Libraries: Snyk or Github.
3. Backups.
4. Rebuilds:
  - Including restoration of backed up data.
Cost of any security implementation should be justified by your threat model.
Data in transit:
1. Server identity: HTTPS.
2. Client identity:
  - Mutual TLS.
  - API keys.
3. Visibility of data: HTTPS.
4. Manipulation of data:
  - HTTPS.
  - HMAC.
Data at rest:
- Where do you store the encryption keys?
Do fine-grained authorization in service. It is ok for coarse-grained in the gateway.
Gateway generating a JWT per request.
Agile Application Security by Laure Bell.

Chapter 12 - Resilience

Resilience:
1. Robustness:
  - The ability to absorb expected perturbation.
  - As we increase robustness, we increase complexity.
2. Rebound:
  - Ability to recover after a traumatic event.
3. Graceful extensibility:
  - How well we deal with a situation that is unexpected.
  - Flatter orgs will often be better prepared.
  - Optimizations can increase brittleness.
4. Sustained adaptability:
  - Ability to continually adapt to changing envs, stakeholders and demands.
  - Requires a holistic view of the system.
  - Share information freely on an incident.
  - Culture to find time to learn from incident.
Failure is everywhere:
- Spend less time trying to stop the inevitable and more dealing with it gracefully.
Responding very slowly is one of the worst failure modes you can experience.
CP (from CAP theorem) systems can’t fix all your problems, specially if you keep records of the real world.
Blame:
- Root cause analysis: it is surprising how often we want that root cause to be a human.
- If it is true that one person making a mistake can really bring an entire telco to its knees, you’d think that would say more about the telco than the individual.

Chapter 13 - Scaling

Four axes:
1. Vertical scaling:
  - In a public cloud provider, is very fast to apply.
  - Little risk.
  - Does not improve robustness.
2. Horizontal duplication:
  - It does improve robustness.
3. Data partitioning:
  - It does not improve robustness.
4. Functional decomposition:
  - It does improve robustness.
  - Can choose tech most appropriate to the microservice load.
CQRS: one of the hardest forms of scaling.
Caching:
- Improves robustness.
- Client-side:
  - Main issue is invalidation and consistency.
  - Better latency.
- Server-side:
  - Main issue is that clients still need to make a network trip.
  - Latency improved for all clients.
- Server to return TTL.
- Conditional GET (ETag).
- Notification (pub/sub):
  - Consider adding a heartbeat.
- The ideal number of places to cache is zero.
At certain scaling points, you will need to rearchitect.

Part III - People

Chapter 14 - User Interfaces

Traditional reasons for dedicated FE teams:
1. Scarcity of specialist:
  - Share between teams:
    - Do the hard bits and teach the team to do the easy ones.
    - Enabling teams.
2. UI consistency:
  - Enabling teams.
  - Design System.
3. Technical challenge:
  1. Monolith FE.
  2. Micro FE:
    - Key problem: unlikely that 1 micro FE == 1 microservice, so micro FE ends up making multiple API calls to multiple BE services.
    - 2 types:
      1. Page-based decomposition:
        
        We’ve lost a lot by automatically assuming that a web-based UI means a single-page app.
        
        Forward traffic to microservice/microFE depending on url.
      2. Widget-based decomposition:
        
        Needs an assembly layer to pull the parts together.
        
        Options:
        
        Iframes: issues with comms between widgets.
        
        Server-side templating.
        
        Dynamically inserted by the client.
        
        Issues:
        
        Dependencies:
        
        Great to upgrade one widget dependencies at a time.
        
        Duplication of libraries to download.
        
        Communication using custom events:
        
        Same as event-driven communication patterns.
  3. Central aggregating gateway:
    - Often the central aggregating gateway does so much that ends up being owned by a dedicated team:
      - Potential contention and bottleneck.
    - When customizing a product built by someone else, you have to work in their world.
    - Strongly advise against using filtering and aggregation capabilities of dedicated gateway API.
  4. Backend for frontend (BFF):
    - Like (3) but one BFF per FE.
    - FE and its BFF owned by same team and very coupled.
    - One experience, one BFF.
    - Consider BFF for external parties as a way to isolation APIs.
  5. GraphQL.

Chapter 15 - Organizational Structures

The biggest cost to working efficiently at scale is the need of coordination.
Conways law, team API, platform team, enabling teams: Team Topologies.
Automation is key.
Dunbar numbers.
Strong ownership can allow for more local variation than collective ownership.
Collective ownership can allow for more standardization, hence easier to move people/work around.
When creating an internal framework, it all starts with the best intentions.
Avoid internal framework unless you’ve exhausted your other options.
Do code reviews promptly: pair programming FTW!
Coming up with a vision without considering how your staff will feel about it and without considering what capabilities they have is likely to lead to a bad place.

Chapter 16 - The Evolutionary Architect

Architects and engineers have a rigor and discipline we could only dream of.
Many forms of IT certification are worthless, as we know little about what “good” looks like.
SW architects should focus on helping create a framework in which the right systems can emerge and continue to grow as we learn more.
SW arch as a town planner. (Same analogy as in Elements of Clojure).
Architecture is what happens, not what is planned.
Habitability: is the characteristic of source code that enables programmers coming to the code later in its life to understand its construction and intentions and to change it comfortably and confidently.
Strategic goals -> Technical vision -> Principles -> Practices.
Architects as an enabling team:
- 2-3 full time architects plus tech leads.
Make Product Manager accountable for the technical quality of the system.
Key things to standardize:
- Monitoring.
- Interfaces between microservices.
- Architectural safety (error codes, circuit breakers, …).
- Governance:
  - Group activity, mostly people executing the work being governed.
  - Examples.
  - Microservices templates.

Book notes: Building Microservices - Second edition