The Decomposition — Amazon's Monolith-to-Microservices Migration

Act IThe Monolith

A single block that everyone could reach into.

By 2001 Amazon was not a bookshop with a website. It was a fast-growing platform whose one shared codebase had become the bottleneck for everything it wanted to do next.

The retail engine had a name inside the company: Obidos. It was a monolith — one large program, backed by a shared database that almost any team could read directly. In the early days that was a feature. If the recommendations team needed catalogue data, the catalogue tables were right there. No integration, no waiting. Ship by lunchtime.

The trouble is what that convenience hides. When any team can read any other team's tables, every team's internal design silently becomes every other team's dependency — and nobody can see the web of those dependencies. The cost doesn't show up when you write the code. It shows up months later, when a team renames a column to support a new feature and a system three floors away breaks in production, for reasons no one can immediately explain.

This is the monolith's real tax, and it is not performance. It is coordination. To change anything safely, you have to know who depends on it — and in a shared-everything system, that is everyone, invisibly. So changes slow down. To stay safe, teams coordinate. As the company adds teams, the coordination required to ship a single change grows faster than the team count itself.

obidos · the shared-everything monolith

flowchart TB
  subgraph M["Obidos · one process, one database"]
    direction TB
    C[Catalogue logic]:::c
    O[Ordering logic]:::c
    R[Recommendations]:::c
    P[Pricing]:::c
    DB[(Shared database
every team reads every table)]:::db
    C --- DB
    O --- DB
    R --- DB
    P --- DB
    R -. reads catalogue tables directly .-> DB
    O -. reads pricing tables directly .-> DB
  end
  classDef c fill:#1E283A,stroke:#3A465E,color:#E8EBF1;
  classDef db fill:#2a1d12,stroke:#D9842A,color:#F2A24B;

The hidden coupling. The dotted lines are the problem: cross-team reads straight into shared tables. They never appear on an architecture diagram, yet every one of them is a change that can break a distant team without warning.

FactWhy is the coordination cost super-linear, not just linear?

Because the number of potential dependency relationships between teams grows roughly with the square of the team count. Ten teams have up to ~45 pairwise coupling paths; fifty teams have over 1,200. In a shared database, any of those pairs can become a real, invisible dependency. So the work required to change something safely climbs faster than headcount — which is exactly why "just add more engineers" stops working past a certain size.

Framing drawn from Conway's Law and team-topology analysis; the n² intuition is standard in distributed-systems teaching.

cost-of-change · drag to grow the company

Organisation size8 teams

The whole argument, in one chart. Services cost more up front (that flat overhead on the left) and pay you back only once the monolith's coordination curve overtakes them. Below the crossover, the monolith wins. Amazon in 2002 was well to the right of it — which is the single most important thing to understand before copying the decision.

Act IIThe Decree

Six rules, and the threat of being fired.

Sometime around 2002, Jeff Bezos issued a mandate to every engineering team. It did not specify a single line of architecture. It specified how teams were allowed to communicate — and let Conway's Law do the rest.

The mandate survives publicly through Steve Yegge's 2011 "Platforms Rant" — a post he accidentally shared with the world. In his retelling, the rules ran roughly like this:

Rule 01

Expose through interfaces

All teams expose their data and functionality through service interfaces — nothing else counts as "available."

Rule 02

Talk only through them

Teams communicate with each other only through those interfaces. No side channels.

Rule 03

No back doors

No direct linking, no reading another team's data store, no shared-memory tricks. The interface or nothing.

Rule 04

Any technology

Protocol-agnostic — HTTP, an RPC layer, pub/sub, anything — as long as it's over the network, through the interface.

Rule 05

Design as if external

Every interface must be built to be exposed to outside developers. No assuming the caller is a friend.

Rule 06

Or you're fired

Compliance was mandatory. Not a recommendation. A condition of employment, enforced from the top.

Read the list again and notice what is not in it. There is no mention of microservices, no reference architecture, no diagram. Five of the six rules are a single idea stated five ways — stop reaching around the interface — and the sixth is the enforcement that made the other five actually happen.

That enforcement is the part most retellings underplay because it is the least comfortable. The technical content wasn't new; service-oriented architecture had been written about for a decade. What was new was the willingness to impose it absolutely, on a large organisation already succeeding with a monolith, and to make ignoring it career-ending. Yegge's line about that existing SOA lore: it was "about as useful as telling Indiana Jones to look both ways before crossing the street."

"Anyone who doesn't do this will be fired. Thank you; have a nice day!" — the mandate's closing line, as recalled by Steve Yegge, 2011

Why a mandate, and not a suggestion

Here is the uncomfortable lesson, stated plainly: some architectural changes cannot emerge bottom-up, because no single team is rewarded for absorbing a short-term cost that only pays off for the system as a whole, years later. Left optional, the hardest and most valuable decoupling — the kind that hurts this quarter — simply never happens. The mandate supplied the will from the top that no individual team could supply for itself.

But how do you migrate a live, revenue-critical business?

Not with a big-bang rewrite — that's the highest-risk migration strategy known, and the retail site could not stop earning money for a year. The answer is the strangler fig pattern: put a routing layer in front of the monolith, then peel off one capability at a time into its own service. Each request either hits a new service or falls through to Obidos, until — capability by capability — the monolith is hollowed out and quietly retired. The transformation was still underway years later; in Yegge's telling it was "pretty far advanced" by mid-2005.

strangler fig · peel capabilities off a running monolith

flowchart LR
  R([Incoming requests]):::r --> RT{Routing layer}:::rt
  RT -->|catalogue| S1[Catalogue service]:::s
  RT -->|pricing| S2[Pricing service]:::s
  RT -->|not yet moved| M[Obidos monolith]:::m
  M -.-> S1
  M -.-> S2
  classDef r fill:#2a1d12,stroke:#F2A24B,color:#F2A24B;
  classDef rt fill:#171a2e,stroke:#9B8CFF,color:#cabfff;
  classDef s fill:#0f2a27,stroke:#54C7BD,color:#9fe9e1;
  classDef m fill:#241318,stroke:#EC6A52,color:#f4b3a7;

No flag day. The routing layer lets new services and the old monolith coexist. Risk is spread across years of small cutovers instead of concentrated in one terrifying weekend.

ProvenanceHow much of this is verified, and how much is lore?

The existence and thrust of the mandate are well attested across many Amazon engineers and Werner Vogels' public talks. The exact wording, the precise date, and the famous closing line come from Yegge's 2011 recollection, written years after the fact — Bezos never officially published the memo. Treat the rules as a faithful reconstruction of intent, not a photographed document. The principles don't depend on the punctuation being exact.

Primary source: Steve Yegge, "Google Platforms Rant" (2011). Corroboration: Werner Vogels, ACM Queue (2006) and AWS-era talks.

Act IIIThe Boundary

Where you draw the line is the whole design.

The mandate's true edge wasn't "use services." It was "a service's data is private to that service." The line between systems became the line between who owns which data — and the org chart followed.

Conway's Law, made physical

In 1967 Melvin Conway observed that systems end up mirroring the communication structure of the organisations that build them. Most companies experience this as an accident. Amazon used it as a tool: change the permitted communication structure — only via interfaces — and the architecture is forced to follow. Pair that with the "two-pizza team" (small enough to be fed by two pizzas) and you get a one-to-one mapping: one capability, one team, one service, one private data store.

conway's law · the org chart and the system are the same drawing

flowchart LR
  subgraph ORG["The organisation"]
    direction TB
    TA([Two-pizza team A]):::t
    TB([Two-pizza team B]):::t
    TC([Two-pizza team C]):::t
  end
  subgraph SYS["The system"]
    direction TB
    SA[Service A
+ private store]:::s
    SB[Service B
+ private store]:::s
    SC[Service C
+ private store]:::s
    SA <-->|interface| SB
    SB <-->|interface| SC
  end
  TA ==> SA
  TB ==> SB
  TC ==> SC
  classDef t fill:#2a1d12,stroke:#D9842A,color:#F2A24B;
  classDef s fill:#0f2a27,stroke:#54C7BD,color:#9fe9e1;

Owns, builds, runs. Each team maps to exactly one service. Teams talk through interfaces; so do their services. The architecture is a reflection of the org — by design, not by accident.

The rule that did the real work

You can compress the famous six rules down to one and keep most of the value: no team may read another team's data store. Everything else enforces it or follows from it. A shared database is the back door that turns a "service" architecture into a costume — if two services read the same tables, they are coupled through the schema no matter how clean their interfaces look, and a schema change still breaks a distant consumer.

Make the data store private and the interface becomes the only coupling point. And interfaces, unlike schemas, can be versioned, documented, and evolved without breaking the people on the other side. That is what lets a team rip out its database, denormalise, or re-shard at will — none of it is visible across the boundary, because none of it is reachable.

real vs fake · the difference is the database

flowchart TB
  subgraph BAD["✕ Distributed monolith — services over a shared DB"]
    direction TB
    a1[Service A]:::bad --> sdb[(Shared DB)]:::baddb
    a2[Service B]:::bad --> sdb
    a3[Service C]:::bad --> sdb
  end
  subgraph GOOD["✓ Real services — private stores, interfaces only"]
    direction TB
    b1[Service A]:::good --> d1[(A's store)]:::gooddb
    b2[Service B]:::good --> d2[(B's store)]:::gooddb
    b3[Service C]:::good --> d3[(C's store)]:::gooddb
    b1 <-->|interface| b2
    b2 <-->|interface| b3
  end
  classDef bad fill:#241318,stroke:#EC6A52,color:#f4b3a7;
  classDef baddb fill:#241318,stroke:#EC6A52,color:#EC6A52;
  classDef good fill:#0f2a27,stroke:#54C7BD,color:#9fe9e1;
  classDef gooddb fill:#0f2a27,stroke:#54C7BD,color:#54C7BD;

The boundary is the database, not the diagram. Teams love to draw clean service boxes and then quietly share one database underneath — and wonder why a schema change still breaks three teams. Privacy of data, not tidiness of boxes, is what makes services real.

The single most consequential decision

It is the ban on the shared database. "Use services" is advice anyone can nod at and quietly ignore. "No team may read another team's data store" is an enforceable rule that makes the boundary real — and every durable benefit of the architecture traces back to it.

The architecture this produces

Follow the rule across a whole company and a recognisable shape emerges: many small services, each owning its data, each behind a hardened interface, composed into request paths, sitting on a shared platform substrate — discovery, messaging, storage, compute, observability — that makes distribution survivable. The quiet twist: that substrate was itself built as services, to the same "design as if external" standard. Which is exactly why it could later be sold.

system architecture · services on a shared substrate

flowchart TB
  U([Customer request]):::u --> FE[Front-end / page composition]:::svc
  FE --> CAT[Catalogue]:::svc
  FE --> PR[Pricing]:::svc
  FE --> INV[Inventory]:::svc
  FE --> REC[Recommendations]:::svc
  FE --> REV[Reviews]:::svc
  CAT --> cdb[(cat store)]:::store
  PR --> pdb[(price store)]:::store
  INV --> idb[(inv store)]:::store
  subgraph SUB["Platform substrate — also built as services"]
    direction LR
    DISC{{Service discovery}}:::sub
    MSG{{Messaging · queues}}:::sub
    OBS{{Observability · tracing}}:::sub
    STOR{{Storage · compute}}:::sub
  end
  FE -.-> OBS
  CAT -.-> DISC
  REC -.-> MSG
  classDef u fill:#2a1d12,stroke:#F2A24B,color:#F2A24B;
  classDef svc fill:#1E283A,stroke:#54C7BD,color:#cfeeea;
  classDef store fill:#162033,stroke:#3A465E,color:#9fb0c8;
  classDef sub fill:#171a2e,stroke:#9B8CFF,color:#cabfff;

Five calls to draw one page. The front-end composes a product page from independent services, each with a private store. The dotted lines into the substrate are what make it operable — without discovery, messaging, and tracing, this graph is unrunnable.

The interface is the product — even internally

"Design every internal interface as if a stranger will use it" sounds like a style note. It is actually the highest-leverage rule in the mandate, for a non-obvious reason: it removes the option to rely on shared context. An internal-only interface can quietly assume the caller shares your types, your database, your assumptions. An externalisable one cannot — it must hide its implementation, be explicit about errors and timeouts, and be versioned so it can change without breaking anyone.

Apply that discipline everywhere and you end up with interfaces robust enough that exposing one to the outside world is an act of flipping a switch, not a rewrite. Hold that thought. It is the seed of Act V.

Trade-offWhat does "private data store" cost you?

A lot, and it must be taught honestly. Data that lived in one schema and could be joined in a single query is now split across services — so you lose the cross-service join and the cross-service transaction. A change spanning two services' data can't be made atomic with a database transaction; you reach for sagas, idempotency, compensating actions, and eventual consistency. Services also end up keeping local, slightly-stale copies of data they need from others. You're trading consistency for autonomy and availability — the same CAP-shaped bargain that runs through every distributed system.

See: Kleppmann, Designing Data-Intensive Applications, on consistency and the limits of distributed transactions.

Act IVThe Cost

You don't escape complexity. You relocate it.

The monolith's risk was "one change can break everything in one process." Services trade that for "any dependency can fail on its own, and failures can climb the graph." Decomposition doesn't delete systemic risk — it reshapes it, and hands you a distributed-systems bill.

The moment two components talk only over the network, you inherit the whole catalogue of distributed-systems problems the monolith never had: partial failure, accumulating latency, versioning, debugging that spans dozens of machines, and a new signature failure mode — the cascade.

The cascade is the one to fear

It works like this. A service deep in the graph slows down. Its callers block, waiting, holding threads and connections. Those callers then slow, so their callers block. The stall climbs the dependency graph until something the customer can see falls over — even though the original fault was small and far away. Nobody designed the outage; it emerged from independently reasonable components under load.

The fix isn't the failing service's job. It's every caller's job — to assume its dependencies will fail and to have decided, in advance, what happens when they do. Try it below.

failure simulator · tap a service to break it

healthy failing degraded but up essential

Resilience lives in the callers. With patterns on — short timeouts, circuit breakers, fallbacks — a failing non-essential service (reviews, recommendations) just drops off the page. Turn them off and the same small failure takes the whole page down. Break an essential service (catalogue, pricing) and you see the difference between a controlled error and an outage.

Graceful degradation, decided in advance

The healthy pattern is to sort every dependency on a request path into essential and non-essential before anything breaks. A product page must have catalogue and price; it can live without recommendations and reviews. Encode that — short timeouts and fallbacks on the non-essential — and the page renders something useful even when half its dependencies are sick. It is the services-world version of Apollo's load-shedding: when you can't do everything, protect the critical path and drop the rest.

Circuit breaker

Stop calling a failing dependency for a while, so you don't pile up requests on something already on fire.

Bulkhead

Isolate resources per dependency, so one drowning call can't exhaust the threads the rest of the page needs.

Timeout + fallback

Give up fast and show a sensible default, instead of blocking the whole request on one slow hop.

Consistency without a transaction

Once each service owns its own database, you lose the one tool that made multi-step changes safe: the cross-service transaction. There is no BEGIN … COMMIT that spans two services. So you replace it with a saga — a sequence of local steps, each with a compensating action that undoes it if a later step fails. It is eventual consistency, made operational. Watch what happens when the inventory step fails:

saga · compensate instead of roll back

sequenceDiagram
  autonumber
  participant O as Order
  participant P as Payment
  participant I as Inventory
  O->>P: reserve funds
  P-->>O: reserved ✓
  O->>I: reserve stock
  I-->>O: OUT OF STOCK ✗
  Note over O,P: no cross-service transaction to undo
  O->>P: compensate — release funds
  P-->>O: released ✓
  Note over O,I: order fails cleanly — each store stays consistent on its own

The cost of autonomy, made concrete. What a monolith did in one atomic transaction now takes an explicit, failure-aware choreography. More moving parts — but each service keeps full control of its own data.

And the bill keeps coming after launch

The cascade is the dramatic cost. The quiet ones are just as real. Every team now runs its own services in production — Werner Vogels' famous "you build it, you run it." That aligns incentives beautifully: the people who feel a bad design at 3 a.m. are the people who can fix it. But it only works if the platform makes each team's slice of operations small. Where the substrate is missing, "you build it, you run it" becomes "you build it, you suffer," and teams quietly rebuild back doors to escape the pain — which is how a services architecture rots back into a monolith.

The quiet worst case

The distributed monolith. Services on paper; a shared database and lock-step releases underneath. You pay every cost of distribution and collect none of the benefits — and you usually don't notice until one schema change breaks five teams at once. It is the most common way "doing microservices" goes wrong.

Act VThe By-Product

Strategy fell out of the architecture.

Nobody set out to build a cloud business. The "design as if external" rule meant the internal plumbing was already shaped like a product. Exposing it was a switch, not a project.

Here is the chain of consequence, and it is worth sitting with because it is one of the great examples of strategy emerging from a technical decision. Rule 05 forced every internal service — including the substrate of storage, compute, and queues — to be built to an externalisable standard. Once storage was already a clean, documented, network-addressable service that assumed nothing about its caller, the distance between "our internal storage service" and "S3, a thing anyone on earth can buy" was an act of exposure, not engineering.

So AWS is, in a real sense, the platform substrate turned outward. SQS appeared in 2004; S3 and EC2 in 2006. The mandate that looked like an internal tax turned out to be the R&D for a business that now underwrites much of modern computing.

The interface is the product

The discipline you would only ever apply to an external product, applied internally, is what turned plumbing into a platform. That is the counter-intuitive heart of the whole story: the rule that cost the most up front — "treat every internal call as if a stranger makes it" — is the one that made AWS thinkable.

timeline · from a monolith to a cloud business

1994–95

Obidos is born

Amazon grows on a single retail monolith and a shared database — fast to build on, until it isn't.

2001

The bottleneck bites

Explosive growth turns the shared codebase into the coordination chokepoint for every new initiative.

~2002

The mandate

Bezos forbids back doors and shared databases; CIO Rick Dalzell drives enforcement. Date approximate — known publicly only via later recollection.

2002–05

Incremental decomposition

Service by service, the monolith is strangled. Two-pizza teams formalise the one-team-one-service model.

2004

SQS — the first tell

An externalised, internal-style queue service ships publicly. The by-product is becoming visible.

2006

S3 & EC2 — AWS is a business

Storage (March) and compute (August) go public. The substrate, turned outward, becomes an industry. Vogels articulates "you build it, you run it."

2011

The story goes public

Steve Yegge's "Platforms Rant" accidentally ships to the world, becoming the canonical account of the 2002 decree.

Ember marks the decree; teal marks the consequences. The four-year gap between mandate and AWS is the cost-of-change curve being paid down — and then monetised.

~2002

the mandate (± a year)

rules, five saying one thing

2 pizzas

max team size

2006

S3 & EC2 go public

Act VIThe Verdict

The right answer depends entirely on your scale.

Amazon's decision was correct — for Amazon. The most valuable thing a working engineer can take from it is not "do this," but "know the scale at which this becomes correct," and the honesty to admit when you're not there yet.

Flip the comparison below. The same architecture is a triumph in one column and a self-inflicted wound in the other; the only thing that changed is the organisation looking at it.

trade-off matrix · choose a lens

strengthweaknessneutral

Axis	Monolith	What it means
Delivery as teams grow		Can teams ship in parallel, or do they queue behind each other?
In-process performance		Fast function calls vs network hops that accumulate latency.
Failure isolation		Does one fault take down everything, or just one capability?
Consistency		One database and one transaction vs sagas and eventual consistency.
Operational burden		Centralised ops vs every team running its own service.
Debugging		One process and a debugger vs tracing across many machines.
New products / optionality		Compose existing services into something new — the AWS move.
Right scale		Small org / early product vs large org / many teams / sustained growth.

No column is all green. Services don't dominate monoliths; they trade immediate, legible costs (latency, ops, consistency) for delayed, diffuse benefits (parallel delivery, isolation, optionality). The trade only pays once you're past the crossover from Act I.

The strongest argument against copying Amazon

It's called monolith-first, and it's right far more often than microservice enthusiasm admits. For most organisations, starting with microservices is a mistake: you pay the coordination and distributed-systems costs immediately, while the benefits only arrive at a scale most teams never reach. A twelve-person startup that copies the 2002 mandate gets all of the latency, ops burden, and eventual-consistency pain, and approximately none of the parallel-delivery payoff — because it doesn't have enough teams to be slowed by a monolith in the first place.

The reconciliation is simply scale. Amazon was already far past the point where monolith coordination cost dominated. The lesson for a 2–10-year engineer is to locate your own organisation on that curve honestly — and to be willing to say "we are not Amazon, and a well-structured modular monolith is the wiser call here."

What to actually carry away

Principle

Boundaries beat intentions

Soft module boundaries inside a shared process always erode. An enforceable boundary — the network plus a private store — is what makes decoupling stick.

Principle

Conway's Law is a tool

You can't choose an architecture independently of the org that builds it. Change how teams may communicate and the system reorganises itself.

Principle

Resilience lives in callers

Assume every dependency will fail, and decide in advance what happens when it does. That habit is the whole difference between degradation and outage.

Principle

Cost now, benefit later

Judge the decision on the slope of the cost-of-change curve as you grow — never on this quarter's velocity, or you'll kill the right call in year one.

Take-home challenge

Run a coupling & boundary audit of a system you work on. Find where teams reach across boundaries — shared databases, direct links, lock-step releases. Pick one such coupling and predict which change would break a distant team; propose the interface that removes the risk. Mark where a service boundary disagrees with data ownership. Then decide, with explicit reference to your organisation's scale, whether more decomposition is actually warranted — or whether you're at the size where a modular monolith is wiser. Grade yourself on whether data ownership drove the boundary, and whether your scale argument is honest rather than cargo-culted from Amazon.

Amazon changed its architecture by changing the rules of how its teams were allowed to talk — and got microservices and a cloud business as the same act. — the one-sentence version

The mandate thatfractured the monolith.

A single block that everyone could reach into.

Six rules, and the threat of being fired.

Why a mandate, and not a suggestion

But how do you migrate a live, revenue-critical business?

Where you draw the line is the whole design.

Conway's Law, made physical

The rule that did the real work

The architecture this produces

The interface is the product — even internally

You don't escape complexity. You relocate it.

The cascade is the one to fear

Graceful degradation, decided in advance

Consistency without a transaction

And the bill keeps coming after launch

Strategy fell out of the architecture.

The right answer depends entirely on your scale.

The strongest argument against copying Amazon

What to actually carry away

The mandate that
fractured the monolith.