Act IThe Monolith
A single block that everyone could reach into.
By 2001 Amazon was not a bookshop with a website. It was a fast-growing platform whose one shared codebase had become the bottleneck for everything it wanted to do next.
The retail engine had a name inside the company: Obidos. It was a monolith — one large program, backed by a shared database that almost any team could read directly. In the early days that was a feature. If the recommendations team needed catalogue data, the catalogue tables were right there. No integration, no waiting. Ship by lunchtime.
The trouble is what that convenience hides. When any team can read any other team's tables, every team's internal design silently becomes every other team's dependency — and nobody can see the web of those dependencies. The cost doesn't show up when you write the code. It shows up months later, when a team renames a column to support a new feature and a system three floors away breaks in production, for reasons no one can immediately explain.
This is the monolith's real tax, and it is not performance. It is coordination. To change anything safely, you have to know who depends on it — and in a shared-everything system, that is everyone, invisibly. So changes slow down. To stay safe, teams coordinate. As the company adds teams, the coordination required to ship a single change grows faster than the team count itself.
flowchart TB
subgraph M["Obidos · one process, one database"]
direction TB
C[Catalogue logic]:::c
O[Ordering logic]:::c
R[Recommendations]:::c
P[Pricing]:::c
DB[(Shared database
every team reads every table)]:::db
C --- DB
O --- DB
R --- DB
P --- DB
R -. reads catalogue tables directly .-> DB
O -. reads pricing tables directly .-> DB
end
classDef c fill:#1E283A,stroke:#3A465E,color:#E8EBF1;
classDef db fill:#2a1d12,stroke:#D9842A,color:#F2A24B;
FactWhy is the coordination cost super-linear, not just linear?
Because the number of potential dependency relationships between teams grows roughly with the square of the team count. Ten teams have up to ~45 pairwise coupling paths; fifty teams have over 1,200. In a shared database, any of those pairs can become a real, invisible dependency. So the work required to change something safely climbs faster than headcount — which is exactly why "just add more engineers" stops working past a certain size.
Framing drawn from Conway's Law and team-topology analysis; the n² intuition is standard in distributed-systems teaching.
Act IIThe Decree
Six rules, and the threat of being fired.
Sometime around 2002, Jeff Bezos issued a mandate to every engineering team. It did not specify a single line of architecture. It specified how teams were allowed to communicate — and let Conway's Law do the rest.
The mandate survives publicly through Steve Yegge's 2011 "Platforms Rant" — a post he accidentally shared with the world. In his retelling, the rules ran roughly like this:
All teams expose their data and functionality through service interfaces — nothing else counts as "available."
Teams communicate with each other only through those interfaces. No side channels.
No direct linking, no reading another team's data store, no shared-memory tricks. The interface or nothing.
Protocol-agnostic — HTTP, an RPC layer, pub/sub, anything — as long as it's over the network, through the interface.
Every interface must be built to be exposed to outside developers. No assuming the caller is a friend.
Compliance was mandatory. Not a recommendation. A condition of employment, enforced from the top.
Read the list again and notice what is not in it. There is no mention of microservices, no reference architecture, no diagram. Five of the six rules are a single idea stated five ways — stop reaching around the interface — and the sixth is the enforcement that made the other five actually happen.
That enforcement is the part most retellings underplay because it is the least comfortable. The technical content wasn't new; service-oriented architecture had been written about for a decade. What was new was the willingness to impose it absolutely, on a large organisation already succeeding with a monolith, and to make ignoring it career-ending. Yegge's line about that existing SOA lore: it was "about as useful as telling Indiana Jones to look both ways before crossing the street."
"Anyone who doesn't do this will be fired. Thank you; have a nice day!" — the mandate's closing line, as recalled by Steve Yegge, 2011
Why a mandate, and not a suggestion
Here is the uncomfortable lesson, stated plainly: some architectural changes cannot emerge bottom-up, because no single team is rewarded for absorbing a short-term cost that only pays off for the system as a whole, years later. Left optional, the hardest and most valuable decoupling — the kind that hurts this quarter — simply never happens. The mandate supplied the will from the top that no individual team could supply for itself.
But how do you migrate a live, revenue-critical business?
Not with a big-bang rewrite — that's the highest-risk migration strategy known, and the retail site could not stop earning money for a year. The answer is the strangler fig pattern: put a routing layer in front of the monolith, then peel off one capability at a time into its own service. Each request either hits a new service or falls through to Obidos, until — capability by capability — the monolith is hollowed out and quietly retired. The transformation was still underway years later; in Yegge's telling it was "pretty far advanced" by mid-2005.
flowchart LR
R([Incoming requests]):::r --> RT{Routing layer}:::rt
RT -->|catalogue| S1[Catalogue service]:::s
RT -->|pricing| S2[Pricing service]:::s
RT -->|not yet moved| M[Obidos monolith]:::m
M -.-> S1
M -.-> S2
classDef r fill:#2a1d12,stroke:#F2A24B,color:#F2A24B;
classDef rt fill:#171a2e,stroke:#9B8CFF,color:#cabfff;
classDef s fill:#0f2a27,stroke:#54C7BD,color:#9fe9e1;
classDef m fill:#241318,stroke:#EC6A52,color:#f4b3a7;
ProvenanceHow much of this is verified, and how much is lore?
The existence and thrust of the mandate are well attested across many Amazon engineers and Werner Vogels' public talks. The exact wording, the precise date, and the famous closing line come from Yegge's 2011 recollection, written years after the fact — Bezos never officially published the memo. Treat the rules as a faithful reconstruction of intent, not a photographed document. The principles don't depend on the punctuation being exact.
Primary source: Steve Yegge, "Google Platforms Rant" (2011). Corroboration: Werner Vogels, ACM Queue (2006) and AWS-era talks.
Act IIIThe Boundary
Where you draw the line is the whole design.
The mandate's true edge wasn't "use services." It was "a service's data is private to that service." The line between systems became the line between who owns which data — and the org chart followed.
Conway's Law, made physical
In 1967 Melvin Conway observed that systems end up mirroring the communication structure of the organisations that build them. Most companies experience this as an accident. Amazon used it as a tool: change the permitted communication structure — only via interfaces — and the architecture is forced to follow. Pair that with the "two-pizza team" (small enough to be fed by two pizzas) and you get a one-to-one mapping: one capability, one team, one service, one private data store.
flowchart LR
subgraph ORG["The organisation"]
direction TB
TA([Two-pizza team A]):::t
TB([Two-pizza team B]):::t
TC([Two-pizza team C]):::t
end
subgraph SYS["The system"]
direction TB
SA[Service A
+ private store]:::s
SB[Service B
+ private store]:::s
SC[Service C
+ private store]:::s
SA <-->|interface| SB
SB <-->|interface| SC
end
TA ==> SA
TB ==> SB
TC ==> SC
classDef t fill:#2a1d12,stroke:#D9842A,color:#F2A24B;
classDef s fill:#0f2a27,stroke:#54C7BD,color:#9fe9e1;
The rule that did the real work
You can compress the famous six rules down to one and keep most of the value: no team may read another team's data store. Everything else enforces it or follows from it. A shared database is the back door that turns a "service" architecture into a costume — if two services read the same tables, they are coupled through the schema no matter how clean their interfaces look, and a schema change still breaks a distant consumer.
Make the data store private and the interface becomes the only coupling point. And interfaces, unlike schemas, can be versioned, documented, and evolved without breaking the people on the other side. That is what lets a team rip out its database, denormalise, or re-shard at will — none of it is visible across the boundary, because none of it is reachable.
flowchart TB
subgraph BAD["✕ Distributed monolith — services over a shared DB"]
direction TB
a1[Service A]:::bad --> sdb[(Shared DB)]:::baddb
a2[Service B]:::bad --> sdb
a3[Service C]:::bad --> sdb
end
subgraph GOOD["✓ Real services — private stores, interfaces only"]
direction TB
b1[Service A]:::good --> d1[(A's store)]:::gooddb
b2[Service B]:::good --> d2[(B's store)]:::gooddb
b3[Service C]:::good --> d3[(C's store)]:::gooddb
b1 <-->|interface| b2
b2 <-->|interface| b3
end
classDef bad fill:#241318,stroke:#EC6A52,color:#f4b3a7;
classDef baddb fill:#241318,stroke:#EC6A52,color:#EC6A52;
classDef good fill:#0f2a27,stroke:#54C7BD,color:#9fe9e1;
classDef gooddb fill:#0f2a27,stroke:#54C7BD,color:#54C7BD;
It is the ban on the shared database. "Use services" is advice anyone can nod at and quietly ignore. "No team may read another team's data store" is an enforceable rule that makes the boundary real — and every durable benefit of the architecture traces back to it.
The architecture this produces
Follow the rule across a whole company and a recognisable shape emerges: many small services, each owning its data, each behind a hardened interface, composed into request paths, sitting on a shared platform substrate — discovery, messaging, storage, compute, observability — that makes distribution survivable. The quiet twist: that substrate was itself built as services, to the same "design as if external" standard. Which is exactly why it could later be sold.
flowchart TB
U([Customer request]):::u --> FE[Front-end / page composition]:::svc
FE --> CAT[Catalogue]:::svc
FE --> PR[Pricing]:::svc
FE --> INV[Inventory]:::svc
FE --> REC[Recommendations]:::svc
FE --> REV[Reviews]:::svc
CAT --> cdb[(cat store)]:::store
PR --> pdb[(price store)]:::store
INV --> idb[(inv store)]:::store
subgraph SUB["Platform substrate — also built as services"]
direction LR
DISC{{Service discovery}}:::sub
MSG{{Messaging · queues}}:::sub
OBS{{Observability · tracing}}:::sub
STOR{{Storage · compute}}:::sub
end
FE -.-> OBS
CAT -.-> DISC
REC -.-> MSG
classDef u fill:#2a1d12,stroke:#F2A24B,color:#F2A24B;
classDef svc fill:#1E283A,stroke:#54C7BD,color:#cfeeea;
classDef store fill:#162033,stroke:#3A465E,color:#9fb0c8;
classDef sub fill:#171a2e,stroke:#9B8CFF,color:#cabfff;
The interface is the product — even internally
"Design every internal interface as if a stranger will use it" sounds like a style note. It is actually the highest-leverage rule in the mandate, for a non-obvious reason: it removes the option to rely on shared context. An internal-only interface can quietly assume the caller shares your types, your database, your assumptions. An externalisable one cannot — it must hide its implementation, be explicit about errors and timeouts, and be versioned so it can change without breaking anyone.
Apply that discipline everywhere and you end up with interfaces robust enough that exposing one to the outside world is an act of flipping a switch, not a rewrite. Hold that thought. It is the seed of Act V.
Trade-offWhat does "private data store" cost you?
A lot, and it must be taught honestly. Data that lived in one schema and could be joined in a single query is now split across services — so you lose the cross-service join and the cross-service transaction. A change spanning two services' data can't be made atomic with a database transaction; you reach for sagas, idempotency, compensating actions, and eventual consistency. Services also end up keeping local, slightly-stale copies of data they need from others. You're trading consistency for autonomy and availability — the same CAP-shaped bargain that runs through every distributed system.
See: Kleppmann, Designing Data-Intensive Applications, on consistency and the limits of distributed transactions.
Act IVThe Cost
You don't escape complexity. You relocate it.
The monolith's risk was "one change can break everything in one process." Services trade that for "any dependency can fail on its own, and failures can climb the graph." Decomposition doesn't delete systemic risk — it reshapes it, and hands you a distributed-systems bill.
The moment two components talk only over the network, you inherit the whole catalogue of distributed-systems problems the monolith never had: partial failure, accumulating latency, versioning, debugging that spans dozens of machines, and a new signature failure mode — the cascade.
The cascade is the one to fear
It works like this. A service deep in the graph slows down. Its callers block, waiting, holding threads and connections. Those callers then slow, so their callers block. The stall climbs the dependency graph until something the customer can see falls over — even though the original fault was small and far away. Nobody designed the outage; it emerged from independently reasonable components under load.
The fix isn't the failing service's job. It's every caller's job — to assume its dependencies will fail and to have decided, in advance, what happens when they do. Try it below.
Graceful degradation, decided in advance
The healthy pattern is to sort every dependency on a request path into essential and non-essential before anything breaks. A product page must have catalogue and price; it can live without recommendations and reviews. Encode that — short timeouts and fallbacks on the non-essential — and the page renders something useful even when half its dependencies are sick. It is the services-world version of Apollo's load-shedding: when you can't do everything, protect the critical path and drop the rest.
Stop calling a failing dependency for a while, so you don't pile up requests on something already on fire.
Isolate resources per dependency, so one drowning call can't exhaust the threads the rest of the page needs.
Give up fast and show a sensible default, instead of blocking the whole request on one slow hop.
Consistency without a transaction
Once each service owns its own database, you lose the one tool that made multi-step changes safe: the cross-service transaction. There is no BEGIN … COMMIT that spans two services. So you replace it with a saga — a sequence of local steps, each with a compensating action that undoes it if a later step fails. It is eventual consistency, made operational. Watch what happens when the inventory step fails:
sequenceDiagram autonumber participant O as Order participant P as Payment participant I as Inventory O->>P: reserve funds P-->>O: reserved ✓ O->>I: reserve stock I-->>O: OUT OF STOCK ✗ Note over O,P: no cross-service transaction to undo O->>P: compensate — release funds P-->>O: released ✓ Note over O,I: order fails cleanly — each store stays consistent on its own
And the bill keeps coming after launch
The cascade is the dramatic cost. The quiet ones are just as real. Every team now runs its own services in production — Werner Vogels' famous "you build it, you run it." That aligns incentives beautifully: the people who feel a bad design at 3 a.m. are the people who can fix it. But it only works if the platform makes each team's slice of operations small. Where the substrate is missing, "you build it, you run it" becomes "you build it, you suffer," and teams quietly rebuild back doors to escape the pain — which is how a services architecture rots back into a monolith.
The distributed monolith. Services on paper; a shared database and lock-step releases underneath. You pay every cost of distribution and collect none of the benefits — and you usually don't notice until one schema change breaks five teams at once. It is the most common way "doing microservices" goes wrong.
Act VThe By-Product
Strategy fell out of the architecture.
Nobody set out to build a cloud business. The "design as if external" rule meant the internal plumbing was already shaped like a product. Exposing it was a switch, not a project.
Here is the chain of consequence, and it is worth sitting with because it is one of the great examples of strategy emerging from a technical decision. Rule 05 forced every internal service — including the substrate of storage, compute, and queues — to be built to an externalisable standard. Once storage was already a clean, documented, network-addressable service that assumed nothing about its caller, the distance between "our internal storage service" and "S3, a thing anyone on earth can buy" was an act of exposure, not engineering.
So AWS is, in a real sense, the platform substrate turned outward. SQS appeared in 2004; S3 and EC2 in 2006. The mandate that looked like an internal tax turned out to be the R&D for a business that now underwrites much of modern computing.
The discipline you would only ever apply to an external product, applied internally, is what turned plumbing into a platform. That is the counter-intuitive heart of the whole story: the rule that cost the most up front — "treat every internal call as if a stranger makes it" — is the one that made AWS thinkable.
Amazon grows on a single retail monolith and a shared database — fast to build on, until it isn't.
Explosive growth turns the shared codebase into the coordination chokepoint for every new initiative.
Bezos forbids back doors and shared databases; CIO Rick Dalzell drives enforcement. Date approximate — known publicly only via later recollection.
Service by service, the monolith is strangled. Two-pizza teams formalise the one-team-one-service model.
An externalised, internal-style queue service ships publicly. The by-product is becoming visible.
Storage (March) and compute (August) go public. The substrate, turned outward, becomes an industry. Vogels articulates "you build it, you run it."
Steve Yegge's "Platforms Rant" accidentally ships to the world, becoming the canonical account of the 2002 decree.
Act VIThe Verdict
The right answer depends entirely on your scale.
Amazon's decision was correct — for Amazon. The most valuable thing a working engineer can take from it is not "do this," but "know the scale at which this becomes correct," and the honesty to admit when you're not there yet.
Flip the comparison below. The same architecture is a triumph in one column and a self-inflicted wound in the other; the only thing that changed is the organisation looking at it.
| Axis | Monolith | What it means |
|---|---|---|
| Delivery as teams grow | Can teams ship in parallel, or do they queue behind each other? | |
| In-process performance | Fast function calls vs network hops that accumulate latency. | |
| Failure isolation | Does one fault take down everything, or just one capability? | |
| Consistency | One database and one transaction vs sagas and eventual consistency. | |
| Operational burden | Centralised ops vs every team running its own service. | |
| Debugging | One process and a debugger vs tracing across many machines. | |
| New products / optionality | Compose existing services into something new — the AWS move. | |
| Right scale | Small org / early product vs large org / many teams / sustained growth. |
The strongest argument against copying Amazon
It's called monolith-first, and it's right far more often than microservice enthusiasm admits. For most organisations, starting with microservices is a mistake: you pay the coordination and distributed-systems costs immediately, while the benefits only arrive at a scale most teams never reach. A twelve-person startup that copies the 2002 mandate gets all of the latency, ops burden, and eventual-consistency pain, and approximately none of the parallel-delivery payoff — because it doesn't have enough teams to be slowed by a monolith in the first place.
The reconciliation is simply scale. Amazon was already far past the point where monolith coordination cost dominated. The lesson for a 2–10-year engineer is to locate your own organisation on that curve honestly — and to be willing to say "we are not Amazon, and a well-structured modular monolith is the wiser call here."
What to actually carry away
Soft module boundaries inside a shared process always erode. An enforceable boundary — the network plus a private store — is what makes decoupling stick.
You can't choose an architecture independently of the org that builds it. Change how teams may communicate and the system reorganises itself.
Assume every dependency will fail, and decide in advance what happens when it does. That habit is the whole difference between degradation and outage.
Judge the decision on the slope of the cost-of-change curve as you grow — never on this quarter's velocity, or you'll kill the right call in year one.
Run a coupling & boundary audit of a system you work on. Find where teams reach across boundaries — shared databases, direct links, lock-step releases. Pick one such coupling and predict which change would break a distant team; propose the interface that removes the risk. Mark where a service boundary disagrees with data ownership. Then decide, with explicit reference to your organisation's scale, whether more decomposition is actually warranted — or whether you're at the size where a modular monolith is wiser. Grade yourself on whether data ownership drove the boundary, and whether your scale argument is honest rather than cargo-culted from Amazon.
Amazon changed its architecture by changing the rules of how its teams were allowed to talk — and got microservices and a cloud business as the same act. — the one-sentence version