What is the context for "Domain Bugs Cost More Than Code Bugs"?

At Zendesk, the word 'agreement' meant different things to legal, product, and engineering.

What is the core friction or problem?

The software kept shipping cleanly while the workflow stayed wrong. Domain bugs create roadmap waste, support load, and trust debt before anyone logs an incident.

What is the key insight or pivot?

I use Domain-Driven Design less as engineering doctrine and more as a way to force shared language, clearer ownership, and fewer confident misbuilds across internal platforms and trust-sensitive systems.

Domain Bugs Cost More Than Code Bugs

The most expensive product bugs I have worked on did not throw exceptions. They passed QA. They shipped. Everyone involved could point to working code.

They were still wrong.

A domain bug is what happens when the software is internally consistent, but the organization never agreed on what the thing actually is. Legal means one thing. Product means another. Engineering encodes a third. Support or operations end up discovering the gap after the workflow starts breaking in real life.

At Zendesk, “agreement” was one of those words. Legal, product, and engineering were all working in good faith, but not from the same definition. That mattered because the workflow crossed policy, approvals, system states, and real customer commitments. We were fully capable of shipping technically correct changes that still optimized the wrong part of the process. The cost showed up as rework, ambiguous handoffs, and more translation than a healthy product system should need.

That is why I care about Domain-Driven Design. Not as architecture theatre. As a way to stop teams from confidently shipping the wrong thing.

The bug is upstream of the code

CPOs usually notice domain bugs late because nothing is visibly on fire. There is no outage. There is just a growing pile of signals that do not quite line up:

roadmap items that keep getting re-scoped after kickoff
support or operations teams doing manual interpretation between systems
the same noun meaning different things in planning docs, UI copy, and data models
engineering asking “which version of this do we actually mean?” halfway through delivery

When those signals show up together, the problem is rarely backlog hygiene. It is usually semantic debt.

Semantic debt compounds faster than most technical debt because it spreads through decisions. One vague noun in a planning meeting becomes a misleading API name. That API name becomes a data field. That data field becomes reporting logic. Then a quarter later you are debating a KPI built on a definition nobody would defend out loud.

The smell test I actually use

Instead of teaching DDD as a set of patterns, this is the diagnostic I use before a workflow gets expensive:

Early smell	What it usually means	Business cost	Product move
One noun needs different hidden definitions in the same meeting	Two contexts are pretending to be one	Roadmap churn and misbuilds	Force one explicit definition or split the workflow
A workflow crosses legal, support, product, and engineering with no obvious owner	Ownership boundary is unclear	Slow approvals, manual workarounds, trust loss	Define the boundary and owning team before scoping features
Teams keep sharing the same table, event, or status for convenience	Different lifecycles are being collapsed together	Silent regressions and reporting drift	Put an explicit interface or handoff between contexts
Metrics look stable while the frontline says the flow is broken	Reporting model and operational reality disagree	False confidence and delayed fixes	Reconcile the domain language before changing the dashboard

That is the part of DDD I find most useful. It gives the team a way to name the ambiguity before it leaks into delivery.

Three checks matter more than the jargon

I do not start with architecture patterns. I start with three questions.

1. Can one term survive every surface without changing meaning?

If a term cannot survive legal copy, UI labels, roadmap docs, API names, and reporting without changing meaning, the team is not ready to ship around it. That is usually the first sign the domain model is still fuzzy.

This is what people mean by ubiquitous language, but the operational point is simpler: shared language reduces translation work. Translation work is where expensive misunderstandings hide. It is also why I treat documentation as a product surface, not an afterthought.

2. Where does this workflow stop being the same thing?

A “user” in access control is not the same as a “user” in billing. An “agreement” in a contractual system is not automatically the same thing as an agreement step inside an operational workflow. Good teams stop pretending everything belongs in one universal model.

This is what bounded contexts are for. Not because engineers love boxes. Because product teams need cleaner ownership, cleaner handoffs, and fewer accidental dependencies.

3. What has to stay true together or the user stops trusting the system?

Engineers would describe this as an aggregate boundary. I think about it in more practical terms: what must change together, and what absolutely cannot drift apart?

That question matters in internal platforms, compliance workflows, and trust-sensitive systems because partial truth is often worse than visible failure. A broken page gets escalated quickly. A system that looks right while carrying the wrong status spreads bad decisions.

What changed once I started treating this as product work

The fix at Zendesk was not memorizing DDD terminology. It was forcing the shared language to hold across the workflow and refusing to keep scoping around fuzzy definitions.

In practice, that meant fewer arguments about whether engineering had “implemented the requirement” and better conversations about whether we were modeling the right thing in the first place. It meant less rework after discovery. It meant the work moved out of translation overhead and into clearer ownership.

I use the same discipline in Meitheal. The domain areas live separately: tasks, auth, strategy, observability. More importantly, the language and responsibilities stay separate too. Cross-domain communication happens through deliberate interfaces instead of deep imports and implicit coupling. The useful outcome is not architectural purity. It is being able to evolve one area without casually breaking another.

The operating rules I keep coming back to are straightforward:

If the team cannot agree on a term, we are not ready to write code.
If a workflow crosses contexts, the handoff needs to be explicit.
If ownership is vague, the architecture will not rescue us later.
If reporting, UI language, and operational behavior disagree, I treat that as a product bug.

Where this rigor earns its keep

I care most about this in the kinds of product areas I keep gravitating toward: internal platforms, trust and safety, compliance-heavy workflows, AI in production, and data systems leadership uses to make real decisions.

Those domains punish confident ambiguity.

On an internal platform, a vague model turns into more manual support, slower contribution flow, and teams working around the system. It is one of the reasons I use a more domain-aware lens in prioritization for internal platforms. In a trust-sensitive workflow, it turns into audit risk and inconsistent decisions. In AI or data-heavy surfaces, it turns into dashboards or automations that look precise while resting on the wrong definitions.

That is why DDD matters to product leaders. It is not an engineering preference. It is a way to reduce rework, protect trust, and keep different parts of the organization from shipping different interpretations of the same product.

When the overhead is not worth it

I would not use this level of rigor on a marketing microsite or a simple one-team feature with obvious rules. Sometimes the right move is to ship the page, learn, and move on.

But once a workflow spans multiple teams, carries policy or trust implications, or feeds downstream reporting and automation, the overhead stops being overhead. It becomes the cost of not lying to yourself about what the system means.

If one noun needs three hidden definitions in the same meeting, I stop calling that a naming problem. It is usually two contexts pretending to be one.

If your team keeps arguing over the same noun, the model is usually wrong before the roadmap is. Get in touch.

Documentation is a Product Surface - the documentation discipline that keeps shared language from drifting across code, docs, and handoffs
Why RICE Fails on Internal Platforms - where I use domain-weighted prioritization when ambiguous internal work gets scored badly
Your AI Demo Is Not Production Ready - why trust-sensitive systems fail in production when the operational model is fuzzier than the demo suggests

Domain Bugs Cost More Than Code Bugs

The bug is upstream of the code

The smell test I actually use

Three checks matter more than the jargon

1. Can one term survive every surface without changing meaning?

2. Where does this workflow stop being the same thing?

3. What has to stay true together or the user stops trusting the system?

What changed once I started treating this as product work

Where this rigor earns its keep

When the overhead is not worth it

Further Reading

Join the Discussion

The bug is upstream of the code

The smell test I actually use

Three checks matter more than the jargon

1. Can one term survive every surface without changing meaning?

2. Where does this workflow stop being the same thing?

3. What has to stay true together or the user stops trusting the system?

What changed once I started treating this as product work

Where this rigor earns its keep

When the overhead is not worth it

Related thinking

Further Reading

Join the Discussion

More Field Notes

Why Raw RICE Fails on Internal Platforms

Signal to Noise: The Spreadsheet That Changed What We Shipped

The Loudest Request Is Rarely the Most Important