Your CI pipeline runs 22 minutes on a good day.

On a bad day, three browser tests fail for no obvious reason, one snapshot breaks because somebody changed a button label, and a production bug still slips through because the one integration that mattered was never exercised end to end.

This is the uncomfortable truth about testing: many teams do not have a testing problem because they write too few tests. They have a testing problem because they keep buying confidence from the wrong place.

A large test suite is not the same thing as a trustworthy one.

If your tests are slow, brittle, over-mocked, or duplicated across too many layers, they stop protecting change and start taxing it. Engineers hesitate to refactor. Pull requests stay open longer. Broken builds become background noise. Eventually the team has hundreds or thousands of tests that feel impressive on paper and weak in practice.

This guide is about fixing that. Not with slogans like "just follow the test pyramid" or "get coverage above 80%," but with a practical way to decide what to test, what to mock, and what to stop testing altogether.

The Short Answer

If you only want the operating model, it is this:

  • Unit test business rules, data transformations, validation logic, and branching behavior.
  • Integration test the boundaries that actually break in production: database access, auth, queues, caches, file storage, and third-party APIs.
  • End-to-end test only a small number of high-value user journeys.
  • Mock systems you do not control in narrow tests, but keep real interactions where integration risk is the point of the test.
  • Stop testing framework behavior, trivial pass-through code, and low-signal snapshots that fail more often than they teach.

That is the short version. The rest of the article is about applying it without fooling yourself.

Why Most Testing Advice Breaks Down In Real Teams

The internet is full of clean testing advice for messy systems.

It usually sounds like one of these:

  • write more unit tests
  • follow the test pyramid
  • keep end-to-end tests minimal
  • hit a coverage threshold

None of that is entirely wrong. It is just incomplete.

Real codebases are not tutorial codebases. They have legacy abstractions, awkward boundaries, flaky infrastructure, vendor SDKs, and product behavior that leaks across layers. The problem is not understanding what a unit test is. The problem is deciding where confidence should come from in a system that is already imperfect.

For example:

  • A pure unit test cannot tell you whether your auth cookie settings are correct in production. If you have worked through OAuth and session flows, you know the boundary behavior matters as much as the logic itself, which is exactly why Authentication Explained gets complicated so quickly.
  • A database-heavy service can have beautiful coverage and still melt down if the query shape is wrong. That is the same lesson behind SQL Performance Pitfalls: correctness in isolation is not enough when the system boundary is the risk.
  • A cache integration can pass mocked tests while still serving stale data because invalidation was never exercised against the real layers involved, which is one of the core traps in Caching Beyond Redis.

Testing advice breaks down when it ignores where failures actually come from.

Testing pyramid compared with production reality

The Real Mental Model: Tests Answer Questions

Stop organizing your testing strategy around categories alone. Organize it around questions.

Each layer earns its keep by answering a different question about the system:

Test layerMain questionBest at catchingCommon mistake
UnitDoes this logic behave correctly in isolation?branching bugs, edge cases, validation mistakestesting implementation details instead of behavior
IntegrationDo these components work together correctly?DB, auth, queue, cache, serialization, contract issuesmocking away the exact boundary that can fail
End-to-endDoes the user-visible system actually work?broken flows, wiring issues, config mistakesoverusing browser tests for everything
Non-test checksIs change safe enough to ship?typing, linting, observability, canaries, feature flagspretending tests are the only confidence mechanism

That last row matters more than many teams admit. Strong types, static analysis, feature flags, runtime monitoring, and canary rollouts do not replace tests, but they absolutely participate in the same job: reducing the chance that a change hurts users.

Map of unit, integration, and end-to-end testing scope

Here is the practical implication:

  • Unit tests are cheap and precise.
  • Integration tests are slower but often more truthful.
  • End-to-end tests are expensive and broad, so they should be few and valuable.

The right mix is not ideological. It is economic.

What To Test At Each Level

Unit Tests: Test Decision-Making, Not Plumbing

Unit tests shine when the thing under test has real logic and a small surface area.

Good candidates:

  • pricing and discount rules
  • permission checks
  • input validation
  • formatting and transformation logic
  • retry backoff calculations
  • feature-flag branching

Bad candidates:

  • trivial getters and setters
  • wrappers that just forward parameters
  • framework code you do not own
  • behavior that only matters once multiple real components interact

The best unit tests tend to target code where a wrong branch produces a wrong outcome.

describe("calculateDiscount", () => {
  it("caps promotional discount at 25% for enterprise plans", () => {
    expect(
      calculateDiscount({
        plan: "enterprise",
        isPromotional: true,
        baseDiscount: 40,
      }),
    ).toBe(25);
  });
});

That test is useful because it protects a business rule. If the implementation changes tomorrow, the test still matters.

What unit tests are bad at is telling you whether the whole request path works. They are not supposed to. The mistake is expecting them to.

Integration Tests: Test The Boundaries That Break In Production

Integration tests are where many teams should spend more of their energy.

These tests answer questions like:

  • does this handler write the right rows to the database?
  • does this auth middleware accept, reject, and refresh sessions correctly?
  • does this queue consumer behave correctly when the real payload shape arrives?
  • does this cache invalidation path actually remove stale state?

These are not hypothetical concerns. They are where production incidents live.

Good candidates:

  • API handler plus database
  • auth/session logic plus cookie handling
  • queue consumer plus persistence layer
  • repository layer plus real database schema
  • cache write/read/invalidate flows
  • external API adapter with realistic response contracts

For a backend service, this is often the highest return-on-investment layer because it exercises real seams without paying the full cost of browser automation.

it("creates a session row and sets the secure auth cookie", async () => {
  const response = await request(app)
    .post("/login")
    .send({ email: "user@example.com", password: "correct-password" });

  expect(response.status).toBe(200);
  expect(response.headers["set-cookie"][0]).toContain("HttpOnly");
  expect(await db.session.count()).toBe(1);
});

That test tells you far more than five mocked unit tests for the same login flow.

End-to-End Tests: Test Journeys, Not Every Branch

End-to-end tests are the final proof that the system a user sees still works. They are necessary. They are also expensive and fragile if you use them for the wrong job.

Good end-to-end targets:

  • sign up, login, logout
  • checkout or payment completion
  • file upload and processing completion
  • a core admin workflow that drives revenue or operations
  • one or two high-risk regression paths per major feature area

Bad end-to-end targets:

  • every possible validation branch
  • every error state already covered at lower levels
  • presentational details that can be checked more cheaply elsewhere

The question to ask is not "can this be tested in the browser?" It is "is this user journey important enough to justify the cost of browser-level testing?"

If the answer is no, move the test down a layer.

Contract Tests: Use Them Where Teams Or Services Drift

If your system depends heavily on APIs between services or external vendors, contract tests can pull their weight. They are especially useful when two systems evolve independently and misunderstandings are expensive.

Common uses:

  • validating event payload shape between producer and consumer
  • asserting request and response contracts for internal APIs
  • pinning assumptions around third-party integrations

Not every team needs a formal contract testing setup. But many teams would benefit from recognizing that boundary drift is a distinct failure mode and deserves dedicated coverage.

What To Mock

The internet produces two kinds of bad testing advice about mocks:

  • mock everything to keep tests fast
  • mock nothing because mocks are lies

Both positions collapse under contact with real software.

The better rule is simpler: mock where isolation helps you learn something, and avoid mocks where they erase the risk you actually care about.

Mocking trade-offs across system boundaries

Good Uses Of Mocks

  • a payment provider in a unit test for your billing decision logic
  • a clock when time-dependent behavior is the variable under test
  • a random number generator
  • a third-party email or SMS vendor in lower-level tests
  • a flaky external dependency when your goal is to test your own retry or fallback behavior

Bad Uses Of Mocks

  • mocking your own repository layer and then claiming the handler is tested
  • mocking every internal function call until the test mirrors the implementation exactly
  • mocking the cache, queue, or database in a test whose whole purpose is to validate that boundary
  • creating complex fake objects that duplicate real production behavior badly

As a rule of thumb:

  • mock external systems you do not control in unit tests
  • prefer real collaborators inside the boundary you own when integration risk is what matters
  • if your test breaks every time you refactor internals but the public behavior is unchanged, you are probably mocking too low in the stack

Consider these two tests.

// Low-value test: mostly verifies mock setup
it("calls repo.save", async () => {
  repo.save.mockResolvedValue({ id: 1 });

  await createUser(serviceInput, { repo });

  expect(repo.save).toHaveBeenCalledWith({ email: "a@b.com" });
});
// Higher-value test: verifies behavior at the boundary that matters
it("returns 409 when email already exists", async () => {
  await seedUser({ email: "a@b.com" });

  const response = await request(app).post("/users").send({ email: "a@b.com" });

  expect(response.status).toBe(409);
});

The first test will pass even if your real persistence layer is broken. The second one exercises the rule the user cares about.

What To Stop Testing

This is the part many teams avoid, because deleting tests feels reckless. Often it is the opposite.

Some tests create less confidence than they cost.

Stop Testing Framework Behavior

Do not write tests to prove React updates state correctly, that your ORM maps fields the way its own maintainers document, or that Next.js routing works as advertised. Test your usage and assumptions, not the framework's existence.

Stop Testing Trivial Pass-Through Code

If a function just forwards arguments to another function without adding decisions, validation, transformation, or risk, a dedicated test is usually wasted effort.

Stop Over-Testing Implementation Details

Tests that assert private method calls, exact hook ordering, internal helper invocation counts, or DOM structure that users never observe are refactor traps. They make the code harder to change without making regressions meaningfully less likely.

Stop Treating Snapshot Volume As Quality

Snapshot tests are not useless, but large uncontrolled snapshot suites tend to decay into approval theater. The team scrolls, shrugs, and presses accept. If a snapshot is too big to review carefully, it is too big to trust.

Stop Duplicating The Same Confidence At Three Layers

If a rule is thoroughly exercised in unit tests and verified once through an integration path, you probably do not also need four browser tests for the same branches.

Redundant coverage feels safe until it slows every pull request and multiplies flake surface area.

A Sane Testing Strategy For A Real Team

If I were setting a default testing strategy for a typical product engineering team, it would look something like this:

  1. Put fast unit tests around business logic, validation, and branching code.
  2. Put integration tests around the seams where production failures actually happen: database, auth, cache, queue, storage, and external APIs.
  3. Keep a small, explicitly curated set of end-to-end flows for the journeys that matter most to the business.
  4. Mock third-party dependencies in lower-level tests, but do not mock your own core boundaries when the point is to validate integration behavior.
  5. Enforce ownership for flaky tests: if a test flakes twice, somebody fixes or deletes it.
  6. Track CI runtime like a product metric. Slow feedback is an engineering tax.

That strategy scales better than "test everything" because it accepts that every test has a maintenance cost.

Here is a useful team-level question to ask during code review:

What is the cheapest test that would catch the most important failure here?

That question usually leads to better decisions than arguing abstractly about best practices.

Coverage, Flakiness, And CI Economics

Coverage is useful, but only as a smoke alarm.

If a critical module has 3% coverage, that is a signal. If your whole repository sits at 86%, that number alone tells you almost nothing about whether the important behaviors are protected.

High coverage can coexist with:

  • over-mocked tests that never hit real boundaries
  • missing regression protection for critical user flows
  • massive snapshot suites no one reviews carefully
  • slow builds that train engineers to ignore failures

Flaky tests are their own category of damage. A flaky test does not just waste CI minutes. It teaches the team that the suite is negotiable. Once people assume failures might be noise, the value of every passing and failing test drops.

Confidence gained versus cost and flakiness in CI

The economics matter:

  • the first few high-value tests usually buy a lot of confidence cheaply
  • later tests often buy less confidence and more maintenance burden
  • every unstable test increases the background cost of shipping

This is why deleting a test is sometimes the right move. If a test is flaky, redundant, low-signal, and expensive to maintain, keeping it is not discipline. It is inertia.

A Better Way To Judge Test Quality

Instead of asking "how many tests do we have?" ask better questions:

  • Which failures in this system are most likely and most expensive?
  • Which layer can catch each one most cheaply?
  • Which tests fail for reasons users would actually care about?
  • Which parts of the suite are mostly ceremony?
  • If this test broke tomorrow, would anyone learn something important?

Those questions push teams toward confidence, not volume.

Closing Thought

Good testing is not about performing discipline for its own sake.

It is about making change safer.

That means you should be willing to add tests where the system is exposed, deepen tests where real risk lives, and delete tests that create cost without protection.

The best teams do not try to prove that everything is tested. They build a system where the important things fail loudly, cheaply, and early.

That is what testing in the real world looks like.