Building ESLint Rules to Prevent Tests That Lie

From Taxonomy to Tooling

Discovering that 60% of our agent-generated tests provided zero signal was sobering. Fixing them was labour-intensive but straightforward. The harder problem was preventing the patterns from returning, especially when AI agents write tests at high velocity across parallel workstreams.

We needed automated enforcement at two layers. Not all 8 false-confidence patterns (FC-A through FC-H) are detectable through static analysis; some require semantic reasoning about intent. But two of the most prevalent patterns, FC-A (multi-status acceptance) and FC-E (tautological/range assertions), have distinctive syntactic signatures that are visible in the AST. These are the pre-commit layer.

We built a three-rule ESLint plugin that runs at "error" severity on all test files. Here is how each rule works.

Rule 1: `no-false-confidence-patterns`

This is the workhorse rule, catching both FC-A and FC-E patterns across five distinct syntactic forms.

What It Catches

FC-A multi-status arrays:

// Flagged: FC-A multi-status array
expect([200, 404]).toContain(res.status)

FC-A Array.includes:

// Flagged: FC-A includes pattern
[200, 404].includes(res.status)

FC-A OR-chains:

// Flagged: FC-A OR-chain
if (res.status === 200 || res.status === 201) { /* ... */ }

FC-E range checks in expect:

// Flagged: FC-E range assertion
expect(res.status).toBeLessThan(500)
expect(res.status).toBeGreaterThanOrEqual(200)

FC-E raw binary comparisons:

// Flagged: FC-E binary comparison
if (res.status < 500) { /* ... */ }

How It Works: Context-Aware Status Detection

The key design decision was making the rule context-aware. We do not flag every toBeLessThan() call, only those where the subject looks like an HTTP status and the threshold is a numeric literal between 100 and 599.

The looksLikeStatusCode() function walks the AST node recursively:

function looksLikeStatusCode(node) {
  if (!node) return false
  if (node.type === "MemberExpression") {
    const prop = node.property.name || node.property.value
    if (typeof prop === "string" && prop.toLowerCase().includes("status")) {
      return true
    }
    return looksLikeStatusCode(node.object)
  }
  if (node.type === "Identifier") {
    return node.name.toLowerCase().includes("status")
  }
  return false
}

Why Context Matters

This catches res.status, response.statusCode, httpStatus, and any other identifier containing "status". It does not fire on expect(count).toBeLessThan(500) because count does not contain "status". Without this context-awareness, the rule would be too noisy to keep enabled.

The threshold check uses a simple range validation:

function isStatusLiteral(value) {
  return typeof value === "number" && value >= 100 && value <= 599
}

This means expect(res.status).toBeLessThan(1000) is not flagged (threshold outside HTTP range), while expect(res.status).toBeLessThan(500) is (threshold is a valid HTTP status code).

OR-Chain Flattening

The FC-A OR-chain detection required flattening nested logical expressions. JavaScript's AST represents a || b || c as nested binary LogicalExpression nodes:

LogicalExpression(||)
  left: LogicalExpression(||)
    left: a === 200
    right: b === 201
  right: c === 204

We flatten this into a list of leaf nodes and check that each is a status equality comparison against the same subject:

function flattenOr(node) {
  if (node.type === "LogicalExpression" && node.operator === "||") {
    return [...flattenOr(node.left), ...flattenOr(node.right)]
  }
  return [node]
}

The rule only fires at the root of an OR chain (not intermediate nodes) to avoid duplicate reports. It also verifies that all comparisons reference the same status expression: res.status === 200 || res.status === 201 is flagged, but res.status === 200 || req.method === "POST" is not (different subjects).

Negation Handling

An interesting edge case: expect(res.status).not.toBeGreaterThanOrEqual(400). This is a negated range check that means "status is less than 400". It is still a range check that passes for many codes. The rule detects .not. in the member expression chain and flags negated range assertions as well.

const isNegated =
  node.callee.object.type === "MemberExpression" &&
  node.callee.object.property.name === "not"

Rule 2: `no-conditional-status-expect`

This rule catches a pattern that overlaps FC-A and FC-F: expect() calls guarded by an HTTP status condition.

// Flagged: conditional expect guarded by status check
if (res.status === 200) {
  expect(res.body).toEqual({ id: 1 })
}

The problem: when the endpoint returns 500 instead of 200, the if branch does not execute, the expect() never runs, and the test passes with zero assertions. This is FC-F (silent skip) triggered by a status-based condition.

AST Walking Strategy

The rule visits IfStatement nodes and performs two recursive searches:

Does the condition reference an HTTP status? The conditionHasStatusCheck() function recursively walks the test expression, checking for status-like identifiers.
Does the consequent or alternate contain an expect() call? The nodeContainsExpectCall() function walks the entire subtree of the if-body.

IfStatement(node) {
  if (!conditionHasStatusCheck(node.test)) return
 
  const consequentHasExpect = nodeContainsExpectCall(node.consequent)
  const alternateHasExpect = node.alternate
    ? nodeContainsExpectCall(node.alternate)
    : false
 
  if (consequentHasExpect || alternateHasExpect) {
    context.report({
      node,
      messageId: "conditionalStatusExpect",
    })
  }
}

The Fix

Split the conditional into separate tests: one for the happy path that seeds data producing 200, one for the 404 case that seeds conditions producing 404. Each test asserts exactly one expected status.

Rule 3: `no-tautological-expect`

The simplest and most surgical rule. It catches assertions where the subject and expected value are textually identical:

// Flagged: subject and argument are the same expression
expect(404).toBe(404)
expect(foo).toBe(foo)
expect(result.value).toEqual(result.value)

Text Comparison Approach

We considered deep AST comparison but chose source text comparison instead. It is simpler, handles all expression types uniformly, and avoids false negatives:

const subjectText = sourceCode.getText(subject).trim()
const argumentText = sourceCode.getText(argument).trim()
 
if (subjectText === argumentText) {
  context.report({
    node,
    messageId: "tautologicalExpect",
    data: { text: subjectText, method },
  })
}

This does mean that expect(a).toBe(a) is flagged but expect(a).toBe(b) is not, even if a and b reference the same value at runtime. That is acceptable: the rule catches syntactic tautologies, not semantic ones.

Plugin Architecture

The three rules are packaged as a custom ESLint plugin with a recommended flat config:

// eslint.config.js
import testingPlugin from "eslint-plugin-your-testing-rules"
 
export default [
  testingPlugin.configs.recommended,
]

The recommended config applies all three rules at "error" severity to test files:

files: [
  "**/__tests__/**/*.{js,ts,jsx,tsx}",
  "**/*.test.{js,ts,jsx,tsx}",
  "**/*.spec.{js,ts,jsx,tsx}",
  "**/*.e2e.{js,ts,jsx,tsx}",
  "tests/**/*.{js,ts,jsx,tsx}",
]

All three rules use CommonJS (.cjs) for maximum ESLint compatibility. The plugin has zero dependencies beyond ESLint itself.

CI Gate Integration

The rules run as part of our pre-commit parallel gate (alongside typecheck and unit tests). Because they are set to "error", a single FC-A or FC-E violation blocks the commit:

# Pre-commit hook runs these in parallel:
lint &      # includes custom testing rules
typecheck &
test &
wait

The error messages include the FC code for traceability:

FC-A: Multi-status array [200, 404] accepts failure as success.
      Use an exact status assertion: expect(res.status).toBe(200).

FC-E: .toBeLessThan(500) is a range check. It passes for many status codes.
      Use an exact assertion: expect(res.status).toBe(<code>).

What the Rules Cannot Catch, and How Agent Scaffolding Fills the Gap

Static analysis has limits. The three ESLint rules cover FC-A and FC-E at the pre-commit layer, but the remaining patterns require semantic understanding:

Pattern	Why AST Cannot Catch It
FC-B (Shape-only)	`typeof body === "object"` is syntactically valid. The rule would need to know that `body` should have specific fields
FC-C (Mock-only)	Detecting that an assertion tests mock output requires understanding the mock setup and assertion relationship
FC-D (Route never reached)	Requires knowing which routes are mounted (a type-system question, not an AST question)
FC-G (Graceful degrade)	Requires knowing whether a mock DB was injected: runtime context
FC-H (Wrong constant)	Requires understanding which security constant maps to which test: domain knowledge

This is where a second enforcement layer comes in: structured context injection. The FC taxonomy, written as rules and checklists, loads into every AI agent's context window during development. When an agent writes or modifies a test, it has the full taxonomy in its working context and reasons about each pattern as it goes. The agent checks its own work against the FC-B through FC-H checklist before committing, catching semantic patterns that no linter can detect.

The combination is deliberate: ESLint rules (FC-A, FC-E) provide hard gates at commit time; structured context rules (FC-B through FC-H, COND) provide reasoning-based enforcement during development. Between the two layers, the full taxonomy is covered.

Lessons for Plugin Authors

Context-awareness prevents false positives

The looksLikeStatusCode() function is what makes no-false-confidence-patterns usable in practice. Without it, every toBeLessThan() in every test would flag, making the rule too noisy to keep enabled.

Source text comparison beats deep AST comparison

For no-tautological-expect, comparing sourceCode.getText() output is simpler and more robust than recursively comparing AST node structures.

Error messages should include the taxonomy code

FC-A: and FC-E: prefixes in error messages let developers look up the pattern in documentation immediately. The message becomes a teaching moment, not just a lint error.

CommonJS for maximum compatibility

ESLint plugins need to work across projects with different module systems. .cjs extension ensures the plugin loads correctly regardless of the consuming project's configuration.

Test the rules with fixtures

Each rule has a test file with valid and invalid code samples. Running the rules against known-good and known-bad patterns prevents the rules themselves from having false confidence, a meta-problem worth guarding against.

The three rules described here are directly implementable. If your team uses AI agents to write tests, or writes HTTP tests yourself and suspects your assertions are too loose, the patterns above give you everything you need to build this plugin for your own codebase. The results may be uncomfortable, but that discomfort is the point: it means the rules are finding tests that were lying to you.

For the patterns ESLint cannot catch, consider building structured context rules that load into your AI agents' context during development, giving them the vocabulary and checklists to reason about test quality as they write. Static analysis catches the syntactic patterns; agent context rules catch the semantic ones. Together, they close the gap.

60% of Our Tests Had Zero Signal: How We Discovered False Confidence covers the audit that produced the FC taxonomy in the first place.
Agent-Native Shift-Left CI for High-Velocity Solo Engineering shows how these linting rules fit into a broader local quality perimeter.