March 18, 2026

Self-healing UI tests with Playwright: a primer

playwrighttestingself-healing

If you've owned an end-to-end suite for more than a quarter, you already know the failure mode: a designer renames a button from "Save" to "Save changes," a CSS module hash shifts, an A/B test wraps the checkout CTA in an extra div, and twenty tests go red overnight. None of those are bugs. They're churn — and they're the reason "flaky e2e tests" is a category of human suffering with its own Slack channel at most companies.

This post is about a specific mitigation: self-healing UI tests. The idea is older than it sounds, but Playwright's locator model and the availability of cheap LLM inference have made it newly practical. We'll walk through why selectors break, what "healing" actually means in the runner, a minimal Playwright recipe you can drop into a project today, and the tradeoffs that should keep you honest about when to use it.

Why selectors break

Selectors break for boring reasons, not interesting ones:

Refactors. A `<button>` becomes a `<Button>` component, which renders an extra wrapper, which kills your `parent > .btn` chain.
Design system swaps. Migrating from Bootstrap to a Tailwind-based system rewrites every class name in the app.
A/B tests. Half your users see one DOM tree, half see another, and the test only knows about one.
i18n changes. A copy team fixes "Sign Up" to "Sign up" — your `getByText('Sign Up')` is now a flake.
Generated class names. CSS-in-JS or CSS modules produce hashes that change between builds.

None of these change what the user does. The button still saves, the form still submits, the page still navigates. The test fails because the path it took to find an element no longer leads there — even though the element is sitting right next to where it was, doing the same job.

The brittle-test pyramid

Not all selectors are equally fragile. A rough ranking, brittlest at the bottom:

Role and accessible name — `getByRole('button', { name: 'Save changes' })`. Tied to what the element means. Survives most refactors. Breaks on copy changes.
Test IDs — `getByTestId('save-btn')`. Stable as long as engineers remember to add them. Invisible to the user, which means designers and product can refactor freely. The downside: every component needs the discipline of someone adding `data-testid`.
Text content — `getByText('Save')`. Robust to DOM changes, fragile to copy changes and i18n.
XPath / structural CSS — `//div[2]/button[1]` or `.modal-footer > button.primary`. Tied to layout. Breaks the moment anything wraps or reorders.
Generated class names — `.css-1a2b3c4`. Change every build. Don't.

Playwright's whole API is designed to push you toward the top of that pyramid: `getByRole`, `getByLabel`, `getByText`, `getByTestId`. Use them. Healing is not a substitute for selector hygiene — it's a safety net for the cases where hygiene wasn't enough.

What self-healing actually means

Healing kicks in when a primary locator misses. Instead of failing the assertion immediately, the runner pauses, snapshots the current DOM, and asks: given what I knew about the element I was looking for — its role, its nearby text, its position in the flow — which element on this page is the closest semantic match? If the answer is confident enough, the test continues against the new element. If it isn't, the test fails as it would have anyway, and now you have a DOM snapshot to debug with.

The "what I knew about the element" part is the load-bearing piece. A naive locator stores nothing but the selector string. A healing-aware locator stores semantic context: the role, the accessible name, the nearest heading, surrounding text, maybe a visual bounding box. When the selector misses, that context is what powers the match.

There are two flavors of matcher:

Heuristic. Fuzzy text match, role + name match, position-in-flow scoring. Cheap, deterministic, no API call. Catches maybe 60–70% of the easy cases — copy changes, single-attribute drifts.
LLM-assisted. Send the snapshot plus the original context to a model and ask it to return a new locator. Slower, costs tokens, catches structural changes that heuristics miss. Best paired with a confidence threshold and a heuristic fallback.

In practice you want both: heuristics first, LLM only when heuristics fail or score below threshold.

A minimal recipe

Here's the wrapper pattern. The shape is: a `findOrHeal` function that tries a primary locator, falls back to a candidate finder built from stored semantic context, and logs every healed match for review.

import { Page, Locator, expect } from '@playwright/test';

type Hint = { role?: string; name?: string; nearText?: string };

export async function findOrHeal(
  page: Page,
  primary: string,
  hint: Hint,
): Promise<Locator> {
  const direct = page.locator(primary);
  if (await direct.count()) return direct.first();

  // Heuristic pass: role + accessible name.
  if (hint.role && hint.name) {
    const byRole = page.getByRole(hint.role as any, { name: hint.name });
    if (await byRole.count()) {
      console.warn(`[heal] ${primary} -> role=${hint.role} name=${hint.name}`);
      return byRole.first();
    }
  }

  // Fallback: anchor on nearby text, then find the nearest interactive element.
  if (hint.nearText) {
    const anchor = page.getByText(hint.nearText, { exact: false }).first();
    const candidate = anchor.locator('xpath=following::button[1]');
    if (await candidate.count()) {
      console.warn(`[heal] ${primary} -> nearText anchor`);
      return candidate;
    }
  }

  throw new Error(`Could not heal locator: ${primary}`);
}

Used in a test:

test('user can save profile', async ({ page }) => {
  await page.goto('/settings/profile');
  const save = await findOrHeal(page, '[data-testid="save-btn"]', {
    role: 'button',
    name: 'Save changes',
    nearText: 'Profile information',
  });
  await save.click();
  await expect(page.getByText('Saved')).toBeVisible();
});

That's the skeleton. To make it production-grade you'd add: a real confidence score on heuristic matches, an LLM tier behind a flag, persistence so a healed selector becomes the new primary on next run (with human approval), and a structured log so a teammate can review what healed and why.

Caveats — read this part

Healing is a sharp tool. A few rules that have saved us pain:

Healing is for navigation and setup, not assertions. If your test is asserting "the Save button exists and reads 'Save changes'" — don't heal that. The whole point of that assertion is to catch the change you're trying to paper over. Heal the click that gets you to the next step. Assert with normal locators.
Log every healed selector. A test that quietly heals forever is a test that's slowly drifting from reality. Every heal should be a warning in CI output and ideally a row in a dashboard.
A human owns every heal. Treat heals like dependabot PRs: they propose a change, a person confirms it. After confirmation, the new locator becomes the primary. No confirmation, no auto-merge into the canonical test.
Set a budget. If more than 10–15% of locators in a run are healing, your selector strategy is the problem, not the selectors. Healing should be the exception.
Healing has false positives. A confident-looking match on the wrong element is worse than a clean failure, because it'll pass the test and lie about coverage. Confidence thresholds and review queues exist for this reason.

Roll your own, or use a tool

If you have one suite, a few hundred tests, and engineers who like writing infra — roll your own. The code above is most of the way there. The hard part isn't the wrapper, it's the review workflow around it.

If you have many suites across many teams, want a UI for reviewing heals, need cross-run learning so the same heal doesn't get proposed twice, or you're integrating with a recorder/walkthrough product that already snapshots DOM context — at that point a dedicated tool earns its keep.

We built one of these for walkthroughs at Heal Demo: when a recorded user flow breaks because the underlying app shipped, our healing agent inspects the new DOM and proposes a patched selector for the demo author to confirm. The same primitives — semantic context, heuristic-then-LLM matching, human-in-the-loop confirmation — apply directly to test suites. If you're working in this space, the architectural decisions look almost identical.

Wherever you land, the principle is the same: don't paper over flakiness, but don't pretend selectors are stable either. Stable selectors first, healing as a safety net, humans on every change.