Two interpreters, one trace: how we test cross-language parity

smartordercapture has two implementations of the same workflow interpreter. One lives in packages/workflow-engine (TypeScript) and powers the builder's live preview. The other lives in android/.../engine (Kotlin) and runs the actual workflow on your phone.

Whenever a user clicks "Run preview" in the builder, the trace they see comes from the TypeScript interpreter against a no-op simulator adapter. When the workflow then runs on their phone, the trace comes from the Kotlin interpreter against a real AccessibilityService adapter. Those two traces must be byte-identical — if they diverge, the preview is misleading and trust evaporates.

The fixture-runner

Under packages/workflow-engine/__fixtures__/ we keep JSON files of the form:

{
  "workflow": { ... },
  "expectedTrace": [
    { "nodeId": "t1", "kind": "trigger.manual", "outcome": "ok" },
    { "nodeId": "a1", "kind": "action.openApp", "outcome": "ok" },
    { "nodeId": "a2", "kind": "action.wait", "outcome": "ok" }
  ]
}

The TypeScript Vitest suite walks every fixture and asserts runWorkflow(workflow).trace == expectedTrace (ignoring timing-derived fields). The Kotlin test suite walks the same fixtures and asserts the same equality. CI runs both on every PR.

What "byte-identical" actually means

We strip three things before comparing: the wall-clock at timestamp, the per-node durationMs, and the run-level start/end times. Those are environment-dependent; everything else must match exactly.

That includes things like rounding behavior in numeric branch conditions, regex case-folding, the order in which a loop interleaves with downstream nodes, and the wording of error messages. The last one bit us a few times — we ended up extracting both interpreters' error strings into a shared constants file so the wording can't drift.

What it caught

A real example: an early version of the Kotlin action.branch implementation evaluated the condition before recording the trace event. The TypeScript version recorded the event first. Same workflow produced two different trace orderings; parity test caught it on the first run. Five minutes to fix once you know what to look for.

Another: the cron matcher in Kotlin used Java's DayOfWeek.value (1=Mon..7=Sun) while the TypeScript test fixtures assumed Vixie's convention (0=Sun..6=Sat). A simple modulo would have masked the bug on six of seven days a week. The parity test caught it on the seventh.

Why two implementations at all?

Because the builder needs to give users instant feedback in the browser — no round trip — and the phone needs to run workflows offline on its own engine. Embedding a JS engine inside the Android app was an option, but a 2k-line Kotlin interpreter is more reliable than V8 inside an APK and ships smaller too.