Offline-first by default: how the Android engine handles flaky networks

Workflows run on the phone. The server is involved at save time and at analytics time, but not at run time. That choice — offline-first as the default rather than the fallback — shapes the entire engine in ways we keep being surprised by, even four years in.

This post walks through the queue that holds run events between the phone and the server: what it is, why it exists, what failure modes it absorbs, and the one architectural mistake we made that took six months to fully unwind.

The shape of the problem

A typical smartordercapture user runs a workflow somewhere their network is unreliable: in a basement office on a fading 4G signal, on a campus Wi-Fi that demands a captive-portal re-login every two hours, in a warehouse with thick concrete walls, on a bus. If the engine needed an HTTP round trip to dispatch each UI action, the product would simply not work for half of our users. The phone has to be self-sufficient for execution, and the server only finds out what happened when the phone next has a chance to tell it.

That's a familiar pattern — Notion, Linear, and every modern note-taking app run on the same principle. The twist for us is that the run isn't a document the user edits; it's a sequence of dispatched actions against other apps, plus a trace of what they returned. The trace is the analytics. If the trace is lost, we have no way to bill, no way to debug, no way to flag misuse. So the queue has to be more durable than a draft buffer; it has to be the source of truth for everything that happened on the device, full stop.

The queue, concretely

Inside the Android app, every workflow run produces a stream of RunEvent objects: RunStarted, NodeStarted, NodeFinished, NodeFailed, RunFinished. The engine emits these synchronously as it walks the workflow graph, and the queue layer writes each one to a Room-backed SQLite table called queued_events before any further processing happens. Write-ahead logging is on; the events table is the boundary between "the engine ran a step" and "the rest of the world might find out about that step."

Each row carries a stable client-generated UUID, a runId, a sequence number that's monotonic within the run, the event's typed payload as a JSON blob, and a sentAt nullable timestamp. The uploader is a separate coroutine that wakes when WorkManager tells it there's connectivity, batches up to 200 rows per request, and posts them to /v1/runs/events:batchAppend with the UUIDs as idempotency keys. On a successful response the rows get their sentAt stamped and stay in the table for another 72 hours before a janitor coroutine deletes them.

72 hours is on purpose. Users notice problems with a run within a day at the outside; if they email support and we need to ask them to "try the same workflow again so we can see the trace", we've already lost. Keeping the events around lets the app re-upload them on demand from a debug screen, and lets us answer "what did this run actually do" questions without re-running anything. Storage on a modern Android device is cheap; a busy power user generates maybe two megabytes of event data per week, which is invisible next to their photo library.

What replay looks like

The server side is built around the assumption that any given event UUID may arrive any number of times in any order, and the answer has to be the same. The handler for events:batchAppend does an INSERT … ON CONFLICT (event_uuid) DO NOTHING against a Postgres table partitioned by week, then enqueues a downstream job to update derived state: per-run summaries, monthly usage counters, the abuse review feed. The derived-state updates are themselves keyed by run-id plus sequence number, so re-running them on already-applied events is also a no-op.

This is unglamorous. It is also where most of the bugs live, because it's where most of the assumptions live. We learned to write tests that explicitly replay every batch from a fixture twice and assert byte-equal output between the two runs. Catching duplicate-tolerance regressions in CI is much cheaper than finding them in a user's monthly usage report where their counter has doubled, and a single duplicated billing event has a way of generating a support ticket that takes a half-day to chase down to its origin.

The bug we lived with for six months

Early on, our reconnect logic was: if the device transitions from "no network" to "any network", schedule an upload. That worked on most phones, most of the time. It did not work for one specific failure mode that turned out to be common: captive-portal Wi-Fi networks that report "connected" before the user has actually authenticated through the portal page.

What would happen: phone joins café Wi-Fi, Android reports the network as available, our uploader fires, the TCP connection succeeds (it goes to the captive portal), the captive portal returns an HTTP 200 with a redirect HTML body, our client treats that as a successful upload, and the rows get sentAt stamped. Then the user authenticates, the real network comes up, and the events are never re-uploaded because we'd already marked them as sent. We discovered this when one user filed an abuse report that referenced a workflow run we had no record of.

The fix in retrospect is obvious: only mark sentAt on a response that contains the server's parsed acknowledgment payload — a JSON object with a received field listing the UUIDs the server accepted. We added that check, backfilled a one-time re-upload pass for any run with a server-side gap, and added a CI test that mocks a captive-portal response and asserts no events are marked sent. That test has caught two regressions since.

The lesson is not "captive portals are annoying", though they are. The lesson is that "the upload returned 200" is not the same statement as "the server has the data", and our queue cannot be allowed to confuse them. We rewrote three other request paths after this incident under the same principle: the client only believes the server when the server explicitly confirms what it has, and the schema of that confirmation is part of the contract, not a courtesy.

Ordering, or its absence

We made one early decision that's served us well: events within a run are ordered, events across runs are not. The server only enforces that for any single runId, sequence number n is processed before sequence number n+1. Across runs, batches can arrive in any order; a run that started two days ago can land after a run that started this morning, and the UI handles that gracefully because the run summaries are sorted by their client-supplied start timestamp, not by ingest order.

This matters because the alternative — global ordering, server-assigned timestamps, "logical clocks", and so on — is the kind of thing that sounds reasonable in a design doc and turns into a six-month migration when reality lands. Per-run ordering plus client-supplied wall-clock timestamps is enough to render a run trace correctly, and we've never wanted anything more. The bug categories we'd have inherited by going stricter (clock skew, sequence-gap retries, ordering buffers in the receiver) simply don't exist in our system. When clock skew shows up at all — usually because someone has their phone set to manual time and is off by hours — it manifests as a visibly weird timestamp in the UI, which is recoverable, rather than as a stuck queue, which isn't.

Testing the queue under failure

We have a small integration harness called queue-fuzz that boots the engine in-process with a fake network adapter and runs a Markov chain of network-state transitions: connected, disconnected, captive-portal, intermittent-loss, half-open. It fires through a workflow fixture under each transition pattern and asserts that, no matter what the network did, the server-side state at the end of the run matches the server-side state of a "clean" run with no network interference. Twelve hours of fuzzing each night in CI; a green run gets us to the morning standup, a red run gets a screenshot of the failing transition pattern in the team channel.

Two of the three nastiest bugs we shipped were caught by this harness before they hit a user. The third was a Heisenbug that only reproduced on a Pixel 6 with battery-saver enabled, and we caught it only after a beta user sent us a logcat. We added battery-saver state to the fuzz chain that week.

What we still don't do well

Two things, both honest.

First, we don't compress event bodies before upload. Trace payloads for a long-running workflow can be tens of kilobytes each, and on a slow mobile network the upload time matters more than the CPU we'd burn on gzip. We've prototyped it twice and shipped neither version because the wins were within margin of error on our internal devices. The right next step is to instrument it on real users' phones with a tiny rollout, not to keep arguing about it in the abstract.

Second, the janitor pass that deletes rows older than 72 hours is naive: it scans the table on a timer. On phones with extremely long uptime and heavy use, that scan has been measured to take 300+ ms — not enough to block anything, but enough to show up in tracing. The fix is straightforward (index the table on sentAt and do an indexed range delete) and is on the next maintenance milestone.

What we'd do the same way next time

All of the rest of it. Writing the event to durable storage before doing anything else with it is the discipline that's saved us most often. Treating the server as a read-only mirror of phone-side truth, rather than the other way around, has made our abuse-investigation tooling much simpler than it would otherwise be — we always have the raw events to look at, and the derived state is always reconstructible from them.

Offline-first isn't a feature we ship. It's the only stance we can take given who our users are and where they are when they run their workflows. The queue is where that stance lives in code, and every line in it has been written under the assumption that the network is the last thing you can trust and the device is the first.