KavachOS is open source. Cloud launching soon.
kavachOS

02/ENGINEERING

Why we chose Cloudflare Workersfor kavachOS Cloud

A post-mortem on picking Workers over Node, AWS Lambda, and Fly. Cost math, latency wins, and the trade-offs that kept us up a few nights.

GD

Gagan Deep Singh

Founder, GLINCKER

Published

March 27, 20266 min read

We spent a month prototyping on four runtimes. Workers won, but it was closer than you would think. This is not a Cloudflare sponsorship post. It is a record of what we measured, what surprised us, and where we are still paying the cost of that decision.

Auth is a latency-sensitive workload. Every token validation or session check sits in the critical path of a user request. We knew from the start that geography mattered: an auth call that routes to us-east-1 from a user in Singapore adds 200ms before any application logic runs. That shaped everything about the evaluation.


01

The four candidates

Node on Render was the default we started from. We had a working prototype in about a week. Render is genuinely pleasant to operate: deploy from git, managed Postgres, decent observability out of the box. The limitation is single-region unless you pay for their global tier, and their auto-scaling story was inconsistent in testing. Cold behavior after idle periods added about 800ms on the first request.

AWS Lambda with a Node runtime was the next candidate. Lambda is the obvious pick for a team that already runs things on AWS. The operational model is familiar, IAM gives you fine-grained control, and Lambda@Edge theoretically solves the geography problem. In practice, Lambda@Edge has a 1MB compressed deployment package limit, no persistent storage at the edge, and cold starts on the order of 300 to 500ms for Node runtimes. Not disqualifying, but not clean.

Fly Machines were compelling on paper. You get real VMs at the edge, persistent volumes, private networking between machines, and a straightforward pricing model. The developer experience is good. What we ran into was Fly's database story: Fly Postgres is managed by you, not by Fly. Replication lag across regions was something we would have had to instrument and manage ourselves. For an auth workload where stale session state is a security concern, that was a burden we did not want.

Cloudflare Workers was the fourth candidate. We already knew Hono worked well on Workers from a previous project. D1 for relational data, KV for session tokens, Durable Objects for per-tenant rate limiting. The primitives were a natural fit for multi-tenant auth.


02

What we measured

We ran 10,000 simulated auth calls from four regions: us-east, eu-west, ap-southeast, and us-west. The calls were token validations, the most common operation in a real deployment. We measured p50, p95, and cold start behavior.

34ms

Workers p95 globally

Median across 4 regions, token validation

0ms

Workers cold start

V8 isolates boot in microseconds, not seconds

5x

Cost reduction vs Lambda

At 10k auth calls per day, Workers at $0.30/month

The 34ms p95 is the number we kept coming back to. Lambda@Edge in the same test came in at 89ms p95, and that is the regional edge version. Standard Lambda from a single region was 240ms p95 for users in Asia Pacific. Node on Render was 310ms p95 globally on the free tier. Fly was better on p95 in its deployed regions at around 55ms, but only in the two regions we ran machines in.

34ms

p95 globally on Workers

Averaged across us-east, eu-west, ap-southeast, us-west


03

The trade-offs we are still living with

Local development on Workers is imperfect. Wrangler's dev --local mode is close but not identical to production. D1 in local mode uses a SQLite file on disk, which is convenient until you hit a query that behaves differently under D1's actual execution plan. We have caught two bugs that only reproduced in production. Both were query planner issues. The debugging experience is slower than a local Postgres instance with full stack traces.

The bundle size ceiling is real. Workers has a 10MB compressed bundle limit. Keeping kavachOS's full dependency tree under that limit required some deliberate choices about what goes in the Worker versus what lives in a separate service. We had to move the SCIM payload validation library out of the main Worker and into a service binding. Not a crisis, but the constraint is there.

There are no filesystem APIs. Workers runs in a V8 isolate with no fs. If your auth code loads certificates from disk, you need to rethink that. We store all secrets in Workers Secrets, which works well, but it is a porting cost if you are moving an existing Node service.


04

Where Workers won decisively

The zero-ops story is genuine. We deploy with wrangler deploy, Cloudflare propagates the update to 300 data centers in about 30 seconds, and it is done. No ECS task definitions, no ALB health check windows, no blue-green deployment state to track. For a two-person team, that is a meaningful reduction in operational surface.

Cost at scale came out strongly in Workers' favor. Cloudflare charges $0.30 per million requests after the included 10 million. At 10,000 auth calls per day across a customer base, that is about $0.09 per month. Lambda at the same volume with provisioned concurrency to keep cold starts manageable costs closer to $4 to $6 per month. The gap widens as volume grows. See the adapters guide if you are deploying kavachOS on your own Workers setup rather than using our cloud.

Global latency was the deciding factor. Auth is in the critical path of every logged-in request in your application. Shaving 200ms off token validation for a user in Singapore is not a vanity metric. It is 200ms off every page load that requires auth. The MCP OAuth endpoints benefit from this too: agent-to-agent auth flows that chain multiple token exchanges are noticeably faster from any geography.


05

What we would do differently

We would evaluate D1 more carefully before committing. D1 has improved significantly in the past year, but it is still maturing. If we were starting today we would prototype with Neon's serverless driver over HTTP as an alternative, given its read replica story and the absence of query planner surprises. D1 has been fine in production, but the debugging gap cost us time early on.

We would also write the local development environment setup documentation before onboarding the first external contributor, not after. The wrangler setup with local D1 and KV bindings has a few non-obvious steps. We learned this when the first contributor opened a PR and could not get the dev server to connect to the local database.

The short version: Workers was the right call for auth specifically because geography matters and the cold start problem is real. If you are building something where all your users are in one region and you already run AWS infrastructure, Lambda is a reasonable choice and the operational familiarity is worth something. For a global auth platform from day one, the decision was not close once we saw the latency numbers. Read the quickstart guide if you want to see the final architecture in practice.

Topics

  • #Cloudflare Workers
  • #edge compute
  • #kavachOS architecture
  • #serverless auth
  • #Workers vs Lambda

Keep going in the docs

Read next

Share this post

Get started

Try kavachOS Cloud free

Global auth at 34ms p95. Free up to 1,000 MAU.