How did Coinbase migrate to Temporal Cloud without downtime?

Coinbase used two migration strategies depending on workflow type. For short-lived workflows, they used drain-and-switch: stop new workflow starts on the old cluster, wait for existing workflows to drain, then flip traffic to Temporal Cloud. For critical long-lived workflows where pausing was not an option, they used the dual Worker strategy: a multi-client manager routes a configurable percentage of new workflows to Temporal Cloud while Workers on both clusters continue executing until the old cluster fully drains. This allowed 0 planned downtime on critical trading and payment paths.

What is the dual Worker migration strategy for Temporal?

The dual Worker strategy lets a single application call route traffic across two Temporal clusters simultaneously using a multi-client manager. Constructed with a primary client (Temporal Cloud) and a secondary client (the legacy cluster), the manager accepts a configurable routing percentage (cutoverPct) to gradually shift new workflow starts to the new cluster while existing workflows drain from the old one. Workers run on both clusters throughout the transition. The application code requires minimal changes — it constructs the multi-client manager in place of the standard Temporal client.

What is the drain-and-switch migration strategy for Temporal?

Drain-and-switch is the simplest Temporal cluster migration pattern. The new cluster is provisioned but receives no traffic. New workflow starts are stopped on the old cluster. Existing workflows are allowed to drain to completion. Once the old cluster is empty, traffic is flipped to the new cluster. It works well for short-lived workflows or any namespace that can tolerate a brief maintenance window, requiring very few code changes — often just a configuration update.

Why did Coinbase migrate from self-hosted Temporal to Temporal Cloud?

Coinbase ran a single shared Temporal cluster on a custom in-house persistence layer. As more teams and workloads piled in, four problems compounded: (1) unpredictable P95 latency and noisy-neighbor risk under mixed workloads; (2) operational overhead that was making the cluster its own internal product; (3) blast radius — one cluster failure affected all workloads with no easy per-namespace tuning; (4) insufficient security and compliance controls around service identity, access control, auditability, and payload encryption.

How did Coinbase secure its Temporal Cloud deployment?

Coinbase secured Temporal Cloud across three layers. Client-to-cluster communication uses a private link so traffic never traverses the public internet, mTLS for authentication, and certificate filters to restrict which services can reach each namespace. Developer access uses LDAP group approval for namespace provisioning and a custom consensus admin service requiring peer approval before any workflow operation executes in production. Payload security uses the Temporal data converter to encrypt all payloads, with teams choosing between a centralized codec server (faster to migrate, small latency addition) or an in-pod codec running entirely in memory (no added latency, no single point of failure).

How long did Bitovi's engagement with Coinbase take?

The Bitovi engagement was a year-long, white-glove migration. Bitovi embedded engineers directly onto each Coinbase service team, handling communication, Slack channel setup, Jira and Confluence project management, runbooks, dashboards, and incident playbooks — all agreed upon before cutover began. The migration covered 300+ namespaces across 300+ services.

What workloads does Coinbase run on Temporal?

Coinbase runs a wide range of workloads on Temporal: trading and payment flows (short-lived, bursty, critical), settlements and payouts (hours to days, critical), KYC refresh and travel rule checks (audit-heavy, regulatory), GDPR deletion workflows (long-tail, regulatory), ETL and backfill jobs (hours to weeks), ML-driven notifications (seconds), and infrastructure provisioning automation (minutes).

What are the key lessons from migrating hundreds of services to Temporal Cloud?

Five takeaways from the Coinbase migration: (1) Design for zero planned downtime and easy rollback from day one — good canaries, routing controls, and rollback paths are invaluable under real production conditions. (2) Treat security as a first-class requirement, not an afterthought. (3) Start with drain-and-switch; reach for dual Workers only where you genuinely need continuous operation. (4) Invest early in shared runbooks, dashboards, and dedicated migration time — those assets pay off on every service. (5) Migrations are about trust as much as technology — being transparent and data-driven with service teams matters as much as the mechanics.

Case Study

Migrating 300+ namespaces
to Temporal Cloud

Bitovi and Coinbase migrated 300+ namespaces across 300+ services off a custom self-hosted Temporal cluster and onto Temporal Cloud, with minimal planned downtime on critical trading and payment paths and a rollback strategy available at every stage.

This content is adapted from a joint talk given at Temporal Replay.

TemporalCloud MigrationInfrastructureSecurity

// migration.config.tslive

const manager = NewMultiClientManager(
  WithPrimaryClient(cloud),
  WithSecondaryClient(legacy),
  WithCustomExecutionStrategy(
    PercentRouting(cfg.cutoverPct),
  ),
);

manager.ExecuteWorkflow(ctx, opts, MyWorkflow);

cutoverPct72%

errors0

stranded0

Coinbase's Mission

Increasing economic freedom in the world

Coinbase is a secure global crypto platform for buying, selling, transferring, and storing digital assets. Security and infrastructure sit at the center of its mission: protecting users' assets, data, and privacy depends directly on the reliability and security of the Temporal-based workflows running behind the scenes.

Consumer & Wallet

Buy, sell, transfer, store digital assets.

Coinbase Prime

Trading and custody for institutions.

Developer Tools

APIs and SDKs for the platform.

Base

Coinbase's Ethereum L2 network.

Temporal at Coinbase

The shared layer underneath the platform

Temporal sits underneath a wide range of Coinbase systems. It powers core business workflows like trading, payments and payout flows, settlements, and back-office jobs that need to be reliable and auditable.

On the compliance and regulatory side, it runs workflows for GDPR deletion, travel rule checks, KYC refresh, and other audit-heavy flows where correctness and traceability are non-negotiable. It is also used heavily for internal platform and data workflows, including ETLs and backfills, ML-driven notifications, and automation around infrastructure provisioning and operations.

Workflow classShapeRisk

Trading & paymentsShort-lived, burstyCritical

Settlements & payoutsHours to daysCritical

KYC / Travel ruleAudit-heavyRegulatory

GDPR deletionLong tailRegulatory

ETL & backfillsHours to weeksRecoverable

ML notificationsSecondsRecoverable

Infra provisioningMinutesOperational

Why Temporal Cloud

Four pressure points the custom stack couldn't keep absorbing

Coinbase originally ran a single shared Temporal cluster backed by a custom, in-house persistence layer. Over time, more independent teams and workloads piled into the one cluster, amplifying the usual problems.

Scaling & Performance

P95 latency became unpredictable under mixed workloads, and noisy neighbors across teams were a constant risk.

Operational Overhead

Running a single Temporal cluster at this scale on-prem was on track to become its own Coinbase product, and upgrades to custom components were difficult.

Blast Radius

One cluster to rule them all, and one cluster to be very nervous about, with no easy way to tune the server for the unique traffic patterns of specific namespaces.

Security & Compliance

Coinbase needed stronger service identity, access control, auditability, and encrypted workflow data, and the custom stack made each of those changes difficult and risky.

What Needs Securing

Security was a first-class requirement, not an afterthought

The work broke down into three categories: how clients and clusters talk to each other, how developers interact with the system, and how the payloads themselves are handled.

Traffic between Coinbase services and Temporal Cloud should never flow over the public internet, and the connection should be one-way from services to Temporal Cloud. Only an allow-list of services should be able to talk to each namespace.

All services communicate with Temporal Cloud over a private link, establishing a private connection between Coinbase's VPC and Temporal Cloud so no data flows over the public internet. Temporal Cloud authenticates with mTLS out of the box, and Coinbase layered certificate filters on top so only approved services could reach a given namespace.

Coinbase VPC

private link · mTLS

Temporal Cloud

Developers need to interact with the cluster securely: provisioning namespaces with the right certificates, and starting, stopping, resetting, or otherwise operating on workflows during incident response. In production, the default Temporal CLI is not enabled for security reasons.

Namespace provisioning was integrated with LDAP groups and an access portal, so getting access to Temporal Cloud required approval to join the right group. For interacting with workflows, the team built and maintains a Temporal admin service with a consensus model: a developer raises a request to perform an operation and another developer approves it, so no single person can unilaterally stop or reset a workflow.

1Request

2LDAP check

3Peer approval

4Execute

Temporal Cloud persists payloads in order to maintain things like event history. Coinbase uses the data converter option in the Temporal client to encrypt payloads as they flow in and out of the cluster.

Traditionally, that encoding runs in memory inside the Worker pod, but to ease the initial migration the team stood up a centralized codec server. That added a small amount of latency that was acceptable for most teams; latency-sensitive teams, or teams that flagged the centralized codec server as a single point of failure, were offered a more traditional codec model where the certs are loaded into the Worker pod and the encoding runs entirely in memory there.

Centralized codec

Shared server
Faster to migrate
+ small latency

In-pod codec

Certs on Worker
In-memory encoding
No added latency

Building Trust

A lot of the difficulty is operational rather than technical

It comes down to coordinating with teams, building trust, and earning the room to make the changes you need to make. Coinbase and Bitovi ran what they call white-glove migrations: an engineer was embedded onto each service team. That engineer was responsible for communicating what the migration entailed, standing up dedicated Slack channels for the work, and using Jira and Confluence for project management and documentation.

Rollout plans were templatized and the team aligned on patterns up front, so engineers could step in and out of migrations without losing momentum. Each team got a shared set of runbooks, dashboards, and incident playbooks, agreed upon before the migration even began. The team also deliberately chose early high-visibility wins, which proved out the tooling and gave them concrete successes to point to whenever a service team got nervous later on.

Observability equals transparency, and transparency is what builds trust in a migration process.

The goal was to make the answer to "is it safe to move to the next step?" a data-driven decision rather than a gut call.

Embedded engineer

On every service team

Runbook & playbook

Agreed before cutover

Dashboards

Latency, errors, queues

Reconcile scripts

No duplicates, no strandeds

Trust earned

Then accelerate

The Migration Journey

A deliberate staging phase

Coinbase didn't go straight from the custom self-hosted setup to Temporal Cloud. The journey ran through a staging ground, with real migrations at each step to prove out the patterns before moving to Cloud.

Start

Custom self-hosted Temporal

Durable execution, but scaling, reliability, and security problems were getting worse.

Phase 1

Aurora-backed staging clusters

Multiple Coinbase-operated clusters acted as a staging ground to validate capacity, isolation, and runbooks.

Phase 2

Migration to Temporal Cloud

Same drain-and-switch and dual-Worker patterns proven in Phase 1, now pointed at Cloud.

Phase 3

Temporal Cloud

Managed infrastructure, stronger isolation between namespaces, scaling and security wins delivered.

Migration Approach 01 · Drain and Switch

The strategy most people picture

Drain and switch is the strategy most people picture when they imagine switching Temporal clusters. The strengths are exactly what they sound like: a very simple mental model, very little infrastructure change, and very little impact on customer code, often just a couple of config values. It works well when workflows are short-lived or can drain inside a maintenance window.

STEP 0101 / 04

Application starts workflows on old cluster; new cluster idle.

STEP 01

Day-1 steady state

The application starts the day running workflows on the old cluster. New workflows route there as they always have, and existing ones keep executing.

The new cluster is provisioned but receives no traffic.

Migration Approach 02 · Dual Workers

When you can't pause, run both clusters at once

For namespaces running critical, long-lived workflows, Coinbase couldn't pause new workflow creation for long, and some workflows take a very long time to drain naturally. The dual-Worker strategy handles those cases by letting Temporal Clients and Workers talk to both the old cluster and the new one.

PART 0101 / 07

Application calls StartWorkflowExecution into the SDK.

PART 01

One API call, on the surface

From the application point of view, nothing really changes. It still makes a single API call. The multi-client manager presents the same interface as a native Temporal client.

Adopting it is mostly a matter of constructing it with WithPrimaryClient and WithSecondaryClient, configuring the execution strategy, and calling ExecuteWorkflow the normal way.

Lessons & Takeaways

Five things to bring with you on the next migration

Design for no planned downtime

On critical paths, and an easy rollback from day one. Even when a migration looks simple on paper, good canaries, routing controls, and rollback paths are invaluable under real production conditions.

Treat security as a first-class requirement

For Coinbase that meant strong identity, private network paths, encrypted payloads, and controlled operator access by way of an admin service with consensus-based reviews.

Start with the simplest strategy

Drain and switch works well where you can drain quickly and accept a brief pause. Reach for dual Workers and more advanced routing only where you genuinely need them.

Invest early in tooling

Runbooks, dashboards, and dedicated migration time. Those shared assets pay off on every migration and build trust with the product teams.

Migrations are about trust as much as technology

Being transparent, data-driven, and present with the service teams during cutover matters just as much as the mechanics of any strategy.

Contact Us

Migrating 300+ namespaces
to Temporal Cloud

Increasing economic freedom in the world

The shared layer underneath the platform