Handling Cross-Shard Transactions in Vitess

When a single write must touch rows on more than one shard, Vitess has to turn what MySQL treats as one local transaction into a coordinated distributed commit — and getting that coordination wrong strands row locks, leaves partial commits, and starves connection pools.

Where This Fits

A cross-shard transaction is a write path problem that sits directly on top of the VTGate routing layer: the same router that resolves a SELECT to one shard also decides, at COMMIT time, whether a multi-statement transaction stayed on a single shard or fanned across several. The physical layout it dispatches across — key ranges, tablet roles, replica hierarchy — comes from Vitess Sharding Architecture & Topology Design, and whether a given write can stay single-shard is largely decided by your keyspace partitioning model and vindex choices. This page is the narrow, high-stakes case: the write genuinely cannot be kept on one shard, so VTGate must run a distributed commit protocol and you must operate it.

The default and correct answer for most workloads is to avoid this path — keep writes single-shard by design. Reach for the two-phase commit machinery only when atomicity across shards is a hard business requirement (a funds transfer that debits one customer’s shard and credits another), not a convenience.

How Vitess Coordinates a Cross-Shard Commit

Under the default --transaction_mode=MULTI, VTGate will happily let a transaction touch several shards and commit them best-effort, one shard at a time. If the process dies between shard commits you get a partial commit and no automatic repair — acceptable only when the application can tolerate or reconcile it.

Atomic cross-shard commit requires --transaction_mode=TWOPC, which turns on Vitess’s distributed transaction manager. It implements a two-phase commit aligned with the MySQL XA transaction model, with one shard designated the metadata manager (MM) that durably records the transaction’s intent, and the others acting as resource managers (RMs). On COMMIT, VTGate assigns a distributed transaction ID (dtid), writes the participant list and a PREPARED record to the MM’s _vt.dt_state / _vt.dt_participant tables, asks every shard to PREPARE (write its redo log and hold its row locks), and only then broadcasts the real COMMIT. If any shard fails to prepare, the whole transaction rolls back; if the coordinator dies after the decision is durable, a background resolver replays it to completion.

The sequence below traces both the happy path and the compensating rollback that fires when any shard fails to prepare — the branch that prevents orphaned PREPARED states from stranding row locks:

Enabling Two-Phase Commit

Two-phase commit is opt-in and has a real latency and lock-hold cost, so enable it deliberately and scope it to the keyspaces that need it.

1. Turn on TWOPC at the router. Set the transaction mode on every VTGate; a mixed fleet where some routers default to MULTI gives you non-atomic commits under the same client code:

vtgate \
  --transaction_mode TWOPC \
  --grpc_max_message_size 67108864 \
  # ...other flags

2. Give the tablets a prepared-transaction watchdog. Each VTTablet needs the abandoned-transaction resolver enabled so that a dtid left PREPARED by a dead coordinator is retried rather than parked forever holding locks:

vttablet \
  --twopc_enable \
  --twopc_abandon_age 1h \
  # ...other flags

--twopc_abandon_age is how long a distributed transaction may sit unresolved before the resolver forces it to a conclusion. Set it comfortably above your worst-case commit latency but low enough that stuck locks do not outlive an on-call response window.

3. Let the client opt into atomicity per session. The application still speaks plain MySQL. On a TWOPC-enabled router, a normal transaction that happens to span shards is coordinated atomically — there is no special SQL. Because VTGate is stateless, any driver compliant with the Python Database API Specification v2.0 drives it directly:

import pymysql

conn = pymysql.connect(host="vtgate.internal", port=3306, db="commerce")

def transfer(from_customer: int, to_customer: int, amount: int) -> None:
    # These two rows may live on different shards. Under TWOPC, VTGate
    # runs a 2PC across both participants; the debit and credit either
    # both land or neither does.
    try:
        with conn.cursor() as cur:
            cur.execute("BEGIN")
            cur.execute(
                "UPDATE accounts SET balance = balance - %s WHERE customer_id = %s",
                (amount, from_customer),
            )
            cur.execute(
                "UPDATE accounts SET balance = balance + %s WHERE customer_id = %s",
                (amount, to_customer),
            )
            cur.execute("COMMIT")
    except pymysql.err.OperationalError:
        # A failed PREPARE surfaces here as a rollback error. Retry the
        # whole logical operation — never a bare COMMIT of a dead txn.
        conn.rollback()
        raise

Keep the retry idempotent and bounded: because the router is stateless any VTGate can serve the retry, but an unbounded retry loop against a shard set that is already failing to prepare simply amplifies lock contention.

Edge Cases and Gotchas

PREPARED transactions strand row locks. If a PREPARE succeeds on some shards and the coordinator then dies (or a network partition isolates the MM), those shards hold their locks until the resolver or --twopc_abandon_age fires. A too-high abandon age turns a brief blip into a lock pileup that spreads across the whole cluster; a too-low one can prematurely resolve a transaction that was merely slow.
A reparent mid-transaction changes the participant. If VTOrc promotes a new primary on a participating shard between PREPARE and COMMIT, the redo log written by the old primary must be present on the new one. This is why every TWOPC participant must be running with durable, semi-sync replication — an async replica promoted after losing the prepared log breaks atomicity silently.
Scatter writes are not the same as atomic writes. A DELETE with no sharding-key predicate fans out to every shard, but under MULTI mode it is not atomic. Do not assume a multi-shard statement is transactional just because it ran; only TWOPC makes it so.
DDL overlapping a live dtid deadlocks. A schema change entering cutover takes metadata locks that collide with a prepared transaction’s held row locks. Sequence Online DDL orchestration so cutovers do not overlap active distributed commits — see coordinating multi-shard schema migrations for the windowing pattern.
ER_LOCK_DEADLOCK amplifies across shards. Local deadlocks that MySQL would resolve in milliseconds become distributed stalls when the lock waiters are spread over shards holding prepared state. Order your writes by a stable key to reduce cross-shard lock cycles.
Cross-tenant leakage during resolution. In a multi-tenant keyspace, a manually resolved transaction must respect tenant boundaries at the vindex layer; a lookup vindex that maps a secondary key to the wrong shard can route a compensating write to a neighbouring tenant’s rows.
--grpc_max_message_size truncation. Large multi-shard result sets or redo logs can exceed the default gRPC message size and abort the commit mid-flight; size it to your largest expected transaction.

Verification

Confirm no distributed transaction is stuck before declaring a TWOPC deployment healthy. Query the metadata manager’s state tables and look for any dtid lingering in PREPARE beyond expected completion time:

-- Run against the MM shard's underlying MySQL.
-- Rows here that are older than a few seconds are stuck and hold locks.
SELECT dtid, state, time_created
FROM _vt.dt_state
WHERE state = 'PREPARE'
ORDER BY time_created;

A healthy cluster returns an empty set (or only transactions younger than your commit latency). Anything older should be driven to a conclusion explicitly:

vtctldclient DistributedTransaction ... # inspect, then:
# resolve a specifically identified stuck dtid so its locks release

Wire the same check into observability: alert when the count of PREPARE-state rows exceeds zero for longer than a chosen threshold, and track distributed-commit latency percentiles and VTTablet lock-wait times so a rising trend pages you before it becomes a lock pileup. When a participating shard is genuinely unavailable, keep non-critical read paths serving with fallback routing for shard outages while the coordinator resolves the outstanding transactions.

VTGate Routing Architecture Deep Dive — the query-resolution pipeline that decides single-shard vs. scatter before a transaction ever commits.
Understanding Vitess Keyspace Partitioning Models — choosing boundaries that keep most writes single-shard and out of 2PC entirely.
Configuring Lookup Vindexes for Cross-Shard Joins — secondary vindexes so related rows co-locate and a write stays atomic on one shard.

← Back to VTGate Routing Architecture Deep Dive

Handling Cross-Shard Transactions in Vitess

Where This Fits #

How Vitess Coordinates a Cross-Shard Commit #

Enabling Two-Phase Commit #

Edge Cases and Gotchas #

Verification #

Related #

Where This Fits

How Vitess Coordinates a Cross-Shard Commit

Enabling Two-Phase Commit

Edge Cases and Gotchas

Verification

Related