Tracking Migration Progress and State Machines

In distributed MySQL environments, schema evolution behaves as a long-running, fault-tolerant workflow rather than an atomic statement. A single ALTER TABLE against a sharded keyspace fans out into dozens of independent row-copy jobs, each with its own throttle state, replica lag, and cutover barrier. Without an authoritative record of where every shard sits in that workflow, an operator cannot answer the two questions that matter during an incident: is it safe to cut over now? and is it safe to retry? This page shows how to model migration progress as a persisted, idempotent state machine so that both questions have deterministic answers — the control-plane discipline that the broader practice of Online DDL Orchestration & Migration Coordination depends on.

The state machine is the authoritative control plane. It bridges low-level InnoDB and VReplication mechanics with high-level orchestration policy, so that every schema change stays observable, auditable, and reversible even when a controller crashes mid-migration, a network partition strands a shard, or a topology rebalance moves a primary underneath a running job.

Prerequisites

Before implementing progress tracking against live shards, confirm the following:

Vitess 16.0+ (18.0+ recommended) — the vtctldclient OnlineDDL command surface, --ddl_strategy=vitess, and the _vt.schema_migrations progress columns are stable from these releases onward.
A managed Topology Server — etcd, ZooKeeper, or Consul reachable by every vtctld and vttablet. State is persisted here (and in _vt.schema_migrations), so treat it as a production dependency, not a bootstrap detail.
VReplication enabled on the target keyspace, which is the default execution engine for native Online DDL. If you run external tooling instead, review the trade-offs in Vitess Native Online DDL vs External Tools before wiring up telemetry — the two paths expose progress very differently.
Working knowledge of shard topology — how keyspaces, shards, and tablet roles map to physical MySQL instances. If that model is not yet solid, ground it first in Vitess Sharding Architecture & Topology Design and the practice of designing horizontal shard topologies.
Metrics scraping — a Prometheus (or compatible) collector already pulling vttablet and vtgate /metrics endpoints, so migration counters land in the same store as your latency and lag dashboards.
vtctldclient access to the control plane and read access to the _vt sidecar database on each shard’s primary.

The Migration State Machine

A production-grade migration advances through discrete, verifiable phases. Naming them explicitly — rather than tracking a boolean running flag — is what makes progress queryable and recovery deterministic:

QUEUED — accepted and persisted, not yet scheduled.
INITIALIZING — shadow (ghost) table created, VReplication stream provisioned.
COPYING_ROWS — bulk row copy in flight; the phase where rows_copied and throttle state move.
WAITING_FOR_CATCHUP — row copy done, VReplication draining the binlog backlog until replica lag falls under threshold.
CUTOVER_PENDING — caught up and holding at the barrier, waiting for every participating shard to reach the same point.
SWITCHING_TRAFFIC — atomic rename / routing swap in progress.
COMPLETED / FAILED — terminal states, with FAILED retaining enough context to retry or roll back.

Each transition requires strict precondition validation — replica lag thresholds, binary log position alignment, and a vttablet topology refresh — before the controller is allowed to advance. The full lifecycle, including the precondition gates between phases and the idempotent retry/abort edges out of FAILED, is captured below:

Core Mechanism: How State Is Persisted and Advanced

Native Online DDL persists every migration’s state in the _vt.schema_migrations table on each shard’s primary, keyed by a globally unique migration UUID. This table is the source of truth for a single shard; the Topology Server aggregates UUID → keyspace/shard mappings so the control plane can enumerate active jobs cluster-wide. Because both stores survive a controller restart, an orchestration process is stateless: on startup it reloads the current phase for every in-flight UUID and resumes from there rather than from memory.

The columns that drive progress tracking are migration_status, progress (a 0–100 percentage during row copy), eta_seconds, rows_copied, table_rows (the estimate), and migration_context (the caller-supplied grouping key that ties a fan-out set of shard migrations to one logical change). During COPYING_ROWS, vttablet updates rows_copied and recomputes progress and eta_seconds on a fixed interval; during WAITING_FOR_CATCHUP it publishes the VReplication lag so the controller can decide when the catch-up precondition is satisfied.

Idempotency is the non-negotiable property. Network partitions, MySQL restarts, and rebalancing are routine, so every state mutation must be safe to apply more than once. Two rules enforce this:

Read-before-write. A controller queries the persisted status before issuing any resume, retry, or cancel directive. If the shard already reports COMPLETED, a duplicate retry is a no-op instead of a second ghost table.
UUID-scoped commands. Every mutation targets a specific migration UUID, never “the current migration on this shard.” This prevents a delayed retry from acting on a migration that has since been superseded.

The single-shard state machine composes into a multi-shard one through a global barrier at CUTOVER_PENDING. No shard is allowed to enter SWITCHING_TRAFFIC until all participating shards in the migration_context group have reached the barrier. This is what keeps the VTGate routing layer consistent: the atomic rename happens within a narrow window across every shard, so query planners never observe a mix of old and new table definitions. The mechanics of holding and releasing that barrier across many shards are covered in depth in Coordinating Multi-Shard Schema Migrations.

Step-by-Step Implementation

The following steps build a progress tracker that launches a migration, polls per-shard state, and enforces the cutover barrier. Each step is independently verifiable.

1. Launch the migration and capture the UUID

Submit the DDL with an explicit strategy and a migration_context so the fan-out set can be tracked as one logical change:

vtctldclient ApplySchema \
  --ddl-strategy "vitess --postpone-completion" \
  --migration-context "checkout-idx-2026q3" \
  --sql "ALTER TABLE orders ADD INDEX idx_customer (customer_id)" \
  commerce

--postpone-completion is deliberate: it lets every shard reach CUTOVER_PENDING and hold there, handing the cutover decision to your orchestrator instead of letting each shard cut over independently. The command returns one UUID per shard; capture them all.

2. Poll per-shard status

vtctldclient OnlineDDL show returns the persisted state for a UUID or a whole migration_context, as JSON:

vtctldclient OnlineDDL show --json commerce checkout-idx-2026q3

Each row carries shard, migration_status, progress, eta_seconds, and rows_copied. Poll on a fixed interval (5–15s is typical) rather than in a tight loop — the row-copy counters only refresh periodically, so faster polling adds topology load without new information.

3. Model the state machine in Python

Wrap the CLI (or the gRPC vtctld service directly) in an idempotent controller. The read-before-write pattern lives here:

import json
import subprocess
from enum import Enum

class Phase(str, Enum):
    QUEUED = "queued"
    RUNNING = "running"          # covers INITIALIZING/COPYING_ROWS/WAITING_FOR_CATCHUP
    READY = "ready"              # postponed at the cutover barrier
    COMPLETE = "complete"
    FAILED = "failed"
    CANCELLED = "cancelled"

def shard_states(keyspace: str, context: str) -> dict[str, dict]:
    out = subprocess.check_output(
        ["vtctldclient", "OnlineDDL", "show", "--json", keyspace, context]
    )
    return {row["shard"]: row for row in json.loads(out)}

def barrier_reached(states: dict[str, dict]) -> bool:
    # every shard must be holding at the cutover barrier (READY), none FAILED
    phases = {s["migration_status"] for s in states.values()}
    if Phase.FAILED in phases:
        raise MigrationFailed(states)
    return phases == {Phase.READY}

Because shard_states reads authoritative persisted state on every call, the controller can crash and restart at any point without losing track of the migration — it simply re-reads.

4. Enforce the global cutover barrier, then complete

Only once every shard reports READY do you release the barrier. Completing the postponed migration triggers the atomic rename on each shard:

def complete_when_ready(keyspace: str, context: str, uuid: str) -> None:
    states = shard_states(keyspace, context)
    if not barrier_reached(states):
        return  # poll again next tick
    subprocess.check_call(
        ["vtctldclient", "OnlineDDL", "complete", keyspace, uuid]
    )

For a heavily loaded keyspace, sequence the completion inside an approved change window — see Scheduling DDL Windows Across Multiple Timezones for aligning that window with regional traffic troughs across a global fleet.

5. Export progress as metrics

Turn each poll into gauges so the migration shows up on the same dashboards as replica lag and QPS. Normalize per-shard progress by table_rows so a large shard cannot be hidden behind several small, already-complete ones:

from prometheus_client import Gauge

shard_progress = Gauge("ddl_shard_progress_pct", "row-copy %", ["keyspace", "shard"])
weighted_progress = Gauge("ddl_migration_progress_pct", "row-weighted %", ["keyspace", "context"])

def publish(keyspace: str, context: str, states: dict[str, dict]) -> None:
    total_rows = sum(int(s["table_rows"]) or 1 for s in states.values())
    weighted = 0.0
    for shard, s in states.items():
        pct = float(s["progress"])
        shard_progress.labels(keyspace, shard).set(pct)
        weighted += pct * (int(s["table_rows"]) or 1) / total_rows
    weighted_progress.labels(keyspace, context).set(weighted)

Correlate these against MySQL replication lag metrics on the same dashboard so an operator can see progress stalling and the lag spike that caused it in one view.

Configuration Reference

The flags and thresholds below govern how the state machine advances. Tune them per keyspace based on shard size and primary headroom.

Flag / parameter	Type	Default	Recommended (production)
`--ddl-strategy`	string	`direct`	`vitess` (native VReplication path)
`--postpone-completion`	bool	`false`	`true` for coordinated multi-shard cutover
`--cut-over-threshold`	duration	`10s`	`10s`–`30s`; lower needs a very quiet primary
`--online-ddl-throttle-ratio` (via throttler)	float	`1.0`	`0.7`–`0.8` on latency-sensitive keyspaces
`--throttle-metrics-threshold`	float	`1.0` (s of lag)	`1.5`–`2.0` to tolerate normal replica lag
`--migration-context`	string	auto-generated	explicit, human-readable per change
`--retain-online-ddl-tables`	duration	`24h`	`24h`–`72h` for post-cutover rollback safety
poll interval (orchestrator)	duration	n/a	`5s`–`15s`

The --retain-online-ddl-tables window matters for recovery: the old table is renamed aside, not dropped, so a regression caught minutes after cutover can be reverted without a restore. Do not shorten it below your rollback decision time.

Failure Modes

Each named scenario lists the observable symptom and the mitigation the state machine should encode.

Stalled row copy. progress frozen and eta_seconds climbing while migration_status stays running. Almost always throttling: the primary’s replicas exceeded the lag threshold and VReplication paused the copy. Check the throttler status; if lag is legitimately high, either widen --throttle-metrics-threshold or move the window to a quieter period rather than forcing the copy and risking a replication cascade. A frozen copy on a specific shard often traces back to that shard’s primary being under-provisioned — validate the setup against configuring VTTablet for high availability.

Barrier deadlock. Most shards report READY but one is stuck in running. Completing now would cut over a subset of shards and split query routing. The barrier logic must refuse to release: block on barrier_reached() returning True for all shards, and page if a shard sits below the barrier past its expected ETA. Never hand-complete individual shards to “unstick” the set.

Cutover contention. The rename times out because it cannot acquire the metadata lock within --cut-over-threshold — a long-running transaction is holding the table. Symptom: repeated transition attempts from CUTOVER_PENDING that fall back. For external-tool migrations this manifests as lock waits; the same class of problem and its mitigations are detailed in resolving gh-ost lock contention in sharded MySQL. Kill or wait out the blocking transaction; do not lower the threshold blindly.

Orphaned ghost table after controller crash. A migration left running in _vt.schema_migrations with a shadow table present but no active VReplication stream. Because state is persisted, the correct recovery is a read-before-write retry or cancel on the UUID — never a manual DROP TABLE, which the retention/cleanup machinery expects to own.

Post-cutover latency spike. Migration COMPLETED, but p99 climbs as connection pools repopulate against a cold InnoDB buffer pool and query planners rebuild execution plans. Mitigate by warming the buffer pool and shifting traffic gradually rather than all at once. This is an operational tail of the migration, not a separate incident — track it as part of the same run.

Verification

Confirm the tracker and the migration itself are healthy with three concrete checks:

State is terminal and consistent across shards. Every shard should agree:
```
vtctldclient OnlineDDL show --json commerce checkout-idx-2026q3 \
  | jq '[.[] | {shard, status: .migration_status}]'
```
Every status should read complete. A mix of complete and running means the barrier was released early — investigate before declaring success.
The schema actually changed on each primary. Progress metadata is not proof the DDL applied. Confirm the structural change directly:
```
SHOW INDEX FROM commerce.orders WHERE Key_name = 'idx_customer';
```
Run it against each shard’s primary; every shard must return the new index.
Metrics reflect completion. ddl_migration_progress_pct should read 100 for the context and then stop updating, and no ddl_shard_progress_pct series should be stuck below 100. A gauge frozen mid-range is a stalled shard that never reached the barrier.

Once all three pass and the --retain-online-ddl-tables window has elapsed without a rollback decision, the migration is genuinely done and the retained tables can be reclaimed.

Coordinating Multi-Shard Schema Migrations — how the global cutover barrier is held and released across an entire keyspace.
Scheduling DDL Windows Across Multiple Timezones — aligning cutover with regional traffic troughs for a global fleet.
Vitess Native Online DDL vs External Tools — how progress and state are exposed differently by native VReplication versus gh-ost/pt-osc.
Resolving gh-ost Lock Contention in Sharded MySQL — diagnosing the metadata-lock waits that stall cutover.
Handling Cross-Shard Transactions in Vitess — why atomic, barrier-gated routing swaps matter for consistency.

For custom controllers, the Vitess schema change documentation provides reference implementations for topology-aware transitions and progress polling.

← Back to Online DDL Orchestration & Migration Coordination

Tracking Migration Progress and State Machines

Prerequisites #

The Migration State Machine #

Core Mechanism: How State Is Persisted and Advanced #

Step-by-Step Implementation #

1. Launch the migration and capture the UUID #

2. Poll per-shard status #

3. Model the state machine in Python #

4. Enforce the global cutover barrier, then complete #

5. Export progress as metrics #

Configuration Reference #

Failure Modes #

Verification #

Related #

Go deeper

Related in Online DDL Orchestration