Vitess Sharding Architecture & Topology Design: Foundational Principles for Distributed MySQL

Modern data platforms demand horizontal scalability without sacrificing transactional integrity or operational predictability. Vitess addresses this by abstracting traditional MySQL deployments into a distributed, cloud-native architecture where topology design becomes the primary determinant of system resilience. This reference is written for database platform engineers, MySQL SREs, Python orchestration builders, and distributed systems teams who need to reason about how keyspaces, shards, and routing layers interact under production load — and who own the operational consequences when they do not. The architecture decouples the query-routing control plane from the storage tier, enabling independent scaling while preserving MySQL’s ACID guarantees within shard boundaries. Everything downstream on this site — keyspace partitioning models, horizontal shard design, VTGate routing, and Online DDL orchestration — builds on the structural principles established here.

The diagram below maps how the control plane and data plane fit together — clients talk only to the stateless VTGate router, which consults the topology server’s serving graph to dispatch queries to the VTTablet/MySQL pairs that own each shard, while VTOrc watches tablet health and drives failover.

The Vitess Logical Model: Keyspaces, Shards, and Tablet Roles

Every topology decision in Vitess is expressed through four core abstractions, and misusing any one of them propagates directly into query latency and failover behavior.

A keyspace is a logical database. It is the unit clients connect to, and it may be unsharded (a single backing MySQL) or sharded (partitioned across many). The keyspace name is the stable identity that survives resharding — applications target commerce, not commerce/-80, and the routing layer resolves the physical destination.

A shard is a contiguous slice of a keyspace’s keyspace-ID space, named by its lower and upper key range in hexadecimal: -80 covers key IDs from zero up to (but excluding) 0x80, and 80- covers 0x80 to the maximum. Because ranges are expressed in binary, splitting a shard is always a clean bisection — -80 becomes -40 and 40-80 — which is why practitioners provision shard counts as powers of two.

A tablet is the pairing of a VTTablet process with exactly one mysqld. Tablets carry a type that determines what traffic they serve:

PRIMARY — the single writable tablet per shard; all INSERT/UPDATE/DELETE and transactional reads route here.
REPLICA — read replicas eligible for promotion; serve @replica reads and act as failover candidates.
RDONLY — batch/analytics replicas, never promoted, used for VReplication sources and OLAP scatter queries so they never steal capacity from transactional replicas.

The topology server — an etcd or Consul cluster — is the source of truth binding these together. It stores keyspace and shard records, tablet records, the VSchema, and the serving graph: the per-cell mapping of (keyspace, shard, tablet_type) to the currently healthy tablet that should receive traffic. VTGate caches this graph and re-plans routing whenever the topology changes. Inspecting it is the first move in almost any incident:

# Enumerate shards in a keyspace and the current serving tablets
vtctldclient GetShards commerce
vtctldclient GetTablets --keyspace commerce --tablet-type primary

How the keyspace ID is derived from application data — the choice of range, hash, or lookup distribution — is the single most consequential decision in the whole system; it is examined in depth in Understanding Vitess Keyspace Partitioning Models, and the arithmetic for sizing the initial shard count is worked through in How to Calculate Optimal Shard Count for MySQL.

Logical Partitioning & the VSchema

The mapping between logical tables and physical shards is not stored in MySQL — it lives in the VSchema, a JSON document in the topology server that VTGate reads to build execution plans. The VSchema names each table’s primary vindex, the function that converts a column value into a keyspace ID and therefore into a shard. Without a correct VSchema, every query degrades to a scatter across all shards.

{
  "sharded": true,
  "vindexes": {
    "hash": { "type": "hash" }
  },
  "tables": {
    "orders": {
      "column_vindexes": [
        { "column": "customer_id", "name": "hash" }
      ]
    }
  }
}

Here customer_id is the sharding key: rows for a given customer deterministically resolve to one shard, so a customer’s orders are co-located and single-shard reads stay cheap. When a table must be queried by a secondary column (say, looking an order up by order_id rather than customer_id), a secondary lookup vindex maintains a mapping table so VTGate can still target a single shard instead of fanning out. The full grammar of vindex definitions, sequence tables, and materialized rules is covered in Mastering VSchema Syntax and Structure, and traffic-shaping overlays are handled by dynamic routing rules and query rewriting. A VSchema that is inconsistent with the physical shard ranges is the most common root cause of silent scatter-gather regressions, so it is treated as production configuration and versioned accordingly.

Horizontal Shard Topology & Failure Domain Mapping

Horizontal scaling relies on deterministic shard mapping and on placing tablets so that no single infrastructure failure can take a shard’s entire quorum offline. Topology design must account for the initial shard count, growth projections, and the operational cost of future splits or merges. Provisioning with a power-of-two convention keeps binary range calculations trivial and streamlines VReplication when a shard is later bisected.

The physical layout matters as much as the count. Each shard’s PRIMARY, REPLICA, and RDONLY tablets must be distributed across distinct failure domains — availability zones or racks — so that losing one zone leaves a promotable replica and a surviving quorum in the topology server. A shard whose primary and only replica share a rack has no real availability, regardless of how many tablets the topology reports.

# Bisect an overloaded shard: -80 becomes -40 and 40-80 via VReplication
vtctldclient Reshard --workflow split80 --target-keyspace commerce create \
  --source-shards '-80' --target-shards '-40,40-'
vtctldclient Reshard --workflow split80 --target-keyspace commerce switchtraffic

The end-to-end procedure — capacity planning, range assignment, and aligning shard boundaries to infrastructure — is laid out in Designing Horizontal Shard Topologies. Choosing a key with enough cardinality and entropy to avoid write hotspots is its own discipline, worked through for a concrete workload in Shard Key Selection Best Practices for E-commerce, while tablet-level durability settings are covered in Configuring VTTablet for High Availability. Because MySQL foreign keys cannot span shards, referential integrity across shard boundaries is enforced at the application or orchestration layer rather than by the storage engine.

Query Routing & VTGate Data Plane Mechanics

VTGate is the stateless data-plane proxy: it parses client SQL, consults the VSchema and serving graph, builds a shard-aware plan, and either targets a single shard or scatters. Routing efficiency is overwhelmingly a function of whether the sharding key appears as a predicate in the WHERE clause.

When the sharding key is present and equality-constrained, VTGate computes the keyspace ID and issues a targeted query to exactly one shard:

-- Targeted: customer_id resolves to a single shard
SELECT * FROM orders WHERE customer_id = 42;

When the predicate is absent, ranged, or on a non-vindex column, VTGate performs scatter-gather: it fans the query to every shard, then aggregates, sorts, and applies LIMIT on the results:

-- Scatter-gather: no sharding-key predicate, fans out to all shards
SELECT COUNT(*) FROM orders WHERE status = 'shipped';

Scatter latency is bounded by the slowest shard, and its cost grows with shard count, so the operational goal is to keep hot-path queries targeted. The internal plan cache, prepared-statement handling, connection pooling, and plan types are dissected in the VTGate Routing Architecture Deep Dive. Writes that unavoidably span shards escalate into distributed-transaction territory, whose commit semantics and --transaction_mode trade-offs are covered in Handling Cross-Shard Transactions in Vitess. Inspecting a plan before it reaches production is done offline against the VSchema:

# Show the routing plan without touching a live tablet
vtexplain --vschema-file vschema.json --schema-file schema.sql \
  --shards 4 --sql "SELECT * FROM orders WHERE customer_id = 42"

Schema Evolution Across Shards

A schema change in a sharded keyspace is not one ALTER — it is N coordinated migrations that must reach cutover together or not at all. Vitess supports Online DDL natively, applying changes through VReplication-backed workflows (or delegating to gh-ost/pt-online-schema-change) so reads and writes continue during the copy. Coordinating the change across every shard, enforcing a global barrier before any shard switches traffic, and preserving a rollback path is the substance of Online DDL Orchestration & Migration Coordination.

# Launch a non-blocking, revertible schema change across all shards
vtctldclient ApplySchema --ddl-strategy vitess \
  --sql "ALTER TABLE orders ADD COLUMN fulfilled_at DATETIME NULL" commerce
vtctldclient OnlineDDL show commerce

The decision of whether to use the native executor or an external tool — and the operational differences in throttling, progress reporting, and cutover — is analyzed in Vitess Native Online DDL vs External Tools. Sequencing the change so shards advance in lockstep is handled in Coordinating Multi-Shard Schema Migrations, and the observable state model that controllers poll is defined in Tracking Migration Progress and State Machines.

Operational Considerations: Tuning Knobs & Common Misconfigurations

Most topology incidents trace back to a handful of VTGate and VTTablet flags left at defaults that suit a demo, not a production fleet. The values below are starting points to tune against measured load, not universal constants.

Flag	Component	Type	Default	Recommended (production)
`--transaction_mode`	`VTGate`	enum	`MULTI`	`SINGLE` unless 2PC is required; force cross-shard writes to fail fast
`--normalize_queries`	`VTGate`	bool	`true`	`true` — literals become bind vars so the plan cache stays warm
`--max_memory_rows`	`VTGate`	int	`300000`	Lower toward `100000` to fail runaway scatters before they OOM
`--warn_sharded_only`	`VTGate`	bool	`false`	`true` in staging to surface unintended scatter-gather
`--mysql_server_query_timeout`	`VTGate`	duration	`0` (off)	`30s`–`60s` to bound client-visible tail latency
`--queryserver-config-pool-size`	`VTTablet`	int	`16`	Size to the backing MySQL `max_connections` budget per tablet
`--queryserver-config-transaction-cap`	`VTTablet`	int	`20`	Raise for write-heavy shards; cap to protect `mysqld`
`--queryserver-config-query-timeout`	`VTTablet`	duration	`30s`	Align just under the `VTGate` timeout to shed work at the tablet
`--health_check_interval`	`VTTablet`	duration	`20s`	`5s`–`10s` so `VTOrc` detects failure faster
`--degraded_threshold`	`VTTablet`	duration	`30s`	Tune to replication SLA; governs when a replica stops serving

Recurring misconfigurations to audit for:

Silent scatter-gather. A VSchema missing a table’s primary vindex routes every query to all shards. Catch it with --warn_sharded_only and by watching the VTGate scatter metrics rather than waiting for a latency page.
Mismatched timeouts. When the tablet query timeout exceeds the VTGate timeout, VTGate gives up while mysqld keeps executing, burning connections. Keep the tablet timeout below the gateway’s.
Co-located quorum. Tablets nominally in different tablets but the same failure domain give the illusion of HA without the substance.
Undersized pools. A pool-size far below MySQL max_connections throttles throughput; far above it, a traffic spike exhausts mysqld and cascades.

Failure Modes & Recovery Patterns

Distributed topologies fail partially, and the recovery path depends on which layer degraded. The named scenarios below pair a root cause with a mitigation checklist; graceful degradation of the routing layer specifically is expanded in Implementing Fallback Routing for Shard Outages.

Primary failure on a single shard. VTOrc detects the unhealthy PRIMARY, elects a replica, and updates the serving graph.

Confirm VTOrc is running and has topology write access.
Verify --health_check_interval is low enough to detect within your RTO.
Watch vtctldclient GetTablets until a new PRIMARY appears; check EmergencyReparentShard logs if promotion stalls.
Ensure the promoted replica had semi-sync acknowledgments to avoid data loss.

Topology server partition (etcd/Consul quorum loss). VTGate serves from its cached serving graph but cannot learn new changes.

Do not restart VTGate fleet-wide — a cold cache during a topo outage is worse than a stale one.
Restore etcd/Consul quorum before attempting any reparent.
Freeze schema changes and reshards until the topology store is healthy.

Scatter-gather latency storm. A missing predicate or bad plan fans hot-path traffic to all shards, saturating pools.

Identify the offending query via VTGate query logs and per-plan metrics.
Apply a temporary dynamic routing rule or block the pattern while the VSchema/predicate is fixed.
Lower --max_memory_rows so runaway scatters fail fast instead of exhausting memory.

Replica lag breaching threshold. Replicas exceed --degraded_threshold and stop serving @replica reads, shifting load to the primary.

Check for a long-running migration or backup saturating I/O.
Throttle the active Online DDL workflow; confirm throttle_status.
Scale out RDONLY capacity if analytics traffic is the source.

Python Orchestration Integration

For platform and automation engineers, Vitess exposes two distinct surfaces, and conflating them is a common early mistake. The data plane is spoken over the MySQL wire protocol: VTGate presents as a MySQL server, so any standard driver connects and routing is transparent to the application.

import mysql.connector

# VTGate speaks the MySQL protocol; the app never names a shard.
conn = mysql.connector.connect(
    host="vtgate.internal", port=15306, database="commerce"
)
cur = conn.cursor(dictionary=True)
cur.execute("SELECT * FROM orders WHERE customer_id = %s", (42,))
rows = cur.fetchall()  # targeted single-shard read

The control plane — provisioning, reparents, reshards, schema changes, and topology inspection — is driven through vtctldclient and the vtadmin API, which orchestration controllers call rather than the SQL port. A robust controller reads persisted state before acting so that retries stay idempotent:

import json, subprocess

def active_migrations(keyspace: str) -> list[dict]:
    out = subprocess.run(
        ["vtctldclient", "OnlineDDL", "show", keyspace, "--json"],
        capture_output=True, text=True, check=True,
    )
    return json.loads(out.stdout or "[]")

# Gate a new change on there being no in-flight migration for the keyspace.
if not active_migrations("commerce"):
    launch_schema_change("commerce")

These patterns compose into the broader automation surface used across the site — async VSchema validation, retry logic, and progress polling — and they are the connective tissue between the topology described here and the migration workflows under Online DDL Orchestration. Mastering Vitess topology means treating the control plane as a living system that evolves alongside workload demands: align keyspace design, routing mechanics, and schema coordination with rigorous SRE practice, and the result is a horizontally scalable MySQL deployment that holds its transactional guarantees under production load.

Explore the topology in depth

Understanding Vitess Keyspace Partitioning Models — range, hash, and lookup distribution and how each shapes routing.
Designing Horizontal Shard Topologies — capacity planning, range assignment, and failure-domain placement.
VTGate Routing Architecture Deep Dive — plan types, targeted vs scatter routing, and connection pooling internals.
Implementing Fallback Routing for Shard Outages — read-only fallback, stale-read tolerance, and circuit breaking.
Securing Multi-Tenant Sharded Databases — tenant-aware routing and isolation across shared infrastructure.

← Back to shardedtopology.org home

Vitess Sharding Architecture & Topology Design: Foundational Principles for Distributed MySQL

The Vitess Logical Model: Keyspaces, Shards, and Tablet Roles #

Logical Partitioning & the VSchema #

Horizontal Shard Topology & Failure Domain Mapping #

Query Routing & VTGate Data Plane Mechanics #

Schema Evolution Across Shards #

Operational Considerations: Tuning Knobs & Common Misconfigurations #

Failure Modes & Recovery Patterns #

Python Orchestration Integration #

Explore the topology in depth #

Go deeper