How to Calculate Optimal Shard Count for MySQL

Sizing a sharded keyspace comes down to one number — how many shards to provision — and getting it wrong on either side is expensive: too few and a shard saturates under peak write load, too many and routing overhead and migration wall-clock time balloon for no throughput gain.

Where This Fits

Shard count is the quantitative half of the partitioning decision. The keyspace partitioning model decides how a sharding key maps to a shard; the shard count decides how many shards that mapping spreads across, and the two are chosen together — a hash model tolerates a higher count before scatter-gather overhead bites, a range model wants a tighter one. The count you land on then flows downstream into every other design choice that follows: it sets how wide the stateless VTGate routing layer has to fan a scatter query, how many primaries a coordinated schema change must sequence through, and how much blast radius each failure domain carries when you lay the shards out in Designing Horizontal Shard Topologies. This page is the arithmetic that upstream decision assumes you have done. It is written for MySQL SREs and Python orchestration builders who own the capacity model.

The Core Formula

Shard count is bounded by whichever physical resource each MySQL instance exhausts first — sustained write QPS, storage footprint, or I/O bandwidth. Size against all three and take the largest result:

Shard Count = ⌈ \frac{Total Peak Load}{Target per Instance \times Headroom} ⌉

The Headroom factor is the piece most first-cut estimates omit. A shard’s primary is never doing only foreground query work: it is also shipping binlog to replicas, absorbing the row-copy phase of an Online DDL migration, and occasionally serving traffic redirected from a failed peer. Target 60–70% of a primary’s measured ceiling as the usable budget (Headroom = 0.65), leaving the rest for those background and failover surges. Sizing to 100% guarantees a latency incident the first time a maintenance window and a traffic peak coincide.

Work the calculation for each constraint independently:

Write QPS: peak_write_qps / (per_instance_write_qps × 0.65). Use write QPS, not total QPS — reads scale out onto replicas, writes do not.
Storage: total_dataset_bytes / (per_instance_target_bytes × 0.65), where the per-instance target leaves room for the shadow/ghost table an Online DDL rebuild doubles the table into.
IOPS: peak_write_iops / (per_instance_iops × 0.65), which usually dominates for write-heavy OLTP on network-attached storage.

Then round up to the next power of two. This is not cosmetic. Because every Vitess shard boundary is a binary keyspace-ID prefix, a power-of-two count lets a future Reshard bisect one shard into two children that cover exactly its old range — a mechanical copy rather than a hand-computed redistribution. A count of 6 forces uneven ranges and turns every split into a bespoke migration; 8 splits cleanly forever.

The write-QPS constraint binds at 8.2, and rounding up to the next power of two lifts the answer from a naive 9 to 16 — the rounding is the headroom margin that keeps the fleet inside budget when a DDL row-copy starts.

A Sizing Helper

The following captures the method as an idempotent function an orchestration controller can call before it provisions a keyspace, so the count is derived from measured inputs rather than guessed:

import math

def optimal_shard_count(
    peak_write_qps: float,
    dataset_bytes: float,
    peak_write_iops: float,
    per_instance_qps: float,
    per_instance_bytes: float,
    per_instance_iops: float,
    headroom: float = 0.65,
) -> int:
    """Size a keyspace against the binding constraint, rounded up to a power of two."""
    by_qps  = peak_write_qps  / (per_instance_qps   * headroom)
    by_size = dataset_bytes   / (per_instance_bytes * headroom)
    by_iops = peak_write_iops / (per_instance_iops  * headroom)
    raw = max(by_qps, by_size, by_iops)
    # Round up to the next power of two so a future Reshard bisects cleanly.
    return 1 << math.ceil(math.log2(max(raw, 1)))

# Example: 48k write QPS, 1.8 TB, 90k write IOPS at peak
count = optimal_shard_count(
    peak_write_qps=48_000, dataset_bytes=1.8e12, peak_write_iops=90_000,
    per_instance_qps=9_000, per_instance_bytes=350e9, per_instance_iops=18_000,
)
# by_qps ≈ 8.2, by_size ≈ 7.9, by_iops ≈ 7.7  → raw 8.2 → 16 shards

In the worked example the write-QPS constraint binds at 8.2, and the power-of-two rounding lifts the answer from a naive 9 to 16. The jump looks wasteful until you note that a raw count of 8 would have left the fleet running at the very edge of its headroom budget the moment a DDL copy started — the rounding is the safety margin.

Parameters That Dominate the Answer

Feed the formula measured numbers, not vendor spec sheets. The per-instance ceilings below are the ones to benchmark on your own hardware before trusting any count.

Input	What it measures	How to obtain	Sizing note
`per_instance_qps`	Sustained write QPS one primary holds at target latency	Load-test a single shard to its p99 knee	Use the knee, not the point of collapse
`headroom`	Fraction of ceiling usable in steady state	Policy, informed by DDL + failover surge	`0.60`–`0.70`; lower if failover fans onto peers
`per_instance_bytes`	Usable dataset per primary	Provisioned disk minus rebuild + binlog overhead	Reserve ~2× the largest table for Online DDL
`peak_write_iops`	Write IOPS at the daily peak, not the mean	Prometheus, p99 over the busiest window	Usually the binding constraint on cloud disks
Replication fan-out	Replicas per shard multiplying binlog load	Topology definition	More replicas raise per-primary binlog cost

Two adjustments sit on top of the table. First, cross-shard cost is not free capacity. Any query the partitioning model cannot keep single-shard becomes a scatter or a cross-shard transaction, whose cost climbs with shard count — so raising the count to solve a write-throughput problem can quietly regress your scatter-gather latency. Second, the count multiplies migration wall-clock. A full-keyspace schema change touches every primary; at high shard counts the coordinated migration must batch shards to avoid a fleet-wide replication-lag spike, so operational overhead, not just hardware, sets a soft ceiling on how many shards you actually want.

Edge Cases and Gotchas

Sizing on total QPS instead of write QPS. Reads offload to replicas; only writes are bounded by primary count. Mixing them inflates the shard count and wastes primaries.
Skewed key distribution defeats the average. The formula assumes uniform load. A low-cardinality or monotonic sharding key concentrates writes on one shard regardless of count — fix the shard key selection before adding shards, or you are provisioning idle hardware next to one hotspot.
Rounding down to a non-power-of-two. A count of 12 or 20 “to save a node” makes the next Reshard a manual range recomputation. Always round up to 2, 4, 8, 16, 32.
Ignoring the Online DDL storage double. A rebuild holds the original and the shadow table simultaneously; a disk sized to the steady-state dataset fills mid-migration and aborts the copy. Bake the ~2× table headroom into per_instance_bytes.
Forgetting failover redistribution. When a primary fails and its traffic redirects, surviving peers absorb it. If headroom is too tight, one failure cascades — size headroom with the VTTablet high-availability failover fan-out in mind.
Treating the count as permanent. Peak load drifts. Re-derive the count when sustained utilisation crosses 70% (expand) or sits under 30% for a full cycle (consolidate); the number is a running measurement, not a one-time provisioning constant.

Verification

Never ship a computed count untested — validate it against production-like traffic before it carries real load. Replay the real read/write ratio, including the worst-case scatter-gather and concurrent Online DDL, then watch per-shard saturation against the 70% trigger line:

# Per-shard write QPS and CPU headroom, broken out by shard, over the load test.
# Any shard sustaining >70% of its benchmarked ceiling means the count is too low.
vtctldclient GetTablets --keyspace commerce --tablet-type primary

Pair that with the throughput and lag signals in Prometheus: sustained per-primary write QPS above per_instance_qps × 0.70, or replication lag that never drains during the row-copy phase, both say the count is undersized and a Reshard to the next power of two is due. Utilisation parked below 30% across a full traffic cycle says the opposite — consolidate to cut routing and migration overhead. A correctly sized keyspace shows every shard inside a narrow utilisation band with headroom intact under a simulated DDL.

Understanding Vitess Keyspace Partitioning Models — the model the shard count is applied over, and why hash and range models tolerate different counts.
Shard Key Selection Best Practices for E-Commerce — keeping load uniform so the count’s uniform-distribution assumption holds.
Designing Horizontal Shard Topologies — laying the computed count out across cells and failure domains.

← Back to Understanding Vitess Keyspace Partitioning Models

How to Calculate Optimal Shard Count for MySQL

Where This Fits #

The Core Formula #

A Sizing Helper #

Parameters That Dominate the Answer #

Edge Cases and Gotchas #

Verification #

Related #