Shard Key Selection Best Practices for E-commerce

The shard key you pick for an e-commerce order keyspace decides whether a customer’s cart, order history, and payment records live on one VTTablet or scatter across every shard in the fleet — and that single decision is the hardest one to reverse once real traffic is on it.

Where This Fits

The primary vindex column is the routing primitive that the stateless VTGate routing layer hashes to choose a shard for every query. When you lay out a keyspace in Designing Horizontal Shard Topologies, the shard key is what turns an abstract shard range into a concrete “which rows land where” rule. Pick a column that matches your dominant access pattern and most queries resolve to a single shard; pick against it and VTGate falls back to scatter-gather — fanning the same query to all shards and merging the results — which is where checkout latency and cross-shard transactions come from.

This page is about that column choice specifically for an online store: high-volume, session-bound reads and writes against orders, payments, and fulfillment, punctuated by flash-sale write bursts. The related decision of how many shards to provision is covered in how to calculate optimal shard count for MySQL; here we assume the shard count is fixed and the question is which key rides on top of it.

The Concept: Route the Session, Not the Row

E-commerce access is sharply asymmetric. The customer-facing hot path — read the cart, append a line item, list this account’s orders, take a payment — is always scoped to one account. Reporting and catalog analytics span the whole dataset but run off the critical path and tolerate scatter. So the key that keeps the hot path single-shard is the one that appears in the WHERE clause of nearly every latency-sensitive query: customer_id.

Co-locating orders, payments, and fulfillment on the same shard as their owning customer means a checkout touches exactly one VTTablet. That eliminates cross-shard joins for session workflows, keeps a multi-table checkout inside a single-shard transaction (no two-phase commit), and lets VTGate build a deterministic routing plan instead of a scatter. The alternatives fail on exactly these axes:

order_id as the shard key distributes orders evenly, but every “show me this customer’s orders” query no longer knows which shard to hit, so it scatters to all of them. The most common read in the store becomes the most expensive.
product_sku as the shard key concentrates all writes for a hot SKU on one shard. During a flash sale or an inventory replenishment run, that shard becomes a write hotspot while the rest of the fleet idles — the opposite of what sharding is for.

Solution: A Customer-Keyed VSchema With a Secondary Order Lookup

Choosing customer_id does not mean you lose the ability to fetch an order by order_id. You keep the primary vindex on customer_id for routing and add a lookup vindex on order_id so a bare order lookup still resolves to a single shard instead of scattering.

For an integer surrogate customer_id, the idiomatic primary vindex is hash (or xxhash); these give a uniform spread across the shard range. String vindexes such as unicode_loose_md5 exist for natural-language string keys where case-insensitive, Unicode-normalized distribution matters — they are not the right choice for an integer ID, a common mis-selection worth calling out explicitly.

{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    },
    "order_lookup": {
      "type": "consistent_lookup_unique",
      "params": {
        "table": "commerce.order_lookup",
        "from": "order_id",
        "to": "keyspace_id"
      },
      "owner": "orders"
    }
  },
  "tables": {
    "orders": {
      "column_vindexes": [
        { "column": "customer_id", "name": "hash" },
        { "column": "order_id", "name": "order_lookup" }
      ]
    },
    "payments": {
      "column_vindexes": [
        { "column": "customer_id", "name": "hash" }
      ]
    },
    "fulfillment": {
      "column_vindexes": [
        { "column": "customer_id", "name": "hash" }
      ]
    }
  }
}

The first column_vindex on a table is the primary vindex — the one that decides which shard a row is written to. Listing customer_id first on orders, payments, and fulfillment is what co-locates the whole customer record. The order_lookup on orders is a secondary vindex: VTGate consults its backing table to turn an order_id into a keyspace_id, so SELECT * FROM orders WHERE order_id = ? routes to one shard. Because it is consistent_lookup_unique and owned by orders, Vitess keeps the lookup table transactionally consistent with inserts and deletes on the owner. Product catalog tables that every shard needs to join against should be modelled as reference tables (materialized on every shard), not sharded by product_sku, which sidesteps the hotspot entirely.

Apply the VSchema without a downtime window using the same workflow described for any routing change — see how to deploy VSchema changes without downtime.

Platform teams building automation on top of this should validate that real query predicates actually resolve to a single shard before promoting a key. VTGate exposes the plan through vexplain, which a Python orchestration check can assert against:

import MySQLdb  # any MySQL DB-API driver connects to VTGate's MySQL protocol port

def assert_single_shard(cur, sql):
    cur.execute(f"VEXPLAIN QUERIES {sql}")
    # KsIDs / ShardsQueried reflect how many shards the plan touches.
    plan = cur.fetchall()
    shards = {row[2] for row in plan}  # shard column from the vexplain output
    assert len(shards) == 1, f"scatter query, touched {len(shards)} shards: {sql}"

conn = MySQLdb.connect(host="vtgate.internal", port=15306, db="commerce")
cur = conn.cursor()
assert_single_shard(cur, "SELECT * FROM orders WHERE customer_id = 42")
assert_single_shard(cur, "SELECT * FROM orders WHERE order_id = 9001")  # via order_lookup

Wiring this into CI catches a predicate that silently regressed to a scatter — for example a report that started filtering on a non-vindex column — before it reaches production checkout traffic.

Edge Cases and Gotchas

Whale-account write skew. hash(customer_id) distributes accounts evenly, but a single B2B customer generating a disproportionate share of orders makes its shard hot regardless. Model per-customer volume before committing; genuinely oversized tenants may need to be isolated onto their own shard rather than hashed in with everyone else.
String vindex on an integer key. Selecting unicode_loose_md5 (or another string vindex) for an integer customer_id produces a valid but suboptimal distribution and blocks clean use of hash/xxhash. Match the vindex type to the column type.
Cross-shard foreign keys. A foreign key that spans shards forces an implicit distributed transaction on every write and cannot be enforced by MySQL. Keep referential integrity inside the customer’s shard, or move it to the application layer — never define an FK across the shard boundary.
Guest-checkout rows with no customer_id. Anonymous orders have no natural customer key. Assign a synthetic account id at order creation rather than leaving the vindex column null, or those rows route unpredictably and break the single-shard guarantee.
Reshard splittability. A hash primary vindex bisects cleanly when you split a shard, because the key space is uniform. A lookup or range vindex chosen as the primary can leave a Reshard unable to divide the range evenly — confirm the primary vindex’s key space is uniform before you need to grow.
Stale lookup rows. If the order_lookup table drifts from orders (e.g. an out-of-band delete that bypassed the owner), lookups by order_id return wrong or empty shards. Use the consistent_lookup* types and let the owner table maintain it; do not populate the lookup table by hand.

Verification

Confirm the primary key column actually produces single-shard routing. VEXPLAIN reports how many shards a plan touches — a customer-scoped query must touch exactly one:

vtctldclient VExplain --keyspace commerce \
  "SELECT o.*, p.status FROM orders o JOIN payments p \
   ON o.customer_id = p.customer_id WHERE o.customer_id = 42"
# Expect a single-shard plan: ShardsQueried = 1, no scatter/gather operators.

If the customer-scoped join resolves to ShardsQueried = 1 and the order_id lookup resolves to one shard too, the key is doing its job. If either scatters, the predicate does not match a vindex and the plan will not hold up under checkout load. For sustained high-QPS lookups, tune the secondary vindex as described in optimizing vindex performance for high QPS.

Configuring Lookup Vindexes for Cross-Shard Joins — how the secondary order_id vindex keeps by-order reads single-shard.
How to Calculate Optimal Shard Count for MySQL — sizing the shard range the chosen key rides on.
Handling Cross-Shard Transactions in Vitess — what a mis-chosen key costs you when a write has to span shards.

← Back to Designing Horizontal Shard Topologies

Shard Key Selection Best Practices for E-commerce

Where This Fits #

The Concept: Route the Session, Not the Row #

Solution: A Customer-Keyed VSchema With a Secondary Order Lookup #

Edge Cases and Gotchas #

Verification #

Related #

Related in Designing Horizontal Shard Topologies

Where This Fits

The Concept: Route the Session, Not the Row

Solution: A Customer-Keyed VSchema With a Secondary Order Lookup

Edge Cases and Gotchas

Verification

Related