Implementing Fallback Routing for Shard Outages

When a primary shard goes dark — a failed MySQL instance, a network partition, or a stalled failover — the routing tier must decide, per query, whether to buffer, reroute, or fail fast, all without violating the consistency contract the application depends on. This page shows how to implement deterministic fallback routing in Vitess: how VTGate detects an unhealthy shard, how request buffering masks a primary transition, how consistency requirements gate which replica tier is eligible, and how to keep fallback behaviour aligned with in-flight schema changes. It sits within the broader Vitess Sharding Architecture & Topology Design reference and assumes you are already routing production traffic through a sharded keyspace.

The goal is not “route somewhere” — it is to route to a tablet that owns the correct key range, satisfies the query’s consistency requirement, and reports a schema version compatible with the statement, or else to fail in a controlled, observable way. Blind broadcast or naive replica promotion turns a single-shard outage into cross-shard phantom reads and rollback storms.

Prerequisites

Before configuring fallback routing, confirm the following are in place:

Vitess 16.0 or later. Buffer-based primary failover masking (--buffering_policy, --buffer_* flags) and reliable VTOrc-driven promotion are stable from v16 onward. Older --enable_buffer semantics differ.
A healthy topology service. VTGate’s health signals derive from the topology store (etcd, Consul, or ZooKeeper). Confirm vtctldclient GetTablets returns live state before relying on any fallback path.
VTOrc deployed for automated reparent. Fallback routing masks the window of a primary transition; it does not perform the transition. Without VTOrc or a PlannedReparentShard runbook, buffered requests eventually drain into errors.
A defined consistency model per workload. You must know which query classes require read-after-write (route only to the primary or a synchronous replica) versus which tolerate replica lag. This decision drives every eligibility rule below.
Familiarity with the routing tier. Read the VTGate routing architecture deep dive first — fallback routing is an extension of normal plan execution, not a separate subsystem.

How Fallback Routing Fits the Topology

The effectiveness of any fallback strategy is constrained by how the keyspace was partitioned in the first place. A range-sharded keyspace and a hash-sharded keyspace expose completely different fallback surfaces, because the mapping from a query’s key value to an owning shard is fixed by the keyspace partitioning model. When shard -80 becomes unreachable, VTGate cannot silently redistribute its key range to 80-; those rows do not exist there. Fallback therefore operates within a shard’s replica set (primary → replica) or across a geographic copy of the same shard, never by spraying a single-shard query across siblings that do not own the data.

This is also why replica placement decided in Designing Horizontal Shard Topologies is a fallback-routing concern: the number, location, and replication mode of each shard’s replicas defines exactly which tablets are candidates when the primary drops.

The decision path VTGate walks on every query during a degraded state is shown below — health and schema-version checks come first, then the consistency requirement selects which replica tier (if any) is eligible before escalating to a cross-region fallback:

Core Mechanism: Health, Buffering, and Tablet Selection

Fallback routing in Vitess is the product of three cooperating mechanisms inside VTGate.

1. Continuous health monitoring. VTGate maintains a HealthCheck stream to every VTTablet it can route to. Each tablet reports its type (PRIMARY, REPLICA, RDONLY), a serving flag, replication lag, and a query-serving state. A tablet flips out of the eligible set the moment it reports NOT_SERVING, enters DRAINING, or exceeds the configured lag/latency threshold. Selection is not a periodic poll — it reacts to the stream, so a primary that stops serving is removed within the health-check interval rather than after a timeout on the next query.

2. Request buffering during primary transitions. The single most valuable outage-masking feature is VTGate’s in-memory buffer. When the primary for a shard becomes unavailable, VTGate can hold affected primary-targeted requests in a bounded buffer instead of failing them immediately. While requests are buffered, VTOrc (or a manual PlannedReparentShard) promotes a replica. The instant a new primary reports SERVING, buffered requests drain to it. To the application, a sub-10-second failover looks like a brief latency spike rather than a wave of errors. Buffering is per-shard and bounded — this is deliberate, and tuning those bounds is the heart of the configuration table below.

3. Consistency-gated tablet selection. For read-after-write workloads, fallback is restricted to the primary or a synchronous/semi-sync replica whose GTID position is guaranteed to include the last acknowledged write; a lagging async replica is never eligible. For read-only or analytical traffic that tolerates staleness, VTGate may route to REPLICA/RDONLY tablets, and query buffering is typically disabled for these because serving slightly stale data beats adding latency. This gate is expressed through the tablet type in the query’s target (@primary, @replica, @rdonly) combined with the buffering policy.

Fallback selection also respects schema version. If an Online DDL orchestration migration has a tablet in a transitional state, that tablet is excluded from the eligible set until it reports SERVING with a schema that satisfies the statement — see the failure modes section.

Step-by-Step Implementation

Each step below is independently verifiable — run its check before moving on.

1. Enable primary-failover buffering on VTGate

Turn on buffering with bounds sized to your promotion time. These are VTGate process flags:

vtgate \
  --buffer_size 20000 \
  --buffer_window 10s \
  --buffer_max_failover_duration 20s \
  --buffer_min_time_between_failovers 1m \
  --buffer_keyspace_shards commerce/-80,commerce/80- \
  --healthcheck_timeout 2s \
  # ... existing routing flags

--buffer_keyspace_shards scopes buffering to the shards that carry write-critical traffic; omit it to buffer all keyspaces. Verify the buffer is active:

curl -s http://<vtgate-host>:15001/debug/vars | \
  python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps(d.get('BufferState'), indent=2))"

2. Confirm replica eligibility and replication mode

Fallback quality depends on how far a replica can lag before it is ineligible. Inspect the shard’s tablets and their reported lag:

vtctldclient GetTablets --keyspace commerce --shard -80
vtctldclient ShardReplicationPositions commerce/-80

For read-after-write shards, ensure semi-synchronous replication is enabled so a promoted replica is guaranteed durable. Set the low-lag threshold VTGate uses to treat replicas as fresh via --discovery_low_replication_lag 5s.

3. Pin write and consistency-sensitive reads with routing rules

Use routing rules so consistency-sensitive tables never resolve to a lagging replica during a fallback. Routing rules are the same mechanism covered in dynamic routing rules and query rewriting; here they encode the fallback policy declaratively:

{
  "rules": [
    {
      "from_table": "commerce.orders",
      "to_tables": ["commerce.orders@primary"]
    },
    {
      "from_table": "commerce.order_history",
      "to_tables": ["commerce.order_history@replica"]
    }
  ]
}

Apply and confirm:

vtctldclient ApplyRoutingRules --rules-file routing_rules.json
vtctldclient GetRoutingRules

orders (read-after-write) stays on the primary and is protected by buffering; order_history (tolerates staleness) is allowed to fall back to replicas immediately with no buffering penalty.

4. Programmatically drain a known-bad tablet

When observability flags a degraded primary before it fully fails, remove it from the serving set deterministically rather than waiting for the health check. Automation engineers typically drive this through vtctldclient; the snippet below wraps it with exponential backoff so a flapping tablet is not thrashed in and out of rotation:

import subprocess
import time

def drain_tablet(alias: str, max_attempts: int = 5) -> bool:
    """Mark a tablet non-serving and wait for VTGate to drop it."""
    for attempt in range(max_attempts):
        result = subprocess.run(
            ["vtctldclient", "SetWritable", alias, "false"],
            capture_output=True, text=True,
        )
        if result.returncode == 0:
            # Force a fresh primary election if this was the primary.
            subprocess.run(
                ["vtctldclient", "PlannedReparentShard", "commerce/-80"],
                check=False,
            )
            return True
        backoff = min(2 ** attempt, 30)
        time.sleep(backoff)
    return False

SetWritable false sheds writes immediately; PlannedReparentShard promotes a healthy replica, and buffered requests from step 1 drain onto the new primary as soon as it serves.

5. Define the cross-region escalation

If no in-region replica is eligible, escalation to a remote copy of the shard must be explicit, because it trades latency and replication lag for availability. Gate promotion of a remote tablet behind a synthetic read probe that measures live lag, and only allow it for workloads whose consistency model tolerates the staleness. Encode the fail-fast boundary so that when even the cross-region path is ineligible, VTGate returns a typed error the application circuit-breaker can catch, rather than hanging on a buffer that will time out anyway.

Configuration Reference

Flag	Type	Default	Recommended (prod)
`--buffer_size`	int (requests)	`10000`	`20000` — sized to peak per-shard QPS × failover window
`--buffer_window`	duration	`10s`	`10s` — max time a single request waits in the buffer
`--buffer_max_failover_duration`	duration	`20s`	`20s` — cap on total buffering per failover event
`--buffer_min_time_between_failovers`	duration	`1m`	`1m` — suppresses buffering during replication flapping
`--buffer_keyspace_shards`	csv	(all)	explicit write-critical shards only
`--healthcheck_timeout`	duration	`1m`	`2s` — how long before a silent tablet is unhealthy
`--discovery_low_replication_lag`	duration	`30s`	`5s` — lag ceiling for a replica to count as fresh
`--min_number_serving_vttablets`	int	`2`	`2` — floor before `VTGate` stops routing to a tablet type
`--gateway_initial_tablet_timeout`	duration	`30s`	`30s` — startup wait for first healthy tablet

Size --buffer_size from data, not intuition: peak_writes_per_second × buffer_max_failover_duration_seconds, then add headroom. Under-sizing produces overflow errors mid-failover; grossly over-sizing risks memory pressure on VTGate under a sustained outage.

Failure Modes Specific to Fallback Routing

Buffer overflow during a slow reparent. Symptom: VTGate BufferFailoverDurationSumMs climbs and clients see buffer full errors while buffer_max_failover_duration is exceeded. Root cause: promotion is taking longer than the buffer window — usually a VTOrc reparent stuck on a replica that cannot catch up. Mitigation: fix the promotion path (semi-sync durability, replica lag), and raise --buffer_max_failover_duration only after confirming reparent time; buffering cannot outlast a failover that never completes.

Stale reads after fallback to a lagging replica. Symptom: read-after-write violations reported by the application immediately after an outage. Root cause: a consistency-sensitive table was allowed to resolve to a REPLICA target, or --discovery_low_replication_lag is too generous. Mitigation: pin the table to @primary via routing rules (step 3) and tighten the lag ceiling.

Routing into a tablet mid-migration. Symptom: column not found / unknown table errors during a schema change, or metadata-lock contention. Root cause: fallback selected a DRAINING/NOT_SERVING tablet undergoing an online schema migration; its schema version does not match the statement. Mitigation: rely on version-aware selection — VTGate already excludes non-SERVING tablets, so ensure the migration engine sets tablet state correctly and never manually re-add a draining tablet to rotation.

Cross-tenant leakage during escalation. Symptom: a tenant observes another tenant’s rows after a failover. Root cause: an over-broad fallback route bypassed tenant-scoped keyspace boundaries. Mitigation: enforce tenant isolation in the fallback rules exactly as in steady state — see Securing Multi-Tenant Sharded Databases. Fallback must never widen the routing scope.

Buffering masks a total shard loss. Symptom: buffered requests drain into a wall of errors after buffer_window. Root cause: there was no healthy replica to promote — buffering delayed, not prevented, the failure. Mitigation: treat buffering as a bridge over a transition, not insurance against losing an entire replica set; alert on any failover that exhausts the buffer without a successful reparent.

Verification

Confirm the configuration behaves correctly before you rely on it in an incident.

Inspect live buffer state. VTGate exposes /debug/vars; watch BufferState, BufferLastFailoverDurationMs, and BufferRequestsBuffered. A healthy steady state shows zero buffered requests.
Run a controlled failover in staging. Trigger a planned reparent and confirm masking:
```
vtctldclient PlannedReparentShard commerce/-80
```
During the reparent, a tight write loop against commerce.orders should show a brief latency bump with no errors, and BufferRequestsBuffered should rise then return to zero.
Verify eligibility gates. With a replica artificially lagged past --discovery_low_replication_lag, confirm a read-after-write query still resolves to the primary and never the lagging replica:
```
vtctldclient VtGateExecute --server <vtgate>:15991 \
  --json "SELECT @@vitess_metadata FROM orders WHERE id = 1 /* @primary */"
```
Watch the SRE-facing metrics. In your dashboards, alert on vtgate_buffer_requests_buffered, vtgate_buffer_failover_duration_sum, per-shard vtgate_errors by error code, and vtgate_query_latency p99. A failover that does not resolve inside buffer_max_failover_duration must page.

Fault-inject deliberately — kill a primary VTTablet under load with a Kubernetes chaos framework and confirm the whole chain (health drop → buffer → reparent → drain) executes within budget. A fallback path that has never been exercised under real load is a fallback path you do not have.

VTGate Routing Architecture Deep Dive — the routing tier that fallback selection extends, including plan execution and scatter-gather.
Understanding Vitess Keyspace Partitioning Models — why the partitioning scheme fixes which tablets can ever be a fallback target.
Designing Horizontal Shard Topologies — replica placement and replication mode that define your fallback surface.
Securing Multi-Tenant Sharded Databases — keeping tenant isolation intact during failover and escalation.
Coordinating Multi-Shard Schema Migrations — how in-flight DDL interacts with tablet serving state and version-aware routing.

← Back to Vitess Sharding Architecture & Topology Design

Implementing Fallback Routing for Shard Outages

Prerequisites #

How Fallback Routing Fits the Topology #

Core Mechanism: Health, Buffering, and Tablet Selection #

Step-by-Step Implementation #

1. Enable primary-failover buffering on VTGate #

2. Confirm replica eligibility and replication mode #

3. Pin write and consistency-sensitive reads with routing rules #

4. Programmatically drain a known-bad tablet #

5. Define the cross-region escalation #

Configuration Reference #

Failure Modes Specific to Fallback Routing #

Verification #

Related #

Related in Sharding Architecture & Topology