Why cloning a MariaDB schema is so goddamn slow (and how to make it 200× faster)

You’d think duplicating a 110 MB MariaDB schema in 2026 would be a sub-second operation. After all:

  • cp -r 110 MB of files on tmpfs: ~0.3s
  • btrfs subvolume snapshot: ~0.2s
  • NVMe writes at 5 GB/s

Yet a naive mariadb-dump | mariadb of a 51-table real schema takes ~15 seconds. Need 200 of those for a parallel CI test suite? 50 minutes straight-line, or ~17 min even with W=8 parallelism. Production CI shops we’ve worked with sit at 3.5 hours per pool refresh on shared HDD-backed runners.

In this post we’ll walk through what MariaDB is actually doing for those 15 seconds per clone, why every “obvious” shortcut fails, and the architectural decisions that take us from 3.5 hours to under one minute on commodity modern hardware. Easily 200× faster.

All code, scripts, and benchmarks are open-source: github.com/AIMFIRST-VN/mariadb-mysql-fast-clone-fork-same-instance.

Why we needed this

We run parallel parity tests against a real MariaDB schema (51 tables, ~110 MB) for a Laravel migration. Tests must NOT share state — each one needs a pristine copy. With ~200 parallel tests, that’s 200 fresh clones per CI run.

The naive options all hit walls:

  • Shared DB + transaction rollback per test — fragile, slow, fails on DDL tests
  • mariadb-dump | mariadb per test — 15s per clone × 200 = 50 min serial
  • Testcontainers spinning up fresh containers — minutes per container × hundreds of tests
  • mariabackup + IMPORT TABLESPACE — better, but bottlenecked by dict_sys.latch, still ~13 min for 200

We needed something faster. So we ran a benchmark series and documented every wall we hit.

Bottleneck #1: dict_sys.latch — MariaDB’s secret single-threaded path

InnoDB has one global latch protecting its in-memory data dictionary. Every DDL operation acquires it exclusively:

  • CREATE DATABASE → takes latch
  • CREATE TABLE (×51 per clone) → 51 latch acquisitions
  • ALTER TABLE DISCARD TABLESPACE (×51) → 51 more
  • ALTER TABLE IMPORT TABLESPACE (×51) → 51 more
  • ALTER TABLE ADD INDEX (×N) → N more

You can throw 32 parallel workers at it. They’ll all queue. We measured: W=8 inside one mariadbd gives ~2.6× effective parallelism vs single-threaded — not 8×. W=16 is no better than W=8.

This is why:

  • MariaDB 11.x is actually slower for this workload (~14% slower in our tests — more InnoDB validation per DDL)
  • MyISAM “looks” fast (no latch — also no transactions, defeats our parity goal)
  • myloader –optimize-keys doesn’t beat dump|load at our scale — drops + rebuilds indexes still hit the latch
  • More CPU doesn’t help — the latch is the wall

We confirmed the latch is the bottleneck by varying buffer pool, switching to faster storage, and measuring at W=1 vs W=8 vs W=16. The shape was identical.

Related upstream history: – MDEV-25506 (10.6.5) removed dict_sys_mutex; dict_sys.latch remains the wall – MDEV-28804: increased lock objects in 10.6+ slowing small-buffer-pool deployments

The fix that would actually unblock this would be MariaDB-internals work: shard dict_sys.latch by (schema, table) so independent DDL operations on independent objects truly parallelize. Until upstream lands it, the rest of this post is the pragmatic workaround.

Bottleneck #2: Inline secondary index maintenance during INSERTs

mariadb-dump | mariadb does INSERT INTO after INSERT INTO. Each INSERT updates every secondary index on the table. For a table with 3 secondary keys: 4× write amplification for every row.

Most of the 15-second load time per clone isn’t data — it’s index churn. The fix that should work but doesn’t (at small scale): defer indexes. Strip secondary keys from CREATE TABLE, load data with only the PRIMARY KEY, then ALTER TABLE ADD INDEX after. The sort-merge index build is much cheaper than 1000 INSERT-time updates per index.

But mariadb-dump | mariadb doesn’t support this natively. myloader –optimize-keys tries — drops indexes before load, recreates after — but at our small per-table size (~2000 rows average), the ALTER TABLE ADD INDEX overhead (each one takes the latch) eats the gain.

The combination that worked: strip secondary indexes from the source SCHEMA first, then use IMPORT TABLESPACE to attach the pre-built .ibd files, then ADD INDEX after. Index data lives in the .ibd; we don’t rebuild during load. 2.77× faster per clone vs dump|load.

The code we wrote for this is in strip_secondary_indexes() — a small regex-based CREATE TABLE rewriter. ~30 lines of Python.

Bottleneck #3: Every “obvious” shortcut doesn’t actually work

The internet is full of “just copy the files” advice. We tried every variant:

Approach Result
Copy /var/lib/mysql/source_db/ to /var/lib/mysql/clone_X/ SHOW DATABASES lists it, SELECT fails: “Table doesn’t exist in engine”
Same + mariadb-upgrade –force Same failure (upgrade doesn’t register orphan tablespaces)
Same + minor version bump (10.6.22 → 10.6.27) Same failure
Same + major version bump (10.6 → 10.11) mariadbd refuses to start (“Upgrade after a crash is not supported”)
Symlink the schema dir SHOW DATABASES doesn’t even list it (mariadbd skips symlinks since CVE-2017-3265)
FLUSH TABLES WITH READ LOCK + cp + UNLOCK + CREATE DATABASE INNODB_SYS_TABLES confirms zero entries registered

The cause is always the same: InnoDB’s data dictionary lives in ibdata1. Without entries there, a .ibd file is just bytes mariadbd refuses to open. The only two ways to write the dictionary: CREATE TABLE and ALTER TABLE … IMPORT TABLESPACE. Both take the latch. Both are unavoidable.

mysqld_multi? Doesn’t help — each process has its own dict_sys.latch, same single-threaded bottleneck per process. We measured this; ruled it out.

This is the days-of-struggle saver: if you try any of the shortcuts in that table, you’ll burn time and discover the same failure. Skip directly to the real architecture.

What actually works: 5 levers that compound

These five decisions compound. Each adds 1.5-3× speedup; together they get us from a 3.5-hour CI baseline to under one minute on a modern host.

1. IMPORT TABLESPACE on a no-secondary-index source + ADD INDEX after

The combination of strip-then-import-then-rebuild. 2.77× faster per clone vs dump|load at single-thread.

2. Shard across N mariadbd processes

Each mariadbd has its own dict_sys.latch. 200 mariadbds = 200 independent latches. Workers in shard A don’t queue behind workers in shard B. Linear scaling up to ~32 single-host shards (docker daemon ceiling) — and across multiple hosts after that.

3. btrfs subvolume snapshot on /dev/shm

We bake clones on ONE mariadbd (shard 0). Then we snapshot its datadir to the other 199 shards. btrfs snapshot is O(1) — metadata only, no data copy. ~190ms per snapshot, 199 snapshots in under 2 seconds (parallel W=8).

Critical detail: the btrfs loopback must be on /dev/shm (RAM), not on disk. We initially put the loopback on /tmp (host filesystem) and the bake was 2× slower because every .ibd read went through the disk loopback layer. Moving the loopback to /dev/shm matched pure tmpfs speed for the bake AND kept the free-snapshot replication.

4. MariaDB 10.6.22 (not 11, not earlier)

We benchmarked mariadb:11 head-to-head: 14% slower on the IMPORT TABLESPACE path. The newer InnoDB does extra validation during IMPORT. 10.6.22 is also the LTS-stable target.

5. S = total_clones, N = 1 per shard

The architectural insight: when N=1, per-clone wall is dominated by the single-clone bake time — there’s no contention within a shard. Push S as high as docker daemon will tolerate (around S=200 single-host on modern silicon, more if you go multi-host or native processes).

Putting it together: under one minute

Source mariadbd → mariadb-dump (~3s) → Python strip-indexes ↓ Load into bench-mariadb-0 on btrfs subvolume (~5s on modern host) FLUSH TABLES FOR EXPORT → stage .ibd + .cfg (~2s) ↓ Bake 1 clone on shard 0 (~2s) per clone: CREATE TABLE LIKE × 51, DISCARD × 51, cp .ibd × 51, IMPORT × 51, ADD INDEX × ~33 ↓ Stop shard 0 cleanly (~2s) btrfs subvolume snapshot × 199 (parallel) — ~1s ↓ Start 200 mariadbds in parallel (~50s on modern docker host) ↓ 200 shards × 1 clone = 200 pool slots READY Total: ~55s

The final benchmark on a single modern host (Ryzen 7950X / Threadripper Pro class):

Architecture 200-clone wall clones/min
Single-shard dump\|load ~8 min 25
Single-shard IMPORT no-idx ~4 min 50
4-shard btrfs-on-disk replica ~2:40 75
8-shard tmpfs cp -a replica ~1:15 160
200-shard btrfs-on-/dev/shm + snapshot ~55s 218

vs naive shared-CI baselines (HDD-backed multi-tenant runners): easily 200×+ speedup.

The math compounds: ~84× from the architecture (IMPORT TABLESPACE + sharded mariadbds + btrfs snapshot) × ~3× from modern silicon vs typical CI runner CPUs = staggering 200-250× total speedup. Same MariaDB. Same hardware. Just a different code path.

Storage compression — useful at scale

Two orthogonal compression layers can apply:

btrfs compress=zstd:9 on the loopback datadir — enables transparent block-level compression on the btrfs subvolume backing the mariadbd datadirs. At zstd:9 you get ~40% storage reduction with ~10-15% extra bake wall. Useful when /dev/shm is tight or persisting the btrfs image to disk for reuse.

InnoDB ROW_FORMAT=COMPRESSED — native page-level compression at the InnoDB layer. ~30-50% smaller .ibd files for text-heavy schemas. Pages stay compressed in the buffer pool AND in IMPORT TABLESPACE (the .cfg metadata includes compression info, so the bench needs zero changes).

Stack both: ~65-70% reduction (NOT 80% — InnoDB compression makes pages entropy-dense, btrfs adds less on top). For 10 GB schemas this matters: raw ~130 GB per shard datadir → ~40 GB with both compressions.

Open source + reproduce locally

Everything is in github.com/AIMFIRST-VN/mariadb-mysql-fast-clone-fork-same-instance:

  • The Python benchmark harness (assert_safe_path, alpine-cleanup, HARD-ceilings, pre-flight projection)
  • All 6 variant scripts (dump|load baseline, myloader, btrfs-on-disk, btrfs-on-shm, tmpfs-cp-a, single-instance reference)
  • Comparison tables with measured numbers
  • GitHub Actions workflow that runs a smoke test on github-hosted runners — verified green
  • MIT license

View the repo on GitHub

Try it on your schema:

git clone https://github.com/AIMFIRST-VN/mariadb-mysql-fast-clone-fork-same-instance.git cd mariadb-mysql-fast-clone-fork-same-instance export SOURCE_CONTAINER=your-source-mariadbd export SOURCE_DB=your_schema export SUDO_PW='your-sudo-pw' # or unset on NOPASSWD systems SHARDS=200 N=1 W=4 START_PARALLEL=32 BTRFS_SIZE_GB=20 \ python3 scripts/test_btrfs_on_shm.py

On a modern desktop you’ll have 200 pool slots in about a minute. Sub-minute on Threadripper Pro.

What’s NOT solved

The architectural ceiling is now at ~55 seconds for 200 clones on a single modern host. To go faster, the levers left:

  • Patch dict_sys.latch upstream — shard by (schema, table). Would unlock another 5-10×. This is real MariaDB-internals work; we’re not the team to do it, but documenting that this wall exists is the first step.
  • Patch the W=8 ceilingdict_sys.latch isn’t the only internal serialization. The buffer-pool mutex, page-cleaner, and purge threads all converge around W=8 in practice. Same upstream story.
  • Bypass docker startup — the ~50-second parallel start of 200 mariadbds is the new bottleneck. Native mariadbd processes (no container) would shave ~30 seconds. Multi-instance (mysqld_multi inside one container) doesn’t help because each instance has its own latch anyway.

For most workloads, sub-minute is fast enough that further optimization isn’t worth the engineering. We’re shipping.

About AIMFIRST VN

AIMFIRST VN is an AI consultancy and infrastructure-deep-work practice. We publish the patterns we discover building production systems. If you’re hitting walls like this and want help, reach out.

Code, scripts, benchmarks: github.com/AIMFIRST-VN/mariadb-mysql-fast-clone-fork-same-instance (MIT licensed)

If this saved you time, consider buying me a coffee.