Performance¶

HedgeDB vs RocksDB on the same machine, same workload, same harness.

Setup¶


CPU	13th Gen Intel i7-13700H (14 cores / 20 threads)
RAM	32 GB DDR5
Storage	Samsung 980 Pro 1TB NVMe
Records	100M, 24-byte keys, 100-byte values (~12 GB raw)
Key space	uniformly-distributed random

Both RocksDB and HedgeDB have been tested with O_DIRECT I/O mode, with 12 threads plus 8 background threads (for flush and compaction), reflecting the test CPU architecture (6 P-cores with SMT and 4+4 E-cores).

In the HedgeDB benchmarks, the operations are submitted through the TooManyCooks coroutine-based threadpool; in the RocksDB the operations are submitted just via std::thread.

RocksDB have been tested with Universal Compaction (size-tiered). RocksDB has been provided with 1GB worth of cache and, pin_l0_filter_and_index_blocks_in_cache was enable.

RocksDB was configured in the attempt of matching HedgeDB features. For the specific configurations check src/benchtool/utils.cc and rocksdb/benchtool.cc.

Throughput¶

Workload	HedgeDB	RocksDB	HedgeDB / RocksDB
Load (100M puts)	3.97M ops/s	1.14M ops/s	3.5×
Load + compactions drained	3.59M ops/s	1.13M ops/s	3.2×
Read (100M random gets)	1.03M ops/s	194K ops/s	5.3×
Mixed 50/50 read-write	1.33M ops/s	262K ops/s	5.1×

Latency¶

Read (read-only workload)¶

HedgeDB’s per-request latency is higher than RocksDB’s despite its 5.3× throughput advantage. This is the expected tradeoff of the batching model: each thread runs its own io_uring ring at QD16, keeping multiple I/O requests in flight simultaneously. More requests in flight means higher aggregate throughput, but each individual request spends more time waiting in the queue. See the Queue-depth effect section below for a direct QD8 vs QD16 comparison.

Percentile	HedgeDB	RocksDB
avg	185 µs	60 µs
p50	155 µs	61 µs
p90	298 µs	112 µs
p99	632 µs	198 us
p99.9	1.05 ms	295 us

Write (memtable insert+WAL append)¶

Percentile	HedgeDB	RocksDB
avg	2.73 µs	10.28 µs
p50	2.0 µs	9.5 µs
p99	6.0 µs	17.0 µs
p99.9	23.5 µs	25.5 µs

Read latency under the mixed workload¶

Percentile	HedgeDB	RocksDB
avg	285 µs	84 µs
p50	237 µs	72 µs
p90	430 µs	136 µs
p99	1.09 ms	281 µs

Range scans¶

Range size	Metric	HedgeDB	RocksDB	HedgeDB / RocksDB
Small (1-100)	scans/s	87.5K	26.3K	3.3×
Small (1-100)	keys/s	4.38M	1.32M	3.3×
Medium (512-1024)	scans/s	24.9K	6.7K	3.7×
Medium (512-1024)	keys/s	19.2M	5.12M	3.7×
Large (114K-131K)	scans/s	240	192	1.25×
Large (114K-131K)	keys/s	29.5M	23.7M	1.25×

Small and medium scans favor HedgeDB by ~3.3-3.7×. Very large scans converge: at that range size both engines are bottlenecked by sequential SSD bandwidth, not the index structure.

Memory (peak RSS)¶

Workload	HedgeDB	RocksDB
Load (100M puts)	1.53 GB	1.03 GB
Read (100M gets)	455 MB	1.30 GB
Range scans	633 MB	1.30 GB
Mixed 50/50 read-write	1.82 GB	1.89 GB

HedgeDB uses more memory during load, since the memtable holds pending writes before they flush to SSTs. On the read path it is significantly lighter: the SST index cache is demand-filled and shares nothing with the OS page cache (all reads go through O_DIRECT), so memory usage tracks actual working set rather than page-cache accumulation.

`io_uring` Queue-depth effect on read latencies¶

The tests that are shown above, have been executed with the thread-local io_uring instance configured with queue-depth 16.

For very latency-sensitive workloads, the io_uring depth queue can be tuned while still maintaining high bandwidth utilization.

Let’s see what happens if we reduce the QD to 8 instead:

Measurement	HedgeDB QD8	HedgeDB QD16	RocksDB
Throughput (reads/s)	881K	1.03M	193K
avg	108 us	185 us	60 us
p50	99 us	155 us	61 us
p90	153.5 us	298 us	112 us
p99	237.5 us	632 us	198 us
p99.9	331.5 us	1025 us	295 us

With this configuration, despite not being able to maximize the device bandwidth (14.5% lower than the peak), we gain substantial improvements on the measured latencies (62.5% decrease). HedgeDB now behaves much closer to RocksDB, proving that it can be adapted even to latency-sensitive scenarios.

Q: Did you try RocksDB’s MultiGet? It even support io_uring!

A: I did try it, but I did not register any meaningful throughput gain, only higher latencies.

Conclusions¶

From the results, we can deduce that HedgeDB multi-core and NVMe aware architecture produce the wanted results.

Writes have 3x more throughput compared to RocksDB and lower latencies, thanks to the high degree of parallelism, fast synchronization structures and the per-thread WAL.
Random reads can finally saturate the NVMe bandwidth thanks to the io_uring integration. However, the maximum throughput comes at the cost of higher latency.
Short and medium range scan workloads are IOPS-bound, and here the asynchronous architecture shines the most.
Long range scans are bandwidth-intensive rather than IOPS-intensive, so the concurrent model is less of a differentiator.

Reproducing¶

The benchtool and rocksdb_benchtool binaries that produced these numbers live in src/benchtool* in the repo. See Getting started for the build steps and CLI flags.

Performance¶

Setup¶

Throughput¶

Latency¶

Read (read-only workload)¶

Write (memtable insert+WAL append)¶

Read latency under the mixed workload¶

Range scans¶

Memory (peak RSS)¶

io_uring Queue-depth effect on read latencies¶

Conclusions¶

Reproducing¶

`io_uring` Queue-depth effect on read latencies¶