Systems Mini-DB / Cache / KV Store

A collection of from-scratch storage system implementations built to understand the fundamentals of databases and caching layers.

What I Built

Three progressively complex storage components:

1. LRU Cache

A fixed-size in-memory cache with O(1) operations using a hash map and doubly-linked list combination. Implemented in both C (for understanding memory layout) and Python (for rapid prototyping and comparison).

2. Log-Structured Key-Value Store

A simple LSM-tree inspired storage engine:

Append-only write log for durability
In-memory memtable (sorted map) for recent writes
Background compaction to merge and garbage-collect old entries
Bloom filters to avoid unnecessary disk reads

3. B+Tree Index

A disk-oriented B+Tree implementation:

Page-based storage with configurable page size
Leaf nodes form a linked list for range scans
Interior nodes for logarithmic search
Simple buffer pool for caching hot pages

Technical Details

Write Path (LSM)

Writes follow the standard LSM pattern:

Append to write-ahead log (WAL) for crash recovery
Insert into memtable
When memtable reaches threshold, flush to sorted run on disk
Background compaction merges sorted runs

The implementation handles edge cases: partial writes, recovery after crash, and concurrent reads during compaction.

Read Path

Reads check (in order):

Memtable (most recent writes)
Bloom filter for each sorted run
Binary search within matching runs

Benchmarking

Built a simple benchmarking harness to measure performance characteristics:

Throughput under different read/write ratios
Latency characteristics for reads and writes
Space amplification from compaction lag
Memory usage vs. hit rate trade-offs (cache)

Results / Learnings

Running benchmarks on different workloads revealed:

Write-heavy: LSM excels—sequential I/O is fast
Read-heavy with locality: LRU cache + B+Tree wins
Mixed random: Trade-offs depend heavily on memory budget

Key insights:

The gap between memory and disk speed is enormous—caching matters
Compaction scheduling is surprisingly complex for real workloads
Understanding memory layout and cache lines significantly improved C code performance
Simple systems are easier to debug and reason about; start simple, optimize with data