Building BeachDB: A Database from Scratch (in Go)
A new rabbit hole
In my last blog post “On Watershed Moments: Boredom, Growth, and Curiosity”, I reflected on moments in my career when I hit a repeating pattern: productivity, then stability, then boredom. These moments led me to get out of my comfort zone, change roles, shed responsibilities and dig into new domains.
One of the most rewarding things I’ve been doing lately is working on technical products: systems and tools that help other engineers do their work.
At HubSpot, that was building and operating the Apache HBase distributed database at massive scale (HubSpot’s Data Infrastructure team). Now at Grafana Labs, it’s building observability products in the cloud so other engineers get answers fast.
Before all of this, I was on the other side of the table: using databases and observability tools to ship features to end users.
Stateless vs. Stateful Systems
Working on the internals of database systems is a rewarding experience. Especially for software engineers who constantly work on web products.
In a way, our jobs teach us to design services in stateless terms. Or rather, they nudge us in that direction. We offload state to the database or storage system. A stateless API is in most (all?) cases, easier to scale horizontally than a stateful one. State complicates things.
What makes studying these systems rewarding is that you get to see the turtles, and it’s turtles all the way down.
Project BeachDB
BeachDB is the name of the database I’ll be building in Go to learn more about storage, database architecture, and the parts of distributed systems I’ve mostly only met in books (so far).
BeachDB is a toy, educational project. I’m building a small, inspectable LSM-based1 storage engine in Go.
Then I will deliberately grow it into a server and a Raft-replicated distributed NoSQL system with progress proven through concrete artifacts like dump tools, crash tests, benchmarks, and clear semantics rather than feature breadth or optimization theater.
Cassandra and HBase are distributed database systems too, but they’re not Raft2-based. HBase is multi-writer per table (single-writer per region) and relies on HDFS replication within the same cluster3. Cassandra has a host of other replication strategies4.
To make sure I keep this project from exploding beyond what’s humanely possible to build for the sake of learning new things, I set the following goals and non-goals for the project.
You can follow the project on GitHub: github.com/aalhour/beachdb. It’s still pretty empty for now (mostly plans in markdown), but I’ll be filling it in, and writing about it here, one milestone at a time.
Goals
- Learn and document the fundamentals by building them end to end, in public.
- Make correctness and durability first-class (WAL5, recovery, explicit invariants).
- Keep the system small, understandable, and easy to inspect with tooling.
- Define crisp semantics for writes, deletes, snapshots, and iteration before tuning.
- Layer the system intentionally: engine first, then server API, then Raft2 replication.
- Use a small set of fixed workloads to guide measurement and explain tradeoffs.
Non-goals (by design)
- Production readiness, multi-year maintenance guarantees, or compatibility promises.
- Multi-writer concurrency in the engine early on (single-writer initially).
- Background compaction early on (only after invariants are rock-solid).
- SQL, query planning, joins, or secondary indexes.
- Full transactions or serializable isolation.
- Auto sharding, region split/merge, rebalancing, quorum reads, gossip/repair.
The Architecture: A 10,000 ft. view
The architecture of BeachDB follows three layers intentionally designed to make building for the sake of learning as easy as possible:
- An LSM-Trees1 storage engine inspired by Facebook’s RocksDB.
- A single-node server that embeds the storage engine and exposes an API protocol for clients.
- A distributed cluster of many nodes that replicate state over the Raft2 protocol.
Given that we don’t have a storage engine yet, it’s appropriate to start there and park the other two layers of the onion until it’s time to peel them.
Well, what is a storage engine? How does it ensure that data is stored? Assuming that it promises durability (BeachDB does!), how can it recover from crashes? If I kill -9 the database process, how does it guarantee NOT losing data?
I could write a long essay to answer all of this but let me show you a simulation instead.
A demo is worth 10k pages
The following is an interactive visualization that demonstrates how LSM-Tree1 databases handle writes, reads, deletes and compaction. Try adding some key-value pairs with the PUT API, run some GET and DELETE operations on them or click ▶ Play on the Demo to see it in action! Play the demo a few times to see compactions happen.
This is a simplified visualization for educational purposes. Real LSM-Tree implementations (RocksDB, Cassandra, HBase) include additional optimizations like bloom filters, block-based SST storage, background compaction threads, write stalls, and sophisticated compaction strategies (leveled, size-tiered, FIFO).
How It Works
The visualization shows a generic LSM-Tree1 storage architecture with two planes:
- Memory Plane, which contains:
- Memtable: An in-memory sorted map that receives writes.
- Disk Plane, which contains:
Operations
Before the bullets, here’s the mental model I keep in my head: writes are cheap and sequential, reads are a scavenger hunt across a few places, and compaction is the janitor that stops the whole thing from becoming a landfill.
- Put: Writes go to both the Memtable (memory) and WAL5 (disk) for durability.
- The Memtable gives you a fast sorted view for reads.
- The WAL is your “I swear I saw this write” receipt when the process crashes at the worst possible time.
- Flush: When the Memtable fills up (4 entries7), it’s flushed to disk as an SST6 file in L0.
- Flushing turns volatile memory into an immutable, sorted file.
- Now the Memtable can be reset and keep taking writes, while the disk holds the history.
- Get: Reads check Memtable first, then SST6 files from newest (L0) to oldest (L2).
- “Newest wins” is the big rule here. If the same key exists in multiple places, the freshest version is the truth.
- Real systems add bloom filters and indexing to avoid touching every file (the demo keeps it simple on purpose).
- Delete: Deletes don’t remove data immediately; they write a TOMBSTONE record.
- Tombstones are basically “this key is gone as of sequence X”.
- The actual space reclaim happens later, during compaction, when older values get merged away.
- Compact: When there are 2+ files in L0, they can be merged into L1 (deduplicating keys and removing deleted ones).
- Compaction is where the engine pays for those cheap writes: it rewrites data to keep read paths sane and storage bounded.
- This is also where some of the nastiest engineering tradeoffs live (write amplification, read amplification, space amplification).
If you’re playing with the demo, watch for two things:
- How fast the write path stays even as you add keys (it mostly just appends and updates one in-memory structure).
- How the read path grows as more files accumulate (and why compaction exists in the first place).
If you want to go deeper
I recommend watching the following lecture from the CMU Database course about LSM-Tree1 storage:
Closing thoughts (and what I’m building next)
BeachDB is my excuse to stop hand-waving storage engines and actually build one: define the semantics, write the invariants down, crash it on purpose, and make the on-disk state inspectable.
Right now the repo (github.com/aalhour/beachdb) is still empty (besides plans in markdown files), and that’s fine. The point of this series isn’t to drop a polished codebase on day one. It’s to show the messy, mechanical steps that turn “I understand LSM trees” into “I can implement one and explain where it breaks.”
Next up, I’ll start putting real code on the page: the first slices of the storage engine (WAL + memtable + flushing into an SST format), plus some tiny tools/tests to prove it survives restarts and doesn’t lie about what it stored.
Until we meet again.
Adios! ✌🏼
Notes & references
“LSM” or “LSMT” is short for Log-structured merge tree, a data structure invented by O’Neil, Cheng and Gawlick in 1996 and has been widely used in famous database systems, such as: Google Bigtable, HBase, Cassandra, Dynamo, RocksDB and others. See: Wikipedia article, original 1996 paper, and the “LSM techniques survey” paper by Luo & Carey. ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5
Raft is a consensus protocol that was built to make distributed consensus easier to understand and implement in fault-tolerant distributed systems. Rumor has it, it was built after the industry collectively had nightmares about Paxos xD, see: Raft docs & demo to learn more about Raft. For context on Paxos, see: The Part-time Parliament (original paper), Paxos Made Simple, Paxos Made Really Simple, and Just Say NO to Paxos overhead. ↩︎ ↩︎2 ↩︎3
HBase runs on top of the Hadoop Distributed Filesystem (HDFS) which handles sharding and replication of files for HBase, in addition to that different companies and deployments have different cross-cluster replication, see: The HBase Book and Cloudera docs. ↩︎
“WAL” is short for Write-Ahead Log or write-ahead logging, a technique utilized by database systems that promise durability and crash-recovery, where mutations to the database state are appended as facts to a log file for speed and performance reasons before they are applied internally (see: Wikipedia article). ↩︎ ↩︎2 ↩︎3
SST is short for Sorted-string table, which is a format for files that the storage engine uses when writing the contents of Memtables into disk. SSTables are sorted by key. See: Bigtable paper, and ScyllaDB’s article on SSTs. ↩︎ ↩︎2 ↩︎3
To keep the demo and explanation simple, the memtable size is measured in terms of # of entries. In practice, the size is measured in standard binary formats (bytes, kilobytes … etc). ↩︎