The syscall I forgot: directory fsync
TL;DR: I fsynced the WAL file on every write. Crash tests passed. SIGKILL tests passed. Then someone asked: “What if the directory doesn’t know the file exists?” I was one missing syscall away from losing (almost) everything.
This is part of an ongoing series — see all posts tagged #beachdb.
Our first bug!
In my last post, I walked through BeachDB’s Write-Ahead Log: the record format, the crash recovery, the SIGKILL tests. I was feeling pretty good about durability. Then a Discord discussion on “Software Internals” pulled on a loose thread:
We ended up talking about something I completely missed:
fsync-ing the WAL file isn’t enough when the file is new. You also need tofsyncthe directory.Then people started sharing production horror stories of data loss from storage engines that forgot this.
I stared at the thread for a while. Then I stared at my code. Then I stared at the fsync(2) man page1:
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
It was right there the whole time.
Two pieces of metadata, not one
When you create a new file, two things happen at the filesystem level:
- The file’s inode is allocated (its metadata, data blocks, etc.)
- A directory entry is added to the parent directory’s inode — a mapping from filename to inode number
These are two separate mutations, and they live independently in the kernel’s page cache. Calling fsync on the WAL file flushes the file’s data and metadata to disk — but the directory entry? That’s the directory’s metadata. It sits in the page cache until the kernel gets around to flushing it, or until someone fsyncs the directory itself.
So my WAL data was on disk. My WAL file’s inode was on disk. But the directory might not know the file exists.
A crash at the wrong moment, and the WAL becomes an orphan: data blocks on disk with no name, left orphaned until repair (or cleaned up by fsck). The database opens, sees no WAL, starts fresh. All your data, gone — despite every single fsync call succeeding.
When would this actually happen?
The crash tests and SIGKILL tests that I’ve added to the project in v0.0.1 only cover the case when the process hosting an instance of BeachDB in-process gets killed or crashes.
In the tested cases the kernel survives and is assumed to still be online, but what happens if the kernel crashes?
Imagine the following scenario: a process starts, opens a BeachDB database in an existing directory, and creates a new WAL file. Writes are synced to the WAL file. Moments later the machine crashes. That’s when this bug would likely manifest, especially when the kernel itself dies before flushing the directory entry.
Other examples include:
- Actual power loss
- Kernel panic
- Hard reboot (
echo b > /proc/sysrq-trigger)
The fix
Almost embarrassingly simple:
1
2
3
4
5
6
7
8
func syncDir(path string) error {
dir, err := os.Open(path)
if err != nil {
return fmt.Errorf("beachdb: failed to open directory for sync: %w", err)
}
defer dir.Close()
return dir.Sync()
}
Open the directory, call Sync(), close it. One call in engine.Open(), right after the WAL file is created. That’s it.
In BeachDB, this happens when we create the WAL the first time (fresh DB dir). If you’re curious, here’s the patch commit2.
You only need to sync the directory when its metadata changes: file creation, rename, or delete. Appending to an existing file doesn’t touch the directory, so db.Put() doesn’t need it.
I’m not the first to miss this
LevelDB has a function called SyncDirIfManifest() that does exactly this — opens the parent directory with O_RDONLY and fsyncs it whenever a new MANIFEST file is created. RocksDB has an FSDirectory abstraction to fsync directories, and they’ve had bugs where directory fsync was accidentally skipped due to lifecycle/close issues. If the RocksDB team can get this wrong, I don’t feel too bad.
Jeff Moyer’s LWN article “Ensuring data reaches disk” spells this out clearly:
A newly created file may require an fsync() of not just the file itself, but also of the directory in which it was created (since this is where the file system looks to find your file).
The lesson
I had fsync everywhere. I had crash tests. I had SIGKILL tests. And I was still one missing syscall away from losing everything.
Durability is a promise you keep in layers, and I’d missed a layer. The code is patched2, the tests are written, and I won’t forget this one.
Durable file ops checklist (yes, it’s annoying)
If you care about durability across power loss / kernel panic / hard reset, here’s the checklist I wish I had tattooed on my forehead:
- Appending to an existing file:
write()→fsync(file)- This makes the file durable. It does not necessarily make the name of the file durable (directory entries are their own thing).1
- Creating a new file (or making it appear under a new name):
- Atomic replace (
tmp→rename):
If this feels like overkill: it is. Until the day it isn’t.
Notes & references
fsync(2)explicitly notes that fsyncing a file does not necessarily persist the containing directory entry: man2/fsync.2.html ↩︎ ↩︎2 ↩︎3The patch:
engine/fs.go,engine/fs_test.gowith simple tests, and its call inengine/db.go, full patch commit here ↩︎ ↩︎2Jeff Moyer’s “Ensuring data reaches disk” (LWN) discusses the directory fsync requirement for newly created files and portable best practices: lwn.net/Articles/457667 ↩︎ ↩︎2
rename(2)documents rename’s atomic namespace behavior; durability still requires syncing the directory for crash safety: man2/rename.2.html ↩︎