Chapter 6: Version & VersionSet - The Database Catalog
In the previous chapter, Chapter 5: WriteBatch, we learned how LevelDB groups multiple Put
and Delete
operations together to apply them atomically and efficiently. We saw that writes go first to the Write-Ahead Log (WAL) for durability, and then to the in-memory MemTable.
Eventually, the MemTable gets full and is flushed to an SSTable file on disk. Over time, LevelDB also runs compactions, which read data from existing SSTables and write new ones, deleting the old ones afterwards. This means the set of SSTable files that represent the database’s current state is constantly changing!
What’s the Problem? Tracking a Changing Set of Files
Imagine our library again. Books (SSTables) are constantly being added (from MemTable flushes), removed (after compaction), and sometimes even moved between sections (levels during compaction). How does the librarian know which books are currently part of the official collection and where they are located? If a reader asks for information, the librarian can’t just guess which books to look in – they need an accurate, up-to-date catalog.
Similarly, LevelDB needs a system to track:
- Which SSTable files exist and are currently “live” (contain valid data)?
- Which “level” each live SSTable file belongs to? (Levels are important for compaction, see Chapter 8: Compaction).
- What’s the overall state of the database, like the next available file number or the sequence number of the last operation?
- How can reads see a consistent snapshot of the database, even while background tasks are adding and removing files?
The Solution: Versions, VersionEdits, and the VersionSet
LevelDB uses a trio of concepts to manage this state:
-
Version: Think of a
Version
object as one specific edition of the library’s catalog. It represents a complete, consistent snapshot of the database state at a single point in time. Specifically, it contains lists of all the live SSTable files for each level. Once created, aVersion
object is immutable – it never changes, just like a printed catalog edition. Reads (Get
operations or Iterators) use a specificVersion
to know which files to consult. - VersionEdit: This is like a list of corrections and updates to get from one catalog edition to the next. It describes the changes between two versions. A
VersionEdit
might say:- “Add file number 15 to Level-0.” (Because a MemTable was flushed).
- “Remove files 8 and 9 from Level-1.” (Because they were compacted).
- “Add file number 25 to Level-2.” (The result of the compaction).
- “Update the next available file number to 26.”
- “Update the last sequence number.” These edits are small descriptions of changes. They are stored persistently in a special file called the
MANIFEST
.
- VersionSet: This is the chief librarian or the cataloguing department. It’s the central manager for all database state related to the set of live files. The
VersionSet
performs several critical tasks:- Keeps track of the single
current
Version (the latest catalog edition). - Reads the
MANIFEST
file during startup to reconstruct the database state. - Applies
VersionEdit
s to thecurrent
Version to create newVersion
s. - Manages essential metadata like the
next_file_number_
,log_number_
, andlast_sequence_
. - Decides which compactions are needed (Chapter 8: Compaction).
- Manages the lifecycle of
Version
objects (using reference counting) so that old versions needed by iterators or snapshots aren’t deleted prematurely.
- Keeps track of the single
In short: VersionSet
uses VersionEdit
s (from the MANIFEST
) to create a sequence of immutable Version
s, each representing the database state at a point in time. The current
Version
tells LevelDB which files to read from.
How Reads Use Versions
When you perform a Get(key)
operation, the DBImpl needs to know which SSTables to check (after checking the MemTables). It does this by consulting the current
Version
held by the VersionSet
.
// --- Simplified from db/db_impl.cc Get() ---
Status DBImpl::Get(const ReadOptions& options, const Slice& key,
std::string* value) {
// ... check MemTable, Immutable MemTable first ...
// If not found in memory, check SSTables:
else {
MutexLock l(&mutex_); // Need lock to get current Version pointer safely
Version* current = versions_->current(); // Ask VersionSet for current Version
current->Ref(); // Increment ref count (important!)
mutex_.Unlock(); // Unlock for potentially slow disk I/O
LookupKey lkey(key, snapshot_sequence_number); // Key to search for
Version::GetStats stats;
// Ask the Version object to perform the lookup in its files
Status s = current->Get(options, lkey, value, &stats);
mutex_.Lock(); // Re-acquire lock for cleanup
current->Unref(); // Decrement ref count
// ... maybe trigger compaction based on stats ...
mutex_.Unlock();
return s;
}
// ...
}
The key step is versions_->current()->Get(...)
. The DBImpl
asks the VersionSet
(versions_
) for the pointer to the current
Version
. It then calls the Get
method on that Version
object.
How does Version::Get
work?
// --- Simplified from db/version_set.cc ---
Status Version::Get(const ReadOptions& options, const LookupKey& k,
std::string* value, GetStats* stats) {
Slice ikey = k.internal_key();
Slice user_key = k.user_key();
// We search level-by-level
for (int level = 0; level < config::kNumLevels; level++) {
const std::vector<FileMetaData*>& files = files_[level]; // Get list for this level
if (files.empty()) continue; // Skip empty levels
if (level == 0) {
// Level-0 files might overlap, search newest-first
std::vector<FileMetaData*> tmp;
// Find potentially overlapping files in level 0
// ... logic to find relevant files ...
// Sort them newest-first
std::sort(tmp.begin(), tmp.end(), NewestFirst);
// Search each relevant file
for (uint32_t i = 0; i < tmp.size(); i++) {
FileMetaData* f = tmp[i];
// Use TableCache to search the actual SSTable file
Status s = vset_->table_cache_->Get(options, f->number, f->file_size,
ikey, /* saver state */, SaveValue);
// ... check if found/deleted/error and update stats ...
if (/* found or deleted */) return s;
}
} else {
// Levels > 0 files are sorted and non-overlapping
// Binary search to find the single file that might contain the key
uint32_t index = FindFile(vset_->icmp_, files, ikey);
if (index < files.size()) {
FileMetaData* f = files[index];
// Check if user_key is within the file's range
if (/* user_key is within f->smallest/f->largest range */) {
// Use TableCache to search the actual SSTable file
Status s = vset_->table_cache_->Get(options, f->number, f->file_size,
ikey, /* saver state */, SaveValue);
// ... check if found/deleted/error and update stats ...
if (/* found or deleted */) return s;
}
}
}
} // End loop over levels
return Status::NotFound(Slice()); // Key not found in any SSTable
}
Explanation:
- The
Version
object has arrays (files_[level]
) storingFileMetaData
pointers for each level.FileMetaData
contains the file number, size, and smallest/largest keys for an SSTable. - It iterates through the levels.
- Level 0: Files might overlap, so it finds all potentially relevant files, sorts them newest-first (by file number), and checks each one using the Table / SSTable & TableCache.
- Levels > 0: Files are sorted and non-overlapping. It performs a binary search (
FindFile
) to quickly locate the single file that might contain the key. It checks that file’s key range and then searches it using theTableCache
. - The search stops as soon as the key is found (either a value or a deletion marker) in any file. If it searches all relevant files in all levels without finding the key, it returns
NotFound
.
The Version
object acts as the map, guiding the search to the correct SSTable files.
How State Changes: Applying VersionEdits
The database state doesn’t stand still. MemTables are flushed, compactions happen. How does the VersionSet
update the state? By applying VersionEdit
s.
When a background task (like flushing the immutable MemTable or running a compaction) finishes, it creates a VersionEdit
describing the changes it made (e.g., “add file X, remove file Y”). It then asks the VersionSet
to apply this edit.
The core logic is in VersionSet::LogAndApply
:
// --- Simplified from db/version_set.cc ---
Status VersionSet::LogAndApply(VersionEdit* edit, port::Mutex* mu) {
// 1. Fill in metadata in the edit (log number, sequence number etc.)
// ... set edit->log_number_, edit->last_sequence_, etc. ...
// 2. Create a new Version based on the current one + the edit
Version* v = new Version(this);
{
Builder builder(this, current_); // Builder starts with 'current_' state
builder.Apply(edit); // Apply the changes described by 'edit'
builder.SaveTo(v); // Save the resulting state into 'v'
}
Finalize(v); // Calculate compaction score/level for the new version
// 3. Write the edit to the MANIFEST file (for persistence)
std::string record;
edit->EncodeTo(&record); // Serialize the VersionEdit
// Unlock mutex while writing to disk (can be slow)
mu->Unlock();
Status s = descriptor_log_->AddRecord(record); // Append edit to MANIFEST log
if (s.ok()) {
s = descriptor_file_->Sync(); // Ensure MANIFEST write is durable
}
// ... handle MANIFEST write errors ...
mu->Lock(); // Re-lock mutex
// 4. Install the new version as the 'current' one
if (s.ok()) {
AppendVersion(v); // Make 'v' the new current_ version
// Update VersionSet's metadata based on the edit
log_number_ = edit->log_number_;
prev_log_number_ = edit->prev_log_number_;
} else {
delete v; // Discard the new version if MANIFEST write failed
}
return s;
}
Explanation:
- Prepare Edit: Fills in missing metadata fields in the
VersionEdit
(like the current log number and last sequence number). - Build New Version: Creates a temporary
Builder
object, initialized with the state of thecurrent_
version. It applies the changes from theedit
to this builder and then saves the resulting state into a completely newVersion
object (v
). - Log to MANIFEST: Serializes the
VersionEdit
into a string (record
) and appends it to theMANIFEST
log file (descriptor_log_
). This step makes the state change persistent. If the database crashes and restarts, it can replay theMANIFEST
file to recover the state. - Install New Version: If the
MANIFEST
write succeeds, it callsAppendVersion(v)
. This crucial step updates thecurrent_
pointer in theVersionSet
to point to the newly createdVersion
v
. Future read operations will now use this new version. It also updates theVersionSet
’s own metadata (likelog_number_
).
This process ensures that the database state transitions atomically: a new Version
only becomes current
after the changes it represents have been safely recorded in the MANIFEST
.
sequenceDiagram
participant BG as Background Task (Flush/Compact)
participant VE as VersionEdit
participant VS as VersionSet
participant VSCur as Current Version
participant VSBld as VersionSet::Builder
participant V as New Version
participant Manifest as MANIFEST Log File
BG ->> VE: Create edit (add file X, remove Y)
BG ->> VS: LogAndApply(edit)
VS ->> VSCur: Get current state
VS ->> VSBld: Create Builder(based on VSCur)
VSBld ->> VE: Apply(edit)
VSBld ->> V: Save resulting state to New Version
VS ->> V: Finalize()
VE ->> VE: EncodeTo(record)
VS ->> Manifest: AddRecord(record)
Manifest -->> VS: Write Status OK
VS ->> V: AppendVersion(V) // Make V the new 'current'
VS ->> VS: Update log_number etc.
VS -->> BG: Return OK
Version Lifecycle and Snapshots
Why keep old Version
objects around if we have a current
one? Because ongoing read operations or snapshots might still need them!
- Reference Counting: Each
Version
has a reference count (refs_
). WhenDBImpl::Get
uses a version, it callsRef()
(increment count) before starting the lookup andUnref()
(decrement count) when finished. - Snapshots: When you request a snapshot (
db->GetSnapshot()
), LevelDB essentially gives you a pointer to thecurrent
Version
at that moment and increments its reference count. As long as you hold onto that snapshot, the correspondingVersion
object (and the SSTable files it refers to) won’t be deleted, even if thecurrent
version advances due to subsequent writes and compactions. This provides a consistent point-in-time view of the data. - Cleanup: When a
Version
’s reference count drops to zero (meaning no reads or snapshots are using it anymore), it can be safely deleted. TheVersionSet
also keeps track of which underlying SSTable files are no longer referenced by any activeVersion
and can trigger their deletion from disk (DBImpl::RemoveObsoleteFiles).
The MANIFEST File
The MANIFEST
file is crucial for durability. It’s a log file (like the WAL, but for metadata changes) that stores the sequence of VersionEdit
records.
When LevelDB starts (DB::Open
), the VersionSet::Recover
method reads the MANIFEST
file from beginning to end. It starts with an empty initial state and applies each VersionEdit
it reads, step-by-step, rebuilding the database’s file state in memory. This ensures that LevelDB knows exactly which SSTable files were live when it last shut down (or crashed).
Occasionally, the MANIFEST
file can grow large. LevelDB might then write a snapshot of the entire current state (all files in all levels) as a single large record into a new MANIFEST
file and then switch subsequent edits to that new file. This prevents the recovery process from becoming too slow.
Conclusion
Version
, VersionEdit
, and VersionSet
form the core cataloguing system of LevelDB.
- Version: An immutable snapshot of which SSTable files exist at each level. Used by reads to find data.
- VersionEdit: A description of changes (files added/deleted, metadata updated) between versions. Persisted in the
MANIFEST
log. - VersionSet: Manages the
current
Version, applies edits to create new versions, handles recovery from theMANIFEST
, and manages metadata like file numbers and sequence numbers.
Together, they allow LevelDB to manage a constantly changing set of files on disk while providing consistent views for read operations and ensuring the database state can be recovered after a restart.
Now that we understand how LevelDB finds data (checking MemTables, then using the current Version
to check SSTables via the TableCache
), how does it provide a way to scan through data, not just get single keys?
Next up: Chapter 7: Iterator
Generated by AI Codebase Knowledge Builder