Chapter 1: Table / SSTable & TableCache
Welcome to your LevelDB journey! This is the first chapter where we’ll start exploring the fundamental building blocks of LevelDB.
Imagine you’re building a system to store a massive amount of data, like user profiles or product information. You need a way to save this data permanently (so it doesn’t disappear when the computer turns off) and retrieve it quickly. How does LevelDB handle this?
The core idea we’ll explore in this chapter is how LevelDB stores the bulk of its data on disk in special files and how it accesses them efficiently.
What’s the Problem? Storing Lots of Data Permanently
Databases need to store key-value pairs (like user_id
-> user_data
) persistently. This means writing the data to disk. However, disks are much slower than computer memory (RAM). If we just wrote every tiny change directly to a file, it would be very slow. Also, how do we organize the data on disk so we can find a specific key quickly without reading everything?
LevelDB’s solution involves files called SSTables (Sorted String Tables), often just called Tables in the code.
SSTable: The Sorted, Immutable Book on the Shelf
Think of an SSTable as a permanently bound book in a library.
- Stores Key-Value Pairs: Just like a dictionary or an encyclopedia volume, an SSTable contains data entries, specifically key-value pairs.
- Sorted: The keys inside an SSTable file are always stored in sorted order (like words in a dictionary). This is crucial for finding data quickly later on. If you’re looking for the key “zebra”, you know you don’t need to look in the “A” section.
- Immutable: Once an SSTable file is written to disk, LevelDB never changes it. It’s like a printed book – you can’t erase or rewrite a page. If you need to update or delete data, LevelDB writes new information in newer SSTables. (We’ll see how this works in later chapters like Compaction). This immutability makes many things simpler and safer.
- It’s a File: At the end of the day, an SSTable is just a file on your computer’s disk. LevelDB gives these files names like
000005.ldb
or000010.sst
.
Here’s how LevelDB determines the filename for an SSTable:
// --- File: filename.cc ---
// Creates a filename like "dbname/000005.ldb"
std::string TableFileName(const std::string& dbname, uint64_t number) {
assert(number > 0);
// Uses a helper to format the number with leading zeros
// and adds the '.ldb' or '.sst' suffix.
return MakeFileName(dbname, number, "ldb"); // or "sst"
}
This simple function takes the database name (e.g., /path/to/my/db
) and a unique number and creates the actual filename used on disk. The .ldb
or .sst
extension helps identify it as a LevelDB table file.
Creating SSTables: BuildTable
How do these sorted, immutable files get created? This happens during processes like “flushing” data from memory or during “compaction” (which we’ll cover in later chapters: MemTable and Compaction).
The function responsible for writing a new SSTable file is BuildTable
. Think of BuildTable
as the printing press and binding machine for our book analogy. It takes data (often from memory, represented by an Iterator
) and writes it out to a new, sorted SSTable file on disk.
Let’s look at a simplified view of BuildTable
:
// --- File: builder.cc ---
// Builds an SSTable file from the key/value pairs provided by 'iter'.
Status BuildTable(const std::string& dbname, Env* env, const Options& options,
TableCache* table_cache, Iterator* iter, FileMetaData* meta) {
Status s;
// ... setup: determine filename, open the file for writing ...
std::string fname = TableFileName(dbname, meta->number);
WritableFile* file;
s = env->NewWritableFile(fname, &file);
// ... handle potential errors ...
// TableBuilder does the heavy lifting of formatting the file
TableBuilder* builder = new TableBuilder(options, file);
// Find the first key to store as the smallest key in metadata
iter->SeekToFirst();
meta->smallest.DecodeFrom(iter->key());
// Loop through all key-value pairs from the input iterator
Slice key;
for (; iter->Valid(); iter->Next()) {
key = iter->key();
// Add the key and value to the table being built
builder->Add(key, iter->value());
}
// Store the last key as the largest key in metadata
if (!key.empty()) {
meta->largest.DecodeFrom(key);
}
// Finish writing the file (adds index blocks, etc.)
s = builder->Finish();
// ... more steps: update metadata, sync file to disk, close file ...
if (s.ok()) {
meta->file_size = builder->FileSize();
s = file->Sync(); // Ensure data is physically written
}
if (s.ok()) {
s = file->Close();
}
// ... cleanup: delete builder, file; handle errors ...
return s;
}
Explanation:
- Input:
BuildTable
receives data via anIterator
. An iterator is like a cursor that lets you go through key-value pairs one by one, already in sorted order. It also gets other necessary info like the database name (dbname
), environment (env
), options, theTableCache
(we’ll see this next!), and aFileMetaData
object to store information about the new file (like its number, size, smallest key, and largest key). - File Creation: It creates a new, empty file using
env->NewWritableFile
. - TableBuilder: It uses a helper object called
TableBuilder
to handle the complex details of formatting the SSTable file structure (data blocks, index blocks, etc.). - Iteration & Adding: It loops through the
Iterator
. For each key-value pair, it callsbuilder->Add()
. Because the inputIterator
provides keys in sorted order, theTableBuilder
can write them sequentially to the file. - Metadata: It records the very first key (
meta->smallest
) and the very last key (meta->largest
) it processes. This is useful later for quickly knowing the range of keys stored in this file without opening it. - Finishing Up: It calls
builder->Finish()
to write out the final pieces of the SSTable (like the index). Then itSync
s the file to ensure the data is safely on disk andClose
s it. - Output: If successful, a new
.ldb
file exists on disk containing the sorted key-value pairs, and themeta
object is filled with details about this file.
Accessing SSTables Efficiently: TableCache
Okay, so we have these SSTable files on disk. But reading from disk is slow. If we need to read from the same SSTable file multiple times (which is common), opening and closing it repeatedly, or re-reading its internal index structure, would be inefficient.
This is where the TableCache
comes in. Think of the TableCache
as a smart librarian.
- Keeps Files Open: The librarian might keep the most popular books near the front desk instead of running to the far shelves every time someone asks for them. Similarly, the
TableCache
keeps recently used SSTable files open. - Caches Structures: Just opening the file isn’t enough. LevelDB needs to read some index information within the SSTable file to find keys quickly. The
TableCache
also keeps this parsed information in memory (RAM). It uses a specific caching strategy called LRU (Least Recently Used) to decide which table information to keep in memory if the cache gets full. - Provides Access: When LevelDB needs to read data from a specific SSTable (identified by its file number), it asks the
TableCache
. The cache checks if it already has that table open and ready in memory. If yes (a “cache hit”), it returns access quickly. If no (a “cache miss”), it opens the actual file from disk, reads the necessary index info, stores it in the cache for next time, and then returns access.
Let’s see how the TableCache
finds a table:
// --- File: table_cache.cc ---
// Tries to find the Table structure for a given file number.
// If not in cache, opens the file and loads it.
Status TableCache::FindTable(uint64_t file_number, uint64_t file_size,
Cache::Handle** handle) {
Status s;
// Create a key for the cache lookup (based on file number)
char buf[sizeof(file_number)];
EncodeFixed64(buf, file_number);
Slice key(buf, sizeof(buf));
// 1. Try looking up the table in the cache
*handle = cache_->Lookup(key);
if (*handle == nullptr) { // Cache Miss!
// 2. If not found, open the actual file from disk
std::string fname = TableFileName(dbname_, file_number);
RandomAccessFile* file = nullptr;
Table* table = nullptr;
s = env_->NewRandomAccessFile(fname, &file); // Open the file
// ... handle errors, potentially check for old .sst filename ...
if (s.ok()) {
// 3. Parse the Table structure (index etc.) from the file
s = Table::Open(options_, file, file_size, &table);
}
if (s.ok()) {
// 4. Store the opened file and parsed Table in the cache
TableAndFile* tf = new TableAndFile;
tf->file = file;
tf->table = table;
*handle = cache_->Insert(key, tf, 1 /*charge*/, &DeleteEntry);
} else {
// Error occurred, cleanup
delete file;
// Note: Errors are NOT cached. We'll retry opening next time.
}
} // else: Cache Hit! *handle is already valid.
return s;
}
Explanation:
- Lookup: It first tries
cache_->Lookup
using thefile_number
. - Cache Miss: If
Lookup
returnsnullptr
, it means the table isn’t in the cache. It then proceeds to open the file (env_->NewRandomAccessFile
). - Table::Open: It calls
Table::Open
, which reads the file’s footer, parses the index block, and sets up aTable
object ready for lookups. - Insert: If opening and parsing succeed, it creates a
TableAndFile
struct (holding both the file handle and theTable
object) and inserts it into the cache usingcache_->Insert
. Now, the next timeFindTable
is called for thisfile_number
, it will be a cache hit. - Cache Hit: If
Lookup
initially returned a valid handle,FindTable
simply returnsStatus::OK()
, and the caller can use the handle to get theTable
object.
When LevelDB needs to read data, it often gets an Iterator
for a specific SSTable via the TableCache
:
// --- File: table_cache.cc ---
// Returns an iterator for reading the specified SSTable file.
Iterator* TableCache::NewIterator(const ReadOptions& options,
uint64_t file_number, uint64_t file_size,
Table** tableptr) {
// ... setup ...
Cache::Handle* handle = nullptr;
// Use FindTable to get the Table object (from cache or by opening file)
Status s = FindTable(file_number, file_size, &handle);
if (!s.ok()) {
return NewErrorIterator(s); // Return an iterator that yields the error
}
// Get the Table object from the cache handle
Table* table = reinterpret_cast<TableAndFile*>(cache_->Value(handle))->table;
// Ask the Table object to create a new iterator for its data
Iterator* result = table->NewIterator(options);
// Important: Register cleanup to release the cache handle when iterator is done
result->RegisterCleanup(&UnrefEntry, cache_, handle);
// Optionally return the Table object itself
if (tableptr != nullptr) {
*tableptr = table;
}
return result;
}
This function uses FindTable
to get the Table
object (either from the cache or by loading it from disk) and then asks that Table
object to provide an Iterator
to step through its key-value pairs. It also cleverly registers a cleanup function (UnrefEntry
) so that when the iterator is no longer needed, the cache handle is released, allowing the cache to potentially evict the table later if needed.
Here’s a diagram showing how a read might use the TableCache
:
sequenceDiagram
participant Client as Read Operation
participant TableCache
participant Cache as LRUCache
participant OS/FileSystem as FS
participant TableObject as In-Memory Table Rep
Client->>TableCache: Get("some_key", file_num=5, size=1MB)
TableCache->>Cache: Lookup(file_num=5)?
alt Cache Hit
Cache-->>TableCache: Return handle for Table 5
TableCache->>TableObject: Find "some_key" within Table 5 data
TableObject-->>TableCache: Return value / not found
TableCache-->>Client: Return value / not found
else Cache Miss
Cache-->>TableCache: Not found (nullptr)
TableCache->>FS: Open file "000005.ldb"
FS-->>TableCache: Return file handle
TableCache->>TableObject: Create Table 5 representation from file handle + size
TableObject-->>TableCache: Return Table 5 object
TableCache->>Cache: Insert(file_num=5, Table 5 object)
Note right of Cache: Table 5 now cached
TableCache->>TableObject: Find "some_key" within Table 5 data
TableObject-->>TableCache: Return value / not found
TableCache-->>Client: Return value / not found
end
Conclusion
In this chapter, we learned about two fundamental concepts in LevelDB:
- SSTable (Table): These are the immutable, sorted files on disk where LevelDB stores the bulk of its key-value data. Think of them as sorted, bound books. They are created using
BuildTable
. - TableCache: This acts like an efficient librarian for SSTables. It keeps recently used tables open and their index structures cached in memory (RAM) to speed up access, avoiding slow disk reads whenever possible. It provides access to table data, often via iterators.
These two components work together to provide persistent storage and relatively fast access to the data within those files.
But where does the data come from before it gets written into an SSTable? Often, it lives in memory first. In the next chapter, we’ll look at the in-memory structure where recent writes are held before being flushed to an SSTable.
Next up: Chapter 2: MemTable
Generated by AI Codebase Knowledge Builder