Architecture
Kifa is an LSM-tree (Log-Structured Merge-tree) optimized for append-only workloads. Entries flow through four stages: write-ahead log, memtable, SSTables, and compaction.
Data Flow
Section titled “Data Flow”Components
Section titled “Components”Write-Ahead Log (WAL)
Section titled “Write-Ahead Log (WAL)”The WAL is the durability layer. Every entry is written to the WAL before it reaches the memtable. WAL segments are fixed at 16 MiB by default.
On Linux, the WAL uses Direct I/O (O_DIRECT) to bypass the kernel page cache. This means fsync writes directly to storage media rather than to a kernel buffer that may not survive power loss.
Each WAL entry is validated with:
- Header CRC: protects the length field
- Data CRC: protects timestamp and payload
- Magic trailer: detects incomplete writes from a crash mid-entry
Memtable
Section titled “Memtable”The memtable is an in-memory buffer holding recent writes, sorted by timestamp. Because Kifa is append-only (entries are never updated or deleted), the memtable is a Vec with entries appended in timestamp order. Entries are iterated sequentially during flush and query.
When the memtable exceeds the flush threshold (default: 4 MiB), it is written to an SSTable and cleared.
SSTables
Section titled “SSTables”SSTables (Sorted String Tables) are immutable files on disk. Each SSTable contains entries sorted by timestamp. Once written, an SSTable is never modified.
Reads search the memtable first, then SSTables from newest to oldest.
Compaction
Section titled “Compaction”Background compaction merges multiple SSTables into fewer, larger files. This reclaims disk space from obsolete WAL segments and keeps read performance bounded.
Compaction triggers when the number of SSTables exceeds the compaction threshold (default: 4). In Emergency flush mode, compaction is paused to minimize disk activity.
Manifest
Section titled “Manifest”The manifest tracks database state: which SSTables exist, the checkpoint timestamp (the point up to which entries have been flushed to SSTables), and metadata checksums. On startup, the manifest determines where WAL replay begins.
Entry Format
Section titled “Entry Format”Each log entry contains two fields:
| Field | Size | Description |
|---|---|---|
| Timestamp | 8 bytes | Unix timestamp in nanoseconds (monotonically increasing) |
| Data | Variable | Raw payload (max 1 MiB) |
Crash Recovery
Section titled “Crash Recovery”On startup, Kifa:
-
Read manifest. Determine the checkpoint timestamp.
-
Replay WAL. Replay all entries after the checkpoint timestamp into the memtable.
-
Validate CRCs. Entries with invalid CRCs or incomplete trailers are discarded. They represent writes that did not complete before the crash.
-
Report statistics. Entries replayed and time range covered.
In Cautious mode, no entries are lost. Every entry that append() confirmed is on disk. In Normal mode, up to 49 entries between syncs may be missing after a crash.