Skip to content

Architecture

Kifa is an LSM-tree (Log-Structured Merge-tree) optimized for append-only workloads. Entries flow through four stages: write-ahead log, memtable, SSTables, and compaction.

Kifa Data Flow Log entries flow from input sources through the ingester channel, write-ahead log, memtable, SSTables, and compaction. stdin File TCP UDP Ingester Channel Write-Ahead Log 16 MiB segments Durability layer Memtable In-memory buffer Sorted by timestamp Flush SSTables Immutable files Sorted on disk Compaction Merged SSTables

The WAL is the durability layer. Every entry is written to the WAL before it reaches the memtable. WAL segments are fixed at 16 MiB by default.

On Linux, the WAL uses Direct I/O (O_DIRECT) to bypass the kernel page cache. This means fsync writes directly to storage media rather than to a kernel buffer that may not survive power loss.

Each WAL entry is validated with:

  • Header CRC: protects the length field
  • Data CRC: protects timestamp and payload
  • Magic trailer: detects incomplete writes from a crash mid-entry

The memtable is an in-memory buffer holding recent writes, sorted by timestamp. Because Kifa is append-only (entries are never updated or deleted), the memtable is a Vec with entries appended in timestamp order. Entries are iterated sequentially during flush and query.

When the memtable exceeds the flush threshold (default: 4 MiB), it is written to an SSTable and cleared.

SSTables (Sorted String Tables) are immutable files on disk. Each SSTable contains entries sorted by timestamp. Once written, an SSTable is never modified.

Reads search the memtable first, then SSTables from newest to oldest.

Background compaction merges multiple SSTables into fewer, larger files. This reclaims disk space from obsolete WAL segments and keeps read performance bounded.

Compaction triggers when the number of SSTables exceeds the compaction threshold (default: 4). In Emergency flush mode, compaction is paused to minimize disk activity.

The manifest tracks database state: which SSTables exist, the checkpoint timestamp (the point up to which entries have been flushed to SSTables), and metadata checksums. On startup, the manifest determines where WAL replay begins.

Each log entry contains two fields:

FieldSizeDescription
Timestamp8 bytesUnix timestamp in nanoseconds (monotonically increasing)
DataVariableRaw payload (max 1 MiB)

On startup, Kifa:

  1. Read manifest. Determine the checkpoint timestamp.

  2. Replay WAL. Replay all entries after the checkpoint timestamp into the memtable.

  3. Validate CRCs. Entries with invalid CRCs or incomplete trailers are discarded. They represent writes that did not complete before the crash.

  4. Report statistics. Entries replayed and time range covered.

In Cautious mode, no entries are lost. Every entry that append() confirmed is on disk. In Normal mode, up to 49 entries between syncs may be missing after a crash.