Thread: Re: Proposal: Exploring LSM Tree‑Based Storage Engine for PostgreSQL (Inspired by MyRocks)

Re: Proposal: Exploring LSM Tree‑Based Storage Engine for PostgreSQL (Inspired by MyRocks)

From
Nico Williams
Date:
13 May, 00:31:22
On Sun, May 11, 2025 at 11:06:00AM -0700, Manish Rai Jain wrote:
> The motivation stems from the well‑known write‑amplification issues
> with B‑trees under high write throughput. An LSM‑based engine could
> offer:

Consider the ZFS copy-on-write b-tree-ish design, which has _even more_
write amplification than any normal b-tree.  The ZFS solution to that is
LSM-ish.  Specifically is the ZFS intent log (ZIL), which lets the
filesystem write w/o amplification for a while at the cost of:

a) having to keep an in-memory representation of the filesystem as
   modified by the ZIL.

b) having to spend time on boot after power failure reconstructing (a)
   from the on-disk ZIL.

The obvious way to reduce this cost of the ZIL is to periodically write
a simple index for all of the extant ZIL so that reading the filesystem
can be done by first searching the latest ZIL index, then then
b-tree-ish filesystem itself.  This begins to resemble LSMs.

But what are the benefits of a copy-on-write design with write
magnification mitigation by an indexed "intent log" / LSM?  This:

 - you get to do content-addressed storage (which ZFS isn't quite, but
   any greenfield thing could be)

 - you get one root hash to identify the whole content (same as Git and
   such), which gets you:

    - hardware bit flip fault detection and recovery (if you add
      mirroring or raid-z)

    - read-only snapshots, which gets you:

       - MVCC semantics, w/o MVCC

       - O(1) cloning of read-only snapshots

       - zero-lock reads w/ high concurrency!

         (Either you lock the txid for readin -and lock nothing else-
         you find at the root when you start reading so that pages freed
         in later transactions are not reused while your read tx is
         still going, or you lock nothing and if ever you find checksums
         don't match [most likely because the block you're reading has
         been freed and reused] you restart the read operation, much
         like with SERIALIZABLE transactions one has to be prepared to
         restart.)

But now if you want to build a distributed DB then you almost certainly
have to go back to MVCC and lose the read-only snapshots and clones.
But you could still have zero-lock reads w/ high concurrency, which is
not nothing!

But for PG it's too late to even consider such a thing unless... one
were to:

 - first add support for something other than heaps for table storage

   I'm sure many would like to have this.

 - then follow the SQLite4 (yes, 4, not 3) model so you can store the
   whole DB in one LSM (or one b-tree w/ write mag. amortization)

   This strikes me as a lot of work.

Nico
--