LSM tree for Postgres - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject LSM tree for Postgres
Date
Msg-id 315b7ce8-9d62-3817-0a92-4b20519d0c51@postgrespro.ru
Whole thread Raw
Responses Re: LSM tree for Postgres  (Alexander Korotkov <aekorotkov@gmail.com>)
Re: LSM tree for Postgres  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: LSM tree for Postgres  (Dmitry Dolgov <9erthalion6@gmail.com>)
List pgsql-hackers
Hi hackers,

I want to share results of my last research of implementing LSM index in Postgres.
Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree.

From one side, capacity of RAM at modern servers allows to keep the whole database in memory.
It leads to the anti-caching approach proposed by Michael Stonebraker
https://15721.courses.cs.cmu.edu/spring2016/papers/hstore-anticaching.pdf

From the other side: maintaining if indexes is one of the most important factor limiting database performance.
Postgres is able to insert records in table without indexes almost with linear disk write speed(hundred megabytes per second).
But if table contains indexes, inserted keys have random values and indexes don't fill in memory then we observer dramatic degradation of performance. Average HDD access time is about 10msec, which corresponds to 100 random reads per second. If table has several indexes
and is large enough not to fit in memory, then insert speed can be as low as tens TPS. Certainly SSD can provide much better random access time,
but still random reads are slow.

LSM approach tries to address this problem.
First of all I started my experiments with RocksDB (may be most popular LSM-based key-value storage, used in many databases).
There was FDW for RocksDB from VidarDB project: https://github.com/vidardb/pgrocks
As far as RocksDB is multuthreaded embedded database engine and Postgres is based on multiprocess architecture,
them used interesting approach "server inside server": there is bgworker process which works with RocksDB and
backends sendind requests to it through shared memory queue.

I have significantly rewriten their FDW implementation: original RocksDB server implementation was single threaded.
I have made it multitheaded making it possible to run multiple RocksDB requests in parallel.
My implementation can be found there:
https://github.com/postgrespro/lsm

Some results of benchmarking.
Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key and 8 bigint fields.
SIze of index is about 20Gb and target system has 16GB of RAM:


IndexClientsTPS
Inclusive B-Tree19387
Inclusive B-Tree1018761
RocksDB FDW1138350
RocksDB FDW10122369
RocksDB1166333
RocksDB10141482


As you can see there is about 10 times difference.
May be somebody will find useful this idea of using IOT (index organized table) based on RocksDB  in Postgres.
But this approach violates all ACID properties of Postgres:
there is no atomicity and consistency (in principle RocksDB supports 2PC, but it is not used here),
isolation corresponds to something like "read uncommitted",
and  concerning durability - it is all up to RocksDB and I have serious doubts that it will survive failure especially with sync write mode disabled.
So I considered this project mostly as prototype for estimating efficiency of LSM approach.

Then I think about implementing ideas of LSM using standard Postgres nbtree.

We need two indexes: one small for fast inserts and another - big (main) index. This top index is small enough to fit in memory
so inserts in this index are very fast.
Periodically we will merge data from top index to base index and truncate the top index. To prevent blocking of inserts in the table
while we are merging indexes we can add ... on more index, which will be used during merge.

So final architecture of Lsm3 is the following:
two top indexes used in cyclic way and one main index. When top index reaches some threshold value
we initiate merge with main index, done by bgworker and switch to another top index.
As far as merging indexes is done in background, it doesn't  affect insert speed.
Unfortunately Postgres Index AM has not bulk insert operation, so we have to perform normal inserts.
But inserted data is already sorted by key which should improve access locality and partly solve random reads problem for base index.

Certainly to perform search in Lsm3 we have to make lookups in all three indexes and merge search results.
(in case of unique index we can avoid extra searches if searched value is found in top index).
It can happen that during merge of top and base indexes the same TID can be found in both of them.
But such duplicates can be easily eliminated during merge of search results.

As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL.
There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records.

Implementation of Lsm3 Index AM as standart Postgres extension  is available here:
https://github.com/postgrespro/lsm3

I have performed the same benchmark with random inserts (described above) for Lsm3:

IndexClientsTPS
Inclusive B-Tree19387
Inclusive B-Tree1018761
RocksDB FDW1138350
RocksDB FDW10122369
RocksDB1166333
RocksDB10141482
Lsm31151699
Lsm31065997

Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are signficantly worser.
So Lsm3 can provide significant improve of performance for large indexes not fitting in main memory.
And the larger ratio between index size and RAM size is, the larger benefit in insertion speed you get.
Lsm3 is just standard postgres extension, fully integrated in Postgres infrastructure (MVCC, WAL, backups,...).
SO I hope it can be useful when standard indexes becomes bottleneck.


I will be glad to receive any feedback, may be some change requests or proposals.

Best regards,
Konstantin

pgsql-hackers by date:

Previous
From: Dave Page
Date:
Subject: Re: EDB builds Postgres 13 with an obsolete ICU version
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Is it worth accepting multiple CRLs?