Thread: LSM tree for Postgres

LSM tree for Postgres

From
Konstantin Knizhnik
Date:
Hi hackers,

I want to share results of my last research of implementing LSM index in Postgres.
Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree.

From one side, capacity of RAM at modern servers allows to keep the whole database in memory.
It leads to the anti-caching approach proposed by Michael Stonebraker
https://15721.courses.cs.cmu.edu/spring2016/papers/hstore-anticaching.pdf

From the other side: maintaining if indexes is one of the most important factor limiting database performance.
Postgres is able to insert records in table without indexes almost with linear disk write speed(hundred megabytes per second).
But if table contains indexes, inserted keys have random values and indexes don't fill in memory then we observer dramatic degradation of performance. Average HDD access time is about 10msec, which corresponds to 100 random reads per second. If table has several indexes
and is large enough not to fit in memory, then insert speed can be as low as tens TPS. Certainly SSD can provide much better random access time,
but still random reads are slow.

LSM approach tries to address this problem.
First of all I started my experiments with RocksDB (may be most popular LSM-based key-value storage, used in many databases).
There was FDW for RocksDB from VidarDB project: https://github.com/vidardb/pgrocks
As far as RocksDB is multuthreaded embedded database engine and Postgres is based on multiprocess architecture,
them used interesting approach "server inside server": there is bgworker process which works with RocksDB and
backends sendind requests to it through shared memory queue.

I have significantly rewriten their FDW implementation: original RocksDB server implementation was single threaded.
I have made it multitheaded making it possible to run multiple RocksDB requests in parallel.
My implementation can be found there:
https://github.com/postgrespro/lsm

Some results of benchmarking.
Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key and 8 bigint fields.
SIze of index is about 20Gb and target system has 16GB of RAM:


IndexClientsTPS
Inclusive B-Tree19387
Inclusive B-Tree1018761
RocksDB FDW1138350
RocksDB FDW10122369
RocksDB1166333
RocksDB10141482


As you can see there is about 10 times difference.
May be somebody will find useful this idea of using IOT (index organized table) based on RocksDB  in Postgres.
But this approach violates all ACID properties of Postgres:
there is no atomicity and consistency (in principle RocksDB supports 2PC, but it is not used here),
isolation corresponds to something like "read uncommitted",
and  concerning durability - it is all up to RocksDB and I have serious doubts that it will survive failure especially with sync write mode disabled.
So I considered this project mostly as prototype for estimating efficiency of LSM approach.

Then I think about implementing ideas of LSM using standard Postgres nbtree.

We need two indexes: one small for fast inserts and another - big (main) index. This top index is small enough to fit in memory
so inserts in this index are very fast.
Periodically we will merge data from top index to base index and truncate the top index. To prevent blocking of inserts in the table
while we are merging indexes we can add ... on more index, which will be used during merge.

So final architecture of Lsm3 is the following:
two top indexes used in cyclic way and one main index. When top index reaches some threshold value
we initiate merge with main index, done by bgworker and switch to another top index.
As far as merging indexes is done in background, it doesn't  affect insert speed.
Unfortunately Postgres Index AM has not bulk insert operation, so we have to perform normal inserts.
But inserted data is already sorted by key which should improve access locality and partly solve random reads problem for base index.

Certainly to perform search in Lsm3 we have to make lookups in all three indexes and merge search results.
(in case of unique index we can avoid extra searches if searched value is found in top index).
It can happen that during merge of top and base indexes the same TID can be found in both of them.
But such duplicates can be easily eliminated during merge of search results.

As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL.
There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records.

Implementation of Lsm3 Index AM as standart Postgres extension  is available here:
https://github.com/postgrespro/lsm3

I have performed the same benchmark with random inserts (described above) for Lsm3:

IndexClientsTPS
Inclusive B-Tree19387
Inclusive B-Tree1018761
RocksDB FDW1138350
RocksDB FDW10122369
RocksDB1166333
RocksDB10141482
Lsm31151699
Lsm31065997

Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are signficantly worser.
So Lsm3 can provide significant improve of performance for large indexes not fitting in main memory.
And the larger ratio between index size and RAM size is, the larger benefit in insertion speed you get.
Lsm3 is just standard postgres extension, fully integrated in Postgres infrastructure (MVCC, WAL, backups,...).
SO I hope it can be useful when standard indexes becomes bottleneck.


I will be glad to receive any feedback, may be some change requests or proposals.

Best regards,
Konstantin

Re: LSM tree for Postgres

From
Alexander Korotkov
Date:
Hi!

On Tue, Aug 4, 2020 at 11:22 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> I want to share results of my last research of implementing LSM index in Postgres.
> Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree.

I wouldn't say it that way.  I would say they are providing LSM in
addition to the B-tree. For instance WiredTiger (which is the main
engine for MongoDB) provides both B-tree and LSM.  As I know Tarantool
provides at least an in-memory B-tree.  If RocksDB is used as an
engine for MySQL, then it's also an addition to B-tree, which is
provided by InnoDB.  Also, implementation of B-tree's in mentioned
DBMSes are very different.  I would say none of them is purely
classical.

> LSM approach tries to address this problem.

LSM has great use-cases for sure.

> I have significantly rewriten their FDW implementation: original RocksDB server implementation was single threaded.
> I have made it multitheaded making it possible to run multiple RocksDB requests in parallel.
> My implementation can be found there:
> https://github.com/postgrespro/lsm

Great, thank you for your work.

> Some results of benchmarking.
> Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key
and8 bigint fields.
 
> SIze of index is about 20Gb and target system has 16GB of RAM:

What storage do you use?

> As you can see there is about 10 times difference.
> May be somebody will find useful this idea of using IOT (index organized table) based on RocksDB  in Postgres.
> But this approach violates all ACID properties of Postgres:
> there is no atomicity and consistency (in principle RocksDB supports 2PC, but it is not used here),
> isolation corresponds to something like "read uncommitted",
> and  concerning durability - it is all up to RocksDB and I have serious doubts that it will survive failure
especiallywith sync write mode disabled.
 
> So I considered this project mostly as prototype for estimating efficiency of LSM approach.

Yes, integration of WAL and snapshots between Postgres and RocksDB is
problematic.  I also doubt that RocksDB can use the full power of
Postgres extendable type system.

> Then I think about implementing ideas of LSM using standard Postgres nbtree.
>
> We need two indexes: one small for fast inserts and another - big (main) index. This top index is small enough to fit
inmemory
 
> so inserts in this index are very fast.
> Periodically we will merge data from top index to base index and truncate the top index. To prevent blocking of
insertsin the table
 
> while we are merging indexes we can add ... on more index, which will be used during merge.
>
> So final architecture of Lsm3 is the following:
> two top indexes used in cyclic way and one main index. When top index reaches some threshold value
> we initiate merge with main index, done by bgworker and switch to another top index.
> As far as merging indexes is done in background, it doesn't  affect insert speed.
> Unfortunately Postgres Index AM has not bulk insert operation, so we have to perform normal inserts.
> But inserted data is already sorted by key which should improve access locality and partly solve random reads problem
forbase index.
 
>
> Certainly to perform search in Lsm3 we have to make lookups in all three indexes and merge search results.
> (in case of unique index we can avoid extra searches if searched value is found in top index).
> It can happen that during merge of top and base indexes the same TID can be found in both of them.
> But such duplicates can be easily eliminated during merge of search results.

You use a fixed number of trees.  Is this a limitation of prototype or
intention of design?  I guess if the base index is orders of magnitude
bigger than RAM, this scheme can degrade greatly.

> As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL.
> There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records.

As I get the merge operations are logged in the same way as ordinal
inserts.  This seems acceptable, but a single insert operation would
eventually cause in times more WAL than it does in B-tree (especially
if we'll have an implementation of a flexible number of trees).  In
principle that could be evaded if recovery could replay logic of trees
merge on its side.  But this approach hardly can fit Postgres in many
ways.

> I have performed the same benchmark with random inserts (described above) for Lsm3:
>
> Index Clients TPS
> Inclusive B-Tree 1 9387
> Inclusive B-Tree 10 18761
> RocksDB FDW 1 138350
> RocksDB FDW 10 122369
> RocksDB 1 166333
> RocksDB 10 141482
> Lsm3 1 151699
> Lsm3 10 65997
>
> Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are
signficantlyworser.
 

Did you investigate the source of degradation?  Such degradation
doesn't yet look inevitable for me.  Probably, things could be
improved.

> I will be glad to receive any feedback, may be some change requests or proposals.

As I get you benchmarked just inserts.  But what about vacuum?  I
think the way Postgres vacuum works for now isn't optimal for lsm.
Postgres vacuum requires full scan of index, because it provides a
bitmap of tids to be deleted without information of index keys.  For
lsm, it would be better if vacuum would push delete requests to the
top level of lsm (as specially marked tuples of something).  Thanks to
that index deletes could be as efficient as inserts.  This is
especially important for lsm with many levels and/or aggressive
vacuum.

------
Regards,
Alexander Korotkov



Re: LSM tree for Postgres

From
Tomas Vondra
Date:
On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
>Hi hackers,
>
>I want to share results of my last research of implementing LSM index 
>in Postgres.
>Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using 
>LSM tree instead of classical B-Tree.
>

I was under the impression that LSM is more an alternative primary
storage, not for indexes. Or am I wrong / confused?

>From one side, capacity of RAM at modern servers allows to keep the 
>whole database in memory.
>It leads to the anti-caching approach proposed by Michael Stonebraker
>https://15721.courses.cs.cmu.edu/spring2016/papers/hstore-anticaching.pdf
>
>From the other side: maintaining if indexes is one of the most 
>important factor limiting database performance.
>Postgres is able to insert records in table without indexes almost 
>with linear disk write speed(hundred megabytes per second).
>But if table contains indexes, inserted keys have random values and 
>indexes don't fill in memory then we observer dramatic degradation of 
>performance. Average HDD access time is about 10msec, which 
>corresponds to 100 random reads per second. If table has several 
>indexes
>and is large enough not to fit in memory, then insert speed can be as 
>low as tens TPS. Certainly SSD can provide much better random access 
>time,
>but still random reads are slow.
>

True. Indexes (the way we do them) almost inevitably cause random I/O.

>LSM approach tries to address this problem.
>First of all I started my experiments with RocksDB (may be most 
>popular LSM-based key-value storage, used in many databases).
>There was FDW for RocksDB from VidarDB project: 
>https://github.com/vidardb/pgrocks
>As far as RocksDB is multuthreaded embedded database engine and 
>Postgres is based on multiprocess architecture,
>them used interesting approach "server inside server": there is 
>bgworker process which works with RocksDB and
>backends sendind requests to it through shared memory queue.
>
>I have significantly rewriten their FDW implementation: original 
>RocksDB server implementation was single threaded.
>I have made it multitheaded making it possible to run multiple RocksDB 
>requests in parallel.
>My implementation can be found there:
>https://github.com/postgrespro/lsm
>
>Some results of benchmarking.
>Benchmark is just insertion of 250 millions of records with random key 
>in inclusive index containing one bigint key and 8 bigint fields.
>SIze of index is about 20Gb and target system has 16GB of RAM:
>
>
>Index     Clients     TPS
>Inclusive B-Tree     1     9387
>Inclusive B-Tree     10     18761
>RocksDB FDW     1     138350
>RocksDB FDW     10     122369
>RocksDB     1     166333
>RocksDB     10     141482
>

Interesting, although those are just writes, right? Do you have any
numbers for read? Also, what are the numbers when you start with "larger
than RAM" data (i.e. ignoring the initial period when the index fits
into memory)?

>
>As you can see there is about 10 times difference.
>May be somebody will find useful this idea of using IOT (index 
>organized table) based on RocksDB  in Postgres.
>But this approach violates all ACID properties of Postgres:
>there is no atomicity and consistency (in principle RocksDB supports 
>2PC, but it is not used here),
>isolation corresponds to something like "read uncommitted",
>and  concerning durability - it is all up to RocksDB and I have 
>serious doubts that it will survive failure especially with sync write 
>mode disabled.
>So I considered this project mostly as prototype for estimating 
>efficiency of LSM approach.
>

Yeah, I think in general FDW to a database with different consistency
model is not going to get us very far ... Good for PoC experiments, but
not really designed for stuff like this. Also, in my experience FDW has
siginficant per-row overhead.

>Then I think about implementing ideas of LSM using standard Postgres nbtree.
>
>We need two indexes: one small for fast inserts and another - big 
>(main) index. This top index is small enough to fit in memory
>so inserts in this index are very fast.
>Periodically we will merge data from top index to base index and 
>truncate the top index. To prevent blocking of inserts in the table
>while we are merging indexes we can add ... on more index, which will 
>be used during merge.
>
>So final architecture of Lsm3 is the following:
>two top indexes used in cyclic way and one main index. When top index 
>reaches some threshold value
>we initiate merge with main index, done by bgworker and switch to 
>another top index.
>As far as merging indexes is done in background, it doesn't  affect 
>insert speed.
>Unfortunately Postgres Index AM has not bulk insert operation, so we 
>have to perform normal inserts.
>But inserted data is already sorted by key which should improve access 
>locality and partly solve random reads problem for base index.
>
>Certainly to perform search in Lsm3 we have to make lookups in all 
>three indexes and merge search results.
>(in case of unique index we can avoid extra searches if searched value 
>is found in top index).
>It can happen that during merge of top and base indexes the same TID 
>can be found in both of them.
>But such duplicates can be easily eliminated during merge of search results.
>
>As far as we are using standard nbtree indexes there is no need to 
>worry about logging information in WAL.
>There is no need to use inefficient "generic WAL records" or patch 
>kernel by adding own WAL records.
>

Makes sense, I guess. I always imagined we could do something like this
by adding "buffers" into the btree directly, and instead of pushing them
all the way down to the leaf pages we'd only insert them into the first
buffer (and full buffers would get "propagated" in the background).

How many such "buffers" we'd need / in which places in the btree is an
open question - I suppose we could have a buffer for every internal page
Typical indexes have ~1% in internal pages, so even if each 8kB internal
page has an associated 8kB buffer page, it's not going to increase the
size significantly. Of course, it's going to make lookups more expensive
because you have to search all the buffers on the way to the leaf page
(I wonder if we could improve this by keeping a a tiny bloom filters for
those buffers, representing data in the subtree).

Not sure if this would be simpler/cheaper than maintaining multiple
separate indexes, which is what you propose.

BTW how would your approach work with unique indexes, speculative
inserts etc.?

>Implementation of Lsm3 Index AM as standart Postgres extension  is 
>available here:
>https://github.com/postgrespro/lsm3
>
>I have performed the same benchmark with random inserts (described 
>above) for Lsm3:
>
>Index     Clients     TPS
>Inclusive B-Tree     1     9387
>Inclusive B-Tree     10     18761
>RocksDB FDW     1     138350
>RocksDB FDW     10     122369
>RocksDB     1     166333
>RocksDB     10     141482
>Lsm3     1     151699
>Lsm3     10     65997
>
>
>Size of nbtree is about 29Gb, single client performance is even higher 
>than of RocksDB FDW, but parallel results are signficantly worser.
>So Lsm3 can provide significant improve of performance for large 
>indexes not fitting in main memory.
>And the larger ratio between index size and RAM size is, the larger 
>benefit in insertion speed you get.
>Lsm3 is just standard postgres extension, fully integrated in Postgres 
>infrastructure (MVCC, WAL, backups,...).
>SO I hope it can be useful when standard indexes becomes bottleneck.
>

Isn't it a bit suspicious that with more clients the throughput actually
drops significantly? Is this merely due to PoC stage, or is there some
inherent concurrency bottleneck?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: LSM tree for Postgres

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
> >two top indexes used in cyclic way and one main index. When top index
> >reaches some threshold value
> >we initiate merge with main index, done by bgworker and switch to another
> >top index.
> >As far as merging indexes is done in background, it doesn't  affect insert
> >speed.
> >Unfortunately Postgres Index AM has not bulk insert operation, so we have
> >to perform normal inserts.
> >But inserted data is already sorted by key which should improve access
> >locality and partly solve random reads problem for base index.
> >
> >Certainly to perform search in Lsm3 we have to make lookups in all three
> >indexes and merge search results.
> >(in case of unique index we can avoid extra searches if searched value is
> >found in top index).
> >It can happen that during merge of top and base indexes the same TID can
> >be found in both of them.
> >But such duplicates can be easily eliminated during merge of search results.
> >
> >As far as we are using standard nbtree indexes there is no need to worry
> >about logging information in WAL.
> >There is no need to use inefficient "generic WAL records" or patch kernel
> >by adding own WAL records.
>
> Makes sense, I guess. I always imagined we could do something like this
> by adding "buffers" into the btree directly, and instead of pushing them
> all the way down to the leaf pages we'd only insert them into the first
> buffer (and full buffers would get "propagated" in the background).

I get that it's not quite the same, but this all is reminding me of the
GIN pending list and making me wonder if there's some way to generalize
that (or come up with something new that would work for GIN too).

Independently while considering this, I don't think the issues around
how to deal with unique btrees properly has really been considered- you
certainly can't stop your search on the first tuple you find even if the
index is unique, since the "unique" btree could certainly have multiple
entries for a given key and you might need to find a different one.

Thanks,

Stephen

Attachment

Re: LSM tree for Postgres

From
Alexander Korotkov
Date:
On Tue, Aug 4, 2020 at 6:11 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
> >Hi hackers,
> >
> >I want to share results of my last research of implementing LSM index
> >in Postgres.
> >Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using
> >LSM tree instead of classical B-Tree.
> >
>
> I was under the impression that LSM is more an alternative primary
> storage, not for indexes. Or am I wrong / confused?

As I understand, there are different use-cases.  We can use LSM for
index, and this is good already.  Such indexes would be faster for
insertions and probably even vacuum if we redesign it (see my previous
message), but slower for search.  But for updates/deletes you still
have to do random access to the heap.  And you also need to find a
heap record to update/delete, probably using the LSM index (and it's
slower for search than B-tree).

LSM as a primary storage can do more advanced tricks.  For instance,
some updates/inserts_on_conflict could be also just pushed to the top
level of LSM without fetching the affected record before.

So, in my point of view LSM as an index AM is far not a full power LSM
for PostgreSQL, but it's still useful.  Large insert-only tables can
benefit from LSM.  Large tables with many indexes could also benefit,
because non-HOT updates will become cheaper.

------
Regards,
Alexander Korotkov



Re: LSM tree for Postgres

From
Konstantin Knizhnik
Date:

On 04.08.2020 18:04, Alexander Korotkov wrote:
> Hi!
>
> On Tue, Aug 4, 2020 at 11:22 AM Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> I want to share results of my last research of implementing LSM index in Postgres.
>> Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using LSM tree instead of classical B-Tree.
> I wouldn't say it that way.  I would say they are providing LSM in
> addition to the B-tree. For instance WiredTiger (which is the main
> engine for MongoDB) provides both B-tree and LSM.  As I know Tarantool
> provides at least an in-memory B-tree.  If RocksDB is used as an
> engine for MySQL, then it's also an addition to B-tree, which is
> provided by InnoDB.  Also, implementation of B-tree's in mentioned
> DBMSes are very different.  I would say none of them is purely
> classical.
I am not suggesting to completely replace B-Tree with LSM.
My experiments shows tah Postgres nbtree is faster than RocksDB when 
index is small and fits in memory.
Definitely I have suggest to use LSM only for huge tables which indexes 
are much larger than size of available memory.


> Some results of benchmarking.
>> Benchmark is just insertion of 250 millions of records with random key in inclusive index containing one bigint key
and8 bigint fields.
 
>> SIze of index is about 20Gb and target system has 16GB of RAM:
> What storage do you use?

It is my notebook with 16GB of RAM and SSD.
Certainly it should be tested at more serious hardware.
But there is a "problem": powerful servers now have hundreds of 
gigabytes of memory.
To let LSM index demonstrates it advantages we need to create index not 
fitting in memory.
And CPU speed of very expensive servers is not significantly faster than 
of my notebook.
Performing may inserts in parallel also will not significantly increase 
population of table with data: multiple bottlenecks in Postgres
do not allow to reach liner scalability even thoughwe have hundreds of 
CPU cores.

> Yes, integration of WAL and snapshots between Postgres and RocksDB is
> problematic.  I also doubt that RocksDB can use the full power of
> Postgres extendable type system.
This implementation support only basic scala Postgres types.
Actually RocksDB is dealing only with string key-value pairs.
So we need to serialize Postgres type into array of bytes (and provide 
right ordering)!


>
> You use a fixed number of trees.  Is this a limitation of prototype or
> intention of design?  I guess if the base index is orders of magnitude
> bigger than RAM, this scheme can degrade greatly.
I do not understand why do we need multiple indexes.
We need one "hot" index which fits in memory to perform fast inserts.
But it should not be too small to be able to accumulate substantial 
amount of records to provide efficient bulk insert.
I expect that top index can be efficiently merger with based index 
because of better access locality.
I.e. we will insert multiple entries into one B-Tree page and so 
minimize slowdown of random reads.

Third index is needed to perform parallel merge (while merge is in 
progress top index will be blocked and we can not perform inserts in it).
I do not understand benefits of performing more than one merge in 
parallel: it will only increase fraction of random reads.

Degradation certainly takes place. But it is not so critical as in case 
of standard nbtree.
It is possible to tune threshold for top index size to make merge most 
efficient.
But we can not truncate and swap index before we complete the merge.
So if merge operation takes long time, then it will cause exhaustion of 
top index and it will not fit in memory any more.
It will lead to further slowdown (so we have negative feedback here).

Certainly it is possible to create more top indexes, keeping their size 
small enough to fit in memory.
But in this case search will require merge not of 3, but of N indexes.
I think that it may cause unacceptable slowdown of search operations.
And it is highly undesirable, taken in account that most of application 
send more select-only queries than updates.


>
>> As far as we are using standard nbtree indexes there is no need to worry about logging information in WAL.
>> There is no need to use inefficient "generic WAL records" or patch kernel by adding own WAL records.
> As I get the merge operations are logged in the same way as ordinal
> inserts.  This seems acceptable, but a single insert operation would
> eventually cause in times more WAL than it does in B-tree (especially
> if we'll have an implementation of a flexible number of trees).  In
> principle that could be evaded if recovery could replay logic of trees
> merge on its side.  But this approach hardly can fit Postgres in many
> ways.

Yes, this approach increase mount of logged data twice:
we need to write in WAL inserts in top and base indexes.
And it will certainly have negative influence on performance.
Unfortunately I have no idea how to avoid it without patching Postgres core.
>
>> I have performed the same benchmark with random inserts (described above) for Lsm3:
>>
>> Index Clients TPS
>> Inclusive B-Tree 1 9387
>> Inclusive B-Tree 10 18761
>> RocksDB FDW 1 138350
>> RocksDB FDW 10 122369
>> RocksDB 1 166333
>> RocksDB 10 141482
>> Lsm3 1 151699
>> Lsm3 10 65997
>>
>> Size of nbtree is about 29Gb, single client performance is even higher than of RocksDB FDW, but parallel results are
signficantlyworser.
 
> Did you investigate the source of degradation?  Such degradation
> doesn't yet look inevitable for me.  Probably, things could be
> improved.

As  I explained above insertion in Lsm3 now is process with negative 
feedback.
If insertion rate is higher than merge speed then top index is blown and 
doesn't fit in memory any more which in turn cause
more slowdown of merge and as a result - further increase of top index size.
Several parallel clients performing inserts can fill top index faster 
than single client does and faster than it can be merged with main index.

I have tried several approaches which try to slowdown inserts to prevent 
undesired growth of top index
(I  have considered two criteria: size of top index and time of merge 
operation).
But none of my attempts was successful: its leads to even worse performance.

>
>> I will be glad to receive any feedback, may be some change requests or proposals.
> As I get you benchmarked just inserts.  But what about vacuum?  I
> think the way Postgres vacuum works for now isn't optimal for lsm.
> Postgres vacuum requires full scan of index, because it provides a
> bitmap of tids to be deleted without information of index keys.  For
> lsm, it would be better if vacuum would push delete requests to the
> top level of lsm (as specially marked tuples of something).  Thanks to
> that index deletes could be as efficient as inserts.  This is
> especially important for lsm with many levels and/or aggressive
> vacuum.

Right now vacuum process Lsm3 indexes in usual way.
Removing records from top indexes may be not needed at all (just because 
this indexes will be truncated in any case).
But as far as size of top index is expected to be small enough 
vacuumming it should not take a long time,
so I didn't to avoid it (although it should not be difficult - just 
disable ambulkdelete for correspondent  nbtree wrappers).
Concerning deletes from main index - I do not understand how it can be 
optimized.




Re: LSM tree for Postgres

From
Konstantin Knizhnik
Date:

On 04.08.2020 18:11, Tomas Vondra wrote:
> On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
>> Hi hackers,
>>
>> I want to share results of my last research of implementing LSM index 
>> in Postgres.
>> Most of modern databases (RocksDB, MongoDB, Tarantool,...) are using 
>> LSM tree instead of classical B-Tree.
>>
>
> I was under the impression that LSM is more an alternative primary
> storage, not for indexes. Or am I wrong / confused?

Yes, originally I considered LSM for IOT (index organized table).
And RocksDB FDW is actually such implementation of IOT.
But implement IOT using existed nbtree indexes is more challenging task:
I have thought about it but have not tried it yet.

> Interesting, although those are just writes, right? Do you have any
> numbers for read? Also, what are the numbers when you start with "larger
> than RAM" data (i.e. ignoring the initial period when the index fits
> into memory)?

When benchmark starts, insertion speed is almost the same for Lsm3 and 
standard nbtree
(we have to insert record twice, but second insertion s done in background).
At the end of benchmark - when we close to 250 million records, Lsm3 
shows TPS about 20 times faster than nbtree.
Finally it gives about 6 times difference in elapsed time.


> Yeah, I think in general FDW to a database with different consistency
> model is not going to get us very far ... Good for PoC experiments, but
> not really designed for stuff like this. Also, in my experience FDW has
> siginficant per-row overhead.

 From my experience the main drawback of FDW is lack of support of 
parallel operations.
But it is important mostly for OLAP, not for OLTP.

> BTW how would your approach work with unique indexes, speculative
> inserts etc.?

Unique indexes are not supported now.
And I do not see some acceptable solution here.
If we will have to check presence of duplicate at the time of insert 
then it will eliminate all advantages of LSM approach.
And if we postpone to the moment of merge, then... I afraid that it will 
be too late.

>
> Isn't it a bit suspicious that with more clients the throughput actually
> drops significantly? Is this merely due to PoC stage, or is there some
> inherent concurrency bottleneck?
>
My explaination is the following (I am not 100% sure that it is true): 
multiple clients insert records faster than merge bgworker is able to 
merge them to main index. It cause blown of top index and as a result it 
doesn't fir in memory any more.
So we loose advantages of fast inserts. If we have N top indexes instead 
of just 2, we can keep size of each top index small enough.
But in this case search operations will have to merge N indexes and so 
search is almost N times slow (the fact that each top index fits in memory
doesn't mean that all of the fits in memory at the same time, so we 
still have to read pages from disk during lookups in top indexes).





Re: LSM tree for Postgres

From
Konstantin Knizhnik
Date:

On 04.08.2020 18:18, Stephen Frost wrote:
>
> Independently while considering this, I don't think the issues around
> how to deal with unique btrees properly has really been considered- you
> certainly can't stop your search on the first tuple you find even if the
> index is unique, since the "unique" btree could certainly have multiple
> entries for a given key and you might need to find a different one.
But search locates not ANY record with specified key in top index but record
which satisfies snapshot of the transaction. Why do we need more records 
if we know that
there are no duplicates?



Re: LSM tree for Postgres

From
Tomas Vondra
Date:
On Tue, Aug 04, 2020 at 08:18:01PM +0300, Konstantin Knizhnik wrote:
>
>
>On 04.08.2020 18:11, Tomas Vondra wrote:
>>On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
>>>Hi hackers,
>>>
>>>I want to share results of my last research of implementing LSM 
>>>index in Postgres.
>>>Most of modern databases (RocksDB, MongoDB, Tarantool,...) are 
>>>using LSM tree instead of classical B-Tree.
>>>
>>
>>I was under the impression that LSM is more an alternative primary
>>storage, not for indexes. Or am I wrong / confused?
>
>Yes, originally I considered LSM for IOT (index organized table).
>And RocksDB FDW is actually such implementation of IOT.
>But implement IOT using existed nbtree indexes is more challenging task:
>I have thought about it but have not tried it yet.
>
>>Interesting, although those are just writes, right? Do you have any
>>numbers for read? Also, what are the numbers when you start with "larger
>>than RAM" data (i.e. ignoring the initial period when the index fits
>>into memory)?
>
>When benchmark starts, insertion speed is almost the same for Lsm3 and 
>standard nbtree
>(we have to insert record twice, but second insertion s done in background).
>At the end of benchmark - when we close to 250 million records, Lsm3 
>shows TPS about 20 times faster than nbtree.
>Finally it gives about 6 times difference in elapsed time.
>

IMO the 6x difference is rather misleading, as it very much depends on
the duration of the benchmark and how much data it ends up with. I think
it's better to test 'stable states' i.e. with small data set that does
not exceed RAM during the whole test, and large ones that already starts
larger than RAM. Not sure if it makes sense to make a difference between
cases that fit into shared buffers and those that exceed shared buffers
but still fit into RAM.

>
>>Yeah, I think in general FDW to a database with different consistency
>>model is not going to get us very far ... Good for PoC experiments, but
>>not really designed for stuff like this. Also, in my experience FDW has
>>siginficant per-row overhead.
>
>From my experience the main drawback of FDW is lack of support of 
>parallel operations.
>But it is important mostly for OLAP, not for OLTP.
>

True. There are other overheads, though - having to format/copy the
tuples is not particularly cheap.

>>BTW how would your approach work with unique indexes, speculative
>>inserts etc.?
>
>Unique indexes are not supported now.
>And I do not see some acceptable solution here.
>If we will have to check presence of duplicate at the time of insert 
>then it will eliminate all advantages of LSM approach.
>And if we postpone to the moment of merge, then... I afraid that it 
>will be too late.
>

Ummm, but in your response to Stephen you said:

     But search locates not ANY record with specified key in top index
     but record which satisfies snapshot of the transaction. Why do we
     need more records if we know that there are no duplicates?

So how do you know there are no duplicates, if unique indexes are not
supported (and may not be for LSM)?

>>
>>Isn't it a bit suspicious that with more clients the throughput actually
>>drops significantly? Is this merely due to PoC stage, or is there some
>>inherent concurrency bottleneck?
>>
>My explaination is the following (I am not 100% sure that it is true): 
>multiple clients insert records faster than merge bgworker is able to 
>merge them to main index. It cause blown of top index and as a result 
>it doesn't fir in memory any more.
>So we loose advantages of fast inserts. If we have N top indexes 
>instead of just 2, we can keep size of each top index small enough.
>But in this case search operations will have to merge N indexes and so 
>search is almost N times slow (the fact that each top index fits in 
>memory
>doesn't mean that all of the fits in memory at the same time, so we 
>still have to read pages from disk during lookups in top indexes).
>

Hmmm, maybe. Should be easy to verify by monitoring the size of the top
index, and limiting it to some reasonable value to keep good
performance. Something like gin_pending_list_size I guess.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: LSM tree for Postgres

From
Konstantin Knizhnik
Date:

On 04.08.2020 20:44, Tomas Vondra wrote:
> Unique indexes are not supported now.
>> And I do not see some acceptable solution here.
>> If we will have to check presence of duplicate at the time of insert 
>> then it will eliminate all advantages of LSM approach.
>> And if we postpone to the moment of merge, then... I afraid that it 
>> will be too late.
>>
>
> Ummm, but in your response to Stephen you said:
>
>     But search locates not ANY record with specified key in top index
>     but record which satisfies snapshot of the transaction. Why do we
>     need more records if we know that there are no duplicates?
>
> So how do you know there are no duplicates, if unique indexes are not
> supported (and may not be for LSM)?
>

In index AM I marked Lsm3 index as not supporting unique constraint.
So it can not be used to enforce unique contraint.
But it is possible to specify "unique" in index properties.
In this case it is responsibility of programmer to guarantee that there 
are no duplicates in the index.
This option allows to use this search optimization - locate first record 
satisfying snapshot and not touch other indexes.

>>>
>>> Isn't it a bit suspicious that with more clients the throughput 
>>> actually
>>> drops significantly? Is this merely due to PoC stage, or is there some
>>> inherent concurrency bottleneck?
>>>
>> My explaination is the following (I am not 100% sure that it is 
>> true): multiple clients insert records faster than merge bgworker is 
>> able to merge them to main index. It cause blown of top index and as 
>> a result it doesn't fir in memory any more.
>> So we loose advantages of fast inserts. If we have N top indexes 
>> instead of just 2, we can keep size of each top index small enough.
>> But in this case search operations will have to merge N indexes and 
>> so search is almost N times slow (the fact that each top index fits 
>> in memory
>> doesn't mean that all of the fits in memory at the same time, so we 
>> still have to read pages from disk during lookups in top indexes).
>>
>
> Hmmm, maybe. Should be easy to verify by monitoring the size of the top
> index, and limiting it to some reasonable value to keep good
> performance. Something like gin_pending_list_size I guess.
>

Lsm3 provides functions for getting size of active top index, explicitly 
force merge of top index and
wait completion of merge operation.
Once of use cases of Lsm3 may be delayed update of indexes.
For some application insert speed is very critical: them can not loose 
data which is received at high rate.
In this case in working hours we insert data in small index and at night 
initiate merge of this index with main index.




Re: LSM tree for Postgres

From
Alexander Korotkov
Date:
On Tue, Aug 4, 2020 at 7:56 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> I do not understand why do we need multiple indexes.
> We need one "hot" index which fits in memory to perform fast inserts.
> But it should not be too small to be able to accumulate substantial
> amount of records to provide efficient bulk insert.
> I expect that top index can be efficiently merger with based index
> because of better access locality.
> I.e. we will insert multiple entries into one B-Tree page and so
> minimize slowdown of random reads.
>
> Third index is needed to perform parallel merge (while merge is in
> progress top index will be blocked and we can not perform inserts in it).
> I do not understand benefits of performing more than one merge in
> parallel: it will only increase fraction of random reads.
>
> Degradation certainly takes place. But it is not so critical as in case
> of standard nbtree.
> It is possible to tune threshold for top index size to make merge most
> efficient.
> But we can not truncate and swap index before we complete the merge.
> So if merge operation takes long time, then it will cause exhaustion of
> top index and it will not fit in memory any more.
> It will lead to further slowdown (so we have negative feedback here).
>
> Certainly it is possible to create more top indexes, keeping their size
> small enough to fit in memory.
> But in this case search will require merge not of 3, but of N indexes.
> I think that it may cause unacceptable slowdown of search operations.
> And it is highly undesirable, taken in account that most of application
> send more select-only queries than updates.

The things you're writing makes me uneasy.  I initially understood
lsm3 as a quick and dirty prototype, while you're probably keeping
some design in your mind (for instance, original design of LSM).
However, your message makes me think you're trying to defend the
approach currently implemented in lsm3 extension.  Therefore, I've to
criticise this approach.

1) The base index can degrade.  At first, since merge can cause page
splits.  Therefore logical ordering of pages will become less
correlated with their physical ordering with each merge.
2) If your workload will include updates and/or deletes, page
utilization may also degrade.
3) While base index degrades, merge performance also degrades.
Traverse of base index in logical order will require more and more
random reads (at some point almost every page read will be random).
While the base index becomes large and/or bloat, you push less top
index tuples to a single base index page (at some point you will push
one tuple per page).

Original LSM design implies strict guarantees over average resources
spent per index operation.  Your design doesn't.  Moreover, I bet lsm3
will degrade significantly even on insert-only workload.  It should
degrade to the performance level of B-tree once you insert enough
data.  Try something like number_of_merges =
numer_of_tuples_per_index_page * 2 and you should see this
degradation.  Real LSM doesn't degrade that way.

> Yes, this approach increase mount of logged data twice:
> we need to write in WAL inserts in top and base indexes.
> And it will certainly have negative influence on performance.
> Unfortunately I have no idea how to avoid it without patching Postgres core.

Huh, I didn't mean "without patching Postgres core" :)  I mean it's
difficult in principle, assuming PostgreSQL recovery is single-process
and doesn't have access to system catalog (because it might be
inconsistent).

> Right now vacuum process Lsm3 indexes in usual way.
> Removing records from top indexes may be not needed at all (just because
> this indexes will be truncated in any case).
> But as far as size of top index is expected to be small enough
> vacuumming it should not take a long time,
> so I didn't to avoid it (although it should not be difficult - just
> disable ambulkdelete for correspondent  nbtree wrappers).

It doesn't seem important, but I don't get your point here.  Postgres
expects ambulkdelete to delete TIDs from index.  If you don't delete
it from the top index, this TID will be merged to the base index.  And
that could lead wrong query answers unless you eliminate those TIDs in
a different way (during the merge stage or something).

> Concerning deletes from main index - I do not understand how it can be
> optimized.

This is a trick you can learn from almost every LSM implementation.
For instance, check docs for leveldb [1] about "delete marker".  For
sure, that requires some redesign of the vacuum and can't be done in
extension (at least in the reasonable way).  But, frankly speaking, I
think core modifications are inevitable to utilize the power of LSM in
PostgreSQL.

Links
1. https://github.com/google/leveldb/blob/master/doc/impl.md

------
Regards,
Alexander Korotkov



Re: LSM tree for Postgres

From
Peter Geoghegan
Date:
On Tue, Aug 4, 2020 at 8:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> So, in my point of view LSM as an index AM is far not a full power LSM
> for PostgreSQL, but it's still useful.  Large insert-only tables can
> benefit from LSM.  Large tables with many indexes could also benefit,
> because non-HOT updates will become cheaper.

Right -- this is why you usually have to choose one or the other. An
LSM design typically subsumes not just indexing and table storage, but
also checkpointing -- you cannot really compare an LSM to a B-Tree
because you really have to talk about other components to make a
sensible comparison (at which point you're actually comparing two
totally different *storage engines*). Roughly speaking, the compaction
process is the equivalent of checkpointing. So you either use (say)
InnoDB or RocksDB everywhere -- you usually can't have it both ways.
Well, maybe you can kind of get the benefits of both, but in practice
LSMs are usually highly optimized for the things that they're good at,
at the expense of other things. So in practice you kind of have to
make an up-front choice. An LSM is definitely not a natural fit for
the index access method interface in Postgres.

One thing that I don't think anyone else made reference to on the
thread (which is surprising) is that the performance of an LSM is
usually not measured using any of the conventional metrics that we
care about. For example, consider the Facebook MyRocks paper
"Optimizing Space Amplification in RocksDB" [1]. The reported RocksDB
throughput for an LSM-sympathetic workload is not really any faster
than InnoDB, and sometimes slower. That's not the point, though; the
main advantages of using an LSM are reductions in space amplification
and write amplification, particularly the latter. This isn't so much
about performance as it is about efficiency -- it enabled Facebook to
get a lot more out of the inexpensive flash storage that they use. It
lowered the total cost of ownership by *a lot*.

I personally think that we should care about efficiency in this sense
a lot more than we do now, but the fact remains that it hasn't really
been considered an independent problem that could be addressed by
accepting a tradeoff similar to the tradeoff LSMs make quite explicit
(apparently you can tune LSMs to get less write amplification but more
read amplification, or vice versa). In general, we almost always just
talk about throughout and latency without considering efficiency
specifically. I'm not suggesting that we need an LSM, but an
appreciation of LSMs could be helpful -- it could lead to better
designs elsewhere.

Mark Callaghan's blog is a pretty good resource for learning about
LSMs [2] (perhaps you've heard of him?). He wrote a bunch of stuff
about Postgres recently, which I enjoyed.

[1] http://cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf
[2] https://smalldatum.blogspot.com/
-- 
Peter Geoghegan



Re: LSM tree for Postgres

From
Konstantin Knizhnik
Date:

On 05.08.2020 02:59, Alexander Korotkov wrote:
>
> The things you're writing makes me uneasy.  I initially understood
> lsm3 as a quick and dirty prototype, while you're probably keeping
> some design in your mind (for instance, original design of LSM).
> However, your message makes me think you're trying to defend the
> approach currently implemented in lsm3 extension.  Therefore, I've to
> criticise this approach.
>
> 1) The base index can degrade.  At first, since merge can cause page
> splits.  Therefore logical ordering of pages will become less
> correlated with their physical ordering with each merge.
> 2) If your workload will include updates and/or deletes, page
> utilization may also degrade.
> 3) While base index degrades, merge performance also degrades.
> Traverse of base index in logical order will require more and more
> random reads (at some point almost every page read will be random).
> While the base index becomes large and/or bloat, you push less top
> index tuples to a single base index page (at some point you will push
> one tuple per page).
>
> Original LSM design implies strict guarantees over average resources
> spent per index operation.  Your design doesn't.  Moreover, I bet lsm3
> will degrade significantly even on insert-only workload.  It should
> degrade to the performance level of B-tree once you insert enough
> data.  Try something like number_of_merges =
> numer_of_tuples_per_index_page * 2 and you should see this
> degradation.  Real LSM doesn't degrade that way.

I mostly agree with your critices.
My Lsm3 is not true LSM, but from my point of view it preserves basic 
principles of LSM: fast and small top index and bulk updates of main index.
My experiments with RocksDB shows that degradation also takes place in 
this case. More experiments are needed to compare two approaches.

Concerning degrade of basic index - B-Tree itself is balanced tree. Yes, 
insertion of random keys can cause split of B-Tree page.
In the worst case half of B-Tree page will be empty. So B-Tree size will 
be two times larger than ideal tree.
It may cause degrade up to two times. But that is all. There should not 
be infinite degrade of speed tending to zero.

>
> Right now vacuum process Lsm3 indexes in usual way.
> Removing records from top indexes may be not needed at all (just because
> this indexes will be truncated in any case).
> But as far as size of top index is expected to be small enough
> vacuumming it should not take a long time,
> so I didn't to avoid it (although it should not be difficult - just
> disable ambulkdelete for correspondent  nbtree wrappers).
> It doesn't seem important, but I don't get your point here.  Postgres
> expects ambulkdelete to delete TIDs from index.  If you don't delete
> it from the top index, this TID will be merged to the base index.  And
> that could lead wrong query answers unless you eliminate those TIDs in
> a different way (during the merge stage or something).

Yes, your are right. It is not possible to avoid delete TIDs from top 
indexes.
>> Concerning deletes from main index - I do not understand how it can be
>> optimized.
> This is a trick you can learn from almost every LSM implementation.
> For instance, check docs for leveldb [1] about "delete marker".  For
> sure, that requires some redesign of the vacuum and can't be done in
> extension (at least in the reasonable way).  But, frankly speaking, I
> think core modifications are inevitable to utilize the power of LSM in
> PostgreSQL.

The main idea of Lsm3 was to investigate whether it is possible to 
achieve the same result as with "classical" LSM
using standard Postgres nbtree indexes. Right now it seems t me that 
answer is positive, but I have not performed
exhaustive measurements. For example I have not measured vacuum overhead 
(it was enabled, so vacuumming takes place
in my benchmark, but I have not tries to separate its overhead and 
influence on performance), index search speed,...




> Links
> 1. https://github.com/google/leveldb/blob/master/doc/impl.md
>
> ------
> Regards,
> Alexander Korotkov




Re: LSM tree for Postgres

From
Konstantin Knizhnik
Date:


On 04.08.2020 20:44, Tomas Vondra wrote:

IMO the 6x difference is rather misleading, as it very much depends on
the duration of the benchmark and how much data it ends up with. I think
it's better to test 'stable states' i.e. with small data set that does
not exceed RAM during the whole test, and large ones that already starts
larger than RAM. Not sure if it makes sense to make a difference between
cases that fit into shared buffers and those that exceed shared buffers
but still fit into RAM.

I have changed benchmark scenario.
Now I inserted 200 million records with sequential key: it is fast enough and makes index size about 19Gb.
Then I perform 1 million random inserts.

-- init schema
create table t(k bigint, v1 bigint, v2 bigint, v3 bigint, v4 bigint, v5 bigint, v6 bigint, v7 bigint, v8 bigint);
create index lsm_index on t using lsm3(k) include (v1,v2,v3,v4,v5,v6,v7,v8);
create table t2(k bigint, v1 bigint, v2 bigint, v3 bigint, v4 bigint, v5 bigint, v6 bigint, v7 bigint, v8 bigint);
create index on t2(k) include (v1,v2,v3,v4,v5,v6,v7,v8);

-- fill with sequential data
insert into t values (generate_series(1,200000000),0,0,0,0,0,0,0,0);
Time: 520655,635 ms (08:40,656)

insert into t2 values (generate_series(1,200000000),0,0,0,0,0,0,0,0);
Time: 372245,093 ms (06:12,245)

-- random inserts
insert into t (v1,k,v2,v3,v4,v5,v6,v7,v8) values (generate_series(1,1000000),(random()*1000000000)::bigint,0,0,0,0,0,0,0);
Time: 3781,614 ms (00:03,782)

insert into t2 (v1,k,v2,v3,v4,v5,v6,v7,v8) values (generate_series(1,1000000),(random()*1000000000)::bigint,0,0,0,0,0,0,0);
Time: 39034,574 ms (00:39,035)

The I perform random selects

select.sql:
\set k random(1, 1000000000)
select * from t where k=:k;

select2.sql:
\set k random(1, 1000000000)
select * from t2 where k=:k;

pgbench -n -T 100 -P 10 -M prepared -f select.sql postgres
tps = 11372.821006 (including connections establishing)

pgbench -n -T 100 -P 10 -M prepared -f select2.sql postgres
tps = 10392.729026 (including connections establishing)


So as you can see - insertion speed of Lsm3 is ten times higher and select speed is the same as of nbtree.

Re: LSM tree for Postgres

From
Dmitry Dolgov
Date:
> On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
>
> Then I think about implementing ideas of LSM using standard Postgres
> nbtree.
>
> We need two indexes: one small for fast inserts and another - big
> (main) index. This top index is small enough to fit in memory so
> inserts in this index are very fast.  Periodically we will merge data
> from top index to base index and truncate the top index. To prevent
> blocking of inserts in the table while we are merging indexes we can
> add ... on more index, which will be used during merge.
>
> So final architecture of Lsm3 is the following: two top indexes used
> in cyclic way and one main index. When top index reaches some
> threshold value we initiate merge with main index, done by bgworker
> and switch to another top index.  As far as merging indexes is done in
> background, it doesn't  affect insert speed.  Unfortunately Postgres
> Index AM has not bulk insert operation, so we have to perform normal
> inserts.  But inserted data is already sorted by key which should
> improve access locality and partly solve random reads problem for base
> index.
>
> Certainly to perform search in Lsm3 we have to make lookups in all
> three indexes and merge search results.

Thanks for sharing this! In fact this reminds me more of partitioned
b-trees [1] (and more older [2]) rather than LSM as it is (although
could be that the former was influenced by the latter). What could be
interesting is that quite often in these and many other whitepapers
(e.g. [3]) to address the lookup overhead the design includes bloom
filters in one or another way to avoid searching not relevant part of an
index. Tomas mentioned them in this thread as well (in the different
context), probably the design suggested here could also benefit from it?

[1]: Riegger Christian, Vincon Tobias, Petrov Ilia. Write-optimized
indexing with partitioned b-trees. (2017). 296-300. 10.1145/3151759.3151814.
[2]: Graefe Goetz. Write-Optimized B-Trees. (2004). 672-683.
10.1016/B978-012088469-8/50060-7.
[3]: Huanchen Zhang, David G. Andersen, Andrew Pavlo, Michael Kaminsky,
Lin Ma, and Rui Shen. Reducing the Storage Overhead of Main-Memory OLTP
Databases with Hybrid Indexes. (2016). 1567–1581. 10.1145/2882903.2915222.



Re: LSM tree for Postgres

From
Alexander Korotkov
Date:
ср, 5 авг. 2020 г., 09:13 Konstantin Knizhnik <k.knizhnik@postgrespro.ru>:
> Concerning degrade of basic index - B-Tree itself is balanced tree. Yes,
> insertion of random keys can cause split of B-Tree page.
> In the worst case half of B-Tree page will be empty. So B-Tree size will
> be two times larger than ideal tree.
> It may cause degrade up to two times. But that is all. There should not
> be infinite degrade of speed tending to zero.

My concerns are not just about space utilization.  My main concern is
about the order of the pages.  After the first merge the base index
will be filled in key order.  So physical page ordering perfectly
matches their logical ordering.  After the second merge some pages of
base index splits, and new pages are added to the end of the index.
Splits also happen in key order.  So, now physical and logical
orderings match within two extents corresponding to first and second
merges, but not within the whole tree.  While there are only few such
extents, disk page reads may in fact be mostly sequential, thanks to
OS cache and readahead.  But finally, after many merges, we can end up
with mostly random page reads.  For instance, leveldb doesn't have a
problem of ordering degradation, because it stores levels in sorted
files.

------
Regards,
Alexander Korotkov



Re: LSM tree for Postgres

From
Konstantin Knizhnik
Date:

On 07.08.2020 15:31, Alexander Korotkov wrote:
> ср, 5 авг. 2020 г., 09:13 Konstantin Knizhnik <k.knizhnik@postgrespro.ru>:
>> Concerning degrade of basic index - B-Tree itself is balanced tree. Yes,
>> insertion of random keys can cause split of B-Tree page.
>> In the worst case half of B-Tree page will be empty. So B-Tree size will
>> be two times larger than ideal tree.
>> It may cause degrade up to two times. But that is all. There should not
>> be infinite degrade of speed tending to zero.
> My concerns are not just about space utilization.  My main concern is
> about the order of the pages.  After the first merge the base index
> will be filled in key order.  So physical page ordering perfectly
> matches their logical ordering.  After the second merge some pages of
> base index splits, and new pages are added to the end of the index.
> Splits also happen in key order.  So, now physical and logical
> orderings match within two extents corresponding to first and second
> merges, but not within the whole tree.  While there are only few such
> extents, disk page reads may in fact be mostly sequential, thanks to
> OS cache and readahead.  But finally, after many merges, we can end up
> with mostly random page reads.  For instance, leveldb doesn't have a
> problem of ordering degradation, because it stores levels in sorted
> files.
>
I agree with your that loosing sequential order of B-Tree pages may have 
negative impact on performance.
But it first of all critical for order-by and range queries, when we 
should traverse several subsequent leave pages.
It is less critical for exact-search or delete/insert operations. 
Efficiency of merge operations mostly depends on how much keys
will be stored at the same B-Tree page. And it is first of all 
determined by size of top index and key distribution.





Re: LSM tree for Postgres

From
Alexander Korotkov
Date:
On Sat, Aug 8, 2020 at 5:07 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> I agree with your that loosing sequential order of B-Tree pages may have
> negative impact on performance.
> But it first of all critical for order-by and range queries, when we
> should traverse several subsequent leave pages.
> It is less critical for exact-search or delete/insert operations.
> Efficiency of merge operations mostly depends on how much keys
> will be stored at the same B-Tree page.

What do you mean by "mostly"?  Given PostgreSQL has quite small (8k)
pages, sequential read in times faster than random read on SSDs
(dozens of times on HDDs).  I don't think this is something to
neglect.

> And it is first of all
> determined by size of top index and key distribution.

How can you be sure that the top index can fit memory?  On production
systems, typically there are multiple consumers of memory: other
tables, indexes, other LSMs.  This is one of reasons why LSM
implementations have multiple levels: they don't know in advance which
levels fit memory.  Another reason is dealing with very large
datasets.  And I believe there is a quite strong reason to keep page
order sequential within level.

I'm OK with your design for a third-party extension.  It's very cool
to have.  But I'm -1 for something like this to get into core
PostgreSQL, assuming it's feasible to push some effort and get
state-of-art LSM there.

------
Regards,
Alexander Korotkov



Re: LSM tree for Postgres

From
Konstantin Knizhnik
Date:

On 08.08.2020 21:18, Alexander Korotkov wrote:
> On Sat, Aug 8, 2020 at 5:07 PM Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> I agree with your that loosing sequential order of B-Tree pages may have
>> negative impact on performance.
>> But it first of all critical for order-by and range queries, when we
>> should traverse several subsequent leave pages.
>> It is less critical for exact-search or delete/insert operations.
>> Efficiency of merge operations mostly depends on how much keys
>> will be stored at the same B-Tree page.
> What do you mean by "mostly"?  Given PostgreSQL has quite small (8k)
> pages, sequential read in times faster than random read on SSDs
> (dozens of times on HDDs).  I don't think this is something to
> neglect.

When yo insert one record in B-Tree, the order of pages doesn't matter 
at all.
If you insert ten records at one leaf page then order is also not so 
important.
If you insert 100 records, 50 got to one page and 50 to the next page,
then insertion may be faster if second page follows on  the disk first one.
But such insertion may cause page split and so allocation of new page,
so sequential write order can still be violated.

>> And it is first of all
>> determined by size of top index and key distribution.
> How can you be sure that the top index can fit memory?  On production
> systems, typically there are multiple consumers of memory: other
> tables, indexes, other LSMs.  This is one of reasons why LSM
> implementations have multiple levels: they don't know in advance which
> levels fit memory.  Another reason is dealing with very large
> datasets.  And I believe there is a quite strong reason to keep page
> order sequential within level.

There is no any warranty that top index is kept in memory.
But as far top index pages are frequently accessed,  I hope that buffer 
management cache replacement
algorithm does it best to keep them in memory.

> I'm OK with your design for a third-party extension.  It's very cool
> to have.  But I'm -1 for something like this to get into core
> PostgreSQL, assuming it's feasible to push some effort and get
> state-of-art LSM there.
I realize that it is not true LSM.
But still I wan to notice that it is able to provide ~10 times increase 
of insert speed when size of index is comparable with RAM size.
And "true LSM" from RocksDB shows similar results. May be if size of 
index will be 100 times larger then
size of RAM, RocksDB will be significantly faster than Lsm3. But modern 
servers has 0.5-1Tb of RAM.
Can't believe that there are databases with 100Tb indexes.




Re: LSM tree for Postgres

From
Alexander Korotkov
Date:
On Sat, Aug 8, 2020 at 11:49 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> On 08.08.2020 21:18, Alexander Korotkov wrote:
> > On Sat, Aug 8, 2020 at 5:07 PM Konstantin Knizhnik
> > <k.knizhnik@postgrespro.ru> wrote:
> >> I agree with your that loosing sequential order of B-Tree pages may have
> >> negative impact on performance.
> >> But it first of all critical for order-by and range queries, when we
> >> should traverse several subsequent leave pages.
> >> It is less critical for exact-search or delete/insert operations.
> >> Efficiency of merge operations mostly depends on how much keys
> >> will be stored at the same B-Tree page.
> > What do you mean by "mostly"?  Given PostgreSQL has quite small (8k)
> > pages, sequential read in times faster than random read on SSDs
> > (dozens of times on HDDs).  I don't think this is something to
> > neglect.
>
> When yo insert one record in B-Tree, the order of pages doesn't matter
> at all.
> If you insert ten records at one leaf page then order is also not so
> important.
> If you insert 100 records, 50 got to one page and 50 to the next page,
> then insertion may be faster if second page follows on  the disk first one.
> But such insertion may cause page split and so allocation of new page,
> so sequential write order can still be violated.

Sorry, I've no idea of what you're getting at.

> >> And it is first of all
> >> determined by size of top index and key distribution.
> > How can you be sure that the top index can fit memory?  On production
> > systems, typically there are multiple consumers of memory: other
> > tables, indexes, other LSMs.  This is one of reasons why LSM
> > implementations have multiple levels: they don't know in advance which
> > levels fit memory.  Another reason is dealing with very large
> > datasets.  And I believe there is a quite strong reason to keep page
> > order sequential within level.
>
> There is no any warranty that top index is kept in memory.
> But as far top index pages are frequently accessed,  I hope that buffer
> management cache replacement
> algorithm does it best to keep them in memory.

So, the top index should be small enough that we can safely assume it
wouldn't be evicted from cache on a heavily loaded production system.
I think it's evident that it should be in orders of magnitude less
than the total amount of server RAM.

> > I'm OK with your design for a third-party extension.  It's very cool
> > to have.  But I'm -1 for something like this to get into core
> > PostgreSQL, assuming it's feasible to push some effort and get
> > state-of-art LSM there.
> I realize that it is not true LSM.
> But still I wan to notice that it is able to provide ~10 times increase
> of insert speed when size of index is comparable with RAM size.
> And "true LSM" from RocksDB shows similar results.

It's very far from being shown.  All the things you've shown is a
naive benchmark.  I don't object that your design can work out some
cases.  And it's great that we have the lsm3 extension now.  But I
think for PostgreSQL core we should think about better design.

> May be if size of
> index will be 100 times larger then
> size of RAM, RocksDB will be significantly faster than Lsm3. But modern
> servers has 0.5-1Tb of RAM.
> Can't believe that there are databases with 100Tb indexes.

Comparison of whole RAM size to single index size looks plain wrong
for me.  I think we can roughly compare whole RAM size to whole
database size.  But also not the whole RAM size is always available
for caching data.  Let's assume half of RAM is used for caching data.
So, a modern server with 0.5-1Tb of RAM, which suffers from random
B-tree insertions and badly needs LSM-like data-structure, runs a
database of 25-50Tb.  Frankly speaking, there is nothing
counterintuitive for me.

------
Regards,
Alexander Korotkov



Re: LSM tree for Postgres

From
Konstantin Knizhnik
Date:

On 09.08.2020 04:53, Alexander Korotkov wrote:
>>
>> I realize that it is not true LSM.
>> But still I wan to notice that it is able to provide ~10 times increase
>> of insert speed when size of index is comparable with RAM size.
>> And "true LSM" from RocksDB shows similar results.
> It's very far from being shown.  All the things you've shown is a
> naive benchmark.  I don't object that your design can work out some
> cases.  And it's great that we have the lsm3 extension now.  But I
> think for PostgreSQL core we should think about better design.

Sorry, I mean that at particular benchmark and hardware Lsm3 and RocksDB 
shows similar performance.
It definitely doesn't mean that it will be true in all other cases.
This is one of the reasons why I have published this Lsm3 and RockDB FDW 
extensions:
anybody can try to test them at their workload.
It will be very interesting to me to know this results, because I 
certainly understand
that measuring of random insert performance in dummy table is not enough 
to make some
conclusions.

And I certainly do not want to say that we do not need "right" LSM 
implementation inside Postgres core.
It just requires an order of magnitude more efforts.
And there are many questions and challenges. For example Postgres buffer 
size (8kb) seems to be too small for LSM.
Should LSM implementation bypass Postgres buffer cache? There pros and 
contras...

Another issue is logging. Should we just log all operations with LSM in 
WAL in usual way (as it is done for nbtree and Lsm3)?
It seems to me that for LSM alternative and more efficient solutions may 
be proposed.
For example we may not log inserts in top index at all and just replay 
them during recovery, assuming that this operation with
small index is fast enough. And merge of top index with base index can 
be done in atomic way and so also doesn't require WAL.

As far as I know Anastasia Lubennikova several years ago has implemented 
LSM for Postgres.
There was some performance issues (with concurrent access?).
This is why the first thing I want to clarify for myself is what are the 
bottlenecks of LSM architecture
and are them caused by LSM itself or its integration in Postgres 
infrastructure.

I any case, before thinking about details of in-core LSM implementation 
for Postgres, I think that
it is necessary to demonstrate workloads at which RocksDB (or any other 
existed DBMS with LSM)
shows significant performance advantages comparing with Postgres with 
nbtree/Lsm3.

>> May be if size of
>> index will be 100 times larger then
>> size of RAM, RocksDB will be significantly faster than Lsm3. But modern
>> servers has 0.5-1Tb of RAM.
>> Can't believe that there are databases with 100Tb indexes.
> Comparison of whole RAM size to single index size looks plain wrong
> for me.  I think we can roughly compare whole RAM size to whole
> database size.  But also not the whole RAM size is always available
> for caching data.  Let's assume half of RAM is used for caching data.
> So, a modern server with 0.5-1Tb of RAM, which suffers from random
> B-tree insertions and badly needs LSM-like data-structure, runs a
> database of 25-50Tb.  Frankly speaking, there is nothing
> counterintuitive for me.

There is actually nothing counterintuitive.
I just mean that there are not so much 25-50Tb OLTP databases.




Re: LSM tree for Postgres

From
AJG
Date:
Dmitry Dolgov wrote
>> On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
>>
>> Then I think about implementing ideas of LSM using standard Postgres
>> nbtree.
>>
>> We need two indexes: one small for fast inserts and another - big
>> (main) index. This top index is small enough to fit in memory so
>> inserts in this index are very fast.  Periodically we will merge data
>> from top index to base index and truncate the top index. To prevent
>> blocking of inserts in the table while we are merging indexes we can
>> add ... on more index, which will be used during merge.
>>
>> So final architecture of Lsm3 is the following: two top indexes used
>> in cyclic way and one main index. When top index reaches some
>> threshold value we initiate merge with main index, done by bgworker
>> and switch to another top index.  As far as merging indexes is done in
>> background, it doesn't  affect insert speed.  Unfortunately Postgres
>> Index AM has not bulk insert operation, so we have to perform normal
>> inserts.  But inserted data is already sorted by key which should
>> improve access locality and partly solve random reads problem for base
>> index.
>>
>> Certainly to perform search in Lsm3 we have to make lookups in all
>> three indexes and merge search results.
>
> Thanks for sharing this! In fact this reminds me more of partitioned
> b-trees [1] (and more older [2]) rather than LSM as it is (although
> could be that the former was influenced by the latter). What could be
> interesting is that quite often in these and many other whitepapers
> (e.g. [3]) to address the lookup overhead the design includes bloom
> filters in one or another way to avoid searching not relevant part of an
> index. Tomas mentioned them in this thread as well (in the different
> context), probably the design suggested here could also benefit from it?
>
> [1]: Riegger Christian, Vincon Tobias, Petrov Ilia. Write-optimized
> indexing with partitioned b-trees. (2017). 296-300.
> 10.1145/3151759.3151814.
> [2]: Graefe Goetz. Write-Optimized B-Trees. (2004). 672-683.
> 10.1016/B978-012088469-8/50060-7.
> [3]: Huanchen Zhang, David G. Andersen, Andrew Pavlo, Michael Kaminsky,
> Lin Ma, and Rui Shen. Reducing the Storage Overhead of Main-Memory OLTP
> Databases with Hybrid Indexes. (2016). 1567–1581. 10.1145/2882903.2915222.


I found this 2019 paper recently, might be worth a skim read for some
different ideas. too technical for me :)
"Jungle: Towards Dynamically Adjustable Key-Value Storeby Combining LSM-Tree
and Copy-On-Write B+-Tree"
https://www.usenix.org/system/files/hotstorage19-paper-ahn.pdf



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html