Re: LSM tree for Postgres - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: LSM tree for Postgres
Date
Msg-id 20200804174408.5aq7g7fmfonflp23@development
Whole thread Raw
In response to Re: LSM tree for Postgres  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: LSM tree for Postgres
Re: LSM tree for Postgres
List pgsql-hackers
On Tue, Aug 04, 2020 at 08:18:01PM +0300, Konstantin Knizhnik wrote:
>
>
>On 04.08.2020 18:11, Tomas Vondra wrote:
>>On Tue, Aug 04, 2020 at 11:22:13AM +0300, Konstantin Knizhnik wrote:
>>>Hi hackers,
>>>
>>>I want to share results of my last research of implementing LSM 
>>>index in Postgres.
>>>Most of modern databases (RocksDB, MongoDB, Tarantool,...) are 
>>>using LSM tree instead of classical B-Tree.
>>>
>>
>>I was under the impression that LSM is more an alternative primary
>>storage, not for indexes. Or am I wrong / confused?
>
>Yes, originally I considered LSM for IOT (index organized table).
>And RocksDB FDW is actually such implementation of IOT.
>But implement IOT using existed nbtree indexes is more challenging task:
>I have thought about it but have not tried it yet.
>
>>Interesting, although those are just writes, right? Do you have any
>>numbers for read? Also, what are the numbers when you start with "larger
>>than RAM" data (i.e. ignoring the initial period when the index fits
>>into memory)?
>
>When benchmark starts, insertion speed is almost the same for Lsm3 and 
>standard nbtree
>(we have to insert record twice, but second insertion s done in background).
>At the end of benchmark - when we close to 250 million records, Lsm3 
>shows TPS about 20 times faster than nbtree.
>Finally it gives about 6 times difference in elapsed time.
>

IMO the 6x difference is rather misleading, as it very much depends on
the duration of the benchmark and how much data it ends up with. I think
it's better to test 'stable states' i.e. with small data set that does
not exceed RAM during the whole test, and large ones that already starts
larger than RAM. Not sure if it makes sense to make a difference between
cases that fit into shared buffers and those that exceed shared buffers
but still fit into RAM.

>
>>Yeah, I think in general FDW to a database with different consistency
>>model is not going to get us very far ... Good for PoC experiments, but
>>not really designed for stuff like this. Also, in my experience FDW has
>>siginficant per-row overhead.
>
>From my experience the main drawback of FDW is lack of support of 
>parallel operations.
>But it is important mostly for OLAP, not for OLTP.
>

True. There are other overheads, though - having to format/copy the
tuples is not particularly cheap.

>>BTW how would your approach work with unique indexes, speculative
>>inserts etc.?
>
>Unique indexes are not supported now.
>And I do not see some acceptable solution here.
>If we will have to check presence of duplicate at the time of insert 
>then it will eliminate all advantages of LSM approach.
>And if we postpone to the moment of merge, then... I afraid that it 
>will be too late.
>

Ummm, but in your response to Stephen you said:

     But search locates not ANY record with specified key in top index
     but record which satisfies snapshot of the transaction. Why do we
     need more records if we know that there are no duplicates?

So how do you know there are no duplicates, if unique indexes are not
supported (and may not be for LSM)?

>>
>>Isn't it a bit suspicious that with more clients the throughput actually
>>drops significantly? Is this merely due to PoC stage, or is there some
>>inherent concurrency bottleneck?
>>
>My explaination is the following (I am not 100% sure that it is true): 
>multiple clients insert records faster than merge bgworker is able to 
>merge them to main index. It cause blown of top index and as a result 
>it doesn't fir in memory any more.
>So we loose advantages of fast inserts. If we have N top indexes 
>instead of just 2, we can keep size of each top index small enough.
>But in this case search operations will have to merge N indexes and so 
>search is almost N times slow (the fact that each top index fits in 
>memory
>doesn't mean that all of the fits in memory at the same time, so we 
>still have to read pages from disk during lookups in top indexes).
>

Hmmm, maybe. Should be easy to verify by monitoring the size of the top
index, and limiting it to some reasonable value to keep good
performance. Something like gin_pending_list_size I guess.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Konstantin Knizhnik
Date:
Subject: Re: Confusing behavior of create table like
Next
From: Robert Haas
Date:
Subject: Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY