Re: LSM tree for Postgres - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: LSM tree for Postgres
Date
Msg-id CAPpHfdvkFFKd3-JvVX_aWALMb8xjhkLA9juFG1uAk6aZ7JVRSA@mail.gmail.com
Whole thread Raw
In response to Re: LSM tree for Postgres  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: LSM tree for Postgres
List pgsql-hackers
On Sat, Aug 8, 2020 at 11:49 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> On 08.08.2020 21:18, Alexander Korotkov wrote:
> > On Sat, Aug 8, 2020 at 5:07 PM Konstantin Knizhnik
> > <k.knizhnik@postgrespro.ru> wrote:
> >> I agree with your that loosing sequential order of B-Tree pages may have
> >> negative impact on performance.
> >> But it first of all critical for order-by and range queries, when we
> >> should traverse several subsequent leave pages.
> >> It is less critical for exact-search or delete/insert operations.
> >> Efficiency of merge operations mostly depends on how much keys
> >> will be stored at the same B-Tree page.
> > What do you mean by "mostly"?  Given PostgreSQL has quite small (8k)
> > pages, sequential read in times faster than random read on SSDs
> > (dozens of times on HDDs).  I don't think this is something to
> > neglect.
>
> When yo insert one record in B-Tree, the order of pages doesn't matter
> at all.
> If you insert ten records at one leaf page then order is also not so
> important.
> If you insert 100 records, 50 got to one page and 50 to the next page,
> then insertion may be faster if second page follows on  the disk first one.
> But such insertion may cause page split and so allocation of new page,
> so sequential write order can still be violated.

Sorry, I've no idea of what you're getting at.

> >> And it is first of all
> >> determined by size of top index and key distribution.
> > How can you be sure that the top index can fit memory?  On production
> > systems, typically there are multiple consumers of memory: other
> > tables, indexes, other LSMs.  This is one of reasons why LSM
> > implementations have multiple levels: they don't know in advance which
> > levels fit memory.  Another reason is dealing with very large
> > datasets.  And I believe there is a quite strong reason to keep page
> > order sequential within level.
>
> There is no any warranty that top index is kept in memory.
> But as far top index pages are frequently accessed,  I hope that buffer
> management cache replacement
> algorithm does it best to keep them in memory.

So, the top index should be small enough that we can safely assume it
wouldn't be evicted from cache on a heavily loaded production system.
I think it's evident that it should be in orders of magnitude less
than the total amount of server RAM.

> > I'm OK with your design for a third-party extension.  It's very cool
> > to have.  But I'm -1 for something like this to get into core
> > PostgreSQL, assuming it's feasible to push some effort and get
> > state-of-art LSM there.
> I realize that it is not true LSM.
> But still I wan to notice that it is able to provide ~10 times increase
> of insert speed when size of index is comparable with RAM size.
> And "true LSM" from RocksDB shows similar results.

It's very far from being shown.  All the things you've shown is a
naive benchmark.  I don't object that your design can work out some
cases.  And it's great that we have the lsm3 extension now.  But I
think for PostgreSQL core we should think about better design.

> May be if size of
> index will be 100 times larger then
> size of RAM, RocksDB will be significantly faster than Lsm3. But modern
> servers has 0.5-1Tb of RAM.
> Can't believe that there are databases with 100Tb indexes.

Comparison of whole RAM size to single index size looks plain wrong
for me.  I think we can roughly compare whole RAM size to whole
database size.  But also not the whole RAM size is always available
for caching data.  Let's assume half of RAM is used for caching data.
So, a modern server with 0.5-1Tb of RAM, which suffers from random
B-tree insertions and badly needs LSM-like data-structure, runs a
database of 25-50Tb.  Frankly speaking, there is nothing
counterintuitive for me.

------
Regards,
Alexander Korotkov



pgsql-hackers by date:

Previous
From: Konstantin Knizhnik
Date:
Subject: Re: LSM tree for Postgres
Next
From: Michael Paquier
Date:
Subject: Re: 回复:how to create index concurrently on partitioned table