Re: LSM tree for Postgres - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: LSM tree for Postgres
Date
Msg-id CAPpHfdu9jW-HnemRy-Yo-fAwoEVPfUpXi1e63pnmWqnk8SheFA@mail.gmail.com
Whole thread Raw
In response to Re: LSM tree for Postgres  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: LSM tree for Postgres
List pgsql-hackers
On Tue, Aug 4, 2020 at 7:56 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> I do not understand why do we need multiple indexes.
> We need one "hot" index which fits in memory to perform fast inserts.
> But it should not be too small to be able to accumulate substantial
> amount of records to provide efficient bulk insert.
> I expect that top index can be efficiently merger with based index
> because of better access locality.
> I.e. we will insert multiple entries into one B-Tree page and so
> minimize slowdown of random reads.
>
> Third index is needed to perform parallel merge (while merge is in
> progress top index will be blocked and we can not perform inserts in it).
> I do not understand benefits of performing more than one merge in
> parallel: it will only increase fraction of random reads.
>
> Degradation certainly takes place. But it is not so critical as in case
> of standard nbtree.
> It is possible to tune threshold for top index size to make merge most
> efficient.
> But we can not truncate and swap index before we complete the merge.
> So if merge operation takes long time, then it will cause exhaustion of
> top index and it will not fit in memory any more.
> It will lead to further slowdown (so we have negative feedback here).
>
> Certainly it is possible to create more top indexes, keeping their size
> small enough to fit in memory.
> But in this case search will require merge not of 3, but of N indexes.
> I think that it may cause unacceptable slowdown of search operations.
> And it is highly undesirable, taken in account that most of application
> send more select-only queries than updates.

The things you're writing makes me uneasy.  I initially understood
lsm3 as a quick and dirty prototype, while you're probably keeping
some design in your mind (for instance, original design of LSM).
However, your message makes me think you're trying to defend the
approach currently implemented in lsm3 extension.  Therefore, I've to
criticise this approach.

1) The base index can degrade.  At first, since merge can cause page
splits.  Therefore logical ordering of pages will become less
correlated with their physical ordering with each merge.
2) If your workload will include updates and/or deletes, page
utilization may also degrade.
3) While base index degrades, merge performance also degrades.
Traverse of base index in logical order will require more and more
random reads (at some point almost every page read will be random).
While the base index becomes large and/or bloat, you push less top
index tuples to a single base index page (at some point you will push
one tuple per page).

Original LSM design implies strict guarantees over average resources
spent per index operation.  Your design doesn't.  Moreover, I bet lsm3
will degrade significantly even on insert-only workload.  It should
degrade to the performance level of B-tree once you insert enough
data.  Try something like number_of_merges =
numer_of_tuples_per_index_page * 2 and you should see this
degradation.  Real LSM doesn't degrade that way.

> Yes, this approach increase mount of logged data twice:
> we need to write in WAL inserts in top and base indexes.
> And it will certainly have negative influence on performance.
> Unfortunately I have no idea how to avoid it without patching Postgres core.

Huh, I didn't mean "without patching Postgres core" :)  I mean it's
difficult in principle, assuming PostgreSQL recovery is single-process
and doesn't have access to system catalog (because it might be
inconsistent).

> Right now vacuum process Lsm3 indexes in usual way.
> Removing records from top indexes may be not needed at all (just because
> this indexes will be truncated in any case).
> But as far as size of top index is expected to be small enough
> vacuumming it should not take a long time,
> so I didn't to avoid it (although it should not be difficult - just
> disable ambulkdelete for correspondent  nbtree wrappers).

It doesn't seem important, but I don't get your point here.  Postgres
expects ambulkdelete to delete TIDs from index.  If you don't delete
it from the top index, this TID will be merged to the base index.  And
that could lead wrong query answers unless you eliminate those TIDs in
a different way (during the merge stage or something).

> Concerning deletes from main index - I do not understand how it can be
> optimized.

This is a trick you can learn from almost every LSM implementation.
For instance, check docs for leveldb [1] about "delete marker".  For
sure, that requires some redesign of the vacuum and can't be done in
extension (at least in the reasonable way).  But, frankly speaking, I
think core modifications are inevitable to utilize the power of LSM in
PostgreSQL.

Links
1. https://github.com/google/leveldb/blob/master/doc/impl.md

------
Regards,
Alexander Korotkov



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Concurrency bug in amcheck
Next
From: "kato-sho@fujitsu.com"
Date:
Subject: RE: Creating foreign key on partitioned table is too slow