Thread: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

Hi,

Currently, nbtree code compares each and every column of an index
tuple during the binary search on the index page. With large indexes
that have many duplicate prefix column values (e.g. an index on (bool,
bool, uuid) ) that means a lot of wasted time getting to the right
column.

The attached patch improves on that by doing per-page dynamic prefix
truncation: If we know that on both the right and left side there are
index tuples where the first two attributes are equal to the scan key,
we skip comparing those attributes at the current index tuple and
start with comparing attribute 3, saving two attribute compares. We
gain performance whenever comparing prefixing attributes is expensive
and when there are many tuples with a shared prefix - in unique
indexes this doesn't gain much, but we also don't lose much in this
case.

This patch was originally suggested at [0], but it was mentioned that
they could be pulled out into it's own thread. Earlier, the
performance gains were not clearly there for just this patch, but
after further benchmarking this patch stands on its own for
performance: it sees no obvious degradation of performance, while
gaining 0-5% across various normal indexes on the cc-complete sample
dataset, with the current worst-case index shape getting a 60%+
improved performance on INSERTs in the tests at [0].

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. Best served with the downlink right separator/HIKEY optimization
(separate patch to be submitted soon), and specialization over at [0].

[0] https://www.postgresql.org/message-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com

Attachment
Hi

út 31. 10. 2023 v 22:12 odesílatel Matthias van de Meent <boekewurm+postgres@gmail.com> napsal:
Hi,

Currently, nbtree code compares each and every column of an index
tuple during the binary search on the index page. With large indexes
that have many duplicate prefix column values (e.g. an index on (bool,
bool, uuid) ) that means a lot of wasted time getting to the right
column.

The attached patch improves on that by doing per-page dynamic prefix
truncation: If we know that on both the right and left side there are
index tuples where the first two attributes are equal to the scan key,
we skip comparing those attributes at the current index tuple and
start with comparing attribute 3, saving two attribute compares. We
gain performance whenever comparing prefixing attributes is expensive
and when there are many tuples with a shared prefix - in unique
indexes this doesn't gain much, but we also don't lose much in this
case.

This patch was originally suggested at [0], but it was mentioned that
they could be pulled out into it's own thread. Earlier, the
performance gains were not clearly there for just this patch, but
after further benchmarking this patch stands on its own for
performance: it sees no obvious degradation of performance, while
gaining 0-5% across various normal indexes on the cc-complete sample
dataset, with the current worst-case index shape getting a 60%+
improved performance on INSERTs in the tests at [0].

+1

This can be nice functionality. I had a customer with a very slow index scan - the main problem was a long common prefix like prg010203xxxx.

Regards

Pavel


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. Best served with the downlink right separator/HIKEY optimization
(separate patch to be submitted soon), and specialization over at [0].

[0] https://www.postgresql.org/message-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com
On Wed, 1 Nov 2023 at 07:47, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> út 31. 10. 2023 v 22:12 odesílatel Matthias van de Meent <boekewurm+postgres@gmail.com> napsal:
>> This patch was originally suggested at [0], but it was mentioned that
>> they could be pulled out into it's own thread. Earlier, the
>> performance gains were not clearly there for just this patch, but
>> after further benchmarking this patch stands on its own for
>> performance: it sees no obvious degradation of performance, while
>> gaining 0-5% across various normal indexes on the cc-complete sample
>> dataset, with the current worst-case index shape getting a 60%+
>> improved performance on INSERTs in the tests at [0].
>
>
> +1

Thanks for showing interest.

> This can be nice functionality. I had a customer with a very slow index scan - the main problem was a long common
prefixlike prg010203xxxx. 

I'll have to note that this patch doesn't cover cases where e.g. text
attributes have large shared prefixes, but are still unique: the
dynamic prefix compression in this patch is only implemented at the
tuple attribute level; it doesn't implement type aware dynamic prefix
compression inside the attributes. So, a unique index on a column of
int32 formatted like '%0100i' would not materially benefit from this
patch.

While would certainly be possible to add some type-level prefix
truncation in the framework of this patch, adding that would require
significant code churn in btree compare operators, because we'd need
an additional return argument to contain a numerical "shared prefix",
and that is not something I was planning to implement in this patch.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)





st 1. 11. 2023 v 11:32 odesílatel Matthias van de Meent <boekewurm+postgres@gmail.com> napsal:
On Wed, 1 Nov 2023 at 07:47, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> út 31. 10. 2023 v 22:12 odesílatel Matthias van de Meent <boekewurm+postgres@gmail.com> napsal:
>> This patch was originally suggested at [0], but it was mentioned that
>> they could be pulled out into it's own thread. Earlier, the
>> performance gains were not clearly there for just this patch, but
>> after further benchmarking this patch stands on its own for
>> performance: it sees no obvious degradation of performance, while
>> gaining 0-5% across various normal indexes on the cc-complete sample
>> dataset, with the current worst-case index shape getting a 60%+
>> improved performance on INSERTs in the tests at [0].
>
>
> +1

Thanks for showing interest.

> This can be nice functionality. I had a customer with a very slow index scan - the main problem was a long common prefix like prg010203xxxx.

I'll have to note that this patch doesn't cover cases where e.g. text
attributes have large shared prefixes, but are still unique: the
dynamic prefix compression in this patch is only implemented at the
tuple attribute level; it doesn't implement type aware dynamic prefix
compression inside the attributes. So, a unique index on a column of
int32 formatted like '%0100i' would not materially benefit from this
patch.

While would certainly be possible to add some type-level prefix
truncation in the framework of this patch, adding that would require
significant code churn in btree compare operators, because we'd need
an additional return argument to contain a numerical "shared prefix",
and that is not something I was planning to implement in this patch.

Thanks for the explanation. 

Pavel


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)
On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> Hi,
>
> Currently, nbtree code compares each and every column of an index
> tuple during the binary search on the index page. With large indexes
> that have many duplicate prefix column values (e.g. an index on (bool,
> bool, uuid) ) that means a lot of wasted time getting to the right
> column.
>
> The attached patch improves on that by doing per-page dynamic prefix
> truncation: If we know that on both the right and left side there are
> index tuples where the first two attributes are equal to the scan key,
> we skip comparing those attributes at the current index tuple and
> start with comparing attribute 3, saving two attribute compares. We
> gain performance whenever comparing prefixing attributes is expensive
> and when there are many tuples with a shared prefix - in unique
> indexes this doesn't gain much, but we also don't lose much in this
> case.
>
> This patch was originally suggested at [0], but it was mentioned that
> they could be pulled out into it's own thread. Earlier, the
> performance gains were not clearly there for just this patch, but
> after further benchmarking this patch stands on its own for
> performance: it sees no obvious degradation of performance, while
> gaining 0-5% across various normal indexes on the cc-complete sample
> dataset, with the current worst-case index shape getting a 60%+
> improved performance on INSERTs in the tests at [0].

+1 for the idea, I have some initial comments while reading through the patch.

1.
Commit message refers to a non-existing reference '(see [0])'.


2.
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its left neighbour, and larger than its right
+neighbour.

I think this should be 'larger than left neighbour and smaller than
right neighbour' instead of the other way around.

3.
+With the above optimizations, dynamic prefix truncation improves the worst
+case complexity of indexing from O(tree_height * natts * log(tups_per_page))
+to O(tree_height * (3*natts + log(tups_per_page)))

Where do the 3*natts come from?  Is it related to setting up the
dynamic prefix at each level?

4.
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *compareattr = ntupatts + 1;

Can you elaborate on this more? If all tuple attributes are equal to
the scan key then what do those 'later attributes' mean?


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, 19 Jan 2024 at 05:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > Hi,
> >
> > Currently, nbtree code compares each and every column of an index
> > tuple during the binary search on the index page. With large indexes
> > that have many duplicate prefix column values (e.g. an index on (bool,
> > bool, uuid) ) that means a lot of wasted time getting to the right
> > column.
> >
> > The attached patch improves on that by doing per-page dynamic prefix
> > truncation: If we know that on both the right and left side there are
> > index tuples where the first two attributes are equal to the scan key,
> > we skip comparing those attributes at the current index tuple and
> > start with comparing attribute 3, saving two attribute compares. We
> > gain performance whenever comparing prefixing attributes is expensive
> > and when there are many tuples with a shared prefix - in unique
> > indexes this doesn't gain much, but we also don't lose much in this
> > case.
> >
> > This patch was originally suggested at [0], but it was mentioned that
> > they could be pulled out into it's own thread. Earlier, the
> > performance gains were not clearly there for just this patch, but
> > after further benchmarking this patch stands on its own for
> > performance: it sees no obvious degradation of performance, while
> > gaining 0-5% across various normal indexes on the cc-complete sample
> > dataset, with the current worst-case index shape getting a 60%+
> > improved performance on INSERTs in the tests at [0].
>
> +1 for the idea, I have some initial comments while reading through the patch.

Thank you for the review.

> 1.
> Commit message refers to a non-existing reference '(see [0])'.

Noted, I'll update that.

> 2.
> +When we do a binary search on a sorted set (such as a BTree), we know that a
> +tuple will be smaller than its left neighbour, and larger than its right
> +neighbour.
>
> I think this should be 'larger than left neighbour and smaller than
> right neighbour' instead of the other way around.

Noted, will be fixed, too.

> 3.
> +With the above optimizations, dynamic prefix truncation improves the worst
> +case complexity of indexing from O(tree_height * natts * log(tups_per_page))
> +to O(tree_height * (3*natts + log(tups_per_page)))
>
> Where do the 3*natts come from?  Is it related to setting up the
> dynamic prefix at each level?

Yes: We need to establish prefixes for both a tuple that's ahead of
the to-be-compared tuple, and one that's after the to-be-compared
tuple. Assuming homogenous random distribution of scan key accesses
across the page (not always the case, but IMO a reasonable starting
point) this would average to 3 unprefixed compares before you have
established both a higher and a lower prefix.

> 4.
> + /*
> + * All tuple attributes are equal to the scan key, only later attributes
> + * could potentially not equal the scan key.
> + */
> + *compareattr = ntupatts + 1;
>
> Can you elaborate on this more? If all tuple attributes are equal to
> the scan key then what do those 'later attributes' mean?

In inner pages, tuples may not have all key attributes, as some may
have been truncated away in page splits. So, tuples that have at least
the same prefix as this (potentially truncated) tuple will need to be
compared starting at the first missing attribute of this tuple, i.e.
ntupatts + 1.

Kind regards,

Matthias van de Meent



On Wed, 24 Jan 2024 at 13:02, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > 1.
> > Commit message refers to a non-existing reference '(see [0])'.
>
> Noted, I'll update that.
>
> > 2.
> > +When we do a binary search on a sorted set (such as a BTree), we know that a
> > +tuple will be smaller than its left neighbour, and larger than its right
> > +neighbour.
> >
> > I think this should be 'larger than left neighbour and smaller than
> > right neighbour' instead of the other way around.
>
> Noted, will be fixed, too.

Attached is version 15 of this patch, with the above issues fixed.
It's also rebased on top of 655dc310 of this morning, so that should
keep good for some time again.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment
On Fri, 1 Mar 2024 at 14:48, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Attached is version 15 of this patch, with the above issues fixed.
> It's also rebased on top of 655dc310 of this morning, so that should
> keep good for some time again.

Attached is version 16 now. Relevant changes from previous patch versions:

- Move from 1-indexed AttributeNumber to 0-indexed ints for prefixes,
and use "prefix" as naming scheme, rather than cmpcol. A lack of
prefix, previously indicated with a cmpcol value of 1, is now a prefix
value of 0.
- Adjusted README
- Improved the efficiency of the insertion path in some cases where
we've previously compared the page's highkey.

As always, why we need this:

Currently, btrees are quite inefficient when they have many key
attributes but low attribute cardinality in the prefix, e.g. an index
on ("", "", "", uuid). This is not just inefficient use of disk space
with the high repetition of duplicate prefix values in pages, but it
is also a computational overhead when we're descending the tree in
e.g. _bt_first() or btinsert(): The code we use to search a page
currently compares the full key to the full searchkey, for a
complexity of O(n_equal_attributes + 1) for every tuple on the page,
for O(log(page_ntups) * (n_equal_attributes + 1)) attribute compares
every page during descent.

This patch fixes that part of the computational complexity by applying
dynamic prefix compression, thus reducing the average computational
complexity in random index lookups to O(3 * (n_equal_attributes) +
log(page_ntups)) per page (assuming at least 4 non-hikey tuples on
each page). In practice, this makes indexes with 3+ attributes and
prefixes with low selectivity (such as the example above) much more
viable computationally, as we have to spend much less effort on
comparing known index attributes during descent.

Note that this does _not_ reuse prefix bounds across pages - it
re-establishes the left- and right prefixes every page during the
binary search. See the README modified in the patch for specific
implementation details and considerations.

This patch synergizes with the highkey optimization used in [0]: When
combined, the number of attribute compare function calls could be
further reduced to O(2 * (n_equal_atts) + log(page_ntups)), a
reduction by n_equal_atts every page, which in certain wide index
types could be over 25% of all attribute compare function calls on the
page after dynamic prefix truncation. However, both are separately
useful and reduce the amount of work done on most pages.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://www.postgresql.org/message-id/flat/CAEze2WijWhCQy_nZVP4Ye5Hwj=YW=3rqv+hbMJGcOHtrYQmyKw@mail.gmail.com

Attachment
On Tue, Aug 6, 2024 at 5:42 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Attached is version 16 now.

I ran this with my old land registry benchmark, used for validating
the space utilization impact of nbtree deduplication (among other
things). This isn't obviously the best benchmark for this sort of
thing, but I seem to recall that you'd used it yourself at some point.
To validate work in this area, likely including this patch. So I
decided to start there.

To be clear, this test involves bulk loading of an unlogged table (the
land registry table). The following composite index is created on the
table before we insert any rows, so most of the cycles here are in
index maintenance including _bt_search descents:

CREATE INDEX composite ON land2 USING btree (county COLLATE "C", city
COLLATE "C", locality COLLATE "C");

I wasn't able to see much of an improvement with this patch applied.
It went from  ~00:51.598 to ~00:51.053. That's a little disappointing,
given that this is supposed to be a sympathetic case for the patch.
Can you suggest something else? (Granted, I understand that this patch
has some complicated relationship with other patches of yours, which I
don't understand currently.)

I'm a bit worried about side-effects for this assertion:

@@ -485,7 +489,7 @@ _bt_check_unique(Relation rel, BTInsertState
insertstate, Relation heapRel,
                Assert(insertstate->bounds_valid);
                Assert(insertstate->low >= P_FIRSTDATAKEY(opaque));
                Assert(insertstate->low <= insertstate->stricthigh);
-               Assert(_bt_compare(rel, itup_key, page, offset) < 0);
+               Assert(_bt_compare(rel, itup_key, page, offset, &sprefix) < 0);
                break;
            }

More generally, it's not entirely clear how the code in
_bt_check_unique is supposed to work with the patch. Why should it be
safe to do what you're doing with the prefix there? It's not like
we're doing a binary search here -- it's more like a linear search.

> - Move from 1-indexed AttributeNumber to 0-indexed ints for prefixes,
> and use "prefix" as naming scheme, rather than cmpcol. A lack of
> prefix, previously indicated with a cmpcol value of 1, is now a prefix
> value of 0.

Found a likely-related bug in the changes you made to amcheck, which I
was able to fix myself like so:

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index c7dc6725a..15be61777 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -3187,7 +3187,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
         insertstate.buf = lbuf;

         /* Get matching tuple on leaf page */
-        offnum = _bt_binsrch_insert(state->rel, &insertstate, 1);
+        offnum = _bt_binsrch_insert(state->rel, &insertstate, 0);
         /* Compare first >= matching item on leaf page, if any */

--
Peter Geoghegan



> On Tue, Aug 13, 2024 at 02:39:10PM GMT, Peter Geoghegan wrote:
> On Tue, Aug 6, 2024 at 5:42 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Attached is version 16 now.
>
> I ran this with my old land registry benchmark, used for validating
> the space utilization impact of nbtree deduplication (among other
> things). This isn't obviously the best benchmark for this sort of
> thing, but I seem to recall that you'd used it yourself at some point.
> To validate work in this area, likely including this patch. So I
> decided to start there.
>
> To be clear, this test involves bulk loading of an unlogged table (the
> land registry table). The following composite index is created on the
> table before we insert any rows, so most of the cycles here are in
> index maintenance including _bt_search descents:
>
> CREATE INDEX composite ON land2 USING btree (county COLLATE "C", city
> COLLATE "C", locality COLLATE "C");
>
> I wasn't able to see much of an improvement with this patch applied.
> It went from  ~00:51.598 to ~00:51.053. That's a little disappointing,
> given that this is supposed to be a sympathetic case for the patch.
> Can you suggest something else? (Granted, I understand that this patch
> has some complicated relationship with other patches of yours, which I
> don't understand currently.)

Under the danger of showing my ignorance, what is the definition of land
registry benchmark? I think it would be useful if others could reproduce
the results as well, especially if they're somewhat surprising.

At the same time I would like to note, that dynamic prefix truncation
can be useful not only on realistic data, but in some weird-but-useful
cases, e.g. for partitioned B-Tree with an artificial leading key
separating partitions. It's a hypothetical case, which can't justify
a feature of course, but makes it worth investigating.



On Wed, Nov 13, 2024 at 3:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Tue, Aug 13, 2024 at 02:39:10PM GMT, Peter Geoghegan wrote:
> > On Tue, Aug 6, 2024 at 5:42 PM Matthias van de Meent
> > To be clear, this test involves bulk loading of an unlogged table (the
> > land registry table). The following composite index is created on the
> > table before we insert any rows, so most of the cycles here are in
> > index maintenance including _bt_search descents:
> >
> > CREATE INDEX composite ON land2 USING btree (county COLLATE "C", city
> > COLLATE "C", locality COLLATE "C");

> Under the danger of showing my ignorance, what is the definition of land
> registry benchmark? I think it would be useful if others could reproduce
> the results as well, especially if they're somewhat surprising.

It's a sample dataset that I've found useful from time to time,
particularly when testing nbtree features. Usually using a composite
index like the one I described.

One slightly useful (though far from unique) property of such an index
is that it contains data that's low cardinality (sometimes extremely
low cardinality) across multiple index columns. With variable-width
(text) index columns. That specific combination made the index a
decent test of certain issues affecting the nbtsplitloc.c split point
choice logic during work on Postgres 12 and 13. I believe that
Matthias independently found it useful on a number of other occasions,
too.

See https://wiki.postgresql.org/wiki/Sample_Databases for instructions
on how to set it up for yourself. You could probably come up with a
way of generating a similar dataset, without needing to download
anything, though. The fact that I found it useful in the past is at
least somewhat arbitrary.

--
Peter Geoghegan