Thread: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

From

Matthias van de Meent

Date:

15 April 2021, 18:06:34

-- Targeting PG15; if too early / noise then please ignore.

I've noticed there are a lot of places in the btree index
infrastructure (and also some other index AMs) that effectively
iterate over the attributes of the index tuple, but won't use
index_deform_tuple for reasons. However, this implies that they must
repeatedly call index_getattr, which in the worst case is O(n) for the
n-th attribute, slowing down extraction of multi-column indexes
significantly. As such, I've added some API that allows for iteration
(ish) over index attributes.

Please find attached patch 0001 that improves the runtime complexity
of many of these places by storing and reusing the offset of the last
extracted attribute. This improves the worst-case runtime of
extracting all attributes to O(n) for incremental attribute extraction
(from O(n*n)). Note that finding the first offsets is still an O(n)
worst case for starting at the n-th attribute, but nothing can be done
about that.

Almost all workloads for multi-column nbtree indexes that cannot use
attcacheoff should see a benefit from this patch; only those that only
use row scans cannot use this optimization. Additionally, multi-column
gist indexes could also see some (albeit limited) benefit, which is
indeed useful when considering the newly added INCLUDE support in the
gist AM.

Also attached is 0002, which dynamically truncates attribute prefixes
of tuples whilst _binsrch-ing through a nbtree page. It greatly uses
the improved performance of 0001; they work very well together. The
problems that Peter (cc-ed) mentions in [0] only result in invalid
search bounds when traversing the tree, but on the page level valid
bounds can be constructed.

This is patchset 1 of a series of patches I'm starting for eventually
adding static prefix truncation into nbtree infrastructure in
PostgreSQL. I've put up a wiki page [1] with my current research and
thoughts on that topic.

Performance
-----------

I've run some tests with regards to performance on my laptop; which
tests nbtree index traversal. The test is based on a recent UK land
registry sales prices dataset (25744780 rows), being copied from one
table into an unlogged table with disabled autovacuum, with one index
as specified by the result. Master @ 99964c4a, patched is with both
0001 and 0002. The results are averages over 3 runs, with plain
configure, compiled by gcc (Debian 6.3.0-18+deb9u1).

INSERT (index definition)                | master (s) | patched (s) | improv(%)
UNIQUE (transaction)                     |     256851 |      251705 | 2.00
(county, city, locality)                 |     154529 |      147495 | 4.55
(county COLLATE "en_US", city, locality) |     174028 |      164165 | 5.67
(always_null, county, city, locality)    |     173090 |      166851 | 3.60

Some testing for reindex indicates improvements there as well: Same
compiled version; all indexes on an unlogged table; REINDEX run 4
times on each index, last 3 were averaged.

REINDEX (index definition)               | master (s) | patched (s) | improv(%)
UNIQUE (transaction)                     |      11623 |       11692 | -0.6
(county, city, locality)                 |      58299 |       54770 |  6.1
(county COLLATE "en_US", city, locality) |      61790 |       55887 |  9.6
(always_null, county, city, locality)    |      69703 |       63925 |  8.3

I am quite suprised with the results for the single-column unique
index insertions, as that was one of the points where I was suspecting
a slight decrease in performance for inserts. I haven't really checked
why the performance increased, but I suspect it has to do with an
improved fast-path for finding the first attribute (we know it always
starts at offset 0 of the data section), but it might also just as
well be due to throttling (sadly, I do not have a stable benchmarking
machine, so my laptop will do).

I'm also slightly disappointed with the results of the always_null
insert load; I had hoped for better results there, seeing the results
for the other 2 multi-column indexes.


With regards,

Matthias van de Meent.

[0] https://www.postgresql.org/message-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
[1] https://wiki.postgresql.org/wiki/NBTree_Prefix_Truncation

Attachment

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

From

Peter Geoghegan

Date:

16 April 2021, 16:03:01

On Thu, Apr 15, 2021 at 11:06 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I've noticed there are a lot of places in the btree index
> infrastructure (and also some other index AMs) that effectively
> iterate over the attributes of the index tuple, but won't use
> index_deform_tuple for reasons. However, this implies that they must
> repeatedly call index_getattr, which in the worst case is O(n) for the
> n-th attribute, slowing down extraction of multi-column indexes
> significantly. As such, I've added some API that allows for iteration
> (ish) over index attributes.

Interesting approach. I think that in an ideal world we would have a
tuple format with attribute lengths/offsets right in the header. But
it's too late for that, so other approaches seem worth considering.

> Also attached is 0002, which dynamically truncates attribute prefixes
> of tuples whilst _binsrch-ing through a nbtree page. It greatly uses
> the improved performance of 0001; they work very well together. The
> problems that Peter (cc-ed) mentions in [0] only result in invalid
> search bounds when traversing the tree, but on the page level valid
> bounds can be constructed.
>
> This is patchset 1 of a series of patches I'm starting for eventually
> adding static prefix truncation into nbtree infrastructure in
> PostgreSQL. I've put up a wiki page [1] with my current research and
> thoughts on that topic.

The idea of making _bt_truncate() produce new leaf page high keys
based on the lastleft tuple rather than the firstright tuple (i.e.
+inf truncated attribute values rather than the current -inf) seems
like a non-starter. As you point out in "1.) Suffix-truncation; -INF
in high keys" on the Postgres wiki page, the current approach
truncates firstright (not lastleft), making the left page's new high
key contain what you call a 'foreign' value. But I see that as a big
advantage of the current approach.

Consider, for example, the nbtree high key "continuescan" optimization
added by commit 29b64d1d. The fact that leaf page high keys are
generated in this way kind of allows us to "peak" on the page to the
immediate right before actually visiting it -- possibly without ever
visiting it (which is where the benefit of that particular
optimization lies). _bt_check_unique() uses a similar trick. After the
Postgres 12 work, _bt_check_unique() will only visit a second page in
the extreme case where we cannot possibly fit all of the relevant
version duplicates on even one whole leaf page (i.e. practically
never). There is also cleverness inside _bt_compare() to make sure
that we handle the boundary cases perfectly while descending the tree.

You might also consider how the nbtsplitloc.c code works with
duplicates, and how that would be affected by +inf truncated
attributes. The leaf-page-packing performed in the SPLIT_SINGLE_VALUE
case only goes ahead when the existing high key confirms that this
must be the rightmost page. Now, I guess that you could still do
something like that if we switched to +inf semantics. But, the fact
that the new right page will have a 'foreign' value in the
SPLIT_SINGLE_VALUE-split case is also of benefit -- it is practically
empty right after the split (since the original/left page is packed
full), and we want this empty space to be eligible to either take more
duplicates, or to take values that may happen to fit between the
highly duplicated value and the original foreign high key value. We
want that flexibility, I think.

I also find -inf much more natural. If in the future we teach nbtree
to truncate "inside" text attributes (say text columns), you'd pretty
much be doing the same thing at the level of characters rather than
whole columns. The -inf semantics are like strcmp() semantics.

If you're going to pursue full prefix compression anyway, maybe you
should use a low key on the leaf level in cases where the optimization
is in use. This creates complexity during page deletion, because the
low key in the subtree to the right of the deletion target subtree may
need to be updated. Perhaps you can find a way to make that work that
isn't too complicated.

> I've run some tests with regards to performance on my laptop; which
> tests nbtree index traversal. The test is based on a recent UK land
> registry sales prices dataset (25744780 rows), being copied from one
> table into an unlogged table with disabled autovacuum, with one index
> as specified by the result. Master @ 99964c4a, patched is with both
> 0001 and 0002. The results are averages over 3 runs, with plain
> configure, compiled by gcc (Debian 6.3.0-18+deb9u1).

You should probably account for index size here. I have lots of my own
tests for space utilization, using data from a variety of sources.

--
Peter Geoghegan

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

From

Matthias van de Meent

Date:

16 April 2021, 21:20:36

On Fri, 16 Apr 2021 at 18:03, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Apr 15, 2021 at 11:06 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I've noticed there are a lot of places in the btree index
> > infrastructure (and also some other index AMs) that effectively
> > iterate over the attributes of the index tuple, but won't use
> > index_deform_tuple for reasons. However, this implies that they must
> > repeatedly call index_getattr, which in the worst case is O(n) for the
> > n-th attribute, slowing down extraction of multi-column indexes
> > significantly. As such, I've added some API that allows for iteration
> > (ish) over index attributes.
>
> Interesting approach. I think that in an ideal world we would have a
> tuple format with attribute lengths/offsets right in the header.

I believe that that would indeed be ideal w.r.t. access speed, but
also quite expensive w.r.t. amount of data stored. This would add 2
bytes per attribute in the current infrastructure (11 bits at least
for each attribute to store offsets), on the current 12 bytes of
overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is
probably always going to be a non-starter, seeing that we can
relatively easily optimize our current attribute access patterns.

> But
> it's too late for that, so other approaches seem worth considering.

Yep.

> > Also attached is 0002, which dynamically truncates attribute prefixes
> > of tuples whilst _binsrch-ing through a nbtree page. It greatly uses
> > the improved performance of 0001; they work very well together. The
> > problems that Peter (cc-ed) mentions in [0] only result in invalid
> > search bounds when traversing the tree, but on the page level valid
> > bounds can be constructed.
> >
> > This is patchset 1 of a series of patches I'm starting for eventually
> > adding static prefix truncation into nbtree infrastructure in
> > PostgreSQL. I've put up a wiki page [1] with my current research and
> > thoughts on that topic.
>
> The idea of making _bt_truncate() produce new leaf page high keys
> based on the lastleft tuple rather than the firstright tuple (i.e.
> +inf truncated attribute values rather than the current -inf) seems
> like a non-starter. As you point out in "1.) Suffix-truncation; -INF
> in high keys" on the Postgres wiki page, the current approach
> truncates firstright (not lastleft), making the left page's new high
> key contain what you call a 'foreign' value. But I see that as a big
> advantage of the current approach.
>
> Consider, for example, the nbtree high key "continuescan" optimization
> added by commit 29b64d1d. The fact that leaf page high keys are
> generated in this way kind of allows us to "peak" on the page to the
> immediate right before actually visiting it -- possibly without ever
> visiting it (which is where the benefit of that particular
> optimization lies). _bt_check_unique() uses a similar trick. After the
> Postgres 12 work, _bt_check_unique() will only visit a second page in
> the extreme case where we cannot possibly fit all of the relevant
> version duplicates on even one whole leaf page (i.e. practically
> never). There is also cleverness inside _bt_compare() to make sure
> that we handle the boundary cases perfectly while descending the tree.

I understand and appreciate that the "-INF" truncation that is
currently in place is being relied upon in quite some places. Part of
the effort for "+INF" truncation would be determining where and how to
keep the benefits of the "-INF" truncation. I also believe that for
internal pages truncating to "+INF" would be perfectly fine; the
optimizations that I know of only rely on it at the leaf level.
Completely seperate from that, there's no reason (except for a
potential lack of unused bits) we can't flag suffix-truncated columns
as either "+INF" or "-INF" - that would allow us to apply each where
useful.

> You might also consider how the nbtsplitloc.c code works with
> duplicates, and how that would be affected by +inf truncated
> attributes. The leaf-page-packing performed in the SPLIT_SINGLE_VALUE
> case only goes ahead when the existing high key confirms that this
> must be the rightmost page. Now, I guess that you could still do
> something like that if we switched to +inf semantics. But, the fact
> that the new right page will have a 'foreign' value in the
> SPLIT_SINGLE_VALUE-split case is also of benefit -- it is practically
> empty right after the split (since the original/left page is packed
> full), and we want this empty space to be eligible to either take more
> duplicates, or to take values that may happen to fit between the
> highly duplicated value and the original foreign high key value. We
> want that flexibility, I think.
>
> I also find -inf much more natural. If in the future we teach nbtree
> to truncate "inside" text attributes (say text columns), you'd pretty
> much be doing the same thing at the level of characters rather than
> whole columns. The -inf semantics are like strcmp() semantics.

Yes, I also read and appreciate your comments on +inf vs -inf when
this came up in [0]. However, if we could choose, I think that having
both options could be quite beneficial, especially when dealing with
many duplicates or duplicate prefixes.

> If you're going to pursue full prefix compression anyway, maybe you
> should use a low key on the leaf level in cases where the optimization
> is in use. This creates complexity during page deletion, because the
> low key in the subtree to the right of the deletion target subtree may
> need to be updated. Perhaps you can find a way to make that work that
> isn't too complicated.

That would be an interesting research path as well, the cost/benefit
analysis would be much trickier when comparing to the status quo.

> You should probably account for index size here. I have lots of my own
> tests for space utilization, using data from a variety of sources.

I'd like to mention that the current (and measured) patchset only does
_logical_ dynamic prefix truncation, not the physical prefix
truncation that is described on the wiki page. Physical prefix
truncation will probably be a summer / fall project, and I will indeed
at some point need to build a test suite that would measure the
benefits, but for this patch I do not see the need for benchmarks on
size, as that is not the point of these patches. These patches are
useful on their own for multi-key-column btree performance (and some
GIST), regardless of later patches implementing physical dynamic
prefix truncation in the btree AM.

With regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/CAH2-Wzm_Kxm26E_DwK7AR%2BZB_-B50OMpGoO%3Dn08tD%2BqH%3DMD-zw%40mail.gmail.com
[1] https://www.postgresql.org/message-id/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

From

Peter Geoghegan

Date:

16 April 2021, 23:05:06

On Fri, Apr 16, 2021 at 2:20 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > Interesting approach. I think that in an ideal world we would have a
> > tuple format with attribute lengths/offsets right in the header.
>
> I believe that that would indeed be ideal w.r.t. access speed, but
> also quite expensive w.r.t. amount of data stored. This would add 2
> bytes per attribute in the current infrastructure (11 bits at least
> for each attribute to store offsets), on the current 12 bytes of
> overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is
> probably always going to be a non-starter, seeing that we can
> relatively easily optimize our current attribute access patterns.

I don't think that that's why it's a non-starter. This design assumes
a world in which everything has already been optimized for this
layout. You no longer get to store the varlena header inline, which
would break a lot of things in Postgres were it ever to be attempted.
The space efficiency issues don't really apply because you have an
offset for fixed-length types -- their presence is always implied. I
think that you need to encode NULLs differently, which is a lot less
space efficient when there are a lot of NULLs. But on the whole this
design seems more efficient than what we have currently.

This representation of index tuples would be a totally reasonable
design were we in a green field situation. (Which is pretty far from
the situation we're actually in, of course.)

> I understand and appreciate that the "-INF" truncation that is
> currently in place is being relied upon in quite some places. Part of
> the effort for "+INF" truncation would be determining where and how to
> keep the benefits of the "-INF" truncation. I also believe that for
> internal pages truncating to "+INF" would be perfectly fine; the
> optimizations that I know of only rely on it at the leaf level.

I don't doubt that there is nothing special about -inf from a key
space point of view. Actually...you could say that -inf is special to
the limited extent that we know it only appears in pivot tuples and
exploit that property when the !pivotsearch case/optimization is used.
But that isn't much of an exception at a high level, so whatever.

Anyway, it follows that +inf could in principle be used instead in
some or all cases -- all that is truly essential for correctness is
that the invariants always be respected. We're still in agreement up
until here.

> Yes, I also read and appreciate your comments on +inf vs -inf when
> this came up in [0].

I'm impressed that you've done your homework on this.

> However, if we could choose, I think that having
> both options could be quite beneficial, especially when dealing with
> many duplicates or duplicate prefixes.

This is where things are much less clear -- maybe we're not in
agreement here. Who knows, though -- maybe you're right. But you
haven't presented any kind of argument. I understand that it's hard to
articulate what effects might be in play with stuff like this, so I
won't force the issue now. Strong evidence is of course the only way
that you'll reliably convince me of this.

I should point out that I am a little confused about how this +inf
business could be both independently useful and pivotal to
implementing [dynamic] prefix truncation/compression. Seems...weird to
discuss them together, except maybe to mention in passing that this
+inf thing is notable for particularly helping dynamic prefix stuff --
which is it?

It is my strong preference that nbtsplitloc.c continue to know
approximately nothing about compression or deduplication. While it is
true that nbtsplitloc.c's _bt_recsplitloc() is aware of posting lists,
this is strictly an optimization that is only justified by the fact
that posting lists are sometimes very large, and therefore worth
considering directly -- just to get a more accurate idea of how a
relevant split point choice affects the balance of free space (we
don't bother to do the same thing with non-key INCLUDE columns because
they're generally small and equi-sized). And so this _bt_recsplitloc()
thing no exception to the general rule, which is:
deduplication/posting list maintenance should be *totally* orthogonal
to the page split choice logic (the design of posting list splits
helps a lot with that). We can afford to have complicated split point
choice logic because the question of which split point is optimal is
totally decoupled from the question of which are correct -- in
particular, from the correctness of the space accounting used to
generate candidate split points.

It may interest you to know that I once thought that it would be nice
to have the *option* of +inf too, so that we could use it in very rare
cases like the pathological SPLIT_MANY_DUPLICATES case that
_bt_bestsplitloc() has some defenses against. It would perhaps be nice
if we could use +inf selectively in that case. I never said anything
about this publicly before now, mostly because it wasn't that
important -- pathological behaviors like this have never been reported
on by users a full 18 months after the release of 12.0, so it's
unlikely to be a real concern.

> > If you're going to pursue full prefix compression anyway, maybe you
> > should use a low key on the leaf level in cases where the optimization
> > is in use. This creates complexity during page deletion, because the
> > low key in the subtree to the right of the deletion target subtree may
> > need to be updated. Perhaps you can find a way to make that work that
> > isn't too complicated.
>
> That would be an interesting research path as well, the cost/benefit
> analysis would be much trickier when comparing to the status quo.

I'd say that's unclear right now.

> > You should probably account for index size here. I have lots of my own
> > tests for space utilization, using data from a variety of sources.
>
> I'd like to mention that the current (and measured) patchset only does
> _logical_ dynamic prefix truncation, not the physical prefix
> truncation that is described on the wiki page.

If you change how _bt_truncate() behaves in any way (e.g. sometimes
it's lastleft/+inf based now), and nothing else, you're still bound to
change the space utilization with the tests that I maintain -- though
perhaps only at the level of noise. I sometimes call these tests "wind
tunnel tests". It turns out that you can simulate rather a lot about a
real complicated workload with simple, deterministic, serial test
cases -- provided you're only interested in the space utilization.
This helped a lot for both the Postgres 12 and Postgres 13 stuff
(though not the Postgres 14 stuff).

> These patches are
> useful on their own for multi-key-column btree performance (and some
> GIST), regardless of later patches implementing physical dynamic
> prefix truncation in the btree AM.

Have you isolated the performance impact of the first patch at all?
Can you characterize how well it works on its own, perhaps just
informally? It would be convenient if the first patch could be treated
as an independent thing.

-- 
Peter Geoghegan

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

From

Matthias van de Meent

Date:

23 April 2021, 10:45:45

On Sat, 17 Apr 2021 at 01:05, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Apr 16, 2021 at 2:20 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > > Interesting approach. I think that in an ideal world we would have a
> > > tuple format with attribute lengths/offsets right in the header.
> >
> > I believe that that would indeed be ideal w.r.t. access speed, but
> > also quite expensive w.r.t. amount of data stored. This would add 2
> > bytes per attribute in the current infrastructure (11 bits at least
> > for each attribute to store offsets), on the current 12 bytes of
> > overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is
> > probably always going to be a non-starter, seeing that we can
> > relatively easily optimize our current attribute access patterns.
>
> I don't think that that's why it's a non-starter. This design assumes
> a world in which everything has already been optimized for this
> layout. You no longer get to store the varlena header inline, which
> would break a lot of things in Postgres were it ever to be attempted.
> The space efficiency issues don't really apply because you have an
> offset for fixed-length types -- their presence is always implied. I
> think that you need to encode NULLs differently, which is a lot less
> space efficient when there are a lot of NULLs. But on the whole this
> design seems more efficient than what we have currently.

I believe that that depends on your definition of 'efficiency'. For
storage efficiency, the current design is quite good (except for the
varlena header size of 4 bytes for attributes > 127 bytes, which could
be 2 bytes because pages can not be larger than 64kiB (actually 32kiB)
with our current design, all attributes use just about the least data
possible). For access efficiency / code complexity, you're probably
right that storing attribute offsets in the tuple header is
preferable, but such design would still need some alignment calls, or
store the length of attributes as well to prevent reading the
alignment padding of the next attribute into the variable length
attribute at the additional overhead of up to 2 bytes per attribute.

> This representation of index tuples would be a totally reasonable
> design were we in a green field situation. (Which is pretty far from
> the situation we're actually in, of course.)

That might indeed be the case, assuming a green field with different
or no processing architecture or storage limitations. CPU to storage
bandwidth can be (and often is) a bottleneck, as well as alignment.

> > I understand and appreciate that the "-INF" truncation that is
> > currently in place is being relied upon in quite some places. Part of
> > the effort for "+INF" truncation would be determining where and how to
> > keep the benefits of the "-INF" truncation. I also believe that for
> > internal pages truncating to "+INF" would be perfectly fine; the
> > optimizations that I know of only rely on it at the leaf level.
>
> I don't doubt that there is nothing special about -inf from a key
> space point of view. Actually...you could say that -inf is special to
> the limited extent that we know it only appears in pivot tuples and
> exploit that property when the !pivotsearch case/optimization is used.
> But that isn't much of an exception at a high level, so whatever.
>
> Anyway, it follows that +inf could in principle be used instead in
> some or all cases -- all that is truly essential for correctness is
> that the invariants always be respected. We're still in agreement up
> until here.

Agreed

> > Yes, I also read and appreciate your comments on +inf vs -inf when
> > this came up in [0].
>
> I'm impressed that you've done your homework on this.
>
> > However, if we could choose, I think that having
> > both options could be quite beneficial, especially when dealing with
> > many duplicates or duplicate prefixes.
>
> This is where things are much less clear -- maybe we're not in
> agreement here. Who knows, though -- maybe you're right. But you
> haven't presented any kind of argument. I understand that it's hard to
> articulate what effects might be in play with stuff like this, so I
> won't force the issue now. Strong evidence is of course the only way
> that you'll reliably convince me of this.
>
> I should point out that I am a little confused about how this +inf
> business could be both independently useful and pivotal to
> implementing [dynamic] prefix truncation/compression. Seems...weird to
> discuss them together, except maybe to mention in passing that this
> +inf thing is notable for particularly helping dynamic prefix stuff --
> which is it?

I agree that my reasoning might have been unclear and confusing.

I mean that most benefits that we could receive from +inf would be in
improving the ability to apply [dynamic] prefix truncation on a page
by limiting the keyspace of that page to 'local' values. If prefix
truncation is impossible / does not apply for some index (a single
unique column !allequalimage index is a likely worst case scenario),
then applying +inf would potentially be detrimental to the performance
of certain other optimizations (e.g. the continuescan optimization),
in which case using -inf would probably be preferable. Ergo, I'm
planning on making _bt_recsplitloc aware of +inf and -inf after
implementing physical prefix truncation, and allow it to decide if and
when each should be applied, if it turns out it consistently improves
space and/or time performance without significantly decreasing either.

> It is my strong preference that nbtsplitloc.c continue to know
> approximately nothing about compression or deduplication. While it is
> true that nbtsplitloc.c's _bt_recsplitloc() is aware of posting lists,
> this is strictly an optimization that is only justified by the fact
> that posting lists are sometimes very large, and therefore worth
> considering directly -- just to get a more accurate idea of how a
> relevant split point choice affects the balance of free space (we
> don't bother to do the same thing with non-key INCLUDE columns because
> they're generally small and equi-sized). And so this _bt_recsplitloc()
> thing no exception to the general rule, which is:
> deduplication/posting list maintenance should be *totally* orthogonal
> to the page split choice logic (the design of posting list splits
> helps a lot with that). We can afford to have complicated split point
> choice logic because the question of which split point is optimal is
> totally decoupled from the question of which are correct -- in
> particular, from the correctness of the space accounting used to
> generate candidate split points.

I would argue that it also knows about duplicate attributes and shared
prefixes? It already optimizes (unintentionally?) for deduplication by
choosing split points between two runs of equal values. I believe that
implementing the same for prefixes (if not already in place) would not
stand out too much. I think we can discuss that more extensively when
we actually have code that would benefit from that.

> It may interest you to know that I once thought that it would be nice
> to have the *option* of +inf too, so that we could use it in very rare
> cases like the pathological SPLIT_MANY_DUPLICATES case that
> _bt_bestsplitloc() has some defenses against. It would perhaps be nice
> if we could use +inf selectively in that case. I never said anything
> about this publicly before now, mostly because it wasn't that
> important -- pathological behaviors like this have never been reported
> on by users a full 18 months after the release of 12.0, so it's
> unlikely to be a real concern.

I do not per se disagree, but I should note that the amazing work on
btree page split prevention through 'heapkeyspace', deduplication and
eager tuple deletion have changed some key behaviours of btree index
pages. The same would likely occur once physical prefix truncation is
implemented, and in that case I believe that some decisions that were
previously non-problematic might need to be re-examined.

> > > If you're going to pursue full prefix compression anyway, maybe you
> > > should use a low key on the leaf level in cases where the optimization
> > > is in use. This creates complexity during page deletion, because the
> > > low key in the subtree to the right of the deletion target subtree may
> > > need to be updated. Perhaps you can find a way to make that work that
> > > isn't too complicated.
> >
> > That would be an interesting research path as well, the cost/benefit
> > analysis would be much trickier when comparing to the status quo.
>
> I'd say that's unclear right now.

I agree. My 'trickier' pointed to that "adding an extra non-key tuple
to the page" needs solid understanding and reasoning about the use of
the AM to prove that it's worth the extra metadata on the page.
Proving that is, in my opinion, difficult.

> > > You should probably account for index size here. I have lots of my own
> > > tests for space utilization, using data from a variety of sources.
> >
> > I'd like to mention that the current (and measured) patchset only does
> > _logical_ dynamic prefix truncation, not the physical prefix
> > truncation that is described on the wiki page.
>
> If you change how _bt_truncate() behaves in any way (e.g. sometimes
> it's lastleft/+inf based now), and nothing else, you're still bound to
> change the space utilization with the tests that I maintain -- though
> perhaps only at the level of noise. I sometimes call these tests "wind
> tunnel tests". It turns out that you can simulate rather a lot about a
> real complicated workload with simple, deterministic, serial test
> cases -- provided you're only interested in the space utilization.
> This helped a lot for both the Postgres 12 and Postgres 13 stuff
> (though not the Postgres 14 stuff).

I would be interested in running these benchmarks when I get to
updating the physical format. Good to know there are relatively easy
tests available.

> > These patches are
> > useful on their own for multi-key-column btree performance (and some
> > GIST), regardless of later patches implementing physical dynamic
> > prefix truncation in the btree AM.
>
> Have you isolated the performance impact of the first patch at all?
> Can you characterize how well it works on its own, perhaps just
> informally?

The REINDEX performance results is the place where attribute iteration
shines best, as the hot path in reindex is the tuple comparison which
used index_getattr a lot in its hot path, and the dynamic prefix
truncation is not applicable there (yet?). Its time spent went down by
over 6% for the indexes with 3 key columns of variable length, whereas
the indexes with only a single fixed-size attribute took only slightly
longer (+0.6% avg in 3 runs on a laptop, high variance). I have not
tested it with GIST, but I believe that similar results are realistic
there as well for varlen attributes.

> It would be convenient if the first patch could be treated
> as an independent thing.

Patch 0002 was the reason for writing 0001, and uses the performance
improvements of 0001 to show it's worth. As such, I submitted them as
a set. If you'd like, I could submit 0002 seperately?

With regards,

Matthias van de Meent

[+] instead of starting _binsrch with only the high key compare
result, we could also eagerly compare the search key to the lowest
key. This way, we have high+low bounds for the whole page, instead of
having that only after finding a key < searchkey on the page. The
effort might just as well not be worth it, as it is one extra key
compare (out of max 9 on a page, plus one highkey).

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

From

Matthias van de Meent

Date:

17 June 2021, 15:14:11

On Fri, 23 Apr 2021 at 12:45, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Sat, 17 Apr 2021 at 01:05, Peter Geoghegan <pg@bowt.ie> wrote:
>
> > It would be convenient if the first patch could be treated
> > as an independent thing.
>
> Patch 0002 was the reason for writing 0001, and uses the performance
> improvements of 0001 to show it's worth. As such, I submitted them as
> a set. If you'd like, I could submit 0002 seperately?

For now, version 2 of the patchset to make MSVC and cfbot happy (only
fixes the compilation issues, no significant changes). I'll try to
benchmark the patches in this patchset (both 0001, and 0001+0002) in
the upcoming weekend.

Kind regards,

Matthias van de Meent.

Attachment

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

From

Matthias van de Meent

Date:

24 June 2021, 16:21:45

On Thu, 17 Jun 2021 at 17:14, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> I'll try to
> benchmark the patches in this patchset (both 0001, and 0001+0002) in
> the upcoming weekend.

Somewhat delayed, benchmark results are attached. These are based on 7
iterations of the attached benchmark script ('scratch.sql'), with the
latest 'UK Price Paid' dataset. Again, the index_test table is an
unlogged copy of the land_registry_price_paid_uk table, with one
additional trailing always_null column.

Results for 0001 are quite good in the target area of multi-column
indexes in which attcacheoff cannot be used (2-4% for insertion
workloads, 4-12% for reindex workloads), but regresses slightly for
the single unique column insertion test, and are quite a bit worse for
both insert and reindex cases for the attcacheoff-enabled multi-column
index (4% and 18% respectively (!)).

With 0001+0002, further improvements are made in the target area (now
4-7% for the various insertion workloads, 5-14% for reindex). The
regression in the insert- and reindex-workload in attcacheoff-enabled
multi-column indexes is still substantial, but slightly less bad (down
to a 2% and 15% degradation respectively).

Evidently, this needs improvements in the (likely common)
attcacheoff-enabled multi-column case; as I don't think we can
reasonably commit a 10+% regression. I'll work on that this weekend.

Kind regards,

Matthias van de Meent

Benchmarks were all performed on WSL2 running Debian 10, on an AMD
5950X, with shared_buffers = 15GB (which should fit the dataset three
times), enable_indexscan = off, autovacuum disabled, and parallel
workers disabled on the tables, so that the results should be about as
stable as it gets.

Attachment

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

From

Daniel Gustafsson

Date:

02 December 2021, 10:21:59

> On 24 Jun 2021, at 18:21, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
>
> On Thu, 17 Jun 2021 at 17:14, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
>>
>> I'll try to
>> benchmark the patches in this patchset (both 0001, and 0001+0002) in
>> the upcoming weekend.
>
> Somewhat delayed, benchmark results are attached.

I'm moving this patch to the next CF to allow for more review of the latest
revision and benchmark results.  It no longer applies though, so please post a
rebased version.

--
Daniel Gustafsson        https://vmware.com/