Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation
Date
Msg-id CAH2-WzmqQPoRXwMP=ko8wDj49evFqU+kYYu_vB4S=6BiLcbnDg@mail.gmail.com
Whole thread Raw
In response to Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation
List pgsql-hackers
On Fri, Apr 16, 2021 at 2:20 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > Interesting approach. I think that in an ideal world we would have a
> > tuple format with attribute lengths/offsets right in the header.
>
> I believe that that would indeed be ideal w.r.t. access speed, but
> also quite expensive w.r.t. amount of data stored. This would add 2
> bytes per attribute in the current infrastructure (11 bits at least
> for each attribute to store offsets), on the current 12 bytes of
> overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is
> probably always going to be a non-starter, seeing that we can
> relatively easily optimize our current attribute access patterns.

I don't think that that's why it's a non-starter. This design assumes
a world in which everything has already been optimized for this
layout. You no longer get to store the varlena header inline, which
would break a lot of things in Postgres were it ever to be attempted.
The space efficiency issues don't really apply because you have an
offset for fixed-length types -- their presence is always implied. I
think that you need to encode NULLs differently, which is a lot less
space efficient when there are a lot of NULLs. But on the whole this
design seems more efficient than what we have currently.

This representation of index tuples would be a totally reasonable
design were we in a green field situation. (Which is pretty far from
the situation we're actually in, of course.)

> I understand and appreciate that the "-INF" truncation that is
> currently in place is being relied upon in quite some places. Part of
> the effort for "+INF" truncation would be determining where and how to
> keep the benefits of the "-INF" truncation. I also believe that for
> internal pages truncating to "+INF" would be perfectly fine; the
> optimizations that I know of only rely on it at the leaf level.

I don't doubt that there is nothing special about -inf from a key
space point of view. Actually...you could say that -inf is special to
the limited extent that we know it only appears in pivot tuples and
exploit that property when the !pivotsearch case/optimization is used.
But that isn't much of an exception at a high level, so whatever.

Anyway, it follows that +inf could in principle be used instead in
some or all cases -- all that is truly essential for correctness is
that the invariants always be respected. We're still in agreement up
until here.

> Yes, I also read and appreciate your comments on +inf vs -inf when
> this came up in [0].

I'm impressed that you've done your homework on this.

> However, if we could choose, I think that having
> both options could be quite beneficial, especially when dealing with
> many duplicates or duplicate prefixes.

This is where things are much less clear -- maybe we're not in
agreement here. Who knows, though -- maybe you're right. But you
haven't presented any kind of argument. I understand that it's hard to
articulate what effects might be in play with stuff like this, so I
won't force the issue now. Strong evidence is of course the only way
that you'll reliably convince me of this.

I should point out that I am a little confused about how this +inf
business could be both independently useful and pivotal to
implementing [dynamic] prefix truncation/compression. Seems...weird to
discuss them together, except maybe to mention in passing that this
+inf thing is notable for particularly helping dynamic prefix stuff --
which is it?

It is my strong preference that nbtsplitloc.c continue to know
approximately nothing about compression or deduplication. While it is
true that nbtsplitloc.c's _bt_recsplitloc() is aware of posting lists,
this is strictly an optimization that is only justified by the fact
that posting lists are sometimes very large, and therefore worth
considering directly -- just to get a more accurate idea of how a
relevant split point choice affects the balance of free space (we
don't bother to do the same thing with non-key INCLUDE columns because
they're generally small and equi-sized). And so this _bt_recsplitloc()
thing no exception to the general rule, which is:
deduplication/posting list maintenance should be *totally* orthogonal
to the page split choice logic (the design of posting list splits
helps a lot with that). We can afford to have complicated split point
choice logic because the question of which split point is optimal is
totally decoupled from the question of which are correct -- in
particular, from the correctness of the space accounting used to
generate candidate split points.

It may interest you to know that I once thought that it would be nice
to have the *option* of +inf too, so that we could use it in very rare
cases like the pathological SPLIT_MANY_DUPLICATES case that
_bt_bestsplitloc() has some defenses against. It would perhaps be nice
if we could use +inf selectively in that case. I never said anything
about this publicly before now, mostly because it wasn't that
important -- pathological behaviors like this have never been reported
on by users a full 18 months after the release of 12.0, so it's
unlikely to be a real concern.

> > If you're going to pursue full prefix compression anyway, maybe you
> > should use a low key on the leaf level in cases where the optimization
> > is in use. This creates complexity during page deletion, because the
> > low key in the subtree to the right of the deletion target subtree may
> > need to be updated. Perhaps you can find a way to make that work that
> > isn't too complicated.
>
> That would be an interesting research path as well, the cost/benefit
> analysis would be much trickier when comparing to the status quo.

I'd say that's unclear right now.

> > You should probably account for index size here. I have lots of my own
> > tests for space utilization, using data from a variety of sources.
>
> I'd like to mention that the current (and measured) patchset only does
> _logical_ dynamic prefix truncation, not the physical prefix
> truncation that is described on the wiki page.

If you change how _bt_truncate() behaves in any way (e.g. sometimes
it's lastleft/+inf based now), and nothing else, you're still bound to
change the space utilization with the tests that I maintain -- though
perhaps only at the level of noise. I sometimes call these tests "wind
tunnel tests". It turns out that you can simulate rather a lot about a
real complicated workload with simple, deterministic, serial test
cases -- provided you're only interested in the space utilization.
This helped a lot for both the Postgres 12 and Postgres 13 stuff
(though not the Postgres 14 stuff).

> These patches are
> useful on their own for multi-key-column btree performance (and some
> GIST), regardless of later patches implementing physical dynamic
> prefix truncation in the btree AM.

Have you isolated the performance impact of the first patch at all?
Can you characterize how well it works on its own, perhaps just
informally? It would be convenient if the first patch could be treated
as an independent thing.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Bogus collation version recording in recordMultipleDependencies
Next
From: Tomas Vondra
Date:
Subject: PATCH: generate fractional cheapest paths in generate_orderedappend_path