Thread: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

On Wed, Jul 4, 2018 at 9:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I'm starting this new thread to discuss the benefits of B-Tree suffix
> truncation, and to demonstrate how effective it can be at increasing
> fan-out and space utilization with a real-world example. I haven't
> explained why suffix truncation matters very well before now. Now that
> I have a patch that works, I can just show the benefit directly. (By
> the way, there are similar benefits to covering INCLUDE indexes, where
> suffix truncation occurs all the time, and not just
> opportunistically.)

> Explanation
> -----------
>
> Pivot tuples describe how the key space will be split up, which will
> *hopefully* be balanced for future insertions. So, apart from the
> obvious issue about the size of pivot tuples, there is a more
> important issue: why unnecessarily commit to certain exact boundaries
> early on, before it's clear how well that will balance values that get
> inserted later?

Actually, this particular example does not really show why general
suffix truncation is important. My example did manage to make an index
15% smaller in a practical, workable way, but traditional suffix
truncation deserves far less credit for that than I initially claimed.
My explanation was about 99% wrong, but my example is still valid. The
true explanation for why my patch (the pending v3 of my
unique-key-heap-TID patch) avoided so much bloat is very interesting,
because it is related to a broader problem. I'll show a simple example
where the bloat that my patch can avoid is far greater still,
involving a simple single-attribute secondary index.

Before I get to that example, I'll give the real explanation. The real
reason for the marked decrease in the level of bloating is that my
patch makes insertions into secondary indexes (non-unique indexes)
jump immediately to the leaf page that the tuple unambiguously belongs
on -- since it now has a unique key, there must be one exact page that
the value is supposed to go on. We avoid trying to find a place among
pages full of logical duplicates, and so we avoid the risk of "getting
tired" [1] and giving up on finding free space that is actually
available to us. "Getting tired" tends to produce really inefficient
space utilization among leaf pages full of duplicates, at least past a
certain tipping point.

The whole "getting tired" thing is the root of the problem here, which
is why the pending v3 of my patch will remove that code completely
(_bt_findinsertloc() is streamlined). The "tipping point" nature of
getting tired is of particular concern to me, since it sounds like
something that could have been a factor in Uber's much-publicized blog
post.   :-(

I attach the test case I mentioned, which I show the output from
below, with my commentary in parenthesis (note that there are "\echo"
psql statements whose output you'll also see):

$ psql -f testcase.sql
autovcuum should probably be disabled:
 autovacuum
────────────
 off
(1 row)

psql:testcase.sql:3: NOTICE:  table "sec_idx_bloat" does not exist, skipping
DROP TABLE
CREATE TABLE

(Created empty table named "sec_idx_bloat", with a single int4 attribute.)

CREATE INDEX

(Created empty index named "bloated" on that attribute.)

Initial insert of 1e7 sequential values:
INSERT 0 10000000

"bloated" size on master: 214 MB
"bloated" size with patch: 214 MB

Initial insert of the value 42 1e7 times:
INSERT 0 10000000

"bloated" size on master: 486 MB
"bloated" size with patch: 430 MB

Repeated insertion of the value 42 1e7 times:
INSERT 0 10000000

"bloated" size on master: 757 MB
"bloated" size with patch: 645 MB

Delete + VACUUM away most of the dups:
DELETE 18001072
VACUUM

"bloated" size on master: 757 MB
"bloated" size with patch: 645 MB

(That is, VACUUM hasn't made either case have a smaller index yet,
though it probably gave more back many more whole index pages to the
FSM in the case of the patch.)

Third insertion of the value 42 1e7 times:
INSERT 0 10000000

(This is where it gets really interesting, because things truly fall
apart for master, whereas the patch case manages to reuse existing
free space in all cases.)

"bloated" size on master: 1029 MB
"bloated" size with patch: 645 MB

Fourth insertion of the value 42 1e7 times:
INSERT 0 10000000

"bloated" size on master: 1301 MB
"bloated" size with patch: 688 MB

(Patch can still almost reuse all the space left behind by VACUUM,
though since it's a bit bigger than the original high watermark size
of 645 MB it's not perfectly efficient. Master flounders even further
behind, though.)


To summarize: recognizing that bloat in indexes is correlated with
bloat in tables allows us to recycle space in both structures
cooperatively, which can be very important.

While this example focusses on bloat, there are a lot of other things
to be unhappy about around buffer lock contention, and around the
number of pages that will be dirtied. Random choices about which page
to dirty leads to increased random I/O when checkpointing.

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/nbtinsert.c;h=907cce072412adf88d41ce9317a795fb25820be2;hb=refs/heads/REL_11_STABLE#l694
--
Peter Geoghegan

Attachment
On Sun, Jul 8, 2018 at 7:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> The whole "getting tired" thing is the root of the problem here, which
> is why the pending v3 of my patch will remove that code completely
> (_bt_findinsertloc() is streamlined).

Peter,

This seems like really interesting and important work.  I wouldn't
have foreseen that the "getting tired" code would have led to this
kind of bloat (even if I had known about it at all).  I wonder,
though, whether it's possible that the reverse could happen in some
other scenario.  It seems to me that with the existing code, if you
reinsert a value many copies of which have been deleted, you'll
probably find partially-empty pages whose free space can be reused,
but if there's one specific place where each tuple needs to go, you
might end up having to split pages if the new TIDs are all larger or
smaller than the old TIDs.

I'm really glad that you are working on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Jul 8, 2018 at 7:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> The whole "getting tired" thing is the root of the problem here, which
>> is why the pending v3 of my patch will remove that code completely
>> (_bt_findinsertloc() is streamlined).

> This seems like really interesting and important work.  I wouldn't
> have foreseen that the "getting tired" code would have led to this
> kind of bloat (even if I had known about it at all).  I wonder,
> though, whether it's possible that the reverse could happen in some
> other scenario.  It seems to me that with the existing code, if you
> reinsert a value many copies of which have been deleted, you'll
> probably find partially-empty pages whose free space can be reused,
> but if there's one specific place where each tuple needs to go, you
> might end up having to split pages if the new TIDs are all larger or
> smaller than the old TIDs.

Yeah ... if memory serves, there were specific usage patterns where
that hack made things way better than they'd been before.  (I do not
recall if the hack itself was mine, but I think I can be blamed for
the "getting tired" comment ...)  I'd suggest git blaming your way
to the commit that put that in, and then checking the hackers archives
around that date for more info.

            regards, tom lane


On Tue, Jul 17, 2018 at 1:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This seems like really interesting and important work.  I wouldn't
> have foreseen that the "getting tired" code would have led to this
> kind of bloat (even if I had known about it at all).

Thanks!

I'm glad that I can come up with concrete, motivating examples like
this, because it's really hard to talk about the big picture here.
With something like a pgbench workload, there are undoubtedly many
different factors in play, since temporal locality influences many
different things all at once. I don't think that I understand it all
just yet. Taking a holistic view of the problem seems very helpful,
but it's also very frustrating at times.

> I wonder,
> though, whether it's possible that the reverse could happen in some
> other scenario.  It seems to me that with the existing code, if you
> reinsert a value many copies of which have been deleted, you'll
> probably find partially-empty pages whose free space can be reused,
> but if there's one specific place where each tuple needs to go, you
> might end up having to split pages if the new TIDs are all larger or
> smaller than the old TIDs.

That's a legitimate concern. After all, what I've done boils down to
adding a restriction on space utilization that wasn't there before.
This clearly helps because it makes it practical to rip out the
"getting tired" thing, but that's not everything. There are good
reasons for that hack, but if latency magically didn't matter then we
could just tear the hack out without doing anything else. That would
make groveling through pages full of duplicates at least as discerning
about space utilization as my patch manages to be, without any of the
complexity.

There is actually a flipside to that downside, though (i.e. the
downside is also an upside): While not filling up leaf pages that have
free space on them is bad, it's only bad when it doesn't leave the
pages completely empty. Leaving the pages completely empty is actually
good, because then VACUUM is in a position to delete entire pages,
removing their downlinks from parent pages. That's a variety of bloat
that we can reverse completely. I suspect that you'll see far more of
that favorable case in the real world with my patch. It's pretty much
impossible to do page deletions with pages full of duplicates today,
because the roughly-uniform distribution of still-live tuples among
leaf pages fails to exhibit any temporal locality. So, maybe my patch
would still come out ahead of simply ripping out "getting tired" in
this parallel universe where latency doesn't matter, and space
utilization is everything.

I made one small mistake with my test case: It actually *is* perfectly
efficient at recycling space even at the end, since I don't delete all
the duplicates (just 90% of them). Getting tired might have been a
contributing factor there, too.

-- 
Peter Geoghegan


On Tue, Jul 17, 2018 at 1:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah ... if memory serves, there were specific usage patterns where
> that hack made things way better than they'd been before.  (I do not
> recall if the hack itself was mine, but I think I can be blamed for
> the "getting tired" comment ...)  I'd suggest git blaming your way
> to the commit that put that in, and then checking the hackers archives
> around that date for more info.

I've done plenty of research into the history of this hack. It was
your work, but it does actually make sense in the context of today's
nbtree code. It is essential with scankey-wise duplicates, since
groveling through hundreds or even thousands of pages full of
duplicates to find free space (and avoid a page split) is going to
have a very serious downside for latency.

Vadim wanted to fix the problem by more or less doing what I propose
[1], though he never got into figuring out how to make that practical
(i.e. how to make it not bloat up internal pages, which must represent
heap TID as just another part of the key space). Having unique keys
isn't just assumed by Lehman & Yao; I think that it's assumed by most
or all real-world B-Tree implementations.

[1] https://www.postgresql.org/message-id/18788.963953289@sss.pgh.pa.us
-- 
Peter Geoghegan


Peter Geoghegan <pg@bowt.ie> writes:
> I've done plenty of research into the history of this hack. It was
> your work, but it does actually make sense in the context of today's
> nbtree code. It is essential with scankey-wise duplicates, since
> groveling through hundreds or even thousands of pages full of
> duplicates to find free space (and avoid a page split) is going to
> have a very serious downside for latency.

Well, the actual problem was O(N^2) behavior:

https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6

I certainly have no objection to improving matters, but let's be sure
we don't re-introduce any two-decade-old problems.

            regards, tom lane


On Tue, Jul 17, 2018 at 3:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, the actual problem was O(N^2) behavior:
>
> https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us
>
> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6

Oh, yeah. We still put the new tuple to the left of all existing
duplicates on the leaf that we decide to insert on, because "we'll
insert it before any existing equal keys because of the way
_bt_binsrch() works". I actually plan on mostly fixing that aspect of
_bt_binsrch() while I'm at it, which is why I didn't think to frame it
that way.

It is claimed that we need to do this _bt_binsrch() go-left-on-equal
thing for internal pages because we allow duplicates, contrary to L&Y.
I find it much easier to think of it as being necessary due to index
scans where only some columns are specified in the initial insertion
scan key that finds the initial leaf page (i.e. the _bt_first()
stuff). I'm not going to touch the _bt_binsrch() go-left-on-equal
thing, because it's actually necessary for any real world
implementation that has multiple columns in tuples. Fortunately, we
can usually avoid the go-left-on-equal thing for internal pages by
avoiding equality -- a sentinel heap scan TID value may be used for
this in many cases.

The Lehman and Yao paper is badly written. They should have talked
about the _bt_binsrch() go-left-on-equal thing, rather than leaving it
to the reader to figure it out. It's not why L&Y require unique keys,
though.

Sorry if all this detail isn't useful right now. The important point
is that I actually think that this code gets most things right
already.

> I certainly have no objection to improving matters, but let's be sure
> we don't re-introduce any two-decade-old problems.

A mountain of work remains to validate the performance of the patch.

-- 
Peter Geoghegan


On Tue, Jul 17, 2018 at 5:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> There is actually a flipside to that downside, though (i.e. the
> downside is also an upside): While not filling up leaf pages that have
> free space on them is bad, it's only bad when it doesn't leave the
> pages completely empty. Leaving the pages completely empty is actually
> good, because then VACUUM is in a position to delete entire pages,
> removing their downlinks from parent pages. That's a variety of bloat
> that we can reverse completely. I suspect that you'll see far more of
> that favorable case in the real world with my patch. It's pretty much
> impossible to do page deletions with pages full of duplicates today,
> because the roughly-uniform distribution of still-live tuples among
> leaf pages fails to exhibit any temporal locality. So, maybe my patch
> would still come out ahead of simply ripping out "getting tired" in
> this parallel universe where latency doesn't matter, and space
> utilization is everything.

Yes, that's a good point.  Also, and I think pretty importantly, this
seems essential if we want to allow retail index tuple deletion, which
has its own set of advantages.

I don't think you're going to be able to rip out the getting-tired
code, though, because presumably we'll have to continue support
existing btree indexes that don't include TIDs for some
probably-not-small number of future releases, even if we decide that
all future btree indexes (and any that are reindexed) should have
TIDs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Tue, Jul 17, 2018 at 5:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Yes, that's a good point.  Also, and I think pretty importantly, this
> seems essential if we want to allow retail index tuple deletion, which
> has its own set of advantages.

Retail index tuple deletion is clearly something we should have in
some form, and is the top reason to pursue the project. I could
probably come up with dozens of other reasons if I had to, though.
Here are a few of the less obvious ones:

* This improves usage_count of heap page buffers automatically, by
recognizing correlated references when there are many duplicates. [1]

* How else is VACUUM/deletion supposed to work with indirect indexes?
I think that it's impossible without either retail deletion/retail
updates, unless an AEL is okay.

* Global indexes (indexes that simultaneously cover multiple
partitions) are just a generalization of this idea, with a partition
number attribute before the heap TID attribute.

* The sort order of duplicates actually matters quite a lot to the
executor, and to the planner, which this lets us reason about in many
cases. [2][3]

* There is nothing to stop us from exploiting this within
tidbitmap.c/bitmap index scans. We can build disjunct lists of sorted
TIDs, one per value. Merging lists like this together is typically
very cheap, even when there are relatively many distinct lists. We
could probably do that on-the-fly.

* When building a bitmap for a low cardinality value from a secondary
index, why bother even visiting the leaf pages? We can make a lossy
bitmap from a subset of the internal pages.

* Techniques like bitmap semi-join, which are important for
star-schema queries, are dependent on the merging and intersection of
fact table bitmaps being cheap (basically, you build several bitmaps
through nestloop joins with some dimension table on the outer side,
and some fact table index on the inner side -- that's relatively easy
to parallelize well). Tom's "Improve OR conditions on joined columns"
patch sorts a list of TIDs [4], and I suspect that there is some
high-level connection there.

* Heap prefetching for nbtree index scans is a lot more effective than
it might otherwise be.

I really could go on, but you get the idea. The merits of each of
these ideas are debatable, but the fact that there are so many ideas
that are at least plausible-sounding seems significant to me.

> I don't think you're going to be able to rip out the getting-tired
> code, though, because presumably we'll have to continue support
> existing btree indexes that don't include TIDs for some
> probably-not-small number of future releases, even if we decide that
> all future btree indexes (and any that are reindexed) should have
> TIDs.

That's definitely true. There is no way that it's okay to ignore this
issue with pg_upgrade.

[1] https://postgr.es/m/CAH2-WzkrKKW88Yq4-BLN5N3DrFv93P8r+ti-KfbTeR08P8ckBQ@mail.gmail.com
[2] https://postgr.es/m/20170707234119.GN17566@telsasoft.com
[3] https://postgr.es/m/520D6610.8040907%40emulex.com#520D6610.8040907@emulex.com
[4] https://postgr.es/m/22002.1487099934%40sss.pgh.pa.us
-- 
Peter Geoghegan


On 17 July 2018 at 23:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> I've done plenty of research into the history of this hack. It was
>> your work, but it does actually make sense in the context of today's
>> nbtree code. It is essential with scankey-wise duplicates, since
>> groveling through hundreds or even thousands of pages full of
>> duplicates to find free space (and avoid a page split) is going to
>> have a very serious downside for latency.
>
> Well, the actual problem was O(N^2) behavior:
>
> https://www.postgresql.org/message-id/2378.967216388%40sss.pgh.pa.us
>
> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=40549e9cb5abd2986603883e4ab567dab34723c6
>
> I certainly have no objection to improving matters, but let's be sure
> we don't re-introduce any two-decade-old problems.

Looking at this in terms of CPU cost for a single insert, insertion at
the end of the list of duplicates was O(N), which was much improved by
the O(k) behavior.
Keeping the index entries in order is O(logN)

So there is a higher cost to keeping the entries in order, though that
becomes a win because of the reduced bloat in cases where we have a
high numbers of deletes. That is a win in insertion time as well,
since by packing the index more tightly we cause fewer costly splits
and less I/O.

If we knew that we were never going to do deletes/non-HOT updates from
the table we could continue to use the existing mechanism, otherwise
we will be better off to use sorted index entries. However, it does
appear as if keeping entries in sorted order would be a win on
concurrency from reduced block contention on the first few blocks of
the index key, so it may also be a win in cases where there are heavy
concurrent inserts but no deletes.

Bulk loading will improve because adding a new data block via COPY
will cause all of the resulting index entries to be added with more
adjacency, so a block full of duplicate entries could be sorted and
then merged efficiently into the index.

I hope we can see a patch that just adds the sorting-by-TID property
so we can evaluate that aspect before we try to add other more
advanced index ideas.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Tue, Jul 17, 2018 at 10:42 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> If we knew that we were never going to do deletes/non-HOT updates from
> the table we could continue to use the existing mechanism, otherwise
> we will be better off to use sorted index entries. However, it does
> appear as if keeping entries in sorted order would be a win on
> concurrency from reduced block contention on the first few blocks of
> the index key, so it may also be a win in cases where there are heavy
> concurrent inserts but no deletes.

I think so too. I also expect a big reduction in the number of FPIs in
the event of many duplicates.

> I hope we can see a patch that just adds the sorting-by-TID property
> so we can evaluate that aspect before we try to add other more
> advanced index ideas.

I can certainly see why that's desirable. Unfortunately, it isn't that simple.

If I want to sort on heap TID as a tie-breaker, I cannot cut any
corners. That is, it has to be just another column, at least as far as
the implementation is concerned (heap TID will have a different
representation in internal pages and leaf high keys, but nonetheless
it's essentially just another column in the keyspace). This means that
if I don't have suffix truncation, I'll regress performance in many
important cases that have no compensating benefit (e.g. pgbench).
There is no point in trying to assess that.

It is true that I could opt to only "logically truncate" the heap TID
attribute during a leaf page split (i.e. there'd only be "logical
truncation", which is to say there'd only be the avoidance of adding a
heap TID to the new high key, and never true physical truncation of
user attributes). But doing only that much saves very little code,
since the logic for assessing whether or not it's safe to avoid adding
a new heap attribute (whether or not we logically truncate) still has
to involve an insertion scankey. It seems much more natural to do
everything at once. Again, the heap TID attribute is more or less just
another attribute. Also, the code for doing physical suffix truncation
already exists from the covering/INCLUDE index commit.

I'm currently improving the logic for picking a page split in light of
suffix truncation, which I've been working on for weeks now. I had
something that did quite well with the initial index sizes for TPC-C
and TPC-H, but then realized I'd totally regressed the motivating
example with many duplicates that I started this thread with. I now
have something that does both things well, which I'm trying to
simplify. Another thing to bear in mind is that page split logic for
suffix truncation also helps space utilization on the leaf level. I
can get the main TPC-C order_line pkey about 7% smaller with true
suffix truncation, even though the internal page index tuples can
never be any smaller due to alignment, and even though there are no
duplicates that would otherwise make the implementation "get tired".

Can I really fix space utilization in a piecemeal fashion? I strongly doubt it.

-- 
Peter Geoghegan


Do you plan to submit this patch to the upcoming commit fest perhaps?  I
have done some testing on it and it seems worth pursuing further.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Tue, Aug 28, 2018 at 11:32 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Do you plan to submit this patch to the upcoming commit fest perhaps?  I
> have done some testing on it and it seems worth pursuing further.

I should make sure that this makes it into the September 'fest. Thanks
for reminding me. I've been distracted by other responsibilities, but
I think that this project is well worthwhile. I intend to spend lots
of time on it, as I think that it has real strategic importance. I
would be most grateful if you signed up to review the patch. I've been
testing it with quite a variety of real-world data, which seems like
the way to go until the code really matures.

As I indicated to Simon on August 2nd, it seems like I should further
refine my current working draft of the next version to have less
magic. I have a cumbersome test suite that proves that I have
something that improves fan-out on TPC-C and TPC-H indexes by quite a
bit (e.g. the main TPC-C order_line pkey is about 7% smaller after
initial preload, despite not even having smaller pivot tuples due to
alignment). It also proves that the benefits of not "getting tired" in
the event of many duplicates are preserved. We need to have both at
once.

The code to pick a split point during a page split is rather tricky.
That's what's holding the next version up. I can post what I have
right now on the other thread, to give you a better idea of the
direction of the patch. You'll have to hold your nose when you look at
the code that picks a split point, though. Let me know if you think
that that makes sense. I wouldn't want you to spend too much time on
old-ish code.

-- 
Peter Geoghegan


On 2 August 2018 at 21:32, Peter Geoghegan <pg@bowt.ie> wrote:

If I want to sort on heap TID as a tie-breaker, I cannot cut any
corners. That is, it has to be just another column, at least as far as
the implementation is concerned (heap TID will have a different
representation in internal pages and leaf high keys, but nonetheless
it's essentially just another column in the keyspace). This means that
if I don't have suffix truncation, I'll regress performance in many
important cases that have no compensating benefit (e.g. pgbench).
There is no point in trying to assess that.

If you include heap TID as a column the suffix will be unique and cannot benefit from suffix truncation.

It would be better to avoid including the heap TID as a column, since it is already there. Or the other way around, don't include it as payload if it is has to be a column.

Alternatively, include only the heap block number. That would make it non-unique, but never more than 240 duplicates. So it would allow suffix truncation, and yet still avoid the multi-page split effect.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 29, 2018 at 11:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> If you include heap TID as a column the suffix will be unique and cannot
> benefit from suffix truncation.

Right. During a page split, we must generate a new high key that's
less than or equal to all items on the left side (where the new high
key goes), and strictly greater than all items on the right side
(where the new high key values will be used as a downlink to). We'll
only include heap TID explicitly in the new high key when there is no
other way to get a value that is strictly less than all tuples on the
right side, and only at the leaf level (we don't truncate during
internal page splits, and the rules automatically work at non-leaf
levels because values come from the leaf level).

The original L&Y invariant that I propose to restore requires that
downlinks be strictly less than all items on the child page the point
to. Truncated attributes logically retain the value minus infinity,
which is often what makes the true L&Y invariant hold.

> It would be better to avoid including the heap TID as a column, since it is
> already there. Or the other way around, don't include it as payload if it is
> has to be a column.

The high level principle to bear in mind is that pivot tuples (high
keys at the leaf level, and all internal page tuples) only need to
correctly guide searches. There is no obligation for pivot tuples to
be able to retrieve the original values, for something like an
index-only scan. We could even make their representation into a flat
binary string someday -- this is called "key normalization" [1].
Likewise, pivot tuples are good candidates for automatic prefix
compression because there is scarcely any need to decompress them (not
working on that for v12, though).

Nothing changes about the direct representation of 99%+ of tuples
(those at the leaf level that point to the heap -- non-pivot tuples).
We need an alternative representation for heap TID in pivot tuples for
historical reasons (actually, I think what we have here would even
work best in a green field situation).

Not including a special heap TID representation in pivots, but
including everything else is what I've called "logical truncation" on
this thread. If you compare a serial primary key case with my patch
and without, you typically see no difference in the size of the index.
It is possible to end up with a larger index overall, but that should
be extremely rare, and only affect very small indexes.

> Alternatively, include only the heap block number. That would make it
> non-unique, but never more than 240 duplicates. So it would allow suffix
> truncation, and yet still avoid the multi-page split effect.

I'm not sure what you mean by the multi-page split effect -- all
internal pages are split recursively, during a leaf page split. In
general, free space management needs to remain as simple as possible,
and splitting the same page twice in one go is a big no-no.

I conservatively enforce that the 1/3 of a page restriction has a bit
of extra slop for an extra heap TID. There is one choke point that
everything takes place -- during leaf page splits. Nothing else
changes, apart from index scan logic, which may have to search with an
extra heap TID column, plus one or two other small things along those
lines.

[1] https://wiki.postgresql.org/wiki/Key_normalization
-- 
Peter Geoghegan