Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Date
Msg-id CAH2-WzkjyQ7tnqqOvOamTw0zeLKTZcrPEF8uUMhqf4mUL_xzug@mail.gmail.com
Whole thread Raw
In response to Re: Making all nbtree entries unique by having heap TIDs participatein comparisons  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Making all nbtree entries unique by having heap TIDs participatein comparisons  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I spent some time first trying to understand the current algorithm, and
> then rewriting it in a way that I find easier to understand. I came up
> with the attached. I think it optimizes for the same goals as your
> patch, but the approach  is quite different. At a very high level, I
> believe the goals can be described as:
>
> 1. Find out how much suffix truncation is possible, i.e. how many key
> columns can be truncated away, in the best case, among all possible ways
> to split the page.
>
> 2. Among all the splits that achieve that optimum suffix truncation,
> find the one with smallest "delta".

Thanks for going to the trouble of implementing what you have in mind!

I agree that the code that I wrote within nbtsplitloc.c is very
subtle, and I do think that I have further work to do to make its
design clearer. I think that this high level description of the goals
of the algorithm is inaccurate in subtle but important ways, though.
Hopefully there will be a way of making it more understandable that
preserves certain important characteristics. If you had my test
cases/data that would probably help you a lot (more on that later).

The algorithm I came up with almost always does these two things in
the opposite order, with each considered in clearly separate phases.
There are good reasons for this. We start with the same criteria as
the old algorithm. We assemble a small array of candidate split
points, rather than one split point, but otherwise it's almost exactly
the same (the array is sorted by delta). Then, at the very end, we
iterate through the small array to find the best choice for suffix
truncation. IOW, we only consider suffix truncation as a *secondary*
goal. The delta is still by far the most important thing 99%+ of the
time. I assume it's fairly rare to not have two distinct tuples within
9 or so tuples of the delta-wise optimal split position -- 99% is
probably a low estimate, at least in OLTP app, or within unique
indexes. I see that you do something with a "good enough" delta that
seems like it also makes delta the most important thing, but that
doesn't appear to be, uh, good enough. ;-)

Now, it's true that my approach does occasionally work in a way close
to what you describe above -- it does this when we give up on default
mode and check "how much suffix truncation is possible?" exhaustively,
for every possible candidate split point. "Many duplicates" mode kicks
in when we need to be aggressive about suffix truncation. Even then,
the exact goals are different to what you have in mind in subtle but
important ways. While "truncating away the heap TID" isn't really a
special case in other places, it is a special case for my version of
nbtsplitloc.c, which more or less avoids it at all costs. Truncating
away heap TID is more important than truncating away any other
attribute by a *huge* margin. Many duplicates mode *only* specifically
cares about truncating the final TID attribute. That is the only thing
that is ever treated as more important than delta, though even there
we don't forget about delta entirely. That is, we assume that the
"perfect penalty" is nkeyatts when in many duplicates mode, because we
don't care about suffix truncation beyond heap TID truncation by then.
So, if we find 5 split points out of 250 in the final array that avoid
appending heap TID, we use the earliest/lowest delta out of those 5.
We're not going to try to maximize the number of *additional*
attributes that get truncated, because that can make the leaf pages
unbalanced in an *unbounded* way. None of these 5 split points are
"good enough", but the distinction between their deltas still matters
a lot. We strongly prefer a split with a *mediocre* delta to a split
with a *terrible* delta -- a bigger high key is the least of our
worries here. (I made similar mistakes myself months ago, BTW.)

Your version of the algorithm makes a test case of mine (UK land
registry test case [1]) go from having an index that's 1101 MB with my
version of the algorithm/patch and 1329 MB on the master branch to an
index that's 3030 MB in size. I think that this happens because it
effectively fails to give any consideration to delta at all at certain
points. On leaf pages with lots of unique keys, your algorithm does
about as well as mine because all possible split points look equally
good suffix-truncation-wise, plus you have the "good enough" test, so
delta isn't ignored. I think that your algorithm also works well when
there are many duplicates but only one non-TID index column, since the
heap TID truncation versus other truncation issue does not arise. The
test case I used is an index on "(county, city, locality)", though --
low cardinality, but more than a single column. That's a *combination*
of two separate considerations, that seem to get conflated. I don't
think that you can avoid doing "a second pass" in some sense, because
these really are separate considerations.

There is an important middle-ground that your algorithm fails to
handle with this test case. You end up maximizing the number of
attributes that are truncated when you shouldn't -- leaf page splits
are totally unbalanced much of the time. Pivot tuples are smaller on
average, but are also far far more numerous, because there are more
leaf page splits as a result of earlier leaf page splits being
unbalanced. If instead you treated heap TID truncation as the only
thing that you were willing to go to huge lengths to prevent, then
unbalanced splits become *self-limiting*. The next split will probably
end up being a single value mode split, which packs pages full of
duplicates at tightly as possible.

Splits should "degrade gracefully" from default mode to many
duplicates mode to single value mode in cases where the number of
distinct values is constant (or almost constant), but the total number
of tuples grows over time.

> I've only been testing this on leaf splits. I didn't understand how the
> penalty worked for internal pages in your patch. In this version, the
> same algorithm is used for leaf and internal pages.

The approach that I use for internal pages is only slightly different
to what we've always done -- I split very near the delta-wise optimal
point, with a slight preference for a tuple that happens to be
smaller. And, there is no way in which the delta-optimal point can be
different to what it would have been on master with internal pages
(they only use default mode). I don't think it's appropriate to use
the same algorithm for leaf and internal page splits at all. We cannot
perform suffix truncation on internal pages.

> What have you been using to test this? I wrote the attached little test
> extension, to explore what _bt_findsplitloc() decides in different
> scenarios.

I've specifically tested the _bt_findsplitloc() stuff using a couple
of different techniques. Primarily, I've been using lots of real world
data and TPC benchmark test data, with expected/test output generated
by a contrib/pageinspect query that determines the exact number of
leaf blocks and internal page blocks from each index in a test
database. Just bash and SQL. I'm happy to share that with you, if
you're able to accept a couple of gigabytes worth of dumps that are
needed to make the scripts work. Details:

pg@bat:~/hdd/sample-data$ ll land_registry.custom.dump
-rw------- 1 pg pg 1.1G Mar  3  2018 land_registry.custom.dump
pg@bat:~/hdd/sample-data$ ll tpcc_2018-07-20_unlogged.dump
-rw-rw-r-- 1 pg pg 1.8G Jul 20  2018 tpcc_2018-07-20_unlogged.dump

(The only other components for these "fast" tests are simple bash scripts.)

I think that you'd find it a lot easier to work with me on these
issues you at least had these tests -- my understanding of the
problems was shaped by the tests. I strongly recommend that you try
out my UK land registry test and the TPC-C test as a way of
understanding the design I've used for _bt_findsplitloc(). It
shouldn't be that inconvenient to get it over to you. I have several
more tests besides these two, but they're much more cumbersome and
much less valuable. I have a script that I can run in 5 minutes that
probably catches all the regressions. The long running stuff, like my
TPC-E test case (the stuff that I won't bother sending) hasn't caught
any regressions that the fast tests didn't catch as well.

Separately, I also have a .gdbinit function that looks like this:

define dump_page
  dump binary memory /tmp/gdb_postgres_page.dump $arg0 ($arg0 + 8192)
  echo Invoking pg_hexedit + wxHexEditor on page...\n
  ! ~/code/pg_hexedit/pg_hexedit -n 1 /tmp/gdb_postgres_page.dump >
/tmp/gdb_postgres_page.dump.tags
  ! ~/code/wxHexEditor/wxHexEditor /tmp/gdb_postgres_page.dump &> /dev/null
end

This allows me to see an arbitrary page from an interactive gdb
session using my pg_hexedit tool. I can simply "dump_page page" from
most functions in the nbtree source code. At various points I found it
useful to add optimistic assertions to the split point choosing
routines that failed. I could then see why they failed by using gdb
with the resulting core dump. I could look at the page image using
pg_hexedit/wxHexEditor from there. This allowed me to understand one
or two corner cases. For example, this is how I figured out the exact
details at the end of _bt_perfect_penalty(), when it looks like we're
about to go into a second pass of the page.

> It's pretty rough, but that's what I've been using while
> hacking on this. It prints output like this:

Cool! I did have something that would LOG the new high key in an easy
to interpret way at one point, which was a little like this.

[1] https://postgr.es/m/CAH2-Wzn5XbCzk6u0GL+uPnCp1tbrp2pJHJ=3bYT4yQ0_zzHxmw@mail.gmail.com
--
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: backslash-dot quoting in COPY CSV
Next
From: Bruce Momjian
Date:
Subject: Re: backslash-dot quoting in COPY CSV