Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Date | |
Msg-id | CAH2-WzkjyQ7tnqqOvOamTw0zeLKTZcrPEF8uUMhqf4mUL_xzug@mail.gmail.com Whole thread Raw |
In response to | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
|
List | pgsql-hackers |
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I spent some time first trying to understand the current algorithm, and > then rewriting it in a way that I find easier to understand. I came up > with the attached. I think it optimizes for the same goals as your > patch, but the approach is quite different. At a very high level, I > believe the goals can be described as: > > 1. Find out how much suffix truncation is possible, i.e. how many key > columns can be truncated away, in the best case, among all possible ways > to split the page. > > 2. Among all the splits that achieve that optimum suffix truncation, > find the one with smallest "delta". Thanks for going to the trouble of implementing what you have in mind! I agree that the code that I wrote within nbtsplitloc.c is very subtle, and I do think that I have further work to do to make its design clearer. I think that this high level description of the goals of the algorithm is inaccurate in subtle but important ways, though. Hopefully there will be a way of making it more understandable that preserves certain important characteristics. If you had my test cases/data that would probably help you a lot (more on that later). The algorithm I came up with almost always does these two things in the opposite order, with each considered in clearly separate phases. There are good reasons for this. We start with the same criteria as the old algorithm. We assemble a small array of candidate split points, rather than one split point, but otherwise it's almost exactly the same (the array is sorted by delta). Then, at the very end, we iterate through the small array to find the best choice for suffix truncation. IOW, we only consider suffix truncation as a *secondary* goal. The delta is still by far the most important thing 99%+ of the time. I assume it's fairly rare to not have two distinct tuples within 9 or so tuples of the delta-wise optimal split position -- 99% is probably a low estimate, at least in OLTP app, or within unique indexes. I see that you do something with a "good enough" delta that seems like it also makes delta the most important thing, but that doesn't appear to be, uh, good enough. ;-) Now, it's true that my approach does occasionally work in a way close to what you describe above -- it does this when we give up on default mode and check "how much suffix truncation is possible?" exhaustively, for every possible candidate split point. "Many duplicates" mode kicks in when we need to be aggressive about suffix truncation. Even then, the exact goals are different to what you have in mind in subtle but important ways. While "truncating away the heap TID" isn't really a special case in other places, it is a special case for my version of nbtsplitloc.c, which more or less avoids it at all costs. Truncating away heap TID is more important than truncating away any other attribute by a *huge* margin. Many duplicates mode *only* specifically cares about truncating the final TID attribute. That is the only thing that is ever treated as more important than delta, though even there we don't forget about delta entirely. That is, we assume that the "perfect penalty" is nkeyatts when in many duplicates mode, because we don't care about suffix truncation beyond heap TID truncation by then. So, if we find 5 split points out of 250 in the final array that avoid appending heap TID, we use the earliest/lowest delta out of those 5. We're not going to try to maximize the number of *additional* attributes that get truncated, because that can make the leaf pages unbalanced in an *unbounded* way. None of these 5 split points are "good enough", but the distinction between their deltas still matters a lot. We strongly prefer a split with a *mediocre* delta to a split with a *terrible* delta -- a bigger high key is the least of our worries here. (I made similar mistakes myself months ago, BTW.) Your version of the algorithm makes a test case of mine (UK land registry test case [1]) go from having an index that's 1101 MB with my version of the algorithm/patch and 1329 MB on the master branch to an index that's 3030 MB in size. I think that this happens because it effectively fails to give any consideration to delta at all at certain points. On leaf pages with lots of unique keys, your algorithm does about as well as mine because all possible split points look equally good suffix-truncation-wise, plus you have the "good enough" test, so delta isn't ignored. I think that your algorithm also works well when there are many duplicates but only one non-TID index column, since the heap TID truncation versus other truncation issue does not arise. The test case I used is an index on "(county, city, locality)", though -- low cardinality, but more than a single column. That's a *combination* of two separate considerations, that seem to get conflated. I don't think that you can avoid doing "a second pass" in some sense, because these really are separate considerations. There is an important middle-ground that your algorithm fails to handle with this test case. You end up maximizing the number of attributes that are truncated when you shouldn't -- leaf page splits are totally unbalanced much of the time. Pivot tuples are smaller on average, but are also far far more numerous, because there are more leaf page splits as a result of earlier leaf page splits being unbalanced. If instead you treated heap TID truncation as the only thing that you were willing to go to huge lengths to prevent, then unbalanced splits become *self-limiting*. The next split will probably end up being a single value mode split, which packs pages full of duplicates at tightly as possible. Splits should "degrade gracefully" from default mode to many duplicates mode to single value mode in cases where the number of distinct values is constant (or almost constant), but the total number of tuples grows over time. > I've only been testing this on leaf splits. I didn't understand how the > penalty worked for internal pages in your patch. In this version, the > same algorithm is used for leaf and internal pages. The approach that I use for internal pages is only slightly different to what we've always done -- I split very near the delta-wise optimal point, with a slight preference for a tuple that happens to be smaller. And, there is no way in which the delta-optimal point can be different to what it would have been on master with internal pages (they only use default mode). I don't think it's appropriate to use the same algorithm for leaf and internal page splits at all. We cannot perform suffix truncation on internal pages. > What have you been using to test this? I wrote the attached little test > extension, to explore what _bt_findsplitloc() decides in different > scenarios. I've specifically tested the _bt_findsplitloc() stuff using a couple of different techniques. Primarily, I've been using lots of real world data and TPC benchmark test data, with expected/test output generated by a contrib/pageinspect query that determines the exact number of leaf blocks and internal page blocks from each index in a test database. Just bash and SQL. I'm happy to share that with you, if you're able to accept a couple of gigabytes worth of dumps that are needed to make the scripts work. Details: pg@bat:~/hdd/sample-data$ ll land_registry.custom.dump -rw------- 1 pg pg 1.1G Mar 3 2018 land_registry.custom.dump pg@bat:~/hdd/sample-data$ ll tpcc_2018-07-20_unlogged.dump -rw-rw-r-- 1 pg pg 1.8G Jul 20 2018 tpcc_2018-07-20_unlogged.dump (The only other components for these "fast" tests are simple bash scripts.) I think that you'd find it a lot easier to work with me on these issues you at least had these tests -- my understanding of the problems was shaped by the tests. I strongly recommend that you try out my UK land registry test and the TPC-C test as a way of understanding the design I've used for _bt_findsplitloc(). It shouldn't be that inconvenient to get it over to you. I have several more tests besides these two, but they're much more cumbersome and much less valuable. I have a script that I can run in 5 minutes that probably catches all the regressions. The long running stuff, like my TPC-E test case (the stuff that I won't bother sending) hasn't caught any regressions that the fast tests didn't catch as well. Separately, I also have a .gdbinit function that looks like this: define dump_page dump binary memory /tmp/gdb_postgres_page.dump $arg0 ($arg0 + 8192) echo Invoking pg_hexedit + wxHexEditor on page...\n ! ~/code/pg_hexedit/pg_hexedit -n 1 /tmp/gdb_postgres_page.dump > /tmp/gdb_postgres_page.dump.tags ! ~/code/wxHexEditor/wxHexEditor /tmp/gdb_postgres_page.dump &> /dev/null end This allows me to see an arbitrary page from an interactive gdb session using my pg_hexedit tool. I can simply "dump_page page" from most functions in the nbtree source code. At various points I found it useful to add optimistic assertions to the split point choosing routines that failed. I could then see why they failed by using gdb with the resulting core dump. I could look at the page image using pg_hexedit/wxHexEditor from there. This allowed me to understand one or two corner cases. For example, this is how I figured out the exact details at the end of _bt_perfect_penalty(), when it looks like we're about to go into a second pass of the page. > It's pretty rough, but that's what I've been using while > hacking on this. It prints output like this: Cool! I did have something that would LOG the new high key in an easy to interpret way at one point, which was a little like this. [1] https://postgr.es/m/CAH2-Wzn5XbCzk6u0GL+uPnCp1tbrp2pJHJ=3bYT4yQ0_zzHxmw@mail.gmail.com -- Peter Geoghegan
pgsql-hackers by date: