Thread: Making all nbtree entries unique by having heap TIDs participate in comparisons
Making all nbtree entries unique by having heap TIDs participate in comparisons
From
Peter Geoghegan
Date:
I've been thinking about using heap TID as a tie-breaker when comparing B-Tree index tuples for a while now [1]. I'd like to make all tuples at the leaf level unique, as assumed by L&Y. This can enable "retail index tuple deletion", which I think we'll probably end up implementing in some form or another, possibly as part of the zheap project. It's also possible that this work will facilitate GIN-style deduplication based on run length encoding of TIDs, or storing versioned heap TIDs in an out-of-line nbtree-versioning structure (unique indexes only). I can see many possibilities, but we have to start somewhere. I attach an unfinished prototype of suffix truncation, that also sometimes *adds* a new attribute in pivot tuples. It adds an extra heap TID from the leaf level when truncating away non-distinguishing attributes during a leaf page split, though only when it must. The patch also has nbtree treat heap TID as a first class part of the key space of the index. Claudio wrote a patch that did something similar, though without the suffix truncation part [2] (I haven't studied his patch, to be honest). My patch is actually a very indirect spin-off of Anastasia's covering index patch, and I want to show what I have in mind now, while it's still swapped into my head. I won't do any serious work on this project unless and until I see a way to implement retail index tuple deletion, which seems like a multi-year project that requires the buy-in of multiple senior community members. On its own, my patch regresses performance unacceptably in some workloads, probably due to interactions with kill_prior_tuple()/LP_DEAD hint setting, and interactions with page space management when there are many "duplicates" (it can still help performance in some pgbench workloads with non-unique indexes, though). Note that the approach to suffix truncation that I've taken isn't even my preferred approach [3] -- it's a medium-term solution that enables making a heap TID attribute part of the key space, which enables everything else. Cheap incremental/retail tuple deletion is the real prize here; don't lose sight of that when looking through my patch. If we're going to teach nbtree to truncate this new implicit heap TID attribute, which seems essential, then we might as well teach nbtree to do suffix truncation of other (user-visible) attributes while we're at it. This patch isn't a particularly effective implementation of suffix truncation, because that's not what I'm truly interested in improving here (plus I haven't even bothered to optimize the logic for picking a split point in light of suffix truncation). amcheck ======= This patch adds amcheck coverage, which seems like essential infrastructure for developing a feature such as this. Extensive amcheck coverage gave me confidence in my general approach. The basic idea, invariant-wise, is to treat truncated attributes (often including a truncated heap TID attribute in internal pages) as "minus infinity" attributes, which participate in comparisons if and only if we reach such attributes before the end of the scan key (a smaller keysz for the index scan could prevent this). I've generalized the minus infinity concept that _bt_compare() has always considered as a special case, extending it to individual attributes. It's actually possible to remove that old hard-coded _bt_compare() logic with this patch applied without breaking anything, since we can rely on the comparison of an explicitly 0-attribute tuple working the same way (pg_upgrade'd databases will break if we do this, however, so I didn't go that far). Note that I didn't change the logic that has _bt_binsrch() treat internal pages in a special way when tuples compare as equal. We still need that logic for cases where keysz is less than the number of indexed columns. It's only possible to avoid this _bt_binsrch() thing for internal pages when all attributes, including heap TID, were specified and compared (an insertion scan key has to have an entry for every indexed column, including even heap TID). Doing better there doesn't seem worth the trouble of teaching _bt_compare() to tell the _bt_binsrch() caller about this as a special case. That means that we still move left on equality in some cases where it isn't strictly necessary, contrary to L&Y. However, amcheck verifies that the classic "Ki < v <= Ki+1" invariant holds (as opposed to "Ki <= v <= Ki+1") when verifying parent/child relationships, which demonstrates that I have restored the classic invariant (I just don't find it worthwhile to take advantage of it within _bt_binsrch() just yet). Most of this work was done while I was an employee of VMware, though I joined Crunchy data on Monday and cleaned it up a bit more since then. I'm excited about joining Crunchy, but I should also acknowledge VMware's strong support of my work. [1] https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute [2] https://postgr.es/m/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com [3] https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys -- Peter Geoghegan
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Robert Haas
Date:
On Thu, Jun 14, 2018 at 2:44 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I've been thinking about using heap TID as a tie-breaker when > comparing B-Tree index tuples for a while now [1]. I'd like to make > all tuples at the leaf level unique, as assumed by L&Y. This can > enable "retail index tuple deletion", which I think we'll probably end > up implementing in some form or another, possibly as part of the zheap > project. It's also possible that this work will facilitate GIN-style > deduplication based on run length encoding of TIDs, or storing > versioned heap TIDs in an out-of-line nbtree-versioning structure > (unique indexes only). I can see many possibilities, but we have to > start somewhere. Yes, retail index deletion is essential for the delete-marking approach that is proposed for zheap. It could also be extremely useful in some workloads with the regular heap. If the indexes are large -- say, 100GB -- and the number of tuples that vacuum needs to kill is small -- say, 5 -- scanning them all to remove the references to those tuples is really inefficient. If we had retail index deletion, then we could make a cost-based decision about which approach to use in a particular case. > mind now, while it's still swapped into my head. I won't do any > serious work on this project unless and until I see a way to implement > retail index tuple deletion, which seems like a multi-year project > that requires the buy-in of multiple senior community members. Can you enumerate some of the technical obstacles that you see? > On its > own, my patch regresses performance unacceptably in some workloads, > probably due to interactions with kill_prior_tuple()/LP_DEAD hint > setting, and interactions with page space management when there are > many "duplicates" (it can still help performance in some pgbench > workloads with non-unique indexes, though). I think it would be helpful if you could talk more about these regressions (and the wins). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Jun 15, 2018 at 2:36 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Yes, retail index deletion is essential for the delete-marking > approach that is proposed for zheap. Makes sense. I don't know that much about zheap. I'm sure that retail index tuple deletion is really important in general, though. The Gray & Reuter book treats unique keys as a basic assumption, as do other authoritative reference works and papers. Other database systems probably make unique indexes simply use the user-visible attributes as unique values, but appending heap TID as a unique-ifier is probably a reasonably common design for secondary indexes (it would also be nice if we could simply not store duplicates for unique indexes, rather than using heap TID). I generally have a very high opinion on the nbtree code, but this seems like a problem that ought to be fixed. I've convinced myself that I basically have the right idea with this patch, because the classic L&Y invariants have all been tested with an enhanced amcheck run against all indexes in a regression test database. There was other stress-testing, too. The remaining problems are fixable, but I need some guidance. > It could also be extremely useful in some workloads with the regular > heap. If the indexes are large -- say, 100GB -- and the number of > tuples that vacuum needs to kill is small -- say, 5 -- scanning them > all to remove the references to those tuples is really inefficient. > If we had retail index deletion, then we could make a cost-based > decision about which approach to use in a particular case. I remember talking to Andres about this in a bar 3 years ago. I can imagine variations of pruning that do some amount of this when there are lots of duplicates. Perhaps something like InnoDB's purge threads, which do things like in-place deletes of secondary indexes after an updating (or deleting) xact commits. I believe that that mechanism targets secondary indexes specifically, and that is operates quite eagerly. > Can you enumerate some of the technical obstacles that you see? The #1 technical obstacle is that I simply don't know where I should try to take this patch, given that it probably needs to be tied to some much bigger project, such as zheap. I have an open mind, though, and intend to help if I can. I'm not really sure what the #2 and #3 problems are, because I'd need to be able to see a few steps ahead to be sure. Maybe #2 is that I'm doing something wonky to avoid breaking duplicate checking for unique indexes. (The way that unique duplicate checking has always worked [1] is perhaps questionable, though.) > I think it would be helpful if you could talk more about these > regressions (and the wins). I think that the performance regressions are due to the fact that when you have a huge number of duplicates today, it's useful to be able to claim space to fit further duplicates from almost any of the multiple leaf pages that contain or have contained duplicates. I'd hoped that the increased temporal locality that the patch gets would more than make up for that. As far as I can tell, the problem is that temporal locality doesn't help enough. I saw that performance was somewhat improved with extreme Zipf distribution contention, but it went the other way with less extreme contention. The details are not that fresh in my mind, since I shelved this patch for a while following limited performance testing. The code could certainly use more performance testing, and more general polishing. I'm not strongly motivated to do that right now, because I don't quite see a clear path to making this patch useful. But, as I said, I have an open mind about what the next step should be. [1] https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Claudio Freire
Date:
On Fri, Jun 15, 2018 at 8:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I think it would be helpful if you could talk more about these > > regressions (and the wins). > > I think that the performance regressions are due to the fact that when > you have a huge number of duplicates today, it's useful to be able to > claim space to fit further duplicates from almost any of the multiple > leaf pages that contain or have contained duplicates. I'd hoped that > the increased temporal locality that the patch gets would more than > make up for that. As far as I can tell, the problem is that temporal > locality doesn't help enough. I saw that performance was somewhat > improved with extreme Zipf distribution contention, but it went the > other way with less extreme contention. The details are not that fresh > in my mind, since I shelved this patch for a while following limited > performance testing. > > The code could certainly use more performance testing, and more > general polishing. I'm not strongly motivated to do that right now, > because I don't quite see a clear path to making this patch useful. > But, as I said, I have an open mind about what the next step should > be. Way back when I was dabbling in this kind of endeavor, my main idea to counteract that, and possibly improve performance overall, was a microvacuum kind of thing that would do some on-demand cleanup to remove duplicates or make room before page splits. Since nbtree uniqueification enables efficient retail deletions, that could end up as a net win. I never got around to implementing it though, and it does get tricky if you don't want to allow unbounded latency spikes.
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > Way back when I was dabbling in this kind of endeavor, my main idea to > counteract that, and possibly improve performance overall, was a > microvacuum kind of thing that would do some on-demand cleanup to > remove duplicates or make room before page splits. Since nbtree > uniqueification enables efficient retail deletions, that could end up > as a net win. That sounds like a mechanism that works a bit like _bt_vacuum_one_page(), which we run at the last second before a page split. We do this to see if a page split that looks necessary can actually be avoided. I imagine that retail index tuple deletion (the whole point of this project) would be run by a VACUUM-like process that kills tuples that are dead to everyone. Even with something like zheap, you cannot just delete index tuples until you establish that they're truly dead. I guess that the delete marking stuff that Robert mentioned marks tuples as dead when the deleting transaction commits. Maybe we could justify having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if they're visible to anyone, and if not recycle), because we at least know that the deleting transaction committed there. That is, they could be recently dead or dead, and it may be worth going to the extra trouble of checking which when we know that it's one of the two possibilities. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Claudio Freire
Date:
On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > > Way back when I was dabbling in this kind of endeavor, my main idea to > > counteract that, and possibly improve performance overall, was a > > microvacuum kind of thing that would do some on-demand cleanup to > > remove duplicates or make room before page splits. Since nbtree > > uniqueification enables efficient retail deletions, that could end up > > as a net win. > > That sounds like a mechanism that works a bit like > _bt_vacuum_one_page(), which we run at the last second before a page > split. We do this to see if a page split that looks necessary can > actually be avoided. > > I imagine that retail index tuple deletion (the whole point of this > project) would be run by a VACUUM-like process that kills tuples that > are dead to everyone. Even with something like zheap, you cannot just > delete index tuples until you establish that they're truly dead. I > guess that the delete marking stuff that Robert mentioned marks tuples > as dead when the deleting transaction commits. Maybe we could justify > having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if > they're visible to anyone, and if not recycle), because we at least > know that the deleting transaction committed there. That is, they > could be recently dead or dead, and it may be worth going to the extra > trouble of checking which when we know that it's one of the two > possibilities. Yes, but currently bt_vacuum_one_page does local work on the pinned page. Doing dead tuple deletion however involves reading the heap to check visibility at the very least, and doing it on the whole page might involve several heap fetches, so it's an order of magnitude heavier if done naively. But the idea is to do just that, only not naively.
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Amit Kapila
Date:
On Mon, Jun 18, 2018 at 10:33 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Way back when I was dabbling in this kind of endeavor, my main idea to >> counteract that, and possibly improve performance overall, was a >> microvacuum kind of thing that would do some on-demand cleanup to >> remove duplicates or make room before page splits. Since nbtree >> uniqueification enables efficient retail deletions, that could end up >> as a net win. > > That sounds like a mechanism that works a bit like > _bt_vacuum_one_page(), which we run at the last second before a page > split. We do this to see if a page split that looks necessary can > actually be avoided. > > I imagine that retail index tuple deletion (the whole point of this > project) would be run by a VACUUM-like process that kills tuples that > are dead to everyone. Even with something like zheap, you cannot just > delete index tuples until you establish that they're truly dead. I > guess that the delete marking stuff that Robert mentioned marks tuples > as dead when the deleting transaction commits. > No, I don't think that is the case because we want to perform in-place updates for indexed-column-updates. If we won't delete-mark the index tuple before performing in-place update, then we will have two tuples in the index which point to the same heap-TID. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I imagine that retail index tuple deletion (the whole point of this >> project) would be run by a VACUUM-like process that kills tuples that >> are dead to everyone. Even with something like zheap, you cannot just >> delete index tuples until you establish that they're truly dead. I >> guess that the delete marking stuff that Robert mentioned marks tuples >> as dead when the deleting transaction commits. >> > > No, I don't think that is the case because we want to perform in-place > updates for indexed-column-updates. If we won't delete-mark the index > tuple before performing in-place update, then we will have two tuples > in the index which point to the same heap-TID. How can an old MVCC snapshot that needs to find the heap tuple using some now-obsolete key values get to the heap tuple via an index scan if there are no index tuples that stick around until "recently dead" heap tuples become "fully dead"? How can you avoid keeping around both old and new index tuples at the same time? -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Amit Kapila
Date:
On Tue, Jun 19, 2018 at 11:13 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jun 19, 2018 at 4:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> I imagine that retail index tuple deletion (the whole point of this >>> project) would be run by a VACUUM-like process that kills tuples that >>> are dead to everyone. Even with something like zheap, you cannot just >>> delete index tuples until you establish that they're truly dead. I >>> guess that the delete marking stuff that Robert mentioned marks tuples >>> as dead when the deleting transaction commits. >>> >> >> No, I don't think that is the case because we want to perform in-place >> updates for indexed-column-updates. If we won't delete-mark the index >> tuple before performing in-place update, then we will have two tuples >> in the index which point to the same heap-TID. > > How can an old MVCC snapshot that needs to find the heap tuple using > some now-obsolete key values get to the heap tuple via an index scan > if there are no index tuples that stick around until "recently dead" > heap tuples become "fully dead"? How can you avoid keeping around both > old and new index tuples at the same time? > Both values will be present in the index, but the old value will be delete-marked. It is correct that we can't remove the value (index tuple) from the index until it is truly dead (not visible to anyone), but during a delete or index-update operation, we need to traverse the index to mark the entries as delete-marked. See, at this stage, I don't want to go in too much detail discussion of how delete-marking will happen in zheap and also I am not sure this thread is the right place to discuss details of that technology. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Jun 19, 2018 at 8:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Both values will be present in the index, but the old value will be > delete-marked. It is correct that we can't remove the value (index > tuple) from the index until it is truly dead (not visible to anyone), > but during a delete or index-update operation, we need to traverse the > index to mark the entries as delete-marked. See, at this stage, I > don't want to go in too much detail discussion of how delete-marking > will happen in zheap and also I am not sure this thread is the right > place to discuss details of that technology. I don't understand, but okay. I can provide feedback once a design for delete marking is available. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Thu, Jun 14, 2018 at 11:44 AM, Peter Geoghegan <pg@bowt.ie> wrote: > I attach an unfinished prototype of suffix truncation, that also > sometimes *adds* a new attribute in pivot tuples. It adds an extra > heap TID from the leaf level when truncating away non-distinguishing > attributes during a leaf page split, though only when it must. The > patch also has nbtree treat heap TID as a first class part of the key > space of the index. Claudio wrote a patch that did something similar, > though without the suffix truncation part [2] (I haven't studied his > patch, to be honest). My patch is actually a very indirect spin-off of > Anastasia's covering index patch, and I want to show what I have in > mind now, while it's still swapped into my head. I won't do any > serious work on this project unless and until I see a way to implement > retail index tuple deletion, which seems like a multi-year project > that requires the buy-in of multiple senior community members. On its > own, my patch regresses performance unacceptably in some workloads, > probably due to interactions with kill_prior_tuple()/LP_DEAD hint > setting, and interactions with page space management when there are > many "duplicates" (it can still help performance in some pgbench > workloads with non-unique indexes, though). I attach a revised version, which is still very much of prototype quality, but manages to solve a few of the problems that v1 had. Andrey Lepikhov (CC'd) asked me to post any improved version I might have for use with his retail index tuple deletion patch, so I thought I'd post what I have. The main development for v2 is that the sort order of the implicit heap TID attribute is flipped. In v1, it was in "ascending" order. In v2, comparisons of heap TIDs are inverted to make the attribute order "descending". This has a number of advantages: * It's almost consistent with the current behavior when there are repeated insertions of duplicates. Currently, this tends to result in page splits of the leftmost leaf page among pages that mostly consist of the same duplicated value. This means that the destabilizing impact on DROP SCHEMA ... CASCADE regression test output noted before [1] is totally eliminated. There is now only a single trivial change to regression test "expected" files, whereas in v1 dozens of "expected" files had to be changed, often resulting in less useful reports for the user. * The performance regression I observed with various pgbench workloads seems to have gone away, or is now within the noise range. A patch like this one requires a lot of validation and testing, so this should be taken with a grain of salt. I may have been too quick to give up on my original ambition of writing a stand-alone patch that can be justified entirely on its own merits, without being tied to some much more ambitious project like retail index tuple deletion by VACUUM, or zheap's deletion marking. I still haven't tried to replace the kludgey handling of unique index enforcement, even though that would probably have a measurable additional performance benefit. I think that this patch could become an unambiguous win. [1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com -- Peter Geoghegan
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
Attached is my v3, which has some significant improvements: * The hinting for unique index inserters within _bt_findinsertloc() has been restored, more or less. * Bug fix for case where left side of split comes from tuple being inserted. We need to pass this to _bt_suffix_truncate() as the left side of the split, which we previously failed to do. The amcheck coverage I've added allowed me to catch this issue during a benchmark. (I use amcheck during benchmarks to get some amount of stress-testing in.) * New performance optimization that allows us to descend a downlink when its user-visible attributes have scankey-equal values. We avoid an unnecessary move left by using a sentinel scan tid that's less than any possible real heap TID, but still greater than minus infinity to _bt_compare(). I am now considering pursuing this as a project in its own right, which can be justified without being part of some larger effort to add retail index tuple deletion (e.g. by VACUUM). I think that I can get it to the point of being a totally unambiguous win, if I haven't already. So, this patch is no longer just an interesting prototype of a new architectural direction we should take. In any case, it has far fewer problems than v2. Testing the performance characteristics of this patch has proven difficult. My home server seems to show a nice win with a pgbench workload that uses a Gaussian distribution for the pgbench_accounts queries (script attached). That seems consistent and reproducible. My home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a 250GB capacity. With shared_buffers set to 12GB, 80 minute runs at scale 4800 look like this: Master: 25 clients: tps = 15134.223357 (excluding connections establishing) 50 clients: tps = 13708.419887 (excluding connections establishing) 75 clients: tps = 12951.286926 (excluding connections establishing) 90 clients: tps = 12057.852088 (excluding connections establishing) Patch: 25 clients: tps = 17857.863353 (excluding connections establishing) 50 clients: tps = 14319.514825 (excluding connections establishing) 75 clients: tps = 14015.794005 (excluding connections establishing) 90 clients: tps = 12495.683053 (excluding connections establishing) I ran this twice, and got pretty consistent results each time (there were many other benchmarks on my home server -- this was the only one that tested this exact patch, though). Note that there was only one pgbench initialization for each set of runs. It looks like a pretty strong result for the patch - note that the accounts table is about twice the size of available main memory. The server is pretty well overloaded in every individual run. Unfortunately, I have a hard time showing much of any improvement on a storage-optimized AWS instance with EBS storage, with scaled up pgbench scale and main memory. I'm using an i3.4xlarge, which has 16 vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It appears to more or less make no overall difference there, for reasons that I have yet to get to the bottom of. I conceived this AWS benchmark as something that would have far longer run times with a scaled-up database size. My expectation was that it would confirm the preliminary result, but it hasn't. Maybe the issue is that it's far harder to fill the I/O queue on this AWS instance? Or perhaps its related to the higher latency of EBS, compared to the local SSD on my home server? I would welcome any ideas about how to benchmark the patch. It doesn't necessarily have to be a huge win for a very generic workload like the one I've tested, since it would probably still be enough of a win for things like free space management in secondary indexes [1]. Plus, of course, it seems likely that we're going to eventually add retail index tuple deletion in some form or another, which this is prerequisite to. For a project like this, I expect an unambiguous, across the board win from the committed patch, even if it isn't a huge win. I'm encouraged by the fact that this is starting to look like credible as a stand-alone patch, but I have to admit that there's probably still significant gaps in my understanding of how it affects real-world performance. I don't have a lot of recent experience with benchmarking workloads like this one. [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com -- Peter Geoghegan
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andrey Lepikhov
Date:
I use v3 version of the patch for a Retail Indextuple Deletion and from time to time i catch regression test error (see attachment). As i see in regression.diff, the problem is instability order of DROP ... CASCADE deletions. Most frequently i get error on a test called 'updatable views'. I check nbtree invariants during all tests, but index relations is in consistent state all time. My hypothesis is: instability order of logical duplicates in index relations on a pg_depend relation. But 'updatable views' test not contains any sources of instability: concurrent insertions, updates, vacuum and so on. This fact discourage me. May be you have any ideas on this problem? 18.07.2018 00:21, Peter Geoghegan пишет: > Attached is my v3, which has some significant improvements: > > * The hinting for unique index inserters within _bt_findinsertloc() > has been restored, more or less. > > * Bug fix for case where left side of split comes from tuple being > inserted. We need to pass this to _bt_suffix_truncate() as the left > side of the split, which we previously failed to do. The amcheck > coverage I've added allowed me to catch this issue during a benchmark. > (I use amcheck during benchmarks to get some amount of stress-testing > in.) > > * New performance optimization that allows us to descend a downlink > when its user-visible attributes have scankey-equal values. We avoid > an unnecessary move left by using a sentinel scan tid that's less than > any possible real heap TID, but still greater than minus infinity to > _bt_compare(). > > I am now considering pursuing this as a project in its own right, > which can be justified without being part of some larger effort to add > retail index tuple deletion (e.g. by VACUUM). I think that I can get > it to the point of being a totally unambiguous win, if I haven't > already. So, this patch is no longer just an interesting prototype of > a new architectural direction we should take. In any case, it has far > fewer problems than v2. > > Testing the performance characteristics of this patch has proven > difficult. My home server seems to show a nice win with a pgbench > workload that uses a Gaussian distribution for the pgbench_accounts > queries (script attached). That seems consistent and reproducible. My > home server has 32GB of RAM, and has a Samsung SSD 850 EVO SSD, with a > 250GB capacity. With shared_buffers set to 12GB, 80 minute runs at > scale 4800 look like this: > > Master: > > 25 clients: > tps = 15134.223357 (excluding connections establishing) > > 50 clients: > tps = 13708.419887 (excluding connections establishing) > > 75 clients: > tps = 12951.286926 (excluding connections establishing) > > 90 clients: > tps = 12057.852088 (excluding connections establishing) > > Patch: > > 25 clients: > tps = 17857.863353 (excluding connections establishing) > > 50 clients: > tps = 14319.514825 (excluding connections establishing) > > 75 clients: > tps = 14015.794005 (excluding connections establishing) > > 90 clients: > tps = 12495.683053 (excluding connections establishing) > > I ran this twice, and got pretty consistent results each time (there > were many other benchmarks on my home server -- this was the only one > that tested this exact patch, though). Note that there was only one > pgbench initialization for each set of runs. It looks like a pretty > strong result for the patch - note that the accounts table is about > twice the size of available main memory. The server is pretty well > overloaded in every individual run. > > Unfortunately, I have a hard time showing much of any improvement on a > storage-optimized AWS instance with EBS storage, with scaled up > pgbench scale and main memory. I'm using an i3.4xlarge, which has 16 > vCPUs, 122 GiB RAM, and 2 SSDs in a software RAID0 configuration. It > appears to more or less make no overall difference there, for reasons > that I have yet to get to the bottom of. I conceived this AWS > benchmark as something that would have far longer run times with a > scaled-up database size. My expectation was that it would confirm the > preliminary result, but it hasn't. > > Maybe the issue is that it's far harder to fill the I/O queue on this > AWS instance? Or perhaps its related to the higher latency of EBS, > compared to the local SSD on my home server? I would welcome any ideas > about how to benchmark the patch. It doesn't necessarily have to be a > huge win for a very generic workload like the one I've tested, since > it would probably still be enough of a win for things like free space > management in secondary indexes [1]. Plus, of course, it seems likely > that we're going to eventually add retail index tuple deletion in some > form or another, which this is prerequisite to. > > For a project like this, I expect an unambiguous, across the board win > from the committed patch, even if it isn't a huge win. I'm encouraged > by the fact that this is starting to look like credible as a > stand-alone patch, but I have to admit that there's probably still > significant gaps in my understanding of how it affects real-world > performance. I don't have a lot of recent experience with benchmarking > workloads like this one. > > [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com > -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Aug 1, 2018 at 9:48 PM, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > I use v3 version of the patch for a Retail Indextuple Deletion and from time > to time i catch regression test error (see attachment). > As i see in regression.diff, the problem is instability order of DROP ... > CASCADE deletions. > Most frequently i get error on a test called 'updatable views'. > I check nbtree invariants during all tests, but index relations is in > consistent state all time. > My hypothesis is: instability order of logical duplicates in index relations > on a pg_depend relation. > But 'updatable views' test not contains any sources of instability: > concurrent insertions, updates, vacuum and so on. This fact discourage me. > May be you have any ideas on this problem? It's caused by an implicit dependency on the order of items in an index. See https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql. I've been making "\set VERBOSITY terse" changes like this whenever it happens in a new place. It seems to have finally stopped happening. Note that this is a preexisting issue; there are already places in the regression tests where we paper over the problem in a similar way. I notice that it tends to happen when the machine running the regression tests is heavily loaded. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
Attached is v4. I have two goals in mind for this revision, goals that are of great significance to the project as a whole: * Making better choices around leaf page split points, in order to maximize suffix truncation and thereby maximize fan-out. This is important when there are mostly-distinct index tuples on each leaf page (i.e. most of the time). Maximizing the effectiveness of suffix truncation needs to be weighed against the existing/main consideration: evenly distributing space among each half of a page split. This is tricky. * Not regressing the logic that lets us pack leaf pages full when there are a great many logical duplicates. That is, I still want to get the behavior I described on the '"Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes' thread [1]. This is not something that happens as a consequence of thinking about suffix truncation specifically, and seems like a fairly distinct thing to me. It's actually a bit similar to the rightmost 90/10 page split case. v4 adds significant new logic to make us do better on the first goal, without hurting the second goal. It's easy to regress one while focussing on the other, so I've leaned on a custom test suite throughout development. Previous versions mostly got the first goal wrong, but got the second goal right. For the time being, I'm focussing on index size, on the assumption that I'll be able to demonstrate a nice improvement in throughput or latency later. I can get the main TPC-C order_line pkey about 7% smaller after an initial bulk load with the new logic added to get the first goal (note that the benefits with a fresh CREATE INDEX are close to zero). The index is significantly smaller, even though the internal page index tuples can themselves never be any smaller due to alignment -- this is all about not restricting what can go on each leaf page by too much. 7% is not as dramatic as the "get tired" case, which saw something like a 50% decrease in bloat for one pathological case, but it's still clearly well worth having. The order_line primary key is the largest TPC-C index, and I'm merely doing a standard bulk load to get this 7% shrinkage. The TPC-C order_line primary key happens to be kind of adversarial or pathological to B-Tree space management in general, but it's still fairly realistic. For the first goal, page splits now weigh what I've called the "distance" between tuples, with a view to getting the most discriminating split point -- the leaf split point that maximizes the effectiveness of suffix truncation, within a range of acceptable split points (acceptable from the point of view of not implying a lopsided page split). This is based on probing IndexTuple contents naively when deciding on a split point, without regard for the underlying opclass/types. We mostly just use char integer comparisons to probe, on the assumption that that's a good enough proxy for using real insertion scankey comparisons (only actual truncation goes to those lengths, since that's a strict matter of correctness). This distance business might be considered a bit iffy by some, so I want to get early feedback. This new "distance" code clearly needs more work, but I felt that I'd gone too long without posting a new version. For the second goal, I've added a new macro that can be enabled for debugging purposes. This has the implementation sort heap TIDs in ASC order, rather than DESC order. This nicely demonstrates how my two goals for v4 are fairly independent; uncommenting "#define BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many duplicates are inserted, but won't regress things like the TPC-C indexes. (Note that BTREE_ASC_HEAP_TID will break the regression tests, though in a benign way can safely be ignored.) Open items: * Do more traditional benchmarking. * Add pg_upgrade support. * Simplify _bt_findsplitloc() logic. [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com -- Peter Geoghegan
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
Attached is v5, which significantly simplifies the _bt_findsplitloc() logic. It's now a great deal easier to follow. It would be helpful if someone could do code-level review of the overhauled _bt_findsplitloc(). That's the most important part of the patch. It involves relatively subjective trade-offs around total effort spent during a page split, space utilization, and avoiding "false sharing" (I call the situation where a range of duplicate values straddle two leaf pages unnecessarily "false sharing", since it necessitates that subsequent index scans visit two index scans rather than just one, even when that's avoidable.) This version has slightly improved performance, especially for cases where an index gets bloated without any garbage being generated. With the UK land registry data [1], an index on (county, city, locality) is shrunk by just over 18% by the new logic (I recall that it was shrunk by ~15% in an earlier version). In concrete terms, it goes from being 1.288 GiB on master to being 1.054 GiB with v5 of the patch. This is mostly because the patch intelligently packs together duplicate-filled pages tightly (in particular, it avoids "getting tired"), but also because it makes pivots less restrictive about where leaf tuples can go. I still manage to shrink the largest TPC-C and TPC-H indexes by at least 5% following an initial load performed by successive INSERTs. Those are unique indexes, so the benefits are certainly not limited to cases involving many duplicates. 3 modes ------- My new approach is to teach _bt_findsplitloc() 3 distinct modes of operation: Regular/default mode, many duplicates mode, and single value mode. The higher level split code always asks for a default mode call to _bt_findsplitloc(), so that's always where we start. For leaf page splits, _bt_findsplitloc() will occasionally call itself recursively in either many duplicates mode or single value mode. This happens when the default strategy doesn't work out. * Default mode almost does what we do already, but remembers the top n candidate split points, sorted by the delta between left and right post-split free space, rather than just looking for the overall lowest delta split point. Then, we go through a second pass over the temp array of "acceptable" split points, that considers the needs of suffix truncation. * Many duplicates mode is used when we fail to find a "distinguishing" split point in regular mode, but have determined that it's possible to get one if a new, exhaustive search is performed. We go to great lengths to avoid having to append a heap TID to the new left page high key -- that's what I mean by "distinguishing". We're particularly concerned with false sharing by subsequent point lookup index scans here. * Single value mode is used when we see that even many duplicates mode would be futile, as the leaf page is already *entirely* full of logical duplicates. Single value mode isn't exhaustive, since there is clearly nothing to exhaustively search for. Instead, it packs together as many tuples as possible on the right side of the split. Since heap TIDs sort in descending order, this is very much like a "leftmost" split that tries to free most of the space on the left side, and pack most of the page contents on the right side. Except that it's leftmost, and in particular is leftmost among pages full of logical duplicates (as opposed to being leftmost/rightmost among pages on an entire level of the tree, as with the traditional rightmost 90:10 split thing). Other changes ------------- * I now explicitly use fillfactor in the manner of a rightmost split to get the single value mode behavior. I call these types of splits (rightmost and single value mode splits) "weighted" splits in the patch. This is much more consistent with our existing conventions than my previous approach. * Improved approached to inexpensively determining how effective suffix truncation will be for a given candidate split point. I no longer naively probe the contents of index tuples to do char comparisons. Instead, I use a tuple descriptor to get offsets to each attribute in each tuple in turn, then calling to datumIsEqual() to determine if they're equal. This is almost as good as a full scan key comparison. This actually seems to be a bit faster, and also takes care of INCLUDE indexes without special care (no need to worry about probing non-key attributes, and reaching a faulty conclusion about which split point helps with suffix truncation). I still haven't managed to add pg_upgrade support, but that's my next step. I am more or less happy with the substance of the patch in v5, and feel that I can now work backwards towards figuring out the best way to deal with on-disk compatibility. It shouldn't be too hard -- most of the effort will involve coming up with a good test suite. [1] https://wiki.postgresql.org/wiki/Sample_Databases -- Peter Geoghegan
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andrey Lepikhov
Date:
I use the v5 version in quick vacuum strategy and in the heap&index cleaner (see new patches at corresponding thread a little bit later). It works fine and give quick vacuum 2-3% performance growup in comparison with version v3 at my 24-core test server. Note, that the interface of _bt_moveright() and _bt_binsrch() functions with combination of scankey, scantid and nextkey parameters is too semantic loaded. Everytime of code reading i spend time to remember, what this functions do exactly. May be it needed to rewrite comments. For example, _bt_moveright() comments may include phrase: nextkey=false: Traverse to the next suitable index page if the current page does not contain the value (scan key; scan tid). What do you think about submitting the patch to the next CF? 12.09.2018 23:11, Peter Geoghegan пишет: > Attached is v4. I have two goals in mind for this revision, goals that > are of great significance to the project as a whole: > > * Making better choices around leaf page split points, in order to > maximize suffix truncation and thereby maximize fan-out. This is > important when there are mostly-distinct index tuples on each leaf > page (i.e. most of the time). Maximizing the effectiveness of suffix > truncation needs to be weighed against the existing/main > consideration: evenly distributing space among each half of a page > split. This is tricky. > > * Not regressing the logic that lets us pack leaf pages full when > there are a great many logical duplicates. That is, I still want to > get the behavior I described on the '"Write amplification" is made > worse by "getting tired" while inserting into nbtree secondary > indexes' thread [1]. This is not something that happens as a > consequence of thinking about suffix truncation specifically, and > seems like a fairly distinct thing to me. It's actually a bit similar > to the rightmost 90/10 page split case. > > v4 adds significant new logic to make us do better on the first goal, > without hurting the second goal. It's easy to regress one while > focussing on the other, so I've leaned on a custom test suite > throughout development. Previous versions mostly got the first goal > wrong, but got the second goal right. For the time being, I'm > focussing on index size, on the assumption that I'll be able to > demonstrate a nice improvement in throughput or latency later. I can > get the main TPC-C order_line pkey about 7% smaller after an initial > bulk load with the new logic added to get the first goal (note that > the benefits with a fresh CREATE INDEX are close to zero). The index > is significantly smaller, even though the internal page index tuples > can themselves never be any smaller due to alignment -- this is all > about not restricting what can go on each leaf page by too much. 7% is > not as dramatic as the "get tired" case, which saw something like a > 50% decrease in bloat for one pathological case, but it's still > clearly well worth having. The order_line primary key is the largest > TPC-C index, and I'm merely doing a standard bulk load to get this 7% > shrinkage. The TPC-C order_line primary key happens to be kind of > adversarial or pathological to B-Tree space management in general, but > it's still fairly realistic. > > For the first goal, page splits now weigh what I've called the > "distance" between tuples, with a view to getting the most > discriminating split point -- the leaf split point that maximizes the > effectiveness of suffix truncation, within a range of acceptable split > points (acceptable from the point of view of not implying a lopsided > page split). This is based on probing IndexTuple contents naively when > deciding on a split point, without regard for the underlying > opclass/types. We mostly just use char integer comparisons to probe, > on the assumption that that's a good enough proxy for using real > insertion scankey comparisons (only actual truncation goes to those > lengths, since that's a strict matter of correctness). This distance > business might be considered a bit iffy by some, so I want to get > early feedback. This new "distance" code clearly needs more work, but > I felt that I'd gone too long without posting a new version. > > For the second goal, I've added a new macro that can be enabled for > debugging purposes. This has the implementation sort heap TIDs in ASC > order, rather than DESC order. This nicely demonstrates how my two > goals for v4 are fairly independent; uncommenting "#define > BTREE_ASC_HEAP_TID" will cause a huge regression with cases where many > duplicates are inserted, but won't regress things like the TPC-C > indexes. (Note that BTREE_ASC_HEAP_TID will break the regression > tests, though in a benign way can safely be ignored.) > > Open items: > > * Do more traditional benchmarking. > > * Add pg_upgrade support. > > * Simplify _bt_findsplitloc() logic. > > [1] https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com > -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Sep 19, 2018 at 9:56 PM, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > Note, that the interface of _bt_moveright() and _bt_binsrch() functions with > combination of scankey, scantid and nextkey parameters is too semantic > loaded. > Everytime of code reading i spend time to remember, what this functions do > exactly. > May be it needed to rewrite comments. I think that it might be a good idea to create an "BTInsertionScankey" struct, or similar, since keysz, nextkey, the scankey array and now scantid are all part of that, and are all common to these 4 or so functions. It could have a flexible array at the end, so that we still only need a single palloc(). I'll look into that. > What do you think about submitting the patch to the next CF? Clearly the project that you're working on is a difficult one. It's easy for me to understand why you might want to take an iterative approach, with lots of prototyping. Your patch needs attention to advance, and IMV the CF is the best way to get that attention. So, I think that it would be fine to go submit it now. I must admit that I didn't even notice that your patch lacked a CF entry. Everyone has a different process, perhaps. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Sep 19, 2018 at 11:23 AM Peter Geoghegan <pg@bowt.ie> wrote: > 3 modes > ------- > > My new approach is to teach _bt_findsplitloc() 3 distinct modes of > operation: Regular/default mode, many duplicates mode, and single > value mode. I think that I'll have to add a fourth mode, since I came up with another strategy that is really effective though totally complementary to the other 3 -- "multiple insertion point" mode. Credit goes to Kevin Grittner for pointing out that this technique exists about 2 years ago [1]. The general idea is to pick a split point just after the insertion point of the new item (the incoming tuple that prompted a page split) when it looks like there are localized monotonically increasing ranges. This is like a rightmost 90:10 page split, except the insertion point is not at the rightmost page on the level -- it's rightmost within some local grouping of values. This makes the two largest TPC-C indexes *much* smaller. Previously, they were shrunk by a little over 5% by using the new generic strategy, a win that now seems like small potatoes. With this new mode, TPC-C's order_line primary key, which is the largest index of all, is ~45% smaller following a standard initial bulk load at scalefactor 50. It shrinks from 99,085 blocks (774.10 MiB) to 55,020 blocks (429.84 MiB). It's actually slightly smaller than it would be after a fresh REINDEX with the new strategy. We see almost as big a win with the second largest TPC-C index, the stock table's primary key -- it's ~40% smaller. Here is the definition of the biggest index, the order line primary key index: pg@tpcc[3666]=# \d order_line_pkey Index "public.order_line_pkey" Column │ Type │ Key? │ Definition ───────────┼─────────┼──────┼──────────── ol_w_id │ integer │ yes │ ol_w_id ol_d_id │ integer │ yes │ ol_d_id ol_o_id │ integer │ yes │ ol_o_id ol_number │ integer │ yes │ ol_number primary key, btree, for table "public.order_line" The new strategy/mode works very well because we see monotonically increasing inserts on ol_number (an order's item number), but those are grouped by order. It's kind of an adversarial case for our existing implementation, and yet it seems like it's probably a fairly common scenario in the real world. Obviously these are very significant improvements. They really exceed my initial expectations for the patch. TPC-C is generally considered to be by far the most influential database benchmark of all time, and this is something that we need to pay more attention to. My sense is that the TPC-C benchmark is deliberately designed to almost require that the system under test have this "multiple insertion point" B-Tree optimization, suffix truncation, etc. This is exactly the same index that we've seen reports of out of control bloat on when people run TPC-C over hours or days [2]. My next task is to find heuristics to make the new page split mode/strategy kick in when it's likely to help, but not kick in when it isn't (when we want something close to a generic 50:50 page split). These heuristics should look similar to what I've already done to get cases with lots of duplicates to behave sensibly. Anyone have any ideas on how to do this? I might end up inferring a "multiple insertion point" case from the fact that there are multiple pass-by-value attributes for the index, with the new/incoming tuple having distinct-to-immediate-left-tuple attribute values for the last column, but not the first few. It also occurs to me to consider the fragmentation of the page as a guide, though I'm less sure about that. I'll probably need to experiment with a variety of datasets before I settle on something that looks good. Forcing the new strategy without considering any of this actually works surprisingly well on cases where you'd think it wouldn't, since a 50:50 page split is already something of a guess about where future insertions will end up. [1] https://postgr.es/m/CACjxUsN5fV0kV=YirXwA0S7LqoOJuy7soPtipDhUCemhgwoVFg@mail.gmail.com [2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/ -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Eisentraut
Date:
On 19/09/2018 20:23, Peter Geoghegan wrote: > Attached is v5, So. I don't know much about the btree code, so don't believe anything I say. I was very interested in the bloat test case that you posted on 2018-07-09 and I tried to understand it more. The current method for inserting a duplicate value into a btree is going to the leftmost point for that value and then move right until we find some space or we get "tired" of searching, in which case just make some space right there. The problem is that it's tricky to decide when to stop searching, and there are scenarios when we stop too soon and repeatedly miss all the good free space to the right, leading to bloat even though the index is perhaps quite empty. I tried playing with the getting-tired factor (it could plausibly be a reloption), but that wasn't very successful. You can use that to postpone the bloat, but you won't stop it, and performance becomes terrible. You propose to address this by appending the tid to the index key, so each key, even if its "payload" is a duplicate value, is unique and has a unique place, so we never have to do this "tiresome" search. This makes a lot of sense, and the results in the bloat test you posted are impressive and reproducible. I tried a silly alternative approach by placing a new duplicate key in a random location. This should be equivalent since tids are effectively random. I didn't quite get this to fully work yet, but at least it doesn't blow up, and it gets the same regression test ordering differences for pg_depend scans that you are trying to paper over. ;-) As far as the code is concerned, I agree with Andrey Lepikhov that one more abstraction layer that somehow combines the scankey and the tid or some combination like that would be useful, instead of passing the tid as a separate argument everywhere. I think it might help this patch move along if it were split up a bit, for example 1) suffix truncation, 2) tid stuff, 3) new split strategies. That way, it would also be easier to test out each piece separately. For example, how much space does suffix truncation save in what scenario, are there any performance regressions, etc. In the last few versions, the patches have still been growing significantly in size and functionality, and most of the supposed benefits are not readily visible in tests. And of course we need to think about how to handle upgrades, but you have already started a separate discussion about that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > So. I don't know much about the btree code, so don't believe anything I > say. I think that showing up and reviewing this patch makes you somewhat of an expert, by default. There just isn't enough expertise in this area. > I was very interested in the bloat test case that you posted on > 2018-07-09 and I tried to understand it more. Up until recently, I thought that I would justify the patch primarily as a project to make B-Trees less bloated when there are many duplicates, with maybe as many as a dozen or more secondary benefits. That's what I thought it would say in the release notes, even though the patch was always a broader strategic thing. Now I think that the TPC-C multiple insert point bloat issue might be the primary headline benefit, though. I hate to add more complexity to get it to work well, but just look at how much smaller the indexes are following an initial bulk load (bulk insertions) using my working copy of the patch: Master customer_pkey: 75 MB district_pkey: 40 kB idx_customer_name: 107 MB item_pkey: 2216 kB new_order_pkey: 22 MB oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB oorder_pkey: 78 MB order_line_pkey: 774 MB stock_pkey: 181 MB warehouse_pkey: 24 kB Patch customer_pkey: 50 MB district_pkey: 40 kB idx_customer_name: 105 MB item_pkey: 2216 kB new_order_pkey: 12 MB oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB oorder_pkey: 42 MB order_line_pkey: 429 MB stock_pkey: 111 MB warehouse_pkey: 24 kB All of the indexes used by oltpbench to do TPC-C are listed, so you're seeing the full picture for TPC-C bulk loading here (actually, there is another index that has an identical definition to oorder_o_w_id_o_d_id_o_c_id_o_id_key for some reason, which is omitted as redundant). As you can see, all the largest indexes are *significantly* smaller, with the exception of oorder_o_w_id_o_d_id_o_c_id_o_id_key. You won't be able to see this improvement until I post the next version, though, since this is a brand new development. Note that VACUUM hasn't been run at all, and doesn't need to be run, as there are no dead tuples. Note also that this has *nothing* to do with getting tired -- almost all of these indexes are unique indexes. Note that I'm also testing TPC-E and TPC-H in a very similar way, which have both been improved noticeably, but to a degree that's much less compelling than what we see with TPC-C. They have "getting tired" cases that benefit quite a bit, but those are the minority. Have you ever used HammerDB? I got this data from oltpbench, but I think that HammerDB might be the way to go with TPC-C testing Postgres. > You propose to address this by appending the tid to the index key, so > each key, even if its "payload" is a duplicate value, is unique and has > a unique place, so we never have to do this "tiresome" search.This > makes a lot of sense, and the results in the bloat test you posted are > impressive and reproducible. Thanks. > I tried a silly alternative approach by placing a new duplicate key in a > random location. This should be equivalent since tids are effectively > random. You're never going to get any other approach to work remotely as well, because while the TIDs may seem to be random in some sense, they have various properties that are very useful from a high level, data life cycle point of view. For insertions of duplicates, heap TID has temporal locality -- you are only going to dirty one or two leaf pages, rather than potentially dirtying dozens or hundreds. Furthermore, heap TID is generally strongly correlated with primary key values, so VACUUM can be much much more effective at killing duplicates in low cardinality secondary indexes when there are DELETEs with a range predicate on the primary key. This is a lot more realistic than the 2018-07-09 test case, but it still could make as big of a difference. > I didn't quite get this to fully work yet, but at least it > doesn't blow up, and it gets the same regression test ordering > differences for pg_depend scans that you are trying to paper over. ;-) FWIW, I actually just added to the papering over, rather than creating a new problem. There are plenty of instances of "\set VERBOSITY terse" in the regression tests already, for the same reason. If you run the regression tests with ignore_system_indexes=on, there are very similar failures [1]. > As far as the code is concerned, I agree with Andrey Lepikhov that one > more abstraction layer that somehow combines the scankey and the tid or > some combination like that would be useful, instead of passing the tid > as a separate argument everywhere. I've already drafted this in my working copy. It is a clear improvement. You can expect it in the next version. > I think it might help this patch move along if it were split up a bit, > for example 1) suffix truncation, 2) tid stuff, 3) new split strategies. > That way, it would also be easier to test out each piece separately. > For example, how much space does suffix truncation save in what > scenario, are there any performance regressions, etc. I'll do my best. I don't think I can sensibly split out suffix truncation from the TID stuff -- those seem truly inseparable, since my mental model for suffix truncation breaks without fully unique keys. I can break out all the cleverness around choosing a split point into its own patch, though -- _bt_findsplitloc() has only been changed to give weight to several factors that become important. It's the "brain" of the optimization, where 90% of the complexity actually lives. Removing the _bt_findsplitloc() changes will make the performance of the other stuff pretty poor, and in particular will totally remove the benefit for cases that "become tired" on the master branch. That could be slightly interesting, I suppose. > In the last few > versions, the patches have still been growing significantly in size and > functionality, and most of the supposed benefits are not readily visible > in tests. I admit that this patch has continued to evolve up until this week, despite the fact that I thought it would be a lot more settled by now. It has actually become simpler in recent months, though. And, I think that the results justify the iterative approach I've taken. This stuff is inherently very subtle, and I've had to spend a lot of time paying attention to tiny regressions across a fairly wide variety of test cases. > And of course we need to think about how to handle upgrades, but you > have already started a separate discussion about that. Right. [1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-MvcM_pHN_w@mail.gmail.com -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andrey Lepikhov
Date:
28.09.2018 23:08, Peter Geoghegan wrote: > On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> I think it might help this patch move along if it were split up a bit, >> for example 1) suffix truncation, 2) tid stuff, 3) new split strategies. >> That way, it would also be easier to test out each piece separately. >> For example, how much space does suffix truncation save in what >> scenario, are there any performance regressions, etc. > > I'll do my best. I don't think I can sensibly split out suffix > truncation from the TID stuff -- those seem truly inseparable, since > my mental model for suffix truncation breaks without fully unique > keys. I can break out all the cleverness around choosing a split point > into its own patch, though -- _bt_findsplitloc() has only been changed > to give weight to several factors that become important. It's the > "brain" of the optimization, where 90% of the complexity actually > lives. > > Removing the _bt_findsplitloc() changes will make the performance of > the other stuff pretty poor, and in particular will totally remove the > benefit for cases that "become tired" on the master branch. That could > be slightly interesting, I suppose. I am reviewing this patch too. And join to Peter Eisentraut opinion about splitting the patch into a hierarchy of two or three patches: "functional" - tid stuff and "optimizational" - suffix truncation & splitting. My reasons are simplification of code review, investigation and benchmarking. Now benchmarking is not clear. Possible performance degradation from TID ordering interfere with positive effects from the optimizations in non-trivial way. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Sep 28, 2018 at 10:58 PM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > I am reviewing this patch too. And join to Peter Eisentraut opinion > about splitting the patch into a hierarchy of two or three patches: > "functional" - tid stuff and "optimizational" - suffix truncation & > splitting. My reasons are simplification of code review, investigation > and benchmarking. As I mentioned to Peter, I don't think that I can split out the heap TID stuff from the suffix truncation stuff. At least not without making the patch even more complicated, for no benefit. I will split out the "brain" of the patch (the _bt_findsplitloc() stuff, which decides on a split point using sophisticated rules) from the "brawn" (the actually changes to how index scans work, including the heap TID stuff, as well as the code for actually physically performing suffix truncation). The brain of the patch is where most of the complexity is, as well as most of the code. The brawn of the patch is _totally unusable_ without intelligence around split points, but I'll split things up along those lines anyway. Doing so should make the whole design a little easier to see follow. > Now benchmarking is not clear. Possible performance degradation from TID > ordering interfere with positive effects from the optimizations in > non-trivial way. Is there any evidence of a regression in the last 2 versions? I've been using pgbench, which didn't show any. That's not a sympathetic case for the patch, though it would be nice to confirm if there was some small improvement there. I've seen contradictory results (slight improvements and slight regressions), but that was with a much earlier version, so it just isn't relevant now. pgbench is mostly interesting as a thing that we want to avoid regressing. Once I post the next version, it would be great if somebody could use HammerDB's OLTP test, which seems like the best fair use implementation of TPC-C that's available. I would like to make that the "this is why you should care, even if you happen to not believe in the patch's strategic importance" benchmark. TPC-C is clearly the most influential database benchmark ever, so I think that that's a fair request. (See the TPC-C commentary at https://www.hammerdb.com/docs/ch03s02.html, for example.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sun, Sep 30, 2018 at 2:33 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Now benchmarking is not clear. Possible performance degradation from TID > > ordering interfere with positive effects from the optimizations in > > non-trivial way. > > Is there any evidence of a regression in the last 2 versions? I did find a pretty clear regression, though only with writes to unique indexes. Attached is v6, which fixes the issue. More on that below. v6 also: * Adds a new-to-v6 "insert at new item's insertion point" optimization, which is broken out into its own commit. This *greatly* improves the index bloat situation with the TPC-C benchmark in particular, even before the benchmark starts (just with the initial bulk load). See the relevant commit message for full details, or a couple of my previous mails on this thread. I will provide my own TPC-C test data + test case to any reviewer that wants to see this for themselves. It shouldn't be hard to verify the improvement in raw index size with any TPC-C implementation, though. Please make an off-list request if you're interested. The raw dump is 1.8GB. The exact details of when this new optimization kick in and how it works are tentative. They should really be debated. Reviewers should try to think of edge cases in which my "heap TID adjacency" approach could make the optimization kick in when it shouldn't -- cases where it causes bloat rather than preventing it. I couldn't find any such regressions, but this code was written very recently. I should also look into using HammerDB to do a real TPC-C benchmark, and really put the patch to the test...anybody have experience with it? * Generally groups everything into a relatively manageable series of cumulative improvements, starting with the infrastructure required to physically truncate tuples correctly, without any of the smarts around selecting a split point. The base patch is useless on its own, since it's just necessary to have the split point selection smarts to see a consistent benefit. Reviewers shouldn't waste their time doing any real benchmarking with just the first patch applied. * Adds a lot of new information to the nbtree README, about the high-level thought process behind the design, including citing the classic paper that this patch was primarily inspired by. * Adds a new, dedicated insertion scan key struct -- BTScanInsert[Data]. This is passed around to a number of different routines (_btsearch(), _bt_binsrch(), _bt_compare(), etc). This was suggested by Andrey, and also requested by Peter Eisentraut. While this BTScanInsert work started out as straightforward refactoring, it actually led to my discovering and fixing the regression I mentioned. Previously, I passed a lower bound on a binary search to _bt_binsrch() within _bt_findinsertloc(). This wasn't nearly as effective as what the master branch does for unique indexes at the same point -- it usually manages to reuse a result from an earlier _bt_binsrch() as the offset for the new tuple, since it has no need to worry about the new tuple's position *among duplicates* on the page. In earlier versions of my patch, most of the work of a second binary search took place, despite being redundant and unnecessary. This happened for every new insertion into a non-unique index -- I could easily measure the problem with a simple serial test case. I can see no regression there against master now, though. My fix for the regression involves including some mutable state in the new BTScanInsert struct (within v6-0001-*patch), to explicitly remember and restore some internal details across two binary searches against the same leaf page. We now remember a useful lower *and* upper bound within bt_binsrch(), which is what is truly required to fix the regression. While there is still a second call to _bt_binsrch() within _bt_findinsertloc() for unique indexes, it will do no comparisons in the common case where there are no existing dead duplicate tuples in the unique index. This means that the number of _bt_compare() calls we get in this _bt_findinsertloc() unique index path is the same as the master branch in almost all cases (I instrumented the regression tests to make sure of this). I also think that having BTScanInsert will ease things around pg_upgrade support, something that remains an open item. Changes in this area seem to make everything clearer -- the signature of _bt_findinsertloc() seemed a bit jumbled to me. Aside: I think that this BTScanInsert mutable state idea could be pushed even further in the future. "Dynamic prefix truncation" could be implemented by taking a similar approach when descending composite indexes for an index scan (doesn't have to be a unique index). We can observe that earlier attributes must all be equal to our own scankey's values once we descend the tree and pass between a pair of pivot tuples where a common prefix (some number of leading attributes) is fully equal. It's safe to just not bother comparing these prefix attributes on lower levels, because we can reason about their values transitively; _bt_compare() can be told to always skip the first attribute or two during later/lower-in-the-tree binary searches. This idea will not be implemented for Postgres v12 by me, though. -- Peter Geoghegan
Attachment
- v6-0005-DEBUG-Add-pageinspect-instrumentation.patch
- v6-0006-DEBUG-Allow-nbtree-to-use-ASC-heap-TID-order.patch
- v6-0003-Add-split-at-new-tuple-page-split-optimization.patch
- v6-0004-DEBUG-Add-page-split-instrumentation.patch
- v6-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patch
- v6-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Oct 3, 2018 at 4:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > I did find a pretty clear regression, though only with writes to > unique indexes. Attached is v6, which fixes the issue. More on that > below. I've been benchmarking my patch using oltpbench's TPC-C benchmark these past few weeks, which has been very frustrating -- the picture is very mixed. I'm testing a patch that has evolved from v6, but isn't too different. In one way, the patch does exactly what it's supposed to do when these benchmarks are run: it leaves indexes *significantly* smaller than the master branch will on the same (rate-limited) workload, without affecting the size of tables in any noticeable way. The numbers that I got from my much earlier synthetic single client benchmark mostly hold up. For example, the stock table's primary key is about 35% smaller, and the order line index is only about 20% smaller relative to master, which isn't quite as good as in the synthetic case, but I'll take it (this is all because of the v6-0003-Add-split-at-new-tuple-page-split-optimization.patch stuff). However, despite significant effort, and despite the fact that the index shrinking is reliable, I cannot yet consistently show an increase in either transaction throughput, or transaction latency. I can show a nice improvement in latency on a slightly-rate-limited TPC-C workload when backend_flush_after=0 (something like a 40% reduction on average), but that doesn't hold up when oltpbench isn't rate-limited and/or has backend_flush_after set. Usually, there is a 1% - 2% regression, despite the big improvements in index size, and despite the big reduction in the amount of buffers that backends must write out themselves. The obvious explanation is that throughput is decreased due to our doing extra work (truncation) while under an exclusive buffer lock. However, I've worked hard on that, and, as I said, I can sometimes observe a nice improvement in latency. This makes me doubt the obvious explanation. My working theory is that this has something to do with shared_buffers eviction. Maybe we're making worse decisions about which buffer to evict, or maybe the scalability of eviction is hurt. Perhaps both. You can download results from a recent benchmark to get some sense of this. It includes latency and throughput graphs, plus details statistics collector stats: https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing I would welcome any theories as to what could be the problem here. I'm think that this is fixable, since the picture for the patch is very positive, provided you only focus on bgwriter/checkpoint activity and on-disk sizes. It seems likely that there is a very specific gap in my understanding of how the patch affects buffer cleaning. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andres Freund
Date:
Hi, On 2018-10-18 12:54:27 -0700, Peter Geoghegan wrote: > I can show a nice improvement in latency on a slightly-rate-limited > TPC-C workload when backend_flush_after=0 (something like a 40% > reduction on average), but that doesn't hold up when oltpbench isn't > rate-limited and/or has backend_flush_after set. Usually, there is a > 1% - 2% regression, despite the big improvements in index size, and > despite the big reduction in the amount of buffers that backends must > write out themselves. What kind of backend_flush_after values where you trying? backend_flush_after=0 obviously is the default, so I'm not clear on that. How large is the database here, and how high is shared_buffers > The obvious explanation is that throughput is decreased due to our > doing extra work (truncation) while under an exclusive buffer lock. > However, I've worked hard on that, and, as I said, I can sometimes > observe a nice improvement in latency. This makes me doubt the obvious > explanation. My working theory is that this has something to do with > shared_buffers eviction. Maybe we're making worse decisions about > which buffer to evict, or maybe the scalability of eviction is hurt. > Perhaps both. Is it possible that there's new / prolonged cases where a buffer is read from disk after the patch? Because that might require doing *write* IO when evicting the previous contents of the victim buffer, and obviously that can take longer if you're running with backend_flush_after > 0. I wonder if it'd make sense to hack up a patch that logs when evicting a buffer while already holding another lwlock. That shouldn't be too hard. > You can download results from a recent benchmark to get some sense of > this. It includes latency and throughput graphs, plus details > statistics collector stats: > > https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing I'm uncllear which runs are what here? I assume "public" is your patchset, and master is master? Do you reset the stats inbetween runs? Greetings, Andres Freund
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
Shared_buffers is 10gb iirc. The server has 32gb of memory. Yes, 'public' is the patch case. Sorry for not mentioning it initially.
--
Peter Geoghegan
(Sent from my phone)
Peter Geoghegan
(Sent from my phone)
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund <andres@anarazel.de> wrote: > What kind of backend_flush_after values where you trying? > backend_flush_after=0 obviously is the default, so I'm not clear on > that. How large is the database here, and how high is shared_buffers I *was* trying backend_flush_after=512kB, but it's backend_flush_after=0 in the benchmark I posted. See the "postgres*settings" files. On the master branch, things looked like this after the last run: pg@tpcc_oltpbench[15547]=# \dt+ List of relations Schema │ Name │ Type │ Owner │ Size │ Description ────────┼────────────┼───────┼───────┼──────────┼───────────── public │ customer │ table │ pg │ 4757 MB │ public │ district │ table │ pg │ 5240 kB │ public │ history │ table │ pg │ 1442 MB │ public │ item │ table │ pg │ 10192 kB │ public │ new_order │ table │ pg │ 140 MB │ public │ oorder │ table │ pg │ 1185 MB │ public │ order_line │ table │ pg │ 19 GB │ public │ stock │ table │ pg │ 9008 MB │ public │ warehouse │ table │ pg │ 4216 kB │ (9 rows) pg@tpcc_oltpbench[15547]=# \di+ List of relations Schema │ Name │ Type │ Owner │ Table │ Size │ Description ────────┼──────────────────────────────────────┼───────┼───────┼────────────┼─────────┼───────────── public │ customer_pkey │ index │ pg │ customer │ 367 MB │ public │ district_pkey │ index │ pg │ district │ 600 kB │ public │ idx_customer_name │ index │ pg │ customer │ 564 MB │ public │ idx_order │ index │ pg │ oorder │ 715 MB │ public │ item_pkey │ index │ pg │ item │ 2208 kB │ public │ new_order_pkey │ index │ pg │ new_order │ 188 MB │ public │ oorder_o_w_id_o_d_id_o_c_id_o_id_key │ index │ pg │ oorder │ 715 MB │ public │ oorder_pkey │ index │ pg │ oorder │ 958 MB │ public │ order_line_pkey │ index │ pg │ order_line │ 9624 MB │ public │ stock_pkey │ index │ pg │ stock │ 904 MB │ public │ warehouse_pkey │ index │ pg │ warehouse │ 56 kB │ (11 rows) > Is it possible that there's new / prolonged cases where a buffer is read > from disk after the patch? Because that might require doing *write* IO > when evicting the previous contents of the victim buffer, and obviously > that can take longer if you're running with backend_flush_after > 0. Yes, I suppose that that's possible, because the buffer popularity/usage_count will be affected in ways that cannot easily be predicted. However, I'm not running with "backend_flush_after > 0" here -- that was before. > I wonder if it'd make sense to hack up a patch that logs when evicting a > buffer while already holding another lwlock. That shouldn't be too hard. I'll look into this. Thanks -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund <andres@anarazel.de> wrote: > I wonder if it'd make sense to hack up a patch that logs when evicting a > buffer while already holding another lwlock. That shouldn't be too hard. I tried this. It looks like we're calling FlushBuffer() with more than a single LWLock held (not just the single buffer lock) somewhat *less* with the patch. This is a positive sign for the patch, but also means that I'm no closer to figuring out what's going on. I tested a case with a 1GB shared_buffers + a TPC-C database sized at about 10GB. I didn't want the extra LOG instrumentation to influence the outcome. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andrey Lepikhov
Date:
On 19.10.2018 0:54, Peter Geoghegan wrote: > I would welcome any theories as to what could be the problem here. I'm > think that this is fixable, since the picture for the patch is very > positive, provided you only focus on bgwriter/checkpoint activity and > on-disk sizes. It seems likely that there is a very specific gap in my > understanding of how the patch affects buffer cleaning. I have same problem with background heap & index cleaner (based on your patch). In this case the bottleneck is WAL-record which I need to write for each cleaned block and locks which are held during the WAL-record writing process. Maybe you will do a test without writing any data to disk? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Oct 23, 2018 at 11:35 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > I have same problem with background heap & index cleaner (based on your > patch). In this case the bottleneck is WAL-record which I need to write > for each cleaned block and locks which are held during the WAL-record > writing process. Part of the problem here is that v6 uses up to 25 candidate split points, even during regularly calls to _bt_findsplitloc(). That was based on some synthetic test-cases. I've found that I can get most of the benefit in index size with far fewer spilt points, though. The extra work done with an exclusive buffer lock held will be considerably reduced in v7. I'll probably post that in a couple of weeks, since I'm in Europe for pgConf.EU. I don't fully understand the problems here, but even still I know that what you were testing wasn't very well optimized for write-heavy workloads. It would be especially bad with pgbench, since there isn't much opportunity to reduce the size of indexes there. > Maybe you will do a test without writing any data to disk? Yeah, I should test that on its own. I'm particularly interested in TPC-C, because it's a particularly good target for my patch. I can find a way of only executing the read TPC-C queries, to see where they are on their own. TPC-C is particularly write-heavy, especially compared to the much more recent though less influential TPC-E benchmark. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andrey Lepikhov
Date:
I do the code review. Now, it is first patch - v6-0001... dedicated to a logical duplicates ordering. Documentation is full and clear. All non-trivial logic is commented accurately. Patch applies cleanly on top of current master. Regression tests passed and my "Retail Indextuple deletion" use cases works without mistakes. But I have two comments on the code. New BTScanInsert structure reduces parameters list of many functions and look fine. But it contains some optimization part ('restorebinsrch' field et al.). It is used very locally in the code - _bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize this logic into separate struct, which will passed to _bt_binsrch() as pointer. Another routines may pass NULL value to this routine. It is may simplify usability of the struct. Due to the optimization the _bt_binsrch() size has grown twice. May be you move this to some service routine? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Nov 2, 2018 at 3:06 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > Documentation is full and clear. All non-trivial logic is commented > accurately. Glad you think so. I had the opportunity to discuss this patch at length with Heikki during pgConf.EU. I don't want to speak on his behalf, but I will say that he seemed to understand all aspects of the patch series, and seemed generally well disposed towards the high level design. The high-level design is the most important aspect -- B-Trees can be optimized in many ways, all at once, and we must be sure to come up with something that enables most or all of them. I really care about the long term perspective. That conversation with Heikki eventually turned into a conversation about reimplementing GIN using the nbtree code, which is actually related to my patch series (sorted on heap TID is the first step to optional run length encoding for duplicates). Heikki seemed to think that we can throw out a lot of the optimizations within GIN, and add a few new ones to nbtree, while still coming out ahead. This made the general nbtree-as-GIN idea (which we've been talking about casually for years) seem a lot more realistic to me. Anyway, he requested that I support this long term goal by getting rid of the DESC TID sort order thing -- that breaks GIN-style TID compression. It also increases the WAL volume unnecessarily when a page is split that contains all duplicates. The DESC heap TID sort order thing probably needs to go. I'll probably have to go fix the regression test failures that occur when ASC heap TID order is used. (Technically those failures are a pre-existing problem, a problem that I mask by using DESC order...which is weird. The problem is masked in the master branch by accidental behaviors around nbtree duplicates, which is something that deserves to die. DESC order is closer to the accidental current behavior.) > Patch applies cleanly on top of current master. Regression tests passed > and my "Retail Indextuple deletion" use cases works without mistakes. Cool. > New BTScanInsert structure reduces parameters list of many functions and > look fine. But it contains some optimization part ('restorebinsrch' > field et al.). It is used very locally in the code - > _bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize > this logic into separate struct, which will passed to _bt_binsrch() as > pointer. Another routines may pass NULL value to this routine. It is may > simplify usability of the struct. Hmm. I see your point. I did it that way because the knowledge of having cached an upper and lower bound for a binary search of a leaf page needs to last for a relatively long time. I'll look into it again, though. > Due to the optimization the _bt_binsrch() size has grown twice. May be > you move this to some service routine? Maybe. There are some tricky details that seem to work against it. I'll see if it's possible to polish that some more, though. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andrey Lepikhov
Date:
On 03.11.2018 5:00, Peter Geoghegan wrote: > The DESC heap TID sort order thing probably needs to go. I'll probably > have to go fix the regression test failures that occur when ASC heap > TID order is used. (Technically those failures are a pre-existing > problem, a problem that I mask by using DESC order...which is weird. > The problem is masked in the master branch by accidental behaviors > around nbtree duplicates, which is something that deserves to die. > DESC order is closer to the accidental current behavior.) I applied your patches at top of master. After tests corrections (related to TID ordering in index relations DROP...CASCADE operation) 'make check-world' passed successfully many times. In the case of 'create view' regression test - 'drop cascades to 62 other objects' problem - I verify an Álvaro Herrera hypothesis [1] and it is true. You can verify it by tracking the object_address_present_add_flags() routine return value. Some doubts, however, may be regarding the 'triggers' test. May you specify the test failures do you mean? [1] https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Nov 2, 2018 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > I had the opportunity to discuss this patch at length with Heikki > during pgConf.EU. > The DESC heap TID sort order thing probably needs to go. I'll probably > have to go fix the regression test failures that occur when ASC heap > TID order is used. I've found that TPC-C testing with ASC heap TID order fixes the regression that I've been concerned about these past few weeks. Making this change leaves the patch a little bit faster than the master branch for TPC-C, while still leaving TPC-C indexes about as small as they were with v6 of the patch (i.e. much smaller). I now get about a 1% improvement in transaction throughput, an improvement that seems fairly consistent. It seems likely that the next revision of the patch series will be an unambiguous across the board win for performance. I think that I come out ahead with ASC heap TID order because that has the effect of reducing the volume of WAL generated by page splits. Page splits are already optimized for splitting right, not left. I should thank Heikki for pointing me in the right direction here. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > I applied your patches at top of master. After tests corrections > (related to TID ordering in index relations DROP...CASCADE operation) > 'make check-world' passed successfully many times. > In the case of 'create view' regression test - 'drop cascades to 62 > other objects' problem - I verify an Álvaro Herrera hypothesis [1] and > it is true. You can verify it by tracking the > object_address_present_add_flags() routine return value. I'll have to go and fix the problem directly, so that ASC sort order can be used. > Some doubts, however, may be regarding the 'triggers' test. > May you specify the test failures do you mean? Not sure what you mean. The order of items that are listed in the DETAIL for a cascading DROP can have an "implementation defined" order. I think that this is an example of the more general problem -- what you call the 'drop cascades to 62 other objects' problem is a more specific subproblem, or, if you prefer, a more specific symptom of the same problem. Since I'm going to have to fix the problem head-on, I'll have to study it in detail anyway. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andrey Lepikhov
Date:
On 04.11.2018 9:31, Peter Geoghegan wrote: > On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> I applied your patches at top of master. After tests corrections >> (related to TID ordering in index relations DROP...CASCADE operation) >> 'make check-world' passed successfully many times. >> In the case of 'create view' regression test - 'drop cascades to 62 >> other objects' problem - I verify an Álvaro Herrera hypothesis [1] and >> it is true. You can verify it by tracking the >> object_address_present_add_flags() routine return value. > > I'll have to go and fix the problem directly, so that ASC sort order > can be used. > >> Some doubts, however, may be regarding the 'triggers' test. >> May you specify the test failures do you mean? > > Not sure what you mean. The order of items that are listed in the > DETAIL for a cascading DROP can have an "implementation defined" > order. I think that this is an example of the more general problem -- > what you call the 'drop cascades to 62 other objects' problem is a > more specific subproblem, or, if you prefer, a more specific symptom > of the same problem. I mean that your code have not any problems that I can detect by regression tests and by the retail index tuple deletion patch. Difference in amount of dropped objects is not a problem. It is caused by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file catalog/dependency.c and it is legal behavior: column row object deleted without any report because we already decided to drop its whole table. Also, I checked the triggers test. Difference in the ERROR message 'cannot drop trigger trg1' is caused by different order of tuples in the relation with the dependDependerIndexId relid. It is legal behavior and we can simply replace test results. May be you know any another problems of the patch? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sun, Nov 4, 2018 at 8:21 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > I mean that your code have not any problems that I can detect by > regression tests and by the retail index tuple deletion patch. > Difference in amount of dropped objects is not a problem. It is caused > by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file > catalog/dependency.c and it is legal behavior: column row object deleted > without any report because we already decided to drop its whole table. The behavior implied by using ASC heap TID order is always "legal", but it may cause a regression in certain functionality -- something that an ordinary user might complain about. There were some changes when DESC heap TID order is used too, of course, but those were safe to ignore (it seemed like nobody could ever care). It might have been okay to just use DESC order, but since it now seems like I must use ASC heap TID order for performance reasons, I have to tackle a couple of these issues head-on (e.g. 'cannot drop trigger trg1'). > Also, I checked the triggers test. Difference in the ERROR message > 'cannot drop trigger trg1' is caused by different order of tuples in the > relation with the dependDependerIndexId relid. It is legal behavior and > we can simply replace test results. Let's look at this specific "trg1" case: """ create table trigpart (a int, b int) partition by range (a); create table trigpart1 partition of trigpart for values from (0) to (1000); create trigger trg1 after insert on trigpart for each row execute procedure trigger_nothing(); ... drop trigger trg1 on trigpart1; -- fail -ERROR: cannot drop trigger trg1 on table trigpart1 because trigger trg1 on table trigpart requires it -HINT: You can drop trigger trg1 on table trigpart instead. +ERROR: cannot drop trigger trg1 on table trigpart1 because table trigpart1 requires it +HINT: You can drop table trigpart1 instead. """ The original hint suggests "you need to drop the object on the partition parent instead of its child", which is useful. The new hint suggests "instead of dropping the trigger on the partition child, maybe drop the child itself!". That's almost an insult to the user. Now, I suppose that I could claim that it's not my responsibility to fix this, since we get the useful behavior only due to accidental implementation details. I'm not going to take that position, though. I think that I am obliged to follow both the letter and the spirit of the law. I'm almost certain that this regression test was written because somebody specifically cared about getting the original, useful message. The underlying assumptions may have been a bit shaky, but we all know how common it is for software to evolve to depend on implementation-defined details. We've all written code that does it, but hopefully it didn't hurt us much because we also wrote regression tests that exercised the useful behavior. > May be you know any another problems of the patch? Just the lack of pg_upgrade support. That is progressing nicely, though. I'll probably have that part in the next revision of the patch. I've found what looks like a workable approach, though I need to work on a testing strategy for pg_upgrade. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sun, Nov 4, 2018 at 10:58 AM Peter Geoghegan <pg@bowt.ie> wrote: > Just the lack of pg_upgrade support. Attached is v7 of the patch series. Changes: * Pre-pg_upgrade indexes (indexes of an earlier BTREE_VERSION) are now supported. Using pg_upgrade will be seamless to users. "Getting tired" returns, for the benefit of old indexes that regularly have lots of duplicates inserted. Notably, the new/proposed version of btree (BTREE_VERSION 4) cannot be upgraded on-the-fly -- we're changing more than the contents of the metapage, so that won't work. Version 2 -> version 3 upgrades can still take place dynamically/on-the-fly. It you want to upgrade to version 4, you'll need to REINDEX. The performance of the patch with pg_upgrade'd indexes has been validated. There doesn't seem to be any regressions. amcheck checks both the old invariants, and the new/stricter/L&Y invariants. Which set are checked depends on the btree version of the index undergoing verification. * ASC heap TID order is now used -- not DESC order, as before. This fixed all performance regressions that I'm aware of, and seems quite a lot more elegant overall. I believe that the patch series is now an unambiguous, across the board win for performance. I could see about a 1% increase in transaction throughput with my own TPC-C tests, while the big drop in the size of indexes was preserved. pgbench testing also showed as much as a 3.5% increase in transaction throughput in some cases with non-uniform distributions. Thanks for the suggestion, Heikki! Unfortunately, and as predicted, this change created a new problem that I need to fix directly: it makes certain diagnostic messages that accidentally depend on a certain pg_depend scan order say something different, and less useful (though still technically correct). I'll tackle that problem over on the dedicated thread I started [1]. (For now, I include a separate patch to paper over questionable regression test changes in a controlled way: v7-0005-Temporarily-paper-over-problematic-regress-output.patch.) * New optimization that has index scans avoid visiting the next page by checking the high key -- this is broken out into its own commit (v7-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patch). This is related to an optimization that has been around for years -- we're now using the high key, rather than using a normal (non-pivot) index tuple. High keys are much more likely to indicate that the scan doesn't need to visit the next page with the earlier patches in the patch series applied, since the new logic for choosing a split point favors a high key with earlier differences. It's pretty easy to take advantage of that. With a composite index, or a secondary index, it's particularly likely that we can avoid visiting the next leaf page. In other words, now that we're being smarter about future locality of access during page splits, we should take full advantage during index scans. The v7-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch commit uses a _bt_lowest_scantid() sentinel value to avoid unnecessarily visiting a page to the left of the page we actually ought to go to directly during a descent of a B-Tree -- that optimization was around in all earlier versions of the patch series. It seems natural to also have this new-to-v7 optimization. It avoids unnecessarily going right once we reach the leaf level, so it "does the same thing on the right side" -- the two optimizations mirror each other. If you don't get what I mean by that, then imagine a secondary index where each value appears a few hundred times. Literally every simple lookup query will either benefit from the first optimization on the way down the tree, or from the second optimization towards the end of the scan. (The page split logic ought to pack large groups of duplicates together, ideally confining them to one leaf page.) Andrey: the BTScanInsert struct still has the restorebinsrch stuff (mutable binary search optimization state) in v7. It seemed to make sense to keep it there, because I think that we'll be able to add similar optimizations in the future, that use similar mutable state. See my remarks on "dynamic prefix truncation" [2]. I think that that could be very helpful with skip scans, for example, so we'll probably end up adding it before too long. I hope you don't feel too strongly about it. [1] https://postgr.es/m/CAH2-Wzkypv1R+teZrr71U23J578NnTBt2X8+Y=Odr4pOdW1rXg@mail.gmail.com [2] https://postgr.es/m/CAH2-WzkpKeZJrXvR_p7VSY1b-s85E3gHyTbZQzR0BkJ5LrWF_A@mail.gmail.com -- Peter Geoghegan
Attachment
- v7-0005-Temporarily-paper-over-problematic-regress-output.patch
- v7-0006-DEBUG-Add-pageinspect-instrumentation.patch
- v7-0004-Add-high-key-continuescan-optimization.patch
- v7-0002-Weigh-suffix-truncation-when-choosing-a-split-poi.patch
- v7-0003-Add-split-at-new-tuple-page-split-optimization.patch
- v7-0001-Make-nbtree-indexes-have-unique-keys-in-tuples.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
Attached is v8 of the patch series, which has some relatively minor changes: * A new commit adds an artificial tie-breaker column to pg_depend indexes, comprehensively solving the issues with regression test instability. This is the only really notable change. * Clean-up of how the design in described in the nbtree README, and elsewhere. I want to make it clear that we're now more or less using the Lehman and Yao design. I re-read the Lehman and Yao paper to make sure that the patch acknowledges what Lehman and Yao say to expect, at least in cases that seemed to matter. * Stricter verification by contrib/amcheck. Not likely to catch a case that wouldn't have been caught by previous revisions, but should make the design a bit clearer to somebody following L&Y. * Tweaks to how _bt_findsplitloc() accumulates candidate split points. We're less aggressive in choosing a smaller tuple during an internal page split in this revision. The overall impact of the pg_depend change is that required regression test output changes are *far* less numerous than they were in v7. There are now only trivial differences in the output order of items. And, there are very few diagnostic message changes overall -- we see exactly 5 changes now, rather than dozens. Importantly, there is no longer any question about whether I could make diagnostic messages less useful to users, because the existing behavior for findDependentObjects() is retained. This is an independent improvement, since it fixes an independent problem with test flappiness that we've been papering-over for some time [2] -- I make the required order actually-deterministic, removing heap TID ordering as a factor that can cause seemingly-random regression test failures on slow/overloaded buildfarm animals. Robert Haas remarked that he thought that the pg_depend index tie-breaker commit's approach is acceptable [1] -- see the other thread that Robert weighed in on for all the gory details. The patch's draft commit message may also be interesting. Note that adding a new column turns out to have *zero* storage overhead, because we only ever end up filling up space that was already getting lost to alignment. The pg_depend thing is clearly a kludge. It's ugly, though in no small part because it acknowledges the existing reality of how findDependentObjects() already depends on scan order. I'm optimistic that I'll be able to push this groundwork commit before too long; it doesn't hinge on whether or not the nbtree patches are any good. [1] https://postgr.es/m/CA+TgmoYNeFxdPimiXGL=tCiCXN8zWosUFxUfyDBaTd2VAg-D9w@mail.gmail.com [2] https://postgr.es/m/11852.1501610262%40sss.pgh.pa.us -- Peter Geoghegan
Attachment
- v8-0003-Pick-nbtree-split-points-discerningly.patch
- v8-0004-Add-split-at-new-tuple-page-split-optimization.patch
- v8-0005-Add-high-key-continuescan-optimization.patch
- v8-0006-DEBUG-Add-pageinspect-instrumentation.patch
- v8-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch
- v8-0001-Add-pg_depend-index-scan-tiebreaker-column.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Dmitry Dolgov
Date:
> On Sun, Nov 25, 2018 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached is v8 of the patch series, which has some relatively minor changes: Thank you for working on this patch, Just for the information, cfbot says there are problems on windows: src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' : undeclared identifier
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Dec 1, 2018 at 4:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Just for the information, cfbot says there are problems on windows: > > src/backend/catalog/pg_depend.c(33): error C2065: 'INT32_MAX' : > undeclared identifier Thanks. Looks like I should have used PG_INT32_MAX. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Dec 1, 2018 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote: > Thanks. Looks like I should have used PG_INT32_MAX. Attached is v9, which does things that way. There are no interesting changes, though I have set things up so that a later patch in the series can add "dynamic prefix truncation" -- I do not include any such patch in v9, though. I'm going to start a new thread on that topic, and include the patch there, since it's largely unrelated to this work, and in any case still isn't in scope for Postgres 12 (the patch is still experimental, for reasons that are of general interest). If nothing else, Andrey and Peter E. will probably get a better idea of why I thought that an insertion scan key was a good place to put mutable state if they go read that other thread -- there really was a bigger picture to setting things up that way. -- Peter Geoghegan
Attachment
- v9-0006-DEBUG-Add-pageinspect-instrumentation.patch
- v9-0001-Add-pg_depend-index-scan-tiebreaker-column.patch
- v9-0003-Pick-nbtree-split-points-discerningly.patch
- v9-0004-Add-split-at-new-tuple-page-split-optimization.patch
- v9-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch
- v9-0005-Add-high-key-continuescan-optimization.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Dec 3, 2018 at 7:10 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v9, which does things that way. There are no interesting > changes, though I have set things up so that a later patch in the > series can add "dynamic prefix truncation" -- I do not include any > such patch in v9, though. I'm going to start a new thread on that > topic, and include the patch there, since it's largely unrelated to > this work, and in any case still isn't in scope for Postgres 12 (the > patch is still experimental, for reasons that are of general > interest). The dynamic prefix truncation thread that I started: https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 04/12/2018 05:10, Peter Geoghegan wrote: > Attached is v9, ... I spent some time reviewing this. I skipped the first patch, to add a column to pg_depend, and I got through patches 2, 3 and 4. Impressive results, and the code looks sane. I wrote a laundry list of little comments on minor things, suggested rewordings of comments etc. I hope they're useful, but feel free to ignore/override my opinions of any of those, as you see best. But first, a few slightly bigger (medium-sized?) issues that caught my eye: 1. How about doing the BTScanInsertData refactoring as a separate commit, first? It seems like a good thing for readability on its own, and would slim the big main patch. (And make sure to credit Andrey for that idea in the commit message.) 2. In the "Treat heap TID as part of the nbtree key space" patch: > * Build an insertion scan key that contains comparison data from itup > * as well as comparator routines appropriate to the key datatypes. > * > + * When itup is a non-pivot tuple, the returned insertion scan key is > + * suitable for finding a place for it to go on the leaf level. When > + * itup is a pivot tuple, the returned insertion scankey is suitable > + * for locating the leaf page with the pivot as its high key (there > + * must have been one like it at some point if the pivot tuple > + * actually came from the tree). > + * > + * Note that we may occasionally have to share lock the metapage, in > + * order to determine whether or not the keys in the index are expected > + * to be unique (i.e. whether or not heap TID is treated as a tie-breaker > + * attribute). Callers that cannot tolerate this can request that we > + * assume that this is a heapkeyspace index. > + * > * The result is intended for use with _bt_compare(). > */ > -ScanKey > -_bt_mkscankey(Relation rel, IndexTuple itup) > +BTScanInsert > +_bt_mkscankey(Relation rel, IndexTuple itup, bool assumeheapkeyspace) This 'assumeheapkeyspace' flag feels awkward. What if the caller knows that it is a v3 index? There's no way to tell _bt_mkscankey() that. (There's no need for that, currently, but seems a bit weird.) _bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which calls _bt_mkscankey(). It's holding a lock on the page being split. Do we risk deadlock by locking the metapage at the same time? I don't have any great ideas on what to do about this, but it's awkward as it is. Can we get away without the new argument? Could we somehow arrange things so that rd_amcache would be guaranteed to already be set? 3. In the "Pick nbtree split points discerningly" patch I find the different modes and the logic in _bt_findsplitloc() very hard to understand. I've spent a while looking at it now, and I think I have a vague understanding of what things it takes into consideration, but I don't understand why it performs those multiple stages, what each stage does, and how that leads to an overall strategy. I think a rewrite would be in order, to make that more understandable. I'm not sure what exactly it should look like, though. If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or SINGLE_VALUE modes, it has to redo a lot of the work that was done in the DEFAULT mode already. That's probably not a big deal in practice, performance-wise, but I feel that it's another hint that some refactoring would be in order. One idea on how to restructure that: Make a single pass over all the offset numbers, considering a split at that location. Like the current code does. For each offset, calculate a "penalty" based on two factors: * free space on each side * the number of attributes in the pivot tuple, and whether it needs to store the heap TID Define the penalty function so that having to add a heap TID to the pivot tuple is considered very expensive, more expensive than anything else, and truncating away other attributes gives a reward of some size. However, naively computing the penalty upfront for every offset would be a bit wasteful. Instead, start from the middle of the page, and walk "outwards" towards both ends, until you find a "good enough" penalty. Or something like that... Now, the laundry list of smaller items: ----- laundry list begins ----- 1st commits commit message: > Make nbtree treat all index tuples as having a heap TID trailing key > attribute. Heap TID becomes a first class part of the key space on all > levels of the tree. Index searches can distinguish duplicates by heap > TID, at least in principle. What do you mean by "at least in principle"? > Secondary index insertions will descend > straight to the leaf page that they'll insert on to (unless there is a > concurrent page split). What is a "Secondary" index insertion? > Naively adding a new attribute to every pivot tuple has unacceptable > overhead (it bloats internal pages), so suffix truncation of pivot > tuples is added. This will generally truncate away the "extra" heap TID > attribute from pivot tuples during a leaf page split, and may also > truncate away additional user attributes. This can increase fan-out, > especially when there are several attributes in an index. Suggestion: "when there are several attributes in an index" -> "in a multi-column index" > +/* > + * Convenience macro to get number of key attributes in tuple in low-context > + * fashion > + */ > +#define BTreeTupleGetNKeyAtts(itup, rel) \ > + Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel)) > + What is "low-context fashion"? > + * scankeys is an array of scan key entries for attributes that are compared > + * before scantid (user-visible attributes). Every attribute should have an > + * entry during insertion, though not necessarily when a regular index scan > + * uses an insertion scankey to find an initial leaf page. Suggestion: Reword to something like "During insertion, there must be a scan key for every attribute, but when starting a regular index scan, some can be omitted." > +typedef struct BTScanInsertData > +{ > + /* > + * Mutable state used by _bt_binsrch() to inexpensively repeat a binary > + * search on the leaf level when only scantid has changed. Only used for > + * insertions where _bt_check_unique() is called. > + */ > + bool savebinsrch; > + bool restorebinsrch; > + OffsetNumber low; > + OffsetNumber high; > + > + /* State used to locate a position at the leaf level */ > + bool heapkeyspace; > + bool nextkey; > + ItemPointer scantid; /* tiebreaker for scankeys */ > + int keysz; /* Size of scankeys */ > + ScanKeyData scankeys[INDEX_MAX_KEYS]; /* Must appear last */ > +} BTScanInsertData; It would feel more natural to me, to have the mutable state *after* the other fields. Also, it'd feel less error-prone to have 'scantid' be ItemPointerData, rather than a pointer to somewhere else. The 'heapkeyspace' name isn't very descriptive. I understand that it means that the heap TID is part of the keyspace. Not sure what to suggest instead, though. > +The requirement that all btree keys be unique is satisfied by treating heap > +TID as a tiebreaker attribute. Logical duplicates are sorted in heap item > +pointer order. Suggestion: "item pointer" -> TID, to use consistent terms. > We don't use btree keys to disambiguate downlinks from the > +internal pages during a page split, though: only one entry in the parent > +level will be pointing at the page we just split, so the link fields can be > +used to re-find downlinks in the parent via a linear search. (This is > +actually a legacy of when heap TID was not treated as part of the keyspace, > +but it does no harm to keep things that way.) I don't understand this paragraph. > +Lehman and Yao talk about pairs of "separator" keys and downlinks in > +internal pages rather than tuples or records. We use the term "pivot" > +tuple to distinguish tuples which don't point to heap tuples, that are > +used only for tree navigation. Pivot tuples include all tuples on > +non-leaf pages and high keys on leaf pages. Suggestion: reword to "All tuples on non-leaf pages, and high keys on leaf pages, are pivot tuples" > Note that pivot tuples are > +only used to represent which part of the key space belongs on each page, > +and can have attribute values copied from non-pivot tuples that were > +deleted and killed by VACUUM some time ago. A pivot tuple may contain a > +"separator" key and downlink, just a separator key (in practice the > +downlink will be garbage), or just a downlink. Rather than store garbage, set it to zeros? > +Lehman and Yao require that the key range for a subtree S is described by > +Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page. > +A search where the scan key is equal to a pivot tuple in an upper tree > +level must descend to the left of that pivot to ensure it finds any equal > +keys. Pivot tuples are always a _strict_ lower bound on items on their > +downlink page; the equal item(s) being searched for must therefore be to > +the left of that downlink page on the next level down. (It's possible to > +arrange for internal page tuples to be strict lower bounds in all cases > +because their values come from leaf tuples, which are guaranteed unique by > +the use of heap TID as a tiebreaker. We also make use of hard-coded > +negative infinity values in internal pages. Rightmost pages don't have a > +high key, though they conceptually have a positive infinity high key). A > +handy property of this design is that there is never any need to > +distinguish between equality in the case where all attributes/keys are used > +in a scan from equality where only some prefix is used. "distringuish between ... from ..." doesn't sound like correct grammar. Suggestion: "distinguish between ... and ...", or just "distinguish ... from ...". Or rephrase the sentence some other way. > +We truncate away suffix key attributes that are not needed for a page high > +key during a leaf page split when the remaining attributes distinguish the > +last index tuple on the post-split left page as belonging on the left page, > +and the first index tuple on the post-split right page as belonging on the > +right page. That's a very long sentence. > * Since the truncated tuple is probably smaller than the > * original, it cannot just be copied in place (besides, we want > * to actually save space on the leaf page). We delete the > * original high key, and add our own truncated high key at the > * same offset. It's okay if the truncated tuple is slightly > * larger due to containing a heap TID value, since pivot tuples > * are treated as a special case by _bt_check_third_page(). By "treated as a special case", I assume that _bt_check_third_page() always reserves some space for that? Maybe clarify that somehow. _bt_truncate(): > This is possible when there are > * attributes that follow an attribute in firstright that is not equal to the > * corresponding attribute in lastleft (equal according to insertion scan key > * semantics). I can't comprehend that sentence. Simpler English, maybe add an example, please. > /* > * _bt_leave_natts - how many key attributes to leave when truncating. > * > * Caller provides two tuples that enclose a split point. CREATE INDEX > * callers must pass build = true so that we may avoid metapage access. (This > * is okay because CREATE INDEX always creates an index on the latest btree > * version.) > * > * This can return a number of attributes that is one greater than the > * number of key attributes for the index relation. This indicates that the > * caller must use a heap TID as a unique-ifier in new pivot tuple. > */ > static int > _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright, > bool build) IMHO "keep" would sound better here than "leave". > + if (needheaptidspace) > + ereport(ERROR, > + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), > + errmsg("index row size %zu exceeds btree version %u maximum %zu for index \"%s\"", > + itemsz, BTREE_VERSION, BTMaxItemSize(page), > + RelationGetRelationName(rel)), > + errdetail("Index row references tuple (%u,%u) in relation \"%s\".", > + ItemPointerGetBlockNumber(&newtup->t_tid), > + ItemPointerGetOffsetNumber(&newtup->t_tid), > + RelationGetRelationName(heap)), > + errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n" > + "Consider a function index of an MD5 hash of the value, " > + "or use full text indexing."), > + errtableconstraint(heap, > + RelationGetRelationName(rel)))); > + else > + ereport(ERROR, > + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), > + errmsg("index row size %zu exceeds btree version 3 maximum %zu for index \"%s\"", > + itemsz, BTMaxItemSizeNoHeapTid(page), > + RelationGetRelationName(rel)), > + errdetail("Index row references tuple (%u,%u) in relation \"%s\".", > + ItemPointerGetBlockNumber(&newtup->t_tid), > + ItemPointerGetOffsetNumber(&newtup->t_tid), > + RelationGetRelationName(heap)), > + errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n" > + "Consider a function index of an MD5 hash of the value, " > + "or use full text indexing."), > + errtableconstraint(heap, > + RelationGetRelationName(rel)))); Could restructure this to avoid having two almost identical strings to translate. > #define BTREE_METAPAGE 0 /* first page is meta */ > #define BTREE_MAGIC 0x053162 /* magic number of btree pages */ > -#define BTREE_VERSION 3 /* current version number */ > +#define BTREE_VERSION 4 /* current version number */ > #define BTREE_MIN_VERSION 2 /* minimal supported version number */ > +#define BTREE_META_VERSION 3 /* minimal version with all meta fields */ BTREE_META_VERSION is a strange name for version 3. I think this deserves a more verbose comment, above these #defines, to list all the versions and their differences. v9-0003-Pick-nbtree-split-points-discerningly.patch commit message: > Add infrastructure to determine where the earliest difference appears > among a pair of tuples enclosing a candidate split point. I don't understand this sentence. > _bt_findsplitloc() is also taught to care about the case where there are > many duplicates, making it hard to find a distinguishing split point. > _bt_findsplitloc() may even conclude that it isn't possible to avoid > filling a page entirely with duplicates, in which case it packs pages > full of duplicates very tightly. Hmm. Is the assumption here that if a page is full of duplicates, there will be no more insertions into that page? Why? > The number of cycles added is not very noticeable, which is important, > since _bt_findsplitloc() is run while an exclusive (leaf page) buffer > lock is held. We avoid using authoritative insertion scankey > comparisons, unlike suffix truncation proper. What do you do instead, then? memcmp? (Reading the patch, yes. Suggestion: "We use a faster binary comparison, instead of proper datatype-aware comparison, for speed". Aside from performance, it would feel inappropriate to call user-defined code while holding a buffer lock, anyway. > +There is sophisticated criteria for choosing a leaf page split point. The > +general idea is to make suffix truncation effective without unduly > +influencing the balance of space for each half of the page split. The > +choice of leaf split point can be thought of as a choice among points > +*between* items on the page to be split, at least if you pretend that the > +incoming tuple was placed on the page already, without provoking a split. I'd leave out the ", without provoking a split" part. Or maybe reword to "if you pretend that the incoming tuple fit and was placed on the page already". > +Choosing the split point between two index tuples with differences that > +appear as early as possible results in truncating away as many suffix > +attributes as possible. It took me a while to understand what the "appear as early as possible" means here. It's talking about a multi-column index, and about finding a difference in one of the leading key columns. Not, for example, about finding a split point early in the page. > An array of acceptable candidate split points > +(points that balance free space on either side of the split sufficiently > +well) is assembled in a pass over the page to be split, sorted by delta. > +An optimal split point is chosen during a pass over the assembled array. > +There are often several split points that allow the maximum number of > +attributes to be truncated away -- we choose whichever one has the lowest > +free space delta. Perhaps we should leave out these details in the README, and explain this in the comments of the picksplit-function itself? In the README, I think a more high-level description of what things are taken into account when picking the split point, would be enough. > +Suffix truncation is primarily valuable because it makes pivot tuples > +smaller, which delays splits of internal pages, but that isn't the only > +reason why it's effective. Suggestion: reword to "... , but that isn't the only benefit" ? > There are cases where suffix truncation can > +leave a B-Tree significantly smaller in size than it would have otherwise > +been without actually making any pivot tuple smaller due to restrictions > +relating to alignment. Suggestion: reword to "... smaller in size than it would otherwise be, without ..." and "without making any pivot tuple *physically* smaller, due to alignment". This sentence is a bit of a cliffhanger: what are those cases, and how is that possible? > The criteria for choosing a leaf page split point > +for suffix truncation is also predictive of future space utilization. How so? What does this mean? > +Furthermore, even truncation that doesn't make pivot tuples smaller still > +prevents pivot tuples from being more restrictive than truly necessary in > +how they describe which values belong on which pages. Ok, I guess these sentences resolve the cliffhanger I complained about. But this still feels like magic. When you split a page, all of the keyspace must belong on the left or the right page. Why does it make a difference to space utilization, where exactly you split the key space? > +While it's not possible to correctly perform suffix truncation during > +internal page splits, it's still useful to be discriminating when splitting > +an internal page. The split point that implies a downlink be inserted in > +the parent that's the smallest one available within an acceptable range of > +the fillfactor-wise optimal split point is chosen. This idea also comes > +from the Prefix B-Tree paper. This process has much in common with to what > +happens at the leaf level to make suffix truncation effective. The overall > +effect is that suffix truncation tends to produce smaller and less > +discriminating pivot tuples, especially early in the lifetime of the index, > +while biasing internal page splits makes the earlier, less discriminating > +pivot tuples end up in the root page, delaying root page splits. Ok, so this explains it further, I guess. I find this paragraph difficult to understand, though. The important thing here is the idea that some split points are more "discriminating" than others, but I think it needs some further explanation. What makes a split point more discriminating? Maybe add an example. > +Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf > +tuple that it's based on (the first item on the right page), since a heap > +TID must be appended when nothing else distinguishes each side of a leaf > +split. Truncation cannot simply reuse the leaf level representation: we > +must append an additional attribute, rather than incorrectly leaving a heap > +TID in the generic IndexTuple item pointer field. (The field is already > +used by pivot tuples to store their downlink, plus some additional > +metadata.) That's not really the fault of suffix truncation as such, but the process of turning a leaf tuple into a pivot tuple. It would happen even if you didn't truncate anything. I think this point, that we have to store the heap TID differently in pivot tuples, would deserve a comment somewhere else, too. While reading the patch, I didn't realize that that's what we're doing, until I read this part of the README, even though I saw the new code to deal with heap TIDs elsewhere in the code. Not sure where, maybe in BTreeTupleGetHeapTID(). > +Adding a heap TID attribute during a leaf page split should only occur when > +the page to be split is entirely full of duplicates (the new item must also > +be a duplicate). The logic for selecting a split point goes to great > +lengths to avoid heap TIDs in pivots --- "many duplicates" mode almost > +always manages to pick a split point between two user-key-distinct tuples, > +accepting a completely lopsided split if it must. This is the first mention of "many duplicates" mode. Maybe just say "_bt_findsplitloc() almost always ..." or "The logic for selecting a split point goes to great lengths to avoid heap TIDs in pivots, and almost always manages to pick a split point between two user-key-distinct tuples, accepting a completely lopsided split if it must." > Once appending a heap > +TID to a split's pivot becomes completely unavoidable, there is a fallback > +strategy --- "single value" mode is used, which makes page splits pack the > +new left half full by using a high fillfactor. Single value mode leads to > +better overall space utilization when a large number of duplicates are the > +norm, and thereby also limits the total number of pivot tuples with an > +untruncated heap TID attribute. This assumes that tuples are inserted in increasing TID order, right? Seems like a valid assumption, no complaints there, but it's an assumption nevertheless. I'm not sure if this level of detail is worthwhile in the README. This logic on deciding the split point is all within the _bt_findsplitloc() function, so maybe put this explanation there. In the README, a more high-level explanation of what things _bt_findsplitloc() considers, should be enough. _bt_findsplitloc(), and all its helper structs and subroutines, are about 1000 lines of code now, and big part of nbtinsert.c. Perhaps it would be a good idea to move it to a whole new nbtsplitloc.c file? It's a very isolated piece of code. In the comment on _bt_leave_natts_fast(): > + * Testing has shown that an approach involving treating the tuple as a > + * decomposed binary string would work almost as well as the approach taken > + * here. It would also be faster. It might actually be necessary to go that > + * way in the future, if suffix truncation is made sophisticated enough to > + * truncate at a finer granularity (i.e. truncate within an attribute, rather > + * than just truncating away whole attributes). The current approach isn't > + * markedly slower, since it works particularly well with the "perfect > + * penalty" optimization (there are fewer, more expensive calls here). It > + * also works with INCLUDE indexes (indexes with non-key attributes) without > + * any special effort. That's an interesting tidbit, but I'd suggest just removing this comment altogether. It's not really helping to understand the current implementation. v9-0005-Add-high-key-continuescan-optimization.patch commit message: > Note that even pre-pg_upgrade'd v3 indexes make use of this > optimization. .. but we're missing the other optimizations that make it more effective, so it probably won't do much for v3 indexes. Does it make them slower? It's probably acceptable, even if there's a tiny regression, but I'm curious. ----- laundry list ends ----- - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Dec 28, 2018 at 10:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I spent some time reviewing this. I skipped the first patch, to add a > column to pg_depend, and I got through patches 2, 3 and 4. Impressive > results, and the code looks sane. Thanks! I really appreciate your taking the time to do such a thorough review. You were right to skip the first patch, because there is a fair chance that it won't be used in the end. Tom is looking into the pg_depend problem that I paper over with the first patch. > I wrote a laundry list of little comments on minor things, suggested > rewordings of comments etc. I hope they're useful, but feel free to > ignore/override my opinions of any of those, as you see best. I think that that feedback is also useful, and I'll end up using 95%+ of it. Much of the information I'm trying to get across is very subtle. > But first, a few slightly bigger (medium-sized?) issues that caught my eye: > > 1. How about doing the BTScanInsertData refactoring as a separate > commit, first? It seems like a good thing for readability on its own, > and would slim the big main patch. (And make sure to credit Andrey for > that idea in the commit message.) Good idea. I'll do that. > This 'assumeheapkeyspace' flag feels awkward. What if the caller knows > that it is a v3 index? There's no way to tell _bt_mkscankey() that. > (There's no need for that, currently, but seems a bit weird.) This is there for CREATE INDEX -- we cannot access the metapage during an index build. We'll only be able to create new v4 indexes with the patch applied, so we can assume that heap TID is part of the key space safely. > _bt_split() calls _bt_truncate(), which calls _bt_leave_natts(), which > calls _bt_mkscankey(). It's holding a lock on the page being split. Do > we risk deadlock by locking the metapage at the same time? I already had vague concerns along the same lines. I am also concerned about index_getprocinfo() calls that happen in the same code path, with a buffer lock held. (SP-GiST's doPickSplit() function can be considered a kind of precedent that makes the second issue okay, I suppose.) See also: My later remarks on the use of "authoritative comparisons" from this same e-mail. > I don't have any great ideas on what to do about this, but it's awkward > as it is. Can we get away without the new argument? Could we somehow > arrange things so that rd_amcache would be guaranteed to already be set? These are probably safe in practice, but the way that we rely on them being safe from a distance is a concern. Let me get back to you on this. > 3. In the "Pick nbtree split points discerningly" patch > > I find the different modes and the logic in _bt_findsplitloc() very hard > to understand. I've spent a while looking at it now, and I think I have > a vague understanding of what things it takes into consideration, but I > don't understand why it performs those multiple stages, what each stage > does, and how that leads to an overall strategy. I think a rewrite would > be in order, to make that more understandable. I'm not sure what exactly > it should look like, though. I've already refactored that a little bit for the upcoming v10. The way _bt_findsplitloc() state is initially set up becomes slightly more streamlined. It still works in the same way, though, so you'll probably only think that the new version is a minor improvement. (Actually, v10 focuses on making _bt_splitatnewitem() a bit less magical, at least right now.) > If _bt_findsplitloc() has to fall back to the MANY_DUPLICATES or > SINGLE_VALUE modes, it has to redo a lot of the work that was done in > the DEFAULT mode already. That's probably not a big deal in practice, > performance-wise, but I feel that it's another hint that some > refactoring would be in order. The logic within _bt_findsplitloc() has been very hard to refactor all along. You're right that there is a fair amount of redundant-ish work that the alternative modes (MANY_DUPLICATES + SINGLE_VALUE) perform. The idea is to not burden the common DEFAULT case, and to keep the control flow relatively simple. I'm sure that if I was in your position I'd say something similar. It is complicated in subtle ways, that looks like they might not matter, but actually do. I am working off a fair variety of test cases, which really came in handy. I remember thinking that I'd simplified it a couple of times back in August or September, only to realize that I'd regressed a case that I cared about. I eventually realized that I needed to come up with a comprehensive though relatively fast test suite, which seems essential for refactoring _bt_findsplitloc(), and maybe even for fully understanding how _bt_findsplitloc() works. Another complicating factor is that I have to worry about the number of cycles used under a buffer lock (not just the impact on space utilization). With all of that said, I am willing to give it another try. You've seen opportunities to refactor that I missed before now. More than once. > One idea on how to restructure that: > > Make a single pass over all the offset numbers, considering a split at > that location. Like the current code does. For each offset, calculate a > "penalty" based on two factors: > > * free space on each side > * the number of attributes in the pivot tuple, and whether it needs to > store the heap TID > > Define the penalty function so that having to add a heap TID to the > pivot tuple is considered very expensive, more expensive than anything > else, and truncating away other attributes gives a reward of some size. As you go on to say, accessing the tuple to calculate a penalty like this is expensive, and shouldn't be done exhaustively if at all possible. We're only access item pointer information (that is, lp_len) in the master branch's _bt_findsplitloc(), and that's all we do within the patch until the point where we have a (usually quite small) array of candidate split points, sorted by delta. Doing a pass over the page to assemble an array of candidate splits, and then doing a pass over the sorted array of splits with tolerably-low left/right space deltas works pretty well. "Mixing" the penalties together up front like that is something I considered, and decided not to pursue -- it obscures relatively uncommon though sometimes important large differences, that a single DEFAULT mode style pass would probably miss. MANY_DUPLICATES mode is totally exhaustive, because it's worth being totally exhaustive in the extreme case where there are only a few distinct values, and it's still possible to avoid a large grouping of values that spans more than one page. But it's not worth being exhaustive like that most of the time. That's the useful thing about having 2 alternative modes, that we "escalate" to if and only if it seems necessary to. MANY_DUPLICATES can be expensive, because no workload is likely to consistently use it. Most will almost always use DEFAULT, some will use SINGLE_VALUE quite a bit -- MANY_DUPLICATES is for when we're "in between" those two.Seems unlikely to be the steady state. Maybe we could just have MANY_DUPLICATES mode, and making SINGLE_VALUE mode something that happens within a DEFAULT pass. It's probably not worth it, though -- SINGLE_VALUE mode generally wants to split the page in a way that makes the left page mostly full, and the right page mostly empty. So eliminating SINGLE_VALUE mode would probably not simplify the code. > However, naively computing the penalty upfront for every offset would be > a bit wasteful. Instead, start from the middle of the page, and walk > "outwards" towards both ends, until you find a "good enough" penalty. You can't start at the middle of the page, though. You have to start at the left (though you could probably start at the right instead). This is because of page fragmentation -- it's not correct to assume that the line pointer offset into tuple space on the page (firstright linw pointer lp_off for candidate split point) tells you anything about what the space delta will be after the split. You have to exhaustively add up the free space before the line pointer (the free space for all earlier line pointers) before seeing if the line pointer works as a split point, since each previous line pointer's tuple could be located anywhere in the original page's tuple space (anywhere to the left or to the right of where it would be in the simple/unfragmented case). > 1st commits commit message: > > > Make nbtree treat all index tuples as having a heap TID trailing key > > attribute. Heap TID becomes a first class part of the key space on all > > levels of the tree. Index searches can distinguish duplicates by heap > > TID, at least in principle. > > What do you mean by "at least in principle"? I mean that we don't really do that currently, because we don't have something like retail index tuple deletion. However, we do have, uh, insertion, so I guess that this is just wrong. Will fix. > > Secondary index insertions will descend > > straight to the leaf page that they'll insert on to (unless there is a > > concurrent page split). > > What is a "Secondary" index insertion? Secondary index is how I used to refer to a non-unique index, until I realized that that was kind of wrong. (In fact, all indexes in Postgres are secondary indexes, because we always use a heap, never a clustered index.) Will fix. > Suggestion: "when there are several attributes in an index" -> "in a > multi-column index" I'll change it to say that. > > +/* > > + * Convenience macro to get number of key attributes in tuple in low-context > > + * fashion > > + */ > > +#define BTreeTupleGetNKeyAtts(itup, rel) \ > > + Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel)) > > + > > What is "low-context fashion"? I mean that it works with non-pivot tuples in INCLUDE indexes without special effort on the caller's part, while also fetching the number of key attributes in any pivot tuple, where it might well be < IndexRelationGetNumberOfKeyAttributes(). Maybe no comment is necessary -- BTreeTupleGetNKeyAtts() is exactly what it sounds like to somebody that already knows about BTreeTupleGetNAtts(). > Suggestion: Reword to something like "During insertion, there must be a > scan key for every attribute, but when starting a regular index scan, > some can be omitted." Will do. > It would feel more natural to me, to have the mutable state *after* the > other fields. I fully agree, but I can't really change it. The struct BTScanInsertData ends with a flexible array member, though its sized INDEX_MAX_KEYS because _bt_first() wants to allocate it on the stack without special effort. This was found to make a measurable difference with nested loop joins -- I used to always allocate BTScanInsertData using palloc(), until I found a regression. This nestloop join issue must be why commit d961a568 removed an insertion scan key palloc() from _bt_first(), way back in 2005. It seems like _bt_first() should remain free of palloc()s, which it seems to actually manage to do, despite being so hairy. > Also, it'd feel less error-prone to have 'scantid' be > ItemPointerData, rather than a pointer to somewhere else. It's useful for me to be able to set it to NULL, though -- I'd need another bool to represent the absence of a scantid if the field was ItemPointerData (the absence could occur when _bt_mkscankey() is passed a pivot tuple with its heap TID already truncated away, for example). Besides, the raw scan keys themselves are very often pointers to an attribute in some index tuple -- a tuple that the caller needs to keep around for as long as the insertion scan key needs to be used. Why not do the same thing with scantid? It is more or less just another attribute, so it's really the same situation as before. > The 'heapkeyspace' name isn't very descriptive. I understand that it means > that the heap TID is part of the keyspace. Not sure what to suggest > instead, though. I already changed this once, based on a similar feeling. If you come up with an even better name than "heapkeyspace", let me know. :-) > > +The requirement that all btree keys be unique is satisfied by treating heap > > +TID as a tiebreaker attribute. Logical duplicates are sorted in heap item > > +pointer order. > > Suggestion: "item pointer" -> TID, to use consistent terms. Will do. > > We don't use btree keys to disambiguate downlinks from the > > +internal pages during a page split, though: only one entry in the parent > > +level will be pointing at the page we just split, so the link fields can be > > +used to re-find downlinks in the parent via a linear search. (This is > > +actually a legacy of when heap TID was not treated as part of the keyspace, > > +but it does no harm to keep things that way.) > > I don't understand this paragraph. I mean that we could now "go full Lehman and Yao" if we wanted to: it's not necessary to even use the link field like this anymore. We don't do that because of v3 indexes, but also because it doesn't actually matter. The current way of re-finding downlinks would probably even be better in a green field situation, in fact -- it's just a bit harder to explain in a research paper. > Suggestion: reword to "All tuples on non-leaf pages, and high keys on > leaf pages, are pivot tuples" Will do. > > Note that pivot tuples are > > +only used to represent which part of the key space belongs on each page, > > +and can have attribute values copied from non-pivot tuples that were > > +deleted and killed by VACUUM some time ago. A pivot tuple may contain a > > +"separator" key and downlink, just a separator key (in practice the > > +downlink will be garbage), or just a downlink. > > Rather than store garbage, set it to zeros? There may be minor forensic value in keeping the item pointer block as the heap block (but not the heap item pointer) within leaf high keys (i.e. only changing it when it gets copied over for insertion into the parent, and the block needs to point to the leaf child). I recall discussing this with Alexander Korotkov shortly before the INCLUDE patch went in. I'd rather keep it that way, rather than zeroing. I could say "undefined" instead of "garbage", though. Not at all attached to that wording. > "distringuish between ... from ..." doesn't sound like correct grammar. > Suggestion: "distinguish between ... and ...", or just "distinguish ... > from ...". Or rephrase the sentence some other way. Yeah, I mangled the grammar. Which is kind of surprising, since I make a very important point about why strict lower bounds are handy in that sentence! > > +We truncate away suffix key attributes that are not needed for a page high > > +key during a leaf page split when the remaining attributes distinguish the > > +last index tuple on the post-split left page as belonging on the left page, > > +and the first index tuple on the post-split right page as belonging on the > > +right page. > > That's a very long sentence. Will restructure. > > * Since the truncated tuple is probably smaller than the > > * original, it cannot just be copied in place (besides, we want > > * to actually save space on the leaf page). We delete the > > * original high key, and add our own truncated high key at the > > * same offset. It's okay if the truncated tuple is slightly > > * larger due to containing a heap TID value, since pivot tuples > > * are treated as a special case by _bt_check_third_page(). > > By "treated as a special case", I assume that _bt_check_third_page() > always reserves some space for that? Maybe clarify that somehow. I'll just say that _bt_check_third_page() reserves space for it in the next revision of the patch. > _bt_truncate(): > > This is possible when there are > > * attributes that follow an attribute in firstright that is not equal to the > > * corresponding attribute in lastleft (equal according to insertion scan key > > * semantics). > > I can't comprehend that sentence. Simpler English, maybe add an example, > please. Okay. > > static int > > _bt_leave_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright, > > bool build) > > IMHO "keep" would sound better here than "leave". WFM. > Could restructure this to avoid having two almost identical strings to > translate. I'll try. > > #define BTREE_METAPAGE 0 /* first page is meta */ > > #define BTREE_MAGIC 0x053162 /* magic number of btree pages */ > > -#define BTREE_VERSION 3 /* current version number */ > > +#define BTREE_VERSION 4 /* current version number */ > > #define BTREE_MIN_VERSION 2 /* minimal supported version number */ > > +#define BTREE_META_VERSION 3 /* minimal version with all meta fields */ > > BTREE_META_VERSION is a strange name for version 3. I think this > deserves a more verbose comment, above these #defines, to list all the > versions and their differences. Okay, but what would be better? I'm trying to convey that BTREE_META_VERSION is the last version where upgrading was a simple matter of changing the metapage, which can be performed on the fly. The details of what were added to v3 (what nbtree stuff went into Postgres 11) are not really interesting enough to have a descriptive nbtree.h #define name. The metapage-only distinction is actually the interesting distinction here (if I could do the upgrade on-the-fly, there'd be no need for a v3 #define at all). > v9-0003-Pick-nbtree-split-points-discerningly.patch commit message: > > Add infrastructure to determine where the earliest difference appears > > among a pair of tuples enclosing a candidate split point. > > I don't understand this sentence. A (candidate) split point is a point *between* two enclosing tuples on the original page, provided you pretend that the new tuple that caused the split is already on the original page. I probably don't need to be (un)clear on that in the commit message, though. I think that I'll probably end up committing 0002-* and 0003-* in one go anyway (though not before doing the insertion scan key struct refactoring in a separate commit, as you suggest). > > _bt_findsplitloc() is also taught to care about the case where there are > > many duplicates, making it hard to find a distinguishing split point. > > _bt_findsplitloc() may even conclude that it isn't possible to avoid > > filling a page entirely with duplicates, in which case it packs pages > > full of duplicates very tightly. > > Hmm. Is the assumption here that if a page is full of duplicates, there > will be no more insertions into that page? Why? This is a really important point, that should probably have been in your main feedback, rather than the laundry list. I was hoping you'd comment on this more, in fact. Imagine the extreme (and admittedly unrealistic) case first: We have a page full of duplicates, all of which point to one heap page, and with a gapless sequence of heap TID item pointers. It's literally impossible to have another page split in this extreme case, because VACUUM is guaranteed to kill the tuples in the leaf page before anybody can insert next time (IOW, there has to be TID recycling before an insertion into the leaf page is even possible). Now, I've made the "fillfactor" 99, so I haven't actually assumed that there will be *no* further insertions on the page. I'm almost assuming that, but not quite. My thinking was that I should match the greedy behavior that we already have to some degree, and continue to pack leaf pages full of duplicates very tight. I am quite willing to consider whether or not I'm still being too aggressive, all things considered. If I made it 50:50, that would make indexes with relatively few distinct values significantly larger than on master, which would probably be deemed a regression. FWIW, I think that even that regression in space utilization would be more than made up for in other ways. The master branch _bt_findinsertloc() stuff is a disaster with many duplicates for a bunch of reasons that are even more important than the easy-to-measure bloat issue (FPIs, unnecessary buffer lock contention... I could go on). What value do you think works better than 99? 95? 90? I'm open minded about this. I have my own ideas about why 99 works, but they're based on intuitions that might fail to consider something important. The current behavior with many duplicates is pretty awful, so we can at least be sure that it isn't any worse than that. > What do you do instead, then? memcmp? (Reading the patch, yes. > Suggestion: "We use a faster binary comparison, instead of proper > datatype-aware comparison, for speed". WFM. > Aside from performance, it would feel inappropriate to call user-defined > code while holding a buffer lock, anyway. But we do that all the time for this particular variety of user defined code? I mean, we actually *have* to use the authoritative comparisons at the last moment, once we actually make our mind up about where to split -- nothing else is truly trustworthy. So, uh, we actually do this "inappropriate" thing -- just not that much of it. > I'd leave out the ", without provoking a split" part. Or maybe reword to > "if you pretend that the incoming tuple fit and was placed on the page > already". Okay. > It took me a while to understand what the "appear as early as possible" > means here. It's talking about a multi-column index, and about finding a > difference in one of the leading key columns. Not, for example, about > finding a split point early in the page. This is probably a hold-over from when we didn't look at candidate split point tuples an attribute at a time (months ago, it was something pretty close to a raw memcmp()). Will fix. > Perhaps we should leave out these details in the README, and explain > this in the comments of the picksplit-function itself? In the README, I > think a more high-level description of what things are taken into > account when picking the split point, would be enough. Agreed. > > +Suffix truncation is primarily valuable because it makes pivot tuples > > +smaller, which delays splits of internal pages, but that isn't the only > > +reason why it's effective. > > Suggestion: reword to "... , but that isn't the only benefit" ? WFM. > > There are cases where suffix truncation can > > +leave a B-Tree significantly smaller in size than it would have otherwise > > +been without actually making any pivot tuple smaller due to restrictions > > +relating to alignment. > > Suggestion: reword to "... smaller in size than it would otherwise be, > without ..." WFM. > and "without making any pivot tuple *physically* smaller, due to alignment". WFM. > This sentence is a bit of a cliffhanger: what are those cases, and how > is that possible? This is something you see with the TPC-C indexes, even without the new split stuff. The TPC-C stock pk is about 45% smaller with that later commit, but it's something like 6% or 7% smaller even without it (or maybe it's the orderlines pk). And without ever managing to make a pivot tuple physically smaller. This happens because truncating away trailing attributes allows more stuff to go on the first right half of a split. In more general terms: suffix truncation avoids committing ourselves to rules about where values should go that are stricter than truly necessary. On balance, this improves space utilization quite noticeably, even without the special cases where really big improvements are made. If that still doesn't make sense, perhaps you should just try out the TPC-C stuff without the new split patch, and see for yourself. The easiest way to do that is to follow the procedure I describe here: https://bitbucket.org/openscg/benchmarksql/issues/6/making-it-easier-to-recreate-postgres-tpc (BenchmarkSQL is by far the best TPC-C implementation I've found that works with Postgres, BTW. Yes, I also hate Java.) > Ok, I guess these sentences resolve the cliffhanger I complained about. > But this still feels like magic. When you split a page, all of the > keyspace must belong on the left or the right page. Why does it make a > difference to space utilization, where exactly you split the key space? You have to think about the aggregate effect, rather than thinking about a single split at a time. But, like I said, maybe the best thing is to see the effect for yourself with TPC-C (while reverting the split-at-new-item patch). > Ok, so this explains it further, I guess. I find this paragraph > difficult to understand, though. The important thing here is the idea > that some split points are more "discriminating" than others, but I > think it needs some further explanation. What makes a split point more > discriminating? Maybe add an example. An understandable example seems really hard, even though the effect is clear. Maybe I should just say *nothing* about the benefits when pivot tuples don't actually shrink? I found it pretty interesting, and maybe even something that makes it more understandable, but maybe that isn't a good enough reason to keep the explanation. This doesn't address your exact concern, but I think it might help: Bayer's Prefix B-tree paper talks about the effect of being more aggressive in finding a split point. You tend to be able to make index have more leaf pages but fewer internal pages as you get more aggressive about split points. However, both internal pages and leaf pages eventually become more numerous than they'd be with a reasonable interval/level of aggression/discernment -- the saving in internal pages no longer compensates for having more downlinks in internal pages. Bayer ends up saying next to nothing about how big the "split interval" should be. BTW, somebody named Timothy L. Towns wrote the only analysis I've been able to find on split interval for "simply prefix B-Trees" (suffix truncation): https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1 He is mostly talking about the classic case from Bayer's 77 paper, where everything is a memcmp()-able string, which is probably what some systems actually do. On the other hand, I care about attribute granularity. Anyway, it's pretty clear that this Timothy L. Towns fellow should have picked a better topic for his thesis, because he fails to say anything practical about it. Unfortunately, a certain amount of magic in this area is unavoidable. > > +Suffix truncation may make a pivot tuple *larger* than the non-pivot/leaf > > +tuple that it's based on (the first item on the right page), since a heap > > +TID must be appended when nothing else distinguishes each side of a leaf > > +split. Truncation cannot simply reuse the leaf level representation: we > > +must append an additional attribute, rather than incorrectly leaving a heap > > +TID in the generic IndexTuple item pointer field. (The field is already > > +used by pivot tuples to store their downlink, plus some additional > > +metadata.) > > That's not really the fault of suffix truncation as such, but the > process of turning a leaf tuple into a pivot tuple. It would happen even > if you didn't truncate anything. Fair. Will change. > I think this point, that we have to store the heap TID differently in > pivot tuples, would deserve a comment somewhere else, too. While reading > the patch, I didn't realize that that's what we're doing, until I read > this part of the README, even though I saw the new code to deal with > heap TIDs elsewhere in the code. Not sure where, maybe in > BTreeTupleGetHeapTID(). Okay. > This is the first mention of "many duplicates" mode. Maybe just say > "_bt_findsplitloc() almost always ..." or "The logic for selecting a > split point goes to great lengths to avoid heap TIDs in pivots, and > almost always manages to pick a split point between two > user-key-distinct tuples, accepting a completely lopsided split if it must." Sure. > > Once appending a heap > > +TID to a split's pivot becomes completely unavoidable, there is a fallback > > +strategy --- "single value" mode is used, which makes page splits pack the > > +new left half full by using a high fillfactor. Single value mode leads to > > +better overall space utilization when a large number of duplicates are the > > +norm, and thereby also limits the total number of pivot tuples with an > > +untruncated heap TID attribute. > > This assumes that tuples are inserted in increasing TID order, right? > Seems like a valid assumption, no complaints there, but it's an > assumption nevertheless. I can be explicit about that. See also: my remarks above about "fillfactor" with single value mode. > I'm not sure if this level of detail is worthwhile in the README. This > logic on deciding the split point is all within the _bt_findsplitloc() > function, so maybe put this explanation there. In the README, a more > high-level explanation of what things _bt_findsplitloc() considers, > should be enough. Okay. > _bt_findsplitloc(), and all its helper structs and subroutines, are > about 1000 lines of code now, and big part of nbtinsert.c. Perhaps it > would be a good idea to move it to a whole new nbtsplitloc.c file? It's > a very isolated piece of code. Good idea. I'll give that a go. > In the comment on _bt_leave_natts_fast(): > That's an interesting tidbit, but I'd suggest just removing this comment > altogether. It's not really helping to understand the current > implementation. Will do. > v9-0005-Add-high-key-continuescan-optimization.patch commit message: > > > Note that even pre-pg_upgrade'd v3 indexes make use of this > > optimization. > > .. but we're missing the other optimizations that make it more > effective, so it probably won't do much for v3 indexes. Does it make > them slower? It's probably acceptable, even if there's a tiny > regression, but I'm curious. But v3 indexes get the same _bt_findsplitloc() treatment as v4 indexes -- the new-item-split stuff works almost as well for v3 indexes, and the other _bt_findsplitloc() stuff doesn't seem to make much difference. I'm not sure if that's the right thing to do (probably doesn't matter very much). Now, to answer your question about v3 indexes + the continuescan optimization: I think that it probably will help a bit, with or without the _bt_findsplitloc() changes. Much harder to be sure whether it's worth it on balance, since that's workload dependent. My sense is that it's a much smaller benefit much of the time, but the cost is still pretty low. So why not just make it version-generic, and keep things relatively uncluttered? Once again, I greatly appreciate your excellent review! -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 29/12/2018 01:04, Peter Geoghegan wrote: >> However, naively computing the penalty upfront for every offset would be >> a bit wasteful. Instead, start from the middle of the page, and walk >> "outwards" towards both ends, until you find a "good enough" penalty. > > You can't start at the middle of the page, though. > > You have to start at the left (though you could probably start at the > right instead). This is because of page fragmentation -- it's not > correct to assume that the line pointer offset into tuple space on the > page (firstright linw pointer lp_off for candidate split point) tells > you anything about what the space delta will be after the split. You > have to exhaustively add up the free space before the line pointer > (the free space for all earlier line pointers) before seeing if the > line pointer works as a split point, since each previous line > pointer's tuple could be located anywhere in the original page's tuple > space (anywhere to the left or to the right of where it would be in > the simple/unfragmented case). Right. You'll need to do the free space computations from left to right, but once you have done that, you can compute the penalties in any order. I'm envisioning that you have an array, with one element for each item on the page (including the tuple we're inserting, which isn't really on the page yet). In the first pass, you count up from left to right, filling the array. Next, you compute the complete penalties, starting from the middle, walking outwards. That's not so different from what you're doing now, but I find it more natural to explain the algorithm that way. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Right. You'll need to do the free space computations from left to right, > but once you have done that, you can compute the penalties in any order. > > I'm envisioning that you have an array, with one element for each item > on the page (including the tuple we're inserting, which isn't really on > the page yet). In the first pass, you count up from left to right, > filling the array. Next, you compute the complete penalties, starting > from the middle, walking outwards. > > That's not so different from what you're doing now, but I find it more > natural to explain the algorithm that way. Ah, right. I think I see what you mean now. I like that this datastructure explicitly has a place for the new item, so you really do "pretend it's already on the page". Maybe that's what you liked about it as well. I'm a little concerned about the cost of maintaining the data structure. This sounds workable, but we probably don't want to allocate a buffer most of the time, or even hold on to the information most of the time. The current design throws away potentially useful information that it may later have to recreate, but even that has the benefit of having little storage overhead in the common case. Leave it with me. I'll need to think about this some more. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Alexander Korotkov
Date:
Hi! I'm starting to look at this patchset. Not ready to post detail review, but have a couple of questions. On Wed, Sep 19, 2018 at 9:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > I still haven't managed to add pg_upgrade support, but that's my next > step. I am more or less happy with the substance of the patch in v5, > and feel that I can now work backwards towards figuring out the best > way to deal with on-disk compatibility. It shouldn't be too hard -- > most of the effort will involve coming up with a good test suite. Yes, it shouldn't be too hard, but it seems like we have to keep two branches of code for different handling of duplicates. Is that true? + * In the worst case (when a heap TID is appended) the size of the returned + * tuple is the size of the first right tuple plus an additional MAXALIGN() + * quantum. This guarantee is important, since callers need to stay under + * the 1/3 of a page restriction on tuple size. If this routine is ever + * taught to truncate within an attribute/datum, it will need to avoid + * returning an enlarged tuple to caller when truncation + TOAST compression + * ends up enlarging the final datum. I didn't get the point of this paragraph. Does it might happen that first right tuple is under tuple size restriction, but new pivot tuple is beyond that restriction? If so, would we have an error because of too long pivot tuple? If not, I think this needs to be explained better. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
Hi Alexander, On Fri, Jan 4, 2019 at 7:40 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > I'm starting to look at this patchset. Not ready to post detail > review, but have a couple of questions. Thanks for taking a look! > Yes, it shouldn't be too hard, but it seems like we have to keep two > branches of code for different handling of duplicates. Is that true? Not really. If you take a look at v9, you'll see the approach I've taken is to make insertion scan keys aware of which rules apply (the "heapkeyspace" field field controls this). I think that there are about 5 "if" statements for that outside of amcheck. It's pretty manageable. I like to imagine that the existing code already has unique keys, but nobody ever gets to look at the final attribute. It works that way most of the time -- the only exception is insertion with user keys that aren't unique already. Note that the way we move left on equal pivot tuples, rather than right (rather than following the pivot's downlink) wasn't invented by Postgres to deal with the lack of unique keys. That's actually a part of the Lehman and Yao design itself. Almost all of the special cases are optimizations rather than truly necessary infrastructure. > I didn't get the point of this paragraph. Does it might happen that > first right tuple is under tuple size restriction, but new pivot tuple > is beyond that restriction? If so, would we have an error because of > too long pivot tuple? If not, I think this needs to be explained > better. The v9 version of the function _bt_check_third_page() shows what it means (comments on this will be improved in v10, too). The old limit of 2712 bytes still applies to pivot tuples, while a new, lower limit of 2704 bytes applied for non-pivot tuples. This difference is necessary because an extra MAXALIGN() quantum could be needed to add a heap TID to a pivot tuple during truncation in the worst case. To users, the limit is 2704 bytes, because that's the limit that actually needs to be enforced during insertion. We never actually say "1/3 of a page means 2704 bytes" in the docs, since the definition was always a bit fuzzy. There will need to be a compatibility note in the release notes, though. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I'm envisioning that you have an array, with one element for each item > > on the page (including the tuple we're inserting, which isn't really on > > the page yet). In the first pass, you count up from left to right, > > filling the array. Next, you compute the complete penalties, starting > > from the middle, walking outwards. > Ah, right. I think I see what you mean now. > Leave it with me. I'll need to think about this some more. Attached is v10 of the patch series, which has many changes based on your feedback. However, I didn't end up refactoring _bt_findsplitloc() in the way you described, because it seemed hard to balance all of the concerns there. I still have an open mind on this question, and recognize the merit in what you suggested. Perhaps it's possible to reach a compromise here. I did refactor the _bt_findsplitloc() stuff to make the division of work clearer, though -- I think that you'll find that to be a clear improvement, even though it's less than what you asked for. I also moved all of the _bt_findsplitloc() stuff (old and new) into its own .c file, nbtsplicloc.c, as you suggested. Other significant changes ========================= * Creates a new commit that changes routines like _bt_search() and _bt_binsrch() to use a dedicated insertion scankey struct, per request from Heikki. * As I mentioned in passing, many other small changes to comments, the nbtree README, and the commit messages based on your (Heikki's) first round of review. * v10 generalizes the previous _bt_lowest_scantid() logic for adding a tie-breaker on equal pivot tuples during a descent of a B-Tree. The new code works with any truncated attribute, not just a truncated heap TID (I removed _bt_lowest_scantid() entirely). This also allowed me to remove a couple of places that previously opted in to _bt_lowest_scantid(), since the new approach can work without anybody explicitly opting in. As a bonus, the new approach makes the patch faster, since remaining queries where we unnecessarily follow an equal-though-truncated downlink are fixed (it's usually only the heap TID that's truncated when we can do this, but not always). The idea behind this new generalized approach is to recognize that minus infinity is an artificial/sentinel value that doesn't appear in real keys (it only appears in pivot tuples). The majority of callers (all callers aside from VACUUM's leaf page deletion code) can therefore go to the right of a pivot that has all-equal attributes, if and only if: 1. The pivot has at least one truncated/minus infinity attribute *and* 2. The number of attributes matches the scankey. In other words, we tweak the comparison logic to add a new tie-breaker. There is no change to the on-disk structures compared to v9 of the patch -- I've only made index scans able to take advantage of minus infinity values in *all* cases. If this explanation is confusing to somebody less experienced with nbtree than Heikki: consider the way we descend *between* the values on internal pages, rather than expecting exact matches. _bt_binsrch() behaves slightly differently when doing a binary search on an internal page already: equality actually means "go left" when descending the tree (though it doesn't work like that on leaf pages, where insertion scankeys almost always search for a >= match). We want to "go right" instead in cases where it's clear that tuples of interest to our scan can only be in that child page (we're rarely searching for a minus infinity value, since that doesn't appear in real tuples). (Note that this optimization has nothing to do with "moving right" to recover from concurrent page splits -- we would have relied on code like _bt_findsplicloc() and _bt_readpage() to move right once we reach the leaf level when we didn't have this optimization, but that code isn't concerned with recovering from concurrent page splits.) Minor changes ============= * Addresses Heikki's concerns about locking the metapage more frequently in a general way. Comments are added to nbtpage.c, and updated in a number of places that already talk about the same risk. The master branch seems to be doing much the same thing in similar situations already (e.g. during a root page split, when we need to finish an interrupted page split but don't have a usable parent/ancestor page stack). Importantly, the patch does not change the dependency graph. * Small changes to user docs where existing descriptions of things seem to be made inaccurate by the patch. Benchmarking ============ I have also recently been doing a lot of automated benchmarking. Here are results of a BenchmarkSQL benchmark (plus various instrumentation) as a bz2 archive: https://drive.google.com/file/d/1RVJUzMtMNDi4USg0-Yo56LNcRItbFg1Q/view?usp=sharing I completed on my home server last night, against v10 of the patch series. Note that there were 4 runs for each case (master case + public/patch case), with each run lasting 2 hours (so the benchmark took over 8 hours once you include bulk loading time). There were 400 "warehouses" (this is similar to pgbench's scale factor), and 16 terminals/clients. This left the database 110GB+ in size on a server with 32GB of memory and a fast consumer grade SSD. Autovacuum was tuned to perform aggressive cleanup of bloat. All the settings used are available in the bz2 archive (there are "settings" output files, too). Summary ------- See the html "report" files for a quick visual indication of how the tests progresses. BenchmarkSQL uses R to produce useful graphs, which is quite convenient. (I have automated a lot of this with my own ugly shellscript.) We see a small but consistent increase in transaction throughput here, as well as a small but consistent decrease in average latency for each class of transaction. There is also a large and consistent decrease in the on-disk size of indexes, especially if you just consider the number of internal pages (diff the "balance" files to see what I mean). Note that the performance is expected to degrade across runs, since the database is populated once, at the start, and has more data added over time; the important thing is that run n on master be compared to run n on public/patch. Note also that I use my own fork of BenchmarkSQL that does its CREATE INDEX before initial bulk loading, not after [1]. It'll take longer to see problems on Postgres master if the initial bulk load does CREATE INDEX after BenchmarkSQL workers populate tables (we only need INSERTs to see significant index bloat). Avoiding pristine indexes at the start of the benchmark makes the problems on the master branch apparent sooner. The benchmark results also include things like pg_statio* + pg_stat_bgwriter view output (reset between test runs), which gives some insight into what's going on. Checkpoints tend to write out a few more dirty buffers with the patch, while there is a much larger drop in the number of buffers written out by backends. There are probably workloads where we'd see a much larger increase in transaction throughput -- TPC-C happens to access index pages with significant locality, and happens to be very write-heavy, especially compared to the more modern (though less influential) TPC-E benchmark. Plus, the TPC-C workload isn't at all helped by the fact that the patch will never "get tired", even though that's the most notable improvement overall. [1] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations -- Peter Geoghegan
Attachment
- v10-0002-Add-pg_depend-index-scan-tiebreaker-column.patch
- v10-0001-Refactor-nbtree-insertion-scankeys.patch
- v10-0004-Pick-nbtree-split-points-discerningly.patch
- v10-0005-Add-split-at-new-tuple-page-split-optimization.patch
- v10-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch
- v10-0007-DEBUG-Add-pageinspect-instrumentation.patch
- v10-0006-Add-high-key-continuescan-optimization.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Jan 8, 2019 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v10 of the patch series, which has many changes based on > your feedback. However, I didn't end up refactoring _bt_findsplitloc() > in the way you described, because it seemed hard to balance all of the > concerns there. I still have an open mind on this question, and > recognize the merit in what you suggested. Perhaps it's possible to > reach a compromise here. > * Addresses Heikki's concerns about locking the metapage more > frequently in a general way. Comments are added to nbtpage.c, and > updated in a number of places that already talk about the same risk. Attached is v11 of the patch, which removes the comments mentioned here, and instead finds a way to not do new things with buffer locks. Changes ======= * We simply avoid holding buffer locks while accessing the metapage. (Of course, the old root page split stuff still does this -- it has worked that way forever.) * We also avoid calling index_getprocinfo() with any buffer lock held. We'll still call support function 1 with a buffer lock held to truncate, but that's not new -- *any* insertion will do that. For some reason I got stuck on the idea that we need to use a scankey's own values within _bt_truncate()/_bt_keep_natts() by generating a new insertion scankey every time. We now simply ignore those values, and call the comparator with pairs of tuples that each come from the page directly. Usually, we'll just reuse the insertion scankey that we were using for the insertion anyway (we no longer build our own scankey for truncation). Other times, we'll build an empty insertion scankey (one that has the function pointer and so on, but no values). The only downside is that I cannot have an assertion that calls _bt_compare() to make sure we truncated correctly there and then, since a dedicated insertion scankey is no longer conveniently available. I feel rather silly for not having gone this way from the beginning, because the new approach is quite obviously simpler and safer. nbtsort.c now gets a reusable, valueless insertion scankey that it uses for both truncation and for setting up a merge of the two spools for unique index builds. This approach allows me to remove _bt_mkscankey_nodata() altogether -- callers build a "nodata" insertion scankey with empty values by passing _bt_mkscankey() a NULL tuple. That's equivalent to having an insertion scankey built from an all-attributes-truncated tuple. IOW, the patch now makes the "nodata" stuff a degenerate case of building a scankey from a truncated-attributes tuple. tuplesort.c also uses the new BTScanInsert struct. There is no longer any case where there in an insertion scankey that isn't accessed using the BTScanInsert struct. * No more pg_depend tie-breaker column commit. That was an ugly hack, that I'm glad to be rid of -- many thanks to Tom for working through a number of test instability issues that affected the patch. I do still need to paper-over one remaining regression test issue/bug that the patch happens to unmask, pending Tom fixing it directly [1]. This papering-over is broken out into its own commit ("v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patch"). I expect that Tom will fix the bug before too long, at which point the temporary work around can just be reverted from your local tree. Outlook ======= I feel that this version is pretty close to being commitable, since everything about the design is settled. It completely avoids saying anything new about buffer locking protocols, LWLock deadlock safety, etc. VACUUM and crash recovery are also unchanged, so subtle bugs should at least not be too hard to reproduce when observed once. It's pretty complementary code: the new logic for picking a split point builds a list of candidate split points using the old technique, with a second pass to choose the best one for suffix truncation among the accumulated list. Hard to see how that could introduce an invalid split point choice. I take the risk of introducing new corruption bugs very seriously. contrib/amcheck now verifies all aspects of the new on-disk representation. The stricter Lehman & Yao style invariant ("the subtree S is described by Ki < v <= Ki + 1 ...") allows amcheck to be stricter in what it will accept (e.g., heap TID needs to be in order among logical duplicates, we always expect to see a representation of the number of pivot tuple attributes, and we expect the high key to be strictly greater than items on internal pages). Review ====== It would be very helpful if a reviewer such as Heikki or Alexander could take a look at the patch once more. I suggest that they look at the following points in the patch: * The minusinfkey stuff, which is explained within _bt_compare(), and within nbtree.h header comments. Page deletion by VACUUM is the only _bt_search() caller that sets minusinfkey to true (though older versions of btree and amcheck also set minusinfkey to true). * Does the value of BTREE_SINGLEVAL_FILLFACTOR make sense? Am I being a little too aggressive there, possibly hurting workloads where HOT pruning occurs periodically? Sane duplicate handling is the most compelling improvement that the patch makes, but I may still have been a bit too aggressive in packing pages full of duplicates so tightly. I figured that that was the closest thing to the previous behavior that's still reasonable. * Does the _bt_splitatnewitem() criteria for deciding if we should split at the point the new tuple is positioned at miss some subtlety? It's important that splitting at the new item when we shouldn't doesn't happen, or hardly ever happens -- it should be *self-limiting*. This was tested using BenchmarkSQL/TPC-C [2] -- TPC-C has a workload where this particular enhancement makes indexes a lot smaller. * There was also testing of index bloat following bulk insertions, based on my own custom test suite. Data and indexes were taken from TPC-C tables, TPC-H tables, TPC-E tables, UK land registry data [3], and the Mouse Genome Database Project (Postgres schema + indexes) [4]. This takes almost an hour to run on my development machine, though the most important tests finish in less than 5 minutes. I can provide access to all or some of these tests, if reviewers are interested and are willing to download several gigabytes of sample data that I'll provide privately. [1] https://postgr.es/m/19220.1547767251@sss.pgh.pa.us [2] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations [3] https://wiki.postgresql.org/wiki/Sample_Databases [4] http://www.informatics.jax.org/software.shtml -- Peter Geoghegan
Attachment
- v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patch
- v11-0001-Refactor-nbtree-insertion-scankeys.patch
- v11-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch
- v11-0004-Pick-nbtree-split-points-discerningly.patch
- v11-0007-DEBUG-Add-pageinspect-instrumentation.patch
- v11-0006-Add-high-key-continuescan-optimization.patch
- v11-0005-Add-split-at-new-tuple-page-split-optimization.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 09/01/2019 02:47, Peter Geoghegan wrote: > On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote: >> On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> I'm envisioning that you have an array, with one element for each item >>> on the page (including the tuple we're inserting, which isn't really on >>> the page yet). In the first pass, you count up from left to right, >>> filling the array. Next, you compute the complete penalties, starting >>> from the middle, walking outwards. > >> Ah, right. I think I see what you mean now. > >> Leave it with me. I'll need to think about this some more. > > Attached is v10 of the patch series, which has many changes based on > your feedback. However, I didn't end up refactoring _bt_findsplitloc() > in the way you described, because it seemed hard to balance all of the > concerns there. I still have an open mind on this question, andAt a > recognize the merit in what you suggested. Perhaps it's possible to > reach a compromise here. I spent some time first trying to understand the current algorithm, and then rewriting it in a way that I find easier to understand. I came up with the attached. I think it optimizes for the same goals as your patch, but the approach is quite different. At a very high level, I believe the goals can be described as: 1. Find out how much suffix truncation is possible, i.e. how many key columns can be truncated away, in the best case, among all possible ways to split the page. 2. Among all the splits that achieve that optimum suffix truncation, find the one with smallest "delta". For performance reasons, it doesn't actually do it in that order. It's more like this: 1. First, scan all split positions, recording the 'leftfree' and 'rightfree' at every valid split position. The array of possible splits is kept in order by offset number. (This scans through all items, but the math is simple, so it's pretty fast) 2. Compute the optimum suffix truncation, by comparing the leftmost and rightmost keys, among all the possible split positions. 3. Split the array of possible splits in half, and process both halves recursively. The recursive process "zooms in" to the place where we'd expect to find the best candidate, but will ultimately scan through all split candidates, if no "good enough" match is found. I've only been testing this on leaf splits. I didn't understand how the penalty worked for internal pages in your patch. In this version, the same algorithm is used for leaf and internal pages. I'm sure this still has bugs in it, and could use some polishing, but I think this will be more readable way of doing it. What have you been using to test this? I wrote the attached little test extension, to explore what _bt_findsplitloc() decides in different scenarios. It's pretty rough, but that's what I've been using while hacking on this. It prints output like this: postgres=# select test_split(); NOTICE: test 1: left 2/358: 1 0 left 358/358: 1 356 right 1/ 51: 1 357 right 51/ 51: 1 407 SPLIT TUPLE split ratio: 12/87 NOTICE: test 2: left 2/358: 0 0 left 358/358: 356 356 right 1/ 51: 357 357 right 51/ 51: 407 407 SPLIT TUPLE split ratio: 12/87 NOTICE: test 3: left 2/358: 0 0 left 320/358: 10 10 SPLIT TUPLE left 358/358: 48 48 right 1/ 51: 49 49 right 51/ 51: 99 99 split ratio: 12/87 NOTICE: test 4: left 2/309: 1 100 left 309/309: 1 407 SPLIT TUPLE right 1/100: 2 0 right 100/100: 2 99 split ratio: 24/75 Each test consists of creating a temp table with one index, and inserting rows in a certain pattern, until the root page splits. It then prints the first and last tuples on both pages, after the split, as well as the tuple that caused the split. I don't know if this is useful to anyone but myself, but I thought I'd share it. - Heikki
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I spent some time first trying to understand the current algorithm, and > then rewriting it in a way that I find easier to understand. I came up > with the attached. I think it optimizes for the same goals as your > patch, but the approach is quite different. At a very high level, I > believe the goals can be described as: > > 1. Find out how much suffix truncation is possible, i.e. how many key > columns can be truncated away, in the best case, among all possible ways > to split the page. > > 2. Among all the splits that achieve that optimum suffix truncation, > find the one with smallest "delta". Thanks for going to the trouble of implementing what you have in mind! I agree that the code that I wrote within nbtsplitloc.c is very subtle, and I do think that I have further work to do to make its design clearer. I think that this high level description of the goals of the algorithm is inaccurate in subtle but important ways, though. Hopefully there will be a way of making it more understandable that preserves certain important characteristics. If you had my test cases/data that would probably help you a lot (more on that later). The algorithm I came up with almost always does these two things in the opposite order, with each considered in clearly separate phases. There are good reasons for this. We start with the same criteria as the old algorithm. We assemble a small array of candidate split points, rather than one split point, but otherwise it's almost exactly the same (the array is sorted by delta). Then, at the very end, we iterate through the small array to find the best choice for suffix truncation. IOW, we only consider suffix truncation as a *secondary* goal. The delta is still by far the most important thing 99%+ of the time. I assume it's fairly rare to not have two distinct tuples within 9 or so tuples of the delta-wise optimal split position -- 99% is probably a low estimate, at least in OLTP app, or within unique indexes. I see that you do something with a "good enough" delta that seems like it also makes delta the most important thing, but that doesn't appear to be, uh, good enough. ;-) Now, it's true that my approach does occasionally work in a way close to what you describe above -- it does this when we give up on default mode and check "how much suffix truncation is possible?" exhaustively, for every possible candidate split point. "Many duplicates" mode kicks in when we need to be aggressive about suffix truncation. Even then, the exact goals are different to what you have in mind in subtle but important ways. While "truncating away the heap TID" isn't really a special case in other places, it is a special case for my version of nbtsplitloc.c, which more or less avoids it at all costs. Truncating away heap TID is more important than truncating away any other attribute by a *huge* margin. Many duplicates mode *only* specifically cares about truncating the final TID attribute. That is the only thing that is ever treated as more important than delta, though even there we don't forget about delta entirely. That is, we assume that the "perfect penalty" is nkeyatts when in many duplicates mode, because we don't care about suffix truncation beyond heap TID truncation by then. So, if we find 5 split points out of 250 in the final array that avoid appending heap TID, we use the earliest/lowest delta out of those 5. We're not going to try to maximize the number of *additional* attributes that get truncated, because that can make the leaf pages unbalanced in an *unbounded* way. None of these 5 split points are "good enough", but the distinction between their deltas still matters a lot. We strongly prefer a split with a *mediocre* delta to a split with a *terrible* delta -- a bigger high key is the least of our worries here. (I made similar mistakes myself months ago, BTW.) Your version of the algorithm makes a test case of mine (UK land registry test case [1]) go from having an index that's 1101 MB with my version of the algorithm/patch and 1329 MB on the master branch to an index that's 3030 MB in size. I think that this happens because it effectively fails to give any consideration to delta at all at certain points. On leaf pages with lots of unique keys, your algorithm does about as well as mine because all possible split points look equally good suffix-truncation-wise, plus you have the "good enough" test, so delta isn't ignored. I think that your algorithm also works well when there are many duplicates but only one non-TID index column, since the heap TID truncation versus other truncation issue does not arise. The test case I used is an index on "(county, city, locality)", though -- low cardinality, but more than a single column. That's a *combination* of two separate considerations, that seem to get conflated. I don't think that you can avoid doing "a second pass" in some sense, because these really are separate considerations. There is an important middle-ground that your algorithm fails to handle with this test case. You end up maximizing the number of attributes that are truncated when you shouldn't -- leaf page splits are totally unbalanced much of the time. Pivot tuples are smaller on average, but are also far far more numerous, because there are more leaf page splits as a result of earlier leaf page splits being unbalanced. If instead you treated heap TID truncation as the only thing that you were willing to go to huge lengths to prevent, then unbalanced splits become *self-limiting*. The next split will probably end up being a single value mode split, which packs pages full of duplicates at tightly as possible. Splits should "degrade gracefully" from default mode to many duplicates mode to single value mode in cases where the number of distinct values is constant (or almost constant), but the total number of tuples grows over time. > I've only been testing this on leaf splits. I didn't understand how the > penalty worked for internal pages in your patch. In this version, the > same algorithm is used for leaf and internal pages. The approach that I use for internal pages is only slightly different to what we've always done -- I split very near the delta-wise optimal point, with a slight preference for a tuple that happens to be smaller. And, there is no way in which the delta-optimal point can be different to what it would have been on master with internal pages (they only use default mode). I don't think it's appropriate to use the same algorithm for leaf and internal page splits at all. We cannot perform suffix truncation on internal pages. > What have you been using to test this? I wrote the attached little test > extension, to explore what _bt_findsplitloc() decides in different > scenarios. I've specifically tested the _bt_findsplitloc() stuff using a couple of different techniques. Primarily, I've been using lots of real world data and TPC benchmark test data, with expected/test output generated by a contrib/pageinspect query that determines the exact number of leaf blocks and internal page blocks from each index in a test database. Just bash and SQL. I'm happy to share that with you, if you're able to accept a couple of gigabytes worth of dumps that are needed to make the scripts work. Details: pg@bat:~/hdd/sample-data$ ll land_registry.custom.dump -rw------- 1 pg pg 1.1G Mar 3 2018 land_registry.custom.dump pg@bat:~/hdd/sample-data$ ll tpcc_2018-07-20_unlogged.dump -rw-rw-r-- 1 pg pg 1.8G Jul 20 2018 tpcc_2018-07-20_unlogged.dump (The only other components for these "fast" tests are simple bash scripts.) I think that you'd find it a lot easier to work with me on these issues you at least had these tests -- my understanding of the problems was shaped by the tests. I strongly recommend that you try out my UK land registry test and the TPC-C test as a way of understanding the design I've used for _bt_findsplitloc(). It shouldn't be that inconvenient to get it over to you. I have several more tests besides these two, but they're much more cumbersome and much less valuable. I have a script that I can run in 5 minutes that probably catches all the regressions. The long running stuff, like my TPC-E test case (the stuff that I won't bother sending) hasn't caught any regressions that the fast tests didn't catch as well. Separately, I also have a .gdbinit function that looks like this: define dump_page dump binary memory /tmp/gdb_postgres_page.dump $arg0 ($arg0 + 8192) echo Invoking pg_hexedit + wxHexEditor on page...\n ! ~/code/pg_hexedit/pg_hexedit -n 1 /tmp/gdb_postgres_page.dump > /tmp/gdb_postgres_page.dump.tags ! ~/code/wxHexEditor/wxHexEditor /tmp/gdb_postgres_page.dump &> /dev/null end This allows me to see an arbitrary page from an interactive gdb session using my pg_hexedit tool. I can simply "dump_page page" from most functions in the nbtree source code. At various points I found it useful to add optimistic assertions to the split point choosing routines that failed. I could then see why they failed by using gdb with the resulting core dump. I could look at the page image using pg_hexedit/wxHexEditor from there. This allowed me to understand one or two corner cases. For example, this is how I figured out the exact details at the end of _bt_perfect_penalty(), when it looks like we're about to go into a second pass of the page. > It's pretty rough, but that's what I've been using while > hacking on this. It prints output like this: Cool! I did have something that would LOG the new high key in an easy to interpret way at one point, which was a little like this. [1] https://postgr.es/m/CAH2-Wzn5XbCzk6u0GL+uPnCp1tbrp2pJHJ=3bYT4yQ0_zzHxmw@mail.gmail.com -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Jan 28, 2019 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > Thanks for going to the trouble of implementing what you have in mind! > > I agree that the code that I wrote within nbtsplitloc.c is very > subtle, and I do think that I have further work to do to make its > design clearer. I think that this high level description of the goals > of the algorithm is inaccurate in subtle but important ways, though. > Hopefully there will be a way of making it more understandable that > preserves certain important characteristics. Heikki and I had the opportunity to talk about this recently. We found an easy way forward. I believe that the nbtsplitloc.c algorithm itself is fine -- the code will need to be refactored, though. nbtsplitloc.c can be refactored to assemble a list of legal split points up front, before deciding which one to go with in a separate pass (using one of two "alternative modes", as before). I now understand that Heikki simply wants to separate the questions of "Is this candidate split point legal?" from "Is this known-legal candidate split point good/ideal based on my current criteria?". This seems like a worthwhile goal to me. Heikki accepts the need for multiple modes/passes, provided recursion isn't used in the implementation. It's clear to me that the algorithm should start off trying to split towards the middle of the page (or towards the end in the rightmost case), while possibly making a small compromise on the exact split point to maximize the effectiveness of suffix truncation. We must change strategy entirely if and only if the middle of the page (or wherever we'd like to split initially) is found to be completely full of duplicates -- that's where the need for a second pass comes in. This should almost never happen in most applications. Even when it happens, we only care about not splitting inside a group of duplicates. That's not the same thing as caring about maximizing the number of attributes truncated away. Those two things seem similar, but are actually very different. It might have sounded like Heikki and I disagreed on the design of the algorithm at a high level, or what its goals ought to be. That is not the case, though. (At least not so far.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Feb 5, 2019 at 4:50 PM Peter Geoghegan <pg@bowt.ie> wrote: > Heikki and I had the opportunity to talk about this recently. We found > an easy way forward. I believe that the nbtsplitloc.c algorithm itself > is fine -- the code will need to be refactored, though. Attached v12 does not include this change, though I have every intention of doing the refactoring described for v13. The nbtsplitloc.c/split algorithm refactoring would necessitate revalidating the patch's performance, though, which didn't seem worth blocking on. Besides, there was bit rot that needed to be fixed. Notable improvements in v12: * No more papering-over regression test differences caused by pg_depend issues, thanks to recent work by Tom (today's commit 1d92a0c9). * I simplified the code added to _bt_binsrch() to deal with saving and restoring binary search bounds for _bt_check_unique()-caller insertions (this is from first/"Refactor nbtree insertion scankeys" patch). I also improved matters within _bt_check_unique() itself: the early "break" there (based on reaching the known strict upper bound from cache binary search) works in terms of the existing _bt_check_unique() loop invariant. This even allowed me to add a new assertion that makes sure that breaking out of the loop early is correct -- we call _bt_isequal() for next item on assert-enabled builds when we break having reached strict upper bound established by initial binary search. In other words, _bt_check_unique() ends up doing the same number of _bt_isequal() calls as it did on the master branch, provided assertions are enabled. * I've restored regression test coverage that the patch previously inadvertently took away. Suffix truncation made deliberately-tall B-Tree indexes from the regression tests much shorter, making the tests fail to test the code paths the tests originally targeted. I needed to find ways to "defeat" suffix truncation so I still ended up with a fairly tall tree that hit various code paths. I think that we went from having 8 levels in btree_tall_idx (i.e. ridiculously many) to having only a single root page when I first caught the problem! Now btree_tall_idx only has 3 levels, which is all we really need. Even multi-level page deletion didn't have any coverage in previous versions. I used gcov to specifically verify that we have good multi-level page deletion coverage. I also used gcov to make sure that we have coverage of the v11 "cache rightmost block" optimization, since I noticed that that was missing (though present on the master branch) -- that's actually all that the btree_tall_idx tests in the patch, since multi-level page deletion is covered by a covering-indexes-era test. Finally, I made sure that we have coverage of fast root splits. In general, I preserved the original intent behind the existing tests, all of which I was fairly familiar with from previous projects. * I've added a new "relocate" bt_index_parent_check()/amcheck option, broken out in a separate commit. This new option makes verification relocate each and every leaf page tuple, starting from the root each time. This means that there will be at least one piece of code that specifically relies on "every tuple should have a unique key" from the start, which seems like a good idea. This enhancement to amcheck allows me to detect various forms of corruption that no other existing verification option would catch. In particular, I can catch various very subtle "cross-cousin inconsistencies" that require that we verify a page using its grandparent rather than its parent [1] (existing checks catch some but not all "cousin problem" corruption). Simply put, this amcheck enhancement allows me to detect corruption of the least significant byte in a key value in the root page -- that kind of corruption will cause index scans to miss only a small number of tuples at the leaf level. Maybe this scenario isn't realistic, but I'd rather not take any chances. * I rethought the "single value mode" fillfactor, which I've been suspicious of for a while now. It's now 96, down from 99. Micro-benchmarks involving concurrent sessions inserting into a low cardinality index led me to the conclusion that 99 was aggressively high. It was not that hard to get excessive page splits with these microbenchmarks, since insertions with monotonically increasing heap TIDs arrived a bit out of order with a lot of concurrency. 99 worked a bit better than 96 with only one session, but significantly worse with concurrent sessions. I still think that it's a good idea to be more aggressive than default leaf fillfactor, but reducing "single value mode" fillfactor to 90 (or whatever the user set general leaf fillfactor to) wouldn't be so bad. [1] http://subs.emis.de/LNI/Proceedings/Proceedings144/32.pdf -- Peter Geoghegan
Attachment
- v12-0007-DEBUG-Add-pageinspect-instrumentation.patch
- v12-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patch
- v12-0004-Add-split-after-new-tuple-optimization.patch
- v12-0005-Add-high-key-continuescan-optimization.patch
- v12-0003-Pick-nbtree-split-points-discerningly.patch
- v12-0002-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch
- v12-0001-Refactor-nbtree-insertion-scankeys.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Feb 11, 2019 at 12:54 PM Peter Geoghegan <pg@bowt.ie> wrote: > Notable improvements in v12: I've been benchmarking v12, once again using a slightly modified BenchmarkSQL that doesn't do up-front CREATE INDEX builds [1], since the problems with index bloat don't take so long to manifest themselves when the indexes are inserted into incrementally from the very beginning. This benchmarking process took over 20 hours, with a database that started off at about 90GB (700 TPC-C/BenchmarkSQL warehouses were used). That easily exceeded available main memory on my test server, which was 32GB. This is a pretty I/O bound workload, and a fairly write-heavy one at that. I used a Samsung 970 PRO 512GB, NVMe PCIe M.2 2280 SSD for both pg_wal and the default and only tablespace. Importantly, I figured out that I should disable both hash joins and merge joins with BenchmarkSQL, in order to force all joins to be nested loop joins. Otherwise, the "stock level" transaction eventually starts to use a hash join, even though that's about 10x slower than a nestloop join (~4ms vs. ~40ms on this machine) -- the hash join produces a lot of noise without really testing anything. It usually takes a couple of hours before we start to get obviously-bad plans, but it also usually takes about that long until the patch series starts to noticeably overtake the master branch. I don't think that TPC-C will ever benefit from using a hash join or a merge join, since it's supposed to be a pure OLTP benchmark, and is a benchmark that MySQL is known to do at least respectably-well on. This is the first benchmark I've published that was considerably I/O bound. There are significant improvements in performance across the board, on every measure, though it takes several hours for that to really show. The benchmark was not rate-limited. 16 clients/"terminals" are used throughout. There were 5 runs for master and 5 for patch, interlaced, each lasting 2 hours. Initialization occurred once, so it's expected that both databases will gradually get larger across runs. Summary (appears in same order as the execution of each run) -- each run is 2 hours, so 20 hours total excluding initial load time (2 hours * 5 runs for master + 2 hours * 5 runs for patch): Run 1 -- master: Measured tpmTOTAL = 90063.79, Measured tpmC (NewOrders) = 39172.37 Run 1 -- patch: Measured tpmTOTAL = 90922.63, Measured tpmC (NewOrders) = 39530.2 Run 2 -- master: Measured tpmTOTAL = 77091.63, Measured tpmC (NewOrders) = 33530.66 Run 2 -- patch: Measured tpmTOTAL = 83905.48, Measured tpmC (NewOrders) = 36508.38 <-- 8.8% increase in tpmTOTAL/throughput Run 3 -- master: Measured tpmTOTAL = 71224.25, Measured tpmC (NewOrders) = 30949.24 Run 3 -- patch: Measured tpmTOTAL = 78268.29, Measured tpmC (NewOrders) = 34021.98 <-- 9.8% increase in tpmTOTAL/throughput Run 4 -- master: Measured tpmTOTAL = 71671.96, Measured tpmC (NewOrders) = 31163.29 Run 4 -- patch: Measured tpmTOTAL = 73097.42, Measured tpmC (NewOrders) = 31793.99 Run 5 -- master: Measured tpmTOTAL = 66503.38, Measured tpmC (NewOrders) = 28908.8 Run 5 -- patch: Measured tpmTOTAL = 71072.3, Measured tpmC (NewOrders) = 30885.56 <-- 6.9% increase in tpmTOTAL/throughput There were *also* significant reductions in transaction latency for the patch -- see the full html reports in the provided tar archive for full details (URL provided below). The html reports have nice SVG graphs, generated by BenchmarkSQL using R -- one for transaction throughput, and another for transaction latency. The overall picture is that the patched version starts out ahead, and has a much more gradual decline as the database becomes larger and more bloated. Note also that the statistics collector stats show a *big* reduction in blocks read into shared_buffers for the duration of these runs. For example, here is what pg_stat_database shows for run 3 (I reset the stats between runs): master: blks_read = 78,412,640, blks_hit = 4,022,619,556 patch: blks_read = 70,033,583, blks_hit = 4,505,308,517 <-- 10.7% reduction in blks_read/logical I/O This suggests an indirect benefit, likely related to how buffers are evicted in each case. pg_stat_bgwriter indicates that more buffers are written out during checkpoints, while fewer are written out by backends. I won't speculate further on what all of this means right now, though. You can find the raw details for blks_read for each and every run in the full tar archive. It is available for download from: https://drive.google.com/file/d/1kN4fDmh1a9jtOj8URPrnGYAmuMPmcZax/view?usp=sharing There are also dumps of the other pg_stat* views at the end of each run, logs for each run, etc. There's more information than anybody else is likely to find interesting. If anyone needs help in recreating this benchmark, then I'd be happy to assist in that. The is a shell script (zsh) included in the tar archive, although that will need to be changed a bit to point to the correct installations and so on. Independent validation of the performance of the patch series on this and other benchmarks is very welcome. [1] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I spent some time first trying to understand the current algorithm, and > then rewriting it in a way that I find easier to understand. I came up > with the attached. I think it optimizes for the same goals as your > patch, but the approach is quite different. Attached is v13 of the patch series, which significantly refactors nbtsplitloc.c to implement the algorithm using the approach from your prototype posted on January 28 -- I now take a "top down" approach that materializes all legal split points up-front, as opposed to the initial "bottom up" approach that used recursion, and weighed everything (balance of free space, suffix truncation, etc) all at once. Some of the code is directly lifted from your prototype, so there is now a question about whether or not you should be listed as a co-author. (I think that you should be credited as a secondary author of the nbtsplitloc.c patch, and as a secondary author in the release notes for the feature as a whole, which I imagine will be rolled into one item there.) I always knew that a "top down" approach would be simpler, but I underestimated how much better it would be overall, and how manageable the downsides are -- the added cycles are not actually noticeable when compared to the master branch, even during microbenchmarks. Thanks for suggesting this approach! I don't even need to simulate recursion with a loop or a goto; everything is structured as a linear series of steps now. There are still the same modes as before, though; the algorithm is essentially unchanged. All of my tests show that it's at least as effective as v12 was in terms of how effective the final _bt_findsplitloc() results are in reducing index size. The new approach will make more sophisticated suffix truncation costing much easier to implement in a future release, when suffix truncation is taught to truncate *within* individual datums/attributes (e.g. generate the text string "new m" given a split point between "new jersey" and "new york", by using some new opclass infrastructure). "Top down" also makes the implementation of the "split after new item" optimization safer and simpler -- we already have all split points conveniently available, so we can seek out an exact match instead of interpolating where it ought appear later using a dynamic fillfactor. We can back out of the "split after new item" optimization in the event of the *precise* split point we want to use not being legal. That shouldn't be necessary, and isn't necessary in practice, but it seems like a good idea be defensive with something so delicate as this. I'm using qsort() to sort the candidate split points array. I'm not trying to do something clever to avoid the up-front effort of sorting everything, even though we could probably get away with that much of the time (e.g. by doing a top-N sort in default mode). Testing has shown that using an inlined qsort() routine in the style of tuplesort.c would make my serial test cases/microbenchmarks faster, without adding much complexity. We're already competitive with the master branch even without this microoptimization, so I've put that off for now. What do you think of the idea of specializing an inlineable qsort() for nbtsplitloc.c? Performance is at least as good as v12 with a more relevant workload, such as BenchmarkSQL. Transaction throughput is 5% - 10% greater in my most recent tests (benchmarks for v13 specifically). Unlike in your prototype, v13 makes the array for holding candidate split points into a single big allocation that is always exactly BLCKSZ. The idea is that palloc() can thereby recycle the big _bt_findsplitloc() allocation within _bt_split(). palloc() considers 8KiB to be the upper limit on the size of individual blocks it manages, and we'll always go on to palloc(BLCKSZ) through the _bt_split() call to PageGetTempPage(). In a sense, we're not even allocating memory that we weren't allocating already. (Not sure that this really matters, but it is easy to do it that way.) Other changes from your prototype: * I found a more efficient representation than a pair of raw IndexTuple pointers for each candidate split. Actually, I use the same old representation (firstoldonright + newitemonleft) in each split, and provide routines to work backwards from that to get the lastleft and firstright tuples. This approach is far more space efficient, and space efficiency matters when you've allocating space for hundreds of items in a critical path like this. * You seemed to refactor _bt_checksplitloc() in passing, making it not do the newitemisfirstonright thing. I changed that back. Did I miss something that you intended here? * Fixed a bug in the loop that adds split points. Your refactoring made the main loop responsible for new item space handling, as just mentioned, but it didn't create a split where the new item is first on the page, and the split puts the new item on the left page on its own, on all existing items on the new right page. I couldn't prove that this caused failures to find a legal split, but it still seemed like a bug. In general, I think that we should generate our initial list of split points in exactly the same manner as we do so already. The only difference as far as split legality/feasibility goes is that we pessimistically assume that suffix truncation will have to add a heap TID in all cases. I don't see any advantage to going further than that. Changes to my own code since v12: * Simplified "Add "split after new tuple" optimization" commit, and made it more consistent with associated code. This is something that was made a lot easier by the new approach. It would be great to hear what you think of this. * Removed subtly wrong assertion in nbtpage.c, concerning VACUUM's page deletion. Even a page that is about to be deleted can be filled up again and split when we release and reacquire a lock, however unlikely that may be. * Rename _bt_checksplitloc() to _bt_recordsplit(). I think that it makes more sense to make that about recording a split point, rather than about making sure a split point is legal. It still does that, but perhaps 99%+ of calls to _bt_recordsplit()/_bt_checksplitloc() result in the split being deemed legal, so the new name makes much more sense. -- Peter Geoghegan
Attachment
- v13-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patch
- v13-0004-Add-split-after-new-tuple-optimization.patch
- v13-0005-Add-high-key-continuescan-optimization.patch
- v13-0007-DEBUG-Add-pageinspect-instrumentation.patch
- v13-0003-Consider-secondary-factors-during-nbtree-splits.patch
- v13-0001-Refactor-nbtree-insertion-scankeys.patch
- v13-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 26/02/2019 12:31, Peter Geoghegan wrote: > On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I spent some time first trying to understand the current algorithm, and >> then rewriting it in a way that I find easier to understand. I came up >> with the attached. I think it optimizes for the same goals as your >> patch, but the approach is quite different. > > Attached is v13 of the patch series, which significantly refactors > nbtsplitloc.c to implement the algorithm using the approach from your > prototype posted on January 28 -- I now take a "top down" approach > that materializes all legal split points up-front, as opposed to the > initial "bottom up" approach that used recursion, and weighed > everything (balance of free space, suffix truncation, etc) all at > once. Great, looks much simpler now, indeed! Now I finally understand the algorithm. > I'm using qsort() to sort the candidate split points array. I'm not > trying to do something clever to avoid the up-front effort of sorting > everything, even though we could probably get away with that much of > the time (e.g. by doing a top-N sort in default mode). Testing has > shown that using an inlined qsort() routine in the style of > tuplesort.c would make my serial test cases/microbenchmarks faster, > without adding much complexity. We're already competitive with the > master branch even without this microoptimization, so I've put that > off for now. What do you think of the idea of specializing an > inlineable qsort() for nbtsplitloc.c? If the performance is acceptable without it, let's not bother. We can optimize later. What would be the worst case scenario for this? Splitting a page that has as many tuples as possible, I guess, so maybe inserting into a table with a single-column index, with 32k BLCKSZ. Have you done performance testing on something like that? > Unlike in your prototype, v13 makes the array for holding candidate > split points into a single big allocation that is always exactly > BLCKSZ. The idea is that palloc() can thereby recycle the big > _bt_findsplitloc() allocation within _bt_split(). palloc() considers > 8KiB to be the upper limit on the size of individual blocks it > manages, and we'll always go on to palloc(BLCKSZ) through the > _bt_split() call to PageGetTempPage(). In a sense, we're not even > allocating memory that we weren't allocating already. (Not sure that > this really matters, but it is easy to do it that way.) Rounding up the allocation to BLCKSZ seems like a premature optimization. Even if it saved some cycles, I don't think it's worth the trouble of having to explain all that in the comment. I think you could change the curdelta, leftfree, and rightfree fields in SplitPoint to int16, to make the array smaller. > Other changes from your prototype: > > * I found a more efficient representation than a pair of raw > IndexTuple pointers for each candidate split. Actually, I use the same > old representation (firstoldonright + newitemonleft) in each split, > and provide routines to work backwards from that to get the lastleft > and firstright tuples. This approach is far more space efficient, and > space efficiency matters when you've allocating space for hundreds of > items in a critical path like this. Ok. > * You seemed to refactor _bt_checksplitloc() in passing, making it not > do the newitemisfirstonright thing. I changed that back. Did I miss > something that you intended here? My patch treated the new item the same as all the old items, in _bt_checksplitloc(), so it didn't need newitemisonright. You still need it with your approach. > Changes to my own code since v12: > > * Simplified "Add "split after new tuple" optimization" commit, and > made it more consistent with associated code. This is something that > was made a lot easier by the new approach. It would be great to hear > what you think of this. I looked at it very briefly. Yeah, it's pretty simple now. Nice! About this comment on _bt_findsplit_loc(): >/* > * _bt_findsplitloc() -- find an appropriate place to split a page. > * > * The main goal here is to equalize the free space that will be on each > * split page, *after accounting for the inserted tuple*. (If we fail to > * account for it, we might find ourselves with too little room on the page > * that it needs to go into!) > * > * If the page is the rightmost page on its level, we instead try to arrange > * to leave the left split page fillfactor% full. In this way, when we are > * inserting successively increasing keys (consider sequences, timestamps, > * etc) we will end up with a tree whose pages are about fillfactor% full, > * instead of the 50% full result that we'd get without this special case. > * This is the same as nbtsort.c produces for a newly-created tree. Note > * that leaf and nonleaf pages use different fillfactors. > * > * We are passed the intended insert position of the new tuple, expressed as > * the offsetnumber of the tuple it must go in front of (this could be > * maxoff+1 if the tuple is to go at the end). The new tuple itself is also > * passed, since it's needed to give some weight to how effective suffix > * truncation will be. The implementation picks the split point that > * maximizes the effectiveness of suffix truncation from a small list of > * alternative candidate split points that leave each side of the split with > * about the same share of free space. Suffix truncation is secondary to > * equalizing free space, except in cases with large numbers of duplicates. > * Note that it is always assumed that caller goes on to perform truncation, > * even with pg_upgrade'd indexes where that isn't actually the case > * (!heapkeyspace indexes). See nbtree/README for more information about > * suffix truncation. > * > * We return the index of the first existing tuple that should go on the > * righthand page, plus a boolean indicating whether the new tuple goes on > * the left or right page. The bool is necessary to disambiguate the case > * where firstright == newitemoff. > * > * The high key for the left page is formed using the first item on the > * right page, which may seem to be contrary to Lehman & Yao's approach of > * using the left page's last item as its new high key on the leaf level. > * It isn't, though: suffix truncation will leave the left page's high key > * fully equal to the last item on the left page when two tuples with equal > * key values (excluding heap TID) enclose the split point. It isn't > * necessary for a new leaf high key to be equal to the last item on the > * left for the L&Y "subtree" invariant to hold. It's sufficient to make > * sure that the new leaf high key is strictly less than the first item on > * the right leaf page, and greater than the last item on the left page. > * When suffix truncation isn't possible, L&Y's exact approach to leaf > * splits is taken (actually, a tuple with all the keys from firstright but > * the heap TID from lastleft is formed, so as to not introduce a special > * case). > * > * Starting with the first right item minimizes the divergence between leaf > * and internal splits when checking if a candidate split point is legal. > * It is also inherently necessary for suffix truncation, since truncation > * is a subtractive process that specifically requires lastleft and > * firstright inputs. > */ This is pretty good, but I think some copy-editing can make this even better. If you look at the old comment, it had this structure: 1. Explain what the function does. 2. Explain the arguments 3. Explain the return value. The additions to this comment broke the structure. The explanations of argument and return value are now in the middle, in 3rd and 4th paragraphs. And the 3rd paragraph that explains the arguments, now also goes into detail on what the function does with it. I'd suggest moving things around to restore the old structure, that was more clear. The explanation of how the high key for the left page is formed (5th paragraph), seems out-of-place here, because the high key is not formed here. Somewhere in the 1st or 2nd paragraph, perhaps we should mention that the function effectively uses a different fillfactor in some other scenarios too, not only when it's the rightmost page. In the function itself: > * maxsplits should never exceed maxoff because there will be at most as > * many candidate split points as there are points _between_ tuples, once > * you imagine that the new item is already on the original page (the > * final number of splits may be slightly lower because not all splits > * will be legal). Even still, add space for an extra two splits out of > * sheer paranoia. > */ > state.maxsplits = maxoff + 2; > state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits)); > state.nsplits = 0; I wouldn't be that paranoid. The code that populates the array is pretty straightforward. > /* > * Scan through the data items and calculate space usage for a split at > * each possible position. We start at the first data offset rather than > * the second data offset to handle the "newitemoff == first data offset" > * case (otherwise, a split whose firstoldonright is the first data offset > * can't be legal, and won't actually end up being recorded by > * _bt_recordsplit). > * > * Still, it's typical for almost all calls to _bt_recordsplit to > * determine that the split is legal, and therefore enter it into the > * candidate split point array for later consideration. > */ Suggestion: Remove the "Still" word. The observation that typically all split points are legal is valid, but it seems unrelated to the first paragraph. (Do we need to mention it at all?) > /* > * If the new item goes as the last item, record the split point that > * leaves all the old items on the left page, and the new item on the > * right page. This is required because a split that leaves the new item > * as the firstoldonright won't have been reached within the loop. We > * always record every possible split point. > */ Suggestion: Remove the last sentence. An earlier comment already said that we calculate space usage for a split at each possible position, that seems sufficient. Like it was before this patch. > /* > * Find lowest possible penalty among split points currently regarded as > * acceptable -- the "perfect" penalty. The perfect penalty often saves > * _bt_bestsplitloc() additional work around calculating penalties. This > * is also a convenient point to determine if default mode worked out, or > * if we should instead reassess which split points should be considered > * acceptable (split interval, and possibly fillfactormult). > */ > perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff, > newitem, &secondmode); ISTM that figuring out which "mode" we want to operate in is actually the *primary* purpose of _bt_perfect_penalty(). We only really use the penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd suggest changing the function name to something like _bt_choose_mode(), and have secondmode be the primary return value from it, with perfectpenalty being the secondary result through a pointer. It doesn't really choose the mode, either, though. At least after the next "Add split after new tuple optimization" patch. The caller has a big part in choosing what to do. So maybe split _bt_perfect_penalty into two functions: _bt_perfect_penalty(), which just computes the perfect penalty, and _bt_analyze_split_interval(), which determines how many duplicates there are in the top-N split points. BTW, I like the word "strategy", like you called it in the comment on SplitPoint struct, better than "mode". > + if (usemult) > + delta = fillfactormult * split->leftfree - > + (1.0 - fillfactormult) * split->rightfree; > + else > + delta = split->leftfree - split->rightfree; > How about removing the "usemult" variable, and just check if fillfactormult == 0.5? > /* > * There are a much smaller number of candidate split points when > * splitting an internal page, so we can afford to be exhaustive. Only > * give up when pivot that will be inserted into parent is as small as > * possible. > */ > if (!state->is_leaf) > return MAXALIGN(sizeof(IndexTupleData) + 1); Why are there fewer candidate split points on an internal page? - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
Some comments on v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below. Mostly about code comments. In general, I think a round of copy-editing the comments, to use simpler language, would do good. The actual code changes look good to me. > /* > * _bt_findinsertloc() -- Finds an insert location for a tuple > * > * On entry, *bufptr contains the page that the new tuple unambiguously > * belongs on. This may not be quite right for callers that just called > * _bt_check_unique(), though, since they won't have initially searched > * using a scantid. They'll have to insert into a page somewhere to the > * right in rare cases where there are many physical duplicates in a > * unique index, and their scantid directs us to some page full of > * duplicates to the right, where the new tuple must go. (Actually, > * since !heapkeyspace pg_upgraded'd non-unique indexes never get a > * scantid, they too may require that we move right. We treat them > * somewhat like unique indexes.) Seems confusing to first say assertively that "*bufptr contains the page that the new tuple unambiguously belongs to", and then immediately go on to list a whole bunch of exceptions. Maybe just remove "unambiguously". > @@ -759,7 +787,10 @@ _bt_findinsertloc(Relation rel, > * If this page was incompletely split, finish the split now. We > * do this while holding a lock on the left sibling, which is not > * good because finishing the split could be a fairly lengthy > - * operation. But this should happen very seldom. > + * operation. But this should only happen when inserting into a > + * unique index that has more than an entire page for duplicates > + * of the value being inserted. (!heapkeyspace non-unique indexes > + * are an exception, once again.) > */ > if (P_INCOMPLETE_SPLIT(lpageop)) > { This happens very seldom, because you only get an incomplete split if you crash in the middle of a page split, and that should be very rare. I don't think we need to list more fine-grained conditions here, that just confuses the reader. > /* > * _bt_useduplicatepage() -- Settle for this page of duplicates? > * > * Prior to PostgreSQL 12/Btree version 4, heap TID was never treated > * as a part of the keyspace. If there were many tuples of the same > * value spanning more than one leaf page, a new tuple of that same > * value could legally be placed on any one of the pages. > * > * This function handles the question of whether or not an insertion > * of a duplicate into a pg_upgrade'd !heapkeyspace index should > * insert on the page contained in buf when a choice must be made. > * Preemptive microvacuuming is performed here when that could allow > * caller to insert on to the page in buf. > * > * Returns true if caller should proceed with insert on buf's page. > * Otherwise, caller should move on to the page to the right (caller > * must always be able to still move right following call here). > */ So, this function is only used for legacy pg_upgraded indexes. The comment implies that, but doesn't actually say it. > /* > * Get tie-breaker heap TID attribute, if any. Macro works with both pivot > * and non-pivot tuples, despite differences in how heap TID is represented. > * > * Assumes that any tuple without INDEX_ALT_TID_MASK set has a t_tid that > * points to the heap, and that all pivot tuples have INDEX_ALT_TID_MASK set > * (since all pivot tuples must as of BTREE_VERSION 4). When non-pivot > * tuples use the INDEX_ALT_TID_MASK representation in the future, they'll > * probably also contain a heap TID at the end of the tuple. We currently > * assume that a tuple with INDEX_ALT_TID_MASK set is a pivot tuple within > * heapkeyspace indexes (and that a tuple without it set must be a non-pivot > * tuple), but it might also be used by non-pivot tuples in the future. > * pg_upgrade'd !heapkeyspace indexes only set INDEX_ALT_TID_MASK in pivot > * tuples that actually originated with the truncation of one or more > * attributes. > */ > #define BTreeTupleGetHeapTID(itup) ... The comment claims that "all pivot tuples must be as of BTREE_VERSION 4". I thought that all internal tuples are called pivot tuples, even on version 3. I think what this means to say is that this macro is only used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only have a heap TID in BTREE_VERSION 4 indexes. This macro (and many others in nbtree.h) is quite complicated. A static inline function might be easier to read. > @@ -1114,6 +1151,8 @@ _bt_insertonpg(Relation rel, > > if (BufferIsValid(metabuf)) > { > + Assert(metad->btm_version >= BTREE_NOVAC_VERSION); > + xlmeta.version = metad->btm_root; > xlmeta.root = metad->btm_root; > xlmeta.level = metad->btm_level; > xlmeta.fastroot = metad->btm_fastroot; 'xlmeta.version' is set incorrectly. > /* > * Btree version 4 (used by indexes initialized by PostgreSQL v12) made > * general changes to the on-disk representation to add support for > * heapkeyspace semantics, necessitating a REINDEX to get heapkeyspace > * semantics in pg_upgrade scenarios. We continue to offer support for > * BTREE_MIN_VERSION in order to support upgrades from PostgreSQL versions > * up to and including v10 to v12+ without requiring a REINDEX. > * Similarly, we continue to offer support for BTREE_NOVAC_VERSION to > * support upgrades from v11 to v12+ without requiring a REINDEX. > * > * We maintain PostgreSQL v11's ability to upgrade from BTREE_MIN_VERSION > * to BTREE_NOVAC_VERSION automatically. v11's "no vacuuming" enhancement > * (the ability to skip full index scans during vacuuming) only requires > * two new metapage fields, which makes it possible to upgrade at any > * point that the metapage must be updated anyway (e.g. during a root page > * split). Note also that there happened to be no changes in metapage > * layout for btree version 4. All current metapage fields should have > * valid values set when a metapage WAL record is replayed. > * > * It's convenient to consider the "no vacuuming" enhancement (metapage > * layout compatibility) separately from heapkeyspace semantics, since > * each issue affects different areas. This is just a convention; in > * practice a heapkeyspace index is always also a "no vacuuming" index. > */ > #define BTREE_METAPAGE 0 /* first page is meta */ > #define BTREE_MAGIC 0x053162 /* magic number of btree pages */ > #define BTREE_VERSION 4 /* current version number */ > #define BTREE_MIN_VERSION 2 /* minimal supported version number */ > #define BTREE_NOVAC_VERSION 3 /* minimal version with all meta fields */ I find this comment difficult to read. I suggest rewriting it to: /* * The current Btree version is 4. That's what you'll get when you create * a new index. * * Btree version 3 was used in PostgreSQL v11. It is mostly the same as * version 4, but heap TIDs were not part of the keyspace. Index tuples * with duplicate keys could be stored in any order. We continue to * support reading and writing Btree version 3, so that they don't need * to be immediately re-indexed at pg_upgrade. In order to get the new * heapkeyspace semantics, however, a REINDEX is needed. * * Btree version 2 is the same as version 3, except for two new fields * in the metapage that were introduced in version 3. A version 2 metapage * will be automatically upgraded to version 3 on the first insert to it. */ Now that the index tuple format becomes more complicated, I feel that there should be some kind of an overview explaining the format. All the information is there, in the comments in nbtree.h, but you have to piece together all the details to get the overall picture. I wrote this to keep my head straight: B-tree tuple format =================== Leaf tuples ----------- t_tid | t_info | key values | INCLUDE columns if any t_tid points to the heap TID. Pivot tuples ------------ All tuples on internal pages are pivot tuples. Also, the high keys on leaf pages. t_tid | t_info | key values | [heap TID] The INDEX_ALT_TID_MASK bit in t_info is set. The block number in 't_tid' points to the lower B-tree page. The lower bits in 't_tid.ip_posid' store the number of keys stored (it can be less than the number of keys in the index, if some keys have been suffix-truncated). If BT_HEAP_TID_ATTR flag is set, there's an extra heap TID field after the key datums. (In version 3 indexes, the INDEX_ALT_TID_MASK flag might not be set. In that case, the number keys is implicitly the same as the number of keys in the index, and there is no heap TID.) I think adding something like this in nbtree.h would be good. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sun, Mar 3, 2019 at 5:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Great, looks much simpler now, indeed! Now I finally understand the > algorithm. Glad to hear it. Thanks for the additional review! Attached is v14, which has changes based on your feedback. This includes changes based on your more recent feedback on v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch, though I'll respond to those points directly in a later email. v14 also changes the logic by which we decide if alternative strategy should be used to use leftmost and rightmost splits for the entire page, rather than accessing the page directly. We always handle the newitem-at-end edge case correctly now, since the new "top down" approach has all legal splits close at hand. This is more elegant, more obviously correct, and even more effective, at least in some cases -- it's another example of why "top down" is the superior approach for nbtsplitloc.c. This made my "UK land registry data" index have about 2.5% fewer leaf pages than with v13, which is small but significant. Separately, I made most of the new nbtsplitloc.c functions use a FindSplitData argument in v14, which simplifies their signatures quite a bit. > What would be the worst case scenario for this? Splitting a page that > has as many tuples as possible, I guess, so maybe inserting into a table > with a single-column index, with 32k BLCKSZ. Have you done performance > testing on something like that? I'll test that (added to my project TODO list), though it's not obvious that that's the worst case. Page splits will be less frequent, and have better choices about where to split. > Rounding up the allocation to BLCKSZ seems like a premature > optimization. Even if it saved some cycles, I don't think it's worth the > trouble of having to explain all that in the comment. Removed that optimization. > I think you could change the curdelta, leftfree, and rightfree fields in > SplitPoint to int16, to make the array smaller. Added this alternative optimization to replace the BLCKSZ allocation thing. I even found a way to get the array elements down to 8 bytes, but that made the code noticeably slower with "many duplicates" splits, so I didn't end up doing that (I used bitfields, plus the same pragmas that we use to make sure that item pointers are packed). > > * You seemed to refactor _bt_checksplitloc() in passing, making it not > > do the newitemisfirstonright thing. I changed that back. Did I miss > > something that you intended here? > > My patch treated the new item the same as all the old items, in > _bt_checksplitloc(), so it didn't need newitemisonright. You still need > it with your approach. I would feel better about it if we stuck to the same method for calculating if a split point is legal as before (the only difference being that we pessimistically add heap TID overhead to new high key on leaf level). That seems safer to me. > > Changes to my own code since v12: > > > > * Simplified "Add "split after new tuple" optimization" commit, and > > made it more consistent with associated code. This is something that > > was made a lot easier by the new approach. It would be great to hear > > what you think of this. > > I looked at it very briefly. Yeah, it's pretty simple now. Nice! I can understand why it might be difficult to express an opinion on the heuristics themselves. The specific cut-off points (e.g. details of what "heap TID adjacency" actually means) are not that easy to defend with a theoretical justification, though they have been carefully tested. I think it's worth comparing the "split after new tuple" optimization to the traditional leaf fillfactor of 90, which is a very similar situation. Why should it be 90? Why not 85, or 95? Why is it okay to assume that the rightmost page shouldn't be split 50/50? The answers to all of these questions about the well established idea of a leaf fillfactor boil down to this: it's very likely to be correct on average, and when it isn't correct the problem is self-limiting, and has an infinitesimally small chance of continually recurring (unless you imagine an *adversarial* case). Similarly, it doesn't matter if these new heuristics get it wrong once every 1000 splits (a very pessimistic estimate), because even then those will cancel each other out in the long run. It is necessary to take a holistic view of things. We're talking about an optimization that makes the two largest TPC-C indexes over 40% smaller -- I can hold my nose if I must in order to get that benefit. There were also a couple of indexes in the real-world mouse genome database that this made much smaller, so this will clearly help in the real world. Besides all this, the "split after new tuple" optimization fixes an existing worst case, rather than being an optimization, at least in my mind. It's not supposed to be possible to have leaf pages that are all only 50% full without any deletes, and yet we allow it to happen in this one weird case. Even completely random insertions result in 65% - 70% average space utilization, so the existing worst case really is exceptional. We are forced to take a holistic view, and infer something about the pattern of insertions over time, even though holistic is a dirty word. > > (New header comment block over _bt_findsplitloc()) > > This is pretty good, but I think some copy-editing can make this even > better I've restored the old structure of the _bt_findsplitloc() header comments. > The explanation of how the high key for the left page is formed (5th > paragraph), seems out-of-place here, because the high key is not formed > here. Moved that to _bt_split(), which seems like a good compromise. > Somewhere in the 1st or 2nd paragraph, perhaps we should mention that > the function effectively uses a different fillfactor in some other > scenarios too, not only when it's the rightmost page. Done. > > state.maxsplits = maxoff + 2; > > state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits)); > > state.nsplits = 0; > > I wouldn't be that paranoid. The code that populates the array is pretty > straightforward. Done that way. But are you sure? Some of the attempts to create a new split point are bound to fail, because they try to put everything (including new item) on one size of the split. I'll leave the assertion there. > > * Still, it's typical for almost all calls to _bt_recordsplit to > > * determine that the split is legal, and therefore enter it into the > > * candidate split point array for later consideration. > > */ > > Suggestion: Remove the "Still" word. The observation that typically all > split points are legal is valid, but it seems unrelated to the first > paragraph. (Do we need to mention it at all?) Removed the second paragraph entirely. > > /* > > * If the new item goes as the last item, record the split point that > > * leaves all the old items on the left page, and the new item on the > > * right page. This is required because a split that leaves the new item > > * as the firstoldonright won't have been reached within the loop. We > > * always record every possible split point. > > */ > > Suggestion: Remove the last sentence. Agreed. Removed. > ISTM that figuring out which "mode" we want to operate in is actually > the *primary* purpose of _bt_perfect_penalty(). We only really use the > penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd > suggest changing the function name to something like _bt_choose_mode(), > and have secondmode be the primary return value from it, with > perfectpenalty being the secondary result through a pointer. I renamed _bt_perfect_penalty() to _bt_strategy(), since I agree that its primary purpose is to decide on a strategy (which is what I'm now calling a mode, per your request a bit further down). It still returns perfectpenalty, though, since that seemed more natural to me, probably because its style matches the style of caller/_bt_findsplitloc(). perfectpenalty isn't a mere optimization -- it is important to prevent many duplicates mode from going overboard with suffix truncation. It does more than just save _bt_bestsplitloc() cycles, which I've tried to make clearer in v14. > It doesn't really choose the mode, either, though. At least after the > next "Add split after new tuple optimization" patch. The caller has a > big part in choosing what to do. So maybe split _bt_perfect_penalty into > two functions: _bt_perfect_penalty(), which just computes the perfect > penalty, and _bt_analyze_split_interval(), which determines how many > duplicates there are in the top-N split points. Hmm. I didn't create a _bt_analyze_split_interval(), because now _bt_perfect_penalty()/_bt_strategy() is responsible for setting the perfect penalty in all cases. It was a mistake for me to move some perfect penalty stuff for alternative modes/strategies out to the caller in v13. In v14, we never make _bt_findsplitloc() change its perfect penalty -- it only changes its split interval, based on the strategy/mode, possibly after sorting. Let me know what you think of this. > BTW, I like the word "strategy", like you called it in the comment on > SplitPoint struct, better than "mode". I've adopted that terminology in v14 -- it's always "strategy", never "mode". > How about removing the "usemult" variable, and just check if > fillfactormult == 0.5? I need to use "usemult" to determine if the "split after new tuple" optimization should apply leaf fillfactor, or should instead split at the exact point after the newly inserted tuple -- it's very natural to have a single bool flag for that. It's seems simpler to continue to use "usemult" for everything, and not distinguish "split after new tuple" as a special case later on. (Besides, the master branch already uses a bool for this, even though it handles far fewer things.) > > /* > > * There are a much smaller number of candidate split points when > > * splitting an internal page, so we can afford to be exhaustive. Only > > * give up when pivot that will be inserted into parent is as small as > > * possible. > > */ > > if (!state->is_leaf) > > return MAXALIGN(sizeof(IndexTupleData) + 1); > > Why are there fewer candidate split points on an internal page? The comment should say that there is typically a much smaller split interval (this used to be controlled by limiting the size of the array initially -- should have updated this for v13 of the patch). I believe that you understand that, and are interested in why the split interval itself is different on internal pages. Or why we are more conservative with internal pages in general. Assuming that's what you meant, here is my answer: The "Prefix B-Tree" paper establishes the idea that there are different split intervals for leaf pages and internal pages (which it calls branch pages). We care about different things in each case. With leaf pages, we care about choosing the split point that allows us to create the smallest possible pivot tuple as our secondary goal (primary goal is balancing space). With internal pages, we care about choosing the smallest tuple to insert into parent of internal page (often the root) as our secondary goal, but don't care about truncation, because _bt_split() won't truncate new pivot. That's why the definition of "penalty" varies according to whether we're splitting an internal page or a leaf page. Clearly the idea of having separate split intervals is well established, and makes sense. It's fair to ask if I'm being too conservative (or not conservative enough) with split interval in either case. Unfortunately, the Prefix B-Tree paper never seems to give practical advice about how to come up with an interval. They say: "We have not analyzed the influence of sigma L [leaf interval] or sigma B [branch/internal interval] on the performance of the trees. We expect such an analysis to be quite involved and difficult. We are quite confident, however, that small split intervals improve performance considerably. Sets of keys that arise in practical applications are often far from random, and clusters of similar keys differing only in the last few letters (e.g. plural forms) are quite common." I am aware of another, not-very-notable paper that tries to to impose some theory here, but doesn't really help much [1]. Anyway, I've found that I was too conservative with split interval for internal pages. It pays to make internal interval that higher than leaf interval, because internal pages cover a much bigger portion of the key space than leaf pages, which will tend to get filled up one way or another. Leaf pages cover a tight part of the key space, in contrast. In v14, I've increased internal page to 18, a big increase from 3, and twice what it is for leaf splits (still 9 -- no change there). This mostly isn't that different to 3, since there usually are pivot tuples that are all the same size anyway. However, with cases where suffix truncation makes pivot tuples a lot smaller (e.g. UK land registry test case), this makes the items in the root a lot smaller on average, and even makes the whole index smaller. My entire test suite has a few cases that are noticeably improved by this change, and no cases that are hurt. I'm going to have to revalidate the performance of long-running benchmarks with this change, so this should be considered provisional. I think that it will probably be kept, though. Not expecting it to noticeably impact either BenchmarkSQL or pgbench benchmarks. [1] https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1 -- Peter Geoghegan
Attachment
- v14-0001-Refactor-nbtree-insertion-scankeys.patch
- v14-0003-Consider-secondary-factors-during-nbtree-splits.patch
- v14-0004-Add-split-after-new-tuple-optimization.patch
- v14-0005-Add-high-key-continuescan-optimization.patch
- v14-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patch
- v14-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patch
- v14-0007-DEBUG-Add-pageinspect-instrumentation.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Some comments on > v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below. > Mostly about code comments. In general, I think a round of copy-editing > the comments, to use simpler language, would do good. The actual code > changes look good to me. I'm delighted that the code looks good to you, and makes sense overall. I worked hard to make the patch a natural adjunct to the existing code, which wasn't easy. > Seems confusing to first say assertively that "*bufptr contains the page > that the new tuple unambiguously belongs to", and then immediately go on > to list a whole bunch of exceptions. Maybe just remove "unambiguously". This is fixed in v14 of the patch series. > This happens very seldom, because you only get an incomplete split if > you crash in the middle of a page split, and that should be very rare. I > don't think we need to list more fine-grained conditions here, that just > confuses the reader. Fixed in v14. > > /* > > * _bt_useduplicatepage() -- Settle for this page of duplicates? > So, this function is only used for legacy pg_upgraded indexes. The > comment implies that, but doesn't actually say it. I made that more explicit in v14. > > /* > > * Get tie-breaker heap TID attribute, if any. Macro works with both pivot > > * and non-pivot tuples, despite differences in how heap TID is represented. > > #define BTreeTupleGetHeapTID(itup) ... I fixed up the comments above BTreeTupleGetHeapTID() significantly. > The comment claims that "all pivot tuples must be as of BTREE_VERSION > 4". I thought that all internal tuples are called pivot tuples, even on > version 3. In my mind, "pivot tuple" is a term that describes any tuple that contains a separator key, which could apply to any nbtree version. It's useful to have a distinct term (to not just say "separator key tuple") because Lehman and Yao think of separator keys as separate and distinct from downlinks. Internal page splits actually split *between* a separator key and a downlink. So nbtree internal page splits must split "inside a pivot tuple", leaving its separator on the left hand side (new high key), and its downlink on the right hand side (new minus infinity tuple). Pivot tuples may contain a separator key and a downlink, just a downlink, or just a separator key (sometimes this is implicit, and the block number is garbage). I am particular about the terminology because the pivot tuple vs. downlink vs. separator key thing causes a lot of confusion, particularly when you're using Lehman and Yao (or Lanin and Shasha) to understand how things work in Postgres. We wan't to have a broad term that refers to the tuples that describe the keyspace (pivot tuples), since it's often helpful to refer to them collectively, without seeming to contradict Lehman and Yao. > I think what this means to say is that this macro is only > used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only > have a heap TID in BTREE_VERSION 4 indexes. My high level approach to pg_upgrade/versioning is for index scans to *pretend* that every nbtree index (even on v2 and v3) has a heap attribute that actually makes the keys unique. The difference is that v4 gets to use a scantid, and actually rely on the sort order of heap TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the heap attribute", and must never provide a scantid (they also cannot use the !minusinfkey optimization, but this is only an optimization that v4 indexes don't truly need). They always do the right thing (move left) on otherwise-equal pivot tuples, since they have no scantid. That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring about the version the index uses. It might be NULL for a pivot tuple in a v3 index, even though we imagine/pretend that it should have a value set. But that doesn't matter, because higher level code knows that !heapkeyspace indexes should never get a scantid (_bt_compare() does Assert() that they got that detail right, though). We "have no reason to peak", because we don't have a scantid, so index scans work essentially the same way, regardless of the version in use. There are a few specific cross-version things that we need think about outside of making sure that there is never a scantid (and !minusinfkey optimization is unused) in < v4 indexes, but these are all related to unique indexes. "Pretending that all indexes have a heap TID" is a very useful mental model. Nothing really changes, even though you might guess that changing the classic "Subtree S is described by Ki < v <= Ki+1" invariant would need to break code in _bt_binsrch()/_bt_compare(). Just pretend that the classic invariant was there since the Berkeley days, and don't do anything that breaks the useful illusion on versions before v4. > This macro (and many others in nbtree.h) is quite complicated. A static > inline function might be easier to read. I agree that the macros are complicated, but that seems to be because the rules are complicated. I'd rather leave the macros in place, and improve the commentary on the rules. > 'xlmeta.version' is set incorrectly. Oops. Fixed in v14. > I find this comment difficult to read. I suggest rewriting it to: > > /* > * The current Btree version is 4. That's what you'll get when you create > * a new index. I used your wording for this in v14, almost verbatim. > Now that the index tuple format becomes more complicated, I feel that > there should be some kind of an overview explaining the format. All the > information is there, in the comments in nbtree.h, but you have to piece > together all the details to get the overall picture. I wrote this to > keep my head straight: v14 uses your diagrams in nbtree.h, and expands some existing discussion of INCLUDE indexes/non-key attributes/tuple format. Let me know what you think. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
I'm looking at the first patch in the series now. I'd suggest that you commit that very soon. It's useful on its own, and seems pretty much ready to be committed already. I don't think it will be much affected by whatever changes we make to the later patches, anymore. I did some copy-editing of the code comments, see attached patch which applies on top of v14-0001-Refactor-nbtree-insertion-scankeys.patch. Mostly, to use more Plain English: use active voice instead of passive, split long sentences, avoid difficult words. I also had a few comments and questions on some details. I added them in the same patch, marked with "HEIKKI:". Please take a look. - Heikki
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I'm looking at the first patch in the series now. I'd suggest that you > commit that very soon. It's useful on its own, and seems pretty much > ready to be committed already. I don't think it will be much affected by > whatever changes we make to the later patches, anymore. I agree that the parts covered by the first patch in the series are very unlikely to need changes, but I hesitate to commit it weeks ahead of the other patches. Some of the things that make _bt_findinsertloc() fast are missing for v3 indexes. The "consider secondary factors during nbtree splits" patch actually more than compensates for that with v3 indexes, at least in some cases, but the first patch applied on its own will slightly regress performance. At least, I benchmarked the first patch on its own several months ago and noticed a small regression at the time, though I don't have the exact details at hand. It might have been an invalid result, because I wasn't particularly thorough at the time. We do make some gains in the first patch (the _bt_check_unique() thing), but we also check the high key more than we need to within _bt_findinsertloc() for non-unique indexes. Plus, the microvacuuming thing isn't as streamlined. It's a lot of work to validate and revalidate the performance of a patch like this, and I'd rather commit the first three patches within a couple of days of each other (I can validate v3 indexes and v4 indexes separately, though). We can put off the other patches for longer, and treat them as independent. I guess I'd also push the final amcheck patch following the first three -- no point in holding back on that. Then we'd be left with "Add "split after new tuple" optimization", and "Add high key "continuescan" optimization" as independent improvements that can be pushed at the last minute of the final CF. > I also had a few comments and questions on some details. I added them in > the same patch, marked with "HEIKKI:". Please take a look. Will respond now. Any point that I haven't responding to directly has been accepted. > +HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit > +weird to talk about it in the function comment. I didn't understand what > +the point of adding this sentence was, so I removed it. Maybe there is no point in the comment you reference here, but I like the idea of "checkingunique", because that symbol name is a common thread between a number of functions that coordinate with each other. It's not just a local variable in one function. > @@ -588,6 +592,17 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key, > if (P_RIGHTMOST(opaque)) > break; > highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY); > + > + /* > + * HEIKKI: This assertion might fire if the user-defined opclass > + * is broken. It's just an assertion, so maybe that's ok. With a > + * broken opclass, it's obviously "garbage in, garbage out", but > + * we should try to behave sanely anyway. I don't remember what > + * our general policy on that is; should we assert, elog(ERROR), > + * or continue silently in that case? An elog(ERROR) or > + * elog(WARNING) would feel best to me, but I don't remember what > + * we usually do. > + */ > Assert(highkeycmp <= 0); > if (highkeycmp != 0) > break; We don't really have a general policy on it. However, I don't have any sympathy for the idea of trying to solider on with a corrupt index. I also don't think that it's worth making this a "can't happen" error. Like many of my assertions, this assertion is intended to document an invariant. I don't actually anticipate that it could ever really fail. > +Should we mention explicitly that this binary-search reuse is only applicable > +if unique checks were performed? It's kind of implied by the fact that it's > +_bt_check_unique() that saves the state, but perhaps we should be more clear > +about it. I guess so. > +What is a "garbage duplicate"? Same as a "dead duplicate"? Yes. > +The last sentence, about garbage duplicates, seems really vague. Why do we > +ever do any comparisons that are not strictly necessary? Perhaps it's best to > +just remove that last sentence. Okay -- will remove. > + > +HEIKKI: I don't buy the argument that microvacuuming has to happen here. You > +could easily imagine a separate function that does microvacuuming, and resets > +(or even updates) the binary-search cache in the insertion key. I agree this > +is a convenient place to do it, though. It wasn't supposed to be a water-tight argument. I'll just say that it's convenient. > +/* HEIKKI: > +Do we need 'checkunique' as an argument? If unique checks were not > +performed, the insertion key will simply not have saved state. > +*/ We need it in the next patch in the series, because it's also useful for optimizing away the high key check with non-unique indexes. We know that _bt_moveright() was called at the leaf level, with scantid filled in, so there is no question of needing to move right within _bt_findinsertloc() (provided it's a heapkeyspace index). Actually, we even need it in the first patch: we only restore a binary search because we know that there is something to restore, and must ask for it to be restored explicitly (anything else seems unsafe). Maybe we can't restore it because it's not a unique index, or maybe we can't restore it because we microvacuumed, or moved right to get free space. I don't think that it'll be helpful to make _bt_findinsertloc() pretend that it doesn't know exactly where the binary search bounds come from -- it already knows plenty about unique indexes specifically, and about how it may have to invalidate the bounds. The whole way that it couples buffer locks is only useful for unique indexes, so it already knows *plenty* about unique indexes specifically. I actually like the idea of making certain insertion scan key mutable state relating to search bounds hidden in the case of "dynamic prefix truncation" [1]. Doesn't seem to make sense here, though. > + /* HEIKKI: I liked this comment that we used to have here, before this patch: */ > + /*---------- > + * If we will need to split the page to put the item on this page, > + * check whether we can put the tuple somewhere to the right, > + * instead. Keep scanning right until we > + /* HEIKKI: Maybe it's not relevant with the later patches, but at least > + * with just this first patch, it's still valid. I noticed that the > + * comment is now in _bt_useduplicatepage, it seems a bit out-of-place > + * there. */ I don't think it matters, because I don't think that the first patch can be justified as an independent piece of work. I like the idea of breaking up the patch series, because it makes it all easier to understand, but the first three patches are kind of intertwined. > +HEIKKI: In some scenarios, if the BTP_HAS_GARBAGE flag is falsely set, we would > +try to microvacuum the page twice: first in _bt_useduplicatepage, and second > +time here. That's because _bt_vacuum_one_page() doesn't clear the flag, if > +there are in fact no LP_DEAD items. That's probably insignificant and not worth > +worrying about, but I thought I'd mention it. Right. It's also true that all future insertions will reach _bt_vacuum_one_page() and do the same again, until there either is garbage, or until the page splits. > - * rightmost page case), all the items on the right half will be user data > - * (there is no existing high key that needs to be relocated to the new > - * right page). > + * rightmost page case), all the items on the right half will be user > + * data. > + * > +HEIKKI: I don't think the comment change you made here was needed or > +helpful, so I reverted it. I thought it added something when you're looking at it from a WAL-logging point of view. But I can live without this. > - * starting a regular index scan some can be omitted. The array is used as a > + * starting a regular index scan, some can be omitted. The array is used as a > * flexible array member, though it's sized in a way that makes it possible to > * use stack allocations. See nbtree/README for full details. > + > +HEIKKI: I don't see anything in the README about stack allocations. What > +exactly does the README reference refer to? No code seems to actually allocate > +this in the stack, so we don't really need that. The README discusses insertion scankeys in general, though. I think that you read it that way because you're focussed on my changes, and not because it actually implies that the README talks about the stack thing specifically. (But I can change it if you like.) There is a stack allocation in _bt_first(). This was once just another dynamic allocation, that called _bt_mkscankey(), but that regressed nested loop joins, so I had to make it work the same way as before. I noticed this about six months ago, because there was a clear impact on the TPC-C "Stock level" transaction, which is now sometimes twice as fast with the patch series. Note also that commit d961a568, from 2005, changed the _bt_first() code to use a stack allocation. Besides, sticking to a stack allocation makes the changes to _bt_first() simpler, even though it has to duplicate a few things from _bt_mkscankey(). I could get you a v15 that integrates your changes pretty quickly, but I'll hold off on that for at least a few days. I have a feeling that you'll have more feedback for me to work through before too long. [1] https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Robert Haas
Date:
On Tue, Mar 5, 2019 at 3:03 PM Peter Geoghegan <pg@bowt.ie> wrote: > I agree that the parts covered by the first patch in the series are > very unlikely to need changes, but I hesitate to commit it weeks ahead > of the other patches. I know I'm stating the obvious here, but we don't have many weeks left at this point. I have not reviewed any code, but I have been following this thread and I'd really like to see this work go into PostgreSQL 12, assuming it's in good enough shape. It sounds like really good stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Mar 6, 2019 at 1:37 PM Robert Haas <robertmhaas@gmail.com> wrote: > I know I'm stating the obvious here, but we don't have many weeks left > at this point. I have not reviewed any code, but I have been > following this thread and I'd really like to see this work go into > PostgreSQL 12, assuming it's in good enough shape. It sounds like > really good stuff. Thanks! Barring any objections, I plan to commit the first 3 patches (plus the amcheck "relocate" patch) within 7 - 10 days (that's almost everything). Heikki hasn't reviewed 'Add high key "continuescan" optimization' yet, and it seems like he should take a look at that before I proceed with it. But that seems like the least controversial enhancement within the entire patch series, so I'm not very worried about it. I'm currently working on v15, which has comment-only revisions requested by Heikki. I expect to continue to work with him to make sure that he is happy with the presentation. I'll also need to revalidate the performance of the patch series following recent minor changes to the logic for choosing a split point. That can take days. This is why I don't want to commit the first patch without committing at least the first three all at once -- it increases the amount of performance validation work I'll have to do considerably. (I have to consider both v4 and v3 indexes already, which seems like enough work.) Two of the later patches (one of which I plan to push as part of the first batch of commits) use heuristics to decide where to split the page. As a Postgres contributor, I have learned to avoid inventing heuristics, so this automatically makes me a bit uneasy. However, I don't feel so bad about it here, on reflection. The on-disk size of the TPC-C indexes are reduced by 35% with the 'Add "split after new tuple" optimization' patch (I think that the entire database is usually about 12% smaller). There simply isn't a fundamentally better way to get the same benefit, and I'm sure that nobody will argue that we should just accept the fact that the most influential database benchmark of all time has a big index bloat problem with Postgres. That would be crazy. That said, it's not impossible that somebody will shout at me because my heuristics made their index bloated. I can't see how that could happen, but I am prepared. I can always adjust the heuristics when new information comes to light. I have fairly thorough test cases that should allow me to do this without regressing anything else. This is a risk that can be managed sensibly. There is no gnawing ambiguity about the on-disk changes laid down in the second patch (nor the first patch), though. Making on-disk changes is always a bit scary, but making the keys unique is clearly a big improvement architecturally, as it brings nbtree closer to the Lehman & Yao design without breaking anything for v3 indexes (v3 indexes simply aren't allowed to use a heap TID in their scankey). Unique keys also allow amcheck to relocate every tuple in the index from the root page, using the same code path as regular index scans. We'll be relying on the uniqueness of keys within amcheck from the beginning, before anybody teaches nbtree to perform retail index tuple deletion. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 06/03/2019 04:03, Peter Geoghegan wrote: > On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I'm looking at the first patch in the series now. I'd suggest that you >> commit that very soon. It's useful on its own, and seems pretty much >> ready to be committed already. I don't think it will be much affected by >> whatever changes we make to the later patches, anymore. After staring at the first patch for bit longer, a few things started to bother me: * The new struct is called BTScanInsert, but it's used for searches, too. It makes sense when you read the README, which explains the difference between "search scan keys" and "insertion scan keys", but now that we have a separate struct for this, perhaps we call insertion scan keys with a less confusing name. I don't know what to suggest, though. "Positioning key"? * We store the binary search bounds in BTScanInsertData, but they're only used during insertions. * The binary search bounds are specific for a particular buffer. But that buffer is passed around separately from the bounds. It seems easy to have them go out of sync, so that you try to use the cached bounds for a different page. The savebinsrch and restorebinsrch is used to deal with that, but it is pretty complicated. I came up with the attached (against master), which addresses the 2nd and 3rd points. I added a whole new BTInsertStateData struct, to hold the binary search bounds. BTScanInsert now only holds the 'scankeys' array, and the 'nextkey' flag. The new BTInsertStateData struct also holds the current buffer we're considering to insert to, and a 'bounds_valid' flag to indicate if the saved bounds are valid for the current buffer. That way, it's more straightforward to clear the 'bounds_valid' flag whenever we move right. I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary search like _bt_binsrch does, but the bounds caching is only done in _bt_binsrch_insert. Seems more clear to have separate functions for them now, even though there's some duplication. >> +/* HEIKKI: >> +Do we need 'checkunique' as an argument? If unique checks were not >> +performed, the insertion key will simply not have saved state. >> +*/ > > We need it in the next patch in the series, because it's also useful > for optimizing away the high key check with non-unique indexes. We > know that _bt_moveright() was called at the leaf level, with scantid > filled in, so there is no question of needing to move right within > _bt_findinsertloc() (provided it's a heapkeyspace index). Hmm. Perhaps it would be to move the call to _bt_binsrch (or _bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So that _bt_findinsertloc would only be responsible for finding the correct page to insert to. So the overall code, after patch #2, would be like: /* * Do the insertion. First move right to find the correct page to * insert to, if necessary. If we're inserting to a non-unique index, * _bt_search() already did this when it checked if a move to the * right was required for leaf page. Insertion scankey's scantid * would have been filled out at the time. On a unique index, the * current buffer is the first buffer containing duplicates, however, * so we may need to move right to the correct location for this * tuple. */ if (checkingunique || itup_key->heapkeyspace) _bt_findinsertpage(rel, &insertstate, stack, heapRel); newitemoff = _bt_binsrch_insert(rel, &insertstate); _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, newitemoff, false); Does this make sense? > Actually, we even need it in the first patch: we only restore a binary > search because we know that there is something to restore, and must > ask for it to be restored explicitly (anything else seems unsafe). > Maybe we can't restore it because it's not a unique index, or maybe we > can't restore it because we microvacuumed, or moved right to get free > space. I don't think that it'll be helpful to make _bt_findinsertloc() > pretend that it doesn't know exactly where the binary search bounds > come from -- it already knows plenty about unique indexes > specifically, and about how it may have to invalidate the bounds. The > whole way that it couples buffer locks is only useful for unique > indexes, so it already knows *plenty* about unique indexes > specifically. The attached new version simplifies this, IMHO. The bounds and the current buffer go together in the same struct, so it's easier to keep track whether the bounds are valid or not. >> - * starting a regular index scan some can be omitted. The array is used as a >> + * starting a regular index scan, some can be omitted. The array is used as a >> * flexible array member, though it's sized in a way that makes it possible to >> * use stack allocations. See nbtree/README for full details. >> + >> +HEIKKI: I don't see anything in the README about stack allocations. What >> +exactly does the README reference refer to? No code seems to actually allocate >> +this in the stack, so we don't really need that. > > The README discusses insertion scankeys in general, though. I think > that you read it that way because you're focussed on my changes, and > not because it actually implies that the README talks about the stack > thing specifically. (But I can change it if you like.) > > There is a stack allocation in _bt_first(). This was once just another > dynamic allocation, that called _bt_mkscankey(), but that regressed > nested loop joins, so I had to make it work the same way as before. Ah, gotcha, I missed that. - Heikki
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > After staring at the first patch for bit longer, a few things started to > bother me: > > * The new struct is called BTScanInsert, but it's used for searches, > too. It makes sense when you read the README, which explains the > difference between "search scan keys" and "insertion scan keys", but now > that we have a separate struct for this, perhaps we call insertion scan > keys with a less confusing name. I don't know what to suggest, though. > "Positioning key"? I think that insertion scan key is fine. It's been called that for almost twenty years. It's not like it's an intuitive concept that could be conveyed easily if only we came up with a new, pithy name. > * We store the binary search bounds in BTScanInsertData, but they're > only used during insertions. > > * The binary search bounds are specific for a particular buffer. But > that buffer is passed around separately from the bounds. It seems easy > to have them go out of sync, so that you try to use the cached bounds > for a different page. The savebinsrch and restorebinsrch is used to deal > with that, but it is pretty complicated. That might be an improvement, but I do think that using mutable state in the insertion scankey, to restrict a search is an idea that could work well in at least one other way. That really isn't a once-off thing, even though it looks that way. > I came up with the attached (against master), which addresses the 2nd > and 3rd points. I added a whole new BTInsertStateData struct, to hold > the binary search bounds. BTScanInsert now only holds the 'scankeys' > array, and the 'nextkey' flag. It will also have to store heapkeyspace, of course. And minusinfkey. BTW, I would like to hear what you think of the idea of minusinfkey (and the !minusinfkey optimization) specifically. > The new BTInsertStateData struct also > holds the current buffer we're considering to insert to, and a > 'bounds_valid' flag to indicate if the saved bounds are valid for the > current buffer. That way, it's more straightforward to clear the > 'bounds_valid' flag whenever we move right. I'm not sure that that's an improvement. Moving right should be very rare with my patch. gcov shows that we never move right here anymore with the regression tests, or within _bt_check_unique() -- not once. For a second, I thought that you forgot to invalidate the bounds_valid flag, because you didn't pass it directly, by value to _bt_useduplicatepage(). > I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary > search like _bt_binsrch does, but the bounds caching is only done in > _bt_binsrch_insert. Seems more clear to have separate functions for them > now, even though there's some duplication. I'll have to think about that some more. Having a separate _bt_binsrch_insert() may be worth it, but I'll need to do some profiling. > Hmm. Perhaps it would be to move the call to _bt_binsrch (or > _bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So > that _bt_findinsertloc would only be responsible for finding the correct > page to insert to. So the overall code, after patch #2, would be like: Maybe, but as I said it's not like _bt_findinsertloc() doesn't know all about unique indexes already. This is pointed out in a comment in _bt_doinsert(), even. I guess that it might have to be changed to say _bt_findinsertpage() instead, with your new approach. > /* > * Do the insertion. First move right to find the correct page to > * insert to, if necessary. If we're inserting to a non-unique index, > * _bt_search() already did this when it checked if a move to the > * right was required for leaf page. Insertion scankey's scantid > * would have been filled out at the time. On a unique index, the > * current buffer is the first buffer containing duplicates, however, > * so we may need to move right to the correct location for this > * tuple. > */ > if (checkingunique || itup_key->heapkeyspace) > _bt_findinsertpage(rel, &insertstate, stack, heapRel); > > newitemoff = _bt_binsrch_insert(rel, &insertstate); > > _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, > newitemoff, false); > > Does this make sense? I guess you're saying this because you noticed that the for (;;) loop in _bt_findinsertloc() doesn't do that much in many cases, because of the fastpath. I suppose that this could be an improvement, provided all the assertions that verify that the work "_bt_findinsertpage()" would have done if called was in fact unnecessary. (e.g., check the high key/rightmost-ness) > The attached new version simplifies this, IMHO. The bounds and the > current buffer go together in the same struct, so it's easier to keep > track whether the bounds are valid or not. I'll look into integrating this with my current draft v15 tomorrow. Need to sleep on it. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Mar 6, 2019 at 10:54 PM Peter Geoghegan <pg@bowt.ie> wrote: > It will also have to store heapkeyspace, of course. And minusinfkey. > BTW, I would like to hear what you think of the idea of minusinfkey > (and the !minusinfkey optimization) specifically. > I'm not sure that that's an improvement. Moving right should be very > rare with my patch. gcov shows that we never move right here anymore > with the regression tests, or within _bt_check_unique() -- not once. > For a second, I thought that you forgot to invalidate the bounds_valid > flag, because you didn't pass it directly, by value to > _bt_useduplicatepage(). BTW, the !minusinfkey optimization is why we literally never move right within _bt_findinsertloc() while the regression tests run. We always land on the correct leaf page to begin with. (It works with unique index insertions, where scantid is NULL when we descend the tree.) In general, there are two good reasons for us to move right: * There was a concurrent page split (or page deletion), and we just missed the downlink in the parent, and need to recover. * We omit some columns from our scan key (at least scantid), and there are perhaps dozens of matches -- this is not relevant to _bt_doinsert() code. The single value strategy used by nbtsplitloc.c does a good job of making it unlikely that _bt_check_unique()-wise duplicates will cross leaf pages, so there will almost always be one leaf page to visit. And, the !minusinfkey optimization ensures that the only reason we'll move right is because of a concurrent page split, within _bt_moveright(). The buffer lock coupling move to the right that _bt_findinsertloc() does should be considered an edge case with all of these measures, at least with v4 indexes. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 07/03/2019 14:54, Peter Geoghegan wrote: > On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> After staring at the first patch for bit longer, a few things started to >> bother me: >> >> * The new struct is called BTScanInsert, but it's used for searches, >> too. It makes sense when you read the README, which explains the >> difference between "search scan keys" and "insertion scan keys", but now >> that we have a separate struct for this, perhaps we call insertion scan >> keys with a less confusing name. I don't know what to suggest, though. >> "Positioning key"? > > I think that insertion scan key is fine. It's been called that for > almost twenty years. It's not like it's an intuitive concept that > could be conveyed easily if only we came up with a new, pithy name. Yeah. It's been like that forever, but I must confess I hadn't paid any attention to it, until now. I had not understood that text in the README explaining the difference between search and insertion scan keys, before looking at this patch. Not sure I ever read it with any thought. Now that I understand it, I don't like the "insertion scan key" name. > BTW, I would like to hear what you think of the idea of minusinfkey > (and the !minusinfkey optimization) specifically. I don't understand it :-(. I guess that's valuable feedback on its own. I'll spend more time reading the code around that, but meanwhile, if you can think of a simpler way to explain it in the comments, that'd be good. >> The new BTInsertStateData struct also >> holds the current buffer we're considering to insert to, and a >> 'bounds_valid' flag to indicate if the saved bounds are valid for the >> current buffer. That way, it's more straightforward to clear the >> 'bounds_valid' flag whenever we move right. > > I'm not sure that that's an improvement. Moving right should be very > rare with my patch. gcov shows that we never move right here anymore > with the regression tests, or within _bt_check_unique() -- not once. I haven't given performance much thought, really. I don't expect this method to be any slower, but the point of the refactoring is to make the code easier to understand. >> /* >> * Do the insertion. First move right to find the correct page to >> * insert to, if necessary. If we're inserting to a non-unique index, >> * _bt_search() already did this when it checked if a move to the >> * right was required for leaf page. Insertion scankey's scantid >> * would have been filled out at the time. On a unique index, the >> * current buffer is the first buffer containing duplicates, however, >> * so we may need to move right to the correct location for this >> * tuple. >> */ >> if (checkingunique || itup_key->heapkeyspace) >> _bt_findinsertpage(rel, &insertstate, stack, heapRel); >> >> newitemoff = _bt_binsrch_insert(rel, &insertstate); >> >> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, >> newitemoff, false); >> >> Does this make sense? > > I guess you're saying this because you noticed that the for (;;) loop > in _bt_findinsertloc() doesn't do that much in many cases, because of > the fastpath. The idea is that _bt_findinsertpage() would not need to know whether the unique checks were performed or not. I'd like to encapsulate all the information about the "insert position we're considering" in the BTInsertStateData struct. Passing 'checkingunique' as a separate argument violates that, because when it's set, the key means something slightly different. Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether 'scantid' is set, instead of 'checkingunique'. That would seem better. If it looks like 'checkingunique', it's making the assumption that if unique checks were not performed, then we are already positioned on the correct page, according to the heap TID. But looking at 'scantid' seems like a more direct way of getting the same information. And then we won't need to pass the 'checkingunique' flag as an "out-of-band" argument. So I'm specifically suggesting that we replace this, in _bt_findinsertloc: if (!checkingunique && itup_key->heapkeyspace) break; With this: if (itup_key->scantid) break; And remove the 'checkingunique' argument from _bt_findinsertloc. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 07/03/2019 15:41, Heikki Linnakangas wrote: > On 07/03/2019 14:54, Peter Geoghegan wrote: >> On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> After staring at the first patch for bit longer, a few things started to >>> bother me: >>> >>> * The new struct is called BTScanInsert, but it's used for searches, >>> too. It makes sense when you read the README, which explains the >>> difference between "search scan keys" and "insertion scan keys", but now >>> that we have a separate struct for this, perhaps we call insertion scan >>> keys with a less confusing name. I don't know what to suggest, though. >>> "Positioning key"? >> >> I think that insertion scan key is fine. It's been called that for >> almost twenty years. It's not like it's an intuitive concept that >> could be conveyed easily if only we came up with a new, pithy name. > > Yeah. It's been like that forever, but I must confess I hadn't paid any > attention to it, until now. I had not understood that text in the README > explaining the difference between search and insertion scan keys, before > looking at this patch. Not sure I ever read it with any thought. Now > that I understand it, I don't like the "insertion scan key" name. > >> BTW, I would like to hear what you think of the idea of minusinfkey >> (and the !minusinfkey optimization) specifically. > > I don't understand it :-(. I guess that's valuable feedback on its own. > I'll spend more time reading the code around that, but meanwhile, if you > can think of a simpler way to explain it in the comments, that'd be good. > >>> The new BTInsertStateData struct also >>> holds the current buffer we're considering to insert to, and a >>> 'bounds_valid' flag to indicate if the saved bounds are valid for the >>> current buffer. That way, it's more straightforward to clear the >>> 'bounds_valid' flag whenever we move right. >> >> I'm not sure that that's an improvement. Moving right should be very >> rare with my patch. gcov shows that we never move right here anymore >> with the regression tests, or within _bt_check_unique() -- not once. > > I haven't given performance much thought, really. I don't expect this > method to be any slower, but the point of the refactoring is to make the > code easier to understand. > >>> /* >>> * Do the insertion. First move right to find the correct page to >>> * insert to, if necessary. If we're inserting to a non-unique index, >>> * _bt_search() already did this when it checked if a move to the >>> * right was required for leaf page. Insertion scankey's scantid >>> * would have been filled out at the time. On a unique index, the >>> * current buffer is the first buffer containing duplicates, however, >>> * so we may need to move right to the correct location for this >>> * tuple. >>> */ >>> if (checkingunique || itup_key->heapkeyspace) >>> _bt_findinsertpage(rel, &insertstate, stack, heapRel); >>> >>> newitemoff = _bt_binsrch_insert(rel, &insertstate); >>> >>> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, >>> newitemoff, false); >>> >>> Does this make sense? >> >> I guess you're saying this because you noticed that the for (;;) loop >> in _bt_findinsertloc() doesn't do that much in many cases, because of >> the fastpath. > > The idea is that _bt_findinsertpage() would not need to know whether the > unique checks were performed or not. I'd like to encapsulate all the > information about the "insert position we're considering" in the > BTInsertStateData struct. Passing 'checkingunique' as a separate > argument violates that, because when it's set, the key means something > slightly different. > > Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether > 'scantid' is set, instead of 'checkingunique'. That would seem better. > If it looks like 'checkingunique', it's making the assumption that if > unique checks were not performed, then we are already positioned on the > correct page, according to the heap TID. But looking at 'scantid' seems > like a more direct way of getting the same information. And then we > won't need to pass the 'checkingunique' flag as an "out-of-band" argument. > > So I'm specifically suggesting that we replace this, in _bt_findinsertloc: > > if (!checkingunique && itup_key->heapkeyspace) > break; > > With this: > > if (itup_key->scantid) > break; > > And remove the 'checkingunique' argument from _bt_findinsertloc. Ah, scratch that. By the time we call _bt_findinsertloc(), scantid has already been restored, even if it was not set originally when we did _bt_search. My dislike here is that passing 'checkingunique' as a separate argument acts like a "modifier", changing slightly the meaning of the insertion scan key. If it's not set, we know we're positioned on the correct page. Otherwise, we might not be. And it's a pretty indirect way of saying that, as it also depends 'heapkeyspace'. Perhaps add a flag to BTInsertStateData, to indicate the same thing more explicitly. Something like "bool is_final_insertion_page; /* when set, no need to move right */". - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 05/03/2019 05:16, Peter Geoghegan wrote: > Attached is v14, which has changes based on your feedback. As a quick check of the backwards-compatibility code, i.e. !heapkeyspace, I hacked _bt_initmetapage to force the version number to 3, and ran the regression tests. It failed an assertion in the 'create_index' test: (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007f2943f9a535 in __GI_abort () at abort.c:79 #2 0x00005622c7d9d6b4 in ExceptionalCondition (conditionName=0x5622c7e4cbe8 "!(_bt_check_natts(rel, key->heapkeyspace, page, offnum))", errorType=0x5622c7e4c62a "FailedAssertion", fileName=0x5622c7e4c734 "nbtsearch.c", lineNumber=511) at assert.c:54 #3 0x00005622c78627fb in _bt_compare (rel=0x5622c85afbe0, key=0x7ffd7a996db0, page=0x7f293d433780 "", offnum=2) at nbtsearch.c:511 #4 0x00005622c7862640 in _bt_binsrch (rel=0x5622c85afbe0, key=0x7ffd7a996db0, buf=4622) at nbtsearch.c:432 #5 0x00005622c7861ec9 in _bt_search (rel=0x5622c85afbe0, key=0x7ffd7a996db0, bufP=0x7ffd7a9976d4, access=1, snapshot=0x5622c8353740) at nbtsearch.c:142 #6 0x00005622c7863a44 in _bt_first (scan=0x5622c841e828, dir=ForwardScanDirection) at nbtsearch.c:1183 #7 0x00005622c785f8b0 in btgettuple (scan=0x5622c841e828, dir=ForwardScanDirection) at nbtree.c:245 #8 0x00005622c78522e3 in index_getnext_tid (scan=0x5622c841e828, direction=ForwardScanDirection) at indexam.c:542 #9 0x00005622c7a67784 in IndexOnlyNext (node=0x5622c83ad280) at nodeIndexonlyscan.c:120 #10 0x00005622c7a438d5 in ExecScanFetch (node=0x5622c83ad280, accessMtd=0x5622c7a67254 <IndexOnlyNext>, recheckMtd=0x5622c7a67bc9 <IndexOnlyRecheck>) at execScan.c:95 #11 0x00005622c7a4394a in ExecScan (node=0x5622c83ad280, accessMtd=0x5622c7a67254 <IndexOnlyNext>, recheckMtd=0x5622c7a67bc9 <IndexOnlyRecheck>) at execScan.c:145 #12 0x00005622c7a67c73 in ExecIndexOnlyScan (pstate=0x5622c83ad280) at nodeIndexonlyscan.c:322 #13 0x00005622c7a41814 in ExecProcNodeFirst (node=0x5622c83ad280) at execProcnode.c:445 #14 0x00005622c7a501a5 in ExecProcNode (node=0x5622c83ad280) at ../../../src/include/executor/executor.h:231 #15 0x00005622c7a50693 in fetch_input_tuple (aggstate=0x5622c83acdd0) at nodeAgg.c:406 #16 0x00005622c7a529d9 in agg_retrieve_direct (aggstate=0x5622c83acdd0) at nodeAgg.c:1737 #17 0x00005622c7a525a9 in ExecAgg (pstate=0x5622c83acdd0) at nodeAgg.c:1552 #18 0x00005622c7a41814 in ExecProcNodeFirst (node=0x5622c83acdd0) at execProcnode.c:445 #19 0x00005622c7a3621d in ExecProcNode (node=0x5622c83acdd0) at ../../../src/include/executor/executor.h:231 #20 0x00005622c7a38bd9 in ExecutePlan (estate=0x5622c83acb78, planstate=0x5622c83acdd0, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x5622c8462088, execute_once=true) at execMain.c:1645 #21 0x00005622c7a36872 in standard_ExecutorRun (queryDesc=0x5622c83a9eb8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:363 #22 0x00005622c7a36696 in ExecutorRun (queryDesc=0x5622c83a9eb8, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:307 #23 0x00005622c7c357dc in PortalRunSelect (portal=0x5622c8336778, forward=true, count=0, dest=0x5622c8462088) at pquery.c:929 #24 0x00005622c7c3546f in PortalRun (portal=0x5622c8336778, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x5622c8462088, altdest=0x5622c8462088, completionTag=0x7ffd7a997d50 "") at pquery.c:770 #25 0x00005622c7c2f029 in exec_simple_query (query_string=0x5622c82cf508 "SELECT count(*) FROM onek_with_null WHERE unique1 IS NULL AND unique2 IS NULL;") at postgres.c:1215 #26 0x00005622c7c3369a in PostgresMain (argc=1, argv=0x5622c82faee0, dbname=0x5622c82fac50 "regression", username=0x5622c82c81e8 "heikki") at postgres.c:4256 #27 0x00005622c7b8bcf2 in BackendRun (port=0x5622c82f3d80) at postmaster.c:4378 #28 0x00005622c7b8b45b in BackendStartup (port=0x5622c82f3d80) at postmaster.c:4069 #29 0x00005622c7b87633 in ServerLoop () at postmaster.c:1699 #30 0x00005622c7b86e61 in PostmasterMain (argc=3, argv=0x5622c82c6160) at postmaster.c:1372 #31 0x00005622c7aa9925 in main (argc=3, argv=0x5622c82c6160) at main.c:228 I haven't investigated any deeper, but apparently something's broken. This was with patch v14, without any further changes. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Thu, Mar 7, 2019 at 12:14 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I haven't investigated any deeper, but apparently something's broken. > This was with patch v14, without any further changes. Try it with my patch -- attached. I think that you missed that the INCLUDE indexes thing within nbtsort.c needs to be changed back. -- Peter Geoghegan
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > BTW, I would like to hear what you think of the idea of minusinfkey > > (and the !minusinfkey optimization) specifically. > > I don't understand it :-(. I guess that's valuable feedback on its own. > I'll spend more time reading the code around that, but meanwhile, if you > can think of a simpler way to explain it in the comments, that'd be good. Here is another way of explaining it: When I drew you that picture while we were in Lisbon, I mentioned to you that the patch sometimes used a sentinel scantid value that was greater than minus infinity, but less than any real scantid. This could be used to force an otherwise-equal-to-pivot search to go left rather than uselessly going right. I explained this about 30 minutes in, when I was drawing you a picture. Well, that sentinel heap TID thing doesn't exist any more, because it was replaced by the !minusinfkey optimization, which is a *generalization* of the same idea, which extends it to all columns (not just the heap TID column). That way, you never have to go to two pages just because you searched for a value that happened to be at the "right at the edge" of a leaf page. Page deletion wants to assume that truncated attributes from the high key of the page being deleted have actual negative infinity values -- negative infinity is a value, just like any other, albeit one that can only appear in pivot tuples. This is simulated by VACUUM using "minusinfkey = true". We go left in the parent, not right, and land on the correct leaf page. Technically we don't compare the negative infinity values in the pivot to the negative infinity values in the scankey, but we return 0 just as if we had, and found them equal. Similarly, v3 indexes specify "minusinfkey = true" in all cases, because they always want to go left -- just like in old Postgres versions. They don't have negative infinity values (matches can be on either side of the all-equal pivot, so they must go left). -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Thu, Mar 7, 2019 at 12:37 AM Peter Geoghegan <pg@bowt.ie> wrote: > When I drew you that picture while we were in Lisbon, I mentioned to > you that the patch sometimes used a sentinel scantid value that was > greater than minus infinity, but less than any real scantid. This > could be used to force an otherwise-equal-to-pivot search to go left > rather than uselessly going right. I explained this about 30 minutes > in, when I was drawing you a picture. I meant the opposite: it could be used to go right, instead of going left when descending the tree and unnecessarily moving right on the leaf level. As I said, moving right on the leaf level (rather than during the descent) should only happen when it's necessary, such as when there is a concurrent page split. It shouldn't happen reliably when searching for the same value, unless there really are matches across multiple leaf pages, and that's just what we have to do. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Mar 6, 2019 at 11:41 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I don't understand it :-(. I guess that's valuable feedback on its own. > I'll spend more time reading the code around that, but meanwhile, if you > can think of a simpler way to explain it in the comments, that'd be good. One more thing on this: If you force bitmap index scans (by disabling index-only scans and index scans with the "enable_" GUCs), then you get EXPLAIN (ANALYZE, BUFFERS) instrumentation for the index alone (and the heap, separately). No visibility map accesses, which obscure the same numbers for a similar index-only scan. You can then observe that most searches of a single value will touch the bare minimum number of index pages. For example, if there are 3 levels in the index, you should access only 3 index pages total, unless there are literally hundreds of matches, and cannot avoid storing them on more than one leaf page. You'll see that the scan touches the minimum possible number of index pages, because of: * Many duplicates strategy. (Not single value strategy, which I incorrectly mentioned in relation to this earlier.) * The !minusinfykey optimization, which ensures that we go to the right of an otherwise-equal pivot tuple in an internal page, rather than left. * The "continuescan" high key patch, which ensures that the scan doesn't go to the right from the first leaf page to try to find even more matches. The high key on the same leaf page will indicate that the scan is over, without actually visiting the sibling. (Again, I'm assuming that your search is for a single value.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I came up with the attached (against master), which addresses the 2nd > and 3rd points. I added a whole new BTInsertStateData struct, to hold > the binary search bounds. BTScanInsert now only holds the 'scankeys' > array, and the 'nextkey' flag. The new BTInsertStateData struct also > holds the current buffer we're considering to insert to, and a > 'bounds_valid' flag to indicate if the saved bounds are valid for the > current buffer. That way, it's more straightforward to clear the > 'bounds_valid' flag whenever we move right. > > I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary > search like _bt_binsrch does, but the bounds caching is only done in > _bt_binsrch_insert. Seems more clear to have separate functions for them > now, even though there's some duplication. Attached is v15, which does not yet integrate these changes. However, it does integrate earlier feedback that you posted for v14. I also cleaned up some comments within nbtsplitloc.c. I would like to work through these other items with you (_bt_binsrch_insert() and so on), but I think that it would be helpful if you made an effort to understand the minusinfkey stuff first. I spent a lot of time improving the explanation of that within _bt_compare(). It's important. The !minusinfkey optimization is more than just a "nice to have". Suffix truncation makes pivot tuples less restrictive about what can go on each page, but that might actually hurt performance if we're not also careful to descend directly to the leaf page where matches will first appear (rather than descending to a page to its left). If we needlessly descend to a page that's to the left of the leaf page we really ought to go straight to, then there are cases that are regressed rather than helped -- especially cases where splits use the "many duplicates" strategy. You continue to get correct answers when the !minusinfkey optimization is ripped out, but it seems almost essential that we include it. While it's true that we've always had to move left like this, it's also true that suffix truncation will make it happen much more frequently. It would happen (without the !minusinfkey optimization) most often where suffix truncation makes pivot tuples smallest. Once you grok the minusinfkey stuff, then we'll be in a better position to work through the feedback about _bt_binsrch_insert() and so on, I think. You may lack all of the context of how the second patch goes on to use the new insertion scan key struct, so it will probably make life easier if we're both on the same page. (Pun very much intended.) Thanks again! -- Peter Geoghegan
Attachment
- v15-0001-Refactor-nbtree-insertion-scankeys.patch
- v15-0003-Consider-secondary-factors-during-nbtree-splits.patch
- v15-0005-Add-split-after-new-tuple-optimization.patch
- v15-0004-Allow-tuples-to-be-relocated-from-root-by-amchec.patch
- v15-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patch
- v15-0006-Add-high-key-continuescan-optimization.patch
- v15-0007-DEBUG-Add-pageinspect-instrumentation.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 08/03/2019 12:22, Peter Geoghegan wrote: > I would like to work through these other items with you > (_bt_binsrch_insert() and so on), but I think that it would be helpful > if you made an effort to understand the minusinfkey stuff first. I > spent a lot of time improving the explanation of that within > _bt_compare(). It's important. Ok, after thinking about it for a while, I think I understand the minus infinity stuff now. Let me try to explain it in my own words: Imagine that you have an index with two key columns, A and B. The index has two leaf pages, with the following items: +--------+ +--------+ | Page 1 | | Page 2 | | | | | | 1 1 | | 2 1 | | 1 2 | | 2 2 | | 1 3 | | 2 3 | | 1 4 | | 2 4 | | 1 5 | | 2 5 | +--------+ +--------+ The key space is neatly split on the first key column - probably thanks to the new magic in the page split code. Now, what do we have as the high key of page 1? Answer: "2 -inf". The "-inf" is not stored in the key itself, the second key column is just omitted, and the search code knows to treat it implicitly as a value that's lower than any real value. Hence, "minus infinity". However, during page deletion, we need to perform a search to find the downlink pointing to a leaf page. We do that by using the leaf page's high key as the search key. But the search needs to treat it slightly differently in that case. Normally, searching with a single key value, "2", we would land on page 2, because any real value beginning with "2" would be on that page, but in the page deletion case, we want to find page 1. Setting the BTScanInsert.minusinfkey flag tells the search code to do that. Question: Wouldn't it be more straightforward to use "1 +inf" as page 1's high key? I.e treat any missing columns as positive infinity. That way, the search for page deletion wouldn't need to be treated differently. That's also how this used to work: all tuples on a page used to be <= high key, not strictly < high key. And it would also make the rightmost page less of a special case: you could pretend the rightmost page has a pivot tuple with all columns truncated away, which means positive infinity. You have this comment _bt_split which touches the subject: > /* > * The "high key" for the new left page will be the first key that's going > * to go into the new right page, or possibly a truncated version if this > * is a leaf page split. This might be either the existing data item at > * position firstright, or the incoming tuple. > * > * The high key for the left page is formed using the first item on the > * right page, which may seem to be contrary to Lehman & Yao's approach of > * using the left page's last item as its new high key when splitting on > * the leaf level. It isn't, though: suffix truncation will leave the > * left page's high key fully equal to the last item on the left page when > * two tuples with equal key values (excluding heap TID) enclose the split > * point. It isn't actually necessary for a new leaf high key to be equal > * to the last item on the left for the L&Y "subtree" invariant to hold. > * It's sufficient to make sure that the new leaf high key is strictly > * less than the first item on the right leaf page, and greater than or > * equal to (not necessarily equal to) the last item on the left leaf > * page. > * > * In other words, when suffix truncation isn't possible, L&Y's exact > * approach to leaf splits is taken. (Actually, even that is slightly > * inaccurate. A tuple with all the keys from firstright but the heap TID > * from lastleft will be used as the new high key, since the last left > * tuple could be physically larger despite being opclass-equal in respect > * of all attributes prior to the heap TID attribute.) > */ But it doesn't explain why it's done like that. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Mar 8, 2019 at 2:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Now, what do we have as the high key of page 1? Answer: "2 -inf". The > "-inf" is not stored in the key itself, the second key column is just > omitted, and the search code knows to treat it implicitly as a value > that's lower than any real value. Hence, "minus infinity". Right. > However, during page deletion, we need to perform a search to find the > downlink pointing to a leaf page. We do that by using the leaf page's > high key as the search key. But the search needs to treat it slightly > differently in that case. Normally, searching with a single key value, > "2", we would land on page 2, because any real value beginning with "2" > would be on that page, but in the page deletion case, we want to find > page 1. Setting the BTScanInsert.minusinfkey flag tells the search code > to do that. Right. > Question: Wouldn't it be more straightforward to use "1 +inf" as page > 1's high key? I.e treat any missing columns as positive infinity. That might also work, but it wouldn't be more straightforward on balance. This is because: * We have always taken the new high key from the firstright item, and we also continue to do that on internal pages -- same as before. It would certainly complicate the nbtsplitloc.c code to have to deal with this new special case now (leaf and internal pages would have to have far different handling, not just slightly different handling). * We have always had "-inf" values as the first item on an internal page, which is explicitly truncated to zero attributes as of Postgres v11. It seems ugly to me to make truncated attributes mean negative infinity in that context, but positive infinity in every other context. * Another reason that I prefer "-inf" to "+inf" is that you can imagine an implementation that makes pivot tuples into normalized binary keys, that are truncated using generic/opclass-agnostic logic, and compared using strcmp(). If the scankey binary string is longer than the pivot tuple, then it's greater according to strcmp() -- that just works. And, you can truncate the original binary strings built using opclass infrastructure without having to understand where attributes begin and end (though this relies on encoding things like NULL-ness a certain way). If we define truncation to be "+inf" now, then none of this works. All of that said, maybe it would be clearer if page deletion was not the special case that has to opt in to minusinfkey semantics. Perhaps it would make more sense for everyone else to opt out of minusinfkey semantics, and to get the !minusinfkey optimization as a result of that. I only did it the other way around because that meant that only nbtpage.c had to acknowledge the special case. Even calling it minusinfkey is misleading in one way, because we're not so much searching for "-inf" values as we are searching for the first page that could have tuples for the untruncated attributes. But isn't that how this has always worked, given that we've had to deal with duplicate pivot tuples on the same level before now? As I said, we're not doing an extra thing when minusinfykey is true (during page deletion) -- it's the other way around. Saying that we're searching for minus infinity values for the truncated attributes is kind of a lie, although the search does behave that way. >That way, the search for page deletion wouldn't need to be treated > differently. That's also how this used to work: all tuples on a page > used to be <= high key, not strictly < high key. That isn't accurate -- it still works that way on the leaf level. The alternative that you've described is possible, I think, but the key space works just the same with either of our approaches. You've merely thought of an alternative way of generating new high keys that satisfy the same invariants as my own scheme. Provided the new separator for high key is >= last item on the left and < first item on the right, everything works. As you point out, the original Lehman and Yao rule for leaf pages (which Postgres kinda followed before) is that the high key is <= items on the leaf level. But this patch makes nbtree follow that rule fully and properly. Maybe you noticed that amcheck tests < on internal pages, and only checks <= on leaf pages. Perhaps it led you to believe that I did things differently. Actually, this is classic Lehman and Yao. The keys in internal pages are all "separators" as far as Lehman and Yao are concerned, so the high key is less of a special case on internal pages. We check < on internal pages because all separators are supposed to be unique on a level. But, as I said, we do check <= on the leaf level. Take a look at "Fig. 7 A B-Link Tree" in the Lehman and Yao paper if this is unclear. That shows that internal pages have unique keys -- we can therefore expect the high key to be < items in internal pages. It also shows that leaf pages copy the high key from the last item on the left page -- we can expect the high key to be <= items there. Just like with the patch, in effect. The comment from _bt_split() that you quoted explains why what we do is like what Lehman and Yao do when suffix truncation cannot truncate anything -- the new high key on the left page comes from the last item on the left page. > And it would also make > the rightmost page less of a special case: you could pretend the > rightmost page has a pivot tuple with all columns truncated away, which > means positive infinity. But we do already pretend that. How is that not the case already? > But it doesn't explain why it's done like that. It's done this way because that's equivalent to what Lehman and Yao do, while also avoiding adding the special cases that I mentioned (in nbtsplitloc.c, and so on). -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote: > > Question: Wouldn't it be more straightforward to use "1 +inf" as page > > 1's high key? I.e treat any missing columns as positive infinity. > > That might also work, but it wouldn't be more straightforward on > balance. This is because: I thought of another reason: * The 'Add high key "continuescan" optimization' is effective because the high key of a leaf page tends to look relatively dissimilar to other items on the page. The optimization would almost never help if the high key was derived from the lastleft item at the time of a split -- that's no more informative than the lastleft item itself. As things stand with the patch, a high key usually has a value for its last untruncated attribute that can only appear on the page to the right, and never the current page. We'd quite like to be able to conclude that the page to the right can't be interesting there and then, without needing to visit it. Making new leaf high keys "as close as possible to items on the right, without actually touching them" makes the optimization quite likely to work out with the TPC-C indexes, when we search for orderline items for an order that is rightmost of a leaf page in the orderlines primary key. And another reason: * This makes it likely that any new items that would go between the original lastleft and firstright items end up on the right page when they're inserted after the lastleft/firstright split. This is generally a good thing, because we've optimized WAL-logging for new pages that go on the right. (You pointed this out to me in Lisbon, in fact.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote: > All of that said, maybe it would be clearer if page deletion was not > the special case that has to opt in to minusinfkey semantics. Perhaps > it would make more sense for everyone else to opt out of minusinfkey > semantics, and to get the !minusinfkey optimization as a result of > that. I only did it the other way around because that meant that only > nbtpage.c had to acknowledge the special case. This seems like a good idea -- we should reframe the !minusinfkey optimization, without actually changing the behavior. Flip it around. The minusinfkey field within the insertion scankey struct would be called something like "descendrighttrunc" instead. Same idea, but with the definition inverted. Most _bt_search() callers (all of those outside of nbtpage.c and amcheck) would be required to opt in to that optimization to get it. Under this arrangement, nbtpage.c/page deletion would not ask for the "descendrighttrunc" optimization, and would therefore continue to do what it has always done: find the first leaf page that its insertion scankey values could be on (we don't lie about searching for negative infinity, or having a negative infinity sentinel value in scan key). The only difference for page deletion between v3 indexes and v4 indexes is that with v4 indexes we'll relocate the same leaf page reliably, since every separator key value is guaranteed to be unique on its level (including the leaf level/leaf high keys). This is just a detail, though, and not one that's even worth pointing out; we're not *relying* on that being true on v4 indexes anyway (we check that the block number is a match too, which is strictly necessary for v3 indexes and seems like a good idea for v4 indexes). This is also good because it makes it clear that the unique index code within _bt_doinsert() (that temporarily sets scantid to NULL) benefits from the descendrighttrunc/!minusinfkey optimization -- it should be "honest" and ask for it explicitly. We can make _bt_doinsert() opt in to the optimization for unique indexes, but not for other indexes, where scantid is set from the start. The descendrighttrunc/!minusinfkey optimization cannot help when scantid is set from the start, because we'll always have an attribute value in insertion scankey that breaks the tie for us instead. We'll always move right of a heap-TID-truncated separator key whose untruncated attributes are all equal to a prefix of our insertion scankey values. (This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for unique indexes matters more than you might think -- we do really badly with things like Zipfian distributions currently, and reducing the contention goes some way towards helping with that. Postgres pro noticed this a couple of years back, and analyzed it in detail at that time. It's really nice that we very rarely have to move right within code like _bt_check_unique() and _bt_findsplitloc() with the patch.) Does that make sense to you? Can you live with the name "descendrighttrunc", or do you have a better one? Thanks -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 08/03/2019 23:21, Peter Geoghegan wrote: > On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan <pg@bowt.ie> wrote: >> All of that said, maybe it would be clearer if page deletion was not >> the special case that has to opt in to minusinfkey semantics. Perhaps >> it would make more sense for everyone else to opt out of minusinfkey >> semantics, and to get the !minusinfkey optimization as a result of >> that. I only did it the other way around because that meant that only >> nbtpage.c had to acknowledge the special case. > > This seems like a good idea -- we should reframe the !minusinfkey > optimization, without actually changing the behavior. Flip it around. > > The minusinfkey field within the insertion scankey struct would be > called something like "descendrighttrunc" instead. Same idea, but with > the definition inverted. Most _bt_search() callers (all of those > outside of nbtpage.c and amcheck) would be required to opt in to that > optimization to get it. > > Under this arrangement, nbtpage.c/page deletion would not ask for the > "descendrighttrunc" optimization, and would therefore continue to do > what it has always done: find the first leaf page that its insertion > scankey values could be on (we don't lie about searching for negative > infinity, or having a negative infinity sentinel value in scan key). > The only difference for page deletion between v3 indexes and v4 > indexes is that with v4 indexes we'll relocate the same leaf page > reliably, since every separator key value is guaranteed to be unique > on its level (including the leaf level/leaf high keys). This is just a > detail, though, and not one that's even worth pointing out; we're not > *relying* on that being true on v4 indexes anyway (we check that the > block number is a match too, which is strictly necessary for v3 > indexes and seems like a good idea for v4 indexes). > > This is also good because it makes it clear that the unique index code > within _bt_doinsert() (that temporarily sets scantid to NULL) benefits > from the descendrighttrunc/!minusinfkey optimization -- it should be > "honest" and ask for it explicitly. We can make _bt_doinsert() opt in > to the optimization for unique indexes, but not for other indexes, > where scantid is set from the start. The > descendrighttrunc/!minusinfkey optimization cannot help when scantid > is set from the start, because we'll always have an attribute value in > insertion scankey that breaks the tie for us instead. We'll always > move right of a heap-TID-truncated separator key whose untruncated > attributes are all equal to a prefix of our insertion scankey values. > > (This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for > unique indexes matters more than you might think -- we do really badly > with things like Zipfian distributions currently, and reducing the > contention goes some way towards helping with that. Postgres pro > noticed this a couple of years back, and analyzed it in detail at that > time. It's really nice that we very rarely have to move right within > code like _bt_check_unique() and _bt_findsplitloc() with the patch.) > > Does that make sense to you? Can you live with the name > "descendrighttrunc", or do you have a better one? "descendrighttrunc" doesn't make much sense to me, either. I don't understand it. Maybe a comment would make it clear, though. I don't feel like this is an optimization. It's natural consequence of what the high key means. I guess you can think of it as an optimization, in the same way that not fully scanning the whole index for every search is an optimization, but that's not how I think of it :-). If we don't flip the meaning of the flag, then maybe calling it something like "searching_for_leaf_page" would make sense: /* * When set, we're searching for the leaf page with the given high key, * rather than leaf tuples matching the search keys. * * Normally, when !searching_for_pivot_tuple, if a page's high key * has truncated columns, and the columns that are present are equal to * the search key, the search will not descend to that page. For * example, if an index has two columns, and a page's high key is * ("foo", <omitted>), and the search key is also ("foo," <omitted>), * the search will not descend to that page, but its right sibling. The * omitted column in the high key means that all tuples on the page must * be strictly < "foo", so we don't need to visit it. However, sometimes * we perform a search to find the parent of a leaf page, using the leaf * page's high key as the search key. In that case, when we search for * ("foo", <omitted>), we do want to land on that page, not its right * sibling. */ bool searching_for_leaf_page; As the patch stands, you're also setting minusinfkey when dealing with v3 indexes. I think it would be better to only set searching_for_leaf_page in nbtpage.c. In general, I think BTScanInsert should describe the search key, regardless of whether it's a V3 or V4 index. Properties of the index belong elsewhere. (We're violating that by storing the 'heapkeyspace' flag in BTScanInsert. That wart is probably OK, it is pretty convenient to have it there. But in principle...) - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > "descendrighttrunc" doesn't make much sense to me, either. I don't > understand it. Maybe a comment would make it clear, though. It's not an easily grasped concept. I don't think that any name will easily convey the idea to the reader, though. I'm happy to go with whatever name you prefer. > I don't feel like this is an optimization. It's natural consequence of > what the high key means. I guess you can think of it as an optimization, > in the same way that not fully scanning the whole index for every search > is an optimization, but that's not how I think of it :-). I would agree with this in a green field situation, where we don't have to consider the legacy of v3 indexes. But that's not the case here. > If we don't flip the meaning of the flag, then maybe calling it > something like "searching_for_leaf_page" would make sense: > > /* > * When set, we're searching for the leaf page with the given high key, > * rather than leaf tuples matching the search keys. > * > * Normally, when !searching_for_pivot_tuple, if a page's high key I guess you meant to say "searching_for_pivot_tuple" both times (not "searching_for_leaf_page"). After all, we always search for a leaf page. :-) I'm fine with "searching_for_pivot_tuple", I think. The underscores are not really stylistically consistent with other stuff in nbtree.h, but I can use something very similar to your suggestion that is consistent. > * has truncated columns, and the columns that are present are equal to > * the search key, the search will not descend to that page. For > * example, if an index has two columns, and a page's high key is > * ("foo", <omitted>), and the search key is also ("foo," <omitted>), > * the search will not descend to that page, but its right sibling. The > * omitted column in the high key means that all tuples on the page must > * be strictly < "foo", so we don't need to visit it. However, sometimes > * we perform a search to find the parent of a leaf page, using the leaf > * page's high key as the search key. In that case, when we search for > * ("foo", <omitted>), we do want to land on that page, not its right > * sibling. > */ > bool searching_for_leaf_page; That works for me (assuming you meant searching_for_pivot_tuple). > As the patch stands, you're also setting minusinfkey when dealing with > v3 indexes. I think it would be better to only set > searching_for_leaf_page in nbtpage.c. That would mean I would have to check both heapkeyspace and minusinfkey within _bt_compare(). I would rather just keep the assertion that makes sure that !heapkeyspace callers are also minusinfkey callers, and the comments that explain why that is. It might even matter to performance -- having an extra condition in _bt_compare() is something we should avoid. > In general, I think BTScanInsert > should describe the search key, regardless of whether it's a V3 or V4 > index. Properties of the index belong elsewhere. (We're violating that > by storing the 'heapkeyspace' flag in BTScanInsert. That wart is > probably OK, it is pretty convenient to have it there. But in principle...) The idea with pg_upgrade'd v3 indexes is, as I said a while back, that they too have a heap TID attribute. nbtsearch.c code is not allowed to rely on its value, though, and must use minusinfkey/searching_for_pivot_tuple semantics (relying on its value being minus infinity is still relying on its value being something). Now, it's also true that there are a number of things that we have to do within nbtinsert.c for !heapkeyspace that don't really concern the key space as such. Even still, thinking about everything with reference to the keyspace, and keeping that as similar as possible between v3 and v4 is a good thing. It is up to high level code (such as _bt_first()) to not allow a !heapkeyspace index scan to do something that won't work for it. It is not up to low level code like _bt_compare() to worry about these differences (beyond asserting that caller got it right). If page deletion didn't need minusinfkey semantics (if nobody but v3 indexes needed that), then I would just make the "move right of separator" !minusinfkey code within _bt_compare() test heapkeyspace. But we do have a general need for minusinfkey semantics, so it seems simpler and more future-proof to keep heapkeyspace out of low-level nbtsearch.c code. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 10/03/2019 20:53, Peter Geoghegan wrote: > On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> If we don't flip the meaning of the flag, then maybe calling it >> something like "searching_for_leaf_page" would make sense: >> >> /* >> * When set, we're searching for the leaf page with the given high key, >> * rather than leaf tuples matching the search keys. >> * >> * Normally, when !searching_for_pivot_tuple, if a page's high key > > I guess you meant to say "searching_for_pivot_tuple" both times (not > "searching_for_leaf_page"). After all, we always search for a leaf > page. :-) Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but changed to "searching_for_leaf_page" at the last minute. My thinking was that in the page-deletion case, you're trying to re-locate a particular leaf page. Otherwise, you're searching for matching tuples, not a particular page. Although during insertion, I guess you are also searching for a particular page, the page to insert to. >> As the patch stands, you're also setting minusinfkey when dealing with >> v3 indexes. I think it would be better to only set >> searching_for_leaf_page in nbtpage.c. > > That would mean I would have to check both heapkeyspace and > minusinfkey within _bt_compare(). Yeah. > I would rather just keep the > assertion that makes sure that !heapkeyspace callers are also > minusinfkey callers, and the comments that explain why that is. It > might even matter to performance -- having an extra condition in > _bt_compare() is something we should avoid. It's a hot codepath, but I doubt it's *that* hot that it matters, performance-wise... >> In general, I think BTScanInsert >> should describe the search key, regardless of whether it's a V3 or V4 >> index. Properties of the index belong elsewhere. (We're violating that >> by storing the 'heapkeyspace' flag in BTScanInsert. That wart is >> probably OK, it is pretty convenient to have it there. But in principle...) > > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that > they too have a heap TID attribute. nbtsearch.c code is not allowed to > rely on its value, though, and must use > minusinfkey/searching_for_pivot_tuple semantics (relying on its value > being minus infinity is still relying on its value being something). Yeah. I find that's a complicated way to think about it. My mental model is that v4 indexes store heap TIDs, and every tuple is unique thanks to that. But on v3, we don't store heap TIDs, and duplicates are possible. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sun, Mar 10, 2019 at 12:53 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but > changed to "searching_for_leaf_page" at the last minute. My thinking was > that in the page-deletion case, you're trying to re-locate a particular > leaf page. Otherwise, you're searching for matching tuples, not a > particular page. Although during insertion, I guess you are also > searching for a particular page, the page to insert to. I prefer something like "searching_for_pivot_tuple", because it's unambiguous. Okay with that? > It's a hot codepath, but I doubt it's *that* hot that it matters, > performance-wise... I'll figure that out. Although I am currently looking into a regression in workloads that fit in shared_buffers, that my micro-benchmarks didn't catch initially. Indexes are still much smaller, but we get a ~2% regression all the same. OTOH, we get a 7.5%+ increase in throughput when the workload is I/O bound, and latency is generally no worse and even better with any workload. I suspect that the nice top-down approach to nbtsplitloc.c has its costs...will let you know more when I know more. > > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that > > they too have a heap TID attribute. nbtsearch.c code is not allowed to > > rely on its value, though, and must use > > minusinfkey/searching_for_pivot_tuple semantics (relying on its value > > being minus infinity is still relying on its value being something). > > Yeah. I find that's a complicated way to think about it. My mental model > is that v4 indexes store heap TIDs, and every tuple is unique thanks to > that. But on v3, we don't store heap TIDs, and duplicates are possible. I'll try it that way, then. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sun, Mar 10, 2019 at 1:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that > > > they too have a heap TID attribute. nbtsearch.c code is not allowed to > > > rely on its value, though, and must use > > > minusinfkey/searching_for_pivot_tuple semantics (relying on its value > > > being minus infinity is still relying on its value being something). > > > > Yeah. I find that's a complicated way to think about it. My mental model > > is that v4 indexes store heap TIDs, and every tuple is unique thanks to > > that. But on v3, we don't store heap TIDs, and duplicates are possible. > > I'll try it that way, then. Attached is v16, which does it that way instead. There are simpler comments, still located within _bt_compare(). These are based on your suggested wording, with some changes. I think that I prefer it this way too. Please let me know what you think. Other changes: * nbtsplitloc.c failed to consider the full range of values in the split interval when deciding perfect penalty. It considered from the middle to the left or right edge, rather than from the left edge to the right edge. This didn't seem to really effect the quality of its decisions very much, but it was still wrong. This is fixed by a new function that determines the left and right edges of the split interval -- _bt_interval_edges(). * We now record the smallest observed tuple during our pass over the page to record split points. This is used by internal page splits, to get a more useful "perfect penalty", saving cycles in the common case where there isn't much variability in the size of tuples on the page being split. The same field is used within the "split after new item" optimization as a further crosscheck -- it's now impossible to fool it into thinking that the page has equisized tuples. The regression that I mentioned earlier isn't in pgbench type workloads (even when the distribution is something more interesting that the uniform distribution default). It is only in workloads with lots of page splits and lots of index churn, where we get most of the benefit of the patch, but also where the costs are most apparent. Hopefully it can be fixed, but if not I'm inclined to think that it's a price worth paying. This certainly still needs further analysis and discussion, though. This revision of the patch does not attempt to address that problem in any way. -- Peter Geoghegan
Attachment
- v16-0007-DEBUG-Add-pageinspect-instrumentation.patch
- v16-0004-Allow-tuples-to-be-relocated-from-root-by-amchec.patch
- v16-0003-Consider-secondary-factors-during-nbtree-splits.patch
- v16-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patch
- v16-0001-Refactor-nbtree-insertion-scankeys.patch
- v16-0005-Add-split-after-new-tuple-optimization.patch
- v16-0006-Add-high-key-continuescan-optimization.patch
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
From
Peter Geoghegan
Date:
On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan <pg@bowt.ie> wrote: > The regression that I mentioned earlier isn't in pgbench type > workloads (even when the distribution is something more interesting > that the uniform distribution default). It is only in workloads with > lots of page splits and lots of index churn, where we get most of the > benefit of the patch, but also where the costs are most apparent. > Hopefully it can be fixed, but if not I'm inclined to think that it's > a price worth paying. This certainly still needs further analysis and > discussion, though. This revision of the patch does not attempt to > address that problem in any way. I believe that I've figured out what's going on here. At first, I thought that this regression was due to the cycles that have been added to page splits, but that doesn't seem to be the case at all. Nothing that I did to make page splits faster helped (e.g. temporarily go back to doing them "bottom up" made no difference). CPU utilization was consistently slightly *higher* with the master branch (patch spent slightly more CPU time idle). I now believe that the problem is with LWLock/buffer lock contention on index pages, and that that's an inherent cost with a minority of write-heavy high contention workloads. A cost that we should just accept. Making the orderline primary key about 40% smaller increases contention when BenchmarkSQL is run with this particular configuration. The latency for the NEW_ORDER transaction went from ~4ms average on master to ~5ms average with the patch, while the latency for other types of transactions was either unchanged or improved. It's noticeable, but not that noticeable. This needs to be put in context. The final transactions per minute for this configuration was 250,000, with a total of only 100 warehouses. What this boils down to is that the throughput per warehouse is about 8000% of the maximum valid NOPM specified by the TPC-C spec [1]. In other words, the database is too small relative to the machine, by a huge amount, making the result totally and utterly invalid if you go on what the TPC-C spec says. This exaggerates the LWLock/buffer lock contention on index pages. TPC-C is supposed to simulate a real use case with a plausible configuration, but the details here are totally unrealistic. For example, there are 3 million customers here (there are always 30k customers per warehouse). 250k TPM means that there were about 112k new orders per minute. It's hard to imagine a population of 3 million customers making 112k orders per minute. That's over 20 million orders in the first 3 hour long run that I got these numbers from. Each of these orders has an average of about 10 line items. These people must be very busy, and must have an awful lot of storage space in their homes! (There are various other factors here, such as skew, and the details will never be completely realistic anyway, but you take my point. TPC-C is *designed* to be a realistic distillation of a real use case, going so far as to require usable GUI input terminals when evaluating a formal benchmark submission.) The benchmark that I posted in mid-February [2] (which showed better performance across the board) was much closer to what the TPC-C spec requires -- that was only ~400% of maximum valid NOPM (the BenchmarkSQL html reports will tell you this if you download the archive I posted), and had 2,000 warehouses. TPC-C is *supposed* to be I/O bound, and I/O bound workloads are what the patch helps with the most. The general idea with TPC-C's NOPM is that you're required to increase the number of warehouses as throughput increases. This stops you from getting an unrealistically favorable result by churning through a small amount of data, from the same few warehouses. The only benchmark that I ran that actually satisfied TPC-C's NOPM requirements had a total of 7,000 warehouses, and was almost a full terabyte in size on the master branch. This was run on an i3.4xlarge high I/O AWS ec2 instance. That was substantially I/O bound, and had an improvement in throughput that was very similar to the mid-February results which came from my home server -- we see a ~7.5% increase in transaction throughput after a few hours. I attach a graph of block device reads/writes for the second 4 hour run for this same 7,000 warehouse benchmark (master and patch). This shows a substantial reduction in I/O according to OS-level instrumentation. (Note that the same FS/logical block device was used for both WAL and database files.) In conclusion: I think that this regression is a cost worth accepting. The regression in throughput is relatively small (2% - 3%), and the NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens to be used for 45% of all transactions with TPC-C, and inserts into the largest index by far, without reading much). The patch overtakes master after a few hours anyway -- the patch will still win after about 6 hours, once the database gets big enough, despite all the contention. As I said, I think that we see a regression *because* the indexes are much smaller, not in spite of the fact that they're smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40% smaller than they are on master, no matter the details, even after many hours. I'm not seeing the problem when pgbench is run with a small scale factor but with a high client count. pgbench doesn't have the benefit of much smaller indexes, so it also doesn't bear any cost when contention is ramped up. The pgbench_accounts primary key (which is by far the largest index) is *precisely* the same size as it is on master, though the other indexes do seem to be a lot smaller. They were already tiny, though. OTOH, the TPC-C NEW_ORDER transaction does a lot of straight inserts, localized by warehouse, with skewed access. [1] https://youtu.be/qYeRHK6oq7g?t=1340 [2] https://www.postgresql.org/message-id/CAH2-WzmsK-1qVR8xC86DXv8U0cHwfPcuH6hhA740fCeEu3XsVg@mail.gmail.com -- Peter Geoghegan
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 12/03/2019 04:47, Peter Geoghegan wrote: > In conclusion: I think that this regression is a cost worth accepting. > The regression in throughput is relatively small (2% - 3%), and the > NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens > to be used for 45% of all transactions with TPC-C, and inserts into > the largest index by far, without reading much). The patch overtakes > master after a few hours anyway -- the patch will still win after > about 6 hours, once the database gets big enough, despite all the > contention. As I said, I think that we see a regression*because* the > indexes are much smaller, not in spite of the fact that they're > smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40% > smaller than they are on master, no matter the details, even after > many hours. Yeah, that's fine. I'm curious, though, could you bloat the indexes back to the old size by setting the fillfactor? - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Mar 11, 2019 at 11:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Yeah, that's fine. I'm curious, though, could you bloat the indexes back > to the old size by setting the fillfactor? I think that that might work, though it's hard to say for sure offhand. The "split after new item" optimization is supposed to be a variation of rightmost splits, of course. We apply fillfactor in the same way much of the time. You would still literally split immediately after the new item some of the time, though, which makes it unclear how much bloat there would be without testing it. Some indexes mostly apply fillfactor in non-rightmost pages, while other indexes mostly split at the exact point past the new item, depending on details like the size of the groupings. I am currently doing a multi-day 6,000 warehouse benchmark, since I want to be sure that the bloat resistance will hold up over days. I think that it will, because there aren't that many updates, and they're almost all HOT-safe. I'll put the idea of a 50/50 fillfactor benchmark with the high-contention/regressed workload on my TODO list, though. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Robert Haas
Date:
On Mon, Mar 11, 2019 at 10:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan <pg@bowt.ie> wrote: > > The regression that I mentioned earlier isn't in pgbench type > > workloads (even when the distribution is something more interesting > > that the uniform distribution default). It is only in workloads with > > lots of page splits and lots of index churn, where we get most of the > > benefit of the patch, but also where the costs are most apparent. > > Hopefully it can be fixed, but if not I'm inclined to think that it's > > a price worth paying. This certainly still needs further analysis and > > discussion, though. This revision of the patch does not attempt to > > address that problem in any way. > > I believe that I've figured out what's going on here. > > At first, I thought that this regression was due to the cycles that > have been added to page splits, but that doesn't seem to be the case > at all. Nothing that I did to make page splits faster helped (e.g. > temporarily go back to doing them "bottom up" made no difference). CPU > utilization was consistently slightly *higher* with the master branch > (patch spent slightly more CPU time idle). I now believe that the > problem is with LWLock/buffer lock contention on index pages, and that > that's an inherent cost with a minority of write-heavy high contention > workloads. A cost that we should just accept. If I wanted to try to say this in fewer words, would it be fair to say that reducing the size of an index by 40% without changing anything else can increase contention on the remaining pages? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Mar 12, 2019 at 11:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > If I wanted to try to say this in fewer words, would it be fair to say > that reducing the size of an index by 40% without changing anything > else can increase contention on the remaining pages? Yes. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Robert Haas
Date:
On Tue, Mar 12, 2019 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Mar 12, 2019 at 11:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > > If I wanted to try to say this in fewer words, would it be fair to say > > that reducing the size of an index by 40% without changing anything > > else can increase contention on the remaining pages? > > Yes. Hey, I understood something today! I think it's pretty clear that we have to view that as acceptable. I mean, we could reduce contention even further by finding a way to make indexes 40% larger, but I think it's clear that nobody wants that. Now, maybe in the future we'll want to work on other techniques for reducing contention, but I don't think we should make that the problem of this patch, especially because the regressions are small and go away after a few hours of heavy use. We should optimize for the case where the user intends to keep the database around for years, not hours. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > Hey, I understood something today! And I said something that could be understood! > I think it's pretty clear that we have to view that as acceptable. I > mean, we could reduce contention even further by finding a way to make > indexes 40% larger, but I think it's clear that nobody wants that. > Now, maybe in the future we'll want to work on other techniques for > reducing contention, but I don't think we should make that the problem > of this patch, especially because the regressions are small and go > away after a few hours of heavy use. We should optimize for the case > where the user intends to keep the database around for years, not > hours. I think so too. There is a feature in other database systems called "reverse key indexes", which deals with this problem in a rather extreme way. This situation is very similar to the situation with rightmost page splits, where fillfactor is applied to pack leaf pages full. The only difference is that there are multiple groupings, not just one single implicit grouping (everything in the index). You could probably make very similar observations about rightmost page splits that apply leaf fillfactor. Here is an example of how the largest index looks for master with the 100 warehouse case that was slightly regressed: table_name | index_name | page_type | npages | avg_live_items | avg_dead_items | avg_item_size ------------------+-----------------------+-----------+---------+----------------+----------------+--------------- bmsql_order_line | bmsql_order_line_pkey | R | 1 | 54.000 | 0.000 | 23.000 bmsql_order_line | bmsql_order_line_pkey | I | 11,482 | 143.200 | 0.000 | 23.000 bmsql_order_line | bmsql_order_line_pkey | L | 1,621,316 | 139.458 | 0.003 | 24.000 Here is what we see with the patch: table_name | index_name | page_type | npages | avg_live_items | avg_dead_items | avg_item_size ------------------+-----------------------+-----------+---------+----------------+----------------+--------------- bmsql_order_line | bmsql_order_line_pkey | R | 1 | 29.000 | 0.000 | 22.000 bmsql_order_line | bmsql_order_line_pkey | I | 5,957 | 159.149 | 0.000 | 23.000 bmsql_order_line | bmsql_order_line_pkey | L | 936,170 | 233.496 | 0.052 | 23.999 REINDEX would leave bmsql_order_line_pkey with 262 items, and we see here that it has 233 after several hours, which is pretty good given the amount of contention. The index actually looks very much like it was just REINDEXED when initial bulk loading finishes, before we get any updates, even though that happens using retail insertions. Notice that the number of internal pages is reduced by almost a full 50% -- it's somewhat better than the reduction in the number of leaf pages, because the benefits compound (items in the root are even a bit smaller, because of this compounding effect, despite alignment effects). Internal pages are the most important pages to have cached, but also potentially the biggest points of contention. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andres Freund
Date:
Hi, On 2019-03-11 19:47:29 -0700, Peter Geoghegan wrote: > I now believe that the problem is with LWLock/buffer lock contention > on index pages, and that that's an inherent cost with a minority of > write-heavy high contention workloads. A cost that we should just > accept. Have you looked at an offwake or lwlock wait graph (bcc tools) or something in that vein? Would be interesting to see what is waiting for what most often... Greetings, Andres Freund
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Mar 12, 2019 at 12:40 PM Andres Freund <andres@anarazel.de> wrote: > Have you looked at an offwake or lwlock wait graph (bcc tools) or > something in that vein? Would be interesting to see what is waiting for > what most often... Not recently, though I did use your BCC script for this very purpose quite a few months ago. I don't remember it helping that much at the time, but then that was with a version of the patch that lacked a couple of important optimizations that we have now. We're now very careful to not descend to the left with an equal pivot tuple. We descend right instead when that's definitely the only place we'll find matches (a high key doesn't count as a match in almost all cases!). Edge-cases where we unnecessarily move left then right, or unnecessarily move right a second time once on the leaf level have been fixed. I fixed the regression I was worried about at the time, without getting much benefit from the BCC script, and moved on. This kind of minutiae is more important than it sounds. I have used EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I understand where every single block access comes from with these edge-cases, paying close attention to the structure of the index, and how the key space is broken up (the values of pivot tuples in internal pages). It is one thing to make the index smaller, and another thing to take full advantage of that -- I have both. This is one of the reasons why I believe that this minor regression cannot be avoided, short of simply allowing the index to get bloated: I'm simply not doing things that differently outside of the page split code, and what I am doing differently is clearly superior. Both in general, and for the NEW_ORDER transaction in particular. I'll make that another TODO item -- this regression will be revisited using BCC instrumentation. I am currently performing a multi-day benchmark on a very large TPC-C/BenchmarkSQL database, and it will have to wait for that. (I would like to use the same environment as before.) -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Andres Freund
Date:
On 2019-03-12 14:15:06 -0700, Peter Geoghegan wrote: > On Tue, Mar 12, 2019 at 12:40 PM Andres Freund <andres@anarazel.de> wrote: > > Have you looked at an offwake or lwlock wait graph (bcc tools) or > > something in that vein? Would be interesting to see what is waiting for > > what most often... > > Not recently, though I did use your BCC script for this very purpose > quite a few months ago. I don't remember it helping that much at the > time, but then that was with a version of the patch that lacked a > couple of important optimizations that we have now. We're now very > careful to not descend to the left with an equal pivot tuple. We > descend right instead when that's definitely the only place we'll find > matches (a high key doesn't count as a match in almost all cases!). > Edge-cases where we unnecessarily move left then right, or > unnecessarily move right a second time once on the leaf level have > been fixed. I fixed the regression I was worried about at the time, > without getting much benefit from the BCC script, and moved on. > > This kind of minutiae is more important than it sounds. I have used > EXPLAIN(ANALYZE, BUFFERS) instrumentation to make sure that I > understand where every single block access comes from with these > edge-cases, paying close attention to the structure of the index, and > how the key space is broken up (the values of pivot tuples in internal > pages). It is one thing to make the index smaller, and another thing > to take full advantage of that -- I have both. This is one of the > reasons why I believe that this minor regression cannot be avoided, > short of simply allowing the index to get bloated: I'm simply not > doing things that differently outside of the page split code, and what > I am doing differently is clearly superior. Both in general, and for > the NEW_ORDER transaction in particular. > > I'll make that another TODO item -- this regression will be revisited > using BCC instrumentation. I am currently performing a multi-day > benchmark on a very large TPC-C/BenchmarkSQL database, and it will > have to wait for that. (I would like to use the same environment as > before.) I'm basically just curious which buffers have most of the additional contention. Is it the lower number of leaf pages, the inner pages, or (somewhat unexplicably) the meta page, or ...? I was thinking that the callstack that e.g. my lwlock tool gives should be able to explain what callstack most of the waits are occuring on. (I should work a bit on that script, I locally had a version that showed both waiters and the waking up callstack, but I don't find it anymore) Greetings, Andres Freund
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Mar 12, 2019 at 2:22 PM Andres Freund <andres@anarazel.de> wrote: > I'm basically just curious which buffers have most of the additional > contention. Is it the lower number of leaf pages, the inner pages, or > (somewhat unexplicably) the meta page, or ...? I was thinking that the > callstack that e.g. my lwlock tool gives should be able to explain what > callstack most of the waits are occuring on. Right -- that's exactly what I'm interested in, too. If we can characterize the contention in terms of the types of nbtree blocks that are involved (their level), that could be really helpful. There are 200x+ more leaf blocks than internal blocks, so the internal blocks are a lot hotter. But, there is also a lot fewer splits of internal pages, because you need hundreds of leaf page splits to get one internal split. Is the problem contention caused by internal page splits, or is it contention in internal pages that can be traced back to leaf splits, that insert a downlink in to their parent page? It would be really cool to have some idea of the answers to questions like these. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary > search like _bt_binsrch does, but the bounds caching is only done in > _bt_binsrch_insert. Seems more clear to have separate functions for them > now, even though there's some duplication. > /* > * Do the insertion. First move right to find the correct page to > * insert to, if necessary. If we're inserting to a non-unique index, > * _bt_search() already did this when it checked if a move to the > * right was required for leaf page. Insertion scankey's scantid > * would have been filled out at the time. On a unique index, the > * current buffer is the first buffer containing duplicates, however, > * so we may need to move right to the correct location for this > * tuple. > */ > if (checkingunique || itup_key->heapkeyspace) > _bt_findinsertpage(rel, &insertstate, stack, heapRel); > > newitemoff = _bt_binsrch_insert(rel, &insertstate); > The attached new version simplifies this, IMHO. The bounds and the > current buffer go together in the same struct, so it's easier to keep > track whether the bounds are valid or not. Now that you have a full understanding of how the negative infinity sentinel values work, and how page deletion's leaf page search and !heapkeyspace indexes need to be considered, I think that we should come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is that you now have a full understanding of all the subtleties of the patch, including those that that affect unique index insertion. That will make it much easier to talk about these unresolved questions. My current sense is that it isn't useful to store the current buffer alongside the binary search bounds/hint. It'll hardly ever need to be invalidated, because we'll hardly ever have to move right within _bt_findsplitloc() when doing unique index insertion (as I said before, the regression tests *never* have to do this according to gcov). We're talking about a very specific set of conditions here, so I'd like something that's lightweight and specialized. I agree that the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't think of anything that's better offhand. Perhaps you can suggest something that is both lightweight, and an improvement on savebinsrch/restorebinsrch. I'm of the opinion that having a separate _bt_binsrch_insert() does not make anything clearer. Actually, I think that saving the bounds within the original _bt_binsrch() makes the design of that function clearer, not less clear. It's all quite confusing at the moment, given the rightmost/!leaf/page empty special cases. Seeing how the bounds are reused (or not reused) outside of _bt_binsrch() helps with that. The first 3 patches seem commitable now, but I think that it's important to be sure that I've addressed everything you raised satisfactorily before pushing. Or that everything works in a way that you can live with, at least. It would be great if you could take a look at the 'Add high key "continuescan" optimization' patch, which is the only one you haven't commented on so far (excluding the amcheck "relocate" patch, which is less important). I can put that one off for a while after the first 3 go in. I will also put off the "split after new item" commit for at least a week or two. I'm sure that the idea behind the "continuescan" patch will now seem pretty obvious to you -- it's just taking advantage of the high key when an index scan on the leaf level (which uses a search style scankey, not an insertion style scankey) looks like it may have to move to the next leaf page, but we'd like to avoid it where possible. Checking the high key there is now much more likely to result in the index scan not going to the next page, since we're more careful when considering a leaf split point these days. The high key often looks like the items on the page to the right, not the items on the same page. Thanks -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think it's pretty clear that we have to view that as acceptable. I > mean, we could reduce contention even further by finding a way to make > indexes 40% larger, but I think it's clear that nobody wants that. I found this analysis of bloat in the production database of Gitlab in January 2019 fascinating: https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/ They determined that their tables consisted of about 2% bloat, whereas indexes were 51% bloat (determined by running VACUUM FULL, and measuring how much smaller indexes and tables were afterwards). That in itself may not be that telling. What is telling is the index bloat disproportionately affects certain kinds of indexes. As they put it, "Indexes that do not serve a primary key constraint make up 95% of the overall index bloat". In other words, the vast majority of all bloat occurs within non-unique indexes, with most remaining bloat in unique indexes. One factor that could be relevant is that unique indexes get a lot more opportunistic LP_DEAD killing. Unique indexes don't rely on the similar-but-distinct kill_prior_tuple optimization -- a lot more tuples can be killed within _bt_check_unique() than with kill_prior_tuple in realistic cases. That's probably not really that big a factor, though. It seems almost certain that "getting tired" is the single biggest problem. The blog post drills down further, and cites the examples of several *extremely* bloated indexes on a single-column, which is obviously low cardinality. This includes an index on a boolean field, and an index on an enum-like text field. In my experience, having many indexes like that is very common in real world applications, though not at all common in popular benchmarks (with the exception of TPC-E). It also looks like they may benefit from the "split after new item" optimization, at least among the few unique indexes that were very bloated, such as merge_requests_pkey: https://gitlab.com/snippets/1812014 Gitlab is open source, so it should be possible to confirm my theory about the "split after new item" optimization (I am certain about "getting tired", though). Their schema is defined here: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb I don't have time to confirm all this right now, but I am pretty confident that they have both problems. And almost as confident that they'd notice substantial benefits from this patch series. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 13/03/2019 03:28, Peter Geoghegan wrote: > On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary >> search like _bt_binsrch does, but the bounds caching is only done in >> _bt_binsrch_insert. Seems more clear to have separate functions for them >> now, even though there's some duplication. > >> /* >> * Do the insertion. First move right to find the correct page to >> * insert to, if necessary. If we're inserting to a non-unique index, >> * _bt_search() already did this when it checked if a move to the >> * right was required for leaf page. Insertion scankey's scantid >> * would have been filled out at the time. On a unique index, the >> * current buffer is the first buffer containing duplicates, however, >> * so we may need to move right to the correct location for this >> * tuple. >> */ >> if (checkingunique || itup_key->heapkeyspace) >> _bt_findinsertpage(rel, &insertstate, stack, heapRel); >> >> newitemoff = _bt_binsrch_insert(rel, &insertstate); > >> The attached new version simplifies this, IMHO. The bounds and the >> current buffer go together in the same struct, so it's easier to keep >> track whether the bounds are valid or not. > > Now that you have a full understanding of how the negative infinity > sentinel values work, and how page deletion's leaf page search and > !heapkeyspace indexes need to be considered, I think that we should > come back to this _bt_binsrch()/_bt_findsplitloc() stuff. My sense is > that you now have a full understanding of all the subtleties of the > patch, including those that that affect unique index insertion. That > will make it much easier to talk about these unresolved questions. > > My current sense is that it isn't useful to store the current buffer > alongside the binary search bounds/hint. It'll hardly ever need to be > invalidated, because we'll hardly ever have to move right within > _bt_findsplitloc() when doing unique index insertion (as I said > before, the regression tests *never* have to do this according to > gcov). It doesn't matter how often it happens, the code still needs to deal with it. So let's try to make it as readable as possible. > We're talking about a very specific set of conditions here, so > I'd like something that's lightweight and specialized. I agree that > the savebinsrch/restorebinsrch fields are a bit ugly, though. I can't > think of anything that's better offhand. Perhaps you can suggest > something that is both lightweight, and an improvement on > savebinsrch/restorebinsrch. Well, IMHO holding the buffer and the bounds in the new struct is more clean than the savebinsrc/restorebinsrch flags. That's exactly why I suggested it. I don't know what else to suggest. I haven't done any benchmarking, but I doubt there's any measurable difference. > I'm of the opinion that having a separate _bt_binsrch_insert() does > not make anything clearer. Actually, I think that saving the bounds > within the original _bt_binsrch() makes the design of that function > clearer, not less clear. It's all quite confusing at the moment, given > the rightmost/!leaf/page empty special cases. Seeing how the bounds > are reused (or not reused) outside of _bt_binsrch() helps with that. Ok. I think having some code duplication is better than one function that tries to do many things, but I'm not wedded to that. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 13/03/2019 03:28, Peter Geoghegan wrote: > It would be great if you could take a look at the 'Add high key > "continuescan" optimization' patch, which is the only one you haven't > commented on so far (excluding the amcheck "relocate" patch, which is > less important). I can put that one off for a while after the first 3 > go in. I will also put off the "split after new item" commit for at > least a week or two. I'm sure that the idea behind the "continuescan" > patch will now seem pretty obvious to you -- it's just taking > advantage of the high key when an index scan on the leaf level (which > uses a search style scankey, not an insertion style scankey) looks > like it may have to move to the next leaf page, but we'd like to avoid > it where possible. Checking the high key there is now much more likely > to result in the index scan not going to the next page, since we're > more careful when considering a leaf split point these days. The high > key often looks like the items on the page to the right, not the items > on the same page. Oh yeah, that makes perfect sense. I wonder why we haven't done it like that before? The new page split logic makes it more likely to help, but even without that, I don't see any downside. I find it a bit confusing, that the logic is now split between _bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage() does the high-key check, but the corresponding "first-key" check in a backward scan is done in _bt_checkkeys(). I'd suggest moving the logic completely to _bt_readpage(), so that it's in one place. With that, _bt_checkkeys() can always check the keys as it's told, without looking at the LP_DEAD flag. Like the attached. - Heikki
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Thu, Mar 14, 2019 at 4:00 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Oh yeah, that makes perfect sense. I wonder why we haven't done it like > that before? The new page split logic makes it more likely to help, but > even without that, I don't see any downside. The only downside is that we spend a few extra cycles, and that might not work out. This optimization would have always worked, though. The new page split logic clearly makes checking the high key in the "continuescan" path an easy win. > I find it a bit confusing, that the logic is now split between > _bt_checkkeys() and _bt_readpage(). For a forward scan, _bt_readpage() > does the high-key check, but the corresponding "first-key" check in a > backward scan is done in _bt_checkkeys(). I'd suggest moving the logic > completely to _bt_readpage(), so that it's in one place. With that, > _bt_checkkeys() can always check the keys as it's told, without looking > at the LP_DEAD flag. Like the attached. I'm convinced. I'd like to go a bit further, and also pass tupnatts to _bt_checkkeys(). That makes it closer to the similar _bt_check_rowcompare() function that _bt_checkkeys() must sometimes call. It also allows us to only call BTreeTupleGetNAtts() for the high key, while passes down a generic, loop-invariant IndexRelationGetNumberOfAttributes() value for non-pivot tuples. I'll do it that way in the next revision. Thanks -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > It doesn't matter how often it happens, the code still needs to deal > with it. So let's try to make it as readable as possible. > Well, IMHO holding the buffer and the bounds in the new struct is more > clean than the savebinsrc/restorebinsrch flags. That's exactly why I > suggested it. I don't know what else to suggest. I haven't done any > benchmarking, but I doubt there's any measurable difference. Fair enough. Attached is v17, which does it using the approach taken in your earlier prototype. I even came around to your view on _bt_binsrch_insert() -- I kept that part, too. Note, however, that I still pass checkingunique to _bt_findinsertloc(), because that's a distinct condition to whether or not bounds were cached (they happen to be the same thing right now, but I don't want to assume that). This revision also integrates your approach to the "continuescan" optimization patch, with the small tweak I mentioned yesterday (we also pass ntupatts). I also prefer this approach. I plan on committing the first few patches early next week, barring any objections, or any performance problems noticed during an additional, final round of performance validation. I won't expect feedback from you until Monday at the earliest. It would be nice if you could take a look at the amcheck "relocate" patch. My intention is to push patches up to and including the amcheck "relocate" patch on the same day (I'll leave a few hours between the first two patches, to confirm that the first patch doesn't break the buildfarm). BTW, my multi-day, large BenchmarkSQL benchmark continues, with some interesting results. The first round of 12 hour long runs showed the patch nearly 6% ahead in terms of transaction throughput, with a database that's almost 1 terabyte. The second round, which completed yesterday and reuses the database initialized for the first round showed that the patch had 10.7% higher throughput. That's a new record for the patch. I'm going to leave this benchmark running for a few more days, at least until it stops being interesting. I wonder how long it will be before the master branch throughput stops declining relative to throughput with the patched version. I expect that the master branch will reach "index bloat saturation point" sooner or later. Indexes in the patch's data directory continue to get larger, as expected, but the amount of bloat accumulated over time is barely noticeable (i.e. the pages are packed tight with tuples, which barely declines over time). This version of the patch series has attributions/credits at the end of the commit messages. I have listed you as a secondary author on a couple of the patches, where code was lifted from your feedback patches. Let me know if you think that I have it right. Thanks -- Peter Geoghegan
Attachment
- v17-0001-Refactor-nbtree-insertion-scankeys.patch
- v17-0005-Add-split-after-new-tuple-optimization.patch
- v17-0002-Make-heap-TID-a-tiebreaker-nbtree-index-column.patch
- v17-0003-Consider-secondary-factors-during-nbtree-splits.patch
- v17-0004-Allow-tuples-to-be-relocated-from-root-by-amchec.patch
- v17-0006-Add-high-key-continuescan-optimization.patch
- v17-0007-DEBUG-Add-pageinspect-instrumentation.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 16/03/2019 06:16, Peter Geoghegan wrote: > On Thu, Mar 14, 2019 at 2:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> It doesn't matter how often it happens, the code still needs to deal >> with it. So let's try to make it as readable as possible. > >> Well, IMHO holding the buffer and the bounds in the new struct is more >> clean than the savebinsrc/restorebinsrch flags. That's exactly why I >> suggested it. I don't know what else to suggest. I haven't done any >> benchmarking, but I doubt there's any measurable difference. > > Fair enough. Attached is v17, which does it using the approach taken > in your earlier prototype. I even came around to your view on > _bt_binsrch_insert() -- I kept that part, too. Note, however, that I > still pass checkingunique to _bt_findinsertloc(), because that's a > distinct condition to whether or not bounds were cached (they happen > to be the same thing right now, but I don't want to assume that). > > This revision also integrates your approach to the "continuescan" > optimization patch, with the small tweak I mentioned yesterday (we > also pass ntupatts). I also prefer this approach. Great, thank you! > It would be nice if you could take a look at the amcheck "relocate" > patch When I started looking at this, I thought that "relocate" means "move". So I thought that the new mode would actually move tuples, i.e. that it would modify the index. That sounded horrible. Of course, it doesn't actually do that. It just checks that each tuple can be re-found, or "relocated", by descending the tree from the root. I'd suggest changing the language to avoid that confusion. It seems like a nice way to catch all kinds of index corruption issues. Although, we already check that the tuples are in order within the page. Is it really necessary to traverse the tree for every tuple, as well? Maybe do it just for the first and last item? > + * This routine can detect very subtle transitive consistency issues across > + * more than one level of the tree. Leaf pages all have a high key (even the > + * rightmost page has a conceptual positive infinity high key), but not a low > + * key. Their downlink in parent is a lower bound, which along with the high > + * key is almost enough to detect every possible inconsistency. A downlink > + * separator key value won't always be available from parent, though, because > + * the first items of internal pages are negative infinity items, truncated > + * down to zero attributes during internal page splits. While it's true that > + * bt_downlink_check() and the high key check can detect most imaginable key > + * space problems, there are remaining problems it won't detect with non-pivot > + * tuples in cousin leaf pages. Starting a search from the root for every > + * existing leaf tuple detects small inconsistencies in upper levels of the > + * tree that cannot be detected any other way. (Besides all this, it's > + * probably a useful testing strategy to exhaustively verify that all > + * non-pivot tuples can be relocated in the index using the same code paths as > + * those used by index scans.) I don't understand this. Can you give an example of this kind of inconsistency? - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > It would be nice if you could take a look at the amcheck "relocate" > > patch > When I started looking at this, I thought that "relocate" means "move". > So I thought that the new mode would actually move tuples, i.e. that it > would modify the index. That sounded horrible. Of course, it doesn't > actually do that. It just checks that each tuple can be re-found, or > "relocated", by descending the tree from the root. I'd suggest changing > the language to avoid that confusion. Okay. What do you suggest? :-) > It seems like a nice way to catch all kinds of index corruption issues. > Although, we already check that the tuples are in order within the page. > Is it really necessary to traverse the tree for every tuple, as well? > Maybe do it just for the first and last item? It's mainly intended as a developer option. I want it to be possible to detect any form of corruption, however unlikely. It's an adversarial mindset that will at least make me less nervous about the patch. > I don't understand this. Can you give an example of this kind of > inconsistency? The commit message gives an example, but I suggest trying it out for yourself. Corrupt the least significant key byte of a root page of a B-Tree using pg_hexedit. Say it's an index on a text column, then you'd corrupt the last byte to be something slightly wrong. Then, the only way to catch it is with "relocate" verification. You'll only miss a few tuples on a cousin page at the leaf level (those on either side of the high key that the corrupted separator key in the root was originally copied from). The regular checks won't catch this, because the keys are similar enough one level down. The "minus infinity" item is a kind of a blind spot, because we cannot do a parent check of its children, because we don't have the key (it's truncated when the item becomes a right page minus infinity item, during an internal page split). -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 16/03/2019 10:51, Peter Geoghegan wrote: > On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> It would be nice if you could take a look at the amcheck "relocate" >>> patch >> When I started looking at this, I thought that "relocate" means "move". >> So I thought that the new mode would actually move tuples, i.e. that it >> would modify the index. That sounded horrible. Of course, it doesn't >> actually do that. It just checks that each tuple can be re-found, or >> "relocated", by descending the tree from the root. I'd suggest changing >> the language to avoid that confusion. > > Okay. What do you suggest? :-) Hmm. "re-find", maybe? We use that term in a few error messages already, to mean something similar. >> It seems like a nice way to catch all kinds of index corruption issues. >> Although, we already check that the tuples are in order within the page. >> Is it really necessary to traverse the tree for every tuple, as well? >> Maybe do it just for the first and last item? > > It's mainly intended as a developer option. I want it to be possible > to detect any form of corruption, however unlikely. It's an > adversarial mindset that will at least make me less nervous about the > patch. Fair enough. At first, I thought this would be horrendously expensive, but thinking about it a bit more, nearby tuples will always follow the same path through the upper nodes, so it'll all be cached. So maybe it's not quite so bad. >> I don't understand this. Can you give an example of this kind of >> inconsistency? > > The commit message gives an example, but I suggest trying it out for > yourself. Corrupt the least significant key byte of a root page of a > B-Tree using pg_hexedit. Say it's an index on a text column, then > you'd corrupt the last byte to be something slightly wrong. Then, the > only way to catch it is with "relocate" verification. You'll only miss > a few tuples on a cousin page at the leaf level (those on either side > of the high key that the corrupted separator key in the root was > originally copied from). > > The regular checks won't catch this, because the keys are similar > enough one level down. The "minus infinity" item is a kind of a blind > spot, because we cannot do a parent check of its children, because we > don't have the key (it's truncated when the item becomes a right page > minus infinity item, during an internal page split). Hmm. So, the initial situation would be something like this: +-----------+ | 1: root | | | | -inf -> 2 | | 20 -> 3 | | | +-----------+ +-------------+ +-------------+ | 2: internal | | 3: internal | | | | | | -inf -> 4 | | -inf -> 6 | | 10 -> 5 | | 30 -> 7 | | | | | | hi: 20 | | | +-------------+ +-------------+ +---------+ +---------+ +---------+ +---------+ | 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf | | | | | | | | | | 1 | | 11 | | 21 | | 31 | | 5 | | 15 | | 25 | | 35 | | 9 | | 19 | | 29 | | 39 | | | | | | | | | | hi: 10 | | hi: 20 | | hi: 30 | | | +---------+ +---------+ +---------+ +---------+ Then, a cosmic ray changes the 20 on the root page to 18. That causes the the leaf tuple 19 to become non-re-findable; if you descend the for 19, you'll incorrectly land on page 6. But it also causes the high key on page 2 to be different from the downlink on the root page. Wouldn't the existing checks catch this? - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm. "re-find", maybe? We use that term in a few error messages already, > to mean something similar. WFM. > At first, I thought this would be horrendously expensive, but thinking > about it a bit more, nearby tuples will always follow the same path > through the upper nodes, so it'll all be cached. So maybe it's not quite > so bad. That's deliberate, though you could call bt_relocate_from_root() from anywhere else if you wanted to. It's a bit like a big nested loop join, where the inner side has locality. > Then, a cosmic ray changes the 20 on the root page to 18. That causes > the the leaf tuple 19 to become non-re-findable; if you descend the for > 19, you'll incorrectly land on page 6. But it also causes the high key > on page 2 to be different from the downlink on the root page. Wouldn't > the existing checks catch this? No, the existing checks will not check that. I suppose something closer to the existing approach *could* detect this issue, by making sure that the "seam of identical high keys" from the root to the leaf are a match, but we don't use the high key outside of its own page. Besides, there is something useful about having the code actually rely on _bt_search(). There are other similar cases, where it's not obvious how you can do verification without either 1) crossing multiple levels, or 2) retaining a "low key" as well as a high key (this is what Goetz Graefe calls "retaining fence keys to solve the cousin verification problem" in Modern B-Tree Techniques). What if the corruption was in the leaf page 6 from your example, which had a 20 at the start? We wouldn't have compared the downlink from the parent to the child, because leaf page 6 is the leftmost child, and so we only have "-inf". The lower bound actually comes from the root page, because we truncate "-inf" attributes during page splits, even though we don't have to. Most of the time they're not "absolute minus infinity" -- they're "minus infinity in this subtree". Maybe you could actually do something with the high key from leaf page 5 to detect the stray value "20" in leaf page 6, but again, we don't do anything like that right now. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Hmm. "re-find", maybe? We use that term in a few error messages already, > > to mean something similar. > > WFM. Actually, how about "rootsearch", or "rootdescend"? You're supposed to hyphenate "re-find", and so it doesn't really work as a function argument name. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 16/03/2019 19:32, Peter Geoghegan wrote: > On Sat, Mar 16, 2019 at 9:55 AM Peter Geoghegan <pg@bowt.ie> wrote: >> On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> Hmm. "re-find", maybe? We use that term in a few error messages already, >>> to mean something similar. >> >> WFM. > > Actually, how about "rootsearch", or "rootdescend"? You're supposed to > hyphenate "re-find", and so it doesn't really work as a function > argument name. Works for me. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 16/03/2019 18:55, Peter Geoghegan wrote: > On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Then, a cosmic ray changes the 20 on the root page to 18. That causes >> the the leaf tuple 19 to become non-re-findable; if you descend the for >> 19, you'll incorrectly land on page 6. But it also causes the high key >> on page 2 to be different from the downlink on the root page. Wouldn't >> the existing checks catch this? > > No, the existing checks will not check that. I suppose something > closer to the existing approach *could* detect this issue, by making > sure that the "seam of identical high keys" from the root to the leaf > are a match, but we don't use the high key outside of its own page. > Besides, there is something useful about having the code actually rely > on _bt_search(). > > There are other similar cases, where it's not obvious how you can do > verification without either 1) crossing multiple levels, or 2) > retaining a "low key" as well as a high key (this is what Goetz Graefe > calls "retaining fence keys to solve the cousin verification problem" > in Modern B-Tree Techniques). What if the corruption was in the leaf > page 6 from your example, which had a 20 at the start? We wouldn't > have compared the downlink from the parent to the child, because leaf > page 6 is the leftmost child, and so we only have "-inf". The lower > bound actually comes from the root page, because we truncate "-inf" > attributes during page splits, even though we don't have to. Most of > the time they're not "absolute minus infinity" -- they're "minus > infinity in this subtree". AFAICS, there is a copy of every page's high key in its immediate parent. Either in the downlink of the right sibling, or as the high key of the parent page itself. Cross-checking those would catch any corruption in high keys. Note that this would potentially catch some corruption that the descend-from-root would not. If you have a mismatch between the high key of a leaf page and its parent or grandparent, all the current tuples might be pass the rootdescend check. But a tuple might get inserted to wrong location later. > Maybe you could actually do something with the high key from leaf page > 5 to detect the stray value "20" in leaf page 6, but again, we don't > do anything like that right now. Hmm, yeah, to check for stray values, you could follow the left-link, get the high key of the left sibling, and compare against that. - Heikki
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Mar 16, 2019 at 1:33 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > AFAICS, there is a copy of every page's high key in its immediate > parent. Either in the downlink of the right sibling, or as the high key > of the parent page itself. Cross-checking those would catch any > corruption in high keys. I agree that it's always true that the high key is also in the parent, and we could cross-check that within the child. Actually, I should probably figure out a way of arranging for the Bloom filter used within bt_relocate_from_root() (which has been around since PG v11) to include the key itself when fingerprinting. That would probably necessitate that we don't truncate "negative infinity" items (it was actually that way about 18 years ago), just for the benefit of verification. This is almost the same thing as what Graefe argues for (don't think that you need a low key on the leaf level, since you can cross a single level there). I wonder if truncating the negative infinity item in internal pages to zero attributes is actually worth it, since a low key might be useful for a number of reasons. > Note that this would potentially catch some corruption that the > descend-from-root would not. If you have a mismatch between the high key > of a leaf page and its parent or grandparent, all the current tuples > might be pass the rootdescend check. But a tuple might get inserted to > wrong location later. I also agree with this. However, the rootdescend check will always work better than this in some cases that you can at least imagine, for as long as there are negative infinity items to worry about. (And, even if we decided not to truncate to support easy verification, there is still a good argument to be made for involving the authoritative _bt_search() code at some point). > > Maybe you could actually do something with the high key from leaf page > > 5 to detect the stray value "20" in leaf page 6, but again, we don't > > do anything like that right now. > > Hmm, yeah, to check for stray values, you could follow the left-link, > get the high key of the left sibling, and compare against that. Graefe argues for retaining a low key, even in leaf pages (the left page's old high key becomes the left page's low key during a split, and the left page's new high key becomes the new right pages low key at the same time). It's useful for what he calls "write-optimized B-Trees", and maybe even for optional compression. As I said earlier, I guess you can just go left on the leaf level if you need to, and you have all you need. But I'd need to think about it some more. Point taken; rootdescend isn't enough to make verification exactly perfect. But it makes verification approach being perfect, because you're going to get right answers to queries as long as it passes (I think). There could be a future problem for a future insertion that you could also detect, but can't. But you'd have to be extraordinarily unlucky to have that happen for any amount of time. Unlucky even by my own extremely paranoid standard. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > I agree that it's always true that the high key is also in the parent, > and we could cross-check that within the child. Actually, I should > probably figure out a way of arranging for the Bloom filter used > within bt_relocate_from_root() (which has been around since PG v11) to > include the key itself when fingerprinting. That would probably > necessitate that we don't truncate "negative infinity" items (it was > actually that way about 18 years ago), just for the benefit of > verification. Clarification: You'd fingerprint an entire pivot tuple -- key, block number, everything. Then, you'd probe the Bloom filter for the high key one level down, with the downlink block in the high key set to point to the current sibling on the same level (the child level). As I said, I think that the only reason that that cannot be done at present is because of the micro-optimization of truncating the first item on the right page to zero attributes during an internal page split. (We can retain the key without getting rid of the hard-coded logic for negative infinity within _bt_compare()). bt_relocate_from_root() already has smarts around interrupted page splits (with the incomplete split bit set). Finally, you'd also make bt_downlink_check follow every downlink, not all-but-one downlink (no more excuse for leaving out the first one if we don't truncate during internal page splits). -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Mar 16, 2019 at 2:01 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Sat, Mar 16, 2019 at 1:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I agree that it's always true that the high key is also in the parent, > > and we could cross-check that within the child. Actually, I should > > probably figure out a way of arranging for the Bloom filter used > > within bt_relocate_from_root() (which has been around since PG v11) to > > include the key itself when fingerprinting. That would probably > > necessitate that we don't truncate "negative infinity" items (it was > > actually that way about 18 years ago), just for the benefit of > > verification. > > Clarification: You'd fingerprint an entire pivot tuple -- key, block > number, everything. Then, you'd probe the Bloom filter for the high > key one level down, with the downlink block in the high key set to > point to the current sibling on the same level (the child level). As I > said, I think that the only reason that that cannot be done at present > is because of the micro-optimization of truncating the first item on > the right page to zero attributes during an internal page split. (We > can retain the key without getting rid of the hard-coded logic for > negative infinity within _bt_compare()). > > bt_relocate_from_root() already has smarts around interrupted page > splits (with the incomplete split bit set). Clarification to my clarification: I meant bt_downlink_missing_check(), not bt_relocate_from_root(). The former really has been around since v11, unlike the latter, which is part of this new amcheck patch we're discussing. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Sat, Mar 16, 2019 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Actually, how about "rootsearch", or "rootdescend"? You're supposed to > > hyphenate "re-find", and so it doesn't really work as a function > > argument name. > > Works for me. Attached is v18 of patch series, which calls the new verification option "rootdescend" verification. As previously stated, I intend to commit the first 4 patches (up to and including this amcheck "rootdescend" patch) during the workday tomorrow, Pacific time. Other changes: * Further consolidation of the nbtree.h comments from second patch, so that the on-disk representation overview that you requested a while back has all the details. A couple of these were moved from macro comments also in nbtree.h, and were missed earlier. * Tweaks to comments on _bt_binsrch_insert() and its callers. Streamlined to reflect the fact that it doesn't need to talk so much about cases that only apply to internal pages. Explicitly stated requirements for caller. * Made _bt_binsrch_insert() set InvalidOffsetNumber for bounds in case were valid bounds cannot be established initially. This seemed like a good idea. * A few more defensive assertion were added to nbtinsert.c (also related to _bt_binsrch_insert()). Thanks -- Peter Geoghegan
Attachment
- v18-0005-Add-split-after-new-tuple-optimization.patch
- v18-0001-Refactor-nbtree-insertion-scankeys.patch
- v18-0004-Allow-amcheck-to-re-find-tuples-using-new-search.patch
- v18-0003-Consider-secondary-factors-during-nbtree-splits.patch
- v18-0006-Add-high-key-continuescan-optimization.patch
- v18-0007-DEBUG-Add-pageinspect-instrumentation.patch
- v18-0002-Make-heap-TID-a-tiebreaker-nbtree-index-column.patch
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Heikki Linnakangas
Date:
On 18/03/2019 02:59, Peter Geoghegan wrote: > On Sat, Mar 16, 2019 at 1:05 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> Actually, how about "rootsearch", or "rootdescend"? You're supposed to >>> hyphenate "re-find", and so it doesn't really work as a function >>> argument name. >> >> Works for me. > > Attached is v18 of patch series, which calls the new verification > option "rootdescend" verification. Thanks! I'm getting a regression failure from the 'create_table' test with this: > --- /home/heikki/git-sandbox/postgresql/src/test/regress/expected/create_table.out 2019-03-11 14:41:41.382759197 +0200 > +++ /home/heikki/git-sandbox/postgresql/src/test/regress/results/create_table.out 2019-03-18 13:49:49.480249055 +0200 > @@ -413,18 +413,17 @@ > c text, > d text > ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d collate "C"); > +ERROR: function plusone(integer) does not exist > +HINT: No function matches the given name and argument types. You might need to add explicit type casts. Are you seeing that? Looking at the patches 1 and 2 again: I'm still not totally happy with the program flow and all the conditions in _bt_findsplitloc(). I have a hard time following which codepaths are taken when. I refactored that, so that there is a separate copy of the loop for V3 and V4 indexes. So, when the code used to be something like this: _bt_findsplitloc(...) { ... /* move right, if needed */ for(;;) { /* * various conditions for when to stop. Different conditions * apply depending on whether it's a V3 or V4 index. */ } ... } it is now: _bt_findsplitloc(...) { ... if (heapkeyspace) { /* * If 'checkingunique', move right to the correct page. */ for (;;) { ... } } else { /* * Move right, until we find a page with enough space or "get * tired" */ for (;;) { ... } } ... } I think this makes the logic easier to understand. Although there is some commonality, the conditions for when to move right on a V3 vs V4 index are quite different, so it seems better to handle them separately. There is some code duplication, but it's not too bad. I moved the common code to step to the next page to a new helper function, _bt_stepright(), which actually seems like a good idea in any case. See attached patches with those changes, plus some minor comment kibitzing. It's still failing the 'create_table' regression test, though. - Heikki PS. The commit message of the first patch needs updating, now that BTInsertState is different from BTScanInsert.
Attachment
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Mar 18, 2019 at 4:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I'm getting a regression failure from the 'create_table' test with this: > Are you seeing that? Yes -- though the bug is in your revised v18, not the original v18, which passed CFTester. Your revision fails on Travis/Linux, which is pretty close to what I see locally, and much less subtle than the test failures you mentioned: https://travis-ci.org/postgresql-cfbot/postgresql/builds/507816665 "make check" did pass locally for me with your patch, but "make check-world" (parallel recipe) did not. The original v18 passed both CFTester tests about 15 hour ago: https://travis-ci.org/postgresql-cfbot/postgresql/builds/507643402 I see the bug. You're not supposed to test this way with a heapkeyspace index: > + if (P_RIGHTMOST(lpageop) || > + _bt_compare(rel, itup_key, page, P_HIKEY) != 0) > + break; This is because the presence of scantid makes it almost certain that you'll break when you shouldn't. You're doing it the old way, which is inappropriate for a heapkeyspace index. Note that it would probably take much longer to notice this bug if the "consider secondary factors" patch was also applied, because then you would rarely have cause to step right here (duplicates would never occupy more than a single page in the regression tests). The test failures are probably also timing sensitive, though they happen very reliably on my machine. > Looking at the patches 1 and 2 again: > > I'm still not totally happy with the program flow and all the conditions > in _bt_findsplitloc(). I have a hard time following which codepaths are > taken when. I refactored that, so that there is a separate copy of the > loop for V3 and V4 indexes. The big difference is that you make the possible call to _bt_stepright() conditional on this being a checkingunique index -- the duplicate code is indented in that branch of _bt_findsplitloc(). Whereas I break early in the loop when "checkingunique && heapkeyspace". The flow of the original loop not only had less code. It also contrasted the important differences between heapkeyspace and !heapkeyspace cases: /* If this is the page that the tuple must go on, stop */ if (P_RIGHTMOST(lpageop)) break; cmpval = _bt_compare(rel, itup_key, page, P_HIKEY); if (itup_key->heapkeyspace) { if (cmpval <= 0) break; } else { /* * pg_upgrade'd !heapkeyspace index. * * May have to handle legacy case where there is a choice of which * page to place new tuple on, and we must balance space * utilization as best we can. Note that this may invalidate * cached bounds for us. */ if (cmpval != 0 || _bt_useduplicatepage(rel, heapRel, insertstate)) break; } I thought it was obvious that the "cmpval <= 0" code was different for a reason. It now seems that this at least needs a comment. I still believe that the best way to handle the !heapkeyspace case is to make it similar to the heapkeyspace checkingunique case, regardless of whether or not we're checkingunique. The fact that this bug slipped in supports that view. Besides, the alternative that you suggest treats !heapkeyspace indexes as if they were just as important to the reader, which seems inappropriate (better to make the legacy case follow the new case, not the other way around). I'm fine with the comment tweaks that you made that are not related to _bt_findsplitloc(), though. I won't push the patches today, to give you the opportunity to respond. I am not at all convinced right now, though. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think it's pretty clear that we have to view that as acceptable. I > mean, we could reduce contention even further by finding a way to make > indexes 40% larger, but I think it's clear that nobody wants that. > Now, maybe in the future we'll want to work on other techniques for > reducing contention, but I don't think we should make that the problem > of this patch, especially because the regressions are small and go > away after a few hours of heavy use. We should optimize for the case > where the user intends to keep the database around for years, not > hours. I came back to the question of contention recently. I don't think it's okay to make contention worse in workloads where indexes are mostly the same size as before, such as almost any workload that pgbench can simulate. I have made a lot of the fact that the TPC-C indexes are ~40% smaller, in part because lots of people outside the community find TPC-C interesting, and in part because this patch series is focused on cases where we currently do unusually badly (cases where good intuitions about how B-Trees are supposed to perform break down [1]). These pinpointable problems must affect a lot of users some of the time, but certainly not all users all of the time. The patch series is actually supposed to *improve* the situation with index buffer lock contention in general, and it looks like it manages to do that with pgbench, which doesn't do inserts into indexes, except for those required for non-HOT updates. pgbench requires relatively few page splits, but is in every other sense a high contention workload. With pgbench scale factor 20, here are results for patch and master with a Gaussian distribution on my 8 thread/4 core home server, with each run reported lasting 10 minutes, repeating twice for client counts 1, 2, 8, 16, and 64, patch and master branch: \set aid random_gaussian(1, 100000 * :scale, 20) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; 1st pass ======== (init pgbench from scratch for each database, scale 20) 1 client master: tps = 7203.983289 (including connections establishing) tps = 7204.020457 (excluding connections establishing) latency average = 0.139 ms latency stddev = 0.026 ms 1 client patch: tps = 7012.575167 (including connections establishing) tps = 7012.590007 (excluding connections establishing) latency average = 0.143 ms latency stddev = 0.020 ms 2 clients master: tps = 13434.043832 (including connections establishing) tps = 13434.076194 (excluding connections establishing) latency average = 0.149 ms latency stddev = 0.032 ms 2 clients patch: tps = 13105.620223 (including connections establishing) tps = 13105.654109 (excluding connections establishing) latency average = 0.153 ms latency stddev = 0.033 ms 8 clients master: tps = 27126.852038 (including connections establishing) tps = 27126.986978 (excluding connections establishing) latency average = 0.295 ms latency stddev = 0.095 ms 8 clients patch: tps = 27945.457965 (including connections establishing) tps = 27945.565242 (excluding connections establishing) latency average = 0.286 ms latency stddev = 0.089 ms 16 clients master: tps = 32297.612323 (including connections establishing) tps = 32297.743929 (excluding connections establishing) latency average = 0.495 ms latency stddev = 0.185 ms 16 clients patch: tps = 33434.889405 (including connections establishing) tps = 33435.021738 (excluding connections establishing) latency average = 0.478 ms latency stddev = 0.167 ms 64 clients master: tps = 25699.029787 (including connections establishing) tps = 25699.217022 (excluding connections establishing) latency average = 2.482 ms latency stddev = 1.715 ms 64 clients patch: tps = 26513.816673 (including connections establishing) tps = 26514.013638 (excluding connections establishing) latency average = 2.405 ms latency stddev = 1.690 ms 2nd pass ======== (init pgbench from scratch for each database, scale 20) 1 client master: tps = 7172.995796 (including connections establishing) tps = 7173.013472 (excluding connections establishing) latency average = 0.139 ms latency stddev = 0.022 ms 1 client patch: tps = 7024.724365 (including connections establishing) tps = 7024.739237 (excluding connections establishing) latency average = 0.142 ms latency stddev = 0.021 ms 2 clients master: tps = 13489.016303 (including connections establishing) tps = 13489.047968 (excluding connections establishing) latency average = 0.148 ms latency stddev = 0.032 ms 2 clients patch: tps = 13210.292833 (including connections establishing) tps = 13210.321528 (excluding connections establishing) latency average = 0.151 ms latency stddev = 0.029 ms 8 clients master: tps = 27470.112858 (including connections establishing) tps = 27470.229891 (excluding connections establishing) latency average = 0.291 ms latency stddev = 0.093 ms 8 clients patch: tps = 28132.981815 (including connections establishing) tps = 28133.096414 (excluding connections establishing) latency average = 0.284 ms latency stddev = 0.081 ms 16 clients master: tps = 32409.399669 (including connections establishing) tps = 32409.533400 (excluding connections establishing) latency average = 0.493 ms latency stddev = 0.182 ms 16 clients patch: tps = 33678.304986 (including connections establishing) tps = 33678.427420 (excluding connections establishing) latency average = 0.475 ms latency stddev = 0.168 ms 64 clients master: tps = 25864.453485 (including connections establishing) tps = 25864.639098 (excluding connections establishing) latency average = 2.466 ms latency stddev = 1.698 ms 64 clients patch: tps = 26382.926218 (including connections establishing) tps = 26383.166692 (excluding connections establishing) latency average = 2.417 ms latency stddev = 1.678 ms There was a third run which has been omitted, because it's practically the same as the first two. The order that results appear in is the order things actually ran in (I like to interlace master and patch runs closely). Analysis ======== There seems to be a ~2% regression with one or two clients, but we more than make up for that as the client count goes up -- the 8 and 64 client cases improve throughput by ~2.5%, and the 16 client case improves throughput by ~4%. This seems like a totally reasonable trade-off to me. As I said already, the patch isn't really about workloads that we already do acceptably well on, such as this one, so you're not expected to be impressed with these numbers. My goal is to show that boring workloads that fit everything in shared_buffers appear to be fine. I think that that's a reasonable conclusion, based on these numbers. Lower client count cases are generally considered less interesting, and also lose less in throughput than we go on to gain later. as more clients are added. I'd be surprised if anybody complained. I think that the explanation for the regression with one or two clients boils down to this: We're making better decisions about where to split pages, and even about how pages are accessed by index scans (more on that in the next paragraph). However, this isn't completely free (particularly the page split stuff), and it doesn't pay for itself until the number of clients ramps up. However, not being more careful about that stuff is penny wise, pound foolish. I even suspect that there are priority inversion issues when there is high contention during unique index enforcement, which might be a big problem on multi-socket machines with hundreds of clients. I am not in a position to confirm that right now, but we have heard reports that are consistent with this explanation at least once before now [2]. Zipfian was also somewhat better when I last measured it, using the same fairly modest machine -- I didn't repeat that here because I wanted something simple and widely studied. The patch establishes the principle that there is only one good reason to visit more than one leaf page within index scans like those used by pgbench: a concurrent page split, where the scan simply must go right to find matches that were just missed in the first leaf page. That should be very rare. We should never visit two leaf pages because we're confused about where there might be matches. There is simply no good reason for there to be any ambiguity or confusion. The patch could still make index scans like these visit more than a single leaf page for a bad reason, at least in theory: when there is at least ~400 duplicates in a unique index, and we therefore can't possibly store them all on one leaf page, index scans will of course have to visit more than one leaf page. Again, that should be very rare. All index scans can now check the high key on the leaf level, and avoid going right when they happen to be very close to the right edge of the leaf page's key space. And, we never have to take the scenic route when descending the tree on an equal internal page key, since that condition has practically been eliminated by suffix truncation. No new tuple can be equal to negative infinity, and negative infinity appears in every pivot tuple. There is a place for everything, and everything is in its place. [1] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/ [2] https://postgr.es/m/BF3B6F54-68C3-417A-BFAB-FB4D66F2B410@postgrespro.ru -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Robert Haas
Date:
On Mon, Mar 18, 2019 at 7:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > With pgbench scale factor 20, here are results for patch and master > with a Gaussian distribution on my 8 thread/4 core home server, with > each run reported lasting 10 minutes, repeating twice for client > counts 1, 2, 8, 16, and 64, patch and master branch: > > 1 client master: > tps = 7203.983289 (including connections establishing) > 1 client patch: > tps = 7012.575167 (including connections establishing) > > 2 clients master: > tps = 13434.043832 (including connections establishing) > 2 clients patch: > tps = 13105.620223 (including connections establishing) Blech. I think the patch has enough other advantages that it's worth accepting that, but it's not great. We seem to keep finding reasons to reduce single client performance in the name of scalability, which is often reasonable not but wonderful. > However, this isn't completely > free (particularly the page split stuff), and it doesn't pay for > itself until the number of clients ramps up. I don't really understand that explanation. It makes sense that more intelligent page split decisions could require more CPU cycles, but it is not evident to me why more clients would help better page split decisions pay off. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Mar 18, 2019 at 5:00 PM Robert Haas <robertmhaas@gmail.com> wrote: > Blech. I think the patch has enough other advantages that it's worth > accepting that, but it's not great. We seem to keep finding reasons > to reduce single client performance in the name of scalability, which > is often reasonable not but wonderful. The good news is that the quicksort that we now perform in nbtsplitloc.c is not optimized at all. Heikki thought it premature to optimize that, for example by inlining/specializing the quicksort. I can make that 3x faster fairly easily, which could easily change the picture here. The code will be uglier that way, but not much more complicated. I even prototyped this, and managed to make serial microbenchmarks I've used noticeably faster. This could very well fix the problem here. It clearly showed up in perf profiles with serial bulks loads. > > However, this isn't completely > > free (particularly the page split stuff), and it doesn't pay for > > itself until the number of clients ramps up. > > I don't really understand that explanation. It makes sense that more > intelligent page split decisions could require more CPU cycles, but it > is not evident to me why more clients would help better page split > decisions pay off. Smarter choices on page splits pay off with higher client counts because they reduce contention at likely hot points. It's kind of crazy that the code in _bt_check_unique() sometimes has to move right, while holding an exclusive buffer lock on the original page and a shared buffer lock on its sibling page at the same time. It then has to hold a third buffer lock concurrently, this time on any heap pages it is interested in. Each in turn, to check if they're possibly conflicting. gcov shows that that never happens with the regression tests once the patch is applied (you can at least get away with only having one buffer lock on a leaf page at all times in practically all cases). -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Mar 18, 2019 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote: > Smarter choices on page splits pay off with higher client counts > because they reduce contention at likely hot points. It's kind of > crazy that the code in _bt_check_unique() sometimes has to move right, > while holding an exclusive buffer lock on the original page and a > shared buffer lock on its sibling page at the same time. It then has > to hold a third buffer lock concurrently, this time on any heap pages > it is interested in. Actually, by the time we get to 16 clients, this workload does make the indexes and tables smaller. Here is pg_buffercache output collected after the first 16 client case: Master ====== relname │ relforknumber │ size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg ─────────────────────────────────────────┼───────────────┼───────────────────────────┼──────────────┼──────────────────────── pgbench_history │ 0 │ 123,484 │ 123,484 │ 4.9989715266755207 pgbench_accounts │ 0 │ 34,665 │ 10,682 │ 4.4948511514697622 pgbench_accounts_pkey │ 0 │ 5,708 │ 1,561 │ 4.8731582319026265 pgbench_tellers │ 0 │ 489 │ 489 │ 5.0000000000000000 pgbench_branches │ 0 │ 284 │ 284 │ 5.0000000000000000 pgbench_tellers_pkey │ 0 │ 56 │ 56 │ 5.0000000000000000 .... Patch ===== relname │ relforknumber │ size_main_rel_fork_blocks │ buffer_count │ avg_buffer_usg ─────────────────────────────────────────┼───────────────┼───────────────────────────┼──────────────┼──────────────────────── pgbench_history │ 0 │ 127,864 │ 127,864 │ 4.9980447975974473 pgbench_accounts │ 0 │ 33,933 │ 9,614 │ 4.3517786561264822 pgbench_accounts_pkey │ 0 │ 5,487 │ 1,322 │ 4.8857791225416036 pgbench_tellers │ 0 │ 204 │ 204 │ 4.9803921568627451 pgbench_branches │ 0 │ 198 │ 198 │ 4.3535353535353535 pgbench_tellers_pkey │ 0 │ 14 │ 14 │ 5.0000000000000000 .... The main fork for pgbench_history is larger with the patch, obviously, but that's good. pgbench_accounts_pkey is about 4% smaller, which is probably the most interesting observation that can be made here, but the tables are also smaller. pgbench_accounts itself is ~2% smaller. pgbench_branches is ~30% smaller, and pgbench_tellers is 60% smaller. Of course, the smaller tables were already very small, so maybe that isn't important. I think that this is due to more effective pruning, possibly because we get better lock arbitration as a consequence of better splits, and also because duplicates are in heap TID order. I haven't observed this effect with larger databases, which have been my focus. It isn't weird that shared_buffers doesn't have all the pgbench_accounts blocks, since, of course, this is highly skewed by design -- most blocks were never accessed from the table. This effect seems to be robust, at least with this workload. The second round of benchmarks (which have their own pgbench -i initialization) show very similar amounts of bloat at the same point. It may not be that significant, but it's also not a fluke. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Mon, Mar 18, 2019 at 10:17 AM Peter Geoghegan <pg@bowt.ie> wrote: > The big difference is that you make the possible call to > _bt_stepright() conditional on this being a checkingunique index -- > the duplicate code is indented in that branch of _bt_findsplitloc(). > Whereas I break early in the loop when "checkingunique && > heapkeyspace". Heikki and I discussed this issue privately, over IM, and reached final agreement on remaining loose ends. I'm going to use his code for _bt_findsplitloc(). Plan to push a final version of the first four patches tomorrow morning PST. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Tue, Mar 19, 2019 at 4:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > Heikki and I discussed this issue privately, over IM, and reached > final agreement on remaining loose ends. I'm going to use his code for > _bt_findsplitloc(). Plan to push a final version of the first four > patches tomorrow morning PST. I've committed the first 4 patches. Many thanks to Heikki for his very valuable help! Thanks also to the other reviewers. I'll likely push the remaining two patches on Sunday or Monday. -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan <pg@bowt.ie> wrote: > I've committed the first 4 patches. Many thanks to Heikki for his very > valuable help! Thanks also to the other reviewers. > > I'll likely push the remaining two patches on Sunday or Monday. I noticed that if I initidb and run "make installcheck" with and without the "split after new tuple" optimization patch, the largest system catalog indexes shrink quite noticeably: Master ====== pg_depend_depender_index 1456 kB pg_depend_reference_index 1416 kB pg_class_tblspc_relfilenode_index 224 kB Patch ===== pg_depend_depender_index 1088 kB -- ~25% smaller pg_depend_reference_index 1136 kB -- ~20% smaller pg_class_tblspc_relfilenode_index 160 kB -- 28% smaller This is interesting to me because it is further evidence that the problem that the patch targets is reasonably common. It's also interesting to me because we benefit despite the fact there are a lot of duplicates in parts of these indexes; we vary our strategy at different parts of the key space, which works well. We pack pages tightly where they're full of duplicates, using the "single value" strategy that I've already committed, whereas the apply the "split after new tuple" optimization in parts of the index with localized monotonically increasing insertions. If there were no duplicates in the indexes, then they'd be about 40% smaller, which is exactly what we see with the TPC-C indexes (they're all unique indexes, with very few physical duplicates). Looks like the duplicates are mostly bootstrap mode entries. Lots of the pg_depend_depender_index duplicates look like "(classid, objid, objsubid)=(0, 0, 0)", for example. I also noticed one further difference: the pg_shdepend_depender_index index grew from 40 kB to 48 kB. I guess that might count as a regression, though I'm not sure that it should. I think that we would do better if the volume of data in the underlying table was greater. contrib/pageinspect shows that a small number of the leaf pages in the improved cases are not very filled at all, which is more than made up for by the fact that many more pages are packed as if they were created by a rightmost split (262 items 24 byte tuples is exactly consistent with that). IOW, I suspect that the extra page in pg_shdepend_depender_index is due to a "local minimum". -- Peter Geoghegan
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
From
Peter Geoghegan
Date:
On Fri, Mar 22, 2019 at 2:15 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I'll likely push the remaining two patches on Sunday or Monday. > > I noticed that if I initidb and run "make installcheck" with and > without the "split after new tuple" optimization patch, the largest > system catalog indexes shrink quite noticeably: I pushed this final patch a week ago, as commit f21668f3, concluding work on integrating the patch series. I have some closing thoughts that I would like to close out the project on. I was casually discussing this project over IM with Robert the other day. I was asked a question I'd often asked myself about the "split after new item" heuristics: What if you're wrong? What if some "black swan" type workload fools your heuristics into bloating an index uncontrollably? I gave an answer to his question that may have seemed kind of inscrutable. My intuition about the worst case for the heuristics is based on its similarity to the worst case for quicksort. Any real-world instance of quicksort going quadratic is essentially a case where we *consistently* do the wrong thing when selecting a pivot. A random pivot selection will still perform reasonably well, because we'll still choose the median pivot on average. A malicious actor will always be able to fool any quicksort implementation into going quadratic [1] in certain circumstances. We're defending against Murphy, not Machiavelli, though, so that's okay. I think that I can produce a more tangible argument than this, though. Attached patch removes every heuristic that limits the application of the "split after new item" optimization (it doesn't force the optimization in the case of rightmost splits, or in the case where the new item happens to be first on the page, since caller isn't prepared for that). This is an attempt to come up with a wildly exaggerated worst case. Nevertheless, the consequences are not actually all that bad. Summary: * The "UK land registry" test case that I leaned on a lot for the patch has a final index that's about 1% larger. However, it was about 16% smaller compared to Postgres without the patch, so this is not a problem. * Most of the TPC-C indexes are actually slightly smaller, because we didn't quite go as far as we could have (TPC-C strongly rewards this optimization). 8 out of the 10 indexes are either smaller or unchanged. The customer name index is about 28% larger, though. The oorder table index is also about 28% larger. * TPC-E never benefits from the "split after new item" optimization, and yet the picture isn't so bad here either. The holding history PK is about 40% bigger, which is quite bad, and the biggest regression overall. However, in other affected cases indexes are about 15% larger, which is not that bad. Also attached are the regressions from my test suite in the form of diff files -- these are the full details of the regression, just in case that's interesting to somebody. This isn't the final word. I'm not asking anybody to accept with total certainty that there can never be a "black swan" workload that the heuristics consistently mishandle, leading to pathological performance. However, I think it's fair to say that the risk of that happening has been managed well. The attached test patch literally removes any restraint on applying the optimization, and yet we arguably do no worse than Postgres 11 would overall. Once again, I would like to thank my collaborators for all their help, especially Heikki. [1] https://www.cs.dartmouth.edu/~doug/mdmspe.pdf -- Peter Geoghegan