Thread: btree: downlink right separator/HIKEY optimization
(now really to -hackers) Hi, Over at [0] I'd implemented an optimization that allows us to skip calling _bt_compare in _bt_moveright in many common cases. This patch, when stacked on top of the prefix truncation patch, improves INSERT performance by an additional 2-9%pt, with an extreme case of 45% in the worscase index tests at [0]. The optimization is that we now recognze that our page split algorithm all but guarantees that the HIKEY matches this page's downlink's right separator key bytewise, excluding the data stored in the IndexTupleData struct. By caching the right separator index tuple in _bt_search, we can compare the downlink's right separator and the HIKEY, and when they are equal (memcmp() == 0) we don't have to call _bt_compare - the HIKEY is known to be larger than the scan key, because our key is smaller than the right separator, and thus transitively also smaller than the HIKEY because it contains the same data. As _bt_compare can call expensive user-provided functions, this can be a large performance boon, especially when there are only a small number of column getting compared on each page (e.g. index tuples of many 100s of bytes, or dynamic prefix truncation is enabled). By adding this, the number of _bt_compare calls per _bt_search is often reduced by one per btree level. Kind regards, Matthias van de Meent Neon (https://neon.tech) PS. Best served with dynamic prefix truncation [1] and btree specialization [0]. [0] https://www.postgresql.org/message-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8snuw@mail.gmail.com [1] https://www.postgresql.org/message-id/flat/CAEze2Wh-h20DmPSMXp4qHR0-ykh9=Z3ejX8MSsbikbOqaYe_OQ@mail.gmail.com
Attachment
On 01/11/2023 00:08, Matthias van de Meent wrote: > calling _bt_compare in _bt_moveright in many common cases. This patch, > when stacked on top of the prefix truncation patch, improves INSERT > performance by an additional 2-9%pt, with an extreme case of 45% in > the worscase index tests at [0]. > > The optimization is that we now recognze that our page split algorithm > all but guarantees that the HIKEY matches this page's downlink's right > separator key bytewise, excluding the data stored in the > IndexTupleData struct. Good observation. > By caching the right separator index tuple in _bt_search, we can > compare the downlink's right separator and the HIKEY, and when they > are equal (memcmp() == 0) we don't have to call _bt_compare - the > HIKEY is known to be larger than the scan key, because our key is > smaller than the right separator, and thus transitively also smaller > than the HIKEY because it contains the same data. As _bt_compare can > call expensive user-provided functions, this can be a large > performance boon, especially when there are only a small number of > column getting compared on each page (e.g. index tuples of many 100s > of bytes, or dynamic prefix truncation is enabled). What would be the worst case scenario for this? One situation where the memcmp() would not match is when there is a concurrent page split. I think it's OK to pessimize that case. Are there any other situations? When the memcmp() matches, I think this is almost certainly not slower than calling the datatype's comparison function. > if (offnum < PageGetMaxOffsetNumber(page)) > { > ItemId rightsepitem = PageGetItemId(page, offnum + 1); > IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem); > memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem)); > rightsep = &rsepbuf.tuple; > } > else if (!P_RIGHTMOST(opaque)) > { > /* > * The rightmost data tuple on inner page has P_HIKEY as its right > * separator. > */ > ItemId rightsepitem = PageGetItemId(page, P_HIKEY); > IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem); > memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem)); > rightsep = &rsepbuf.tuple; > } This could use a one-line comment above this, something like "Remember the right separator of the downlink we follow, to speed up the next _bt_moveright call". Should there be an "else rightsep = NULL;" here? Is it possible that we follow the non-rightmost downlink on a higher level and rightmost downlink on next level? Concurrent page deletion? Please update the comment above _bt_moveright to describe the new argument. Perhaps the text from README should go there, this feels like a detail specific to _bt_search and _bt_moveright. -- Heikki Linnakangas Neon (https://neon.tech)
On Wed, 1 Nov 2023 at 03:38, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > (now really to -hackers) > Hi, > > Over at [0] I'd implemented an optimization that allows us to skip > calling _bt_compare in _bt_moveright in many common cases. This patch, > when stacked on top of the prefix truncation patch, improves INSERT > performance by an additional 2-9%pt, with an extreme case of 45% in > the worscase index tests at [0]. > > The optimization is that we now recognze that our page split algorithm > all but guarantees that the HIKEY matches this page's downlink's right > separator key bytewise, excluding the data stored in the > IndexTupleData struct. > > By caching the right separator index tuple in _bt_search, we can > compare the downlink's right separator and the HIKEY, and when they > are equal (memcmp() == 0) we don't have to call _bt_compare - the > HIKEY is known to be larger than the scan key, because our key is > smaller than the right separator, and thus transitively also smaller > than the HIKEY because it contains the same data. As _bt_compare can > call expensive user-provided functions, this can be a large > performance boon, especially when there are only a small number of > column getting compared on each page (e.g. index tuples of many 100s > of bytes, or dynamic prefix truncation is enabled). > > By adding this, the number of _bt_compare calls per _bt_search is > often reduced by one per btree level. CFBot shows the following compilation error at [1]: [16:56:22.153] FAILED: src/backend/postgres_lib.a.p/access_nbtree_nbtsearch.c.obj [16:56:22.153] "cl" "-Isrc\backend\postgres_lib.a.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "/MDd" "/FIpostgres_pch.h" "/Yupostgres_pch.h" "/Fpsrc\backend\postgres_lib.a.p\postgres_pch.pch" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102" "/wd4090" "/wd4267" "-DBUILDING_DLL" "/FS" "/FdC:\cirrus\build\src\backend\postgres_lib.pdb" /Fosrc/backend/postgres_lib.a.p/access_nbtree_nbtsearch.c.obj "/c" ../src/backend/access/nbtree/nbtsearch.c [16:56:22.153] ../src/backend/access/nbtree/nbtsearch.c(112): error C2143: syntax error: missing ';' before 'type' [16:56:22.280] ../src/backend/access/nbtree/nbtsearch.c(112): warning C4091: ' ': ignored on left of 'int' when no variable is declared [1] - https://cirrus-ci.com/task/4634619035779072 Regards, Vignesh
On Tue, 5 Dec 2023 at 08:43, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 01/11/2023 00:08, Matthias van de Meent wrote: > > By caching the right separator index tuple in _bt_search, we can > > compare the downlink's right separator and the HIKEY, and when they > > are equal (memcmp() == 0) we don't have to call _bt_compare - the > > HIKEY is known to be larger than the scan key, because our key is > > smaller than the right separator, and thus transitively also smaller > > than the HIKEY because it contains the same data. As _bt_compare can > > call expensive user-provided functions, this can be a large > > performance boon, especially when there are only a small number of > > column getting compared on each page (e.g. index tuples of many 100s > > of bytes, or dynamic prefix truncation is enabled). > > What would be the worst case scenario for this? One situation where the > memcmp() would not match is when there is a concurrent page split. I > think it's OK to pessimize that case. Are there any other situations? There is also concurrent page deletion which can cause downlinked pages to get removed from the set of accessible pages, but that's quite rare, too: arguably even more rare than page splits. > When the memcmp() matches, I think this is almost certainly not slower > than calling the datatype's comparison function. > > > if (offnum < PageGetMaxOffsetNumber(page)) > > [...] > > else if (!P_RIGHTMOST(opaque)) > > [...] > > } > > This could use a one-line comment above this, something like "Remember > the right separator of the downlink we follow, to speed up the next > _bt_moveright call". Done. > Should there be an "else rightsep = NULL;" here? Is it possible that we > follow the non-rightmost downlink on a higher level and rightmost > downlink on next level? Concurrent page deletion? While possible, the worst this could do is be less efficient in those fringe cases: The cached right separator is a key that is known to compare larger than the search key and thus always correct to use as an optimization for "is this HIKEY larger than my search key", as long as we don't clobber the data in that cache (which we don't). Null-ing the argument, while not incorrect, could be argued to be worse than useless here, as the only case where NULL may match an actual highkey is on the rightmost page, which we already special-case in _bt_moveright before hitting the 'compare the highkey' code. Removal of the value would thus remove any chance of using the optimization after hitting the rightmost page in a layer below. I've added a comment to explain this in an empty else block in the attached version 2 of the patch. > Please update the comment above _bt_moveright to describe the new > argument. Perhaps the text from README should go there, this feels like > a detail specific to _bt_search and _bt_moveright. Done. Thank you for the review. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Attachment
On Sat, 6 Jan 2024 at 16:40, vignesh C <vignesh21@gmail.com> wrote: > > CFBot shows the following compilation error at [1]: > [16:56:22.153] FAILED: > src/backend/postgres_lib.a.p/access_nbtree_nbtsearch.c.obj > [...] > ../src/backend/access/nbtree/nbtsearch.c > [16:56:22.153] ../src/backend/access/nbtree/nbtsearch.c(112): error > C2143: syntax error: missing ';' before 'type' > [16:56:22.280] ../src/backend/access/nbtree/nbtsearch.c(112): warning > C4091: ' ': ignored on left of 'int' when no variable is declared I forgot to address this in the previous patch, so here's v3 which fixes the issue warning. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Attachment
On Thu, Feb 22, 2024 at 10:42 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I forgot to address this in the previous patch, so here's v3 which > fixes the issue warning. What benchmarking have you done here? Have you tried just reordering things in _bt_search() instead? If we delay the check until after the binary search, then the result of the binary search is usually proof enough that we cannot possibly need to move right. That has the advantage of not requiring that we copy anything to the stack. Admittedly, it's harder to make the "binary search first" approach work on the leaf level, under the current code structure. But maybe that doesn't matter very much. And even if it does matter, maybe we should just move the call to _bt_binsrch() that currently takes place in _bt_first into _bt_search() itself -- so that _bt_binsrch() is strictly under the control of _bt_search() (obviously not doable for non-_bt_first callers, which need to call _bt_binsrch_insert instead). This whole approach will have been made easier by the refactoring I did late last year, in commit c9c0589fda. -- Peter Geoghegan
On Fri, Mar 8, 2024 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote: > What benchmarking have you done here? I think that the memcmp() test is subtly wrong: > + if (PointerIsValid(rightsep)) > + { > + /* > + * Note: we're not in the rightmost page (see branchpoint earlier in > + * the loop), so we always have a HIKEY on this page. > + */ > + ItemId hikeyid = PageGetItemId(page, P_HIKEY); > + IndexTuple highkey = (IndexTuple) PageGetItem(page, hikeyid); > + > + if (ItemIdGetLength(hikeyid) == IndexTupleSize(rightsep) && > + memcmp(&highkey[1], &rightsep[1], > + IndexTupleSize(rightsep) - sizeof(IndexTupleData)) == 0) > + { > + break; > + } > + } Unlike amcheck's bt_pivot_tuple_identical, your memcmp() does not compare relevant metadata fields from struct IndexTupleData. It wouldn't make sense for it to compare the block number, of course (if it did then the optimization would simply never work), but ISTM that you still need to compare ItemPointerData.ip_posid. Suppose, for example, that you had two very similar pivot tuples from a multicolumn index on (a int4, b int2) columns. The first/earlier tuple is (a,b) = "(42, -inf)", due to the influence of suffix truncation. The second/later tuple is (a,b) = "(42, 0)", since suffix truncation couldn't take place when the second pivot tuple was created. (Actually, suffix truncation would have been possible, but it would have only managed to truncate-away the tiebreak heap TID attribute value in the case of our second tuple). There'll be more alignment padding (more zero padding bytes) in the second tuple, compared to the first. But the tuples are still the same size. When you go to you memcmp() this pair of tuples using the approach in v3, the bytes that are actually compared will be identical, despite the fact that these are really two distinct tuples, with distinct values. (As I said, you'd have to actually compare the ItemPointerData.ip_posid metadata to notice this small difference.) -- Peter Geoghegan
On Fri, 8 Mar 2024 at 20:14, Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Feb 22, 2024 at 10:42 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > I forgot to address this in the previous patch, so here's v3 which > > fixes the issue warning. > > What benchmarking have you done here? I have benchmarked this atop various versions of master when it was part of the btree specialization patchset, where it showed a 2-9% increase in btree insert performance over the previous patch in the patchset on the various index types in that set. More recently, on an unlogged pgbench with foreign keys enabled (-s400 -j4 -c8) I can't find any obvious regressions (it gains 0-0.7% on master across 5-minute runs), while being 4.5% faster on inserting data on a table with an excessively bad index shape (single index of 10 columns of empty strings with the non-default "nl-BE-x-icu" collation followed by 1 random uuid column, inserted from a 10M row dataset. Extrapolation indicates this could indeed get over 7% improvement when the index shape is 31 nondefault -collated nonnull text columns and a single random ID index column). > Have you tried just reordering things in _bt_search() instead? If we > delay the check until after the binary search, then the result of the > binary search is usually proof enough that we cannot possibly need to > move right. That has the advantage of not requiring that we copy > anything to the stack. I've not tried that, because it would makes page-level prefix truncation more expensive by ~50%: With this patch, we need only 2 full tuple _bt_compares per page before we can establish a prefix, while without this patch (i.e. if we did a binsrch-first approach) we'd need 3 on average (assuming linearly randomly distributed accesses). Because full-key compares can be arbitrarily more expensive than normal attribute compares, I'd rather not have that 50% overhead. > > On Fri, Mar 8, 2024 at 2:14 PM Peter Geoghegan <pg@bowt.ie> wrote: > > What benchmarking have you done here? > I think that the memcmp() test is subtly wrong: Good catch, it's been fixed in the attached version, using a new function. Kind regards, Matthias van de Meent.