Thread: memory layouts for binary search in nbtree
Hi, currently we IIRC use linearly sorted datums for the search in individual btree nodes. Not surprisingly that's often one of the dominant entries in profiles. We could probably improve upon that by using an order more optimized for efficient binary search. See e.g. http://cglab.ca/~morin/misc/arraylayout/ for benchmarks showing benefits. Andres
On 18 May 2016 at 14:25, Andres Freund <andres@anarazel.de> wrote:
--
Hi,
currently we IIRC use linearly sorted datums for the search in
individual btree nodes. Not surprisingly that's often one of the
dominant entries in profiles. We could probably improve upon that by
using an order more optimized for efficient binary search.
See e.g. http://cglab.ca/~morin/misc/arraylayout/ for benchmarks
showing benefits.
Some stuff from >10 years ago about cache conscious btree layout as well. That led to adoption of 64kB pages on some benchmarks.
I think its a good area of work.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 18, 2016 at 6:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I think its a good area of work. I strongly agree. Abbreviated keys in indexes are supposed to help with this. Basically, the ItemId array is made to be interlaced with small abbreviated keys (say one or two bytes), only in the typically less than 1% of pages that are internal (leaf page binary searches don't change). Those internal pages naturally have a wide range of values represented, so 1 byte turns out to be a lot more than you'd think. And, you only have to generate a new one when there is a pagesplit, which is relatively infrequent. You could squeeze out the lp_len bits to fit the abbreviated keys, and store that in the IndexTuple proper. I've discussed this idea with Mashahiko extensively in private. I have lots of related ideas, and think it's a very promising area. I think that this project will be very difficult without better testing. This idea also enables complementary techniques, like interpolation search that can degrade to binary search. -- Peter Geoghegan
On Wed, May 18, 2016 at 6:25 AM, Andres Freund <andres@anarazel.de> wrote: > currently we IIRC use linearly sorted datums for the search in > individual btree nodes. Not surprisingly that's often one of the > dominant entries in profiles. We could probably improve upon that by > using an order more optimized for efficient binary search. Did you ever try running a pgbench SELECT benchmark, having modified things such that all PKs are on columns that are not of type int4/int8, but rather are of type numeric? It's an interesting experiment, that I've been meaning to re-run on a big box. Obviously this will be slower than an equivalent plain pgbench SELECT, but the difference may be smaller than you expect. -- Peter Geoghegan
On Thu, May 19, 2016 at 7:28 PM, Peter Geoghegan <pg@heroku.com> wrote: > Abbreviated keys in indexes are supposed to help with this. Basically, > the ItemId array is made to be interlaced with small abbreviated keys > (say one or two bytes), only in the typically less than 1% of pages > that are internal (leaf page binary searches don't change). I looked at this again recently. I wrote a patch to prove to myself that we can fairly easily reclaim 15 bits from every nbtree internal page ItemId, and put an abbreviated key there instead. The patch has nbtree tell PageAdditem() to store an abbreviated key within lp_len (actually, we call PageAddItemAbbrev() now). I didn't realize that stealing lp_len might be feasible until recently, but it appears to be -- this is a lot simpler than other approaches might be. I already implemented a rudimentary, POC encoding scheme for int4, but text is the datatype that I'd expect to see a real benefit for. While this POC patch of mine could easily have bugs, it is at least true that "make check-world" passes for me. For nbtree, reclaiming the lp_len space was just a matter of teaching a small number of existing places to go to the IndexTuple to get a tuple's length, rather than trusting an ItemId's lp_len field (that is, rather than calling ItemIdGetLength()). Most nbtree related code already gets the length from the index tuple header today, since it's pretty rare to care about the length of a tuple but not its contents. This did require updating some generic routines within bufpage.c, too, but that wasn't so bad. Can you suggest a workload/hardware to assess the benefits of an optimization like this, Andres? Is there a profile available somewhere in the archives that shows many cache misses within _bt_binsrch()? -- Peter Geoghegan
Hi, On 2017-06-27 10:57:15 -0700, Peter Geoghegan wrote: > I looked at this again recently. I wrote a patch to prove to myself > that we can fairly easily reclaim 15 bits from every nbtree internal > page ItemId, and put an abbreviated key there instead. Interesting. Not sure however that really addresses my layout / cache efficiency concern? Or is that just a largely independent optimization, except it's affecting the same area of code? > Can you suggest a workload/hardware to assess the benefits of an > optimization like this, Andres? Is there a profile available somewhere > in the archives that shows many cache misses within _bt_binsrch()? I don't quite remember what triggered my report, but it's quite commonly visible in any workloads that have btrees above toy sizes, but still fitting in shared_buffers. Obviously you need somehting where btree lookups are a significant share of the time, but that's easy enough with index nested loop joins and such. I do recall seeing it recently-ish in a number of tpc-h queries. Greetings, Andres Freund
On 2016-05-19 19:38:02 -0700, Peter Geoghegan wrote: > On Wed, May 18, 2016 at 6:25 AM, Andres Freund <andres@anarazel.de> wrote: > > currently we IIRC use linearly sorted datums for the search in > > individual btree nodes. Not surprisingly that's often one of the > > dominant entries in profiles. We could probably improve upon that by > > using an order more optimized for efficient binary search. > > Did you ever try running a pgbench SELECT benchmark, having modified > things such that all PKs are on columns that are not of type > int4/int8, but rather are of type numeric? It's an interesting > experiment, that I've been meaning to re-run on a big box. > Obviously this will be slower than an equivalent plain pgbench SELECT, > but the difference may be smaller than you expect. I'm not sure what that has to do with the topic? - Andres
On Tue, Jun 27, 2017 at 11:04 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2017-06-27 10:57:15 -0700, Peter Geoghegan wrote: >> I looked at this again recently. I wrote a patch to prove to myself >> that we can fairly easily reclaim 15 bits from every nbtree internal >> page ItemId, and put an abbreviated key there instead. > > Interesting. Not sure however that really addresses my layout / cache > efficiency concern? Or is that just a largely independent optimization, > except it's affecting the same area of code? Well, you'd only do this on internal pages, which are a tiny minority of the total, and yet are where the majority of binary searches for an index scan occur in practice. The optimization has the effect of making the binary search only need to access the much smaller ItemId array in that best case. In the best case, where you resolve all comparisons on internal pages, you still have to get the index tuple that you need to follow the TID of to go to a page on the next level down, once the binary search for an internal page actually finds it. But that's all. In the best case, and maybe the average case, this could be highly effective, I think. There would definitely be cases where the optimization wouldn't help at all, but hopefully it would also not hurt. -- Peter Geoghegan
On Tue, Jun 27, 2017 at 11:05 AM, Andres Freund <andres@anarazel.de> wrote: >> Did you ever try running a pgbench SELECT benchmark, having modified >> things such that all PKs are on columns that are not of type >> int4/int8, but rather are of type numeric? It's an interesting >> experiment, that I've been meaning to re-run on a big box. > >> Obviously this will be slower than an equivalent plain pgbench SELECT, >> but the difference may be smaller than you expect. > > I'm not sure what that has to do with the topic? It suggests that cache misses are much more important than anything else. Even with collated text, the difference is not so large IIRC. Though, it was noticeably larger. -- Peter Geoghegan
On 2017-06-27 11:13:38 -0700, Peter Geoghegan wrote: > On Tue, Jun 27, 2017 at 11:04 AM, Andres Freund <andres@anarazel.de> wrote: > > > > On 2017-06-27 10:57:15 -0700, Peter Geoghegan wrote: > >> I looked at this again recently. I wrote a patch to prove to myself > >> that we can fairly easily reclaim 15 bits from every nbtree internal > >> page ItemId, and put an abbreviated key there instead. > > > > Interesting. Not sure however that really addresses my layout / cache > > efficiency concern? Or is that just a largely independent optimization, > > except it's affecting the same area of code? > > Well, you'd only do this on internal pages, which are a tiny minority > of the total, and yet are where the majority of binary searches for an > index scan occur in practice. The optimization has the effect of > making the binary search only need to access the much smaller ItemId > array in that best case. In the best case, where you resolve all > comparisons on internal pages, you still have to get the index tuple > that you need to follow the TID of to go to a page on the next level > down, once the binary search for an internal page actually finds it. > But that's all. > > In the best case, and maybe the average case, this could be highly > effective, I think. There would definitely be cases where the > optimization wouldn't help at all, but hopefully it would also not > hurt. In other words, it's an independent optimization. That's cool, but I'd rather talk about it in an independent thread, to avoid conflating the issues.