Re: Adding skip scan (including MDAM style range skip scan) to nbtree - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Adding skip scan (including MDAM style range skip scan) to nbtree |
Date | |
Msg-id | CAH2-Wzny5oJ7RGcTO9vT9wyzXqx-N0qivtPXm+G0tuv68rdF9Q@mail.gmail.com Whole thread Raw |
In response to | Adding skip scan (including MDAM style range skip scan) to nbtree (Peter Geoghegan <pg@bowt.ie>) |
List | pgsql-hackers |
On Mon, Nov 4, 2024 at 4:58 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > This is a review on v11, not the latest v13. I suspect most comments > still apply, but I haven't verified this. v11 is indeed quite similar to v13, so this shouldn't really matter. > I'm a bit concerned about the additional operations that are being > added to the scan. Before this patch, the amount of work in the > "horizontal" portion of the scan was limited to user-supplied > scankeys, so O(1) even when the index condition is only (f < 7). But, > with this patch, we're adding work for (a=, b=, c=, etc.) for every > tuple in the scan. There's no question that there are still some cases where this cannot possibly pay for itself. And that just isn't acceptable -- no arguments here. > As these new "skip array" keys are primarily useful for inter-page > coordination (by determining if we want to start a primitive scan to > skip to a different page and which value range that primitive scan > would search for, or continue on to the next sibling), can't we only > apply the "skip array" portion of the code at the final tuple we > access on this page? I plan on doing something like that. I'll need to. AFAICT I only need to avoid wasting CPU cycles here -- there are no notable regressions from performing excessive amounts of index descents, as far as I can tell. And so I plan on doing this without fundamentally changing anything about the current design. In particular, I want the performance to remain robust in cases where the best strategy varies significantly as the scan progresses -- even when we need to strongly favor skipping at first, and then strongly favor staying/not skipping later on (all during the same individual index scan). I really like that the current design gets that part right. > While this section already defines some things about index scans which > seem btree-specific, I don't think we should add more references to > btree scan internals in a section about bitmaps and bitmap index > scans. This section of the docs discusses the trade-off between multiple single column indexes, and fewer multi-column indexes. How could skip scan not be relevant to such a discussion? One of the main benefits of skip scan is that it'll allow users to get by with fewer indexes. > > +++ b/src/backend/access/nbtree/nbtree.c > [...] > > - slock_t btps_mutex; /* protects above variables, btps_arrElems */ > > + LWLock btps_lock; /* protects above variables, btps_arrElems */ > > Why is this changed to LWLock, when it's only ever acquired exclusively? In general one should never do more than an extremely small, tightly controlled amount of work with a spinlock held. It's now possible that we'll allocate memory with the lock held -- doing that with a spinlock held is an obvious no-no. We really need to use an LWLock for this stuff now, on general principle. > > +btestimateparallelscan(Relation rel, int nkeys, int norderbys) > > I notice you're using DatumSerialize. Are there reasons why we > wouldn't want to use heap_fill_tuple, which generally produces much > more compact output? heap_fill_tuple doesn't support the notion of -inf and +inf scan key sentinel values. Plus I'm inclined to use DatumSerialize because it's more or less designed for this kind of problem. > Also, I think you can limit the space usage to BLCKSZ in total, > because a full index tuple can't be larger than 1/3rd of a block; and > for skip scans we'll only have known equality bounds for a prefix of > attributes available in the index tuples, and a single (?) > index-produced dynamic attribute we want to skip ahead of. So, IIUC, > at most we'll have 2 index tuples' worth of data, or 2/3 BLCKSZ. > Right? Possibly, but who wants to take a chance? The scheme you're describing only saves memory when there's 3 skip arrays, which is fairly unlikely in general. I think that the approach taken to serializing the array keys should be as conservative as possible. It's not particularly likely that we'll want to do a parallel skip scan. It's rather hard to test those code paths. > I needed to look up what this 'cons up' thing is, as it wasn't > something that I'd seen before. It also seems used exclusively in > btree code, and only after the array keys patch, so I think it'd be > better in general to use 'construct' here. FWIW I wasn't the first person to use the term in the nbtree code. I think you're right, though. It is a needlessly obscure term that is only known to Lisp hackers. I'll fix it. > > +++ b/src/backend/access/nbtree/nbtcompare.c > > The changes here are essentially 6x the same code, but for different > types. What do you think about the attached > 0001-Deduplicate[...].patch.txt, which has the same effect but with > only 1 copy of the code checked in? Reminds me of the approach taken by this extension: https://github.com/petere/pguint I do find the need to write so much boilerplate code for B-Tree opclasses annoying. I also find it annoying that the nbtree code insists on being as forgiving as possible with incomplete opfamilies. But those problems seem out of scope here -- not like I'm really making it any worse. > > +++b/src/backend/access/nbtree/nbtutils.c > [...] > > +_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts) > > Why does this stop processing keys after hitting a row compare? Why not? It's not ideal, but there are a number of things about RowCompare scan keys that are already less than ideal. We don't even try to do any kind of preprocessing that involves RowCompares -- they're already a slightly awkward special case to the nbtree code. > Doesn't skip scan still apply to any subsequent normal keys? E.g. > "c=1" creates a scan "a=skip, b=skip, c=1", so "(a, b)>(1, 2), c=1" > should IMO still allow a skip scan for a=skip, b=1 to be constructed - > it shouldn't be that we get a much less specific (and potentially, > performant) scan just by adding a rowcompare scankey on early > attributes. It's just awkward to get it to work as expected, while still preserving all of the useful properties of the design. Your "(a, b)>(1,2)" scan key returns rows matching: WHERE (a = 1 AND b > 2) OR (a > 1) And so your complete "(a, b)>(1,2) AND c = 1" qual returns rows matching: WHERE ((a = 1 AND b > 2) OR (a > 1)) AND c = 1 It is difficult to imagine how the existing design for skip arrays can be extended to support this kind of qual, though. I guess we'd still need skip arrays on both "a" and "b" here, though. Right? The "b" skip array would be restricted to a range of values between 3 and +inf inclusive, if and only if we were still on the "a" skip array's first array element (i.e. iff "a = 1" the "b" has a valid low_compare). Otherwise (i.e. when "a > 1"), the "b" skip array wouldn't be constrained by any low_compare inequality. So low_compare/high_compare only apply conditionally, in a world where we support these kinds of RowCompare quals. Right now I can avoid the problem by refusing to allow "c = 1" to ever be marked required (by never creating any skip array on an index attribute >= an attribute with a RowCompare key on input). Obviously, the current design of skip arrays involves arrays that are always constrained by the same range/low_compare and high_compare inequalities, independent of any other factor/any wider context. It's not impossible to make something like your RowCompare case work, but it'd be very significantly more complicated than the existing design. Though not all that much more useful. Doing something like this might make sense in the context of a project that adds support for the MDAM paper's "General OR Optimization" transformations -- RowCompare support would only be a bonus. I don't see any opportunities to target RowCompare as independent work -- seems as if RowCompare quals aren't significantly simpler than what is required to support very general MDAM OR optimizations. > > _bt_preprocess_array_keys > > - output_ikey = 0; > > + numArrayKeyData, > > + numSkipArrayKeys; > > I don't think numArrayKeyData/arrayKeyData are good names here, as it > confused me many times reviewing this function's changes. The code in this area actually did change recently, though I didn't announce it. > On a scankey > of a=1,b=2 we won't have any array keys, yet this variable is set to > 2. Something like numOutputKeys is probably more accurate. That's not accurate. _bt_preprocess_array_keys() sets *new_numberOfKeys on output. When there are no array keys (of either type) to be output, _bt_preprocess_array_keys will just return NULL -- it won't have changed the *new_numberOfKeys passed by its _bt_preprocess_keys caller when it returns NULL. In other words, when _bt_preprocess_array_keys determines that there are no array keys to process (neither SAOP arrays nor skip arrays), it won't modify anything, and won't return an alternative input-to-_bt_preprocess_keys scan key array. _bt_preprocess_keys will work directly off of the scan->keyData[] input scan keys passed by the executor proper. > > + /* Create a skip array and scan key where indicated by skipatts */ > > + while (numSkipArrayKeys && > > + attno_skip <= scan->keyData[input_ikey].sk_attno) > > + { > > + Oid opcintype = rel->rd_opcintype[attno_skip - 1]; > > + Oid collation = rel->rd_indcollation[attno_skip - 1]; > > + Oid eq_op = skipatts[attno_skip - 1].eq_op; > > + RegProcedure cmp_proc; > > + > > + if (!OidIsValid(eq_op)) > > + { > > + /* won't skip using this attribute */ > > + attno_skip++; > > Isn't this branch impossible, given that numSkipArrayKeys is output > from _bt_decide_skipatts, whose output won't contain skipped > attributes which have eq_op=InvalidOid? I'd replace this with > Assert(OidIsValid(eq_op)). It's very possible to hit this code path -- we'll hit this code path every time an explicit "=" input scan key appears before an attribute that requires a skip array (which happens whenever there's a third, later column that we'll be able to mark required thanks to the skip array). Note that BTSkipPreproc.eq_op is the "= op to be used to add array, if any". And so when we see !OidIsValid(eq_op) in this loop, it means "this is for an index attribute that we don't want to add a skip array to" (though, as I said, a later attribute is expected to get a skip array when this happens). > > _bt_rewind_nonrequired_arrays > > What types of scan keys can still generate non-required array keys? Right now it's: 1. As you said, certain cases involving RowCompare scan keys. 2. Cases involving B-Tree operator classes that don't even have a "=" operator (yes, technically those are supported!). Also seems possible that I'll end up relying on our support for non-required arrays when I go fix those regressions. Seems possible that I'll get rid of the explicit SK_BT_REQFWD/SK_BT_REQBKWD markings, and go back to treating scan keys as required based on context (a little like how things were prior to commit 7ccaf13a06). > It seems to me those are now mostly impossible, as this patch generates > required skip arrays for all attributes that don't yet have an > equality key and which are ahead of any (in)equality keys, except the > case with row compare keys which I already commented on above. I agree that non-required arrays become an obscure edge case with the patch, having been reasonably common before now (well, they were common in Postgres 17). I don't think that that provides us with any opportunities to get rid of unneeded code. > > utils/skipsupport.[ch] > I'm not sure why this is included in utils - isn't this exclusively > used in access/nbtree/*? The location of skip support is based on (though slightly different to) the location of sort support. In general many B-Tree opclasses are implemented in or around src/utils/adt/*. The exception is all of the stuff in nbtcompare.c, though I always found that weird (I wouldn't mind getting rid of nbtcompare.c by relocating its code places like int.c and int8.c). > > +++ b/src/include/access/nbtree.h > BTArrayKeyInfo explodes in size, from 24B to 88B. I think some of that > is necessary, but should it really be that large? I'm disinclined to do anything about it right now. I'll make a note of it, and review when the most important performance problems are fixed. Thanks for the review -- Peter Geoghegan
pgsql-hackers by date: