Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
Date | |
Msg-id | CAH2-WznZ5ebfv96DqpKArBBcdnCUxkMOhFGipRA5mdhPKjVHsg@mail.gmail.com Whole thread Raw |
In response to | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
|
List | pgsql-hackers |
On Wed, Mar 20, 2024 at 3:26 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > OK, here's a small additional review, with a suggestion for additional > changes to _bt_preprocess: Attached is v16. This has some of the changes that you asked for. But my main focus has been on fixing regressions that were discovered during recent performance validation. Up until now, my stress testing of the patch (not quite the same thing as benchmarking) more or less consisted of using pgbench to look at two different types of cases. These are: 1. Extreme cases where the patch obviously helps the most, at least insofar as navigating the index structure itself is concerned (note that avoiding heap accesses isn't in scope for the pgbench stress testing). These are all variants of pgbench's SELECT workload, with a huge IN() list containing contiguous integers (we can get maybe a 5.3x increase in TPS here), or integers that are spaced apart by maybe 20 - 50 tuples (where we can get maybe a 20% - 30% improvement in TPS here). 2. The opposite extreme, where an IN() list has integers that are essentially random, and so are spaced very far apart on average. Another variant of pgbench's SELECT workload, with an IN() list. This is a case that the patch has hardly any chance of helping, so the goal here is to just avoid regressions. Since I've been paying attention to this for a long time, this wasn't really a problem for v15. 3. Cases where matching tuples are spaced apart by 100 - 300 tuples/integer values. These cases are "somewhere between 1 and 2". Cases where I would expect to win by a smaller amount (say a 5% improvement in TPS), but v15 nevertheless lost by as much as 12% here. This was a problem that seemed to demand a fix. The underlying reason why all earlier versions of the patch had these regressions came down to their use of a pure linear search (by which I mean having _bt_readpage call _bt_checkeys again and again on the same page). That was just too slow here. v15 could roughly halve the number of index descents compared to master, as expected -- but that wasn't enough to make up for the overhead of having to plow through hundreds of non-matching tuples on every leaf page. v15 couldn't even break even. v16 of the patch fixes the problem by adding a limited additional form of "skipping", used *within* a page, as it is scanned by _bt_readpage. (As opposed to skipping over the index by redescending anew, starting from the root page.) In other words, v16 teaches the linear search process to occasionally "look ahead" when it seems like the linear search process has required too many comparisons at that point (comparisons by _bt_tuple_before_array_skeys made from within _bt_checkkeys). You can think of this new "look ahead" mechanism as bridging the gap between case 1 and case 2 (fixing case 3, a hybrid of 1 and 2). We still mostly want to use a "linear search" from within _bt_readpage, but we do benefit from having this fallback when that just isn't working very well. Now even these new type 3 stress tests see an increase in TPS of perhaps 5% - 12%, depending on just how far apart you arrange for matching tuples to be spaced apart by (the total size of the IN list is also relevant, but much less so). Separately, I added a new optimization to the binary search process that selects the next array element via a binary search of the array (actually, to the function _bt_binsrch_array_skey), which lowers the cost of advancing required arrays that trigger advancement: we only really need one comparison per _bt_binsrch_array_skey call in almost all cases. In practice we'll almost certainly find that the next array element in line is <= the unsatisfied tuple datum (we already know from context that the current/former array element must now be < that same tuple datum at that point, so the conditions under which this doesn't work out right away are narrow). This second optimization is much more general than the first one. It helps with pretty much any kind of adversarial pgbench stress test. > Here, you insert your code between the comment about which opfamily to > choose and the code assigning the opfamily. I think this needs some > cleanup. Fixed. > > + * Don't call _bt_preprocess_array_keys_final in this fast path > > + * (we'll miss out on the single value array transformation, but > > + * that's not nearly as important when there's only one scan key) > > Why is it OK to ignore it? Or, why don't we apply it here? It doesn't seem worth adding any cycles to that fast path, given that the array transformation in question can only help us to convert "= any('{1}')" into "= 1", because there's no other scan keys that can cut down the number of array constants first (through earlier array-specific preprocessing steps). Note that even this can't be said for converting "IN (1)" into "= 1", since the planner already does that for us, without nbtree ever seeing it directly. > Attached 2 patches for further optimization of the _bt_preprocess_keys > path (on top of your v15), according to the following idea: > > Right now, we do "expensive" processing with xform merging for all > keys when we have more than 1 keys in the scan. However, we only do > per-attribute merging of these keys, so if there is only one key for > any attribute, the many cycles spent in that loop are mostly wasted. Why do you say that? I agree that calling _bt_compare_scankey_args() more than necessary might matter, but that doesn't come up when there's only one scan key per attribute. We'll only copy each scan key into the strategy-specific slot for the attribute's temp xform[] array. Actually, it doesn't necessarily come up when there's more than one scan key per index attribute, depending on the strategies in use. That seems like an existing problem on HEAD; I think we should be doing this more. We could stand to do better at preprocessing when inequalities are mixed together. Right now, we can't detect contradictory quals, given a case such as "SELECT * FROM pgbench_accounts WHERE aid > 5 AND aid < 3". It'll only work for something like "WHERE aid = 5 AND aid < 3", so we'll still useless descend the index once (not a disaster, but not ideal either). Honestly, I don't follow what the goal is here. Does it really help to restructure the loop like this? Feels like I'm missing something. We do this when there's only one scan key, sort of, but that's probably not really that important anyway. Maybe it helps a bit for nestloop joins with an inner index scan, where one array scan key is common. -- Peter Geoghegan
Attachment
pgsql-hackers by date: