Thread: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Over in "execExprInterp() questions / How to improve scalar array op expr eval?" [1] I'd mused about how we might be able to optimized scalar array ops with OR'd semantics. This patch implements a binary search for such expressions when the array argument is a constant so that we can avoid needing to teach expression execution to cache stable values or know when a param has changed. The speed-up for the target case can pretty impressive: in my admittedly contrived and relatively unscientific test with a query in the form: select count(*) from generate_series(1,100000) n(i) where i in (<1000 random integers in the series>) shows ~30ms for the patch versus ~640ms on master. James [1]: https://www.postgresql.org/message-id/flat/CAAaqYe-UQBba7sScrucDOyHb7cDoNbWf_rcLrOWeD4ikP3_qTQ%40mail.gmail.com
Attachment
On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: >Over in "execExprInterp() questions / How to improve scalar array op >expr eval?" [1] I'd mused about how we might be able to optimized >scalar array ops with OR'd semantics. > >This patch implements a binary search for such expressions when the >array argument is a constant so that we can avoid needing to teach >expression execution to cache stable values or know when a param has >changed. > >The speed-up for the target case can pretty impressive: in my >admittedly contrived and relatively unscientific test with a query in >the form: > >select count(*) from generate_series(1,100000) n(i) where i in (<1000 >random integers in the series>) > >shows ~30ms for the patch versus ~640ms on master. > Nice improvement, although 1000 items is probably a bit unusual. The threshold used in the patch (9 elements) seems a bit too low - what results have you seen with smaller arrays? Another idea - would a bloom filter be useful here, as a second optimization? That is, for large arrays build s small bloom filter, allowing us to skip even the binary search. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: > >Over in "execExprInterp() questions / How to improve scalar array op > >expr eval?" [1] I'd mused about how we might be able to optimized > >scalar array ops with OR'd semantics. > > > >This patch implements a binary search for such expressions when the > >array argument is a constant so that we can avoid needing to teach > >expression execution to cache stable values or know when a param has > >changed. > > > >The speed-up for the target case can pretty impressive: in my > >admittedly contrived and relatively unscientific test with a query in > >the form: > > > >select count(*) from generate_series(1,100000) n(i) where i in (<1000 > >random integers in the series>) > > > >shows ~30ms for the patch versus ~640ms on master. > > > > Nice improvement, although 1000 items is probably a bit unusual. The > threshold used in the patch (9 elements) seems a bit too low - what > results have you seen with smaller arrays? At least in our systems we regularly work with 1000 batches of items, which means you get IN clauses of identifiers of that size. Admittedly the most common case sees those IN clauses as simple index scans (e.g., WHERE <primary key> IN (...)), but it's also common to have a broader query that merely filters additionally on something like "... AND <some foreign key> IN (...)" where it makes sense for the rest of the quals to take precedence in generating a reasonable plan. In that case, the IN becomes a regular filter, hence the idea behind the patch. Side note: I'd love for us to be able to treat "IN (VALUES)" the same way...but as noted in the other thread that's an extremely large amount of work, I think. But similarly you could use a hash here instead of a binary search...but this seems quite good. As to the choice of 9 elements: I just picked that as a starting point; Andres had previously commented off hand that at 8 elements serial scanning was faster, so I figured this was a reasonable starting point for discussion. Perhaps it would make sense to determine that minimum not as a pure constant but scaled based on how many rows the planner expects us to see? Of course that'd be a more invasive patch...so may or may not be as feasible as a reasonable default. > Another idea - would a bloom filter be useful here, as a second > optimization? That is, for large arrays build s small bloom filter, > allowing us to skip even the binary search. That's an interesting idea. I actually haven't personally worked with bloom filters, so didn't think about that. Are you thinking that you'd also build the filter *and* presort the array? Or try to get away with using only the bloom filter and not expanding and sorting the array at all? Thanks, James
On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: >> >Over in "execExprInterp() questions / How to improve scalar array op >> >expr eval?" [1] I'd mused about how we might be able to optimized >> >scalar array ops with OR'd semantics. >> > >> >This patch implements a binary search for such expressions when the >> >array argument is a constant so that we can avoid needing to teach >> >expression execution to cache stable values or know when a param has >> >changed. >> > >> >The speed-up for the target case can pretty impressive: in my >> >admittedly contrived and relatively unscientific test with a query in >> >the form: >> > >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000 >> >random integers in the series>) >> > >> >shows ~30ms for the patch versus ~640ms on master. >> > >> >> Nice improvement, although 1000 items is probably a bit unusual. The >> threshold used in the patch (9 elements) seems a bit too low - what >> results have you seen with smaller arrays? > >At least in our systems we regularly work with 1000 batches of items, >which means you get IN clauses of identifiers of that size. Admittedly >the most common case sees those IN clauses as simple index scans >(e.g., WHERE <primary key> IN (...)), but it's also common to have a >broader query that merely filters additionally on something like "... >AND <some foreign key> IN (...)" where it makes sense for the rest of >the quals to take precedence in generating a reasonable plan. In that >case, the IN becomes a regular filter, hence the idea behind the >patch. > >Side note: I'd love for us to be able to treat "IN (VALUES)" the same >way...but as noted in the other thread that's an extremely large >amount of work, I think. But similarly you could use a hash here >instead of a binary search...but this seems quite good. > >As to the choice of 9 elements: I just picked that as a starting >point; Andres had previously commented off hand that at 8 elements >serial scanning was faster, so I figured this was a reasonable >starting point for discussion. > >Perhaps it would make sense to determine that minimum not as a pure >constant but scaled based on how many rows the planner expects us to >see? Of course that'd be a more invasive patch...so may or may not be >as feasible as a reasonable default. > Not sure. That seems a bit overcomplicated and I don't think it depends on the number of rows the planner expects to see very much. I think we usually assume the linear search is cheaper for small arrays and then at some point the binary search starts winning The question is where this "break even" point is. I think we usually use something like 64 or so in other places, but maybe I'm wrong. The current limit 9 seems a bit too low, but I may be wrong. Let's not obsess about this too much, let's do some experiments and pick a value based on that. >> Another idea - would a bloom filter be useful here, as a second >> optimization? That is, for large arrays build s small bloom filter, >> allowing us to skip even the binary search. > >That's an interesting idea. I actually haven't personally worked with >bloom filters, so didn't think about that. > >Are you thinking that you'd also build the filter *and* presort the >array? Or try to get away with using only the bloom filter and not >expanding and sorting the array at all? > Yeah, something like that. My intuition is the bloom filter is useful only above some number of items, and the number is higher than for the binary search. So we'd end up with two thresholds, first one enabling binary search, the second one enabling bloom filter. Of course, the "unknown" variable here is how often we actually find the value in the array. If 100% of the queries has a match, then the bloom filter is a waste of time. If there are no matches, it can make a significant difference. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 24 Apr 2020 at 02:56, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: > >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > >As to the choice of 9 elements: I just picked that as a starting > >point; Andres had previously commented off hand that at 8 elements > >serial scanning was faster, so I figured this was a reasonable > >starting point for discussion. > > > >Perhaps it would make sense to determine that minimum not as a pure > >constant but scaled based on how many rows the planner expects us to > >see? Of course that'd be a more invasive patch...so may or may not be > >as feasible as a reasonable default. > > > > Not sure. That seems a bit overcomplicated and I don't think it depends > on the number of rows the planner expects to see very much. I think we > usually assume the linear search is cheaper for small arrays and then at > some point the binary search starts winning The question is where this > "break even" point is. > > I think we usually use something like 64 or so in other places, but > maybe I'm wrong. The current limit 9 seems a bit too low, but I may be > wrong. Let's not obsess about this too much, let's do some experiments > and pick a value based on that. If single comparison for a binary search costs about the same as an equality check, then wouldn't the crossover point be much lower than 64? The binary search should find or not find the target in log2(N) rather than N. ceil(log2(9)) is 4, which is of course less than 9. For 64, it's 6, so are you not just doing a possible 58 equality checks than necessary? Of course, it's a bit more complex as for values that *are* in the array, the linear search will, on average, only check half the values. Assuming that, then 9 does not seem too far off. I guess benchmarks at various crossover points would speak a thousand words. David
Hi, On 2020-04-24 10:09:36 +1200, David Rowley wrote: > If single comparison for a binary search costs about the same as an > equality check, then wouldn't the crossover point be much lower than > 64? The costs aren't quite as simple as that though. Binary search usually has issues with cache misses: In contrast to linear accesses each step will be a cache miss, as the address is not predictable; and even if the CPU couldn't predict accesses in the linear search case, often multiple entries fit on a single cache line. Additionally out-of-order execution is usually a lot more effective for linear searches (e.g. the next elements can be compared before the current one is finished if that's what the branch predictor says is likely). Greetings, Andres Freund
On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: > >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > >> > >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: > >> >Over in "execExprInterp() questions / How to improve scalar array op > >> >expr eval?" [1] I'd mused about how we might be able to optimized > >> >scalar array ops with OR'd semantics. > >> > > >> >This patch implements a binary search for such expressions when the > >> >array argument is a constant so that we can avoid needing to teach > >> >expression execution to cache stable values or know when a param has > >> >changed. > >> > > >> >The speed-up for the target case can pretty impressive: in my > >> >admittedly contrived and relatively unscientific test with a query in > >> >the form: > >> > > >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000 > >> >random integers in the series>) > >> > > >> >shows ~30ms for the patch versus ~640ms on master. > >> > > >> > >> Nice improvement, although 1000 items is probably a bit unusual. The > >> threshold used in the patch (9 elements) seems a bit too low - what > >> results have you seen with smaller arrays? > > > >At least in our systems we regularly work with 1000 batches of items, > >which means you get IN clauses of identifiers of that size. Admittedly > >the most common case sees those IN clauses as simple index scans > >(e.g., WHERE <primary key> IN (...)), but it's also common to have a > >broader query that merely filters additionally on something like "... > >AND <some foreign key> IN (...)" where it makes sense for the rest of > >the quals to take precedence in generating a reasonable plan. In that > >case, the IN becomes a regular filter, hence the idea behind the > >patch. > > > >Side note: I'd love for us to be able to treat "IN (VALUES)" the same > >way...but as noted in the other thread that's an extremely large > >amount of work, I think. But similarly you could use a hash here > >instead of a binary search...but this seems quite good. > > > >As to the choice of 9 elements: I just picked that as a starting > >point; Andres had previously commented off hand that at 8 elements > >serial scanning was faster, so I figured this was a reasonable > >starting point for discussion. > > > >Perhaps it would make sense to determine that minimum not as a pure > >constant but scaled based on how many rows the planner expects us to > >see? Of course that'd be a more invasive patch...so may or may not be > >as feasible as a reasonable default. > > > > Not sure. That seems a bit overcomplicated and I don't think it depends > on the number of rows the planner expects to see very much. I think we > usually assume the linear search is cheaper for small arrays and then at > some point the binary search starts winning The question is where this > "break even" point is. Well since it has to do preprocessing work (expanding the array and then sorting it), then the number of rows processed matters, right? For example, doing a linear search on 1000 items only once is going to be cheaper than preprocessing the array and then doing a binary search, but only a very large row count the binary search + preprocessing might very well win out for only a 10 element array. I'm not trying to argue for more work for myself here: I think the optimization is worth it on its own, and something like this could be a further improvement on its own. But it is interesting to think about. James
On Fri, Apr 24, 2020 at 09:38:54AM -0400, James Coleman wrote: >On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: >> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra >> ><tomas.vondra@2ndquadrant.com> wrote: >> >> >> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: >> >> >Over in "execExprInterp() questions / How to improve scalar array op >> >> >expr eval?" [1] I'd mused about how we might be able to optimized >> >> >scalar array ops with OR'd semantics. >> >> > >> >> >This patch implements a binary search for such expressions when the >> >> >array argument is a constant so that we can avoid needing to teach >> >> >expression execution to cache stable values or know when a param has >> >> >changed. >> >> > >> >> >The speed-up for the target case can pretty impressive: in my >> >> >admittedly contrived and relatively unscientific test with a query in >> >> >the form: >> >> > >> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000 >> >> >random integers in the series>) >> >> > >> >> >shows ~30ms for the patch versus ~640ms on master. >> >> > >> >> >> >> Nice improvement, although 1000 items is probably a bit unusual. The >> >> threshold used in the patch (9 elements) seems a bit too low - what >> >> results have you seen with smaller arrays? >> > >> >At least in our systems we regularly work with 1000 batches of items, >> >which means you get IN clauses of identifiers of that size. Admittedly >> >the most common case sees those IN clauses as simple index scans >> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a >> >broader query that merely filters additionally on something like "... >> >AND <some foreign key> IN (...)" where it makes sense for the rest of >> >the quals to take precedence in generating a reasonable plan. In that >> >case, the IN becomes a regular filter, hence the idea behind the >> >patch. >> > >> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same >> >way...but as noted in the other thread that's an extremely large >> >amount of work, I think. But similarly you could use a hash here >> >instead of a binary search...but this seems quite good. >> > >> >As to the choice of 9 elements: I just picked that as a starting >> >point; Andres had previously commented off hand that at 8 elements >> >serial scanning was faster, so I figured this was a reasonable >> >starting point for discussion. >> > >> >Perhaps it would make sense to determine that minimum not as a pure >> >constant but scaled based on how many rows the planner expects us to >> >see? Of course that'd be a more invasive patch...so may or may not be >> >as feasible as a reasonable default. >> > >> >> Not sure. That seems a bit overcomplicated and I don't think it depends >> on the number of rows the planner expects to see very much. I think we >> usually assume the linear search is cheaper for small arrays and then at >> some point the binary search starts winning The question is where this >> "break even" point is. > >Well since it has to do preprocessing work (expanding the array and >then sorting it), then the number of rows processed matters, right? >For example, doing a linear search on 1000 items only once is going to >be cheaper than preprocessing the array and then doing a binary >search, but only a very large row count the binary search + >preprocessing might very well win out for only a 10 element array. > Hmmm, good point. Essentially the initialization (sorting of the array) has some cost, and the question is how much extra per-tuple cost this adds. It's probably not worth it for a single lookup, but for many lookups it's probably OK. Let's see if I can do the math right: N - number of lookups K - number of array elements Cost to sort the array is O(K * log(K)) = C1 * K * log(K) and the cost of a lookup is C2 * log(K), so with the extra cost amortized for N lookups, the total "per lookup" cost is C1 * K * log(K) / N + C2 * log(K) = log(K) * (C1 * K / N + C2) We need to compare this to the O(K) cost of simple linear search, and the question is at which point the linear search gets more expensive: C3 * K = log(K) * (C1 * K / N + C2) I think we can assume that C3 is somewhere in between 0.5 and 1, i.e. if there's a matching item we find it half-way through on average, and if there is not we have to walk the whole array. So let's say it's 1. C1 and C2 are probably fairly low, I think - C1 is typically ~1.4 for random pivot choice IIRC, and C2 is probably similar. With both values being ~1.5 we get this: K = log(K) * (1.5 * K/N + 1.5) for a fixed K, we get this formula for N: N = log(K) * 1.5 * K / (K - 1.5 * log(K)) and for a bunch of K values the results look like this: K | N -------|------- 1 | 0 10 | 5.27 100 | 7.42 1000 | 10.47 10000 | 13.83 100000 | 17.27 i.e. the binary search with 10k values starts winning over linear search with just ~13 lookups. (Assuming I haven't made some silly mistake in the math ...) Obviously, this only accounts for cost of comparisons and neglects e.g. the indirect costs for less predictable memory access patterns mentioned by Andres in his response. But I think it still shows the number of lookups needed for the binary search to be a win is pretty low - at least for reasonable number of values in the array. Maybe it's 20 and not 10, but I don't think that changes much. The other question is if we can get N at all and how reliable the value is. We can probably get the number of rows, but that will ignore other conditions that may eliminate the row before the binary search. >I'm not trying to argue for more work for myself here: I think the >optimization is worth it on its own, and something like this could be >a further improvement on its own. But it is interesting to think >about. > I don't know. Clearly, if the user sends a query with 10k values and only does a single lookup, that won't win. And if we can reasonably and reliably protect against that, I wouldn't mind doing that, although it means a risk of not using the bin search in case of underestimates etc. I don't have any hard data about this, but I think we can assume the number of rows processed by the clause is (much) higher than the number of keys in it. If you have a clause with 10k values, then you probably expect it to be applied to many rows, far more than the "beak even" point of about 10-20 rows ... So I wouldn't worry about this too much. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 23, 2020 at 04:55:51PM +0200, Tomas Vondra wrote: >On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: >>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra >><tomas.vondra@2ndquadrant.com> wrote: >>> >>>On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: >>>>Over in "execExprInterp() questions / How to improve scalar array op >>>>expr eval?" [1] I'd mused about how we might be able to optimized >>>>scalar array ops with OR'd semantics. >>>> >>>>This patch implements a binary search for such expressions when the >>>>array argument is a constant so that we can avoid needing to teach >>>>expression execution to cache stable values or know when a param has >>>>changed. >>>> >>>>The speed-up for the target case can pretty impressive: in my >>>>admittedly contrived and relatively unscientific test with a query in >>>>the form: >>>> >>>>select count(*) from generate_series(1,100000) n(i) where i in (<1000 >>>>random integers in the series>) >>>> >>>>shows ~30ms for the patch versus ~640ms on master. >>>> >>> >>>Nice improvement, although 1000 items is probably a bit unusual. The >>>threshold used in the patch (9 elements) seems a bit too low - what >>>results have you seen with smaller arrays? >> >>At least in our systems we regularly work with 1000 batches of items, >>which means you get IN clauses of identifiers of that size. Admittedly >>the most common case sees those IN clauses as simple index scans >>(e.g., WHERE <primary key> IN (...)), but it's also common to have a >>broader query that merely filters additionally on something like "... >>AND <some foreign key> IN (...)" where it makes sense for the rest of >>the quals to take precedence in generating a reasonable plan. In that >>case, the IN becomes a regular filter, hence the idea behind the >>patch. >> >>Side note: I'd love for us to be able to treat "IN (VALUES)" the same >>way...but as noted in the other thread that's an extremely large >>amount of work, I think. But similarly you could use a hash here >>instead of a binary search...but this seems quite good. >> >>As to the choice of 9 elements: I just picked that as a starting >>point; Andres had previously commented off hand that at 8 elements >>serial scanning was faster, so I figured this was a reasonable >>starting point for discussion. >> >>Perhaps it would make sense to determine that minimum not as a pure >>constant but scaled based on how many rows the planner expects us to >>see? Of course that'd be a more invasive patch...so may or may not be >>as feasible as a reasonable default. >> > >Not sure. That seems a bit overcomplicated and I don't think it depends >on the number of rows the planner expects to see very much. I think we >usually assume the linear search is cheaper for small arrays and then at >some point the binary search starts winning The question is where this >"break even" point is. > >I think we usually use something like 64 or so in other places, but >maybe I'm wrong. The current limit 9 seems a bit too low, but I may be >wrong. Let's not obsess about this too much, let's do some experiments >and pick a value based on that. > > >>>Another idea - would a bloom filter be useful here, as a second >>>optimization? That is, for large arrays build s small bloom filter, >>>allowing us to skip even the binary search. >> >>That's an interesting idea. I actually haven't personally worked with >>bloom filters, so didn't think about that. >> >>Are you thinking that you'd also build the filter *and* presort the >>array? Or try to get away with using only the bloom filter and not >>expanding and sorting the array at all? >> > >Yeah, something like that. My intuition is the bloom filter is useful >only above some number of items, and the number is higher than for the >binary search. So we'd end up with two thresholds, first one enabling >binary search, the second one enabling bloom filter. > >Of course, the "unknown" variable here is how often we actually find the >value in the array. If 100% of the queries has a match, then the bloom >filter is a waste of time. If there are no matches, it can make a >significant difference. > I did experiment with this is a bit, both to get a bit more familiar with this code and to see if the bloom filter might help. The short answer is the bloom filter does not seem to help at all, so I wouldn't bother about it too much. Attacched is an updated patch series and, script I used to collect some performance measurements, and a spreadsheet with results. The patch series is broken into four parts: 0001 - the original patch with binary search 0002 - adds GUCs to enable bin search / tweak threshold 0003 - allows to use bloom filter + binary search 0004 - try using murmurhash The test script runs a wide range of queries with different number of lookups, keys in the array, match probability (i.e. fraction of lookups that find a match) ranging from 1% to 100%. And of course, it runs this with the binsearch/bloom either enabled or disabled (the threshold was lowered to 1, so it's the on/off GUCs that determine whether the binsearch/bloom is used). The results are summarized in the spreadsheet, demonstrating how useless the bloom filter is. There's not a single case where it would beat the binary search. I believe this is because theaccess to bloom filter is random (determined by the hash function) and we don't save much compared to the log(K) lookups in the sorted array. That makes sense, I think the bloom filters are meant to be used in cases when the main data don't fit into memory - which is not the case here. But I wonder how would this change for cases with more expensive comparisons - this was using just integers, so maybe strings would result in different behavior. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Fri, Apr 24, 2020 at 5:55 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Fri, Apr 24, 2020 at 09:38:54AM -0400, James Coleman wrote: > >On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > >> > >> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: > >> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra > >> ><tomas.vondra@2ndquadrant.com> wrote: > >> >> > >> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: > >> >> >Over in "execExprInterp() questions / How to improve scalar array op > >> >> >expr eval?" [1] I'd mused about how we might be able to optimized > >> >> >scalar array ops with OR'd semantics. > >> >> > > >> >> >This patch implements a binary search for such expressions when the > >> >> >array argument is a constant so that we can avoid needing to teach > >> >> >expression execution to cache stable values or know when a param has > >> >> >changed. > >> >> > > >> >> >The speed-up for the target case can pretty impressive: in my > >> >> >admittedly contrived and relatively unscientific test with a query in > >> >> >the form: > >> >> > > >> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000 > >> >> >random integers in the series>) > >> >> > > >> >> >shows ~30ms for the patch versus ~640ms on master. > >> >> > > >> >> > >> >> Nice improvement, although 1000 items is probably a bit unusual. The > >> >> threshold used in the patch (9 elements) seems a bit too low - what > >> >> results have you seen with smaller arrays? > >> > > >> >At least in our systems we regularly work with 1000 batches of items, > >> >which means you get IN clauses of identifiers of that size. Admittedly > >> >the most common case sees those IN clauses as simple index scans > >> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a > >> >broader query that merely filters additionally on something like "... > >> >AND <some foreign key> IN (...)" where it makes sense for the rest of > >> >the quals to take precedence in generating a reasonable plan. In that > >> >case, the IN becomes a regular filter, hence the idea behind the > >> >patch. > >> > > >> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same > >> >way...but as noted in the other thread that's an extremely large > >> >amount of work, I think. But similarly you could use a hash here > >> >instead of a binary search...but this seems quite good. > >> > > >> >As to the choice of 9 elements: I just picked that as a starting > >> >point; Andres had previously commented off hand that at 8 elements > >> >serial scanning was faster, so I figured this was a reasonable > >> >starting point for discussion. > >> > > >> >Perhaps it would make sense to determine that minimum not as a pure > >> >constant but scaled based on how many rows the planner expects us to > >> >see? Of course that'd be a more invasive patch...so may or may not be > >> >as feasible as a reasonable default. > >> > > >> > >> Not sure. That seems a bit overcomplicated and I don't think it depends > >> on the number of rows the planner expects to see very much. I think we > >> usually assume the linear search is cheaper for small arrays and then at > >> some point the binary search starts winning The question is where this > >> "break even" point is. > > > >Well since it has to do preprocessing work (expanding the array and > >then sorting it), then the number of rows processed matters, right? > >For example, doing a linear search on 1000 items only once is going to > >be cheaper than preprocessing the array and then doing a binary > >search, but only a very large row count the binary search + > >preprocessing might very well win out for only a 10 element array. > > > > Hmmm, good point. Essentially the initialization (sorting of the array) > has some cost, and the question is how much extra per-tuple cost this > adds. It's probably not worth it for a single lookup, but for many > lookups it's probably OK. Let's see if I can do the math right: > > N - number of lookups > K - number of array elements > > Cost to sort the array is > > O(K * log(K)) = C1 * K * log(K) > > and the cost of a lookup is C2 * log(K), so with the extra cost amortized > for N lookups, the total "per lookup" cost is > > C1 * K * log(K) / N + C2 * log(K) = log(K) * (C1 * K / N + C2) > > We need to compare this to the O(K) cost of simple linear search, and > the question is at which point the linear search gets more expensive: > > C3 * K = log(K) * (C1 * K / N + C2) > > I think we can assume that C3 is somewhere in between 0.5 and 1, i.e. if > there's a matching item we find it half-way through on average, and if > there is not we have to walk the whole array. So let's say it's 1. > > C1 and C2 are probably fairly low, I think - C1 is typically ~1.4 for > random pivot choice IIRC, and C2 is probably similar. With both values > being ~1.5 we get this: > > K = log(K) * (1.5 * K/N + 1.5) > > for a fixed K, we get this formula for N: > > N = log(K) * 1.5 * K / (K - 1.5 * log(K)) > > and for a bunch of K values the results look like this: > > K | N > -------|------- > 1 | 0 > 10 | 5.27 > 100 | 7.42 > 1000 | 10.47 > 10000 | 13.83 > 100000 | 17.27 > > i.e. the binary search with 10k values starts winning over linear search > with just ~13 lookups. > > (Assuming I haven't made some silly mistake in the math ...) > > Obviously, this only accounts for cost of comparisons and neglects e.g. > the indirect costs for less predictable memory access patterns mentioned > by Andres in his response. > > But I think it still shows the number of lookups needed for the binary > search to be a win is pretty low - at least for reasonable number of > values in the array. Maybe it's 20 and not 10, but I don't think that > changes much. > > The other question is if we can get N at all and how reliable the value > is. We can probably get the number of rows, but that will ignore other > conditions that may eliminate the row before the binary search. > > >I'm not trying to argue for more work for myself here: I think the > >optimization is worth it on its own, and something like this could be > >a further improvement on its own. But it is interesting to think > >about. > > > > I don't know. Clearly, if the user sends a query with 10k values and > only does a single lookup, that won't win. And if we can reasonably and > reliably protect against that, I wouldn't mind doing that, although it > means a risk of not using the bin search in case of underestimates etc. > > I don't have any hard data about this, but I think we can assume the > number of rows processed by the clause is (much) higher than the number > of keys in it. If you have a clause with 10k values, then you probably > expect it to be applied to many rows, far more than the "beak even" > point of about 10-20 rows ... > > So I wouldn't worry about this too much. Yeah. I think it becomes a lot more interesting in the future if/when we end up with a way to use this with params and not just constant arrays. Then the "group" size would matter a whole lot more. For now, the constant amount of overhead is quite small, so even if we only execute it once we won't make the query that much worse (or, at least, the total query time will still be very small). Also, because it's only applied to constants, there's a natural limit to how much overhead we're likely to introduce into a query. James
On Sat, Apr 25, 2020 at 12:21:06AM +0200, Tomas Vondra wrote: >On Thu, Apr 23, 2020 at 04:55:51PM +0200, Tomas Vondra wrote: >>On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: >>>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra >>><tomas.vondra@2ndquadrant.com> wrote: >>>> >>>>On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: >>>>>Over in "execExprInterp() questions / How to improve scalar array op >>>>>expr eval?" [1] I'd mused about how we might be able to optimized >>>>>scalar array ops with OR'd semantics. >>>>> >>>>>This patch implements a binary search for such expressions when the >>>>>array argument is a constant so that we can avoid needing to teach >>>>>expression execution to cache stable values or know when a param has >>>>>changed. >>>>> >>>>>The speed-up for the target case can pretty impressive: in my >>>>>admittedly contrived and relatively unscientific test with a query in >>>>>the form: >>>>> >>>>>select count(*) from generate_series(1,100000) n(i) where i in (<1000 >>>>>random integers in the series>) >>>>> >>>>>shows ~30ms for the patch versus ~640ms on master. >>>>> >>>> >>>>Nice improvement, although 1000 items is probably a bit unusual. The >>>>threshold used in the patch (9 elements) seems a bit too low - what >>>>results have you seen with smaller arrays? >>> >>>At least in our systems we regularly work with 1000 batches of items, >>>which means you get IN clauses of identifiers of that size. Admittedly >>>the most common case sees those IN clauses as simple index scans >>>(e.g., WHERE <primary key> IN (...)), but it's also common to have a >>>broader query that merely filters additionally on something like "... >>>AND <some foreign key> IN (...)" where it makes sense for the rest of >>>the quals to take precedence in generating a reasonable plan. In that >>>case, the IN becomes a regular filter, hence the idea behind the >>>patch. >>> >>>Side note: I'd love for us to be able to treat "IN (VALUES)" the same >>>way...but as noted in the other thread that's an extremely large >>>amount of work, I think. But similarly you could use a hash here >>>instead of a binary search...but this seems quite good. >>> >>>As to the choice of 9 elements: I just picked that as a starting >>>point; Andres had previously commented off hand that at 8 elements >>>serial scanning was faster, so I figured this was a reasonable >>>starting point for discussion. >>> >>>Perhaps it would make sense to determine that minimum not as a pure >>>constant but scaled based on how many rows the planner expects us to >>>see? Of course that'd be a more invasive patch...so may or may not be >>>as feasible as a reasonable default. >>> >> >>Not sure. That seems a bit overcomplicated and I don't think it depends >>on the number of rows the planner expects to see very much. I think we >>usually assume the linear search is cheaper for small arrays and then at >>some point the binary search starts winning The question is where this >>"break even" point is. >> >>I think we usually use something like 64 or so in other places, but >>maybe I'm wrong. The current limit 9 seems a bit too low, but I may be >>wrong. Let's not obsess about this too much, let's do some experiments >>and pick a value based on that. >> >> >>>>Another idea - would a bloom filter be useful here, as a second >>>>optimization? That is, for large arrays build s small bloom filter, >>>>allowing us to skip even the binary search. >>> >>>That's an interesting idea. I actually haven't personally worked with >>>bloom filters, so didn't think about that. >>> >>>Are you thinking that you'd also build the filter *and* presort the >>>array? Or try to get away with using only the bloom filter and not >>>expanding and sorting the array at all? >>> >> >>Yeah, something like that. My intuition is the bloom filter is useful >>only above some number of items, and the number is higher than for the >>binary search. So we'd end up with two thresholds, first one enabling >>binary search, the second one enabling bloom filter. >> >>Of course, the "unknown" variable here is how often we actually find the >>value in the array. If 100% of the queries has a match, then the bloom >>filter is a waste of time. If there are no matches, it can make a >>significant difference. >> > >I did experiment with this is a bit, both to get a bit more familiar >with this code and to see if the bloom filter might help. The short >answer is the bloom filter does not seem to help at all, so I wouldn't >bother about it too much. > >Attacched is an updated patch series and, script I used to collect some >performance measurements, and a spreadsheet with results. The patch >series is broken into four parts: > > 0001 - the original patch with binary search > 0002 - adds GUCs to enable bin search / tweak threshold > 0003 - allows to use bloom filter + binary search > 0004 - try using murmurhash > >The test script runs a wide range of queries with different number >of lookups, keys in the array, match probability (i.e. fraction of >lookups that find a match) ranging from 1% to 100%. And of course, it >runs this with the binsearch/bloom either enabled or disabled (the >threshold was lowered to 1, so it's the on/off GUCs that determine >whether the binsearch/bloom is used). > >The results are summarized in the spreadsheet, demonstrating how useless >the bloom filter is. There's not a single case where it would beat the >binary search. I believe this is because theaccess to bloom filter is >random (determined by the hash function) and we don't save much compared >to the log(K) lookups in the sorted array. > >That makes sense, I think the bloom filters are meant to be used in >cases when the main data don't fit into memory - which is not the case >here. But I wonder how would this change for cases with more expensive >comparisons - this was using just integers, so maybe strings would >result in different behavior. OK, I tried the same test with text columns (with md5 strings), and the results are about as I predicted - the bloom filter actually makes a difference in this case. Depending on the number of lookups and selectivity (i.e. how many lookups have a match in the array) it can mean additional speedup up to ~5x compared to binary search alone. For the case with 100% selectivity (i.e. all rows have a match) this can't really save any time - it's usually still much faster than master, but it's a bit slower than binary search. So I think this might be worth investigating further, once the simple binary search gets committed. We'll probably need to factor in the cost of the comparison (higher cost -> BF more useful), selectivity of the filter (fewer matches -> BF more useful) and number of lookups. This reminds me our attempts to add bloom filters to hash joins, which I think ran into mostly the same challenge of deciding when the bloom filter can be useful and is worth the extra work. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Fri, Apr 24, 2020 at 09:22:34PM -0400, James Coleman wrote: >On Fri, Apr 24, 2020 at 5:55 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Fri, Apr 24, 2020 at 09:38:54AM -0400, James Coleman wrote: >> >On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra >> ><tomas.vondra@2ndquadrant.com> wrote: >> >> >> >> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: >> >> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra >> >> ><tomas.vondra@2ndquadrant.com> wrote: >> >> >> >> >> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: >> >> >> >Over in "execExprInterp() questions / How to improve scalar array op >> >> >> >expr eval?" [1] I'd mused about how we might be able to optimized >> >> >> >scalar array ops with OR'd semantics. >> >> >> > >> >> >> >This patch implements a binary search for such expressions when the >> >> >> >array argument is a constant so that we can avoid needing to teach >> >> >> >expression execution to cache stable values or know when a param has >> >> >> >changed. >> >> >> > >> >> >> >The speed-up for the target case can pretty impressive: in my >> >> >> >admittedly contrived and relatively unscientific test with a query in >> >> >> >the form: >> >> >> > >> >> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000 >> >> >> >random integers in the series>) >> >> >> > >> >> >> >shows ~30ms for the patch versus ~640ms on master. >> >> >> > >> >> >> >> >> >> Nice improvement, although 1000 items is probably a bit unusual. The >> >> >> threshold used in the patch (9 elements) seems a bit too low - what >> >> >> results have you seen with smaller arrays? >> >> > >> >> >At least in our systems we regularly work with 1000 batches of items, >> >> >which means you get IN clauses of identifiers of that size. Admittedly >> >> >the most common case sees those IN clauses as simple index scans >> >> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a >> >> >broader query that merely filters additionally on something like "... >> >> >AND <some foreign key> IN (...)" where it makes sense for the rest of >> >> >the quals to take precedence in generating a reasonable plan. In that >> >> >case, the IN becomes a regular filter, hence the idea behind the >> >> >patch. >> >> > >> >> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same >> >> >way...but as noted in the other thread that's an extremely large >> >> >amount of work, I think. But similarly you could use a hash here >> >> >instead of a binary search...but this seems quite good. >> >> > >> >> >As to the choice of 9 elements: I just picked that as a starting >> >> >point; Andres had previously commented off hand that at 8 elements >> >> >serial scanning was faster, so I figured this was a reasonable >> >> >starting point for discussion. >> >> > >> >> >Perhaps it would make sense to determine that minimum not as a pure >> >> >constant but scaled based on how many rows the planner expects us to >> >> >see? Of course that'd be a more invasive patch...so may or may not be >> >> >as feasible as a reasonable default. >> >> > >> >> >> >> Not sure. That seems a bit overcomplicated and I don't think it depends >> >> on the number of rows the planner expects to see very much. I think we >> >> usually assume the linear search is cheaper for small arrays and then at >> >> some point the binary search starts winning The question is where this >> >> "break even" point is. >> > >> >Well since it has to do preprocessing work (expanding the array and >> >then sorting it), then the number of rows processed matters, right? >> >For example, doing a linear search on 1000 items only once is going to >> >be cheaper than preprocessing the array and then doing a binary >> >search, but only a very large row count the binary search + >> >preprocessing might very well win out for only a 10 element array. >> > >> >> Hmmm, good point. Essentially the initialization (sorting of the array) >> has some cost, and the question is how much extra per-tuple cost this >> adds. It's probably not worth it for a single lookup, but for many >> lookups it's probably OK. Let's see if I can do the math right: >> >> N - number of lookups >> K - number of array elements >> >> Cost to sort the array is >> >> O(K * log(K)) = C1 * K * log(K) >> >> and the cost of a lookup is C2 * log(K), so with the extra cost amortized >> for N lookups, the total "per lookup" cost is >> >> C1 * K * log(K) / N + C2 * log(K) = log(K) * (C1 * K / N + C2) >> >> We need to compare this to the O(K) cost of simple linear search, and >> the question is at which point the linear search gets more expensive: >> >> C3 * K = log(K) * (C1 * K / N + C2) >> >> I think we can assume that C3 is somewhere in between 0.5 and 1, i.e. if >> there's a matching item we find it half-way through on average, and if >> there is not we have to walk the whole array. So let's say it's 1. >> >> C1 and C2 are probably fairly low, I think - C1 is typically ~1.4 for >> random pivot choice IIRC, and C2 is probably similar. With both values >> being ~1.5 we get this: >> >> K = log(K) * (1.5 * K/N + 1.5) >> >> for a fixed K, we get this formula for N: >> >> N = log(K) * 1.5 * K / (K - 1.5 * log(K)) >> >> and for a bunch of K values the results look like this: >> >> K | N >> -------|------- >> 1 | 0 >> 10 | 5.27 >> 100 | 7.42 >> 1000 | 10.47 >> 10000 | 13.83 >> 100000 | 17.27 >> >> i.e. the binary search with 10k values starts winning over linear search >> with just ~13 lookups. >> >> (Assuming I haven't made some silly mistake in the math ...) >> >> Obviously, this only accounts for cost of comparisons and neglects e.g. >> the indirect costs for less predictable memory access patterns mentioned >> by Andres in his response. >> >> But I think it still shows the number of lookups needed for the binary >> search to be a win is pretty low - at least for reasonable number of >> values in the array. Maybe it's 20 and not 10, but I don't think that >> changes much. >> >> The other question is if we can get N at all and how reliable the value >> is. We can probably get the number of rows, but that will ignore other >> conditions that may eliminate the row before the binary search. >> >> >I'm not trying to argue for more work for myself here: I think the >> >optimization is worth it on its own, and something like this could be >> >a further improvement on its own. But it is interesting to think >> >about. >> > >> >> I don't know. Clearly, if the user sends a query with 10k values and >> only does a single lookup, that won't win. And if we can reasonably and >> reliably protect against that, I wouldn't mind doing that, although it >> means a risk of not using the bin search in case of underestimates etc. >> >> I don't have any hard data about this, but I think we can assume the >> number of rows processed by the clause is (much) higher than the number >> of keys in it. If you have a clause with 10k values, then you probably >> expect it to be applied to many rows, far more than the "beak even" >> point of about 10-20 rows ... >> >> So I wouldn't worry about this too much. > >Yeah. I think it becomes a lot more interesting in the future if/when >we end up with a way to use this with params and not just constant >arrays. Then the "group" size would matter a whole lot more. > True. That probably changes the calculations quite a bit. >For now, the constant amount of overhead is quite small, so even if we >only execute it once we won't make the query that much worse (or, at >least, the total query time will still be very small). Also, because >it's only applied to constants, there's a natural limit to how much >overhead we're likely to introduce into a query. > FWIW the results from repeated test with both int and text columns that I shared in [1] also have data for smaller numbers of rows. I haven't tried very much to minimize noise (the original goal was to test speedup for large numbers of rows and large arrays, where this is not an issue). But I think it still shows that the threshold of ~10 elements is in the right ballpark. We might use a higher value to be a bit more defensive, but it's never going to be perfect for types with both cheap and more expensive comparisons. One more note - shouldn't this also tweak cost_qual_eval_walker which computes cost for ScalarArrayOpExpr? At the moment it does this: /* * Estimate that the operator will be applied to about half of the * array elements before the answer is determined. */ but that's appropriate for linear search. [1] https://www.postgresql.org/message-id/20200425124024.hsv7z6bia752uymz%40development regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > This reminds me our attempts to add bloom filters to hash joins, which I > think ran into mostly the same challenge of deciding when the bloom > filter can be useful and is worth the extra work. Speaking of that, it would be interesting to see how a test where you write the query as IN(VALUES(...)) instead of IN() compares. It would be interesting to know if the planner is able to make a more suitable choice and also to see how all the work over the years to improve Hash Joins compares to the bsearch with and without the bloom filter. David
On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > This reminds me our attempts to add bloom filters to hash joins, which I > > think ran into mostly the same challenge of deciding when the bloom > > filter can be useful and is worth the extra work. > > Speaking of that, it would be interesting to see how a test where you > write the query as IN(VALUES(...)) instead of IN() compares. It would > be interesting to know if the planner is able to make a more suitable > choice and also to see how all the work over the years to improve Hash > Joins compares to the bsearch with and without the bloom filter. It would be interesting. It also makes one wonder about optimizing these into to hash joins...which I'd thought about over at [1]. I think it'd be a very significant effort though. James [1]: https://www.postgresql.org/message-id/CAAaqYe_zVVOURfdPbAhssijw7yV0uKi350gQ%3D_QGDz7R%3DHpGGQ%40mail.gmail.com
On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> > This reminds me our attempts to add bloom filters to hash joins, which I >> > think ran into mostly the same challenge of deciding when the bloom >> > filter can be useful and is worth the extra work. >> >> Speaking of that, it would be interesting to see how a test where you >> write the query as IN(VALUES(...)) instead of IN() compares. It would >> be interesting to know if the planner is able to make a more suitable >> choice and also to see how all the work over the years to improve Hash >> Joins compares to the bsearch with and without the bloom filter. > >It would be interesting. > >It also makes one wonder about optimizing these into to hash >joins...which I'd thought about over at [1]. I think it'd be a very >significant effort though. > I modified the script to also do the join version of the query. I can only run it on my laptop at the moment, so the results may be a bit different from those I shared before, but it's interesting I think. In most cases it's comparable to the binsearch/bloom approach, and in some cases it actually beats them quite significantly. It seems to depend on how expensive the comparison is - for "int" the comparison is very cheap and there's almost no difference. For "text" the comparison is much more expensive, and there are significant speedups. For example the test with 100k lookups in array of 10k elements and 10% match probability, the timings are these master: 62362 ms binsearch: 201 ms bloom: 65 ms hashjoin: 36 ms I do think the explanation is fairly simple - the bloom filter eliminates about 90% of the expensive comparisons, so it's 20ms plus some overhead to build and check the bits. The hash join probably eliminates a lot of the remaining comparisons, because the hash table is sized to have one tuple per bucket. Note: I also don't claim the PoC has the most efficient bloom filter implementation possible. I'm sure it could be made faster. Anyway, I'm not sure transforming this to a hash join is worth the effort - I agree that seems quite complex. But perhaps this suggest we should not be doing binary search and instead just build a simple hash table - that seems much simpler, and it'll probably give us about the same benefits. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: > >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: > >> > >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> > This reminds me our attempts to add bloom filters to hash joins, which I > >> > think ran into mostly the same challenge of deciding when the bloom > >> > filter can be useful and is worth the extra work. > >> > >> Speaking of that, it would be interesting to see how a test where you > >> write the query as IN(VALUES(...)) instead of IN() compares. It would > >> be interesting to know if the planner is able to make a more suitable > >> choice and also to see how all the work over the years to improve Hash > >> Joins compares to the bsearch with and without the bloom filter. > > > >It would be interesting. > > > >It also makes one wonder about optimizing these into to hash > >joins...which I'd thought about over at [1]. I think it'd be a very > >significant effort though. > > > > I modified the script to also do the join version of the query. I can > only run it on my laptop at the moment, so the results may be a bit > different from those I shared before, but it's interesting I think. > > In most cases it's comparable to the binsearch/bloom approach, and in > some cases it actually beats them quite significantly. It seems to > depend on how expensive the comparison is - for "int" the comparison is > very cheap and there's almost no difference. For "text" the comparison > is much more expensive, and there are significant speedups. > > For example the test with 100k lookups in array of 10k elements and 10% > match probability, the timings are these > > master: 62362 ms > binsearch: 201 ms > bloom: 65 ms > hashjoin: 36 ms > > I do think the explanation is fairly simple - the bloom filter > eliminates about 90% of the expensive comparisons, so it's 20ms plus > some overhead to build and check the bits. The hash join probably > eliminates a lot of the remaining comparisons, because the hash table > is sized to have one tuple per bucket. > > Note: I also don't claim the PoC has the most efficient bloom filter > implementation possible. I'm sure it could be made faster. > > Anyway, I'm not sure transforming this to a hash join is worth the > effort - I agree that seems quite complex. But perhaps this suggest we > should not be doing binary search and instead just build a simple hash > table - that seems much simpler, and it'll probably give us about the > same benefits. That's actually what I originally thought about doing, but I chose binary search since it seemed a lot easier to get off the ground. If we instead build a hash is there anything else we need to be concerned about? For example, work mem? I suppose for the binary search we already have to expand the array, so perhaps it's not all that meaningful relative to that... I was looking earlier at what our standard hash implementation was, and it seemed less obvious what was needed to set that up (so binary search seemed a faster proof of concept). If you happen to have any pointers to similar usages I should look at, please let me know. James
On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote: >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: >> >> >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> > This reminds me our attempts to add bloom filters to hash joins, which I >> >> > think ran into mostly the same challenge of deciding when the bloom >> >> > filter can be useful and is worth the extra work. >> >> >> >> Speaking of that, it would be interesting to see how a test where you >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would >> >> be interesting to know if the planner is able to make a more suitable >> >> choice and also to see how all the work over the years to improve Hash >> >> Joins compares to the bsearch with and without the bloom filter. >> > >> >It would be interesting. >> > >> >It also makes one wonder about optimizing these into to hash >> >joins...which I'd thought about over at [1]. I think it'd be a very >> >significant effort though. >> > >> >> I modified the script to also do the join version of the query. I can >> only run it on my laptop at the moment, so the results may be a bit >> different from those I shared before, but it's interesting I think. >> >> In most cases it's comparable to the binsearch/bloom approach, and in >> some cases it actually beats them quite significantly. It seems to >> depend on how expensive the comparison is - for "int" the comparison is >> very cheap and there's almost no difference. For "text" the comparison >> is much more expensive, and there are significant speedups. >> >> For example the test with 100k lookups in array of 10k elements and 10% >> match probability, the timings are these >> >> master: 62362 ms >> binsearch: 201 ms >> bloom: 65 ms >> hashjoin: 36 ms >> >> I do think the explanation is fairly simple - the bloom filter >> eliminates about 90% of the expensive comparisons, so it's 20ms plus >> some overhead to build and check the bits. The hash join probably >> eliminates a lot of the remaining comparisons, because the hash table >> is sized to have one tuple per bucket. >> >> Note: I also don't claim the PoC has the most efficient bloom filter >> implementation possible. I'm sure it could be made faster. >> >> Anyway, I'm not sure transforming this to a hash join is worth the >> effort - I agree that seems quite complex. But perhaps this suggest we >> should not be doing binary search and instead just build a simple hash >> table - that seems much simpler, and it'll probably give us about the >> same benefits. > >That's actually what I originally thought about doing, but I chose >binary search since it seemed a lot easier to get off the ground. > OK, that makes perfect sense. >If we instead build a hash is there anything else we need to be >concerned about? For example, work mem? I suppose for the binary >search we already have to expand the array, so perhaps it's not all >that meaningful relative to that... > I don't think we need to be particularly concerned about work_mem. We don't care about it now, and it's not clear to me what we could do about it - we already have the array in memory anyway, so it's a bit futile. Furthermore, if we need to care about it, it probably applies to the binary search too. >I was looking earlier at what our standard hash implementation was, >and it seemed less obvious what was needed to set that up (so binary >search seemed a faster proof of concept). If you happen to have any >pointers to similar usages I should look at, please let me know. > I think the hash join implementation is far too complicated. It has to care about work_mem, so it implements batching, etc. That's a lot of complexity we don't need here. IMO we could use either the usual dynahash, or maybe even the simpler simplehash. FWIW it'd be good to verify the numbers I shared, i.e. checking that the benchmarks makes sense and running it independently. I'm not aware of any issues but it was done late at night and only ran on my laptop. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote: > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: > >> >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I > >> >> > think ran into mostly the same challenge of deciding when the bloom > >> >> > filter can be useful and is worth the extra work. > >> >> > >> >> Speaking of that, it would be interesting to see how a test where you > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would > >> >> be interesting to know if the planner is able to make a more suitable > >> >> choice and also to see how all the work over the years to improve Hash > >> >> Joins compares to the bsearch with and without the bloom filter. > >> > > >> >It would be interesting. > >> > > >> >It also makes one wonder about optimizing these into to hash > >> >joins...which I'd thought about over at [1]. I think it'd be a very > >> >significant effort though. > >> > > >> > >> I modified the script to also do the join version of the query. I can > >> only run it on my laptop at the moment, so the results may be a bit > >> different from those I shared before, but it's interesting I think. > >> > >> In most cases it's comparable to the binsearch/bloom approach, and in > >> some cases it actually beats them quite significantly. It seems to > >> depend on how expensive the comparison is - for "int" the comparison is > >> very cheap and there's almost no difference. For "text" the comparison > >> is much more expensive, and there are significant speedups. > >> > >> For example the test with 100k lookups in array of 10k elements and 10% > >> match probability, the timings are these > >> > >> master: 62362 ms > >> binsearch: 201 ms > >> bloom: 65 ms > >> hashjoin: 36 ms > >> > >> I do think the explanation is fairly simple - the bloom filter > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus > >> some overhead to build and check the bits. The hash join probably > >> eliminates a lot of the remaining comparisons, because the hash table > >> is sized to have one tuple per bucket. > >> > >> Note: I also don't claim the PoC has the most efficient bloom filter > >> implementation possible. I'm sure it could be made faster. > >> > >> Anyway, I'm not sure transforming this to a hash join is worth the > >> effort - I agree that seems quite complex. But perhaps this suggest we > >> should not be doing binary search and instead just build a simple hash > >> table - that seems much simpler, and it'll probably give us about the > >> same benefits. > > > >That's actually what I originally thought about doing, but I chose > >binary search since it seemed a lot easier to get off the ground. > > > > OK, that makes perfect sense. > > >If we instead build a hash is there anything else we need to be > >concerned about? For example, work mem? I suppose for the binary > >search we already have to expand the array, so perhaps it's not all > >that meaningful relative to that... > > > > I don't think we need to be particularly concerned about work_mem. We > don't care about it now, and it's not clear to me what we could do about > it - we already have the array in memory anyway, so it's a bit futile. > Furthermore, if we need to care about it, it probably applies to the > binary search too. > > >I was looking earlier at what our standard hash implementation was, > >and it seemed less obvious what was needed to set that up (so binary > >search seemed a faster proof of concept). If you happen to have any > >pointers to similar usages I should look at, please let me know. > > > > I think the hash join implementation is far too complicated. It has to > care about work_mem, so it implements batching, etc. That's a lot of > complexity we don't need here. IMO we could use either the usual > dynahash, or maybe even the simpler simplehash. > > FWIW it'd be good to verify the numbers I shared, i.e. checking that the > benchmarks makes sense and running it independently. I'm not aware of > any issues but it was done late at night and only ran on my laptop. Some quick calculations (don't have the scripting in a form I can attach yet; using this as an opportunity to hack on a genericized performance testing framework of sorts) suggest your results are correct. I was also testing on my laptop, but I showed 1.) roughly equivalent results for IN (VALUES ...) and IN (<list>) for integers, but when I switch to (short; average 3 characters long) text values I show the hash join on VALUES is twice as fast as the binary search. Given that, I'm planning to implement this as a hash lookup and share a revised patch. James
On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> > ><tomas.vondra@2ndquadrant.com > wrote:
> > >>
> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > >> >>
> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> > >> >> > think ran into mostly the same challenge of deciding when the bloom
> > >> >> > filter can be useful and is worth the extra work.
> > >> >>
> > >> >> Speaking of that, it would be interesting to see how a test where you
> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> > >> >> be interesting to know if the planner is able to make a more suitable
> > >> >> choice and also to see how all the work over the years to improve Hash
> > >> >> Joins compares to the bsearch with and without the bloom filter.
> > >> >
> > >> >It would be interesting.
> > >> >
> > >> >It also makes one wonder about optimizing these into to hash
> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
> > >> >significant effort though.
> > >> >
> > >>
> > >> I modified the script to also do the join version of the query. I can
> > >> only run it on my laptop at the moment, so the results may be a bit
> > >> different from those I shared before, but it's interesting I think.
> > >>
> > >> In most cases it's comparable to the binsearch/bloom approach, and in
> > >> some cases it actually beats them quite significantly. It seems to
> > >> depend on how expensive the comparison is - for "int" the comparison is
> > >> very cheap and there's almost no difference. For "text" the comparison
> > >> is much more expensive, and there are significant speedups.
> > >>
> > >> For example the test with 100k lookups in array of 10k elements and 10%
> > >> match probability, the timings are these
> > >>
> > >> master: 62362 ms
> > >> binsearch: 201 ms
> > >> bloom: 65 ms
> > >> hashjoin: 36 ms
> > >>
> > >> I do think the explanation is fairly simple - the bloom filter
> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> > >> some overhead to build and check the bits. The hash join probably
> > >> eliminates a lot of the remaining comparisons, because the hash table
> > >> is sized to have one tuple per bucket.
> > >>
> > >> Note: I also don't claim the PoC has the most efficient bloom filter
> > >> implementation possible. I'm sure it could be made faster.
> > >>
> > >> Anyway, I'm not sure transforming this to a hash join is worth the
> > >> effort - I agree that seems quite complex. But perhaps this suggest we
> > >> should not be doing binary search and instead just build a simple hash
> > >> table - that seems much simpler, and it'll probably give us about the
> > >> same benefits.
> > >
> > >That's actually what I originally thought about doing, but I chose
> > >binary search since it seemed a lot easier to get off the ground.
> > >
> >
> > OK, that makes perfect sense.
> >
> > >If we instead build a hash is there anything else we need to be
> > >concerned about? For example, work mem? I suppose for the binary
> > >search we already have to expand the array, so perhaps it's not all
> > >that meaningful relative to that...
> > >
> >
> > I don't think we need to be particularly concerned about work_mem. We
> > don't care about it now, and it's not clear to me what we could do about
> > it - we already have the array in memory anyway, so it's a bit futile.
> > Furthermore, if we need to care about it, it probably applies to the
> > binary search too.
> >
> > >I was looking earlier at what our standard hash implementation was,
> > >and it seemed less obvious what was needed to set that up (so binary
> > >search seemed a faster proof of concept). If you happen to have any
> > >pointers to similar usages I should look at, please let me know.
> > >
> >
> > I think the hash join implementation is far too complicated. It has to
> > care about work_mem, so it implements batching, etc. That's a lot of
> > complexity we don't need here. IMO we could use either the usual
> > dynahash, or maybe even the simpler simplehash.
> >
> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> > benchmarks makes sense and running it independently. I'm not aware of
> > any issues but it was done late at night and only ran on my laptop.
>
> Some quick calculations (don't have the scripting in a form I can
> attach yet; using this as an opportunity to hack on a genericized
> performance testing framework of sorts) suggest your results are
> correct. I was also testing on my laptop, but I showed 1.) roughly
> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
> but when I switch to (short; average 3 characters long) text values I
> show the hash join on VALUES is twice as fast as the binary search.
>
> Given that, I'm planning to implement this as a hash lookup and share
> a revised patch.
While working on this I noticed that dynahash.c line 499 has this assertion:
Assert(info->entrysize >= info->keysize);
Do you by any chance know why the entry would need to be larger than the key? In this case I'm really treating the hash like a set (if there's a hash set implementation that doesn't store a value, then I'd be happy to use that instead) so I've configured the entry as sizeof(bool) which is obviously smaller than the key.
If it helps, that line was added by Tom in fba2a104c6d.
>
> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> > ><tomas.vondra@2ndquadrant.com
> > >>
> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > >> >>
> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> > >> >> > think ran into mostly the same challenge of deciding when the bloom
> > >> >> > filter can be useful and is worth the extra work.
> > >> >>
> > >> >> Speaking of that, it would be interesting to see how a test where you
> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> > >> >> be interesting to know if the planner is able to make a more suitable
> > >> >> choice and also to see how all the work over the years to improve Hash
> > >> >> Joins compares to the bsearch with and without the bloom filter.
> > >> >
> > >> >It would be interesting.
> > >> >
> > >> >It also makes one wonder about optimizing these into to hash
> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
> > >> >significant effort though.
> > >> >
> > >>
> > >> I modified the script to also do the join version of the query. I can
> > >> only run it on my laptop at the moment, so the results may be a bit
> > >> different from those I shared before, but it's interesting I think.
> > >>
> > >> In most cases it's comparable to the binsearch/bloom approach, and in
> > >> some cases it actually beats them quite significantly. It seems to
> > >> depend on how expensive the comparison is - for "int" the comparison is
> > >> very cheap and there's almost no difference. For "text" the comparison
> > >> is much more expensive, and there are significant speedups.
> > >>
> > >> For example the test with 100k lookups in array of 10k elements and 10%
> > >> match probability, the timings are these
> > >>
> > >> master: 62362 ms
> > >> binsearch: 201 ms
> > >> bloom: 65 ms
> > >> hashjoin: 36 ms
> > >>
> > >> I do think the explanation is fairly simple - the bloom filter
> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> > >> some overhead to build and check the bits. The hash join probably
> > >> eliminates a lot of the remaining comparisons, because the hash table
> > >> is sized to have one tuple per bucket.
> > >>
> > >> Note: I also don't claim the PoC has the most efficient bloom filter
> > >> implementation possible. I'm sure it could be made faster.
> > >>
> > >> Anyway, I'm not sure transforming this to a hash join is worth the
> > >> effort - I agree that seems quite complex. But perhaps this suggest we
> > >> should not be doing binary search and instead just build a simple hash
> > >> table - that seems much simpler, and it'll probably give us about the
> > >> same benefits.
> > >
> > >That's actually what I originally thought about doing, but I chose
> > >binary search since it seemed a lot easier to get off the ground.
> > >
> >
> > OK, that makes perfect sense.
> >
> > >If we instead build a hash is there anything else we need to be
> > >concerned about? For example, work mem? I suppose for the binary
> > >search we already have to expand the array, so perhaps it's not all
> > >that meaningful relative to that...
> > >
> >
> > I don't think we need to be particularly concerned about work_mem. We
> > don't care about it now, and it's not clear to me what we could do about
> > it - we already have the array in memory anyway, so it's a bit futile.
> > Furthermore, if we need to care about it, it probably applies to the
> > binary search too.
> >
> > >I was looking earlier at what our standard hash implementation was,
> > >and it seemed less obvious what was needed to set that up (so binary
> > >search seemed a faster proof of concept). If you happen to have any
> > >pointers to similar usages I should look at, please let me know.
> > >
> >
> > I think the hash join implementation is far too complicated. It has to
> > care about work_mem, so it implements batching, etc. That's a lot of
> > complexity we don't need here. IMO we could use either the usual
> > dynahash, or maybe even the simpler simplehash.
> >
> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> > benchmarks makes sense and running it independently. I'm not aware of
> > any issues but it was done late at night and only ran on my laptop.
>
> Some quick calculations (don't have the scripting in a form I can
> attach yet; using this as an opportunity to hack on a genericized
> performance testing framework of sorts) suggest your results are
> correct. I was also testing on my laptop, but I showed 1.) roughly
> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
> but when I switch to (short; average 3 characters long) text values I
> show the hash join on VALUES is twice as fast as the binary search.
>
> Given that, I'm planning to implement this as a hash lookup and share
> a revised patch.
While working on this I noticed that dynahash.c line 499 has this assertion:
Assert(info->entrysize >= info->keysize);
Do you by any chance know why the entry would need to be larger than the key? In this case I'm really treating the hash like a set (if there's a hash set implementation that doesn't store a value, then I'd be happy to use that instead) so I've configured the entry as sizeof(bool) which is obviously smaller than the key.
If it helps, that line was added by Tom in fba2a104c6d.
Thanks,
James
On Mon, 27 Apr 2020 at 15:12, James Coleman <jtc331@gmail.com> wrote: > While working on this I noticed that dynahash.c line 499 has this assertion: > > Assert(info->entrysize >= info->keysize); > > Do you by any chance know why the entry would need to be larger than the key? Larger or equal. They'd be equal if you the key was the data, since you do need to store at least the key. Looking at the code for examples where dynahash is used in that situation, I see _hash_finish_split(). David
On Sun, Apr 26, 2020 at 11:44 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Mon, 27 Apr 2020 at 15:12, James Coleman <jtc331@gmail.com> wrote: > > While working on this I noticed that dynahash.c line 499 has this assertion: > > > > Assert(info->entrysize >= info->keysize); > > > > Do you by any chance know why the entry would need to be larger than the key? > > Larger or equal. They'd be equal if you the key was the data, since > you do need to store at least the key. Looking at the code for > examples where dynahash is used in that situation, I see > _hash_finish_split(). Ah, I was thinking of it as key and value being separate sizes added together rather than one including the other. Thanks, James
On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote: > > On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote: > > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra > > ><tomas.vondra@2ndquadrant.com> wrote: > > >> > > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: > > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: > > >> >> > > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I > > >> >> > think ran into mostly the same challenge of deciding when the bloom > > >> >> > filter can be useful and is worth the extra work. > > >> >> > > >> >> Speaking of that, it would be interesting to see how a test where you > > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would > > >> >> be interesting to know if the planner is able to make a more suitable > > >> >> choice and also to see how all the work over the years to improve Hash > > >> >> Joins compares to the bsearch with and without the bloom filter. > > >> > > > >> >It would be interesting. > > >> > > > >> >It also makes one wonder about optimizing these into to hash > > >> >joins...which I'd thought about over at [1]. I think it'd be a very > > >> >significant effort though. > > >> > > > >> > > >> I modified the script to also do the join version of the query. I can > > >> only run it on my laptop at the moment, so the results may be a bit > > >> different from those I shared before, but it's interesting I think. > > >> > > >> In most cases it's comparable to the binsearch/bloom approach, and in > > >> some cases it actually beats them quite significantly. It seems to > > >> depend on how expensive the comparison is - for "int" the comparison is > > >> very cheap and there's almost no difference. For "text" the comparison > > >> is much more expensive, and there are significant speedups. > > >> > > >> For example the test with 100k lookups in array of 10k elements and 10% > > >> match probability, the timings are these > > >> > > >> master: 62362 ms > > >> binsearch: 201 ms > > >> bloom: 65 ms > > >> hashjoin: 36 ms > > >> > > >> I do think the explanation is fairly simple - the bloom filter > > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus > > >> some overhead to build and check the bits. The hash join probably > > >> eliminates a lot of the remaining comparisons, because the hash table > > >> is sized to have one tuple per bucket. > > >> > > >> Note: I also don't claim the PoC has the most efficient bloom filter > > >> implementation possible. I'm sure it could be made faster. > > >> > > >> Anyway, I'm not sure transforming this to a hash join is worth the > > >> effort - I agree that seems quite complex. But perhaps this suggest we > > >> should not be doing binary search and instead just build a simple hash > > >> table - that seems much simpler, and it'll probably give us about the > > >> same benefits. > > > > > >That's actually what I originally thought about doing, but I chose > > >binary search since it seemed a lot easier to get off the ground. > > > > > > > OK, that makes perfect sense. > > > > >If we instead build a hash is there anything else we need to be > > >concerned about? For example, work mem? I suppose for the binary > > >search we already have to expand the array, so perhaps it's not all > > >that meaningful relative to that... > > > > > > > I don't think we need to be particularly concerned about work_mem. We > > don't care about it now, and it's not clear to me what we could do about > > it - we already have the array in memory anyway, so it's a bit futile. > > Furthermore, if we need to care about it, it probably applies to the > > binary search too. > > > > >I was looking earlier at what our standard hash implementation was, > > >and it seemed less obvious what was needed to set that up (so binary > > >search seemed a faster proof of concept). If you happen to have any > > >pointers to similar usages I should look at, please let me know. > > > > > > > I think the hash join implementation is far too complicated. It has to > > care about work_mem, so it implements batching, etc. That's a lot of > > complexity we don't need here. IMO we could use either the usual > > dynahash, or maybe even the simpler simplehash. > > > > FWIW it'd be good to verify the numbers I shared, i.e. checking that the > > benchmarks makes sense and running it independently. I'm not aware of > > any issues but it was done late at night and only ran on my laptop. > > Some quick calculations (don't have the scripting in a form I can > attach yet; using this as an opportunity to hack on a genericized > performance testing framework of sorts) suggest your results are > correct. I was also testing on my laptop, but I showed 1.) roughly > equivalent results for IN (VALUES ...) and IN (<list>) for integers, > but when I switch to (short; average 3 characters long) text values I > show the hash join on VALUES is twice as fast as the binary search. > > Given that, I'm planning to implement this as a hash lookup and share > a revised patch. I've attached a patch series as before, but with an additional patch that switches to using dynahash instead of binary search. Whereas before the benchmarking ended up with a trimodal distribution (i.e., master with IN <list>, patch with IN <list>, and either with IN VALUES), the hash implementation brings us back to an effectively bimodal distribution -- though the hash scalar array op expression implementation for text is about 5% faster than the hash join. Current outstanding thoughts (besides comment/naming cleanup): - The saop costing needs to be updated to match, as Tomas pointed out. - Should we be concerned about single execution cases? For example, is the regression of speed on a simple SELECT x IN something we should try to defeat by only kicking in the optimization if we execute in a loop at least twice? That might be of particular interest to pl/pgsql. - Should we have a test for an operator with a non-strict function? I'm not aware of any built-in ops that have that characteristic; would you suggest just creating a fake one for the test? James
Attachment
On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote: >On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote: >> >> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >> > >> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote: >> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra >> > ><tomas.vondra@2ndquadrant.com> wrote: >> > >> >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: >> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: >> > >> >> >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I >> > >> >> > think ran into mostly the same challenge of deciding when the bloom >> > >> >> > filter can be useful and is worth the extra work. >> > >> >> >> > >> >> Speaking of that, it would be interesting to see how a test where you >> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would >> > >> >> be interesting to know if the planner is able to make a more suitable >> > >> >> choice and also to see how all the work over the years to improve Hash >> > >> >> Joins compares to the bsearch with and without the bloom filter. >> > >> > >> > >> >It would be interesting. >> > >> > >> > >> >It also makes one wonder about optimizing these into to hash >> > >> >joins...which I'd thought about over at [1]. I think it'd be a very >> > >> >significant effort though. >> > >> > >> > >> >> > >> I modified the script to also do the join version of the query. I can >> > >> only run it on my laptop at the moment, so the results may be a bit >> > >> different from those I shared before, but it's interesting I think. >> > >> >> > >> In most cases it's comparable to the binsearch/bloom approach, and in >> > >> some cases it actually beats them quite significantly. It seems to >> > >> depend on how expensive the comparison is - for "int" the comparison is >> > >> very cheap and there's almost no difference. For "text" the comparison >> > >> is much more expensive, and there are significant speedups. >> > >> >> > >> For example the test with 100k lookups in array of 10k elements and 10% >> > >> match probability, the timings are these >> > >> >> > >> master: 62362 ms >> > >> binsearch: 201 ms >> > >> bloom: 65 ms >> > >> hashjoin: 36 ms >> > >> >> > >> I do think the explanation is fairly simple - the bloom filter >> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus >> > >> some overhead to build and check the bits. The hash join probably >> > >> eliminates a lot of the remaining comparisons, because the hash table >> > >> is sized to have one tuple per bucket. >> > >> >> > >> Note: I also don't claim the PoC has the most efficient bloom filter >> > >> implementation possible. I'm sure it could be made faster. >> > >> >> > >> Anyway, I'm not sure transforming this to a hash join is worth the >> > >> effort - I agree that seems quite complex. But perhaps this suggest we >> > >> should not be doing binary search and instead just build a simple hash >> > >> table - that seems much simpler, and it'll probably give us about the >> > >> same benefits. >> > > >> > >That's actually what I originally thought about doing, but I chose >> > >binary search since it seemed a lot easier to get off the ground. >> > > >> > >> > OK, that makes perfect sense. >> > >> > >If we instead build a hash is there anything else we need to be >> > >concerned about? For example, work mem? I suppose for the binary >> > >search we already have to expand the array, so perhaps it's not all >> > >that meaningful relative to that... >> > > >> > >> > I don't think we need to be particularly concerned about work_mem. We >> > don't care about it now, and it's not clear to me what we could do about >> > it - we already have the array in memory anyway, so it's a bit futile. >> > Furthermore, if we need to care about it, it probably applies to the >> > binary search too. >> > >> > >I was looking earlier at what our standard hash implementation was, >> > >and it seemed less obvious what was needed to set that up (so binary >> > >search seemed a faster proof of concept). If you happen to have any >> > >pointers to similar usages I should look at, please let me know. >> > > >> > >> > I think the hash join implementation is far too complicated. It has to >> > care about work_mem, so it implements batching, etc. That's a lot of >> > complexity we don't need here. IMO we could use either the usual >> > dynahash, or maybe even the simpler simplehash. >> > >> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the >> > benchmarks makes sense and running it independently. I'm not aware of >> > any issues but it was done late at night and only ran on my laptop. >> >> Some quick calculations (don't have the scripting in a form I can >> attach yet; using this as an opportunity to hack on a genericized >> performance testing framework of sorts) suggest your results are >> correct. I was also testing on my laptop, but I showed 1.) roughly >> equivalent results for IN (VALUES ...) and IN (<list>) for integers, >> but when I switch to (short; average 3 characters long) text values I >> show the hash join on VALUES is twice as fast as the binary search. >> >> Given that, I'm planning to implement this as a hash lookup and share >> a revised patch. > >I've attached a patch series as before, but with an additional patch >that switches to using dynahash instead of binary search. > OK. I can't take closer look at the moment, I'll check later. Any particular reasons to pick dynahash over simplehash? ISTM we're using simplehash elsewhere in the executor (grouping, tidbitmap, ...), while there are not many places using dynahash for simple short-lived hash tables. Of course, that alone is a weak reason to insist on using simplehash here, but I suppose there were reasons for not using dynahash and we'll end up facing the same issues here. >Whereas before the benchmarking ended up with a trimodal distribution >(i.e., master with IN <list>, patch with IN <list>, and either with IN >VALUES), the hash implementation brings us back to an effectively >bimodal distribution -- though the hash scalar array op expression >implementation for text is about 5% faster than the hash join. > Nice. I'm not surprised this is a bit faster than hash join, which has to worry about additional stuff. >Current outstanding thoughts (besides comment/naming cleanup): > >- The saop costing needs to be updated to match, as Tomas pointed out. > >- Should we be concerned about single execution cases? For example, is >the regression of speed on a simple SELECT x IN something we should >try to defeat by only kicking in the optimization if we execute in a >loop at least twice? That might be of particular interest to pl/pgsql. > I don't follow. How is this related to the number of executions and/or plpgsql? I suspect you might be talking about prepared statements, but surely the hash table is built for each execution anyway, even if the plan is reused, right? I think the issue we've been talking about is considering the number of lookups we expect to do in the array/hash table. But that has nothing to do with plpgsql and/or multiple executions ... >- Should we have a test for an operator with a non-strict function? >I'm not aware of any built-in ops that have that characteristic; would >you suggest just creating a fake one for the test? > Dunno, I haven't thought about this very much. In general I think the new code should simply behave the same as the current code, i.e. if it does not check for strictness, we don't need either. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote: > >On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote: > >> > >> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra > >> <tomas.vondra@2ndquadrant.com> wrote: > >> > > >> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote: > >> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra > >> > ><tomas.vondra@2ndquadrant.com> wrote: > >> > >> > >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: > >> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: > >> > >> >> > >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I > >> > >> >> > think ran into mostly the same challenge of deciding when the bloom > >> > >> >> > filter can be useful and is worth the extra work. > >> > >> >> > >> > >> >> Speaking of that, it would be interesting to see how a test where you > >> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would > >> > >> >> be interesting to know if the planner is able to make a more suitable > >> > >> >> choice and also to see how all the work over the years to improve Hash > >> > >> >> Joins compares to the bsearch with and without the bloom filter. > >> > >> > > >> > >> >It would be interesting. > >> > >> > > >> > >> >It also makes one wonder about optimizing these into to hash > >> > >> >joins...which I'd thought about over at [1]. I think it'd be a very > >> > >> >significant effort though. > >> > >> > > >> > >> > >> > >> I modified the script to also do the join version of the query. I can > >> > >> only run it on my laptop at the moment, so the results may be a bit > >> > >> different from those I shared before, but it's interesting I think. > >> > >> > >> > >> In most cases it's comparable to the binsearch/bloom approach, and in > >> > >> some cases it actually beats them quite significantly. It seems to > >> > >> depend on how expensive the comparison is - for "int" the comparison is > >> > >> very cheap and there's almost no difference. For "text" the comparison > >> > >> is much more expensive, and there are significant speedups. > >> > >> > >> > >> For example the test with 100k lookups in array of 10k elements and 10% > >> > >> match probability, the timings are these > >> > >> > >> > >> master: 62362 ms > >> > >> binsearch: 201 ms > >> > >> bloom: 65 ms > >> > >> hashjoin: 36 ms > >> > >> > >> > >> I do think the explanation is fairly simple - the bloom filter > >> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus > >> > >> some overhead to build and check the bits. The hash join probably > >> > >> eliminates a lot of the remaining comparisons, because the hash table > >> > >> is sized to have one tuple per bucket. > >> > >> > >> > >> Note: I also don't claim the PoC has the most efficient bloom filter > >> > >> implementation possible. I'm sure it could be made faster. > >> > >> > >> > >> Anyway, I'm not sure transforming this to a hash join is worth the > >> > >> effort - I agree that seems quite complex. But perhaps this suggest we > >> > >> should not be doing binary search and instead just build a simple hash > >> > >> table - that seems much simpler, and it'll probably give us about the > >> > >> same benefits. > >> > > > >> > >That's actually what I originally thought about doing, but I chose > >> > >binary search since it seemed a lot easier to get off the ground. > >> > > > >> > > >> > OK, that makes perfect sense. > >> > > >> > >If we instead build a hash is there anything else we need to be > >> > >concerned about? For example, work mem? I suppose for the binary > >> > >search we already have to expand the array, so perhaps it's not all > >> > >that meaningful relative to that... > >> > > > >> > > >> > I don't think we need to be particularly concerned about work_mem. We > >> > don't care about it now, and it's not clear to me what we could do about > >> > it - we already have the array in memory anyway, so it's a bit futile. > >> > Furthermore, if we need to care about it, it probably applies to the > >> > binary search too. > >> > > >> > >I was looking earlier at what our standard hash implementation was, > >> > >and it seemed less obvious what was needed to set that up (so binary > >> > >search seemed a faster proof of concept). If you happen to have any > >> > >pointers to similar usages I should look at, please let me know. > >> > > > >> > > >> > I think the hash join implementation is far too complicated. It has to > >> > care about work_mem, so it implements batching, etc. That's a lot of > >> > complexity we don't need here. IMO we could use either the usual > >> > dynahash, or maybe even the simpler simplehash. > >> > > >> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the > >> > benchmarks makes sense and running it independently. I'm not aware of > >> > any issues but it was done late at night and only ran on my laptop. > >> > >> Some quick calculations (don't have the scripting in a form I can > >> attach yet; using this as an opportunity to hack on a genericized > >> performance testing framework of sorts) suggest your results are > >> correct. I was also testing on my laptop, but I showed 1.) roughly > >> equivalent results for IN (VALUES ...) and IN (<list>) for integers, > >> but when I switch to (short; average 3 characters long) text values I > >> show the hash join on VALUES is twice as fast as the binary search. > >> > >> Given that, I'm planning to implement this as a hash lookup and share > >> a revised patch. > > > >I've attached a patch series as before, but with an additional patch > >that switches to using dynahash instead of binary search. > > > > OK. I can't take closer look at the moment, I'll check later. > > Any particular reasons to pick dynahash over simplehash? ISTM we're > using simplehash elsewhere in the executor (grouping, tidbitmap, ...), > while there are not many places using dynahash for simple short-lived > hash tables. Of course, that alone is a weak reason to insist on using > simplehash here, but I suppose there were reasons for not using dynahash > and we'll end up facing the same issues here. No particular reason; it wasn't clear to me that there was a reason to prefer one or the other (and I'm not acquainted with the codebase enough to know the differences), so I chose dynahash because it was easier to find examples to follow for implementation. > >Whereas before the benchmarking ended up with a trimodal distribution > >(i.e., master with IN <list>, patch with IN <list>, and either with IN > >VALUES), the hash implementation brings us back to an effectively > >bimodal distribution -- though the hash scalar array op expression > >implementation for text is about 5% faster than the hash join. > > > > Nice. I'm not surprised this is a bit faster than hash join, which has > to worry about additional stuff. > > >Current outstanding thoughts (besides comment/naming cleanup): > > > >- The saop costing needs to be updated to match, as Tomas pointed out. > > > >- Should we be concerned about single execution cases? For example, is > >the regression of speed on a simple SELECT x IN something we should > >try to defeat by only kicking in the optimization if we execute in a > >loop at least twice? That might be of particular interest to pl/pgsql. > > > > I don't follow. How is this related to the number of executions and/or > plpgsql? I suspect you might be talking about prepared statements, but > surely the hash table is built for each execution anyway, even if the > plan is reused, right? > > I think the issue we've been talking about is considering the number of > lookups we expect to do in the array/hash table. But that has nothing to > do with plpgsql and/or multiple executions ... Suppose I do "SELECT 1 IN <100 item list>" (whether just as a standalone query or in pl/pgsql). Then it doesn't make sense to use the optimization, because it can't possibly win over a naive linear scan through the array since to build the hash we have to do that linear scan anyway (I suppose in theory the hashing function could be dramatically faster than the equality op, so maybe it could win overall, but it seems unlikely to me). I'm not so concerned about this in any query where we have a real FROM clause because even if we end up with only one row, the relative penalty is low, and the potential gain is very high. But simple expressions in pl/pgsql, for example, are a case where we can know for certain (correct me if I've wrong on this) that we'll only execute the expression once, which means there's probably always a penalty for choosing the implementation with setup costs over the default linear scan through the array. > >- Should we have a test for an operator with a non-strict function? > >I'm not aware of any built-in ops that have that characteristic; would > >you suggest just creating a fake one for the test? > > > > Dunno, I haven't thought about this very much. In general I think the > new code should simply behave the same as the current code, i.e. if it > does not check for strictness, we don't need either. Both the original implementation and this optimized version have the following short circuit condition: if (fcinfo->args[0].isnull && strictfunc) But I'm not sure there are any existing tests to show that the "&& strictfunc" is required. James
On Tue, Apr 28, 2020 at 08:39:18AM -0400, James Coleman wrote: >On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote: >> >On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote: >> >> >> >> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra >> >> <tomas.vondra@2ndquadrant.com> wrote: >> >> > >> >> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote: >> >> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra >> >> > ><tomas.vondra@2ndquadrant.com> wrote: >> >> > >> >> >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote: >> >> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote: >> >> > >> >> >> >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I >> >> > >> >> > think ran into mostly the same challenge of deciding when the bloom >> >> > >> >> > filter can be useful and is worth the extra work. >> >> > >> >> >> >> > >> >> Speaking of that, it would be interesting to see how a test where you >> >> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would >> >> > >> >> be interesting to know if the planner is able to make a more suitable >> >> > >> >> choice and also to see how all the work over the years to improve Hash >> >> > >> >> Joins compares to the bsearch with and without the bloom filter. >> >> > >> > >> >> > >> >It would be interesting. >> >> > >> > >> >> > >> >It also makes one wonder about optimizing these into to hash >> >> > >> >joins...which I'd thought about over at [1]. I think it'd be a very >> >> > >> >significant effort though. >> >> > >> > >> >> > >> >> >> > >> I modified the script to also do the join version of the query. I can >> >> > >> only run it on my laptop at the moment, so the results may be a bit >> >> > >> different from those I shared before, but it's interesting I think. >> >> > >> >> >> > >> In most cases it's comparable to the binsearch/bloom approach, and in >> >> > >> some cases it actually beats them quite significantly. It seems to >> >> > >> depend on how expensive the comparison is - for "int" the comparison is >> >> > >> very cheap and there's almost no difference. For "text" the comparison >> >> > >> is much more expensive, and there are significant speedups. >> >> > >> >> >> > >> For example the test with 100k lookups in array of 10k elements and 10% >> >> > >> match probability, the timings are these >> >> > >> >> >> > >> master: 62362 ms >> >> > >> binsearch: 201 ms >> >> > >> bloom: 65 ms >> >> > >> hashjoin: 36 ms >> >> > >> >> >> > >> I do think the explanation is fairly simple - the bloom filter >> >> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus >> >> > >> some overhead to build and check the bits. The hash join probably >> >> > >> eliminates a lot of the remaining comparisons, because the hash table >> >> > >> is sized to have one tuple per bucket. >> >> > >> >> >> > >> Note: I also don't claim the PoC has the most efficient bloom filter >> >> > >> implementation possible. I'm sure it could be made faster. >> >> > >> >> >> > >> Anyway, I'm not sure transforming this to a hash join is worth the >> >> > >> effort - I agree that seems quite complex. But perhaps this suggest we >> >> > >> should not be doing binary search and instead just build a simple hash >> >> > >> table - that seems much simpler, and it'll probably give us about the >> >> > >> same benefits. >> >> > > >> >> > >That's actually what I originally thought about doing, but I chose >> >> > >binary search since it seemed a lot easier to get off the ground. >> >> > > >> >> > >> >> > OK, that makes perfect sense. >> >> > >> >> > >If we instead build a hash is there anything else we need to be >> >> > >concerned about? For example, work mem? I suppose for the binary >> >> > >search we already have to expand the array, so perhaps it's not all >> >> > >that meaningful relative to that... >> >> > > >> >> > >> >> > I don't think we need to be particularly concerned about work_mem. We >> >> > don't care about it now, and it's not clear to me what we could do about >> >> > it - we already have the array in memory anyway, so it's a bit futile. >> >> > Furthermore, if we need to care about it, it probably applies to the >> >> > binary search too. >> >> > >> >> > >I was looking earlier at what our standard hash implementation was, >> >> > >and it seemed less obvious what was needed to set that up (so binary >> >> > >search seemed a faster proof of concept). If you happen to have any >> >> > >pointers to similar usages I should look at, please let me know. >> >> > > >> >> > >> >> > I think the hash join implementation is far too complicated. It has to >> >> > care about work_mem, so it implements batching, etc. That's a lot of >> >> > complexity we don't need here. IMO we could use either the usual >> >> > dynahash, or maybe even the simpler simplehash. >> >> > >> >> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the >> >> > benchmarks makes sense and running it independently. I'm not aware of >> >> > any issues but it was done late at night and only ran on my laptop. >> >> >> >> Some quick calculations (don't have the scripting in a form I can >> >> attach yet; using this as an opportunity to hack on a genericized >> >> performance testing framework of sorts) suggest your results are >> >> correct. I was also testing on my laptop, but I showed 1.) roughly >> >> equivalent results for IN (VALUES ...) and IN (<list>) for integers, >> >> but when I switch to (short; average 3 characters long) text values I >> >> show the hash join on VALUES is twice as fast as the binary search. >> >> >> >> Given that, I'm planning to implement this as a hash lookup and share >> >> a revised patch. >> > >> >I've attached a patch series as before, but with an additional patch >> >that switches to using dynahash instead of binary search. >> > >> >> OK. I can't take closer look at the moment, I'll check later. >> >> Any particular reasons to pick dynahash over simplehash? ISTM we're >> using simplehash elsewhere in the executor (grouping, tidbitmap, ...), >> while there are not many places using dynahash for simple short-lived >> hash tables. Of course, that alone is a weak reason to insist on using >> simplehash here, but I suppose there were reasons for not using dynahash >> and we'll end up facing the same issues here. > >No particular reason; it wasn't clear to me that there was a reason to >prefer one or the other (and I'm not acquainted with the codebase >enough to know the differences), so I chose dynahash because it was >easier to find examples to follow for implementation. > OK, understood. >> >Whereas before the benchmarking ended up with a trimodal distribution >> >(i.e., master with IN <list>, patch with IN <list>, and either with IN >> >VALUES), the hash implementation brings us back to an effectively >> >bimodal distribution -- though the hash scalar array op expression >> >implementation for text is about 5% faster than the hash join. >> > >> >> Nice. I'm not surprised this is a bit faster than hash join, which has >> to worry about additional stuff. >> >> >Current outstanding thoughts (besides comment/naming cleanup): >> > >> >- The saop costing needs to be updated to match, as Tomas pointed out. >> > >> >- Should we be concerned about single execution cases? For example, is >> >the regression of speed on a simple SELECT x IN something we should >> >try to defeat by only kicking in the optimization if we execute in a >> >loop at least twice? That might be of particular interest to pl/pgsql. >> > >> >> I don't follow. How is this related to the number of executions and/or >> plpgsql? I suspect you might be talking about prepared statements, but >> surely the hash table is built for each execution anyway, even if the >> plan is reused, right? >> >> I think the issue we've been talking about is considering the number of >> lookups we expect to do in the array/hash table. But that has nothing to >> do with plpgsql and/or multiple executions ... > >Suppose I do "SELECT 1 IN <100 item list>" (whether just as a >standalone query or in pl/pgsql). Then it doesn't make sense to use >the optimization, because it can't possibly win over a naive linear >scan through the array since to build the hash we have to do that >linear scan anyway (I suppose in theory the hashing function could be >dramatically faster than the equality op, so maybe it could win >overall, but it seems unlikely to me). True. I'm sure we could construct such cases, but this is hardly the only place where that would be an issue. We could create some rough costing model and make a decision based on that. >I'm not so concerned about this in any query where we have a real FROM >clause because even if we end up with only one row, the relative >penalty is low, and the potential gain is very high. But simple >expressions in pl/pgsql, for example, are a case where we can know for >certain (correct me if I've wrong on this) that we'll only execute the >expression once, which means there's probably always a penalty for >choosing the implementation with setup costs over the default linear >scan through the array. > What do you mean by "simple expressions"? I'm not plpgsql expert and I see it mostly as a way to glue together SQL queries, but yeah - if we know a given ScalarArrayOpExpr will only be executed once, then we can disable this optimization for now. >> >- Should we have a test for an operator with a non-strict function? >> >I'm not aware of any built-in ops that have that characteristic; >> >would you suggest just creating a fake one for the test? >> > >> >> Dunno, I haven't thought about this very much. In general I think the >> new code should simply behave the same as the current code, i.e. if >> it does not check for strictness, we don't need either. > >Both the original implementation and this optimized version have the >following short circuit condition: > >if (fcinfo->args[0].isnull && strictfunc) > >But I'm not sure there are any existing tests to show that the "&& >strictfunc" is required. > Ah! You mean a regression test, not a test in the "if condition" sense. I don't see a reason not to have such test, although it's probably not something we should require from this patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
On Tue, Apr 28, 2020 at 08:39:18AM -0400, James Coleman wrote:
>On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:
>> >On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote:
>> >>
>> >> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
>> >> <tomas.vondra@2ndquadrant.com> wrote:
>> >> >
>> >> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
>> >> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
>> >> > ><tomas.vondra@2ndquadrant.com> wrote:
>> >> > >>
>> >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>> >> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> >> > >> >>
>> >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> >> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
>> >> > >> >> > think ran into mostly the same challenge of deciding when the bloom
>> >> > >> >> > filter can be useful and is worth the extra work.
>> >> > >> >>
>> >> > >> >> Speaking of that, it would be interesting to see how a test where you
>> >> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> >> > >> >> be interesting to know if the planner is able to make a more suitable
>> >> > >> >> choice and also to see how all the work over the years to improve Hash
>> >> > >> >> Joins compares to the bsearch with and without the bloom filter.
>> >> > >> >
>> >> > >> >It would be interesting.
>> >> > >> >
>> >> > >> >It also makes one wonder about optimizing these into to hash
>> >> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
>> >> > >> >significant effort though.
>> >> > >> >
>> >> > >>
>> >> > >> I modified the script to also do the join version of the query. I can
>> >> > >> only run it on my laptop at the moment, so the results may be a bit
>> >> > >> different from those I shared before, but it's interesting I think.
>> >> > >>
>> >> > >> In most cases it's comparable to the binsearch/bloom approach, and in
>> >> > >> some cases it actually beats them quite significantly. It seems to
>> >> > >> depend on how expensive the comparison is - for "int" the comparison is
>> >> > >> very cheap and there's almost no difference. For "text" the comparison
>> >> > >> is much more expensive, and there are significant speedups.
>> >> > >>
>> >> > >> For example the test with 100k lookups in array of 10k elements and 10%
>> >> > >> match probability, the timings are these
>> >> > >>
>> >> > >> master: 62362 ms
>> >> > >> binsearch: 201 ms
>> >> > >> bloom: 65 ms
>> >> > >> hashjoin: 36 ms
>> >> > >>
>> >> > >> I do think the explanation is fairly simple - the bloom filter
>> >> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
>> >> > >> some overhead to build and check the bits. The hash join probably
>> >> > >> eliminates a lot of the remaining comparisons, because the hash table
>> >> > >> is sized to have one tuple per bucket.
>> >> > >>
>> >> > >> Note: I also don't claim the PoC has the most efficient bloom filter
>> >> > >> implementation possible. I'm sure it could be made faster.
>> >> > >>
>> >> > >> Anyway, I'm not sure transforming this to a hash join is worth the
>> >> > >> effort - I agree that seems quite complex. But perhaps this suggest we
>> >> > >> should not be doing binary search and instead just build a simple hash
>> >> > >> table - that seems much simpler, and it'll probably give us about the
>> >> > >> same benefits.
>> >> > >
>> >> > >That's actually what I originally thought about doing, but I chose
>> >> > >binary search since it seemed a lot easier to get off the ground.
>> >> > >
>> >> >
>> >> > OK, that makes perfect sense.
>> >> >
>> >> > >If we instead build a hash is there anything else we need to be
>> >> > >concerned about? For example, work mem? I suppose for the binary
>> >> > >search we already have to expand the array, so perhaps it's not all
>> >> > >that meaningful relative to that...
>> >> > >
>> >> >
>> >> > I don't think we need to be particularly concerned about work_mem. We
>> >> > don't care about it now, and it's not clear to me what we could do about
>> >> > it - we already have the array in memory anyway, so it's a bit futile.
>> >> > Furthermore, if we need to care about it, it probably applies to the
>> >> > binary search too.
>> >> >
>> >> > >I was looking earlier at what our standard hash implementation was,
>> >> > >and it seemed less obvious what was needed to set that up (so binary
>> >> > >search seemed a faster proof of concept). If you happen to have any
>> >> > >pointers to similar usages I should look at, please let me know.
>> >> > >
>> >> >
>> >> > I think the hash join implementation is far too complicated. It has to
>> >> > care about work_mem, so it implements batching, etc. That's a lot of
>> >> > complexity we don't need here. IMO we could use either the usual
>> >> > dynahash, or maybe even the simpler simplehash.
>> >> >
>> >> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
>> >> > benchmarks makes sense and running it independently. I'm not aware of
>> >> > any issues but it was done late at night and only ran on my laptop.
>> >>
>> >> Some quick calculations (don't have the scripting in a form I can
>> >> attach yet; using this as an opportunity to hack on a genericized
>> >> performance testing framework of sorts) suggest your results are
>> >> correct. I was also testing on my laptop, but I showed 1.) roughly
>> >> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
>> >> but when I switch to (short; average 3 characters long) text values I
>> >> show the hash join on VALUES is twice as fast as the binary search.
>> >>
>> >> Given that, I'm planning to implement this as a hash lookup and share
>> >> a revised patch.
>> >
>> >I've attached a patch series as before, but with an additional patch
>> >that switches to using dynahash instead of binary search.
>> >
>>
>> OK. I can't take closer look at the moment, I'll check later.
>>
>> Any particular reasons to pick dynahash over simplehash? ISTM we're
>> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
>> while there are not many places using dynahash for simple short-lived
>> hash tables. Of course, that alone is a weak reason to insist on using
>> simplehash here, but I suppose there were reasons for not using dynahash
>> and we'll end up facing the same issues here.
>
>No particular reason; it wasn't clear to me that there was a reason to
>prefer one or the other (and I'm not acquainted with the codebase
>enough to know the differences), so I chose dynahash because it was
>easier to find examples to follow for implementation.
>
OK, understood.
>> >Whereas before the benchmarking ended up with a trimodal distribution
>> >(i.e., master with IN <list>, patch with IN <list>, and either with IN
>> >VALUES), the hash implementation brings us back to an effectively
>> >bimodal distribution -- though the hash scalar array op expression
>> >implementation for text is about 5% faster than the hash join.
>> >
>>
>> Nice. I'm not surprised this is a bit faster than hash join, which has
>> to worry about additional stuff.
>>
>> >Current outstanding thoughts (besides comment/naming cleanup):
>> >
>> >- The saop costing needs to be updated to match, as Tomas pointed out.
>> >
>> >- Should we be concerned about single execution cases? For example, is
>> >the regression of speed on a simple SELECT x IN something we should
>> >try to defeat by only kicking in the optimization if we execute in a
>> >loop at least twice? That might be of particular interest to pl/pgsql.
>> >
>>
>> I don't follow. How is this related to the number of executions and/or
>> plpgsql? I suspect you might be talking about prepared statements, but
>> surely the hash table is built for each execution anyway, even if the
>> plan is reused, right?
>>
>> I think the issue we've been talking about is considering the number of
>> lookups we expect to do in the array/hash table. But that has nothing to
>> do with plpgsql and/or multiple executions ...
>
>Suppose I do "SELECT 1 IN <100 item list>" (whether just as a
>standalone query or in pl/pgsql). Then it doesn't make sense to use
>the optimization, because it can't possibly win over a naive linear
>scan through the array since to build the hash we have to do that
>linear scan anyway (I suppose in theory the hashing function could be
>dramatically faster than the equality op, so maybe it could win
>overall, but it seems unlikely to me).
True. I'm sure we could construct such cases, but this is hardly the
only place where that would be an issue. We could create some rough
costing model and make a decision based on that.
>I'm not so concerned about this in any query where we have a real FROM
>clause because even if we end up with only one row, the relative
>penalty is low, and the potential gain is very high. But simple
>expressions in pl/pgsql, for example, are a case where we can know for
>certain (correct me if I've wrong on this) that we'll only execute the
>expression once, which means there's probably always a penalty for
>choosing the implementation with setup costs over the default linear
>scan through the array.
>
What do you mean by "simple expressions"? I'm not plpgsql expert and I
see it mostly as a way to glue together SQL queries, but yeah - if we
know a given ScalarArrayOpExpr will only be executed once, then we can
disable this optimization for now.
a := a + 1
is translated to
SELECT $1 + 1 and save result to var a
The queries like this "SELECT $1 + 1" are simple expressions. They are evaluated just on executor level - it skip SPI
the simple expression has not FROM clause, and have to return just one row. I am not sure if it is required, it has to return just one column.
I am not sure if executor knows so expression is executed as simply expressions. But probably it can be deduced from context
Pavel
>> >- Should we have a test for an operator with a non-strict function?
>> >I'm not aware of any built-in ops that have that characteristic;
>> >would you suggest just creating a fake one for the test?
>> >
>>
>> Dunno, I haven't thought about this very much. In general I think the
>> new code should simply behave the same as the current code, i.e. if
>> it does not check for strictness, we don't need either.
>
>Both the original implementation and this optimized version have the
>following short circuit condition:
>
>if (fcinfo->args[0].isnull && strictfunc)
>
>But I'm not sure there are any existing tests to show that the "&&
>strictfunc" is required.
>
Ah! You mean a regression test, not a test in the "if condition" sense.
I don't see a reason not to have such test, although it's probably not
something we should require from this patch.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote: >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> >napsal: > >> ... >> >> >I'm not so concerned about this in any query where we have a real FROM >> >clause because even if we end up with only one row, the relative >> >penalty is low, and the potential gain is very high. But simple >> >expressions in pl/pgsql, for example, are a case where we can know for >> >certain (correct me if I've wrong on this) that we'll only execute the >> >expression once, which means there's probably always a penalty for >> >choosing the implementation with setup costs over the default linear >> >scan through the array. >> > >> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I >> see it mostly as a way to glue together SQL queries, but yeah - if we >> know a given ScalarArrayOpExpr will only be executed once, then we can >> disable this optimization for now. >> > >a := a + 1 > >is translated to > >SELECT $1 + 1 and save result to var a > >The queries like this "SELECT $1 + 1" are simple expressions. They are >evaluated just on executor level - it skip SPI > >the simple expression has not FROM clause, and have to return just one row. >I am not sure if it is required, it has to return just one column. > >I am not sure if executor knows so expression is executed as simply >expressions. But probably it can be deduced from context > Not sure. The executor state is created by exec_eval_simple_expr, which calls ExecInitExprWithParams (and it's the only caller). And that in turn is the only place that leaves (state->parent == NULL). So maybe that's a way to identify simple (standalone) expressions? Otherwise we might add a new EEO_FLAG_* to identify these expressions explicitly. I wonder if it would be possible to identify cases when the expression is executed in a loop, e.g. like this: FOR i IN 1..1000 LOOP x := y IN (1, 2, ..., 999); END LOOP; in which case we only build the ScalarArrayOpExpr once, so maybe we could keep the hash table for all executions. But maybe that's not possible or maybe it's pointless for other reasons. It sure looks a bit like trying to build a query engine from FOR loop. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
>út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
>napsal:
>
>> ...
>>
>> >I'm not so concerned about this in any query where we have a real FROM
>> >clause because even if we end up with only one row, the relative
>> >penalty is low, and the potential gain is very high. But simple
>> >expressions in pl/pgsql, for example, are a case where we can know for
>> >certain (correct me if I've wrong on this) that we'll only execute the
>> >expression once, which means there's probably always a penalty for
>> >choosing the implementation with setup costs over the default linear
>> >scan through the array.
>> >
>>
>> What do you mean by "simple expressions"? I'm not plpgsql expert and I
>> see it mostly as a way to glue together SQL queries, but yeah - if we
>> know a given ScalarArrayOpExpr will only be executed once, then we can
>> disable this optimization for now.
>>
>
>a := a + 1
>
>is translated to
>
>SELECT $1 + 1 and save result to var a
>
>The queries like this "SELECT $1 + 1" are simple expressions. They are
>evaluated just on executor level - it skip SPI
>
>the simple expression has not FROM clause, and have to return just one row.
>I am not sure if it is required, it has to return just one column.
>
>I am not sure if executor knows so expression is executed as simply
>expressions. But probably it can be deduced from context
>
Not sure. The executor state is created by exec_eval_simple_expr, which
calls ExecInitExprWithParams (and it's the only caller). And that in
turn is the only place that leaves (state->parent == NULL). So maybe
that's a way to identify simple (standalone) expressions? Otherwise we
might add a new EEO_FLAG_* to identify these expressions explicitly.
I wonder if it would be possible to identify cases when the expression
is executed in a loop, e.g. like this:
FOR i IN 1..1000 LOOP
x := y IN (1, 2, ..., 999);
END LOOP;
in which case we only build the ScalarArrayOpExpr once, so maybe we
could keep the hash table for all executions. But maybe that's not
possible or maybe it's pointless for other reasons. It sure looks a bit
like trying to build a query engine from FOR loop.
Theoretically it is possible, not now. But I don't think so it is necessary. I cannot to remember this pattern in any plpgsql code and I never seen any request on this feature.
I don't think so this is task for plpgsql engine. Maybe for JIT sometimes.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2020 at 11:18 AM Pavel Stehule <pavel.stehule@gmail.com> wrote: > > > > út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal: >> >> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote: >> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> >> >napsal: >> > >> >> ... >> >> >> >> >I'm not so concerned about this in any query where we have a real FROM >> >> >clause because even if we end up with only one row, the relative >> >> >penalty is low, and the potential gain is very high. But simple >> >> >expressions in pl/pgsql, for example, are a case where we can know for >> >> >certain (correct me if I've wrong on this) that we'll only execute the >> >> >expression once, which means there's probably always a penalty for >> >> >choosing the implementation with setup costs over the default linear >> >> >scan through the array. >> >> > >> >> >> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I >> >> see it mostly as a way to glue together SQL queries, but yeah - if we >> >> know a given ScalarArrayOpExpr will only be executed once, then we can >> >> disable this optimization for now. >> >> >> > >> >a := a + 1 >> > >> >is translated to >> > >> >SELECT $1 + 1 and save result to var a >> > >> >The queries like this "SELECT $1 + 1" are simple expressions. They are >> >evaluated just on executor level - it skip SPI >> > >> >the simple expression has not FROM clause, and have to return just one row. >> >I am not sure if it is required, it has to return just one column. Yes, this is what I meant by simple expressions. >> >I am not sure if executor knows so expression is executed as simply >> >expressions. But probably it can be deduced from context >> > >> >> Not sure. The executor state is created by exec_eval_simple_expr, which >> calls ExecInitExprWithParams (and it's the only caller). And that in >> turn is the only place that leaves (state->parent == NULL). So maybe >> that's a way to identify simple (standalone) expressions? Otherwise we >> might add a new EEO_FLAG_* to identify these expressions explicitly. I'll look into doing one of these. >> I wonder if it would be possible to identify cases when the expression >> is executed in a loop, e.g. like this: >> >> FOR i IN 1..1000 LOOP >> x := y IN (1, 2, ..., 999); >> END LOOP; >> >> in which case we only build the ScalarArrayOpExpr once, so maybe we >> could keep the hash table for all executions. But maybe that's not >> possible or maybe it's pointless for other reasons. It sure looks a bit >> like trying to build a query engine from FOR loop. > > > Theoretically it is possible, not now. But I don't think so it is necessary. I cannot to remember this pattern in any plpgsqlcode and I never seen any request on this feature. > > I don't think so this is task for plpgsql engine. Maybe for JIT sometimes. Agreed. I'd thought about this kind of scenario when I brought this up, but I think solving it would the responsibility of the pg/pgsql compiler rather than the expression evaluation code, because it'd have to recognize the situation and setup a shared expression evaluation context to be reused each time through the loop. James
út 28. 4. 2020 v 18:17 odesílatel James Coleman <jtc331@gmail.com> napsal:
On Tue, Apr 28, 2020 at 11:18 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>
> út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
>>
>> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
>> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> >napsal:
>> >
>> >> ...
>> >>
>> >> >I'm not so concerned about this in any query where we have a real FROM
>> >> >clause because even if we end up with only one row, the relative
>> >> >penalty is low, and the potential gain is very high. But simple
>> >> >expressions in pl/pgsql, for example, are a case where we can know for
>> >> >certain (correct me if I've wrong on this) that we'll only execute the
>> >> >expression once, which means there's probably always a penalty for
>> >> >choosing the implementation with setup costs over the default linear
>> >> >scan through the array.
>> >> >
>> >>
>> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I
>> >> see it mostly as a way to glue together SQL queries, but yeah - if we
>> >> know a given ScalarArrayOpExpr will only be executed once, then we can
>> >> disable this optimization for now.
>> >>
>> >
>> >a := a + 1
>> >
>> >is translated to
>> >
>> >SELECT $1 + 1 and save result to var a
>> >
>> >The queries like this "SELECT $1 + 1" are simple expressions. They are
>> >evaluated just on executor level - it skip SPI
>> >
>> >the simple expression has not FROM clause, and have to return just one row.
>> >I am not sure if it is required, it has to return just one column.
Yes, this is what I meant by simple expressions.
>> >I am not sure if executor knows so expression is executed as simply
>> >expressions. But probably it can be deduced from context
>> >
>>
>> Not sure. The executor state is created by exec_eval_simple_expr, which
>> calls ExecInitExprWithParams (and it's the only caller). And that in
>> turn is the only place that leaves (state->parent == NULL). So maybe
>> that's a way to identify simple (standalone) expressions? Otherwise we
>> might add a new EEO_FLAG_* to identify these expressions explicitly.
I'll look into doing one of these.
>> I wonder if it would be possible to identify cases when the expression
>> is executed in a loop, e.g. like this:
>>
>> FOR i IN 1..1000 LOOP
>> x := y IN (1, 2, ..., 999);
>> END LOOP;
>>
>> in which case we only build the ScalarArrayOpExpr once, so maybe we
>> could keep the hash table for all executions. But maybe that's not
>> possible or maybe it's pointless for other reasons. It sure looks a bit
>> like trying to build a query engine from FOR loop.
>
>
> Theoretically it is possible, not now. But I don't think so it is necessary. I cannot to remember this pattern in any plpgsql code and I never seen any request on this feature.
>
> I don't think so this is task for plpgsql engine. Maybe for JIT sometimes.
Agreed. I'd thought about this kind of scenario when I brought this
up, but I think solving it would the responsibility of the pg/pgsql
compiler rather than the expression evaluation code, because it'd have
to recognize the situation and setup a shared expression evaluation
context to be reused each time through the loop.
can be nice if new implementation was not slower then older in all environments and context (including plpgsql expressions)
Regards
Pavel
James
On Tue, Apr 28, 2020 at 1:40 PM Pavel Stehule <pavel.stehule@gmail.com> wrote: > > > > út 28. 4. 2020 v 18:17 odesílatel James Coleman <jtc331@gmail.com> napsal: >> >> On Tue, Apr 28, 2020 at 11:18 AM Pavel Stehule <pavel.stehule@gmail.com> wrote: >> > >> > >> > >> > út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal: >> >> >> >> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote: >> >> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> >> >> >napsal: >> >> > >> >> >> ... >> >> >> >> >> >> >I'm not so concerned about this in any query where we have a real FROM >> >> >> >clause because even if we end up with only one row, the relative >> >> >> >penalty is low, and the potential gain is very high. But simple >> >> >> >expressions in pl/pgsql, for example, are a case where we can know for >> >> >> >certain (correct me if I've wrong on this) that we'll only execute the >> >> >> >expression once, which means there's probably always a penalty for >> >> >> >choosing the implementation with setup costs over the default linear >> >> >> >scan through the array. >> >> >> > >> >> >> >> >> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I >> >> >> see it mostly as a way to glue together SQL queries, but yeah - if we >> >> >> know a given ScalarArrayOpExpr will only be executed once, then we can >> >> >> disable this optimization for now. >> >> >> >> >> > >> >> >a := a + 1 >> >> > >> >> >is translated to >> >> > >> >> >SELECT $1 + 1 and save result to var a >> >> > >> >> >The queries like this "SELECT $1 + 1" are simple expressions. They are >> >> >evaluated just on executor level - it skip SPI >> >> > >> >> >the simple expression has not FROM clause, and have to return just one row. >> >> >I am not sure if it is required, it has to return just one column. >> >> Yes, this is what I meant by simple expressions. >> >> >> >I am not sure if executor knows so expression is executed as simply >> >> >expressions. But probably it can be deduced from context >> >> > >> >> >> >> Not sure. The executor state is created by exec_eval_simple_expr, which >> >> calls ExecInitExprWithParams (and it's the only caller). And that in >> >> turn is the only place that leaves (state->parent == NULL). So maybe >> >> that's a way to identify simple (standalone) expressions? Otherwise we >> >> might add a new EEO_FLAG_* to identify these expressions explicitly. >> >> I'll look into doing one of these. >> >> >> I wonder if it would be possible to identify cases when the expression >> >> is executed in a loop, e.g. like this: >> >> >> >> FOR i IN 1..1000 LOOP >> >> x := y IN (1, 2, ..., 999); >> >> END LOOP; >> >> >> >> in which case we only build the ScalarArrayOpExpr once, so maybe we >> >> could keep the hash table for all executions. But maybe that's not >> >> possible or maybe it's pointless for other reasons. It sure looks a bit >> >> like trying to build a query engine from FOR loop. >> > >> > >> > Theoretically it is possible, not now. But I don't think so it is necessary. I cannot to remember this pattern in anyplpgsql code and I never seen any request on this feature. >> > >> > I don't think so this is task for plpgsql engine. Maybe for JIT sometimes. >> >> Agreed. I'd thought about this kind of scenario when I brought this >> up, but I think solving it would the responsibility of the pg/pgsql >> compiler rather than the expression evaluation code, because it'd have >> to recognize the situation and setup a shared expression evaluation >> context to be reused each time through the loop. > > > can be nice if new implementation was not slower then older in all environments and context (including plpgsql expressions) Agreed, which is why I'm going to look into preventing using the new code path for simple expressions. James
I cc'd Andres given his commit introduced simplehash, so I figured he'd probably have a few pointers on when each one might be useful. On Tue, Apr 28, 2020 at 8:39 AM James Coleman <jtc331@gmail.com> wrote: ... > > Any particular reasons to pick dynahash over simplehash? ISTM we're > > using simplehash elsewhere in the executor (grouping, tidbitmap, ...), > > while there are not many places using dynahash for simple short-lived > > hash tables. Of course, that alone is a weak reason to insist on using > > simplehash here, but I suppose there were reasons for not using dynahash > > and we'll end up facing the same issues here. > > No particular reason; it wasn't clear to me that there was a reason to > prefer one or the other (and I'm not acquainted with the codebase > enough to know the differences), so I chose dynahash because it was > easier to find examples to follow for implementation. Do you have any thoughts on what the trade-offs/use-cases etc. are for dynahash versus simple hash? From reading the commit message in b30d3ea824c it seems like simple hash is faster and optimized for CPU cache benefits. The comments at the top of simplehash.h also discourage it's use in non performance/space sensitive uses, but there isn't anything I can see that explicitly tries to discuss when dynahash is useful, etc. Given the performance notes in that commit message, I thinking switching to simple hash is worth it. But I also wonder if there might be some value in a README or comments addition that would be a guide to what the various hash implementations are useful for. If there's interest, I could try to type something short up so that we have something to make the code base a bit more discoverable. James
On Tue, Apr 28, 2020 at 06:22:20PM -0400, James Coleman wrote: >I cc'd Andres given his commit introduced simplehash, so I figured >he'd probably have a few pointers on when each one might be useful. > >On Tue, Apr 28, 2020 at 8:39 AM James Coleman <jtc331@gmail.com> wrote: >... >> > Any particular reasons to pick dynahash over simplehash? ISTM we're >> > using simplehash elsewhere in the executor (grouping, tidbitmap, ...), >> > while there are not many places using dynahash for simple short-lived >> > hash tables. Of course, that alone is a weak reason to insist on using >> > simplehash here, but I suppose there were reasons for not using dynahash >> > and we'll end up facing the same issues here. >> >> No particular reason; it wasn't clear to me that there was a reason to >> prefer one or the other (and I'm not acquainted with the codebase >> enough to know the differences), so I chose dynahash because it was >> easier to find examples to follow for implementation. > >Do you have any thoughts on what the trade-offs/use-cases etc. are for >dynahash versus simple hash? > >From reading the commit message in b30d3ea824c it seems like simple >hash is faster and optimized for CPU cache benefits. The comments at >the top of simplehash.h also discourage it's use in non >performance/space sensitive uses, but there isn't anything I can see >that explicitly tries to discuss when dynahash is useful, etc. > >Given the performance notes in that commit message, I thinking >switching to simple hash is worth it. > I recall doing some benchmarks for that patch, but it's so long I don't really remember the details. But in general, I agree simplehash is a bit more efficient in terms of CPU / caching. I think the changes required to switch from dynahash to simplehash are fairly small, so I think the best thing we can do is just try do some measurement and then decide. >But I also wonder if there might be some value in a README or comments >addition that would be a guide to what the various hash >implementations are useful for. If there's interest, I could try to >type something short up so that we have something to make the code >base a bit more discoverable. > I wouldn't object to that. Although maybe we should simply add some basic recommendations to the comments in dynahash/simplehash. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-04-28 18:22:20 -0400, James Coleman wrote: > I cc'd Andres given his commit introduced simplehash, so I figured > he'd probably have a few pointers on when each one might be useful. > [...] > Do you have any thoughts on what the trade-offs/use-cases etc. are for > dynahash versus simple hash? > > From reading the commit message in b30d3ea824c it seems like simple > hash is faster and optimized for CPU cache benefits. The comments at > the top of simplehash.h also discourage it's use in non > performance/space sensitive uses, but there isn't anything I can see > that explicitly tries to discuss when dynahash is useful, etc. Benefits of dynahash (chained hashtable): - supports partitioning, useful for shared memory accessed under locks - better performance for large entries, as they don't need to be moved around in case of hash conflicts - stable pointers to hash table entries Benefits of simplehash (open addressing hash table): - no indirect function calls, known structure sizes, due to "templated" code generation (these show up substantially in profiles for dynahash) - considerably faster for small entries due to previous point, and due open addressing hash tables having better cache behaviour than chained hashtables - once set-up the interface is type safe and easier to use - no overhead of a separate memory context etc > Given the performance notes in that commit message, I thinking > switching to simple hash is worth it. Seems plausible to me. > But I also wonder if there might be some value in a README or comments > addition that would be a guide to what the various hash > implementations are useful for. If there's interest, I could try to > type something short up so that we have something to make the code > base a bit more discoverable. That'd make sense to me. Greetings, Andres Freund
On Tue, Apr 28, 2020 at 7:05 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-04-28 18:22:20 -0400, James Coleman wrote: > > I cc'd Andres given his commit introduced simplehash, so I figured > > he'd probably have a few pointers on when each one might be useful. > > [...] > > Do you have any thoughts on what the trade-offs/use-cases etc. are for > > dynahash versus simple hash? > > > > From reading the commit message in b30d3ea824c it seems like simple > > hash is faster and optimized for CPU cache benefits. The comments at > > the top of simplehash.h also discourage it's use in non > > performance/space sensitive uses, but there isn't anything I can see > > that explicitly tries to discuss when dynahash is useful, etc. > > Benefits of dynahash (chained hashtable): > - supports partitioning, useful for shared memory accessed under locks > - better performance for large entries, as they don't need to be moved > around in case of hash conflicts > - stable pointers to hash table entries > > Benefits of simplehash (open addressing hash table): > - no indirect function calls, known structure sizes, due to "templated" > code generation (these show up substantially in profiles for dynahash) > - considerably faster for small entries due to previous point, and due > open addressing hash tables having better cache behaviour than chained > hashtables > - once set-up the interface is type safe and easier to use > - no overhead of a separate memory context etc > > > > Given the performance notes in that commit message, I thinking > > switching to simple hash is worth it. > > Seems plausible to me. > > > > But I also wonder if there might be some value in a README or comments > > addition that would be a guide to what the various hash > > implementations are useful for. If there's interest, I could try to > > type something short up so that we have something to make the code > > base a bit more discoverable. > > That'd make sense to me. Cool, I'll work on that as I have time then. One question: what is the reasoning behind having SH_STORE_HASH? The only things I could imagine would be a case where you have external pointers to some set of values or need to be able to use the hash for other reasons besides the hash table (and so can avoid calculating it twice), but maybe I'm missing something. Thanks, James
On Wed, Apr 29, 2020 at 10:26:12AM -0400, James Coleman wrote: >On Tue, Apr 28, 2020 at 7:05 PM Andres Freund <andres@anarazel.de> wrote: >> >> Hi, >> >> On 2020-04-28 18:22:20 -0400, James Coleman wrote: >> > I cc'd Andres given his commit introduced simplehash, so I figured >> > he'd probably have a few pointers on when each one might be useful. >> > [...] >> > Do you have any thoughts on what the trade-offs/use-cases etc. are for >> > dynahash versus simple hash? >> > >> > From reading the commit message in b30d3ea824c it seems like simple >> > hash is faster and optimized for CPU cache benefits. The comments at >> > the top of simplehash.h also discourage it's use in non >> > performance/space sensitive uses, but there isn't anything I can see >> > that explicitly tries to discuss when dynahash is useful, etc. >> >> Benefits of dynahash (chained hashtable): >> - supports partitioning, useful for shared memory accessed under locks >> - better performance for large entries, as they don't need to be moved >> around in case of hash conflicts >> - stable pointers to hash table entries >> >> Benefits of simplehash (open addressing hash table): >> - no indirect function calls, known structure sizes, due to "templated" >> code generation (these show up substantially in profiles for dynahash) >> - considerably faster for small entries due to previous point, and due >> open addressing hash tables having better cache behaviour than chained >> hashtables >> - once set-up the interface is type safe and easier to use >> - no overhead of a separate memory context etc >> >> >> > Given the performance notes in that commit message, I thinking >> > switching to simple hash is worth it. >> >> Seems plausible to me. >> >> >> > But I also wonder if there might be some value in a README or comments >> > addition that would be a guide to what the various hash >> > implementations are useful for. If there's interest, I could try to >> > type something short up so that we have something to make the code >> > base a bit more discoverable. >> >> That'd make sense to me. > >Cool, I'll work on that as I have time then. > >One question: what is the reasoning behind having SH_STORE_HASH? The >only things I could imagine would be a case where you have external >pointers to some set of values or need to be able to use the hash for >other reasons besides the hash table (and so can avoid calculating it >twice), but maybe I'm missing something. > I believe it's because computing the hash may be fairly expensive for some data types, in which case it may be better to just store it for future use. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 29, 2020 at 11:17 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Wed, Apr 29, 2020 at 10:26:12AM -0400, James Coleman wrote: > >On Tue, Apr 28, 2020 at 7:05 PM Andres Freund <andres@anarazel.de> wrote: > >> > >> Hi, > >> > >> On 2020-04-28 18:22:20 -0400, James Coleman wrote: > >> > I cc'd Andres given his commit introduced simplehash, so I figured > >> > he'd probably have a few pointers on when each one might be useful. > >> > [...] > >> > Do you have any thoughts on what the trade-offs/use-cases etc. are for > >> > dynahash versus simple hash? > >> > > >> > From reading the commit message in b30d3ea824c it seems like simple > >> > hash is faster and optimized for CPU cache benefits. The comments at > >> > the top of simplehash.h also discourage it's use in non > >> > performance/space sensitive uses, but there isn't anything I can see > >> > that explicitly tries to discuss when dynahash is useful, etc. > >> > >> Benefits of dynahash (chained hashtable): > >> - supports partitioning, useful for shared memory accessed under locks > >> - better performance for large entries, as they don't need to be moved > >> around in case of hash conflicts > >> - stable pointers to hash table entries > >> > >> Benefits of simplehash (open addressing hash table): > >> - no indirect function calls, known structure sizes, due to "templated" > >> code generation (these show up substantially in profiles for dynahash) > >> - considerably faster for small entries due to previous point, and due > >> open addressing hash tables having better cache behaviour than chained > >> hashtables > >> - once set-up the interface is type safe and easier to use > >> - no overhead of a separate memory context etc > >> > >> > >> > Given the performance notes in that commit message, I thinking > >> > switching to simple hash is worth it. > >> > >> Seems plausible to me. > >> > >> > >> > But I also wonder if there might be some value in a README or comments > >> > addition that would be a guide to what the various hash > >> > implementations are useful for. If there's interest, I could try to > >> > type something short up so that we have something to make the code > >> > base a bit more discoverable. > >> > >> That'd make sense to me. > > > >Cool, I'll work on that as I have time then. > > > >One question: what is the reasoning behind having SH_STORE_HASH? The > >only things I could imagine would be a case where you have external > >pointers to some set of values or need to be able to use the hash for > >other reasons besides the hash table (and so can avoid calculating it > >twice), but maybe I'm missing something. > > > > I believe it's because computing the hash may be fairly expensive for > some data types, in which case it may be better to just store it for > future use. But is it storing it for use primarily by the hash table implementation (i.e., does it need the hash stored this way to avoid repeated recalculation) or for caller's use? James
On Wed, Apr 29, 2020 at 11:34:24AM -0400, James Coleman wrote: >On Wed, Apr 29, 2020 at 11:17 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Wed, Apr 29, 2020 at 10:26:12AM -0400, James Coleman wrote: >> >On Tue, Apr 28, 2020 at 7:05 PM Andres Freund <andres@anarazel.de> wrote: >> >> >> >> Hi, >> >> >> >> On 2020-04-28 18:22:20 -0400, James Coleman wrote: >> >> > I cc'd Andres given his commit introduced simplehash, so I figured >> >> > he'd probably have a few pointers on when each one might be useful. >> >> > [...] >> >> > Do you have any thoughts on what the trade-offs/use-cases etc. are for >> >> > dynahash versus simple hash? >> >> > >> >> > From reading the commit message in b30d3ea824c it seems like simple >> >> > hash is faster and optimized for CPU cache benefits. The comments at >> >> > the top of simplehash.h also discourage it's use in non >> >> > performance/space sensitive uses, but there isn't anything I can see >> >> > that explicitly tries to discuss when dynahash is useful, etc. >> >> >> >> Benefits of dynahash (chained hashtable): >> >> - supports partitioning, useful for shared memory accessed under locks >> >> - better performance for large entries, as they don't need to be moved >> >> around in case of hash conflicts >> >> - stable pointers to hash table entries >> >> >> >> Benefits of simplehash (open addressing hash table): >> >> - no indirect function calls, known structure sizes, due to "templated" >> >> code generation (these show up substantially in profiles for dynahash) >> >> - considerably faster for small entries due to previous point, and due >> >> open addressing hash tables having better cache behaviour than chained >> >> hashtables >> >> - once set-up the interface is type safe and easier to use >> >> - no overhead of a separate memory context etc >> >> >> >> >> >> > Given the performance notes in that commit message, I thinking >> >> > switching to simple hash is worth it. >> >> >> >> Seems plausible to me. >> >> >> >> >> >> > But I also wonder if there might be some value in a README or comments >> >> > addition that would be a guide to what the various hash >> >> > implementations are useful for. If there's interest, I could try to >> >> > type something short up so that we have something to make the code >> >> > base a bit more discoverable. >> >> >> >> That'd make sense to me. >> > >> >Cool, I'll work on that as I have time then. >> > >> >One question: what is the reasoning behind having SH_STORE_HASH? The >> >only things I could imagine would be a case where you have external >> >pointers to some set of values or need to be able to use the hash for >> >other reasons besides the hash table (and so can avoid calculating it >> >twice), but maybe I'm missing something. >> > >> >> I believe it's because computing the hash may be fairly expensive for >> some data types, in which case it may be better to just store it for >> future use. > >But is it storing it for use primarily by the hash table >implementation (i.e., does it need the hash stored this way to avoid >repeated recalculation) or for caller's use? > For the hash table implementation, I think. Simplehash is using "robin hood" hashing, which needs to decide which entry to move in case of collision. AFAIC that requires the knowledge of hash value for both the new and existing entry, and having to compute that over and over would be fairly expensive. But that's my understanding, I might be wrong and it's useful for external use too. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: ... > Any particular reasons to pick dynahash over simplehash? ISTM we're > using simplehash elsewhere in the executor (grouping, tidbitmap, ...), > while there are not many places using dynahash for simple short-lived > hash tables. Of course, that alone is a weak reason to insist on using > simplehash here, but I suppose there were reasons for not using dynahash > and we'll end up facing the same issues here. I've attached a patch series that includes switching to simplehash. Obviously we'd really just collapse all of these patches, but it's perhaps interesting to see the changes required to use each style (binary search, dynahash, simplehash). As before, there are clearly comments and naming things to be addressed, but the implementation should be reasonably clean. James
Attachment
On 01/05/2020 05:20, James Coleman wrote: > On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > ... >> Any particular reasons to pick dynahash over simplehash? ISTM we're >> using simplehash elsewhere in the executor (grouping, tidbitmap, ...), >> while there are not many places using dynahash for simple short-lived >> hash tables. Of course, that alone is a weak reason to insist on using >> simplehash here, but I suppose there were reasons for not using dynahash >> and we'll end up facing the same issues here. > > I've attached a patch series that includes switching to simplehash. > Obviously we'd really just collapse all of these patches, but it's > perhaps interesting to see the changes required to use each style > (binary search, dynahash, simplehash). > > As before, there are clearly comments and naming things to be > addressed, but the implementation should be reasonably clean. Looks good, aside from the cleanup work that you mentioned. There are a few more cases that I think you could easily handle with very little extra code: You could also apply the optimization for non-Const expressions, as long as they're stable (i.e. !contain_volatile_functions(expr)). Deconstructing the array Datum into a simple C array on first call would be a win even for very small arrays and for AND semantics, even if you don't use a hash table. - Heikki
On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 01/05/2020 05:20, James Coleman wrote: > > On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > > ... > >> Any particular reasons to pick dynahash over simplehash? ISTM we're > >> using simplehash elsewhere in the executor (grouping, tidbitmap, ...), > >> while there are not many places using dynahash for simple short-lived > >> hash tables. Of course, that alone is a weak reason to insist on using > >> simplehash here, but I suppose there were reasons for not using dynahash > >> and we'll end up facing the same issues here. > > > > I've attached a patch series that includes switching to simplehash. > > Obviously we'd really just collapse all of these patches, but it's > > perhaps interesting to see the changes required to use each style > > (binary search, dynahash, simplehash). > > > > As before, there are clearly comments and naming things to be > > addressed, but the implementation should be reasonably clean. > > Looks good, aside from the cleanup work that you mentioned. There are a > few more cases that I think you could easily handle with very little > extra code: > > You could also apply the optimization for non-Const expressions, as long > as they're stable (i.e. !contain_volatile_functions(expr)). Is that true? Don't we also have to worry about whether or not the value is stable (i.e., know when a param has changed)? There have been discussions about being able to cache stable subexpressions, and my understanding was that we'd need to have that infrastructure (along with the necessarily invalidation when the param changes) to be able to use this for non-Const expressions. > Deconstructing the array Datum into a simple C array on first call would > be a win even for very small arrays and for AND semantics, even if you > don't use a hash table. Because you wouldn't have to repeatedly detoast it? Or some other reason I'm not thinking of? My intuition would have been that (aside from detoasting if necessary) there wouldn't be much real overhead in, for example, an array storing integers. James
On 08/09/2020 22:25, James Coleman wrote: > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> >> You could also apply the optimization for non-Const expressions, as long >> as they're stable (i.e. !contain_volatile_functions(expr)). > > Is that true? Don't we also have to worry about whether or not the > value is stable (i.e., know when a param has changed)? There have been > discussions about being able to cache stable subexpressions, and my > understanding was that we'd need to have that infrastructure (along > with the necessarily invalidation when the param changes) to be able > to use this for non-Const expressions. Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And Vars. I think the conditions are the same as those checked in match_clause_to_partition_key() in partprune.c (it's a long function, search for "if (!IsA(rightop, Const))"). Not sure it's worth the trouble, then. But it would be nice to handle queries like "WHERE column = ANY ($1)" >> Deconstructing the array Datum into a simple C array on first call would >> be a win even for very small arrays and for AND semantics, even if you >> don't use a hash table. > > Because you wouldn't have to repeatedly detoast it? Or some other > reason I'm not thinking of? My intuition would have been that (aside > from detoasting if necessary) there wouldn't be much real overhead in, > for example, an array storing integers. Dealing with NULLs and different element sizes in the array is pretty complicated. Looping through the array currently looks like this: /* Loop over the array elements */ s = (char *) ARR_DATA_PTR(arr); bitmap = ARR_NULLBITMAP(arr); bitmask = 1; for (int i = 0; i < nitems; i++) { Datum elt; Datum thisresult; /* Get array element, checking for NULL */ if (bitmap && (*bitmap & bitmask) == 0) { fcinfo->args[1].value = (Datum) 0; fcinfo->args[1].isnull = true; } else { elt = fetch_att(s, typbyval, typlen); s = att_addlength_pointer(s, typlen, s); s = (char *) att_align_nominal(s, typalign); fcinfo->args[1].value = elt; fcinfo->args[1].isnull = false; } [do stuff with Datum/isnull] /* advance bitmap pointer if any */ if (bitmap) { bitmask <<= 1; if (bitmask == 0x100) { bitmap++; bitmask = 1; } } } Compared with just: for (int i = 0; i < nitems; i++) { Datum elt = datums[i]; [do stuff with the Datum] } I'm not sure how much difference that makes, but I presume it's not zero, and it seems like an easy win when you have the code to deal with the Datum array representation anyway. - Heikki
On Tue, Sep 8, 2020 at 4:37 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 08/09/2020 22:25, James Coleman wrote: > > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > >> > >> You could also apply the optimization for non-Const expressions, as long > >> as they're stable (i.e. !contain_volatile_functions(expr)). > > > > Is that true? Don't we also have to worry about whether or not the > > value is stable (i.e., know when a param has changed)? There have been > > discussions about being able to cache stable subexpressions, and my > > understanding was that we'd need to have that infrastructure (along > > with the necessarily invalidation when the param changes) to be able > > to use this for non-Const expressions. > > Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And > Vars. I think the conditions are the same as those checked in > match_clause_to_partition_key() in partprune.c (it's a long function, > search for "if (!IsA(rightop, Const))"). Not sure it's worth the > trouble, then. But it would be nice to handle queries like "WHERE column > = ANY ($1)" If I'm understanding properly you're imagining something in the form of: with x as (select '{1,2,3,4,5,6,7,8,9,10}'::int[]) select * from t where i = any ((select * from x)::int[]); I've been playing around with this and currently have these checks: contain_var_clause((Node *) arrayarg) contain_volatile_functions((Node *) arrayarg) contain_exec_param((Node *) arrayarg, NIL) (the last one I had to modify the function to handle empty lists) If any are true, then have to disable the optimization. But for queries in the form above the test contain_exec_param((Node *) arrayarg, NIL) evaluates to true, even though we know from looking at the query that the array subexpression is stable for the length of the query. Am I misunderstanding what you're going for? Or is there a way to confirm the expr, although an exec param, won't change? Another interesting thing that this would imply is that we'd either have to: 1. Remove the array length check altogether, 2. Always use the hash when have a non-Const, but when a Const only if the array length check passes, or 3. Make this new expr op more fully featured by teaching it how to use either a straight loop through a deconstructed array or use the hash. That last option feeds into further discussion in the below: > >> Deconstructing the array Datum into a simple C array on first call would > >> be a win even for very small arrays and for AND semantics, even if you > >> don't use a hash table. > > > > Because you wouldn't have to repeatedly detoast it? Or some other > > reason I'm not thinking of? My intuition would have been that (aside > > from detoasting if necessary) there wouldn't be much real overhead in, > > for example, an array storing integers. > > Dealing with NULLs and different element sizes in the array is pretty > complicated. Looping through the array currently looks like this: > > /* Loop over the array elements */ > s = (char *) ARR_DATA_PTR(arr); > bitmap = ARR_NULLBITMAP(arr); > bitmask = 1; > > for (int i = 0; i < nitems; i++) > { > Datum elt; > Datum thisresult; > > /* Get array element, checking for NULL */ > if (bitmap && (*bitmap & bitmask) == 0) > { > fcinfo->args[1].value = (Datum) 0; > fcinfo->args[1].isnull = true; > } > else > { > elt = fetch_att(s, typbyval, typlen); > s = att_addlength_pointer(s, typlen, s); > s = (char *) att_align_nominal(s, typalign); > fcinfo->args[1].value = elt; > fcinfo->args[1].isnull = false; > } > > [do stuff with Datum/isnull] > > /* advance bitmap pointer if any */ > if (bitmap) > { > bitmask <<= 1; > if (bitmask == 0x100) > { > bitmap++; > bitmask = 1; > } > } > } > > Compared with just: > > for (int i = 0; i < nitems; i++) > { > Datum elt = datums[i]; > > [do stuff with the Datum] > } > > I'm not sure how much difference that makes, but I presume it's not > zero, and it seems like an easy win when you have the code to deal with > the Datum array representation anyway. Doing this would necessitate option 3 above: we'd have to have this new expr op be able both to use a hash or alternatively do a normal loop. Being able to use this in more cases than just a Const array expr is certainly interesting, but I'm not sure yet about the feasibility or desirability of that at this point given the above restrictions. One other point in favor of the additional complexity here is that it's likely that the above described runtime switching between hash and loop would be necessary (for this optimization to come into play) if caching of stable subexpressions ever lands. I have some interest in working on that...but it's also a large project. James
On Fri, Sep 11, 2020 at 5:11 PM James Coleman <jtc331@gmail.com> wrote: > > On Tue, Sep 8, 2020 at 4:37 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 08/09/2020 22:25, James Coleman wrote: > > > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > >> > > >> You could also apply the optimization for non-Const expressions, as long > > >> as they're stable (i.e. !contain_volatile_functions(expr)). > > > > > > Is that true? Don't we also have to worry about whether or not the > > > value is stable (i.e., know when a param has changed)? There have been > > > discussions about being able to cache stable subexpressions, and my > > > understanding was that we'd need to have that infrastructure (along > > > with the necessarily invalidation when the param changes) to be able > > > to use this for non-Const expressions. > > > > Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And > > Vars. I think the conditions are the same as those checked in > > match_clause_to_partition_key() in partprune.c (it's a long function, > > search for "if (!IsA(rightop, Const))"). Not sure it's worth the > > trouble, then. But it would be nice to handle queries like "WHERE column > > = ANY ($1)" > > If I'm understanding properly you're imagining something in the form of: > > with x as (select '{1,2,3,4,5,6,7,8,9,10}'::int[]) > select * from t where i = any ((select * from x)::int[]); > > I've been playing around with this and currently have these checks: > > contain_var_clause((Node *) arrayarg) > contain_volatile_functions((Node *) arrayarg) > contain_exec_param((Node *) arrayarg, NIL) > > (the last one I had to modify the function to handle empty lists) > > If any are true, then have to disable the optimization. But for > queries in the form above the test contain_exec_param((Node *) > arrayarg, NIL) evaluates to true, even though we know from looking at > the query that the array subexpression is stable for the length of the > query. > > Am I misunderstanding what you're going for? Or is there a way to > confirm the expr, although an exec param, won't change? > > Another interesting thing that this would imply is that we'd either have to: > > 1. Remove the array length check altogether, > 2. Always use the hash when have a non-Const, but when a Const only if > the array length check passes, or > 3. Make this new expr op more fully featured by teaching it how to use > either a straight loop through a deconstructed array or use the hash. > > That last option feeds into further discussion in the below: > > > >> Deconstructing the array Datum into a simple C array on first call would > > >> be a win even for very small arrays and for AND semantics, even if you > > >> don't use a hash table. > > > > > > Because you wouldn't have to repeatedly detoast it? Or some other > > > reason I'm not thinking of? My intuition would have been that (aside > > > from detoasting if necessary) there wouldn't be much real overhead in, > > > for example, an array storing integers. > > > > Dealing with NULLs and different element sizes in the array is pretty > > complicated. Looping through the array currently looks like this: > > > > /* Loop over the array elements */ > > s = (char *) ARR_DATA_PTR(arr); > > bitmap = ARR_NULLBITMAP(arr); > > bitmask = 1; > > > > for (int i = 0; i < nitems; i++) > > { > > Datum elt; > > Datum thisresult; > > > > /* Get array element, checking for NULL */ > > if (bitmap && (*bitmap & bitmask) == 0) > > { > > fcinfo->args[1].value = (Datum) 0; > > fcinfo->args[1].isnull = true; > > } > > else > > { > > elt = fetch_att(s, typbyval, typlen); > > s = att_addlength_pointer(s, typlen, s); > > s = (char *) att_align_nominal(s, typalign); > > fcinfo->args[1].value = elt; > > fcinfo->args[1].isnull = false; > > } > > > > [do stuff with Datum/isnull] > > > > /* advance bitmap pointer if any */ > > if (bitmap) > > { > > bitmask <<= 1; > > if (bitmask == 0x100) > > { > > bitmap++; > > bitmask = 1; > > } > > } > > } > > > > Compared with just: > > > > for (int i = 0; i < nitems; i++) > > { > > Datum elt = datums[i]; > > > > [do stuff with the Datum] > > } > > > > I'm not sure how much difference that makes, but I presume it's not > > zero, and it seems like an easy win when you have the code to deal with > > the Datum array representation anyway. > > Doing this would necessitate option 3 above: we'd have to have this > new expr op be able both to use a hash or alternatively do a normal > loop. > > Being able to use this in more cases than just a Const array expr is > certainly interesting, but I'm not sure yet about the feasibility or > desirability of that at this point given the above restrictions. > > One other point in favor of the additional complexity here is that > it's likely that the above described runtime switching between hash > and loop would be necessary (for this optimization to come into play) > if caching of stable subexpressions ever lands. I have some interest > in working on that...but it's also a large project. I've attached a cleaned up patch. Last CF it was listed in is https://commitfest.postgresql.org/29/2542/ -- what's the appropriate step to take here given it's an already existing patch, but not yet moved into recent CFs? Thanks, James
Attachment
On Sat, 20 Mar 2021 at 09:41, James Coleman <jtc331@gmail.com> wrote: > I've attached a cleaned up patch. Last CF it was listed in is > https://commitfest.postgresql.org/29/2542/ -- what's the appropriate > step to take here given it's an already existing patch, but not yet > moved into recent CFs? I had a look at this patch. I like the idea of using a simplehash.h hash table to hash the constant values so that repeat lookups can be performed much more quickly, however, I'm a bit concerned that there are quite a few places in the code where we often just execute a ScalarArrayOpExpr once and I'm a bit worried that we'll slow down expression evaluation of those cases. The two cases that I have in mind are: 1. eval_const_expressions() where we use the executor to evaluate the ScalarArrayOpExpr to see if the result is Const. 2. CHECK constraints with IN clauses and single-row INSERTs. I tried to benchmark both of these but I'm struggling to get stable enough performance for #2, even with fsync=off. Sometimes I'm getting results 2.5x slower than other runs. For benchmarking #1 I'm also not too sure I'm getting stable enough results for them to mean anything. I was running: create table a (a int); bench.sql: explain select * from a where a in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16); drowley@amd3990x:~$ pgbench -n -T 60 -j 64 -c 64 -f bench.sql -P 10 postgres Master (6d41dd045): tps = 992586.991045 (without initial connection time) tps = 987964.990483 (without initial connection time) tps = 994309.670918 (without initial connection time) Master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch tps = 956022.557626 (without initial connection time) tps = 963043.352823 (without initial connection time) tps = 968582.600100 (without initial connection time) This puts the patched version about 3% slower. I'm not sure how much of that is changes in the binary and noise and how much is the needless hashtable build done for eval_const_expressions(). I wondered if we should make it the query planner's job of deciding if the ScalarArrayOpExpr should be hashed or not. I ended up with the attached rough-cut patch that introduces HashedScalarArrayOpExpr and has the query planner decide if it's going to replace ScalarArrayOpExpr with these HashedScalarArrayOpExpr during preprocess_expression(). I do think that we might want to consider being a bit selective about when we do these replacements. It seems likely that we'd want to do this for EXPRKIND_QUAL and maybe EXPRKIND_TARGET, but I imagine that converting ScalarArrayOpExpr to HashedScalarArrayOpExpr for EXPRKIND_VALUES would be a waste of time since those will just be executed once. I tried the same above test with the v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch plus the attached rough-cut patch and got: master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch + v5-0002-Rough-cut-patch-for-HashedScalarArrayOpExpr.patch tps = 1167969.983173 (without initial connection time) tps = 1199636.793314 (without initial connection time) tps = 1190690.939963 (without initial connection time) I can't really explain why this became faster. I was expecting it just to reduce that slowdown of the v4 patch a little. I don't really see any reason why it would become faster. It's almost 20% faster which seems like too much to just be fluctuations in code alignment in the binary. The attached patch is still missing the required changes to llvmjit_expr.c. I think that was also missing from the original patch too, however. Also, I added HashedScalarArrayOpExpr to plannodes.h. All other Expr type nodes are in primnodes.h. However, I put HashedScalarArrayOpExpr in plannodes.h because the parser does not generate this and it's not going to be stored in the catalogue files anywhere. I'm not so sure inventing a new Expr type node that only can be generated by the planner is a good thing to do. Anyway, wondering what you think of the idea of allowing the planner to choose if it's going to hash or not? It might also be good if someone else can check if they can get a bit more stable performance results from benchmarking the patches. (Also attached your v4 patch again just so anyone following along at home does not need to hunt around for the correct set of patches to apply to test this) David
Attachment
On Mon, Apr 5, 2021 at 11:58 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Sat, 20 Mar 2021 at 09:41, James Coleman <jtc331@gmail.com> wrote: > > I've attached a cleaned up patch. Last CF it was listed in is > > https://commitfest.postgresql.org/29/2542/ -- what's the appropriate > > step to take here given it's an already existing patch, but not yet > > moved into recent CFs? > > I had a look at this patch. I like the idea of using a simplehash.h > hash table to hash the constant values so that repeat lookups can be > performed much more quickly, however, I'm a bit concerned that there > are quite a few places in the code where we often just execute a > ScalarArrayOpExpr once and I'm a bit worried that we'll slow down > expression evaluation of those cases. > > The two cases that I have in mind are: > > 1. eval_const_expressions() where we use the executor to evaluate the > ScalarArrayOpExpr to see if the result is Const. > 2. CHECK constraints with IN clauses and single-row INSERTs. This is a good point I hadn't considered; now that you mention it, I think another case would be expression evaluation in pl/pgsql. > I tried to benchmark both of these but I'm struggling to get stable > enough performance for #2, even with fsync=off. Sometimes I'm getting > results 2.5x slower than other runs. > > For benchmarking #1 I'm also not too sure I'm getting stable enough > results for them to mean anything. > > I was running: > > create table a (a int); > > bench.sql: explain select * from a where a > in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16); > > drowley@amd3990x:~$ pgbench -n -T 60 -j 64 -c 64 -f bench.sql -P 10 postgres > Master (6d41dd045): > tps = 992586.991045 (without initial connection time) > tps = 987964.990483 (without initial connection time) > tps = 994309.670918 (without initial connection time) > > Master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch > tps = 956022.557626 (without initial connection time) > tps = 963043.352823 (without initial connection time) > tps = 968582.600100 (without initial connection time) > > This puts the patched version about 3% slower. I'm not sure how much > of that is changes in the binary and noise and how much is the > needless hashtable build done for eval_const_expressions(). > > I wondered if we should make it the query planner's job of deciding if > the ScalarArrayOpExpr should be hashed or not. I ended up with the > attached rough-cut patch that introduces HashedScalarArrayOpExpr and > has the query planner decide if it's going to replace > ScalarArrayOpExpr with these HashedScalarArrayOpExpr during > preprocess_expression(). I do think that we might want to consider > being a bit selective about when we do these replacements. It seems > likely that we'd want to do this for EXPRKIND_QUAL and maybe > EXPRKIND_TARGET, but I imagine that converting ScalarArrayOpExpr to > HashedScalarArrayOpExpr for EXPRKIND_VALUES would be a waste of time > since those will just be executed once. In theory we might want to cost them differently as well, though I'm slightly hesitant to do so at this point to avoid causing plan changes (I'm not sure how we would balance that concern with the potential that the best plan isn't chosen). > I tried the same above test with the > v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch plus the > attached rough-cut patch and got: > > master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch > + v5-0002-Rough-cut-patch-for-HashedScalarArrayOpExpr.patch > tps = 1167969.983173 (without initial connection time) > tps = 1199636.793314 (without initial connection time) > tps = 1190690.939963 (without initial connection time) > > I can't really explain why this became faster. I was expecting it just > to reduce that slowdown of the v4 patch a little. I don't really see > any reason why it would become faster. It's almost 20% faster which > seems like too much to just be fluctuations in code alignment in the > binary. I'm not at a place where I can do good perf testing right now (just on my laptop for the moment), unfortunately, so I can't confirm one way or the other. > The attached patch is still missing the required changes to > llvmjit_expr.c. I think that was also missing from the original patch > too, however. Ah, I didn't realize that needed to be changed as well. I'll take a look at that. > Also, I added HashedScalarArrayOpExpr to plannodes.h. All other Expr > type nodes are in primnodes.h. However, I put HashedScalarArrayOpExpr > in plannodes.h because the parser does not generate this and it's not > going to be stored in the catalogue files anywhere. I'm not so sure > inventing a new Expr type node that only can be generated by the > planner is a good thing to do. I don't know what the positives and negatives are of this. > Anyway, wondering what you think of the idea of allowing the planner > to choose if it's going to hash or not? In general I think it's very reasonable. I kinda wonder if HashedScalarArrayOpExpr should have the ScalarArrayOpExp inlined instead of maintaining a pointer, but it's not a big deal to me either way. It certainly adds additional code, but probably also makes the execExpr code clearer. > It might also be good if someone else can check if they can get a bit > more stable performance results from benchmarking the patches. > > (Also attached your v4 patch again just so anyone following along at > home does not need to hunt around for the correct set of patches to > apply to test this) A few other comments: - It looks like several of the "result is always InvalidOid" changes should get committed separately (and now)? - Two comment tweaks: + * 1. The 2nd argument of the array does not contain any Vars, Params or s/array/array op/ + * worthwhile using the hashed version of ScalarArrayOpExprs rather than s/ScalarArrayOpExprs/ScalarArrayOpExpr/ - Using op_hashjoinable is an improvement over my initial patch. - I like the name change to put HASHED/Hashed first. Thanks, James
On 4/6/21 5:58 AM, David Rowley wrote: > On Sat, 20 Mar 2021 at 09:41, James Coleman <jtc331@gmail.com> wrote: >> I've attached a cleaned up patch. Last CF it was listed in is >> https://commitfest.postgresql.org/29/2542/ -- what's the appropriate >> step to take here given it's an already existing patch, but not yet >> moved into recent CFs? > > I had a look at this patch. I like the idea of using a simplehash.h > hash table to hash the constant values so that repeat lookups can be > performed much more quickly, however, I'm a bit concerned that there > are quite a few places in the code where we often just execute a > ScalarArrayOpExpr once and I'm a bit worried that we'll slow down > expression evaluation of those cases. > > The two cases that I have in mind are: > > 1. eval_const_expressions() where we use the executor to evaluate the > ScalarArrayOpExpr to see if the result is Const. > 2. CHECK constraints with IN clauses and single-row INSERTs. > > I tried to benchmark both of these but I'm struggling to get stable > enough performance for #2, even with fsync=off. Sometimes I'm getting > results 2.5x slower than other runs. > > For benchmarking #1 I'm also not too sure I'm getting stable enough > results for them to mean anything. > > I was running: > > create table a (a int); > > bench.sql: explain select * from a where a > in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16); > > drowley@amd3990x:~$ pgbench -n -T 60 -j 64 -c 64 -f bench.sql -P 10 postgres > Master (6d41dd045): > tps = 992586.991045 (without initial connection time) > tps = 987964.990483 (without initial connection time) > tps = 994309.670918 (without initial connection time) > > Master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch > tps = 956022.557626 (without initial connection time) > tps = 963043.352823 (without initial connection time) > tps = 968582.600100 (without initial connection time) > > This puts the patched version about 3% slower. I'm not sure how much > of that is changes in the binary and noise and how much is the > needless hashtable build done for eval_const_expressions(). > > I wondered if we should make it the query planner's job of deciding if > the ScalarArrayOpExpr should be hashed or not. I ended up with the > attached rough-cut patch that introduces HashedScalarArrayOpExpr and > has the query planner decide if it's going to replace > ScalarArrayOpExpr with these HashedScalarArrayOpExpr during > preprocess_expression(). I do think that we might want to consider > being a bit selective about when we do these replacements. It seems > likely that we'd want to do this for EXPRKIND_QUAL and maybe > EXPRKIND_TARGET, but I imagine that converting ScalarArrayOpExpr to > HashedScalarArrayOpExpr for EXPRKIND_VALUES would be a waste of time > since those will just be executed once. > > I tried the same above test with the > v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch plus the > attached rough-cut patch and got: > > master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch > + v5-0002-Rough-cut-patch-for-HashedScalarArrayOpExpr.patch > tps = 1167969.983173 (without initial connection time) > tps = 1199636.793314 (without initial connection time) > tps = 1190690.939963 (without initial connection time) > > I can't really explain why this became faster. I was expecting it just > to reduce that slowdown of the v4 patch a little. I don't really see > any reason why it would become faster. It's almost 20% faster which > seems like too much to just be fluctuations in code alignment in the > binary. > Interesting. I tried this on the "small" machine I use for benchmarking, with the same SQL script you used, and also with IN() containing 10 and 100 values - so less/more than your script, which used 16 values. I only ran that with a single client, the machine only has 4 cores and this should not be related to concurrency, so 1 client seems fine. The average of 10 runs, 15 seconds each look like this: simple prepared 10/s 10/p 100/s 100/p ------------------------------------------------------------- master 21847 59476 23343 59380 11757 56488 v4 21546 57757 22864 57704 11572 57350 v4+v5 23374 56089 24410 56140 14765 55302 The first two columns are your bench.sql, with -M simple or prepared. The other columns are 10 or 100 values, /s is simple, /p is prepared. Compared to master: simple prepared 10/s 10/p 100/s 100/p ------------------------------------------------------------- v4 98.62% 97.11% 97.95% 97.18% 98.43% 101.52% v4+v5 106.99% 94.31% 104.57% 94.54% 125.59% 97.90% That seems to mostly match your observation - there's a small performance hit (~2%), although that might be due to changes in the layout of the binary. And v4+v5 improves that a bit (even compared to master), although I don't see the same 20% speedup. I see +25% improvement, but only with 100 values. It's a bit strange that in prepared mode, the v5 actually hurts the performance a bit. That being said, this is a pretty extreme test case. I'm pretty sure that once the table is not empty, the results will probably show a clear improvement. I'll collect some of those results. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 8 Apr 2021 at 05:54, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I only ran that with a single client, the machine only has 4 cores and > this should not be related to concurrency, so 1 client seems fine. The > average of 10 runs, 15 seconds each look like this: Thanks for running these tests. The reason I added so much concurrency was that this AMD machine has some weird behaviour in regards to power management. Every bios update I seem to get changes the power management but it still is very unstable despite me having it on the most optimal settings in the bios. Working it a bit harder seems to help make it realise that there might be some urgency. > simple prepared 10/s 10/p 100/s 100/p > ------------------------------------------------------------- > master 21847 59476 23343 59380 11757 56488 > v4 21546 57757 22864 57704 11572 57350 > v4+v5 23374 56089 24410 56140 14765 55302 > > The first two columns are your bench.sql, with -M simple or prepared. > The other columns are 10 or 100 values, /s is simple, /p is prepared. > > Compared to master: > > simple prepared 10/s 10/p 100/s 100/p > ------------------------------------------------------------- > v4 98.62% 97.11% 97.95% 97.18% 98.43% 101.52% > v4+v5 106.99% 94.31% 104.57% 94.54% 125.59% 97.90% > > That seems to mostly match your observation - there's a small > performance hit (~2%), although that might be due to changes in the > layout of the binary. And v4+v5 improves that a bit (even compared to > master), although I don't see the same 20% speedup. I've spent more time hacking at this patch. I had a bit of a change of heart earlier about having this new HashedScalarArrayOpExpr node type. There were more places that I imagined that I needed to add handling for it. For example, partprune.c needed to know about it to allow partition pruning on them. While supporting that is just a few lines to make a recursive call passing in the underlying ScalarArrayOpExpr, I just didn't like the idea. Instead, I think it'll be better just to add a new field to ScalarArrayOpExpr and have the planner set that to tell the executor that it should use a hash table to perform the lookups rather than a linear search. This can just be the hash function oid, which also saves the executor from having to look that up. After quite a bit of hacking, I've ended up with the attached. I added the required JIT code to teach the jit code about EEOP_HASHED_SCALARARRAYOP. I also wrote the missing regression test for non-strict equality ops and moved the declaration for the simplehash.h code into execExprInterp.c and forward declared ScalarArrayOpExprHashTable in exprExpr.h. I also rewrote a large number of comments and fixed a few things like missing permission checks for the hash function. I've not done any further performance tests yet but will start those now. David
Attachment
On Thu, 8 Apr 2021 at 18:50, David Rowley <dgrowleyml@gmail.com> wrote: > I've not done any further performance tests yet but will start those now. I ran a set of tests on this: select * from a where a in( < 1 to 10 > ); and select * from a where a in( < 1 to 100 > ); the table "a" is just an empty table with a single int column. I ran "pgbench -T 15 -c 16 -t 16" ten times each and the resulting tps is averaged over the 10 runs. With 10 items in the IN clause: master: 99887.9098314 tps patched: 103235.7616416 tps (3.35% faster) With 100 items: master: 62442.4838792 tps patched:62275.4955754 tps (0.27% slower) These tests are just designed to test the overhead of the additional planning and expression initialisation. Testing the actual performance of the patch vs master with large IN lists shows the expected significant speedups. These results show that there's not much in the way of a measurable slowdown in planning or executor startup from the additional code which decides if we should hash the ScalarArrayOpExpr. I think the changes in the patch are fairly isolated and the test coverage is now pretty good. I'm planning on looking at the patch again now and will consider pushing it for PG14. David
On Thu, 8 Apr 2021 at 22:54, David Rowley <dgrowleyml@gmail.com> wrote: > > I think the changes in the patch are fairly isolated and the test > coverage is now pretty good. I'm planning on looking at the patch > again now and will consider pushing it for PG14. I push this with some minor cleanup from the v6 patch I posted earlier. David
On Thu, Apr 8, 2021 at 8:01 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Thu, 8 Apr 2021 at 22:54, David Rowley <dgrowleyml@gmail.com> wrote: > > > > I think the changes in the patch are fairly isolated and the test > > coverage is now pretty good. I'm planning on looking at the patch > > again now and will consider pushing it for PG14. > > I push this with some minor cleanup from the v6 patch I posted earlier. > > David Thank you! I assume proper procedure for the CF entry is to move it into the current CF and then mark it as committed, however I don't know how (or don't have permissions?) to move it into the current CF. How does one go about doing that? Here's the entry: https://commitfest.postgresql.org/29/2542/ Thanks, James
On 2021-Apr-08, James Coleman wrote: > I assume proper procedure for the CF entry is to move it into the > current CF and then mark it as committed, however I don't know how (or > don't have permissions?) to move it into the current CF. How does one > go about doing that? > > Here's the entry: https://commitfest.postgresql.org/29/2542/ Done, thanks. -- Álvaro Herrera 39°49'30"S 73°17'W
On Thu, Apr 8, 2021 at 1:04 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Apr-08, James Coleman wrote: > > > I assume proper procedure for the CF entry is to move it into the > > current CF and then mark it as committed, however I don't know how (or > > don't have permissions?) to move it into the current CF. How does one > > go about doing that? > > > > Here's the entry: https://commitfest.postgresql.org/29/2542/ > > Done, thanks. > > -- > Álvaro Herrera 39°49'30"S 73°17'W Thanks. Is that something I should be able to do myself (should I be asking someone for getting privileges in the app to do so)? I'm not sure what the project policy is on that. James
On 2021-Apr-08, James Coleman wrote: > On Thu, Apr 8, 2021 at 1:04 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2021-Apr-08, James Coleman wrote: > > > > > I assume proper procedure for the CF entry is to move it into the > > > current CF and then mark it as committed, however I don't know how (or > > > don't have permissions?) to move it into the current CF. How does one > > > go about doing that? > > > > > > Here's the entry: https://commitfest.postgresql.org/29/2542/ > > > > Done, thanks. > Thanks. Is that something I should be able to do myself No, sorry. -- Álvaro Herrera Valdivia, Chile "La vida es para el que se aventura"
On 4/8/21 2:00 PM, David Rowley wrote: > On Thu, 8 Apr 2021 at 22:54, David Rowley <dgrowleyml@gmail.com> wrote: >> >> I think the changes in the patch are fairly isolated and the test >> coverage is now pretty good. I'm planning on looking at the patch >> again now and will consider pushing it for PG14. > > I push this with some minor cleanup from the v6 patch I posted earlier. > I ran the same set of benchmarks on the v6, which I think should be mostly identical to what was committed. I extended the test a bit to test table with 0, 1, 5 and 1000 rows, and also with int and text values, to see how it works with more expensive comparators. I built binaries with gcc 9.2.0 and clang 11.0.0, the full results are attached. There's a bit of difference between gcc and clang, but the general behavior is about the same, so I'll only present gcc results to keep this simple. I'll only throughput comparison to master, so >1.0 means good, <1.0 means bad. If you're interested in actual tps, see the full results. For the v5 patch (actually v4-0001 + v5-0002) and v6, the results are: integer column / v5 =================== rows 10/p 100/p 16/p 10/s 100/s 16/s ----------------------------------------------------------- 0 97% 97% 97% 107% 126% 108% 1 95% 82% 94% 108% 132% 110% 5 95% 83% 95% 108% 132% 110% 1000 129% 481% 171% 131% 382% 165% integer column / v6 =================== rows 10/p 100/p 16/p 10/s 100/s 16/s ----------------------------------------------------------- 0 97% 97% 97% 98% 98% 98% 1 96% 84% 95% 97% 97% 98% 5 97% 85% 96% 98% 97% 97% 1000 129% 489% 172% 128% 330% 162% text column / v5 ================ rows 10/p 100/p 16/p 10/s 100/s 16/s ----------------------------------------------------------- 0 100% 100% 100% 106% 119% 108% 1 96% 81% 95% 107% 120% 109% 5 97% 82% 96% 107% 121% 109% 1000 291% 1622% 402% 255% 1092% 337% text column / v6 ================ rows 10/p 100/p 16/p 10/s 100/s 16/s ----------------------------------------------------------- 0 101% 101% 101% 98% 99% 99% 1 98% 82% 96% 98% 96% 97% 5 100% 84% 98% 98% 96% 98% 1000 297% 1645% 408% 255% 1000% 336% Overall, the behavior for integer and text columns is the same, for both patches. There's a couple interesting observations: 1) For the "simple" query mode, v5 helped quite a bit (20-30% speedup), but v6 does not seem to help at all - it's either same or slower than unpatched master. I wonder why is that, and if we could get some of the speedup with v6? At first I thought that maybe v5 is not building the hash table in cases where v6 does, but that shouldn't be faster than master. 2) For the "prepared" mode, there's a clear performance hit the longer the array is (for both v5 and v6). For 100 elements it's about 15%, which is not great. I think the reason is fairly simple - building the hash table is not free, and with few rows it's not worth it - it'd be faster to just search the array directly. Unfortunately, the logic that makes the decision to switch to hashing only looks at the array length only, and ignores the number of rows entirely. So I think if we want to address this, convert_saop_to_hashed_saop needs to compare has_build_cost + nrows * hash_lookup_cost and nrows * linear_lookup_cost to make reasonable decision. I was thinking that maybe we can ignore this, because people probably have much larger tables in practice. But I'm not sure that's really true, because there may be other quals and it's possible the preceding ones are quite selective, filtering most of the rows. I'm not sure how much of the necessary information we have available in convert_saop_to_hashed_saop_walker, though :-( I suppose we know the number of input rows for that plan node, not sure about selectivity of the other quals, though. It's also a bit strange that we get speedup for "simple" protocol, while for "prepared" it gets slower. That seems counter-intuitive, because why should we see opposite outcomes in those cases? I'd assume that we'll see either speedup or slowdown in both cases, with the relative change being more significant in the "prepared" mode. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, 9 Apr 2021 at 09:32, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I ran the same set of benchmarks on the v6, which I think should be > mostly identical to what was committed. I extended the test a bit to > test table with 0, 1, 5 and 1000 rows, and also with int and text > values, to see how it works with more expensive comparators. > > I built binaries with gcc 9.2.0 and clang 11.0.0, the full results are > attached. There's a bit of difference between gcc and clang, but the > general behavior is about the same, so I'll only present gcc results to > keep this simple. I'll only throughput comparison to master, so >1.0 > means good, <1.0 means bad. If you're interested in actual tps, see the > full results. > > For the v5 patch (actually v4-0001 + v5-0002) and v6, the results are: > > integer column / v5 > =================== > > rows 10/p 100/p 16/p 10/s 100/s 16/s > ----------------------------------------------------------- > 0 97% 97% 97% 107% 126% 108% > 1 95% 82% 94% 108% 132% 110% > 5 95% 83% 95% 108% 132% 110% > 1000 129% 481% 171% 131% 382% 165% I think we should likely ignore the v5 patch now. The reason is that it was pretty much unfinished and there were many places that I'd not yet added support for the HashedScalarArrayOpExpr node type yet. This could cause these nodes to be skipped during node mutations or node walking which would certainly make planning faster, just not in a way that's correct. > integer column / v6 > =================== > > rows 10/p 100/p 16/p 10/s 100/s 16/s > ----------------------------------------------------------- > 0 97% 97% 97% 98% 98% 98% > 1 96% 84% 95% 97% 97% 98% > 5 97% 85% 96% 98% 97% 97% > 1000 129% 489% 172% 128% 330% 162% This is a really informative set of results. I can only guess that the slowdown of the 100/prepared query is down to building the hash table. I think that because the 0 rows test does not show the slowdown and we only build the table when evaluating for the first time. There's a slightly larger hit on 1 row vs 5 rows, which makes sense since the rewards of the hash lookup start paying off more with more rows. Looking at your tps numbers, I think I can see why we get the drop in performance with prepared statements but not simple statements. This seems to just be down to the fact that the planning time dominates in the simple statement case. For example, the "1 row" test for 100/s for v6 is 10023.3 tps, whereas the 100/p result is 44093.8 tps. With master, prepared gets 52400.0 tps. So we could say the hash table build costs us 8306.2 tps, or 3.59 microseconds per execution, per: postgres=# select 1000000 / 52400.0 - 1000000 / 44093.8; ?column? --------------------- -3.5949559161508538 (1 row) If we look at the tps for the simple query version of the same test. Master did 10309.6 tps, v6 did 10023.3 tps. If we apply that 3.59 microsecond slowdown to master's tps, then we get pretty close to within 1% of the v6 tps: postgres=# select 1000000 / (1000000 / 10309.6 + 3.59); ?column? ----------------------- 9941.6451581291294165 (1 row) > text column / v6 > ================ > > rows 10/p 100/p 16/p 10/s 100/s 16/s > ----------------------------------------------------------- > 0 101% 101% 101% 98% 99% 99% > 1 98% 82% 96% 98% 96% 97% > 5 100% 84% 98% 98% 96% 98% > 1000 297% 1645% 408% 255% 1000% 336% > > > Overall, the behavior for integer and text columns is the same, for both > patches. There's a couple interesting observations: > > 1) For the "simple" query mode, v5 helped quite a bit (20-30% speedup), > but v6 does not seem to help at all - it's either same or slower than > unpatched master. I think that's related to the fact that I didn't finish adding HashedScalarArrayOpExpr processing to all places that needed it. > I wonder why is that, and if we could get some of the speedup with v6? > At first I thought that maybe v5 is not building the hash table in cases > where v6 does, but that shouldn't be faster than master. I don't think v5 and v6 really do anything much differently in the executor. The only difference is really during ExecInitExprRec() when we initialize the expression. With v5 we had a case T_HashedScalarArrayOpExpr: to handle the new node type, but in v6 we have if (OidIsValid(opexpr->hashfuncid)). Oh, wait. I did add the missing permissions check on the hash function, so that will account for something. As far as I can see, that's required. > 2) For the "prepared" mode, there's a clear performance hit the longer > the array is (for both v5 and v6). For 100 elements it's about 15%, > which is not great. > > I think the reason is fairly simple - building the hash table is not > free, and with few rows it's not worth it - it'd be faster to just > search the array directly. Unfortunately, the logic that makes the > decision to switch to hashing only looks at the array length only, and > ignores the number of rows entirely. So I think if we want to address > this, convert_saop_to_hashed_saop needs to compare > > has_build_cost + nrows * hash_lookup_cost > > and > > nrows * linear_lookup_cost > > to make reasonable decision. I thought about that but I was really worried that the performance of ScalarArrayOpExpr would just become too annoyingly unpredictable. You know fairly well that we can often get massive row underestimations in the planner. (I guess you worked on ext stats mainly because of that) The problem I want to avoid is the ones where we get a big row underestimation but don't really get a bad plan as a result. For example a query like: SELECT * FROM big_table WHERE col1 = ... AND col2 = ... AND col3 = ... AND col4 IN( ... big list of values ...); If col1, col2 and col3 are highly correlated but individually fairly selective, then we could massively underestimate how many rows the IN clause will see (assuming no rearranging was done here). I'm not completely opposed to the idea of taking the estimated rows into account during planning. It might just mean having to move the convert_saop_to_hashed_saop() call somewhere else. I imagine that's fairly trivial to do. I just have concerns about doing so. > I was thinking that maybe we can ignore this, because people probably > have much larger tables in practice. But I'm not sure that's really > true, because there may be other quals and it's possible the preceding > ones are quite selective, filtering most of the rows. > > I'm not sure how much of the necessary information we have available in > convert_saop_to_hashed_saop_walker, though :-( I suppose we know the > number of input rows for that plan node, not sure about selectivity of > the other quals, though. > > It's also a bit strange that we get speedup for "simple" protocol, while > for "prepared" it gets slower. That seems counter-intuitive, because why > should we see opposite outcomes in those cases? I'd assume that we'll > see either speedup or slowdown in both cases, with the relative change > being more significant in the "prepared" mode. I hope my theory above about the planner time dominating the overall time shows why that is. Another way to look at these result is by taking your tps value and calculating how long it takes to do N number of transactions then totalling up the time it takes. If I do that to calculate how long it took each test to perform 1000 transactions and sum each test grouping by mode and version with rollup on mode, I get: time values are in seconds: pg-master 28.07 prepared 11.28 simple 16.79 pg-v6 15.86 prepared 5.23 simple 10.63 So, the overall result when applying the total time is 177%. David
On 4/9/21 1:21 AM, David Rowley wrote: > On Fri, 9 Apr 2021 at 09:32, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> I ran the same set of benchmarks on the v6, which I think should be >> mostly identical to what was committed. I extended the test a bit to >> test table with 0, 1, 5 and 1000 rows, and also with int and text >> values, to see how it works with more expensive comparators. >> >> I built binaries with gcc 9.2.0 and clang 11.0.0, the full results are >> attached. There's a bit of difference between gcc and clang, but the >> general behavior is about the same, so I'll only present gcc results to >> keep this simple. I'll only throughput comparison to master, so >1.0 >> means good, <1.0 means bad. If you're interested in actual tps, see the >> full results. >> >> For the v5 patch (actually v4-0001 + v5-0002) and v6, the results are: >> >> integer column / v5 >> =================== >> >> rows 10/p 100/p 16/p 10/s 100/s 16/s >> ----------------------------------------------------------- >> 0 97% 97% 97% 107% 126% 108% >> 1 95% 82% 94% 108% 132% 110% >> 5 95% 83% 95% 108% 132% 110% >> 1000 129% 481% 171% 131% 382% 165% > > I think we should likely ignore the v5 patch now. The reason is that > it was pretty much unfinished and there were many places that I'd not > yet added support for the HashedScalarArrayOpExpr node type yet. This > could cause these nodes to be skipped during node mutations or node > walking which would certainly make planning faster, just not in a way > that's correct. > >> integer column / v6 >> =================== >> >> rows 10/p 100/p 16/p 10/s 100/s 16/s >> ----------------------------------------------------------- >> 0 97% 97% 97% 98% 98% 98% >> 1 96% 84% 95% 97% 97% 98% >> 5 97% 85% 96% 98% 97% 97% >> 1000 129% 489% 172% 128% 330% 162% > > This is a really informative set of results. I can only guess that the > slowdown of the 100/prepared query is down to building the hash table. > I think that because the 0 rows test does not show the slowdown and we > only build the table when evaluating for the first time. There's a > slightly larger hit on 1 row vs 5 rows, which makes sense since the > rewards of the hash lookup start paying off more with more rows. > Agreed. I think that's essentially what I wrote too. > Looking at your tps numbers, I think I can see why we get the drop in > performance with prepared statements but not simple statements. This > seems to just be down to the fact that the planning time dominates in > the simple statement case. For example, the "1 row" test for 100/s > for v6 is 10023.3 tps, whereas the 100/p result is 44093.8 tps. With > master, prepared gets 52400.0 tps. So we could say the hash table > build costs us 8306.2 tps, or 3.59 microseconds per execution, per: > > postgres=# select 1000000 / 52400.0 - 1000000 / 44093.8; > ?column? > --------------------- > -3.5949559161508538 > (1 row) > > If we look at the tps for the simple query version of the same test. > Master did 10309.6 tps, v6 did 10023.3 tps. If we apply that 3.59 > microsecond slowdown to master's tps, then we get pretty close to > within 1% of the v6 tps: > > postgres=# select 1000000 / (1000000 / 10309.6 + 3.59); > ?column? > ----------------------- > 9941.6451581291294165 > (1 row) > Right, that makes perfect sense. >> text column / v6 >> ================ >> >> rows 10/p 100/p 16/p 10/s 100/s 16/s >> ----------------------------------------------------------- >> 0 101% 101% 101% 98% 99% 99% >> 1 98% 82% 96% 98% 96% 97% >> 5 100% 84% 98% 98% 96% 98% >> 1000 297% 1645% 408% 255% 1000% 336% >> >> >> Overall, the behavior for integer and text columns is the same, for botYep,h >> patches. There's a couple interesting observations: >> >> 1) For the "simple" query mode, v5 helped quite a bit (20-30% speedup), >> but v6 does not seem to help at all - it's either same or slower than >> unpatched master. > > I think that's related to the fact that I didn't finish adding > HashedScalarArrayOpExpr processing to all places that needed it. > Not sure I understand. Shouldn't that effectively default to the unpatched behavior? How could that result in code that is *faster* than master? >> I wonder why is that, and if we could get some of the speedup with v6? >> At first I thought that maybe v5 is not building the hash table in cases >> where v6 does, but that shouldn't be faster than master. > > I don't think v5 and v6 really do anything much differently in the > executor. The only difference is really during ExecInitExprRec() when > we initialize the expression. With v5 we had a case > T_HashedScalarArrayOpExpr: to handle the new node type, but in v6 we > have if (OidIsValid(opexpr->hashfuncid)). Oh, wait. I did add the > missing permissions check on the hash function, so that will account > for something. As far as I can see, that's required. > I think the check is required, but I don't think that should be so very expensive - it's likely one of many other permission checks during. But I think the puzzle is not so much about v5 vs v6, but more about v5 vs. master. I still don't understand how v5 managed to be faster than master, but maybe I'm missing something. >> 2) For the "prepared" mode, there's a clear performance hit the longer >> the array is (for both v5 and v6). For 100 elements it's about 15%, >> which is not great. >> >> I think the reason is fairly simple - building the hash table is not >> free, and with few rows it's not worth it - it'd be faster to just >> search the array directly. Unfortunately, the logic that makes the >> decision to switch to hashing only looks at the array length only, and >> ignores the number of rows entirely. So I think if we want to address >> this, convert_saop_to_hashed_saop needs to compare >> >> has_build_cost + nrows * hash_lookup_cost >> >> and >> >> nrows * linear_lookup_cost >> >> to make reasonable decision. > > I thought about that but I was really worried that the performance of > ScalarArrayOpExpr would just become too annoyingly unpredictable. You > know fairly well that we can often get massive row underestimations in > the planner. (I guess you worked on ext stats mainly because of that) > The problem I want to avoid is the ones where we get a big row > underestimation but don't really get a bad plan as a result. For > example a query like: > > SELECT * FROM big_table WHERE col1 = ... AND col2 = ... AND col3 = ... > AND col4 IN( ... big list of values ...); > > If col1, col2 and col3 are highly correlated but individually fairly > selective, then we could massively underestimate how many rows the IN > clause will see (assuming no rearranging was done here). > > I'm not completely opposed to the idea of taking the estimated rows > into account during planning. It might just mean having to move the > convert_saop_to_hashed_saop() call somewhere else. I imagine that's > fairly trivial to do. I just have concerns about doing so. > Hmm, yeah. I understand your concerns, but the way it works now it kinda penalizes correct estimates. Imagine you have workload with simple OLTP queries, the queries always hit only a couple rows - but we make it run 15% slower because we're afraid there might be under-estimate. It's sensible to make the planning resilient to under-estimates, but I'm not sure just ignoring the cardinality estimate with the justification it might be wrong is good strategy. In a way, all the other planning decisions assume it's correct, so why should this be any different? We're not using seqscan exclusively just because the selectivity might be wrong, making index scan ineffective, for example. Maybe the right solution is to rely on the estimates, but then also enable the hashing if we significantly cross the threshold during execution. So for example we might get estimate 10 rows, and calculate that the hashing would start winning at 100 rows, so we start without hashing. But then at execution if we get 200 rows, we build the hash table and start using it. Yes, there's a risk that there are only 200 rows, and the time spent building the hash table is wasted. But it's much more likely that there are many more rows. >> I was thinking that maybe we can ignore this, because people probably >> have much larger tables in practice. But I'm not sure that's really >> true, because there may be other quals and it's possible the preceding >> ones are quite selective, filtering most of the rows. >> >> I'm not sure how much of the necessary information we have available in >> convert_saop_to_hashed_saop_walker, though :-( I suppose we know the >> number of input rows for that plan node, not sure about selectivity of >> the other quals, though. >> >> It's also a bit strange that we get speedup for "simple" protocol, while >> for "prepared" it gets slower. That seems counter-intuitive, because why >> should we see opposite outcomes in those cases? I'd assume that we'll >> see either speedup or slowdown in both cases, with the relative change >> being more significant in the "prepared" mode. > > I hope my theory above about the planner time dominating the overall > time shows why that is. > I think it does explain the v6 behavior, where prepared gets slower while simple is about the same (with low row counts). But I'm not sure I understand the v5, where simple got faster than master. > Another way to look at these result is by taking your tps value and > calculating how long it takes to do N number of transactions then > totalling up the time it takes. If I do that to calculate how long it > took each test to perform 1000 transactions and sum each test grouping > by mode and version with rollup on mode, I get: > > time values are in seconds: > > pg-master 28.07 > prepared 11.28 > simple 16.79 > > pg-v6 15.86 > prepared 5.23 > simple 10.63 > > So, the overall result when applying the total time is 177%. > Not sure how you got those numbers, or how it explains the results. E.g. on v5, the results for 100 int values / 1 row look like this: 100/p 100/s master 52400 10310 v5 43446 13610 I understand why the prepared mode got slower. I don't understand how the simple mode got faster. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, 10 Apr 2021 at 10:32, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > But I think the puzzle is not so much about v5 vs v6, but more about v5 > vs. master. I still don't understand how v5 managed to be faster than > master, but maybe I'm missing something. Well, v5 wrapped ScalarArrayOpExpr inside HashedScalarArrayOpExpr, but didn't add cases for HashedScalarArrayOpExpr in all locations where it should have. For example, fix_expr_common() has a case for ScalarArrayOpExpr, but if we gave it a HashedScalarArrayOpExpr it would have very little work to do. I've not gone and proved that's the exact reason why the planner became faster, but I really don't see any other reason. > Maybe the right solution is to rely on the estimates, but then also > enable the hashing if we significantly cross the threshold during > execution. So for example we might get estimate 10 rows, and calculate > that the hashing would start winning at 100 rows, so we start without > hashing. But then at execution if we get 200 rows, we build the hash > table and start using it. To do that, we'd need to store the number of evaluations of the function somewhere. I'm not really sure that would be a good idea as I imagine we'd need to store that in ExprEvalStep. I imagine if that was a good idea then we'd have done the same for JIT. > On 4/9/21 1:21 AM, David Rowley wrote: > > time values are in seconds: > > > > pg-master 28.07 > > prepared 11.28 > > simple 16.79 > > > > pg-v6 15.86 > > prepared 5.23 > > simple 10.63 > > > > So, the overall result when applying the total time is 177%. > > > > Not sure how you got those numbers, or how it explains the results. I got it by doing "1 / tps * 1000" to get the time it would take to execute 1000 transactions of each of your tests. I then grouped by patch, mode and took the sum of the calculated number. My point was that overall the patch is significantly faster. I was trying to highlight that the 0 and 1 row test take up very little time and the overhead of building the hash table is only showing up because the query executes so quickly. FWIW, I think executing a large IN clause on a table that has 0 rows is likely not that interesting a case to optimise for. That's not the same as a query that just returns 0 rows due to filtering out hundreds or thousands of rows during execution. The overhead of building the hash table is not going to show up very easily in that sort of case. > I understand why the prepared mode got slower. I don't understand how > the simple mode got faster. I very much imagine v5 was faster at planning due to the unfinished nature of the patch. I'd not added support for HashedScalarArrayOpExpr in all the places I should have. That would result in the planner skipping lots of work that it needs to do. The way I got it to work was to add it, then just add enough cases in the planner to handle HashedScalarArrayOpExpr so I didn't get any errors. I stopped after that just to show the idea. Lack of errors does not mean it was correct. At least setrefs.c was not properly handling HashedScalarArrayOpExpr. I really think it would be best if we just ignore the performance of v5. Looking at the performance of a patch that was incorrectly skipping a bunch of required work does not seem that fair. David
On 4/11/21 12:03 AM, David Rowley wrote: > On Sat, 10 Apr 2021 at 10:32, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> But I think the puzzle is not so much about v5 vs v6, but more about v5 >> vs. master. I still don't understand how v5 managed to be faster than >> master, but maybe I'm missing something. > > Well, v5 wrapped ScalarArrayOpExpr inside HashedScalarArrayOpExpr, but > didn't add cases for HashedScalarArrayOpExpr in all locations where it > should have. For example, fix_expr_common() has a case for > ScalarArrayOpExpr, but if we gave it a HashedScalarArrayOpExpr it > would have very little work to do. > > I've not gone and proved that's the exact reason why the planner > became faster, but I really don't see any other reason. > >> Maybe the right solution is to rely on the estimates, but then also >> enable the hashing if we significantly cross the threshold during >> execution. So for example we might get estimate 10 rows, and calculate >> that the hashing would start winning at 100 rows, so we start without >> hashing. But then at execution if we get 200 rows, we build the hash >> table and start using it. > > To do that, we'd need to store the number of evaluations of the > function somewhere. I'm not really sure that would be a good idea as I > imagine we'd need to store that in ExprEvalStep. I imagine if that > was a good idea then we'd have done the same for JIT. > Sure, we'd need to track the number of lookups, but I'd imagine that's fairly cheap and we can stop once we switch to hash mode. I'm not sure "JIT does not do that" is really a proof it's a bad idea. My guess is it wasn't considered back then, and the current heuristics is the simplest possible. So maybe it's the other way and we should consider to do the same thing for JIT? FWIW if we look at what JIT does, it'd argue it supports the approach to trust the estimates. Because if we under-estimate stuff, the cost won't exceed the "jit_above_cost" threshold, and we won't use JIT. > >> On 4/9/21 1:21 AM, David Rowley wrote: >>> time values are in seconds: >>> >>> pg-master 28.07 >>> prepared 11.28 >>> simple 16.79 >>> >>> pg-v6 15.86 >>> prepared 5.23 >>> simple 10.63 >>> >>> So, the overall result when applying the total time is 177%. >>> >> >> Not sure how you got those numbers, or how it explains the results. > > I got it by doing "1 / tps * 1000" to get the time it would take to > execute 1000 transactions of each of your tests. I then grouped by > patch, mode and took the sum of the calculated number. My point was > that overall the patch is significantly faster. I was trying to > highlight that the 0 and 1 row test take up very little time and the > overhead of building the hash table is only showing up because the > query executes so quickly. > Ah, I see. TBH I don't think combining the results gives us a very meaningful value - those cases were quite arbitrary, but summing them together like this assumes the workload has about 25% of each. But if your workload is exclusively 0/1/5 rows it's going to be hit. > FWIW, I think executing a large IN clause on a table that has 0 rows > is likely not that interesting a case to optimise for. That's not the > same as a query that just returns 0 rows due to filtering out hundreds > or thousands of rows during execution. The overhead of building the > hash table is not going to show up very easily in that sort of case. > Yeah, it's probably true that queries with long IN lists are probably dealing with many input rows. And you're right we don't really care about how many rows are ultimately produced by the query (or even the step with the IN list) - if we spent a lot of time to filter the rows before applying the IN list, the time to initialize the hash table is probably just noise. I wonder what's the relationship between the length of the IN list and the minimum number of rows needed for the hash to start winning. >> I understand why the prepared mode got slower. I don't understand how >> the simple mode got faster. > > I very much imagine v5 was faster at planning due to the unfinished > nature of the patch. I'd not added support for HashedScalarArrayOpExpr > in all the places I should have. That would result in the planner > skipping lots of work that it needs to do. The way I got it to work > was to add it, then just add enough cases in the planner to handle > HashedScalarArrayOpExpr so I didn't get any errors. I stopped after > that just to show the idea. Lack of errors does not mean it was > correct. At least setrefs.c was not properly handling > HashedScalarArrayOpExpr. > > I really think it would be best if we just ignore the performance of > v5. Looking at the performance of a patch that was incorrectly > skipping a bunch of required work does not seem that fair. > Aha! I was assuming v5 was correct, but if that assumption is incorrect then the whole "v5 speedup" is just an illusion, aAnd you're right we should simply ignore that. Thanks for the explanation! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, 11 Apr 2021 at 10:38, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I wonder what's the relationship between the length of the IN list and > the minimum number of rows needed for the hash to start winning. I made the attached spreadsheet which demonstrates the crossover point using the costs that I coded into cost_qual_eval_walker(). It basically shows, for large arrays, that there are fairly significant benefits to hashing for just 2 lookups and not hashing only just wins for 1 lookup. However, the cost model does not account for allocating memory for the hash table, which is far from free. You can adjust the number of items in the IN clause by changing the value in cell B1. The values in B2 and B3 are what I saw the planner set when I tested with both INT and TEXT types. David
Attachment
On Fri, 9 Apr 2021 at 00:00, David Rowley <dgrowleyml@gmail.com> wrote: > I push this with some minor cleanup from the v6 patch I posted earlier. I realised when working on something unrelated last night that we can also do hash lookups for NOT IN too. We'd just need to check if the operator's negator operator is hashable. No new fields would need to be added to ScalarArrayOpExpr. We'd just set the hashfuncid to the correct value and then set the opfuncid to the negator function. In the executor, we'd know to check if the value is in the table or not in the table based on the useOr value. I'm not really sure whether lack of NOT IN support is going to be a source of bug reports for PG14 or not. If it was, then it might be worth doing something about that for PG14. Otherwise, we can just leave it for future work for PG15 and beyond. I personally don't have any strong feelings either way, but I'm leaning towards just writing a patch and thinking of pushing it sometime after we branch for PG15. I've included the RMT, just in case they want to voice an opinion on that. David
David Rowley <dgrowleyml@gmail.com> writes: > I realised when working on something unrelated last night that we can > also do hash lookups for NOT IN too. ... and still get the behavior right for nulls? regards, tom lane
On Tue, 13 Apr 2021 at 11:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > I realised when working on something unrelated last night that we can > > also do hash lookups for NOT IN too. > > ... and still get the behavior right for nulls? Yeah, it will. There are already some special cases for NULLs in the IN version. Those would need to be adapted for NOT IN. David
On Mon, Apr 12, 2021 at 7:49 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Tue, 13 Apr 2021 at 11:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > David Rowley <dgrowleyml@gmail.com> writes: > > > I realised when working on something unrelated last night that we can > > > also do hash lookups for NOT IN too. > > > > ... and still get the behavior right for nulls? > > Yeah, it will. There are already some special cases for NULLs in the > IN version. Those would need to be adapted for NOT IN. I hadn't thought about using the negator operator directly that way when I initially wrote the patch. But also I didn't think a whole lot about the NOT IN case at all -- and there's no mention of such that I see in this thread or the precursor thread. It's pretty obvious that it wasn't part of my immediate need, but obviously it'd be nice to have the consistency. All that to say this: my vote would be to put it into PG15 also. James
On Mon, Apr 12, 2021 at 10:07 PM James Coleman <jtc331@gmail.com> wrote: > > On Mon, Apr 12, 2021 at 7:49 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > > On Tue, 13 Apr 2021 at 11:42, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > David Rowley <dgrowleyml@gmail.com> writes: > > > > I realised when working on something unrelated last night that we can > > > > also do hash lookups for NOT IN too. > > > > > > ... and still get the behavior right for nulls? > > > > Yeah, it will. There are already some special cases for NULLs in the > > IN version. Those would need to be adapted for NOT IN. > > I hadn't thought about using the negator operator directly that way > when I initially wrote the patch. > > But also I didn't think a whole lot about the NOT IN case at all -- > and there's no mention of such that I see in this thread or the > precursor thread. It's pretty obvious that it wasn't part of my > immediate need, but obviously it'd be nice to have the consistency. > > All that to say this: my vote would be to put it into PG15 also. ...and here's a draft patch. I can take this to a new thread if you'd prefer; the one here already got committed, on the other hand this is pretty strongly linked to this discussion, so I figured it made sense to post it here. James
Attachment
On Wed, 14 Apr 2021 at 05:40, James Coleman <jtc331@gmail.com> wrote: > ...and here's a draft patch. I can take this to a new thread if you'd > prefer; the one here already got committed, on the other hand this is > pretty strongly linked to this discussion, so I figured it made sense > to post it here. I only glanced at this when you sent it and I was confused about how it works. The patch didn't look like how I imagined it should and I couldn't see how the executor part worked without any changes. Anyway, I decided to clear up my confusion tonight and apply the patch to figure all this out... unfortunately, I see why I was confused now. It actually does not work at all :-( You're still passing the <> operator to get_op_hash_functions(), which of course is not hashable, so we just never do hashing for NOT IN. All your tests pass just fine because the standard non-hashed code path is used. My idea was that you'd not add any fields to ScalarArrayOpExpr and for soaps with useOr == false, check if the negator of the operator is hashable. If so set the opfuncid to the negator operator's function. I'm a bit undecided if it's safe to set the opfuncid to the negator function. If anything were to set that again based on the opno then it would likely set it to the wrong thing. We can't go changing the opno either because EXPLAIN would display the wrong thing. Anyway, I've attached what I ended up with after spending a few hours looking at this. I pretty much used all your tests as is with the exception of removing one that looked duplicated. David
Attachment
On Sat, Apr 24, 2021 at 6:25 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Wed, 14 Apr 2021 at 05:40, James Coleman <jtc331@gmail.com> wrote: > > ...and here's a draft patch. I can take this to a new thread if you'd > > prefer; the one here already got committed, on the other hand this is > > pretty strongly linked to this discussion, so I figured it made sense > > to post it here. > > I only glanced at this when you sent it and I was confused about how > it works. The patch didn't look like how I imagined it should and I > couldn't see how the executor part worked without any changes. > > Anyway, I decided to clear up my confusion tonight and apply the patch > to figure all this out... unfortunately, I see why I was confused > now. It actually does not work at all :-( > > You're still passing the <> operator to get_op_hash_functions(), which > of course is not hashable, so we just never do hashing for NOT IN. > > All your tests pass just fine because the standard non-hashed code path is used. I was surprised when it "just worked" too; I should have stopped to verify the path was being taken. Egg on my face for not doing so. :( > My idea was that you'd not add any fields to ScalarArrayOpExpr and for > soaps with useOr == false, check if the negator of the operator is > hashable. If so set the opfuncid to the negator operator's function. > > I'm a bit undecided if it's safe to set the opfuncid to the negator > function. If anything were to set that again based on the opno then > it would likely set it to the wrong thing. We can't go changing the > opno either because EXPLAIN would display the wrong thing. I don't personally see a reason why this is a problem. But I also don't know that I have enough knowledge of the codebase to say that definitively. > Anyway, I've attached what I ended up with after spending a few hours > looking at this. Overall I like this approach. One thing I think we could clean up: + bool useOr; /* use OR or AND semantics? */ ... + /* useOr == true means an IN clause, useOr == false is NOT IN */ I'm wondering if the intersection of these two lines implies that useOr isn't quite the right name here. Perhaps something like "negated"? On the other hand (to make the counterargument) useOr would keep it consistent with the other ones. The other thing I got to thinking about was = ALL. It doesn't get turned into a hash op because the negator of = isn't hashable. I think it's correct that that's the determining factor, because I can't imagine what it would mean to hash <>. But...I wanted to confirm I wasn't missing something. We don't have explicit tests for that case, but I'm not sure it's necessary either. > I pretty much used all your tests as is with the exception of removing > one that looked duplicated. Sounds good. James
On Sat, 8 May 2021 at 09:15, James Coleman <jtc331@gmail.com> wrote: > > On Sat, Apr 24, 2021 at 6:25 AM David Rowley <dgrowleyml@gmail.com> wrote: > > I'm a bit undecided if it's safe to set the opfuncid to the negator > > function. If anything were to set that again based on the opno then > > it would likely set it to the wrong thing. We can't go changing the > > opno either because EXPLAIN would display the wrong thing. > > I don't personally see a reason why this is a problem. But I also > don't know that I have enough knowledge of the codebase to say that > definitively. The reason for my concern is that if the opfuncid is set to InvalidOid, set_sa_opfuncid() always sets the ScalarArrayOpExpr's opfuncid to get_opcode(opexpr->opno) . I'm effectively setting the opfuncid to get_opcode(get_negator(opexpr->opno)), if anything were to reset the ScalarArrayOpExpr's opfuncid to InvalidOid, then set_sa_opfuncid() would repopulate it with the wrong value. Maybe the solution there is to teach set_sa_opfuncid() about our hashing NOT IN trick and have it check if (!opexpr->useOr && OidIsValid(opexpr->hashfuncid)) and if that's true then do opexpr->opfuncid = get_opcode(get_negator(opexpr->opno)). Then we could just not bothing setting opfuncid in convert_saop_to_hashed_saop_walker(). > > Anyway, I've attached what I ended up with after spending a few hours > > looking at this. > > Overall I like this approach. > > One thing I think we could clean up: > > + bool useOr; /* use OR or AND semantics? */ > ... > + /* useOr == true means an IN clause, useOr == false is NOT IN */ > > I'm wondering if the intersection of these two lines implies that > useOr isn't quite the right name here. Perhaps something like > "negated"? I'm not sure I want to go changing that. The whole IN() / NOT IN() behaviour regarding NULLs all seems pretty weird until you mentally replace a IN (1,2,3) with a = 1 OR a = 2 OR a = 3. And for the a NOT IN(1,2,3) case, a <> 1 AND a <> 2 AND a <> 3. People can make a bit more sense of the weirdness of NULLs with NOT IN when they mentally convert their expression like that. I think having that in code is useful too. Any optimisations that are added must match those semantics. > The other thing I got to thinking about was = ALL. It doesn't get > turned into a hash op because the negator of = isn't hashable. I think > it's correct that that's the determining factor, because I can't > imagine what it would mean to hash <>. But...I wanted to confirm I > wasn't missing something. We don't have explicit tests for that case, > but I'm not sure it's necessary either. It's important to think of other cases, I just don't think there's any need to do anything for that one. Remember that we have the restriction of requiring a set of Consts, so for that case to be met, someone would have to write something like: col = ALL('{1,1,1,1,1,1,1,1}'::int[]); I think if anyone comes along complaining that a query containing that is not as fast as they'd like then we might tell them that they should just use: col = 1. A sanity checkup might not go amiss either. David
David Rowley <dgrowleyml@gmail.com> writes: > On Sat, 8 May 2021 at 09:15, James Coleman <jtc331@gmail.com> wrote: >> On Sat, Apr 24, 2021 at 6:25 AM David Rowley <dgrowleyml@gmail.com> wrote: >>> I'm a bit undecided if it's safe to set the opfuncid to the negator >>> function. If anything were to set that again based on the opno then >>> it would likely set it to the wrong thing. We can't go changing the >>> opno either because EXPLAIN would display the wrong thing. >> I don't personally see a reason why this is a problem. But I also >> don't know that I have enough knowledge of the codebase to say that >> definitively. > The reason for my concern is that if the opfuncid is set to > InvalidOid, set_sa_opfuncid() always sets the ScalarArrayOpExpr's > opfuncid to get_opcode(opexpr->opno). I will personally veto any design that involves setting opfuncid to something that doesn't match the opno. That's just horrid, and it will break something somewhere, either immediately or down the road. I don't immediately see why you can't add an "invert" boolean flag to ScalarArrayOpExpr and let the executor machinery deal with this. That'd have the advantage of not having to depend on there being a negator. regards, tom lane
On Fri, May 7, 2021 at 9:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > On Sat, 8 May 2021 at 09:15, James Coleman <jtc331@gmail.com> wrote: > >> On Sat, Apr 24, 2021 at 6:25 AM David Rowley <dgrowleyml@gmail.com> wrote: > >>> I'm a bit undecided if it's safe to set the opfuncid to the negator > >>> function. If anything were to set that again based on the opno then > >>> it would likely set it to the wrong thing. We can't go changing the > >>> opno either because EXPLAIN would display the wrong thing. > > >> I don't personally see a reason why this is a problem. But I also > >> don't know that I have enough knowledge of the codebase to say that > >> definitively. > > > The reason for my concern is that if the opfuncid is set to > > InvalidOid, set_sa_opfuncid() always sets the ScalarArrayOpExpr's > > opfuncid to get_opcode(opexpr->opno). > > I will personally veto any design that involves setting opfuncid to > something that doesn't match the opno. That's just horrid, and it > will break something somewhere, either immediately or down the road. This is the "project design" style/policy I don't have. Thanks. > I don't immediately see why you can't add an "invert" boolean flag to > ScalarArrayOpExpr and let the executor machinery deal with this. That'd > have the advantage of not having to depend on there being a negator. Don't we need to have a negator to be able to look up the proper has function? At least somewhere in the process you'd have to convert from looking up the <> op to looking up the = op and then setting the "invert" flag. James
On Fri, May 7, 2021 at 8:38 PM David Rowley <dgrowleyml@gmail.com> wrote: > > It's important to think of other cases, I just don't think there's any > need to do anything for that one. Remember that we have the > restriction of requiring a set of Consts, so for that case to be met, > someone would have to write something like: col = > ALL('{1,1,1,1,1,1,1,1}'::int[]); I think if anyone comes along > complaining that a query containing that is not as fast as they'd like > then we might tell them that they should just use: col = 1. A sanity > checkup might not go amiss either. I wasn't concerned with trying to optimize this case (I don't think we can anyway, at least not without adding new work, like de-duplicating the array first). Though I do hope that someday I'll/we'll get around to getting the stable subexpressions caching patch finished, and then this will be able to work for more than constant arrays. I just wanted to confirm we'd thought through the cases we can't handle to ensure we're not accidentally covering them. James
On Sat, 8 May 2021 at 13:37, James Coleman <jtc331@gmail.com> wrote: > > On Fri, May 7, 2021 at 9:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I don't immediately see why you can't add an "invert" boolean flag to > > ScalarArrayOpExpr and let the executor machinery deal with this. That'd > > have the advantage of not having to depend on there being a negator. > > Don't we need to have a negator to be able to look up the proper has > function? At least somewhere in the process you'd have to convert from > looking up the <> op to looking up the = op and then setting the > "invert" flag. Yeah, we *do* need to ensure there's a negator in the planner as we need to use it during hash probes. It's no good checking the hash bucket we landed on does match with the <> operator's function. We won't find many matches that way! I'm not opposed to adding some new field if that's what it takes. I'd imagine the new field will be something like negfuncid which will be InvalidOid unless the hash function is set and useOr == false David
On Sat, 8 May 2021 at 14:04, David Rowley <dgrowleyml@gmail.com> wrote: > I'm not opposed to adding some new field if that's what it takes. I'd > imagine the new field will be something like negfuncid which will be > InvalidOid unless the hash function is set and useOr == false Just while this is still swapped into main memory, I've attached a patch that adds a new field to ScalarArrayOpExpr rather than repurposing the existing field. David
Attachment
On Fri, May 7, 2021 at 9:50 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 8 May 2021 at 14:04, David Rowley <dgrowleyml@gmail.com> wrote:
> I'm not opposed to adding some new field if that's what it takes. I'd
> imagine the new field will be something like negfuncid which will be
> InvalidOid unless the hash function is set and useOr == false
Just while this is still swapped into main memory, I've attached a
patch that adds a new field to ScalarArrayOpExpr rather than
repurposing the existing field.
David
Hi,
+ record_plan_function_dependency(root, saop->hashfuncid);
Is there a typo in the second line ? (root, saop->negfuncid)
Cheers
On Sat, 8 May 2021 at 20:17, Zhihong Yu <zyu@yugabyte.com> wrote: > + if (!OidIsValid(saop->negfuncid)) > + record_plan_function_dependency(root, saop->hashfuncid); > > Is there a typo in the second line ? (root, saop->negfuncid) Yeah, that's a mistake. Thanks for checking it. David
On Sat, 8 May 2021 at 20:29, David Rowley <dgrowleyml@gmail.com> wrote: > > On Sat, 8 May 2021 at 20:17, Zhihong Yu <zyu@yugabyte.com> wrote: > > + if (!OidIsValid(saop->negfuncid)) > > + record_plan_function_dependency(root, saop->hashfuncid); > > > > Is there a typo in the second line ? (root, saop->negfuncid) > > Yeah, that's a mistake. Thanks for checking it. I've attached a patch which fixes the mistake mentioned above. Also, dropped the RMT from the thread. I only introduced them when I wanted some input about if hashing NOT IN should be included in PG14. Nobody seems to think that should be done. David
Attachment
I've been looking at the NOT IN hashing patch again and after making a few minor tweaks I think it's pretty much ready to go. If anyone feels differently, please let me know in the next couple of days. Otherwise, I plan on taking a final look and pushing it soon. David
Attachment
On Tue, 6 Jul 2021 at 22:39, David Rowley <dgrowleyml@gmail.com> wrote: > If anyone feels differently, please let me know in the next couple of > days. Otherwise, I plan on taking a final look and pushing it soon. After doing some very minor adjustments, I pushed this. (29f45e299). Thanks to James and Zhihong for reviewing. David