Thread: Binary search in ScalarArrayOpExpr for OR'd constant arrays

Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
Over in "execExprInterp() questions / How to improve scalar array op
expr eval?" [1] I'd mused about how we might be able to optimized
scalar array ops with OR'd semantics.

This patch implements a binary search for such expressions when the
array argument is a constant so that we can avoid needing to teach
expression execution to cache stable values or know when a param has
changed.

The speed-up for the target case can pretty impressive: in my
admittedly contrived and relatively unscientific test with a query in
the form:

select count(*) from generate_series(1,100000) n(i) where i in (<1000
random integers in the series>)

shows ~30ms for the patch versus ~640ms on master.

James

[1]: https://www.postgresql.org/message-id/flat/CAAaqYe-UQBba7sScrucDOyHb7cDoNbWf_rcLrOWeD4ikP3_qTQ%40mail.gmail.com

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>Over in "execExprInterp() questions / How to improve scalar array op
>expr eval?" [1] I'd mused about how we might be able to optimized
>scalar array ops with OR'd semantics.
>
>This patch implements a binary search for such expressions when the
>array argument is a constant so that we can avoid needing to teach
>expression execution to cache stable values or know when a param has
>changed.
>
>The speed-up for the target case can pretty impressive: in my
>admittedly contrived and relatively unscientific test with a query in
>the form:
>
>select count(*) from generate_series(1,100000) n(i) where i in (<1000
>random integers in the series>)
>
>shows ~30ms for the patch versus ~640ms on master.
>

Nice improvement, although 1000 items is probably a bit unusual. The
threshold used in the patch (9 elements) seems a bit too low - what
results have you seen with smaller arrays?

Another idea - would a bloom filter be useful here, as a second
optimization? That is, for large arrays build s small bloom filter,
allowing us to skip even the binary search.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
> >Over in "execExprInterp() questions / How to improve scalar array op
> >expr eval?" [1] I'd mused about how we might be able to optimized
> >scalar array ops with OR'd semantics.
> >
> >This patch implements a binary search for such expressions when the
> >array argument is a constant so that we can avoid needing to teach
> >expression execution to cache stable values or know when a param has
> >changed.
> >
> >The speed-up for the target case can pretty impressive: in my
> >admittedly contrived and relatively unscientific test with a query in
> >the form:
> >
> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
> >random integers in the series>)
> >
> >shows ~30ms for the patch versus ~640ms on master.
> >
>
> Nice improvement, although 1000 items is probably a bit unusual. The
> threshold used in the patch (9 elements) seems a bit too low - what
> results have you seen with smaller arrays?

At least in our systems we regularly work with 1000 batches of items,
which means you get IN clauses of identifiers of that size. Admittedly
the most common case sees those IN clauses as simple index scans
(e.g., WHERE <primary key> IN (...)), but it's also common to have a
broader query that merely filters additionally on something like "...
AND <some foreign key> IN (...)" where it makes sense for the rest of
the quals to take precedence in generating a reasonable plan. In that
case, the IN becomes a regular filter, hence the idea behind the
patch.

Side note: I'd love for us to be able to treat "IN (VALUES)" the same
way...but as noted in the other thread that's an extremely large
amount of work, I think. But similarly you could use a hash here
instead of a binary search...but this seems quite good.

As to the choice of 9 elements: I just picked that as a starting
point; Andres had previously commented off hand that at 8 elements
serial scanning was faster, so I figured this was a reasonable
starting point for discussion.

Perhaps it would make sense to determine that minimum not as a pure
constant but scaled based on how many rows the planner expects us to
see? Of course that'd be a more invasive patch...so may or may not be
as feasible as a reasonable default.

> Another idea - would a bloom filter be useful here, as a second
> optimization? That is, for large arrays build s small bloom filter,
> allowing us to skip even the binary search.

That's an interesting idea. I actually haven't personally worked with
bloom filters, so didn't think about that.

Are you thinking that you'd also build the filter *and* presort the
array? Or try to get away with using only the bloom filter and not
expanding and sorting the array at all?

Thanks,
James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>> >Over in "execExprInterp() questions / How to improve scalar array op
>> >expr eval?" [1] I'd mused about how we might be able to optimized
>> >scalar array ops with OR'd semantics.
>> >
>> >This patch implements a binary search for such expressions when the
>> >array argument is a constant so that we can avoid needing to teach
>> >expression execution to cache stable values or know when a param has
>> >changed.
>> >
>> >The speed-up for the target case can pretty impressive: in my
>> >admittedly contrived and relatively unscientific test with a query in
>> >the form:
>> >
>> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
>> >random integers in the series>)
>> >
>> >shows ~30ms for the patch versus ~640ms on master.
>> >
>>
>> Nice improvement, although 1000 items is probably a bit unusual. The
>> threshold used in the patch (9 elements) seems a bit too low - what
>> results have you seen with smaller arrays?
>
>At least in our systems we regularly work with 1000 batches of items,
>which means you get IN clauses of identifiers of that size. Admittedly
>the most common case sees those IN clauses as simple index scans
>(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>broader query that merely filters additionally on something like "...
>AND <some foreign key> IN (...)" where it makes sense for the rest of
>the quals to take precedence in generating a reasonable plan. In that
>case, the IN becomes a regular filter, hence the idea behind the
>patch.
>
>Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>way...but as noted in the other thread that's an extremely large
>amount of work, I think. But similarly you could use a hash here
>instead of a binary search...but this seems quite good.
>
>As to the choice of 9 elements: I just picked that as a starting
>point; Andres had previously commented off hand that at 8 elements
>serial scanning was faster, so I figured this was a reasonable
>starting point for discussion.
>
>Perhaps it would make sense to determine that minimum not as a pure
>constant but scaled based on how many rows the planner expects us to
>see? Of course that'd be a more invasive patch...so may or may not be
>as feasible as a reasonable default.
>

Not sure. That seems a bit overcomplicated and I don't think it depends
on the number of rows the planner expects to see very much. I think we
usually assume the linear search is cheaper for small arrays and then at
some point the binary search starts winning The question is where this
"break even" point is.

I think we usually use something like 64 or so in other places, but
maybe I'm wrong. The current limit 9 seems a bit too low, but I may be
wrong. Let's not obsess about this too much, let's do some experiments
and pick a value based on that.


>> Another idea - would a bloom filter be useful here, as a second
>> optimization? That is, for large arrays build s small bloom filter,
>> allowing us to skip even the binary search.
>
>That's an interesting idea. I actually haven't personally worked with
>bloom filters, so didn't think about that.
>
>Are you thinking that you'd also build the filter *and* presort the
>array? Or try to get away with using only the bloom filter and not
>expanding and sorting the array at all?
>

Yeah, something like that. My intuition is the bloom filter is useful
only above some number of items, and the number is higher than for the
binary search. So we'd end up with two thresholds, first one enabling
binary search, the second one enabling bloom filter.

Of course, the "unknown" variable here is how often we actually find the
value in the array. If 100% of the queries has a match, then the bloom
filter is a waste of time. If there are no matches, it can make a
significant difference.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Fri, 24 Apr 2020 at 02:56, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >As to the choice of 9 elements: I just picked that as a starting
> >point; Andres had previously commented off hand that at 8 elements
> >serial scanning was faster, so I figured this was a reasonable
> >starting point for discussion.
> >
> >Perhaps it would make sense to determine that minimum not as a pure
> >constant but scaled based on how many rows the planner expects us to
> >see? Of course that'd be a more invasive patch...so may or may not be
> >as feasible as a reasonable default.
> >
>
> Not sure. That seems a bit overcomplicated and I don't think it depends
> on the number of rows the planner expects to see very much. I think we
> usually assume the linear search is cheaper for small arrays and then at
> some point the binary search starts winning The question is where this
> "break even" point is.
>
> I think we usually use something like 64 or so in other places, but
> maybe I'm wrong. The current limit 9 seems a bit too low, but I may be
> wrong. Let's not obsess about this too much, let's do some experiments
> and pick a value based on that.

If single comparison for a binary search costs about the same as an
equality check, then wouldn't the crossover point be much lower than
64? The binary search should find or not find the target in log2(N)
rather than N.  ceil(log2(9)) is 4, which is of course less than 9.
For 64, it's 6, so are you not just doing a possible 58 equality
checks than necessary?  Of course, it's a bit more complex as for
values that *are* in the array, the linear search will, on average,
only check half the values. Assuming that, then 9 does not seem too
far off.  I guess benchmarks at various crossover points would speak a
thousand words.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Andres Freund
Date:
Hi,

On 2020-04-24 10:09:36 +1200, David Rowley wrote:
> If single comparison for a binary search costs about the same as an
> equality check, then wouldn't the crossover point be much lower than
> 64?

The costs aren't quite as simple as that though. Binary search usually
has issues with cache misses: In contrast to linear accesses each step
will be a cache miss, as the address is not predictable; and even if the
CPU couldn't predict accesses in the linear search case, often multiple
entries fit on a single cache line. Additionally out-of-order execution
is usually a lot more effective for linear searches (e.g. the next
elements can be compared before the current one is finished if that's
what the branch predictor says is likely).

Greetings,

Andres Freund



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
> >> >Over in "execExprInterp() questions / How to improve scalar array op
> >> >expr eval?" [1] I'd mused about how we might be able to optimized
> >> >scalar array ops with OR'd semantics.
> >> >
> >> >This patch implements a binary search for such expressions when the
> >> >array argument is a constant so that we can avoid needing to teach
> >> >expression execution to cache stable values or know when a param has
> >> >changed.
> >> >
> >> >The speed-up for the target case can pretty impressive: in my
> >> >admittedly contrived and relatively unscientific test with a query in
> >> >the form:
> >> >
> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
> >> >random integers in the series>)
> >> >
> >> >shows ~30ms for the patch versus ~640ms on master.
> >> >
> >>
> >> Nice improvement, although 1000 items is probably a bit unusual. The
> >> threshold used in the patch (9 elements) seems a bit too low - what
> >> results have you seen with smaller arrays?
> >
> >At least in our systems we regularly work with 1000 batches of items,
> >which means you get IN clauses of identifiers of that size. Admittedly
> >the most common case sees those IN clauses as simple index scans
> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a
> >broader query that merely filters additionally on something like "...
> >AND <some foreign key> IN (...)" where it makes sense for the rest of
> >the quals to take precedence in generating a reasonable plan. In that
> >case, the IN becomes a regular filter, hence the idea behind the
> >patch.
> >
> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same
> >way...but as noted in the other thread that's an extremely large
> >amount of work, I think. But similarly you could use a hash here
> >instead of a binary search...but this seems quite good.
> >
> >As to the choice of 9 elements: I just picked that as a starting
> >point; Andres had previously commented off hand that at 8 elements
> >serial scanning was faster, so I figured this was a reasonable
> >starting point for discussion.
> >
> >Perhaps it would make sense to determine that minimum not as a pure
> >constant but scaled based on how many rows the planner expects us to
> >see? Of course that'd be a more invasive patch...so may or may not be
> >as feasible as a reasonable default.
> >
>
> Not sure. That seems a bit overcomplicated and I don't think it depends
> on the number of rows the planner expects to see very much. I think we
> usually assume the linear search is cheaper for small arrays and then at
> some point the binary search starts winning The question is where this
> "break even" point is.

Well since it has to do preprocessing work (expanding the array and
then sorting it), then the number of rows processed matters, right?
For example, doing a linear search on 1000 items only once is going to
be cheaper than preprocessing the array and then doing a binary
search, but only a very large row count the binary search +
preprocessing might very well win out for only a 10 element array.

I'm not trying to argue for more work for myself here: I think the
optimization is worth it on its own, and something like this could be
a further improvement on its own. But it is interesting to think
about.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Fri, Apr 24, 2020 at 09:38:54AM -0400, James Coleman wrote:
>On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>> >> >Over in "execExprInterp() questions / How to improve scalar array op
>> >> >expr eval?" [1] I'd mused about how we might be able to optimized
>> >> >scalar array ops with OR'd semantics.
>> >> >
>> >> >This patch implements a binary search for such expressions when the
>> >> >array argument is a constant so that we can avoid needing to teach
>> >> >expression execution to cache stable values or know when a param has
>> >> >changed.
>> >> >
>> >> >The speed-up for the target case can pretty impressive: in my
>> >> >admittedly contrived and relatively unscientific test with a query in
>> >> >the form:
>> >> >
>> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
>> >> >random integers in the series>)
>> >> >
>> >> >shows ~30ms for the patch versus ~640ms on master.
>> >> >
>> >>
>> >> Nice improvement, although 1000 items is probably a bit unusual. The
>> >> threshold used in the patch (9 elements) seems a bit too low - what
>> >> results have you seen with smaller arrays?
>> >
>> >At least in our systems we regularly work with 1000 batches of items,
>> >which means you get IN clauses of identifiers of that size. Admittedly
>> >the most common case sees those IN clauses as simple index scans
>> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>> >broader query that merely filters additionally on something like "...
>> >AND <some foreign key> IN (...)" where it makes sense for the rest of
>> >the quals to take precedence in generating a reasonable plan. In that
>> >case, the IN becomes a regular filter, hence the idea behind the
>> >patch.
>> >
>> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>> >way...but as noted in the other thread that's an extremely large
>> >amount of work, I think. But similarly you could use a hash here
>> >instead of a binary search...but this seems quite good.
>> >
>> >As to the choice of 9 elements: I just picked that as a starting
>> >point; Andres had previously commented off hand that at 8 elements
>> >serial scanning was faster, so I figured this was a reasonable
>> >starting point for discussion.
>> >
>> >Perhaps it would make sense to determine that minimum not as a pure
>> >constant but scaled based on how many rows the planner expects us to
>> >see? Of course that'd be a more invasive patch...so may or may not be
>> >as feasible as a reasonable default.
>> >
>>
>> Not sure. That seems a bit overcomplicated and I don't think it depends
>> on the number of rows the planner expects to see very much. I think we
>> usually assume the linear search is cheaper for small arrays and then at
>> some point the binary search starts winning The question is where this
>> "break even" point is.
>
>Well since it has to do preprocessing work (expanding the array and
>then sorting it), then the number of rows processed matters, right?
>For example, doing a linear search on 1000 items only once is going to
>be cheaper than preprocessing the array and then doing a binary
>search, but only a very large row count the binary search +
>preprocessing might very well win out for only a 10 element array.
>

Hmmm, good point. Essentially the initialization (sorting of the array)
has some cost, and the question is how much extra per-tuple cost this
adds. It's probably not worth it for a single lookup, but for many
lookups it's probably OK. Let's see if I can do the math right:

   N - number of lookups
   K - number of array elements

Cost to sort the array is

    O(K * log(K)) = C1 * K * log(K)

and the cost of a lookup is C2 * log(K), so with the extra cost amortized
for N lookups, the total "per lookup" cost is

    C1 * K * log(K) / N + C2 * log(K) = log(K) * (C1 * K / N + C2)

We need to compare this to the O(K) cost of simple linear search, and
the question is at which point the linear search gets more expensive:

    C3 * K = log(K) * (C1 * K / N + C2)

I think we can assume that C3 is somewhere in between 0.5 and 1, i.e. if
there's a matching item we find it half-way through on average, and if
there is not we have to walk the whole array. So let's say it's 1.

C1 and C2 are probably fairly low, I think - C1 is typically ~1.4 for
random pivot choice IIRC, and C2 is probably similar. With both values
being ~1.5 we get this:

    K = log(K) * (1.5 * K/N + 1.5)

for a fixed K, we get this formula for N:

    N = log(K) * 1.5 * K / (K - 1.5 * log(K))

and for a bunch of K values the results look like this:

         K |     N
    -------|-------
         1 |     0
        10 |  5.27
       100 |  7.42
      1000 | 10.47
     10000 | 13.83
    100000 | 17.27

i.e. the binary search with 10k values starts winning over linear search
with just ~13 lookups.

(Assuming I haven't made some silly mistake in the math ...)

Obviously, this only accounts for cost of comparisons and neglects e.g.
the indirect costs for less predictable memory access patterns mentioned
by Andres in his response.

But I think it still shows the number of lookups needed for the binary
search to be a win is pretty low - at least for reasonable number of
values in the array. Maybe it's 20 and not 10, but I don't think that
changes much.

The other question is if we can get N at all and how reliable the value
is. We can probably get the number of rows, but that will ignore other
conditions that may eliminate the row before the binary search.

>I'm not trying to argue for more work for myself here: I think the
>optimization is worth it on its own, and something like this could be
>a further improvement on its own. But it is interesting to think
>about.
>

I don't know. Clearly, if the user sends a query with 10k values and
only does a single lookup, that won't win. And if we can reasonably and
reliably protect against that, I wouldn't mind doing that, although it
means a risk of not using the bin search in case of underestimates etc.

I don't have any hard data about this, but I think we can assume the
number of rows processed by the clause is (much) higher than the number
of keys in it. If you have a clause with 10k values, then you probably
expect it to be applied to many rows, far more than the "beak even"
point of about 10-20 rows ...

So I wouldn't worry about this too much.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Thu, Apr 23, 2020 at 04:55:51PM +0200, Tomas Vondra wrote:
>On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
>><tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>>>>Over in "execExprInterp() questions / How to improve scalar array op
>>>>expr eval?" [1] I'd mused about how we might be able to optimized
>>>>scalar array ops with OR'd semantics.
>>>>
>>>>This patch implements a binary search for such expressions when the
>>>>array argument is a constant so that we can avoid needing to teach
>>>>expression execution to cache stable values or know when a param has
>>>>changed.
>>>>
>>>>The speed-up for the target case can pretty impressive: in my
>>>>admittedly contrived and relatively unscientific test with a query in
>>>>the form:
>>>>
>>>>select count(*) from generate_series(1,100000) n(i) where i in (<1000
>>>>random integers in the series>)
>>>>
>>>>shows ~30ms for the patch versus ~640ms on master.
>>>>
>>>
>>>Nice improvement, although 1000 items is probably a bit unusual. The
>>>threshold used in the patch (9 elements) seems a bit too low - what
>>>results have you seen with smaller arrays?
>>
>>At least in our systems we regularly work with 1000 batches of items,
>>which means you get IN clauses of identifiers of that size. Admittedly
>>the most common case sees those IN clauses as simple index scans
>>(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>>broader query that merely filters additionally on something like "...
>>AND <some foreign key> IN (...)" where it makes sense for the rest of
>>the quals to take precedence in generating a reasonable plan. In that
>>case, the IN becomes a regular filter, hence the idea behind the
>>patch.
>>
>>Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>>way...but as noted in the other thread that's an extremely large
>>amount of work, I think. But similarly you could use a hash here
>>instead of a binary search...but this seems quite good.
>>
>>As to the choice of 9 elements: I just picked that as a starting
>>point; Andres had previously commented off hand that at 8 elements
>>serial scanning was faster, so I figured this was a reasonable
>>starting point for discussion.
>>
>>Perhaps it would make sense to determine that minimum not as a pure
>>constant but scaled based on how many rows the planner expects us to
>>see? Of course that'd be a more invasive patch...so may or may not be
>>as feasible as a reasonable default.
>>
>
>Not sure. That seems a bit overcomplicated and I don't think it depends
>on the number of rows the planner expects to see very much. I think we
>usually assume the linear search is cheaper for small arrays and then at
>some point the binary search starts winning The question is where this
>"break even" point is.
>
>I think we usually use something like 64 or so in other places, but
>maybe I'm wrong. The current limit 9 seems a bit too low, but I may be
>wrong. Let's not obsess about this too much, let's do some experiments
>and pick a value based on that.
>
>
>>>Another idea - would a bloom filter be useful here, as a second
>>>optimization? That is, for large arrays build s small bloom filter,
>>>allowing us to skip even the binary search.
>>
>>That's an interesting idea. I actually haven't personally worked with
>>bloom filters, so didn't think about that.
>>
>>Are you thinking that you'd also build the filter *and* presort the
>>array? Or try to get away with using only the bloom filter and not
>>expanding and sorting the array at all?
>>
>
>Yeah, something like that. My intuition is the bloom filter is useful
>only above some number of items, and the number is higher than for the
>binary search. So we'd end up with two thresholds, first one enabling
>binary search, the second one enabling bloom filter.
>
>Of course, the "unknown" variable here is how often we actually find the
>value in the array. If 100% of the queries has a match, then the bloom
>filter is a waste of time. If there are no matches, it can make a
>significant difference.
>

I did experiment with this is a bit, both to get a bit more familiar
with this code and to see if the bloom filter might help. The short
answer is the bloom filter does not seem to help at all, so I wouldn't
bother about it too much.

Attacched is an updated patch series and, script I used to collect some
performance measurements, and a spreadsheet with results. The patch
series is broken into four parts:

   0001 - the original patch with binary search
   0002 - adds GUCs to enable bin search / tweak threshold
   0003 - allows to use bloom filter + binary search
   0004 - try using murmurhash

The test script runs a wide range of queries with different number
of lookups, keys in the array, match probability (i.e. fraction of
lookups that find a match) ranging from 1% to 100%. And of course, it
runs this with the binsearch/bloom either enabled or disabled (the
threshold was lowered to 1, so it's the on/off GUCs that determine
whether the binsearch/bloom is used).

The results are summarized in the spreadsheet, demonstrating how useless
the bloom filter is. There's not a single case where it would beat the
binary search. I believe this is because theaccess to bloom filter is
random (determined by the hash function) and we don't save much compared
to the log(K) lookups in the sorted array.

That makes sense, I think the bloom filters are meant to be used in
cases when the main data don't fit into memory - which is not the case
here. But I wonder how would this change for cases with more expensive
comparisons - this was using just integers, so maybe strings would
result in different behavior.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Fri, Apr 24, 2020 at 5:55 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Fri, Apr 24, 2020 at 09:38:54AM -0400, James Coleman wrote:
> >On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
> >> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >>
> >> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
> >> >> >Over in "execExprInterp() questions / How to improve scalar array op
> >> >> >expr eval?" [1] I'd mused about how we might be able to optimized
> >> >> >scalar array ops with OR'd semantics.
> >> >> >
> >> >> >This patch implements a binary search for such expressions when the
> >> >> >array argument is a constant so that we can avoid needing to teach
> >> >> >expression execution to cache stable values or know when a param has
> >> >> >changed.
> >> >> >
> >> >> >The speed-up for the target case can pretty impressive: in my
> >> >> >admittedly contrived and relatively unscientific test with a query in
> >> >> >the form:
> >> >> >
> >> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
> >> >> >random integers in the series>)
> >> >> >
> >> >> >shows ~30ms for the patch versus ~640ms on master.
> >> >> >
> >> >>
> >> >> Nice improvement, although 1000 items is probably a bit unusual. The
> >> >> threshold used in the patch (9 elements) seems a bit too low - what
> >> >> results have you seen with smaller arrays?
> >> >
> >> >At least in our systems we regularly work with 1000 batches of items,
> >> >which means you get IN clauses of identifiers of that size. Admittedly
> >> >the most common case sees those IN clauses as simple index scans
> >> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a
> >> >broader query that merely filters additionally on something like "...
> >> >AND <some foreign key> IN (...)" where it makes sense for the rest of
> >> >the quals to take precedence in generating a reasonable plan. In that
> >> >case, the IN becomes a regular filter, hence the idea behind the
> >> >patch.
> >> >
> >> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same
> >> >way...but as noted in the other thread that's an extremely large
> >> >amount of work, I think. But similarly you could use a hash here
> >> >instead of a binary search...but this seems quite good.
> >> >
> >> >As to the choice of 9 elements: I just picked that as a starting
> >> >point; Andres had previously commented off hand that at 8 elements
> >> >serial scanning was faster, so I figured this was a reasonable
> >> >starting point for discussion.
> >> >
> >> >Perhaps it would make sense to determine that minimum not as a pure
> >> >constant but scaled based on how many rows the planner expects us to
> >> >see? Of course that'd be a more invasive patch...so may or may not be
> >> >as feasible as a reasonable default.
> >> >
> >>
> >> Not sure. That seems a bit overcomplicated and I don't think it depends
> >> on the number of rows the planner expects to see very much. I think we
> >> usually assume the linear search is cheaper for small arrays and then at
> >> some point the binary search starts winning The question is where this
> >> "break even" point is.
> >
> >Well since it has to do preprocessing work (expanding the array and
> >then sorting it), then the number of rows processed matters, right?
> >For example, doing a linear search on 1000 items only once is going to
> >be cheaper than preprocessing the array and then doing a binary
> >search, but only a very large row count the binary search +
> >preprocessing might very well win out for only a 10 element array.
> >
>
> Hmmm, good point. Essentially the initialization (sorting of the array)
> has some cost, and the question is how much extra per-tuple cost this
> adds. It's probably not worth it for a single lookup, but for many
> lookups it's probably OK. Let's see if I can do the math right:
>
>    N - number of lookups
>    K - number of array elements
>
> Cost to sort the array is
>
>     O(K * log(K)) = C1 * K * log(K)
>
> and the cost of a lookup is C2 * log(K), so with the extra cost amortized
> for N lookups, the total "per lookup" cost is
>
>     C1 * K * log(K) / N + C2 * log(K) = log(K) * (C1 * K / N + C2)
>
> We need to compare this to the O(K) cost of simple linear search, and
> the question is at which point the linear search gets more expensive:
>
>     C3 * K = log(K) * (C1 * K / N + C2)
>
> I think we can assume that C3 is somewhere in between 0.5 and 1, i.e. if
> there's a matching item we find it half-way through on average, and if
> there is not we have to walk the whole array. So let's say it's 1.
>
> C1 and C2 are probably fairly low, I think - C1 is typically ~1.4 for
> random pivot choice IIRC, and C2 is probably similar. With both values
> being ~1.5 we get this:
>
>     K = log(K) * (1.5 * K/N + 1.5)
>
> for a fixed K, we get this formula for N:
>
>     N = log(K) * 1.5 * K / (K - 1.5 * log(K))
>
> and for a bunch of K values the results look like this:
>
>          K |     N
>     -------|-------
>          1 |     0
>         10 |  5.27
>        100 |  7.42
>       1000 | 10.47
>      10000 | 13.83
>     100000 | 17.27
>
> i.e. the binary search with 10k values starts winning over linear search
> with just ~13 lookups.
>
> (Assuming I haven't made some silly mistake in the math ...)
>
> Obviously, this only accounts for cost of comparisons and neglects e.g.
> the indirect costs for less predictable memory access patterns mentioned
> by Andres in his response.
>
> But I think it still shows the number of lookups needed for the binary
> search to be a win is pretty low - at least for reasonable number of
> values in the array. Maybe it's 20 and not 10, but I don't think that
> changes much.
>
> The other question is if we can get N at all and how reliable the value
> is. We can probably get the number of rows, but that will ignore other
> conditions that may eliminate the row before the binary search.
>
> >I'm not trying to argue for more work for myself here: I think the
> >optimization is worth it on its own, and something like this could be
> >a further improvement on its own. But it is interesting to think
> >about.
> >
>
> I don't know. Clearly, if the user sends a query with 10k values and
> only does a single lookup, that won't win. And if we can reasonably and
> reliably protect against that, I wouldn't mind doing that, although it
> means a risk of not using the bin search in case of underestimates etc.
>
> I don't have any hard data about this, but I think we can assume the
> number of rows processed by the clause is (much) higher than the number
> of keys in it. If you have a clause with 10k values, then you probably
> expect it to be applied to many rows, far more than the "beak even"
> point of about 10-20 rows ...
>
> So I wouldn't worry about this too much.

Yeah. I think it becomes a lot more interesting in the future if/when
we end up with a way to use this with params and not just constant
arrays. Then the "group" size would matter a whole lot more.

For now, the constant amount of overhead is quite small, so even if we
only execute it once we won't make the query that much worse (or, at
least, the total query time will still be very small). Also, because
it's only applied to constants, there's a natural limit to how much
overhead we're likely to introduce into a query.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Sat, Apr 25, 2020 at 12:21:06AM +0200, Tomas Vondra wrote:
>On Thu, Apr 23, 2020 at 04:55:51PM +0200, Tomas Vondra wrote:
>>On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>>>On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
>>><tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>>On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>>>>>Over in "execExprInterp() questions / How to improve scalar array op
>>>>>expr eval?" [1] I'd mused about how we might be able to optimized
>>>>>scalar array ops with OR'd semantics.
>>>>>
>>>>>This patch implements a binary search for such expressions when the
>>>>>array argument is a constant so that we can avoid needing to teach
>>>>>expression execution to cache stable values or know when a param has
>>>>>changed.
>>>>>
>>>>>The speed-up for the target case can pretty impressive: in my
>>>>>admittedly contrived and relatively unscientific test with a query in
>>>>>the form:
>>>>>
>>>>>select count(*) from generate_series(1,100000) n(i) where i in (<1000
>>>>>random integers in the series>)
>>>>>
>>>>>shows ~30ms for the patch versus ~640ms on master.
>>>>>
>>>>
>>>>Nice improvement, although 1000 items is probably a bit unusual. The
>>>>threshold used in the patch (9 elements) seems a bit too low - what
>>>>results have you seen with smaller arrays?
>>>
>>>At least in our systems we regularly work with 1000 batches of items,
>>>which means you get IN clauses of identifiers of that size. Admittedly
>>>the most common case sees those IN clauses as simple index scans
>>>(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>>>broader query that merely filters additionally on something like "...
>>>AND <some foreign key> IN (...)" where it makes sense for the rest of
>>>the quals to take precedence in generating a reasonable plan. In that
>>>case, the IN becomes a regular filter, hence the idea behind the
>>>patch.
>>>
>>>Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>>>way...but as noted in the other thread that's an extremely large
>>>amount of work, I think. But similarly you could use a hash here
>>>instead of a binary search...but this seems quite good.
>>>
>>>As to the choice of 9 elements: I just picked that as a starting
>>>point; Andres had previously commented off hand that at 8 elements
>>>serial scanning was faster, so I figured this was a reasonable
>>>starting point for discussion.
>>>
>>>Perhaps it would make sense to determine that minimum not as a pure
>>>constant but scaled based on how many rows the planner expects us to
>>>see? Of course that'd be a more invasive patch...so may or may not be
>>>as feasible as a reasonable default.
>>>
>>
>>Not sure. That seems a bit overcomplicated and I don't think it depends
>>on the number of rows the planner expects to see very much. I think we
>>usually assume the linear search is cheaper for small arrays and then at
>>some point the binary search starts winning The question is where this
>>"break even" point is.
>>
>>I think we usually use something like 64 or so in other places, but
>>maybe I'm wrong. The current limit 9 seems a bit too low, but I may be
>>wrong. Let's not obsess about this too much, let's do some experiments
>>and pick a value based on that.
>>
>>
>>>>Another idea - would a bloom filter be useful here, as a second
>>>>optimization? That is, for large arrays build s small bloom filter,
>>>>allowing us to skip even the binary search.
>>>
>>>That's an interesting idea. I actually haven't personally worked with
>>>bloom filters, so didn't think about that.
>>>
>>>Are you thinking that you'd also build the filter *and* presort the
>>>array? Or try to get away with using only the bloom filter and not
>>>expanding and sorting the array at all?
>>>
>>
>>Yeah, something like that. My intuition is the bloom filter is useful
>>only above some number of items, and the number is higher than for the
>>binary search. So we'd end up with two thresholds, first one enabling
>>binary search, the second one enabling bloom filter.
>>
>>Of course, the "unknown" variable here is how often we actually find the
>>value in the array. If 100% of the queries has a match, then the bloom
>>filter is a waste of time. If there are no matches, it can make a
>>significant difference.
>>
>
>I did experiment with this is a bit, both to get a bit more familiar
>with this code and to see if the bloom filter might help. The short
>answer is the bloom filter does not seem to help at all, so I wouldn't
>bother about it too much.
>
>Attacched is an updated patch series and, script I used to collect some
>performance measurements, and a spreadsheet with results. The patch
>series is broken into four parts:
>
>  0001 - the original patch with binary search
>  0002 - adds GUCs to enable bin search / tweak threshold
>  0003 - allows to use bloom filter + binary search
>  0004 - try using murmurhash
>
>The test script runs a wide range of queries with different number
>of lookups, keys in the array, match probability (i.e. fraction of
>lookups that find a match) ranging from 1% to 100%. And of course, it
>runs this with the binsearch/bloom either enabled or disabled (the
>threshold was lowered to 1, so it's the on/off GUCs that determine
>whether the binsearch/bloom is used).
>
>The results are summarized in the spreadsheet, demonstrating how useless
>the bloom filter is. There's not a single case where it would beat the
>binary search. I believe this is because theaccess to bloom filter is
>random (determined by the hash function) and we don't save much compared
>to the log(K) lookups in the sorted array.
>
>That makes sense, I think the bloom filters are meant to be used in
>cases when the main data don't fit into memory - which is not the case
>here. But I wonder how would this change for cases with more expensive
>comparisons - this was using just integers, so maybe strings would
>result in different behavior.

OK, I tried the same test with text columns (with md5 strings), and the
results are about as I predicted - the bloom filter actually makes a
difference in this case. Depending on the number of lookups and
selectivity (i.e. how many lookups have a match in the array) it can
mean additional speedup up to ~5x compared to binary search alone.

For the case with 100% selectivity (i.e. all rows have a match) this
can't really save any time - it's usually still much faster than master,
but it's a bit slower than binary search.

So I think this might be worth investigating further, once the simple
binary search gets committed. We'll probably need to factor in the cost
of the comparison (higher cost -> BF more useful), selectivity of the
filter (fewer matches -> BF more useful) and number of lookups.

This reminds me our attempts to add bloom filters to hash joins, which I
think ran into mostly the same challenge of deciding when the bloom
filter can be useful and is worth the extra work.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Fri, Apr 24, 2020 at 09:22:34PM -0400, James Coleman wrote:
>On Fri, Apr 24, 2020 at 5:55 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Fri, Apr 24, 2020 at 09:38:54AM -0400, James Coleman wrote:
>> >On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote:
>> >> >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
>> >> ><tomas.vondra@2ndquadrant.com> wrote:
>> >> >>
>> >> >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
>> >> >> >Over in "execExprInterp() questions / How to improve scalar array op
>> >> >> >expr eval?" [1] I'd mused about how we might be able to optimized
>> >> >> >scalar array ops with OR'd semantics.
>> >> >> >
>> >> >> >This patch implements a binary search for such expressions when the
>> >> >> >array argument is a constant so that we can avoid needing to teach
>> >> >> >expression execution to cache stable values or know when a param has
>> >> >> >changed.
>> >> >> >
>> >> >> >The speed-up for the target case can pretty impressive: in my
>> >> >> >admittedly contrived and relatively unscientific test with a query in
>> >> >> >the form:
>> >> >> >
>> >> >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
>> >> >> >random integers in the series>)
>> >> >> >
>> >> >> >shows ~30ms for the patch versus ~640ms on master.
>> >> >> >
>> >> >>
>> >> >> Nice improvement, although 1000 items is probably a bit unusual. The
>> >> >> threshold used in the patch (9 elements) seems a bit too low - what
>> >> >> results have you seen with smaller arrays?
>> >> >
>> >> >At least in our systems we regularly work with 1000 batches of items,
>> >> >which means you get IN clauses of identifiers of that size. Admittedly
>> >> >the most common case sees those IN clauses as simple index scans
>> >> >(e.g., WHERE <primary key> IN (...)), but it's also common to have a
>> >> >broader query that merely filters additionally on something like "...
>> >> >AND <some foreign key> IN (...)" where it makes sense for the rest of
>> >> >the quals to take precedence in generating a reasonable plan. In that
>> >> >case, the IN becomes a regular filter, hence the idea behind the
>> >> >patch.
>> >> >
>> >> >Side note: I'd love for us to be able to treat "IN (VALUES)" the same
>> >> >way...but as noted in the other thread that's an extremely large
>> >> >amount of work, I think. But similarly you could use a hash here
>> >> >instead of a binary search...but this seems quite good.
>> >> >
>> >> >As to the choice of 9 elements: I just picked that as a starting
>> >> >point; Andres had previously commented off hand that at 8 elements
>> >> >serial scanning was faster, so I figured this was a reasonable
>> >> >starting point for discussion.
>> >> >
>> >> >Perhaps it would make sense to determine that minimum not as a pure
>> >> >constant but scaled based on how many rows the planner expects us to
>> >> >see? Of course that'd be a more invasive patch...so may or may not be
>> >> >as feasible as a reasonable default.
>> >> >
>> >>
>> >> Not sure. That seems a bit overcomplicated and I don't think it depends
>> >> on the number of rows the planner expects to see very much. I think we
>> >> usually assume the linear search is cheaper for small arrays and then at
>> >> some point the binary search starts winning The question is where this
>> >> "break even" point is.
>> >
>> >Well since it has to do preprocessing work (expanding the array and
>> >then sorting it), then the number of rows processed matters, right?
>> >For example, doing a linear search on 1000 items only once is going to
>> >be cheaper than preprocessing the array and then doing a binary
>> >search, but only a very large row count the binary search +
>> >preprocessing might very well win out for only a 10 element array.
>> >
>>
>> Hmmm, good point. Essentially the initialization (sorting of the array)
>> has some cost, and the question is how much extra per-tuple cost this
>> adds. It's probably not worth it for a single lookup, but for many
>> lookups it's probably OK. Let's see if I can do the math right:
>>
>>    N - number of lookups
>>    K - number of array elements
>>
>> Cost to sort the array is
>>
>>     O(K * log(K)) = C1 * K * log(K)
>>
>> and the cost of a lookup is C2 * log(K), so with the extra cost amortized
>> for N lookups, the total "per lookup" cost is
>>
>>     C1 * K * log(K) / N + C2 * log(K) = log(K) * (C1 * K / N + C2)
>>
>> We need to compare this to the O(K) cost of simple linear search, and
>> the question is at which point the linear search gets more expensive:
>>
>>     C3 * K = log(K) * (C1 * K / N + C2)
>>
>> I think we can assume that C3 is somewhere in between 0.5 and 1, i.e. if
>> there's a matching item we find it half-way through on average, and if
>> there is not we have to walk the whole array. So let's say it's 1.
>>
>> C1 and C2 are probably fairly low, I think - C1 is typically ~1.4 for
>> random pivot choice IIRC, and C2 is probably similar. With both values
>> being ~1.5 we get this:
>>
>>     K = log(K) * (1.5 * K/N + 1.5)
>>
>> for a fixed K, we get this formula for N:
>>
>>     N = log(K) * 1.5 * K / (K - 1.5 * log(K))
>>
>> and for a bunch of K values the results look like this:
>>
>>          K |     N
>>     -------|-------
>>          1 |     0
>>         10 |  5.27
>>        100 |  7.42
>>       1000 | 10.47
>>      10000 | 13.83
>>     100000 | 17.27
>>
>> i.e. the binary search with 10k values starts winning over linear search
>> with just ~13 lookups.
>>
>> (Assuming I haven't made some silly mistake in the math ...)
>>
>> Obviously, this only accounts for cost of comparisons and neglects e.g.
>> the indirect costs for less predictable memory access patterns mentioned
>> by Andres in his response.
>>
>> But I think it still shows the number of lookups needed for the binary
>> search to be a win is pretty low - at least for reasonable number of
>> values in the array. Maybe it's 20 and not 10, but I don't think that
>> changes much.
>>
>> The other question is if we can get N at all and how reliable the value
>> is. We can probably get the number of rows, but that will ignore other
>> conditions that may eliminate the row before the binary search.
>>
>> >I'm not trying to argue for more work for myself here: I think the
>> >optimization is worth it on its own, and something like this could be
>> >a further improvement on its own. But it is interesting to think
>> >about.
>> >
>>
>> I don't know. Clearly, if the user sends a query with 10k values and
>> only does a single lookup, that won't win. And if we can reasonably and
>> reliably protect against that, I wouldn't mind doing that, although it
>> means a risk of not using the bin search in case of underestimates etc.
>>
>> I don't have any hard data about this, but I think we can assume the
>> number of rows processed by the clause is (much) higher than the number
>> of keys in it. If you have a clause with 10k values, then you probably
>> expect it to be applied to many rows, far more than the "beak even"
>> point of about 10-20 rows ...
>>
>> So I wouldn't worry about this too much.
>
>Yeah. I think it becomes a lot more interesting in the future if/when
>we end up with a way to use this with params and not just constant
>arrays. Then the "group" size would matter a whole lot more.
>

True. That probably changes the calculations quite a bit.

>For now, the constant amount of overhead is quite small, so even if we
>only execute it once we won't make the query that much worse (or, at
>least, the total query time will still be very small). Also, because
>it's only applied to constants, there's a natural limit to how much
>overhead we're likely to introduce into a query.
>

FWIW the results from repeated test with both int and text columns that
I shared in [1] also have data for smaller numbers of rows. I haven't
tried very much to minimize noise (the original goal was to test speedup
for large numbers of rows and large arrays, where this is not an issue).
But I think it still shows that the threshold of ~10 elements is in the
right ballpark. We might use a higher value to be a bit more defensive,
but it's never going to be perfect for types with both cheap and more
expensive comparisons.

One more note - shouldn't this also tweak cost_qual_eval_walker which
computes cost for ScalarArrayOpExpr? At the moment it does this:

   /*
    * Estimate that the operator will be applied to about half of the
    * array elements before the answer is determined.
    */

but that's appropriate for linear search.


[1] https://www.postgresql.org/message-id/20200425124024.hsv7z6bia752uymz%40development


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> This reminds me our attempts to add bloom filters to hash joins, which I
> think ran into mostly the same challenge of deciding when the bloom
> filter can be useful and is worth the extra work.

Speaking of that, it would be interesting to see how a test where you
write the query as IN(VALUES(...)) instead of IN() compares. It would
be interesting to know if the planner is able to make a more suitable
choice and also to see how all the work over the years to improve Hash
Joins compares to the bsearch with and without the bloom filter.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > This reminds me our attempts to add bloom filters to hash joins, which I
> > think ran into mostly the same challenge of deciding when the bloom
> > filter can be useful and is worth the extra work.
>
> Speaking of that, it would be interesting to see how a test where you
> write the query as IN(VALUES(...)) instead of IN() compares. It would
> be interesting to know if the planner is able to make a more suitable
> choice and also to see how all the work over the years to improve Hash
> Joins compares to the bsearch with and without the bloom filter.

It would be interesting.

It also makes one wonder about optimizing these into to hash
joins...which I'd thought about over at [1]. I think it'd be a very
significant effort though.

James

[1]: https://www.postgresql.org/message-id/CAAaqYe_zVVOURfdPbAhssijw7yV0uKi350gQ%3D_QGDz7R%3DHpGGQ%40mail.gmail.com



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> > This reminds me our attempts to add bloom filters to hash joins, which I
>> > think ran into mostly the same challenge of deciding when the bloom
>> > filter can be useful and is worth the extra work.
>>
>> Speaking of that, it would be interesting to see how a test where you
>> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> be interesting to know if the planner is able to make a more suitable
>> choice and also to see how all the work over the years to improve Hash
>> Joins compares to the bsearch with and without the bloom filter.
>
>It would be interesting.
>
>It also makes one wonder about optimizing these into to hash
>joins...which I'd thought about over at [1]. I think it'd be a very
>significant effort though.
>

I modified the script to also do the join version of the query. I can
only run it on my laptop at the moment, so the results may be a bit
different from those I shared before, but it's interesting I think.

In most cases it's comparable to the binsearch/bloom approach, and in
some cases it actually beats them quite significantly. It seems to
depend on how expensive the comparison is - for "int" the comparison is
very cheap and there's almost no difference. For "text" the comparison
is much more expensive, and there are significant speedups.

For example the test with 100k lookups in array of 10k elements and 10%
match probability, the timings are these

   master:  62362 ms
   binsearch: 201 ms
   bloom:      65 ms
   hashjoin:   36 ms

I do think the explanation is fairly simple - the bloom filter
eliminates about 90% of the expensive comparisons, so it's 20ms plus
some overhead to build and check the bits. The hash join probably
eliminates a lot of the remaining comparisons, because the hash table
is sized to have one tuple per bucket.

Note: I also don't claim the PoC has the most efficient bloom filter
implementation possible. I'm sure it could be made faster.

Anyway, I'm not sure transforming this to a hash join is worth the
effort - I agree that seems quite complex. But perhaps this suggest we
should not be doing binary search and instead just build a simple hash
table - that seems much simpler, and it'll probably give us about the
same benefits.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >>
> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> >> > think ran into mostly the same challenge of deciding when the bloom
> >> > filter can be useful and is worth the extra work.
> >>
> >> Speaking of that, it would be interesting to see how a test where you
> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> >> be interesting to know if the planner is able to make a more suitable
> >> choice and also to see how all the work over the years to improve Hash
> >> Joins compares to the bsearch with and without the bloom filter.
> >
> >It would be interesting.
> >
> >It also makes one wonder about optimizing these into to hash
> >joins...which I'd thought about over at [1]. I think it'd be a very
> >significant effort though.
> >
>
> I modified the script to also do the join version of the query. I can
> only run it on my laptop at the moment, so the results may be a bit
> different from those I shared before, but it's interesting I think.
>
> In most cases it's comparable to the binsearch/bloom approach, and in
> some cases it actually beats them quite significantly. It seems to
> depend on how expensive the comparison is - for "int" the comparison is
> very cheap and there's almost no difference. For "text" the comparison
> is much more expensive, and there are significant speedups.
>
> For example the test with 100k lookups in array of 10k elements and 10%
> match probability, the timings are these
>
>    master:  62362 ms
>    binsearch: 201 ms
>    bloom:      65 ms
>    hashjoin:   36 ms
>
> I do think the explanation is fairly simple - the bloom filter
> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> some overhead to build and check the bits. The hash join probably
> eliminates a lot of the remaining comparisons, because the hash table
> is sized to have one tuple per bucket.
>
> Note: I also don't claim the PoC has the most efficient bloom filter
> implementation possible. I'm sure it could be made faster.
>
> Anyway, I'm not sure transforming this to a hash join is worth the
> effort - I agree that seems quite complex. But perhaps this suggest we
> should not be doing binary search and instead just build a simple hash
> table - that seems much simpler, and it'll probably give us about the
> same benefits.

That's actually what I originally thought about doing, but I chose
binary search since it seemed a lot easier to get off the ground.

If we instead build a hash is there anything else we need to be
concerned about? For example, work mem? I suppose for the binary
search we already have to expand the array, so perhaps it's not all
that meaningful relative to that...

I was looking earlier at what our standard hash implementation was,
and it seemed less obvious what was needed to set that up (so binary
search seemed a faster proof of concept). If you happen to have any
pointers to similar usages I should look at, please let me know.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
>On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> >>
>> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> >> > This reminds me our attempts to add bloom filters to hash joins, which I
>> >> > think ran into mostly the same challenge of deciding when the bloom
>> >> > filter can be useful and is worth the extra work.
>> >>
>> >> Speaking of that, it would be interesting to see how a test where you
>> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> >> be interesting to know if the planner is able to make a more suitable
>> >> choice and also to see how all the work over the years to improve Hash
>> >> Joins compares to the bsearch with and without the bloom filter.
>> >
>> >It would be interesting.
>> >
>> >It also makes one wonder about optimizing these into to hash
>> >joins...which I'd thought about over at [1]. I think it'd be a very
>> >significant effort though.
>> >
>>
>> I modified the script to also do the join version of the query. I can
>> only run it on my laptop at the moment, so the results may be a bit
>> different from those I shared before, but it's interesting I think.
>>
>> In most cases it's comparable to the binsearch/bloom approach, and in
>> some cases it actually beats them quite significantly. It seems to
>> depend on how expensive the comparison is - for "int" the comparison is
>> very cheap and there's almost no difference. For "text" the comparison
>> is much more expensive, and there are significant speedups.
>>
>> For example the test with 100k lookups in array of 10k elements and 10%
>> match probability, the timings are these
>>
>>    master:  62362 ms
>>    binsearch: 201 ms
>>    bloom:      65 ms
>>    hashjoin:   36 ms
>>
>> I do think the explanation is fairly simple - the bloom filter
>> eliminates about 90% of the expensive comparisons, so it's 20ms plus
>> some overhead to build and check the bits. The hash join probably
>> eliminates a lot of the remaining comparisons, because the hash table
>> is sized to have one tuple per bucket.
>>
>> Note: I also don't claim the PoC has the most efficient bloom filter
>> implementation possible. I'm sure it could be made faster.
>>
>> Anyway, I'm not sure transforming this to a hash join is worth the
>> effort - I agree that seems quite complex. But perhaps this suggest we
>> should not be doing binary search and instead just build a simple hash
>> table - that seems much simpler, and it'll probably give us about the
>> same benefits.
>
>That's actually what I originally thought about doing, but I chose
>binary search since it seemed a lot easier to get off the ground.
>

OK, that makes perfect sense.

>If we instead build a hash is there anything else we need to be
>concerned about? For example, work mem? I suppose for the binary
>search we already have to expand the array, so perhaps it's not all
>that meaningful relative to that...
>

I don't think we need to be particularly concerned about work_mem. We
don't care about it now, and it's not clear to me what we could do about
it - we already have the array in memory anyway, so it's a bit futile.
Furthermore, if we need to care about it, it probably applies to the
binary search too.

>I was looking earlier at what our standard hash implementation was,
>and it seemed less obvious what was needed to set that up (so binary
>search seemed a faster proof of concept). If you happen to have any
>pointers to similar usages I should look at, please let me know.
>

I think the hash join implementation is far too complicated. It has to
care about work_mem, so it implements batching, etc. That's a lot of
complexity we don't need here. IMO we could use either the usual
dynahash, or maybe even the simpler simplehash.

FWIW it'd be good to verify the numbers I shared, i.e. checking that the
benchmarks makes sense and running it independently. I'm not aware of
any issues but it was done late at night and only ran on my laptop.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >> >>
> >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> >> >> > think ran into mostly the same challenge of deciding when the bloom
> >> >> > filter can be useful and is worth the extra work.
> >> >>
> >> >> Speaking of that, it would be interesting to see how a test where you
> >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> >> >> be interesting to know if the planner is able to make a more suitable
> >> >> choice and also to see how all the work over the years to improve Hash
> >> >> Joins compares to the bsearch with and without the bloom filter.
> >> >
> >> >It would be interesting.
> >> >
> >> >It also makes one wonder about optimizing these into to hash
> >> >joins...which I'd thought about over at [1]. I think it'd be a very
> >> >significant effort though.
> >> >
> >>
> >> I modified the script to also do the join version of the query. I can
> >> only run it on my laptop at the moment, so the results may be a bit
> >> different from those I shared before, but it's interesting I think.
> >>
> >> In most cases it's comparable to the binsearch/bloom approach, and in
> >> some cases it actually beats them quite significantly. It seems to
> >> depend on how expensive the comparison is - for "int" the comparison is
> >> very cheap and there's almost no difference. For "text" the comparison
> >> is much more expensive, and there are significant speedups.
> >>
> >> For example the test with 100k lookups in array of 10k elements and 10%
> >> match probability, the timings are these
> >>
> >>    master:  62362 ms
> >>    binsearch: 201 ms
> >>    bloom:      65 ms
> >>    hashjoin:   36 ms
> >>
> >> I do think the explanation is fairly simple - the bloom filter
> >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> >> some overhead to build and check the bits. The hash join probably
> >> eliminates a lot of the remaining comparisons, because the hash table
> >> is sized to have one tuple per bucket.
> >>
> >> Note: I also don't claim the PoC has the most efficient bloom filter
> >> implementation possible. I'm sure it could be made faster.
> >>
> >> Anyway, I'm not sure transforming this to a hash join is worth the
> >> effort - I agree that seems quite complex. But perhaps this suggest we
> >> should not be doing binary search and instead just build a simple hash
> >> table - that seems much simpler, and it'll probably give us about the
> >> same benefits.
> >
> >That's actually what I originally thought about doing, but I chose
> >binary search since it seemed a lot easier to get off the ground.
> >
>
> OK, that makes perfect sense.
>
> >If we instead build a hash is there anything else we need to be
> >concerned about? For example, work mem? I suppose for the binary
> >search we already have to expand the array, so perhaps it's not all
> >that meaningful relative to that...
> >
>
> I don't think we need to be particularly concerned about work_mem. We
> don't care about it now, and it's not clear to me what we could do about
> it - we already have the array in memory anyway, so it's a bit futile.
> Furthermore, if we need to care about it, it probably applies to the
> binary search too.
>
> >I was looking earlier at what our standard hash implementation was,
> >and it seemed less obvious what was needed to set that up (so binary
> >search seemed a faster proof of concept). If you happen to have any
> >pointers to similar usages I should look at, please let me know.
> >
>
> I think the hash join implementation is far too complicated. It has to
> care about work_mem, so it implements batching, etc. That's a lot of
> complexity we don't need here. IMO we could use either the usual
> dynahash, or maybe even the simpler simplehash.
>
> FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> benchmarks makes sense and running it independently. I'm not aware of
> any issues but it was done late at night and only ran on my laptop.

Some quick calculations (don't have the scripting in a form I can
attach yet; using this as an opportunity to hack on a genericized
performance testing framework of sorts) suggest your results are
correct. I was also testing on my laptop, but I showed 1.) roughly
equivalent results for IN (VALUES ...) and IN (<list>) for integers,
but when I switch to (short; average 3 characters long) text values I
show the hash join on VALUES is twice as fast as the binary search.

Given that, I'm planning to implement this as a hash lookup and share
a revised patch.

James



Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> > ><tomas.vondra@2ndquadrant.com> wrote:
> > >>
> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > >> >>
> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> > >> >> > think ran into mostly the same challenge of deciding when the bloom
> > >> >> > filter can be useful and is worth the extra work.
> > >> >>
> > >> >> Speaking of that, it would be interesting to see how a test where you
> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> > >> >> be interesting to know if the planner is able to make a more suitable
> > >> >> choice and also to see how all the work over the years to improve Hash
> > >> >> Joins compares to the bsearch with and without the bloom filter.
> > >> >
> > >> >It would be interesting.
> > >> >
> > >> >It also makes one wonder about optimizing these into to hash
> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
> > >> >significant effort though.
> > >> >
> > >>
> > >> I modified the script to also do the join version of the query. I can
> > >> only run it on my laptop at the moment, so the results may be a bit
> > >> different from those I shared before, but it's interesting I think.
> > >>
> > >> In most cases it's comparable to the binsearch/bloom approach, and in
> > >> some cases it actually beats them quite significantly. It seems to
> > >> depend on how expensive the comparison is - for "int" the comparison is
> > >> very cheap and there's almost no difference. For "text" the comparison
> > >> is much more expensive, and there are significant speedups.
> > >>
> > >> For example the test with 100k lookups in array of 10k elements and 10%
> > >> match probability, the timings are these
> > >>
> > >>    master:  62362 ms
> > >>    binsearch: 201 ms
> > >>    bloom:      65 ms
> > >>    hashjoin:   36 ms
> > >>
> > >> I do think the explanation is fairly simple - the bloom filter
> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> > >> some overhead to build and check the bits. The hash join probably
> > >> eliminates a lot of the remaining comparisons, because the hash table
> > >> is sized to have one tuple per bucket.
> > >>
> > >> Note: I also don't claim the PoC has the most efficient bloom filter
> > >> implementation possible. I'm sure it could be made faster.
> > >>
> > >> Anyway, I'm not sure transforming this to a hash join is worth the
> > >> effort - I agree that seems quite complex. But perhaps this suggest we
> > >> should not be doing binary search and instead just build a simple hash
> > >> table - that seems much simpler, and it'll probably give us about the
> > >> same benefits.
> > >
> > >That's actually what I originally thought about doing, but I chose
> > >binary search since it seemed a lot easier to get off the ground.
> > >
> >
> > OK, that makes perfect sense.
> >
> > >If we instead build a hash is there anything else we need to be
> > >concerned about? For example, work mem? I suppose for the binary
> > >search we already have to expand the array, so perhaps it's not all
> > >that meaningful relative to that...
> > >
> >
> > I don't think we need to be particularly concerned about work_mem. We
> > don't care about it now, and it's not clear to me what we could do about
> > it - we already have the array in memory anyway, so it's a bit futile.
> > Furthermore, if we need to care about it, it probably applies to the
> > binary search too.
> >
> > >I was looking earlier at what our standard hash implementation was,
> > >and it seemed less obvious what was needed to set that up (so binary
> > >search seemed a faster proof of concept). If you happen to have any
> > >pointers to similar usages I should look at, please let me know.
> > >
> >
> > I think the hash join implementation is far too complicated. It has to
> > care about work_mem, so it implements batching, etc. That's a lot of
> > complexity we don't need here. IMO we could use either the usual
> > dynahash, or maybe even the simpler simplehash.
> >
> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> > benchmarks makes sense and running it independently. I'm not aware of
> > any issues but it was done late at night and only ran on my laptop.
>
> Some quick calculations (don't have the scripting in a form I can
> attach yet; using this as an opportunity to hack on a genericized
> performance testing framework of sorts) suggest your results are
> correct. I was also testing on my laptop, but I showed 1.) roughly
> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
> but when I switch to (short; average 3 characters long) text values I
> show the hash join on VALUES is twice as fast as the binary search.
>
> Given that, I'm planning to implement this as a hash lookup and share
> a revised patch.

While working on this I noticed that dynahash.c line 499 has this assertion:

Assert(info->entrysize >= info->keysize);

Do you by any chance know why the entry would need to be larger than the key? In this case I'm really treating the hash like a set (if there's a hash set implementation that doesn't store a value, then I'd be happy to use that instead) so I've configured the entry as sizeof(bool) which is obviously smaller than the key.

If it helps, that line was added by Tom in fba2a104c6d.

Thanks,
James

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Mon, 27 Apr 2020 at 15:12, James Coleman <jtc331@gmail.com> wrote:
> While working on this I noticed that dynahash.c line 499 has this assertion:
>
> Assert(info->entrysize >= info->keysize);
>
> Do you by any chance know why the entry would need to be larger than the key?

Larger or equal. They'd be equal if you the key was the data, since
you do need to store at least the key.  Looking at the code for
examples where dynahash is used in that situation, I see
_hash_finish_split().

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Sun, Apr 26, 2020 at 11:44 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Mon, 27 Apr 2020 at 15:12, James Coleman <jtc331@gmail.com> wrote:
> > While working on this I noticed that dynahash.c line 499 has this assertion:
> >
> > Assert(info->entrysize >= info->keysize);
> >
> > Do you by any chance know why the entry would need to be larger than the key?
>
> Larger or equal. They'd be equal if you the key was the data, since
> you do need to store at least the key.  Looking at the code for
> examples where dynahash is used in that situation, I see
> _hash_finish_split().

Ah, I was thinking of it as key and value being separate sizes added
together rather than one including the other.

Thanks,
James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> > ><tomas.vondra@2ndquadrant.com> wrote:
> > >>
> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > >> >>
> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> > >> >> > think ran into mostly the same challenge of deciding when the bloom
> > >> >> > filter can be useful and is worth the extra work.
> > >> >>
> > >> >> Speaking of that, it would be interesting to see how a test where you
> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> > >> >> be interesting to know if the planner is able to make a more suitable
> > >> >> choice and also to see how all the work over the years to improve Hash
> > >> >> Joins compares to the bsearch with and without the bloom filter.
> > >> >
> > >> >It would be interesting.
> > >> >
> > >> >It also makes one wonder about optimizing these into to hash
> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
> > >> >significant effort though.
> > >> >
> > >>
> > >> I modified the script to also do the join version of the query. I can
> > >> only run it on my laptop at the moment, so the results may be a bit
> > >> different from those I shared before, but it's interesting I think.
> > >>
> > >> In most cases it's comparable to the binsearch/bloom approach, and in
> > >> some cases it actually beats them quite significantly. It seems to
> > >> depend on how expensive the comparison is - for "int" the comparison is
> > >> very cheap and there's almost no difference. For "text" the comparison
> > >> is much more expensive, and there are significant speedups.
> > >>
> > >> For example the test with 100k lookups in array of 10k elements and 10%
> > >> match probability, the timings are these
> > >>
> > >>    master:  62362 ms
> > >>    binsearch: 201 ms
> > >>    bloom:      65 ms
> > >>    hashjoin:   36 ms
> > >>
> > >> I do think the explanation is fairly simple - the bloom filter
> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> > >> some overhead to build and check the bits. The hash join probably
> > >> eliminates a lot of the remaining comparisons, because the hash table
> > >> is sized to have one tuple per bucket.
> > >>
> > >> Note: I also don't claim the PoC has the most efficient bloom filter
> > >> implementation possible. I'm sure it could be made faster.
> > >>
> > >> Anyway, I'm not sure transforming this to a hash join is worth the
> > >> effort - I agree that seems quite complex. But perhaps this suggest we
> > >> should not be doing binary search and instead just build a simple hash
> > >> table - that seems much simpler, and it'll probably give us about the
> > >> same benefits.
> > >
> > >That's actually what I originally thought about doing, but I chose
> > >binary search since it seemed a lot easier to get off the ground.
> > >
> >
> > OK, that makes perfect sense.
> >
> > >If we instead build a hash is there anything else we need to be
> > >concerned about? For example, work mem? I suppose for the binary
> > >search we already have to expand the array, so perhaps it's not all
> > >that meaningful relative to that...
> > >
> >
> > I don't think we need to be particularly concerned about work_mem. We
> > don't care about it now, and it's not clear to me what we could do about
> > it - we already have the array in memory anyway, so it's a bit futile.
> > Furthermore, if we need to care about it, it probably applies to the
> > binary search too.
> >
> > >I was looking earlier at what our standard hash implementation was,
> > >and it seemed less obvious what was needed to set that up (so binary
> > >search seemed a faster proof of concept). If you happen to have any
> > >pointers to similar usages I should look at, please let me know.
> > >
> >
> > I think the hash join implementation is far too complicated. It has to
> > care about work_mem, so it implements batching, etc. That's a lot of
> > complexity we don't need here. IMO we could use either the usual
> > dynahash, or maybe even the simpler simplehash.
> >
> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> > benchmarks makes sense and running it independently. I'm not aware of
> > any issues but it was done late at night and only ran on my laptop.
>
> Some quick calculations (don't have the scripting in a form I can
> attach yet; using this as an opportunity to hack on a genericized
> performance testing framework of sorts) suggest your results are
> correct. I was also testing on my laptop, but I showed 1.) roughly
> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
> but when I switch to (short; average 3 characters long) text values I
> show the hash join on VALUES is twice as fast as the binary search.
>
> Given that, I'm planning to implement this as a hash lookup and share
> a revised patch.

I've attached a patch series as before, but with an additional patch
that switches to using dynahash instead of binary search.

Whereas before the benchmarking ended up with a trimodal distribution
(i.e., master with IN <list>, patch with IN <list>, and either with IN
VALUES), the hash implementation brings us back to an effectively
bimodal distribution -- though the hash scalar array op expression
implementation for text is about 5% faster than the hash join.

Current outstanding thoughts (besides comment/naming cleanup):

- The saop costing needs to be updated to match, as Tomas pointed out.

- Should we be concerned about single execution cases? For example, is
the regression of speed on a simple SELECT x IN something we should
try to defeat by only kicking in the optimization if we execute in a
loop at least twice? That might be of particular interest to pl/pgsql.

- Should we have a test for an operator with a non-strict function?
I'm not aware of any built-in ops that have that characteristic; would
you suggest just creating a fake one for the test?

James

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:
>On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote:
>>
>> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> >
>> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
>> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
>> > ><tomas.vondra@2ndquadrant.com> wrote:
>> > >>
>> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> > >> >>
>> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
>> > >> >> > think ran into mostly the same challenge of deciding when the bloom
>> > >> >> > filter can be useful and is worth the extra work.
>> > >> >>
>> > >> >> Speaking of that, it would be interesting to see how a test where you
>> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> > >> >> be interesting to know if the planner is able to make a more suitable
>> > >> >> choice and also to see how all the work over the years to improve Hash
>> > >> >> Joins compares to the bsearch with and without the bloom filter.
>> > >> >
>> > >> >It would be interesting.
>> > >> >
>> > >> >It also makes one wonder about optimizing these into to hash
>> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
>> > >> >significant effort though.
>> > >> >
>> > >>
>> > >> I modified the script to also do the join version of the query. I can
>> > >> only run it on my laptop at the moment, so the results may be a bit
>> > >> different from those I shared before, but it's interesting I think.
>> > >>
>> > >> In most cases it's comparable to the binsearch/bloom approach, and in
>> > >> some cases it actually beats them quite significantly. It seems to
>> > >> depend on how expensive the comparison is - for "int" the comparison is
>> > >> very cheap and there's almost no difference. For "text" the comparison
>> > >> is much more expensive, and there are significant speedups.
>> > >>
>> > >> For example the test with 100k lookups in array of 10k elements and 10%
>> > >> match probability, the timings are these
>> > >>
>> > >>    master:  62362 ms
>> > >>    binsearch: 201 ms
>> > >>    bloom:      65 ms
>> > >>    hashjoin:   36 ms
>> > >>
>> > >> I do think the explanation is fairly simple - the bloom filter
>> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
>> > >> some overhead to build and check the bits. The hash join probably
>> > >> eliminates a lot of the remaining comparisons, because the hash table
>> > >> is sized to have one tuple per bucket.
>> > >>
>> > >> Note: I also don't claim the PoC has the most efficient bloom filter
>> > >> implementation possible. I'm sure it could be made faster.
>> > >>
>> > >> Anyway, I'm not sure transforming this to a hash join is worth the
>> > >> effort - I agree that seems quite complex. But perhaps this suggest we
>> > >> should not be doing binary search and instead just build a simple hash
>> > >> table - that seems much simpler, and it'll probably give us about the
>> > >> same benefits.
>> > >
>> > >That's actually what I originally thought about doing, but I chose
>> > >binary search since it seemed a lot easier to get off the ground.
>> > >
>> >
>> > OK, that makes perfect sense.
>> >
>> > >If we instead build a hash is there anything else we need to be
>> > >concerned about? For example, work mem? I suppose for the binary
>> > >search we already have to expand the array, so perhaps it's not all
>> > >that meaningful relative to that...
>> > >
>> >
>> > I don't think we need to be particularly concerned about work_mem. We
>> > don't care about it now, and it's not clear to me what we could do about
>> > it - we already have the array in memory anyway, so it's a bit futile.
>> > Furthermore, if we need to care about it, it probably applies to the
>> > binary search too.
>> >
>> > >I was looking earlier at what our standard hash implementation was,
>> > >and it seemed less obvious what was needed to set that up (so binary
>> > >search seemed a faster proof of concept). If you happen to have any
>> > >pointers to similar usages I should look at, please let me know.
>> > >
>> >
>> > I think the hash join implementation is far too complicated. It has to
>> > care about work_mem, so it implements batching, etc. That's a lot of
>> > complexity we don't need here. IMO we could use either the usual
>> > dynahash, or maybe even the simpler simplehash.
>> >
>> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
>> > benchmarks makes sense and running it independently. I'm not aware of
>> > any issues but it was done late at night and only ran on my laptop.
>>
>> Some quick calculations (don't have the scripting in a form I can
>> attach yet; using this as an opportunity to hack on a genericized
>> performance testing framework of sorts) suggest your results are
>> correct. I was also testing on my laptop, but I showed 1.) roughly
>> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
>> but when I switch to (short; average 3 characters long) text values I
>> show the hash join on VALUES is twice as fast as the binary search.
>>
>> Given that, I'm planning to implement this as a hash lookup and share
>> a revised patch.
>
>I've attached a patch series as before, but with an additional patch
>that switches to using dynahash instead of binary search.
>

OK. I can't take closer look at the moment, I'll check later.

Any particular reasons to pick dynahash over simplehash? ISTM we're
using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
while there are not many places using dynahash for simple short-lived
hash tables. Of course, that alone is a weak reason to insist on using
simplehash here, but I suppose there were reasons for not using dynahash
and we'll end up facing the same issues here.


>Whereas before the benchmarking ended up with a trimodal distribution
>(i.e., master with IN <list>, patch with IN <list>, and either with IN
>VALUES), the hash implementation brings us back to an effectively
>bimodal distribution -- though the hash scalar array op expression
>implementation for text is about 5% faster than the hash join.
>

Nice. I'm not surprised this is a bit faster than hash join, which has
to worry about additional stuff.

>Current outstanding thoughts (besides comment/naming cleanup):
>
>- The saop costing needs to be updated to match, as Tomas pointed out.
>
>- Should we be concerned about single execution cases? For example, is
>the regression of speed on a simple SELECT x IN something we should
>try to defeat by only kicking in the optimization if we execute in a
>loop at least twice? That might be of particular interest to pl/pgsql.
>

I don't follow. How is this related to the number of executions and/or
plpgsql? I suspect you might be talking about prepared statements, but
surely the hash table is built for each execution anyway, even if the
plan is reused, right?

I think the issue we've been talking about is considering the number of
lookups we expect to do in the array/hash table. But that has nothing to
do with plpgsql and/or multiple executions ...

>- Should we have a test for an operator with a non-strict function?
>I'm not aware of any built-in ops that have that characteristic; would
>you suggest just creating a fake one for the test?
>

Dunno, I haven't thought about this very much. In general I think the
new code should simply behave the same as the current code, i.e. if it
does not check for strictness, we don't need either.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:
> >On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote:
> >>
> >> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
> >> <tomas.vondra@2ndquadrant.com> wrote:
> >> >
> >> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> >> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> >> > ><tomas.vondra@2ndquadrant.com> wrote:
> >> > >>
> >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >> > >> >>
> >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> >> > >> >> > think ran into mostly the same challenge of deciding when the bloom
> >> > >> >> > filter can be useful and is worth the extra work.
> >> > >> >>
> >> > >> >> Speaking of that, it would be interesting to see how a test where you
> >> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> >> > >> >> be interesting to know if the planner is able to make a more suitable
> >> > >> >> choice and also to see how all the work over the years to improve Hash
> >> > >> >> Joins compares to the bsearch with and without the bloom filter.
> >> > >> >
> >> > >> >It would be interesting.
> >> > >> >
> >> > >> >It also makes one wonder about optimizing these into to hash
> >> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
> >> > >> >significant effort though.
> >> > >> >
> >> > >>
> >> > >> I modified the script to also do the join version of the query. I can
> >> > >> only run it on my laptop at the moment, so the results may be a bit
> >> > >> different from those I shared before, but it's interesting I think.
> >> > >>
> >> > >> In most cases it's comparable to the binsearch/bloom approach, and in
> >> > >> some cases it actually beats them quite significantly. It seems to
> >> > >> depend on how expensive the comparison is - for "int" the comparison is
> >> > >> very cheap and there's almost no difference. For "text" the comparison
> >> > >> is much more expensive, and there are significant speedups.
> >> > >>
> >> > >> For example the test with 100k lookups in array of 10k elements and 10%
> >> > >> match probability, the timings are these
> >> > >>
> >> > >>    master:  62362 ms
> >> > >>    binsearch: 201 ms
> >> > >>    bloom:      65 ms
> >> > >>    hashjoin:   36 ms
> >> > >>
> >> > >> I do think the explanation is fairly simple - the bloom filter
> >> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> >> > >> some overhead to build and check the bits. The hash join probably
> >> > >> eliminates a lot of the remaining comparisons, because the hash table
> >> > >> is sized to have one tuple per bucket.
> >> > >>
> >> > >> Note: I also don't claim the PoC has the most efficient bloom filter
> >> > >> implementation possible. I'm sure it could be made faster.
> >> > >>
> >> > >> Anyway, I'm not sure transforming this to a hash join is worth the
> >> > >> effort - I agree that seems quite complex. But perhaps this suggest we
> >> > >> should not be doing binary search and instead just build a simple hash
> >> > >> table - that seems much simpler, and it'll probably give us about the
> >> > >> same benefits.
> >> > >
> >> > >That's actually what I originally thought about doing, but I chose
> >> > >binary search since it seemed a lot easier to get off the ground.
> >> > >
> >> >
> >> > OK, that makes perfect sense.
> >> >
> >> > >If we instead build a hash is there anything else we need to be
> >> > >concerned about? For example, work mem? I suppose for the binary
> >> > >search we already have to expand the array, so perhaps it's not all
> >> > >that meaningful relative to that...
> >> > >
> >> >
> >> > I don't think we need to be particularly concerned about work_mem. We
> >> > don't care about it now, and it's not clear to me what we could do about
> >> > it - we already have the array in memory anyway, so it's a bit futile.
> >> > Furthermore, if we need to care about it, it probably applies to the
> >> > binary search too.
> >> >
> >> > >I was looking earlier at what our standard hash implementation was,
> >> > >and it seemed less obvious what was needed to set that up (so binary
> >> > >search seemed a faster proof of concept). If you happen to have any
> >> > >pointers to similar usages I should look at, please let me know.
> >> > >
> >> >
> >> > I think the hash join implementation is far too complicated. It has to
> >> > care about work_mem, so it implements batching, etc. That's a lot of
> >> > complexity we don't need here. IMO we could use either the usual
> >> > dynahash, or maybe even the simpler simplehash.
> >> >
> >> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> >> > benchmarks makes sense and running it independently. I'm not aware of
> >> > any issues but it was done late at night and only ran on my laptop.
> >>
> >> Some quick calculations (don't have the scripting in a form I can
> >> attach yet; using this as an opportunity to hack on a genericized
> >> performance testing framework of sorts) suggest your results are
> >> correct. I was also testing on my laptop, but I showed 1.) roughly
> >> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
> >> but when I switch to (short; average 3 characters long) text values I
> >> show the hash join on VALUES is twice as fast as the binary search.
> >>
> >> Given that, I'm planning to implement this as a hash lookup and share
> >> a revised patch.
> >
> >I've attached a patch series as before, but with an additional patch
> >that switches to using dynahash instead of binary search.
> >
>
> OK. I can't take closer look at the moment, I'll check later.
>
> Any particular reasons to pick dynahash over simplehash? ISTM we're
> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
> while there are not many places using dynahash for simple short-lived
> hash tables. Of course, that alone is a weak reason to insist on using
> simplehash here, but I suppose there were reasons for not using dynahash
> and we'll end up facing the same issues here.

No particular reason; it wasn't clear to me that there was a reason to
prefer one or the other (and I'm not acquainted with the codebase
enough to know the differences), so I chose dynahash because it was
easier to find examples to follow for implementation.

> >Whereas before the benchmarking ended up with a trimodal distribution
> >(i.e., master with IN <list>, patch with IN <list>, and either with IN
> >VALUES), the hash implementation brings us back to an effectively
> >bimodal distribution -- though the hash scalar array op expression
> >implementation for text is about 5% faster than the hash join.
> >
>
> Nice. I'm not surprised this is a bit faster than hash join, which has
> to worry about additional stuff.
>
> >Current outstanding thoughts (besides comment/naming cleanup):
> >
> >- The saop costing needs to be updated to match, as Tomas pointed out.
> >
> >- Should we be concerned about single execution cases? For example, is
> >the regression of speed on a simple SELECT x IN something we should
> >try to defeat by only kicking in the optimization if we execute in a
> >loop at least twice? That might be of particular interest to pl/pgsql.
> >
>
> I don't follow. How is this related to the number of executions and/or
> plpgsql? I suspect you might be talking about prepared statements, but
> surely the hash table is built for each execution anyway, even if the
> plan is reused, right?
>
> I think the issue we've been talking about is considering the number of
> lookups we expect to do in the array/hash table. But that has nothing to
> do with plpgsql and/or multiple executions ...

Suppose I do "SELECT 1 IN <100 item list>" (whether just as a
standalone query or in pl/pgsql). Then it doesn't make sense to use
the optimization, because it can't possibly win over a naive linear
scan through the array since to build the hash we have to do that
linear scan anyway (I suppose in theory the hashing function could be
dramatically faster than the equality op, so maybe it could win
overall, but it seems unlikely to me). I'm not so concerned about this
in any query where we have a real FROM clause because even if we end
up with only one row, the relative penalty is low, and the potential
gain is very high. But simple expressions in pl/pgsql, for example,
are a case where we can know for certain (correct me if I've wrong on
this) that we'll only execute the expression once, which means there's
probably always a penalty for choosing the implementation with setup
costs over the default linear scan through the array.

> >- Should we have a test for an operator with a non-strict function?
> >I'm not aware of any built-in ops that have that characteristic; would
> >you suggest just creating a fake one for the test?
> >
>
> Dunno, I haven't thought about this very much. In general I think the
> new code should simply behave the same as the current code, i.e. if it
> does not check for strictness, we don't need either.

Both the original implementation and this optimized version have the
following short circuit condition:

if (fcinfo->args[0].isnull && strictfunc)

But I'm not sure there are any existing tests to show that the "&&
strictfunc" is required.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Tue, Apr 28, 2020 at 08:39:18AM -0400, James Coleman wrote:
>On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:
>> >On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote:
>> >>
>> >> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
>> >> <tomas.vondra@2ndquadrant.com> wrote:
>> >> >
>> >> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
>> >> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
>> >> > ><tomas.vondra@2ndquadrant.com> wrote:
>> >> > >>
>> >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>> >> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> >> > >> >>
>> >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> >> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
>> >> > >> >> > think ran into mostly the same challenge of deciding when the bloom
>> >> > >> >> > filter can be useful and is worth the extra work.
>> >> > >> >>
>> >> > >> >> Speaking of that, it would be interesting to see how a test where you
>> >> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> >> > >> >> be interesting to know if the planner is able to make a more suitable
>> >> > >> >> choice and also to see how all the work over the years to improve Hash
>> >> > >> >> Joins compares to the bsearch with and without the bloom filter.
>> >> > >> >
>> >> > >> >It would be interesting.
>> >> > >> >
>> >> > >> >It also makes one wonder about optimizing these into to hash
>> >> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
>> >> > >> >significant effort though.
>> >> > >> >
>> >> > >>
>> >> > >> I modified the script to also do the join version of the query. I can
>> >> > >> only run it on my laptop at the moment, so the results may be a bit
>> >> > >> different from those I shared before, but it's interesting I think.
>> >> > >>
>> >> > >> In most cases it's comparable to the binsearch/bloom approach, and in
>> >> > >> some cases it actually beats them quite significantly. It seems to
>> >> > >> depend on how expensive the comparison is - for "int" the comparison is
>> >> > >> very cheap and there's almost no difference. For "text" the comparison
>> >> > >> is much more expensive, and there are significant speedups.
>> >> > >>
>> >> > >> For example the test with 100k lookups in array of 10k elements and 10%
>> >> > >> match probability, the timings are these
>> >> > >>
>> >> > >>    master:  62362 ms
>> >> > >>    binsearch: 201 ms
>> >> > >>    bloom:      65 ms
>> >> > >>    hashjoin:   36 ms
>> >> > >>
>> >> > >> I do think the explanation is fairly simple - the bloom filter
>> >> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
>> >> > >> some overhead to build and check the bits. The hash join probably
>> >> > >> eliminates a lot of the remaining comparisons, because the hash table
>> >> > >> is sized to have one tuple per bucket.
>> >> > >>
>> >> > >> Note: I also don't claim the PoC has the most efficient bloom filter
>> >> > >> implementation possible. I'm sure it could be made faster.
>> >> > >>
>> >> > >> Anyway, I'm not sure transforming this to a hash join is worth the
>> >> > >> effort - I agree that seems quite complex. But perhaps this suggest we
>> >> > >> should not be doing binary search and instead just build a simple hash
>> >> > >> table - that seems much simpler, and it'll probably give us about the
>> >> > >> same benefits.
>> >> > >
>> >> > >That's actually what I originally thought about doing, but I chose
>> >> > >binary search since it seemed a lot easier to get off the ground.
>> >> > >
>> >> >
>> >> > OK, that makes perfect sense.
>> >> >
>> >> > >If we instead build a hash is there anything else we need to be
>> >> > >concerned about? For example, work mem? I suppose for the binary
>> >> > >search we already have to expand the array, so perhaps it's not all
>> >> > >that meaningful relative to that...
>> >> > >
>> >> >
>> >> > I don't think we need to be particularly concerned about work_mem. We
>> >> > don't care about it now, and it's not clear to me what we could do about
>> >> > it - we already have the array in memory anyway, so it's a bit futile.
>> >> > Furthermore, if we need to care about it, it probably applies to the
>> >> > binary search too.
>> >> >
>> >> > >I was looking earlier at what our standard hash implementation was,
>> >> > >and it seemed less obvious what was needed to set that up (so binary
>> >> > >search seemed a faster proof of concept). If you happen to have any
>> >> > >pointers to similar usages I should look at, please let me know.
>> >> > >
>> >> >
>> >> > I think the hash join implementation is far too complicated. It has to
>> >> > care about work_mem, so it implements batching, etc. That's a lot of
>> >> > complexity we don't need here. IMO we could use either the usual
>> >> > dynahash, or maybe even the simpler simplehash.
>> >> >
>> >> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
>> >> > benchmarks makes sense and running it independently. I'm not aware of
>> >> > any issues but it was done late at night and only ran on my laptop.
>> >>
>> >> Some quick calculations (don't have the scripting in a form I can
>> >> attach yet; using this as an opportunity to hack on a genericized
>> >> performance testing framework of sorts) suggest your results are
>> >> correct. I was also testing on my laptop, but I showed 1.) roughly
>> >> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
>> >> but when I switch to (short; average 3 characters long) text values I
>> >> show the hash join on VALUES is twice as fast as the binary search.
>> >>
>> >> Given that, I'm planning to implement this as a hash lookup and share
>> >> a revised patch.
>> >
>> >I've attached a patch series as before, but with an additional patch
>> >that switches to using dynahash instead of binary search.
>> >
>>
>> OK. I can't take closer look at the moment, I'll check later.
>>
>> Any particular reasons to pick dynahash over simplehash? ISTM we're
>> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
>> while there are not many places using dynahash for simple short-lived
>> hash tables. Of course, that alone is a weak reason to insist on using
>> simplehash here, but I suppose there were reasons for not using dynahash
>> and we'll end up facing the same issues here.
>
>No particular reason; it wasn't clear to me that there was a reason to
>prefer one or the other (and I'm not acquainted with the codebase
>enough to know the differences), so I chose dynahash because it was
>easier to find examples to follow for implementation.
>

OK, understood.

>> >Whereas before the benchmarking ended up with a trimodal distribution
>> >(i.e., master with IN <list>, patch with IN <list>, and either with IN
>> >VALUES), the hash implementation brings us back to an effectively
>> >bimodal distribution -- though the hash scalar array op expression
>> >implementation for text is about 5% faster than the hash join.
>> >
>>
>> Nice. I'm not surprised this is a bit faster than hash join, which has
>> to worry about additional stuff.
>>
>> >Current outstanding thoughts (besides comment/naming cleanup):
>> >
>> >- The saop costing needs to be updated to match, as Tomas pointed out.
>> >
>> >- Should we be concerned about single execution cases? For example, is
>> >the regression of speed on a simple SELECT x IN something we should
>> >try to defeat by only kicking in the optimization if we execute in a
>> >loop at least twice? That might be of particular interest to pl/pgsql.
>> >
>>
>> I don't follow. How is this related to the number of executions and/or
>> plpgsql? I suspect you might be talking about prepared statements, but
>> surely the hash table is built for each execution anyway, even if the
>> plan is reused, right?
>>
>> I think the issue we've been talking about is considering the number of
>> lookups we expect to do in the array/hash table. But that has nothing to
>> do with plpgsql and/or multiple executions ...
>
>Suppose I do "SELECT 1 IN <100 item list>" (whether just as a
>standalone query or in pl/pgsql). Then it doesn't make sense to use
>the optimization, because it can't possibly win over a naive linear
>scan through the array since to build the hash we have to do that
>linear scan anyway (I suppose in theory the hashing function could be
>dramatically faster than the equality op, so maybe it could win
>overall, but it seems unlikely to me).

True. I'm sure we could construct such cases, but this is hardly the
only place where that would be an issue. We could create some rough
costing model and make a decision based on that.

>I'm not so concerned about this in any query where we have a real FROM
>clause because even if we end up with only one row, the relative
>penalty is low, and the potential gain is very high. But simple
>expressions in pl/pgsql, for example, are a case where we can know for
>certain (correct me if I've wrong on this) that we'll only execute the
>expression once, which means there's probably always a penalty for
>choosing the implementation with setup costs over the default linear
>scan through the array.
>

What do you mean by "simple expressions"? I'm not plpgsql expert and I
see it mostly as a way to glue together SQL queries, but yeah - if we
know a given ScalarArrayOpExpr will only be executed once, then we can
disable this optimization for now.

>> >- Should we have a test for an operator with a non-strict function?
>> >I'm not aware of any built-in ops that have that characteristic;
>> >would you suggest just creating a fake one for the test?
>> >
>>
>> Dunno, I haven't thought about this very much. In general I think the
>> new code should simply behave the same as the current code, i.e. if
>> it does not check for strictness, we don't need either.
>
>Both the original implementation and this optimized version have the
>following short circuit condition:
>
>if (fcinfo->args[0].isnull && strictfunc)
>
>But I'm not sure there are any existing tests to show that the "&&
>strictfunc" is required.
>

Ah! You mean a regression test, not a test in the "if condition" sense.
I don't see a reason not to have such test, although it's probably not
something we should require from this patch.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Pavel Stehule
Date:


út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
On Tue, Apr 28, 2020 at 08:39:18AM -0400, James Coleman wrote:
>On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:
>> >On Sun, Apr 26, 2020 at 7:41 PM James Coleman <jtc331@gmail.com> wrote:
>> >>
>> >> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
>> >> <tomas.vondra@2ndquadrant.com> wrote:
>> >> >
>> >> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
>> >> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
>> >> > ><tomas.vondra@2ndquadrant.com> wrote:
>> >> > >>
>> >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>> >> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
>> >> > >> >>
>> >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> >> > >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
>> >> > >> >> > think ran into mostly the same challenge of deciding when the bloom
>> >> > >> >> > filter can be useful and is worth the extra work.
>> >> > >> >>
>> >> > >> >> Speaking of that, it would be interesting to see how a test where you
>> >> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> >> > >> >> be interesting to know if the planner is able to make a more suitable
>> >> > >> >> choice and also to see how all the work over the years to improve Hash
>> >> > >> >> Joins compares to the bsearch with and without the bloom filter.
>> >> > >> >
>> >> > >> >It would be interesting.
>> >> > >> >
>> >> > >> >It also makes one wonder about optimizing these into to hash
>> >> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
>> >> > >> >significant effort though.
>> >> > >> >
>> >> > >>
>> >> > >> I modified the script to also do the join version of the query. I can
>> >> > >> only run it on my laptop at the moment, so the results may be a bit
>> >> > >> different from those I shared before, but it's interesting I think.
>> >> > >>
>> >> > >> In most cases it's comparable to the binsearch/bloom approach, and in
>> >> > >> some cases it actually beats them quite significantly. It seems to
>> >> > >> depend on how expensive the comparison is - for "int" the comparison is
>> >> > >> very cheap and there's almost no difference. For "text" the comparison
>> >> > >> is much more expensive, and there are significant speedups.
>> >> > >>
>> >> > >> For example the test with 100k lookups in array of 10k elements and 10%
>> >> > >> match probability, the timings are these
>> >> > >>
>> >> > >>    master:  62362 ms
>> >> > >>    binsearch: 201 ms
>> >> > >>    bloom:      65 ms
>> >> > >>    hashjoin:   36 ms
>> >> > >>
>> >> > >> I do think the explanation is fairly simple - the bloom filter
>> >> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
>> >> > >> some overhead to build and check the bits. The hash join probably
>> >> > >> eliminates a lot of the remaining comparisons, because the hash table
>> >> > >> is sized to have one tuple per bucket.
>> >> > >>
>> >> > >> Note: I also don't claim the PoC has the most efficient bloom filter
>> >> > >> implementation possible. I'm sure it could be made faster.
>> >> > >>
>> >> > >> Anyway, I'm not sure transforming this to a hash join is worth the
>> >> > >> effort - I agree that seems quite complex. But perhaps this suggest we
>> >> > >> should not be doing binary search and instead just build a simple hash
>> >> > >> table - that seems much simpler, and it'll probably give us about the
>> >> > >> same benefits.
>> >> > >
>> >> > >That's actually what I originally thought about doing, but I chose
>> >> > >binary search since it seemed a lot easier to get off the ground.
>> >> > >
>> >> >
>> >> > OK, that makes perfect sense.
>> >> >
>> >> > >If we instead build a hash is there anything else we need to be
>> >> > >concerned about? For example, work mem? I suppose for the binary
>> >> > >search we already have to expand the array, so perhaps it's not all
>> >> > >that meaningful relative to that...
>> >> > >
>> >> >
>> >> > I don't think we need to be particularly concerned about work_mem. We
>> >> > don't care about it now, and it's not clear to me what we could do about
>> >> > it - we already have the array in memory anyway, so it's a bit futile.
>> >> > Furthermore, if we need to care about it, it probably applies to the
>> >> > binary search too.
>> >> >
>> >> > >I was looking earlier at what our standard hash implementation was,
>> >> > >and it seemed less obvious what was needed to set that up (so binary
>> >> > >search seemed a faster proof of concept). If you happen to have any
>> >> > >pointers to similar usages I should look at, please let me know.
>> >> > >
>> >> >
>> >> > I think the hash join implementation is far too complicated. It has to
>> >> > care about work_mem, so it implements batching, etc. That's a lot of
>> >> > complexity we don't need here. IMO we could use either the usual
>> >> > dynahash, or maybe even the simpler simplehash.
>> >> >
>> >> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
>> >> > benchmarks makes sense and running it independently. I'm not aware of
>> >> > any issues but it was done late at night and only ran on my laptop.
>> >>
>> >> Some quick calculations (don't have the scripting in a form I can
>> >> attach yet; using this as an opportunity to hack on a genericized
>> >> performance testing framework of sorts) suggest your results are
>> >> correct. I was also testing on my laptop, but I showed 1.) roughly
>> >> equivalent results for IN (VALUES ...) and IN (<list>) for integers,
>> >> but when I switch to (short; average 3 characters long) text values I
>> >> show the hash join on VALUES is twice as fast as the binary search.
>> >>
>> >> Given that, I'm planning to implement this as a hash lookup and share
>> >> a revised patch.
>> >
>> >I've attached a patch series as before, but with an additional patch
>> >that switches to using dynahash instead of binary search.
>> >
>>
>> OK. I can't take closer look at the moment, I'll check later.
>>
>> Any particular reasons to pick dynahash over simplehash? ISTM we're
>> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
>> while there are not many places using dynahash for simple short-lived
>> hash tables. Of course, that alone is a weak reason to insist on using
>> simplehash here, but I suppose there were reasons for not using dynahash
>> and we'll end up facing the same issues here.
>
>No particular reason; it wasn't clear to me that there was a reason to
>prefer one or the other (and I'm not acquainted with the codebase
>enough to know the differences), so I chose dynahash because it was
>easier to find examples to follow for implementation.
>

OK, understood.

>> >Whereas before the benchmarking ended up with a trimodal distribution
>> >(i.e., master with IN <list>, patch with IN <list>, and either with IN
>> >VALUES), the hash implementation brings us back to an effectively
>> >bimodal distribution -- though the hash scalar array op expression
>> >implementation for text is about 5% faster than the hash join.
>> >
>>
>> Nice. I'm not surprised this is a bit faster than hash join, which has
>> to worry about additional stuff.
>>
>> >Current outstanding thoughts (besides comment/naming cleanup):
>> >
>> >- The saop costing needs to be updated to match, as Tomas pointed out.
>> >
>> >- Should we be concerned about single execution cases? For example, is
>> >the regression of speed on a simple SELECT x IN something we should
>> >try to defeat by only kicking in the optimization if we execute in a
>> >loop at least twice? That might be of particular interest to pl/pgsql.
>> >
>>
>> I don't follow. How is this related to the number of executions and/or
>> plpgsql? I suspect you might be talking about prepared statements, but
>> surely the hash table is built for each execution anyway, even if the
>> plan is reused, right?
>>
>> I think the issue we've been talking about is considering the number of
>> lookups we expect to do in the array/hash table. But that has nothing to
>> do with plpgsql and/or multiple executions ...
>
>Suppose I do "SELECT 1 IN <100 item list>" (whether just as a
>standalone query or in pl/pgsql). Then it doesn't make sense to use
>the optimization, because it can't possibly win over a naive linear
>scan through the array since to build the hash we have to do that
>linear scan anyway (I suppose in theory the hashing function could be
>dramatically faster than the equality op, so maybe it could win
>overall, but it seems unlikely to me).

True. I'm sure we could construct such cases, but this is hardly the
only place where that would be an issue. We could create some rough
costing model and make a decision based on that.

>I'm not so concerned about this in any query where we have a real FROM
>clause because even if we end up with only one row, the relative
>penalty is low, and the potential gain is very high. But simple
>expressions in pl/pgsql, for example, are a case where we can know for
>certain (correct me if I've wrong on this) that we'll only execute the
>expression once, which means there's probably always a penalty for
>choosing the implementation with setup costs over the default linear
>scan through the array.
>

What do you mean by "simple expressions"? I'm not plpgsql expert and I
see it mostly as a way to glue together SQL queries, but yeah - if we
know a given ScalarArrayOpExpr will only be executed once, then we can
disable this optimization for now.

a := a + 1

is translated to

SELECT $1 + 1 and save result to var a

The queries like this "SELECT $1 + 1" are simple expressions. They are evaluated just on executor level - it skip SPI

the simple expression has not FROM clause, and have to return just one row. I am not sure if it is required, it has to return just one column.

I am not sure if executor knows so expression is executed as simply expressions. But probably it can be deduced from context

Pavel



>> >- Should we have a test for an operator with a non-strict function?
>> >I'm not aware of any built-in ops that have that characteristic;
>> >would you suggest just creating a fake one for the test?
>> >
>>
>> Dunno, I haven't thought about this very much. In general I think the
>> new code should simply behave the same as the current code, i.e. if
>> it does not check for strictness, we don't need either.
>
>Both the original implementation and this optimized version have the
>following short circuit condition:
>
>if (fcinfo->args[0].isnull && strictfunc)
>
>But I'm not sure there are any existing tests to show that the "&&
>strictfunc" is required.
>

Ah! You mean a regression test, not a test in the "if condition" sense.
I don't see a reason not to have such test, although it's probably not
something we should require from this patch.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
>út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
>napsal:
>
>> ...
>>
>> >I'm not so concerned about this in any query where we have a real FROM
>> >clause because even if we end up with only one row, the relative
>> >penalty is low, and the potential gain is very high. But simple
>> >expressions in pl/pgsql, for example, are a case where we can know for
>> >certain (correct me if I've wrong on this) that we'll only execute the
>> >expression once, which means there's probably always a penalty for
>> >choosing the implementation with setup costs over the default linear
>> >scan through the array.
>> >
>>
>> What do you mean by "simple expressions"? I'm not plpgsql expert and I
>> see it mostly as a way to glue together SQL queries, but yeah - if we
>> know a given ScalarArrayOpExpr will only be executed once, then we can
>> disable this optimization for now.
>>
>
>a := a + 1
>
>is translated to
>
>SELECT $1 + 1 and save result to var a
>
>The queries like this "SELECT $1 + 1" are simple expressions. They are
>evaluated just on executor level - it skip SPI
>
>the simple expression has not FROM clause, and have to return just one row.
>I am not sure if it is required, it has to return just one column.
>
>I am not sure if executor knows so expression is executed as simply
>expressions. But probably it can be deduced from context
>

Not sure. The executor state is created by exec_eval_simple_expr, which
calls ExecInitExprWithParams (and it's the only caller). And that in
turn is the only place that leaves (state->parent == NULL). So maybe
that's a way to identify simple (standalone) expressions? Otherwise we
might add a new EEO_FLAG_* to identify these expressions explicitly.

I wonder if it would be possible to identify cases when the expression
is executed in a loop, e.g. like this:

     FOR i IN 1..1000 LOOP
         x := y IN (1, 2, ..., 999);
     END LOOP;

in which case we only build the ScalarArrayOpExpr once, so maybe we
could keep the hash table for all executions. But maybe that's not
possible or maybe it's pointless for other reasons. It sure looks a bit
like trying to build a query engine from FOR loop.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Pavel Stehule
Date:


út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
>út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
>napsal:
>
>> ...
>>
>> >I'm not so concerned about this in any query where we have a real FROM
>> >clause because even if we end up with only one row, the relative
>> >penalty is low, and the potential gain is very high. But simple
>> >expressions in pl/pgsql, for example, are a case where we can know for
>> >certain (correct me if I've wrong on this) that we'll only execute the
>> >expression once, which means there's probably always a penalty for
>> >choosing the implementation with setup costs over the default linear
>> >scan through the array.
>> >
>>
>> What do you mean by "simple expressions"? I'm not plpgsql expert and I
>> see it mostly as a way to glue together SQL queries, but yeah - if we
>> know a given ScalarArrayOpExpr will only be executed once, then we can
>> disable this optimization for now.
>>
>
>a := a + 1
>
>is translated to
>
>SELECT $1 + 1 and save result to var a
>
>The queries like this "SELECT $1 + 1" are simple expressions. They are
>evaluated just on executor level - it skip SPI
>
>the simple expression has not FROM clause, and have to return just one row.
>I am not sure if it is required, it has to return just one column.
>
>I am not sure if executor knows so expression is executed as simply
>expressions. But probably it can be deduced from context
>

Not sure. The executor state is created by exec_eval_simple_expr, which
calls ExecInitExprWithParams (and it's the only caller). And that in
turn is the only place that leaves (state->parent == NULL). So maybe
that's a way to identify simple (standalone) expressions? Otherwise we
might add a new EEO_FLAG_* to identify these expressions explicitly.

I wonder if it would be possible to identify cases when the expression
is executed in a loop, e.g. like this:

     FOR i IN 1..1000 LOOP
         x := y IN (1, 2, ..., 999);
     END LOOP;

in which case we only build the ScalarArrayOpExpr once, so maybe we
could keep the hash table for all executions. But maybe that's not
possible or maybe it's pointless for other reasons. It sure looks a bit
like trying to build a query engine from FOR loop.

Theoretically it is possible, not now. But I don't think so it is necessary. I cannot to remember this pattern in any plpgsql code and I never seen any request on this feature.

I don't think so this is task for plpgsql engine. Maybe for JIT sometimes.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Tue, Apr 28, 2020 at 11:18 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>
> út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
>>
>> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
>> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> >napsal:
>> >
>> >> ...
>> >>
>> >> >I'm not so concerned about this in any query where we have a real FROM
>> >> >clause because even if we end up with only one row, the relative
>> >> >penalty is low, and the potential gain is very high. But simple
>> >> >expressions in pl/pgsql, for example, are a case where we can know for
>> >> >certain (correct me if I've wrong on this) that we'll only execute the
>> >> >expression once, which means there's probably always a penalty for
>> >> >choosing the implementation with setup costs over the default linear
>> >> >scan through the array.
>> >> >
>> >>
>> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I
>> >> see it mostly as a way to glue together SQL queries, but yeah - if we
>> >> know a given ScalarArrayOpExpr will only be executed once, then we can
>> >> disable this optimization for now.
>> >>
>> >
>> >a := a + 1
>> >
>> >is translated to
>> >
>> >SELECT $1 + 1 and save result to var a
>> >
>> >The queries like this "SELECT $1 + 1" are simple expressions. They are
>> >evaluated just on executor level - it skip SPI
>> >
>> >the simple expression has not FROM clause, and have to return just one row.
>> >I am not sure if it is required, it has to return just one column.

Yes, this is what I meant by simple expressions.

>> >I am not sure if executor knows so expression is executed as simply
>> >expressions. But probably it can be deduced from context
>> >
>>
>> Not sure. The executor state is created by exec_eval_simple_expr, which
>> calls ExecInitExprWithParams (and it's the only caller). And that in
>> turn is the only place that leaves (state->parent == NULL). So maybe
>> that's a way to identify simple (standalone) expressions? Otherwise we
>> might add a new EEO_FLAG_* to identify these expressions explicitly.

I'll look into doing one of these.

>> I wonder if it would be possible to identify cases when the expression
>> is executed in a loop, e.g. like this:
>>
>>      FOR i IN 1..1000 LOOP
>>          x := y IN (1, 2, ..., 999);
>>      END LOOP;
>>
>> in which case we only build the ScalarArrayOpExpr once, so maybe we
>> could keep the hash table for all executions. But maybe that's not
>> possible or maybe it's pointless for other reasons. It sure looks a bit
>> like trying to build a query engine from FOR loop.
>
>
> Theoretically it is possible, not now. But I don't think so it is necessary. I cannot to remember this pattern in any
plpgsqlcode and I never seen any request on this feature. 
>
> I don't think so this is task for plpgsql engine. Maybe for JIT sometimes.

Agreed. I'd thought about this kind of scenario when I brought this
up, but I think solving it would the responsibility of the pg/pgsql
compiler rather than the expression evaluation code, because it'd have
to recognize the situation and setup a shared expression evaluation
context to be reused each time through the loop.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Pavel Stehule
Date:


út 28. 4. 2020 v 18:17 odesílatel James Coleman <jtc331@gmail.com> napsal:
On Tue, Apr 28, 2020 at 11:18 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>
> út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
>>
>> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
>> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> >napsal:
>> >
>> >> ...
>> >>
>> >> >I'm not so concerned about this in any query where we have a real FROM
>> >> >clause because even if we end up with only one row, the relative
>> >> >penalty is low, and the potential gain is very high. But simple
>> >> >expressions in pl/pgsql, for example, are a case where we can know for
>> >> >certain (correct me if I've wrong on this) that we'll only execute the
>> >> >expression once, which means there's probably always a penalty for
>> >> >choosing the implementation with setup costs over the default linear
>> >> >scan through the array.
>> >> >
>> >>
>> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I
>> >> see it mostly as a way to glue together SQL queries, but yeah - if we
>> >> know a given ScalarArrayOpExpr will only be executed once, then we can
>> >> disable this optimization for now.
>> >>
>> >
>> >a := a + 1
>> >
>> >is translated to
>> >
>> >SELECT $1 + 1 and save result to var a
>> >
>> >The queries like this "SELECT $1 + 1" are simple expressions. They are
>> >evaluated just on executor level - it skip SPI
>> >
>> >the simple expression has not FROM clause, and have to return just one row.
>> >I am not sure if it is required, it has to return just one column.

Yes, this is what I meant by simple expressions.

>> >I am not sure if executor knows so expression is executed as simply
>> >expressions. But probably it can be deduced from context
>> >
>>
>> Not sure. The executor state is created by exec_eval_simple_expr, which
>> calls ExecInitExprWithParams (and it's the only caller). And that in
>> turn is the only place that leaves (state->parent == NULL). So maybe
>> that's a way to identify simple (standalone) expressions? Otherwise we
>> might add a new EEO_FLAG_* to identify these expressions explicitly.

I'll look into doing one of these.

>> I wonder if it would be possible to identify cases when the expression
>> is executed in a loop, e.g. like this:
>>
>>      FOR i IN 1..1000 LOOP
>>          x := y IN (1, 2, ..., 999);
>>      END LOOP;
>>
>> in which case we only build the ScalarArrayOpExpr once, so maybe we
>> could keep the hash table for all executions. But maybe that's not
>> possible or maybe it's pointless for other reasons. It sure looks a bit
>> like trying to build a query engine from FOR loop.
>
>
> Theoretically it is possible, not now. But I don't think so it is necessary. I cannot to remember this pattern in any plpgsql code and I never seen any request on this feature.
>
> I don't think so this is task for plpgsql engine. Maybe for JIT sometimes.

Agreed. I'd thought about this kind of scenario when I brought this
up, but I think solving it would the responsibility of the pg/pgsql
compiler rather than the expression evaluation code, because it'd have
to recognize the situation and setup a shared expression evaluation
context to be reused each time through the loop.

can be nice if new implementation was not slower then older in all environments and context (including plpgsql expressions)

Regards

Pavel


James

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Tue, Apr 28, 2020 at 1:40 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>
> út 28. 4. 2020 v 18:17 odesílatel James Coleman <jtc331@gmail.com> napsal:
>>
>> On Tue, Apr 28, 2020 at 11:18 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> >
>> >
>> >
>> > út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com> napsal:
>> >>
>> >> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
>> >> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> >> >napsal:
>> >> >
>> >> >> ...
>> >> >>
>> >> >> >I'm not so concerned about this in any query where we have a real FROM
>> >> >> >clause because even if we end up with only one row, the relative
>> >> >> >penalty is low, and the potential gain is very high. But simple
>> >> >> >expressions in pl/pgsql, for example, are a case where we can know for
>> >> >> >certain (correct me if I've wrong on this) that we'll only execute the
>> >> >> >expression once, which means there's probably always a penalty for
>> >> >> >choosing the implementation with setup costs over the default linear
>> >> >> >scan through the array.
>> >> >> >
>> >> >>
>> >> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I
>> >> >> see it mostly as a way to glue together SQL queries, but yeah - if we
>> >> >> know a given ScalarArrayOpExpr will only be executed once, then we can
>> >> >> disable this optimization for now.
>> >> >>
>> >> >
>> >> >a := a + 1
>> >> >
>> >> >is translated to
>> >> >
>> >> >SELECT $1 + 1 and save result to var a
>> >> >
>> >> >The queries like this "SELECT $1 + 1" are simple expressions. They are
>> >> >evaluated just on executor level - it skip SPI
>> >> >
>> >> >the simple expression has not FROM clause, and have to return just one row.
>> >> >I am not sure if it is required, it has to return just one column.
>>
>> Yes, this is what I meant by simple expressions.
>>
>> >> >I am not sure if executor knows so expression is executed as simply
>> >> >expressions. But probably it can be deduced from context
>> >> >
>> >>
>> >> Not sure. The executor state is created by exec_eval_simple_expr, which
>> >> calls ExecInitExprWithParams (and it's the only caller). And that in
>> >> turn is the only place that leaves (state->parent == NULL). So maybe
>> >> that's a way to identify simple (standalone) expressions? Otherwise we
>> >> might add a new EEO_FLAG_* to identify these expressions explicitly.
>>
>> I'll look into doing one of these.
>>
>> >> I wonder if it would be possible to identify cases when the expression
>> >> is executed in a loop, e.g. like this:
>> >>
>> >>      FOR i IN 1..1000 LOOP
>> >>          x := y IN (1, 2, ..., 999);
>> >>      END LOOP;
>> >>
>> >> in which case we only build the ScalarArrayOpExpr once, so maybe we
>> >> could keep the hash table for all executions. But maybe that's not
>> >> possible or maybe it's pointless for other reasons. It sure looks a bit
>> >> like trying to build a query engine from FOR loop.
>> >
>> >
>> > Theoretically it is possible, not now. But I don't think so it is necessary. I cannot to remember this pattern in
anyplpgsql code and I never seen any request on this feature. 
>> >
>> > I don't think so this is task for plpgsql engine. Maybe for JIT sometimes.
>>
>> Agreed. I'd thought about this kind of scenario when I brought this
>> up, but I think solving it would the responsibility of the pg/pgsql
>> compiler rather than the expression evaluation code, because it'd have
>> to recognize the situation and setup a shared expression evaluation
>> context to be reused each time through the loop.
>
>
> can be nice if new implementation was not slower then older in all environments and context (including plpgsql
expressions)

Agreed, which is why I'm going to look into preventing using the new
code path for simple expressions.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
I cc'd Andres given his commit introduced simplehash, so I figured
he'd probably have a few pointers on when each one might be useful.

On Tue, Apr 28, 2020 at 8:39 AM James Coleman <jtc331@gmail.com> wrote:
...
> > Any particular reasons to pick dynahash over simplehash? ISTM we're
> > using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
> > while there are not many places using dynahash for simple short-lived
> > hash tables. Of course, that alone is a weak reason to insist on using
> > simplehash here, but I suppose there were reasons for not using dynahash
> > and we'll end up facing the same issues here.
>
> No particular reason; it wasn't clear to me that there was a reason to
> prefer one or the other (and I'm not acquainted with the codebase
> enough to know the differences), so I chose dynahash because it was
> easier to find examples to follow for implementation.

Do you have any thoughts on what the trade-offs/use-cases etc. are for
dynahash versus simple hash?

From reading the commit message in b30d3ea824c it seems like simple
hash is faster and optimized for CPU cache benefits. The comments at
the top of simplehash.h also discourage it's use in non
performance/space sensitive uses, but there isn't anything I can see
that explicitly tries to discuss when dynahash is useful, etc.

Given the performance notes in that commit message, I thinking
switching to simple hash is worth it.

But I also wonder if there might be some value in a README or comments
addition that would be a guide to what the various hash
implementations are useful for. If there's interest, I could try to
type something short up so that we have something to make the code
base a bit more discoverable.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Tue, Apr 28, 2020 at 06:22:20PM -0400, James Coleman wrote:
>I cc'd Andres given his commit introduced simplehash, so I figured
>he'd probably have a few pointers on when each one might be useful.
>
>On Tue, Apr 28, 2020 at 8:39 AM James Coleman <jtc331@gmail.com> wrote:
>...
>> > Any particular reasons to pick dynahash over simplehash? ISTM we're
>> > using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
>> > while there are not many places using dynahash for simple short-lived
>> > hash tables. Of course, that alone is a weak reason to insist on using
>> > simplehash here, but I suppose there were reasons for not using dynahash
>> > and we'll end up facing the same issues here.
>>
>> No particular reason; it wasn't clear to me that there was a reason to
>> prefer one or the other (and I'm not acquainted with the codebase
>> enough to know the differences), so I chose dynahash because it was
>> easier to find examples to follow for implementation.
>
>Do you have any thoughts on what the trade-offs/use-cases etc. are for
>dynahash versus simple hash?
>
>From reading the commit message in b30d3ea824c it seems like simple
>hash is faster and optimized for CPU cache benefits. The comments at
>the top of simplehash.h also discourage it's use in non
>performance/space sensitive uses, but there isn't anything I can see
>that explicitly tries to discuss when dynahash is useful, etc.
>
>Given the performance notes in that commit message, I thinking
>switching to simple hash is worth it.
>

I recall doing some benchmarks for that patch, but it's so long I don't
really remember the details. But in general, I agree simplehash is a bit
more efficient in terms of CPU / caching.

I think the changes required to switch from dynahash to simplehash are
fairly small, so I think the best thing we can do is just try do some
measurement and then decide.

>But I also wonder if there might be some value in a README or comments
>addition that would be a guide to what the various hash
>implementations are useful for. If there's interest, I could try to
>type something short up so that we have something to make the code
>base a bit more discoverable.
>

I wouldn't object to that. Although maybe we should simply add some
basic recommendations to the comments in dynahash/simplehash.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Andres Freund
Date:
Hi,

On 2020-04-28 18:22:20 -0400, James Coleman wrote:
> I cc'd Andres given his commit introduced simplehash, so I figured
> he'd probably have a few pointers on when each one might be useful.
> [...]
> Do you have any thoughts on what the trade-offs/use-cases etc. are for
> dynahash versus simple hash?
> 
> From reading the commit message in b30d3ea824c it seems like simple
> hash is faster and optimized for CPU cache benefits. The comments at
> the top of simplehash.h also discourage it's use in non
> performance/space sensitive uses, but there isn't anything I can see
> that explicitly tries to discuss when dynahash is useful, etc.

Benefits of dynahash (chained hashtable):
- supports partitioning, useful for shared memory accessed under locks
- better performance for large entries, as they don't need to be moved
  around in case of hash conflicts
- stable pointers to hash table entries

Benefits of simplehash (open addressing hash table):
- no indirect function calls, known structure sizes, due to "templated"
  code generation (these show up substantially in profiles for dynahash)
- considerably faster for small entries due to previous point, and due
  open addressing hash tables having better cache behaviour than chained
  hashtables
- once set-up the interface is type safe and easier to use
- no overhead of a separate memory context etc


> Given the performance notes in that commit message, I thinking
> switching to simple hash is worth it.

Seems plausible to me.


> But I also wonder if there might be some value in a README or comments
> addition that would be a guide to what the various hash
> implementations are useful for. If there's interest, I could try to
> type something short up so that we have something to make the code
> base a bit more discoverable.

That'd make sense to me.

Greetings,

Andres Freund



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Tue, Apr 28, 2020 at 7:05 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-04-28 18:22:20 -0400, James Coleman wrote:
> > I cc'd Andres given his commit introduced simplehash, so I figured
> > he'd probably have a few pointers on when each one might be useful.
> > [...]
> > Do you have any thoughts on what the trade-offs/use-cases etc. are for
> > dynahash versus simple hash?
> >
> > From reading the commit message in b30d3ea824c it seems like simple
> > hash is faster and optimized for CPU cache benefits. The comments at
> > the top of simplehash.h also discourage it's use in non
> > performance/space sensitive uses, but there isn't anything I can see
> > that explicitly tries to discuss when dynahash is useful, etc.
>
> Benefits of dynahash (chained hashtable):
> - supports partitioning, useful for shared memory accessed under locks
> - better performance for large entries, as they don't need to be moved
>   around in case of hash conflicts
> - stable pointers to hash table entries
>
> Benefits of simplehash (open addressing hash table):
> - no indirect function calls, known structure sizes, due to "templated"
>   code generation (these show up substantially in profiles for dynahash)
> - considerably faster for small entries due to previous point, and due
>   open addressing hash tables having better cache behaviour than chained
>   hashtables
> - once set-up the interface is type safe and easier to use
> - no overhead of a separate memory context etc
>
>
> > Given the performance notes in that commit message, I thinking
> > switching to simple hash is worth it.
>
> Seems plausible to me.
>
>
> > But I also wonder if there might be some value in a README or comments
> > addition that would be a guide to what the various hash
> > implementations are useful for. If there's interest, I could try to
> > type something short up so that we have something to make the code
> > base a bit more discoverable.
>
> That'd make sense to me.

Cool, I'll work on that as I have time then.

One question: what is the reasoning behind having SH_STORE_HASH? The
only things I could imagine would be a case where you have external
pointers to some set of values or need to be able to use the hash for
other reasons besides the hash table (and so can avoid calculating it
twice), but maybe I'm missing something.

Thanks,
James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Wed, Apr 29, 2020 at 10:26:12AM -0400, James Coleman wrote:
>On Tue, Apr 28, 2020 at 7:05 PM Andres Freund <andres@anarazel.de> wrote:
>>
>> Hi,
>>
>> On 2020-04-28 18:22:20 -0400, James Coleman wrote:
>> > I cc'd Andres given his commit introduced simplehash, so I figured
>> > he'd probably have a few pointers on when each one might be useful.
>> > [...]
>> > Do you have any thoughts on what the trade-offs/use-cases etc. are for
>> > dynahash versus simple hash?
>> >
>> > From reading the commit message in b30d3ea824c it seems like simple
>> > hash is faster and optimized for CPU cache benefits. The comments at
>> > the top of simplehash.h also discourage it's use in non
>> > performance/space sensitive uses, but there isn't anything I can see
>> > that explicitly tries to discuss when dynahash is useful, etc.
>>
>> Benefits of dynahash (chained hashtable):
>> - supports partitioning, useful for shared memory accessed under locks
>> - better performance for large entries, as they don't need to be moved
>>   around in case of hash conflicts
>> - stable pointers to hash table entries
>>
>> Benefits of simplehash (open addressing hash table):
>> - no indirect function calls, known structure sizes, due to "templated"
>>   code generation (these show up substantially in profiles for dynahash)
>> - considerably faster for small entries due to previous point, and due
>>   open addressing hash tables having better cache behaviour than chained
>>   hashtables
>> - once set-up the interface is type safe and easier to use
>> - no overhead of a separate memory context etc
>>
>>
>> > Given the performance notes in that commit message, I thinking
>> > switching to simple hash is worth it.
>>
>> Seems plausible to me.
>>
>>
>> > But I also wonder if there might be some value in a README or comments
>> > addition that would be a guide to what the various hash
>> > implementations are useful for. If there's interest, I could try to
>> > type something short up so that we have something to make the code
>> > base a bit more discoverable.
>>
>> That'd make sense to me.
>
>Cool, I'll work on that as I have time then.
>
>One question: what is the reasoning behind having SH_STORE_HASH? The
>only things I could imagine would be a case where you have external
>pointers to some set of values or need to be able to use the hash for
>other reasons besides the hash table (and so can avoid calculating it
>twice), but maybe I'm missing something.
>

I believe it's because computing the hash may be fairly expensive for
some data types, in which case it may be better to just store it for
future use.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Wed, Apr 29, 2020 at 11:17 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Apr 29, 2020 at 10:26:12AM -0400, James Coleman wrote:
> >On Tue, Apr 28, 2020 at 7:05 PM Andres Freund <andres@anarazel.de> wrote:
> >>
> >> Hi,
> >>
> >> On 2020-04-28 18:22:20 -0400, James Coleman wrote:
> >> > I cc'd Andres given his commit introduced simplehash, so I figured
> >> > he'd probably have a few pointers on when each one might be useful.
> >> > [...]
> >> > Do you have any thoughts on what the trade-offs/use-cases etc. are for
> >> > dynahash versus simple hash?
> >> >
> >> > From reading the commit message in b30d3ea824c it seems like simple
> >> > hash is faster and optimized for CPU cache benefits. The comments at
> >> > the top of simplehash.h also discourage it's use in non
> >> > performance/space sensitive uses, but there isn't anything I can see
> >> > that explicitly tries to discuss when dynahash is useful, etc.
> >>
> >> Benefits of dynahash (chained hashtable):
> >> - supports partitioning, useful for shared memory accessed under locks
> >> - better performance for large entries, as they don't need to be moved
> >>   around in case of hash conflicts
> >> - stable pointers to hash table entries
> >>
> >> Benefits of simplehash (open addressing hash table):
> >> - no indirect function calls, known structure sizes, due to "templated"
> >>   code generation (these show up substantially in profiles for dynahash)
> >> - considerably faster for small entries due to previous point, and due
> >>   open addressing hash tables having better cache behaviour than chained
> >>   hashtables
> >> - once set-up the interface is type safe and easier to use
> >> - no overhead of a separate memory context etc
> >>
> >>
> >> > Given the performance notes in that commit message, I thinking
> >> > switching to simple hash is worth it.
> >>
> >> Seems plausible to me.
> >>
> >>
> >> > But I also wonder if there might be some value in a README or comments
> >> > addition that would be a guide to what the various hash
> >> > implementations are useful for. If there's interest, I could try to
> >> > type something short up so that we have something to make the code
> >> > base a bit more discoverable.
> >>
> >> That'd make sense to me.
> >
> >Cool, I'll work on that as I have time then.
> >
> >One question: what is the reasoning behind having SH_STORE_HASH? The
> >only things I could imagine would be a case where you have external
> >pointers to some set of values or need to be able to use the hash for
> >other reasons besides the hash table (and so can avoid calculating it
> >twice), but maybe I'm missing something.
> >
>
> I believe it's because computing the hash may be fairly expensive for
> some data types, in which case it may be better to just store it for
> future use.

But is it storing it for use primarily by the hash table
implementation (i.e., does it need the hash stored this way to avoid
repeated recalculation) or for caller's use?

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On Wed, Apr 29, 2020 at 11:34:24AM -0400, James Coleman wrote:
>On Wed, Apr 29, 2020 at 11:17 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Wed, Apr 29, 2020 at 10:26:12AM -0400, James Coleman wrote:
>> >On Tue, Apr 28, 2020 at 7:05 PM Andres Freund <andres@anarazel.de> wrote:
>> >>
>> >> Hi,
>> >>
>> >> On 2020-04-28 18:22:20 -0400, James Coleman wrote:
>> >> > I cc'd Andres given his commit introduced simplehash, so I figured
>> >> > he'd probably have a few pointers on when each one might be useful.
>> >> > [...]
>> >> > Do you have any thoughts on what the trade-offs/use-cases etc. are for
>> >> > dynahash versus simple hash?
>> >> >
>> >> > From reading the commit message in b30d3ea824c it seems like simple
>> >> > hash is faster and optimized for CPU cache benefits. The comments at
>> >> > the top of simplehash.h also discourage it's use in non
>> >> > performance/space sensitive uses, but there isn't anything I can see
>> >> > that explicitly tries to discuss when dynahash is useful, etc.
>> >>
>> >> Benefits of dynahash (chained hashtable):
>> >> - supports partitioning, useful for shared memory accessed under locks
>> >> - better performance for large entries, as they don't need to be moved
>> >>   around in case of hash conflicts
>> >> - stable pointers to hash table entries
>> >>
>> >> Benefits of simplehash (open addressing hash table):
>> >> - no indirect function calls, known structure sizes, due to "templated"
>> >>   code generation (these show up substantially in profiles for dynahash)
>> >> - considerably faster for small entries due to previous point, and due
>> >>   open addressing hash tables having better cache behaviour than chained
>> >>   hashtables
>> >> - once set-up the interface is type safe and easier to use
>> >> - no overhead of a separate memory context etc
>> >>
>> >>
>> >> > Given the performance notes in that commit message, I thinking
>> >> > switching to simple hash is worth it.
>> >>
>> >> Seems plausible to me.
>> >>
>> >>
>> >> > But I also wonder if there might be some value in a README or comments
>> >> > addition that would be a guide to what the various hash
>> >> > implementations are useful for. If there's interest, I could try to
>> >> > type something short up so that we have something to make the code
>> >> > base a bit more discoverable.
>> >>
>> >> That'd make sense to me.
>> >
>> >Cool, I'll work on that as I have time then.
>> >
>> >One question: what is the reasoning behind having SH_STORE_HASH? The
>> >only things I could imagine would be a case where you have external
>> >pointers to some set of values or need to be able to use the hash for
>> >other reasons besides the hash table (and so can avoid calculating it
>> >twice), but maybe I'm missing something.
>> >
>>
>> I believe it's because computing the hash may be fairly expensive for
>> some data types, in which case it may be better to just store it for
>> future use.
>
>But is it storing it for use primarily by the hash table
>implementation (i.e., does it need the hash stored this way to avoid
>repeated recalculation) or for caller's use?
>

For the hash table implementation, I think. Simplehash is using "robin
hood" hashing, which needs to decide which entry to move in case of
collision. AFAIC that requires the knowledge of hash value for both the
new and existing entry, and having to compute that over and over would
be fairly expensive. But that's my understanding, I might be wrong and
it's useful for external use too.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
...
> Any particular reasons to pick dynahash over simplehash? ISTM we're
> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
> while there are not many places using dynahash for simple short-lived
> hash tables. Of course, that alone is a weak reason to insist on using
> simplehash here, but I suppose there were reasons for not using dynahash
> and we'll end up facing the same issues here.

I've attached a patch series that includes switching to simplehash.
Obviously we'd really just collapse all of these patches, but it's
perhaps interesting to see the changes required to use each style
(binary search, dynahash, simplehash).

As before, there are clearly comments and naming things to be
addressed, but the implementation should be reasonably clean.

James

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Heikki Linnakangas
Date:
On 01/05/2020 05:20, James Coleman wrote:
> On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> ...
>> Any particular reasons to pick dynahash over simplehash? ISTM we're
>> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
>> while there are not many places using dynahash for simple short-lived
>> hash tables. Of course, that alone is a weak reason to insist on using
>> simplehash here, but I suppose there were reasons for not using dynahash
>> and we'll end up facing the same issues here.
> 
> I've attached a patch series that includes switching to simplehash.
> Obviously we'd really just collapse all of these patches, but it's
> perhaps interesting to see the changes required to use each style
> (binary search, dynahash, simplehash).
> 
> As before, there are clearly comments and naming things to be
> addressed, but the implementation should be reasonably clean.

Looks good, aside from the cleanup work that you mentioned. There are a 
few more cases that I think you could easily handle with very little 
extra code:

You could also apply the optimization for non-Const expressions, as long 
as they're stable (i.e. !contain_volatile_functions(expr)).

Deconstructing the array Datum into a simple C array on first call would 
be a win even for very small arrays and for AND semantics, even if you 
don't use a hash table.

- Heikki



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 01/05/2020 05:20, James Coleman wrote:
> > On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > ...
> >> Any particular reasons to pick dynahash over simplehash? ISTM we're
> >> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
> >> while there are not many places using dynahash for simple short-lived
> >> hash tables. Of course, that alone is a weak reason to insist on using
> >> simplehash here, but I suppose there were reasons for not using dynahash
> >> and we'll end up facing the same issues here.
> >
> > I've attached a patch series that includes switching to simplehash.
> > Obviously we'd really just collapse all of these patches, but it's
> > perhaps interesting to see the changes required to use each style
> > (binary search, dynahash, simplehash).
> >
> > As before, there are clearly comments and naming things to be
> > addressed, but the implementation should be reasonably clean.
>
> Looks good, aside from the cleanup work that you mentioned. There are a
> few more cases that I think you could easily handle with very little
> extra code:
>
> You could also apply the optimization for non-Const expressions, as long
> as they're stable (i.e. !contain_volatile_functions(expr)).

Is that true? Don't we also have to worry about whether or not the
value is stable (i.e., know when a param has changed)? There have been
discussions about being able to cache stable subexpressions, and my
understanding was that we'd need to have that infrastructure (along
with the necessarily invalidation when the param changes) to be able
to use this for non-Const expressions.

> Deconstructing the array Datum into a simple C array on first call would
> be a win even for very small arrays and for AND semantics, even if you
> don't use a hash table.

Because you wouldn't have to repeatedly detoast it? Or some other
reason I'm not thinking of? My intuition would have been that (aside
from detoasting if necessary) there wouldn't be much real overhead in,
for example, an array storing integers.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Heikki Linnakangas
Date:
On 08/09/2020 22:25, James Coleman wrote:
> On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> You could also apply the optimization for non-Const expressions, as long
>> as they're stable (i.e. !contain_volatile_functions(expr)).
> 
> Is that true? Don't we also have to worry about whether or not the
> value is stable (i.e., know when a param has changed)? There have been
> discussions about being able to cache stable subexpressions, and my
> understanding was that we'd need to have that infrastructure (along
> with the necessarily invalidation when the param changes) to be able
> to use this for non-Const expressions.

Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And 
Vars. I think the conditions are the same as those checked in 
match_clause_to_partition_key() in partprune.c (it's a long function, 
search for "if (!IsA(rightop, Const))"). Not sure it's worth the 
trouble, then. But it would be nice to handle queries like "WHERE column 
= ANY ($1)"

>> Deconstructing the array Datum into a simple C array on first call would
>> be a win even for very small arrays and for AND semantics, even if you
>> don't use a hash table.
> 
> Because you wouldn't have to repeatedly detoast it? Or some other
> reason I'm not thinking of? My intuition would have been that (aside
> from detoasting if necessary) there wouldn't be much real overhead in,
> for example, an array storing integers.

Dealing with NULLs and different element sizes in the array is pretty 
complicated. Looping through the array currently looks like this:

    /* Loop over the array elements */
    s = (char *) ARR_DATA_PTR(arr);
    bitmap = ARR_NULLBITMAP(arr);
    bitmask = 1;

    for (int i = 0; i < nitems; i++)
    {
        Datum        elt;
        Datum        thisresult;

        /* Get array element, checking for NULL */
        if (bitmap && (*bitmap & bitmask) == 0)
        {
            fcinfo->args[1].value = (Datum) 0;
            fcinfo->args[1].isnull = true;
        }
        else
        {
            elt = fetch_att(s, typbyval, typlen);
            s = att_addlength_pointer(s, typlen, s);
            s = (char *) att_align_nominal(s, typalign);
            fcinfo->args[1].value = elt;
            fcinfo->args[1].isnull = false;
        }

        [do stuff with Datum/isnull]

        /* advance bitmap pointer if any */
        if (bitmap)
        {
            bitmask <<= 1;
            if (bitmask == 0x100)
            {
                bitmap++;
                bitmask = 1;
            }
        }
    }

Compared with just:

    for (int i = 0; i < nitems; i++)
    {
        Datum        elt = datums[i];

        [do stuff with the Datum]
    }

I'm not sure how much difference that makes, but I presume it's not 
zero, and it seems like an easy win when you have the code to deal with 
the Datum array representation anyway.

- Heikki



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Tue, Sep 8, 2020 at 4:37 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 08/09/2020 22:25, James Coleman wrote:
> > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >> You could also apply the optimization for non-Const expressions, as long
> >> as they're stable (i.e. !contain_volatile_functions(expr)).
> >
> > Is that true? Don't we also have to worry about whether or not the
> > value is stable (i.e., know when a param has changed)? There have been
> > discussions about being able to cache stable subexpressions, and my
> > understanding was that we'd need to have that infrastructure (along
> > with the necessarily invalidation when the param changes) to be able
> > to use this for non-Const expressions.
>
> Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And
> Vars. I think the conditions are the same as those checked in
> match_clause_to_partition_key() in partprune.c (it's a long function,
> search for "if (!IsA(rightop, Const))"). Not sure it's worth the
> trouble, then. But it would be nice to handle queries like "WHERE column
> = ANY ($1)"

If I'm understanding properly you're imagining something in the form of:

with x as (select '{1,2,3,4,5,6,7,8,9,10}'::int[])
select * from t where i = any ((select * from x)::int[]);

I've been playing around with this and currently have these checks:

contain_var_clause((Node *) arrayarg)
contain_volatile_functions((Node *) arrayarg)
contain_exec_param((Node *) arrayarg, NIL)

(the last one I had to modify the function to handle empty lists)

If any are true, then have to disable the optimization. But for
queries in the form above the test contain_exec_param((Node *)
arrayarg, NIL) evaluates to true, even though we know from looking at
the query that the array subexpression is stable for the length of the
query.

Am I misunderstanding what you're going for? Or is there a way to
confirm the expr, although an exec param, won't change?

Another interesting thing that this would imply is that we'd either have to:

1. Remove the array length check altogether,
2. Always use the hash when have a non-Const, but when a Const only if
the array length check passes, or
3. Make this new expr op more fully featured by teaching it how to use
either a straight loop through a deconstructed array or use the hash.

That last option feeds into further discussion in the below:

> >> Deconstructing the array Datum into a simple C array on first call would
> >> be a win even for very small arrays and for AND semantics, even if you
> >> don't use a hash table.
> >
> > Because you wouldn't have to repeatedly detoast it? Or some other
> > reason I'm not thinking of? My intuition would have been that (aside
> > from detoasting if necessary) there wouldn't be much real overhead in,
> > for example, an array storing integers.
>
> Dealing with NULLs and different element sizes in the array is pretty
> complicated. Looping through the array currently looks like this:
>
>         /* Loop over the array elements */
>         s = (char *) ARR_DATA_PTR(arr);
>         bitmap = ARR_NULLBITMAP(arr);
>         bitmask = 1;
>
>         for (int i = 0; i < nitems; i++)
>         {
>                 Datum           elt;
>                 Datum           thisresult;
>
>                 /* Get array element, checking for NULL */
>                 if (bitmap && (*bitmap & bitmask) == 0)
>                 {
>                         fcinfo->args[1].value = (Datum) 0;
>                         fcinfo->args[1].isnull = true;
>                 }
>                 else
>                 {
>                         elt = fetch_att(s, typbyval, typlen);
>                         s = att_addlength_pointer(s, typlen, s);
>                         s = (char *) att_align_nominal(s, typalign);
>                         fcinfo->args[1].value = elt;
>                         fcinfo->args[1].isnull = false;
>                 }
>
>                 [do stuff with Datum/isnull]
>
>                 /* advance bitmap pointer if any */
>                 if (bitmap)
>                 {
>                         bitmask <<= 1;
>                         if (bitmask == 0x100)
>                         {
>                                 bitmap++;
>                                 bitmask = 1;
>                         }
>                 }
>         }
>
> Compared with just:
>
>         for (int i = 0; i < nitems; i++)
>         {
>                 Datum           elt = datums[i];
>
>                 [do stuff with the Datum]
>         }
>
> I'm not sure how much difference that makes, but I presume it's not
> zero, and it seems like an easy win when you have the code to deal with
> the Datum array representation anyway.

Doing this would necessitate option 3 above: we'd have to have this
new expr op be able both to use a hash or alternatively do a normal
loop.

Being able to use this in more cases than just a Const array expr is
certainly interesting, but I'm not sure yet about the feasibility or
desirability of that at this point given the above restrictions.

One other point in favor of the additional complexity here is that
it's likely that the above described runtime switching between hash
and loop would be necessary (for this optimization to come into play)
if caching of stable subexpressions ever lands. I have some interest
in working on that...but it's also a large project.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Fri, Sep 11, 2020 at 5:11 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Tue, Sep 8, 2020 at 4:37 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 08/09/2020 22:25, James Coleman wrote:
> > > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > >>
> > >> You could also apply the optimization for non-Const expressions, as long
> > >> as they're stable (i.e. !contain_volatile_functions(expr)).
> > >
> > > Is that true? Don't we also have to worry about whether or not the
> > > value is stable (i.e., know when a param has changed)? There have been
> > > discussions about being able to cache stable subexpressions, and my
> > > understanding was that we'd need to have that infrastructure (along
> > > with the necessarily invalidation when the param changes) to be able
> > > to use this for non-Const expressions.
> >
> > Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And
> > Vars. I think the conditions are the same as those checked in
> > match_clause_to_partition_key() in partprune.c (it's a long function,
> > search for "if (!IsA(rightop, Const))"). Not sure it's worth the
> > trouble, then. But it would be nice to handle queries like "WHERE column
> > = ANY ($1)"
>
> If I'm understanding properly you're imagining something in the form of:
>
> with x as (select '{1,2,3,4,5,6,7,8,9,10}'::int[])
> select * from t where i = any ((select * from x)::int[]);
>
> I've been playing around with this and currently have these checks:
>
> contain_var_clause((Node *) arrayarg)
> contain_volatile_functions((Node *) arrayarg)
> contain_exec_param((Node *) arrayarg, NIL)
>
> (the last one I had to modify the function to handle empty lists)
>
> If any are true, then have to disable the optimization. But for
> queries in the form above the test contain_exec_param((Node *)
> arrayarg, NIL) evaluates to true, even though we know from looking at
> the query that the array subexpression is stable for the length of the
> query.
>
> Am I misunderstanding what you're going for? Or is there a way to
> confirm the expr, although an exec param, won't change?
>
> Another interesting thing that this would imply is that we'd either have to:
>
> 1. Remove the array length check altogether,
> 2. Always use the hash when have a non-Const, but when a Const only if
> the array length check passes, or
> 3. Make this new expr op more fully featured by teaching it how to use
> either a straight loop through a deconstructed array or use the hash.
>
> That last option feeds into further discussion in the below:
>
> > >> Deconstructing the array Datum into a simple C array on first call would
> > >> be a win even for very small arrays and for AND semantics, even if you
> > >> don't use a hash table.
> > >
> > > Because you wouldn't have to repeatedly detoast it? Or some other
> > > reason I'm not thinking of? My intuition would have been that (aside
> > > from detoasting if necessary) there wouldn't be much real overhead in,
> > > for example, an array storing integers.
> >
> > Dealing with NULLs and different element sizes in the array is pretty
> > complicated. Looping through the array currently looks like this:
> >
> >         /* Loop over the array elements */
> >         s = (char *) ARR_DATA_PTR(arr);
> >         bitmap = ARR_NULLBITMAP(arr);
> >         bitmask = 1;
> >
> >         for (int i = 0; i < nitems; i++)
> >         {
> >                 Datum           elt;
> >                 Datum           thisresult;
> >
> >                 /* Get array element, checking for NULL */
> >                 if (bitmap && (*bitmap & bitmask) == 0)
> >                 {
> >                         fcinfo->args[1].value = (Datum) 0;
> >                         fcinfo->args[1].isnull = true;
> >                 }
> >                 else
> >                 {
> >                         elt = fetch_att(s, typbyval, typlen);
> >                         s = att_addlength_pointer(s, typlen, s);
> >                         s = (char *) att_align_nominal(s, typalign);
> >                         fcinfo->args[1].value = elt;
> >                         fcinfo->args[1].isnull = false;
> >                 }
> >
> >                 [do stuff with Datum/isnull]
> >
> >                 /* advance bitmap pointer if any */
> >                 if (bitmap)
> >                 {
> >                         bitmask <<= 1;
> >                         if (bitmask == 0x100)
> >                         {
> >                                 bitmap++;
> >                                 bitmask = 1;
> >                         }
> >                 }
> >         }
> >
> > Compared with just:
> >
> >         for (int i = 0; i < nitems; i++)
> >         {
> >                 Datum           elt = datums[i];
> >
> >                 [do stuff with the Datum]
> >         }
> >
> > I'm not sure how much difference that makes, but I presume it's not
> > zero, and it seems like an easy win when you have the code to deal with
> > the Datum array representation anyway.
>
> Doing this would necessitate option 3 above: we'd have to have this
> new expr op be able both to use a hash or alternatively do a normal
> loop.
>
> Being able to use this in more cases than just a Const array expr is
> certainly interesting, but I'm not sure yet about the feasibility or
> desirability of that at this point given the above restrictions.
>
> One other point in favor of the additional complexity here is that
> it's likely that the above described runtime switching between hash
> and loop would be necessary (for this optimization to come into play)
> if caching of stable subexpressions ever lands. I have some interest
> in working on that...but it's also a large project.

I've attached a cleaned up patch. Last CF it was listed in is
https://commitfest.postgresql.org/29/2542/ -- what's the appropriate
step to take here given it's an already existing patch, but not yet
moved into recent CFs?

Thanks,
James

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Sat, 20 Mar 2021 at 09:41, James Coleman <jtc331@gmail.com> wrote:
> I've attached a cleaned up patch. Last CF it was listed in is
> https://commitfest.postgresql.org/29/2542/ -- what's the appropriate
> step to take here given it's an already existing patch, but not yet
> moved into recent CFs?

I had a look at this patch.  I like the idea of using a simplehash.h
hash table to hash the constant values so that repeat lookups can be
performed much more quickly, however, I'm a bit concerned that there
are quite a few places in the code where we often just execute a
ScalarArrayOpExpr once and I'm a bit worried that we'll slow down
expression evaluation of those cases.

The two cases that I have in mind are:

1. eval_const_expressions() where we use the executor to evaluate the
ScalarArrayOpExpr to see if the result is Const.
2. CHECK constraints with IN clauses and single-row INSERTs.

I tried to benchmark both of these but I'm struggling to get stable
enough performance for #2, even with fsync=off.  Sometimes I'm getting
results 2.5x slower than other runs.

For benchmarking #1 I'm also not too sure I'm getting stable enough
results for them to mean anything.

I was running:

create table a (a int);

bench.sql: explain select * from a where a
in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);

drowley@amd3990x:~$ pgbench -n -T 60 -j 64 -c 64 -f bench.sql -P 10 postgres
Master (6d41dd045):
tps = 992586.991045 (without initial connection time)
tps = 987964.990483 (without initial connection time)
tps = 994309.670918 (without initial connection time)

Master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch
tps = 956022.557626 (without initial connection time)
tps = 963043.352823 (without initial connection time)
tps = 968582.600100 (without initial connection time)

This puts the patched version about 3% slower. I'm not sure how much
of that is changes in the binary and noise and how much is the
needless hashtable build done for eval_const_expressions().

I wondered if we should make it the query planner's job of deciding if
the ScalarArrayOpExpr should be hashed or not.  I ended up with the
attached rough-cut patch that introduces HashedScalarArrayOpExpr and
has the query planner decide if it's going to replace
ScalarArrayOpExpr with these HashedScalarArrayOpExpr during
preprocess_expression().  I do think that we might want to consider
being a bit selective about when we do these replacements.  It seems
likely that we'd want to do this for EXPRKIND_QUAL and maybe
EXPRKIND_TARGET, but I imagine that converting ScalarArrayOpExpr to
HashedScalarArrayOpExpr for EXPRKIND_VALUES would be a waste of time
since those will just be executed once.

I tried the same above test with the
v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch plus the
attached rough-cut patch and got:

master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch
+ v5-0002-Rough-cut-patch-for-HashedScalarArrayOpExpr.patch
tps = 1167969.983173 (without initial connection time)
tps = 1199636.793314 (without initial connection time)
tps = 1190690.939963 (without initial connection time)

I can't really explain why this became faster. I was expecting it just
to reduce that slowdown of the v4 patch a little. I don't really see
any reason why it would become faster.  It's almost 20% faster which
seems like too much to just be fluctuations in code alignment in the
binary.

The attached patch is still missing the required changes to
llvmjit_expr.c. I think that was also missing from the original patch
too, however.

Also, I added HashedScalarArrayOpExpr to plannodes.h.  All other Expr
type nodes are in primnodes.h. However, I put HashedScalarArrayOpExpr
in plannodes.h because the parser does not generate this and it's not
going to be stored in the catalogue files anywhere. I'm not so sure
inventing a new Expr type node that only can be generated by the
planner is a good thing to do.

Anyway, wondering what you think of the idea of allowing the planner
to choose if it's going to hash or not?

It might also be good if someone else can check if they can get a bit
more stable performance results from benchmarking the patches.

(Also attached your v4 patch again just so anyone following along at
home does not need to hunt around for the correct set of patches to
apply to test this)

David

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Mon, Apr 5, 2021 at 11:58 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sat, 20 Mar 2021 at 09:41, James Coleman <jtc331@gmail.com> wrote:
> > I've attached a cleaned up patch. Last CF it was listed in is
> > https://commitfest.postgresql.org/29/2542/ -- what's the appropriate
> > step to take here given it's an already existing patch, but not yet
> > moved into recent CFs?
>
> I had a look at this patch.  I like the idea of using a simplehash.h
> hash table to hash the constant values so that repeat lookups can be
> performed much more quickly, however, I'm a bit concerned that there
> are quite a few places in the code where we often just execute a
> ScalarArrayOpExpr once and I'm a bit worried that we'll slow down
> expression evaluation of those cases.
>
> The two cases that I have in mind are:
>
> 1. eval_const_expressions() where we use the executor to evaluate the
> ScalarArrayOpExpr to see if the result is Const.
> 2. CHECK constraints with IN clauses and single-row INSERTs.

This is a good point I hadn't considered; now that you mention it, I
think another case would be expression evaluation in pl/pgsql.

> I tried to benchmark both of these but I'm struggling to get stable
> enough performance for #2, even with fsync=off.  Sometimes I'm getting
> results 2.5x slower than other runs.
>
> For benchmarking #1 I'm also not too sure I'm getting stable enough
> results for them to mean anything.
>
> I was running:
>
> create table a (a int);
>
> bench.sql: explain select * from a where a
> in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
>
> drowley@amd3990x:~$ pgbench -n -T 60 -j 64 -c 64 -f bench.sql -P 10 postgres
> Master (6d41dd045):
> tps = 992586.991045 (without initial connection time)
> tps = 987964.990483 (without initial connection time)
> tps = 994309.670918 (without initial connection time)
>
> Master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch
> tps = 956022.557626 (without initial connection time)
> tps = 963043.352823 (without initial connection time)
> tps = 968582.600100 (without initial connection time)
>
> This puts the patched version about 3% slower. I'm not sure how much
> of that is changes in the binary and noise and how much is the
> needless hashtable build done for eval_const_expressions().
>
> I wondered if we should make it the query planner's job of deciding if
> the ScalarArrayOpExpr should be hashed or not.  I ended up with the
> attached rough-cut patch that introduces HashedScalarArrayOpExpr and
> has the query planner decide if it's going to replace
> ScalarArrayOpExpr with these HashedScalarArrayOpExpr during
> preprocess_expression().  I do think that we might want to consider
> being a bit selective about when we do these replacements.  It seems
> likely that we'd want to do this for EXPRKIND_QUAL and maybe
> EXPRKIND_TARGET, but I imagine that converting ScalarArrayOpExpr to
> HashedScalarArrayOpExpr for EXPRKIND_VALUES would be a waste of time
> since those will just be executed once.

In theory we might want to cost them differently as well, though I'm
slightly hesitant to do so at this point to avoid causing plan changes
(I'm not sure how we would balance that concern with the potential
that the best plan isn't chosen).

> I tried the same above test with the
> v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch plus the
> attached rough-cut patch and got:
>
> master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch
> + v5-0002-Rough-cut-patch-for-HashedScalarArrayOpExpr.patch
> tps = 1167969.983173 (without initial connection time)
> tps = 1199636.793314 (without initial connection time)
> tps = 1190690.939963 (without initial connection time)
>
> I can't really explain why this became faster. I was expecting it just
> to reduce that slowdown of the v4 patch a little. I don't really see
> any reason why it would become faster.  It's almost 20% faster which
> seems like too much to just be fluctuations in code alignment in the
> binary.

I'm not at a place where I can do good perf testing right now (just on
my laptop for the moment), unfortunately, so I can't confirm one way
or the other.

> The attached patch is still missing the required changes to
> llvmjit_expr.c. I think that was also missing from the original patch
> too, however.

Ah, I didn't realize that needed to be changed as well. I'll take a
look at that.

> Also, I added HashedScalarArrayOpExpr to plannodes.h.  All other Expr
> type nodes are in primnodes.h. However, I put HashedScalarArrayOpExpr
> in plannodes.h because the parser does not generate this and it's not
> going to be stored in the catalogue files anywhere. I'm not so sure
> inventing a new Expr type node that only can be generated by the
> planner is a good thing to do.

I don't know what the positives and negatives are of this.

> Anyway, wondering what you think of the idea of allowing the planner
> to choose if it's going to hash or not?

In general I think it's very reasonable. I kinda wonder if
HashedScalarArrayOpExpr should have the ScalarArrayOpExp inlined
instead of maintaining a pointer, but it's not a big deal to me either
way. It certainly adds additional code, but probably also makes the
execExpr code clearer.

> It might also be good if someone else can check if they can get a bit
> more stable performance results from benchmarking the patches.
>
> (Also attached your v4 patch again just so anyone following along at
> home does not need to hunt around for the correct set of patches to
> apply to test this)

A few other comments:

- It looks like several of the "result is always InvalidOid" changes
should get committed separately (and now)?
- Two comment tweaks:

+ * 1. The 2nd argument of the array does not contain any Vars, Params or

s/array/array op/

+ *    worthwhile using the hashed version of ScalarArrayOpExprs rather than

s/ScalarArrayOpExprs/ScalarArrayOpExpr/

- Using op_hashjoinable is an improvement over my initial patch.
- I like the name change to put HASHED/Hashed first.

Thanks,
James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On 4/6/21 5:58 AM, David Rowley wrote:
> On Sat, 20 Mar 2021 at 09:41, James Coleman <jtc331@gmail.com> wrote:
>> I've attached a cleaned up patch. Last CF it was listed in is
>> https://commitfest.postgresql.org/29/2542/ -- what's the appropriate
>> step to take here given it's an already existing patch, but not yet
>> moved into recent CFs?
> 
> I had a look at this patch.  I like the idea of using a simplehash.h
> hash table to hash the constant values so that repeat lookups can be
> performed much more quickly, however, I'm a bit concerned that there
> are quite a few places in the code where we often just execute a
> ScalarArrayOpExpr once and I'm a bit worried that we'll slow down
> expression evaluation of those cases.
> 
> The two cases that I have in mind are:
> 
> 1. eval_const_expressions() where we use the executor to evaluate the
> ScalarArrayOpExpr to see if the result is Const.
> 2. CHECK constraints with IN clauses and single-row INSERTs.
> 
> I tried to benchmark both of these but I'm struggling to get stable
> enough performance for #2, even with fsync=off.  Sometimes I'm getting
> results 2.5x slower than other runs.
> 
> For benchmarking #1 I'm also not too sure I'm getting stable enough
> results for them to mean anything.
> 
> I was running:
> 
> create table a (a int);
> 
> bench.sql: explain select * from a where a
> in(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16);
> 
> drowley@amd3990x:~$ pgbench -n -T 60 -j 64 -c 64 -f bench.sql -P 10 postgres
> Master (6d41dd045):
> tps = 992586.991045 (without initial connection time)
> tps = 987964.990483 (without initial connection time)
> tps = 994309.670918 (without initial connection time)
> 
> Master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch
> tps = 956022.557626 (without initial connection time)
> tps = 963043.352823 (without initial connection time)
> tps = 968582.600100 (without initial connection time)
> 
> This puts the patched version about 3% slower. I'm not sure how much
> of that is changes in the binary and noise and how much is the
> needless hashtable build done for eval_const_expressions().
> 
> I wondered if we should make it the query planner's job of deciding if
> the ScalarArrayOpExpr should be hashed or not.  I ended up with the
> attached rough-cut patch that introduces HashedScalarArrayOpExpr and
> has the query planner decide if it's going to replace
> ScalarArrayOpExpr with these HashedScalarArrayOpExpr during
> preprocess_expression().  I do think that we might want to consider
> being a bit selective about when we do these replacements.  It seems
> likely that we'd want to do this for EXPRKIND_QUAL and maybe
> EXPRKIND_TARGET, but I imagine that converting ScalarArrayOpExpr to
> HashedScalarArrayOpExpr for EXPRKIND_VALUES would be a waste of time
> since those will just be executed once.
> 
> I tried the same above test with the
> v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch plus the
> attached rough-cut patch and got:
> 
> master + v4-0001-Hash-lookup-const-arrays-in-OR-d-ScalarArrayOps.patch
> + v5-0002-Rough-cut-patch-for-HashedScalarArrayOpExpr.patch
> tps = 1167969.983173 (without initial connection time)
> tps = 1199636.793314 (without initial connection time)
> tps = 1190690.939963 (without initial connection time)
> 
> I can't really explain why this became faster. I was expecting it just
> to reduce that slowdown of the v4 patch a little. I don't really see
> any reason why it would become faster.  It's almost 20% faster which
> seems like too much to just be fluctuations in code alignment in the
> binary.
> 

Interesting. I tried this on the "small" machine I use for benchmarking,
with the same SQL script you used, and also with IN() containing 10 and
100 values - so less/more than your script, which used 16 values.

I only ran that with a single client, the machine only has 4 cores and
this should not be related to concurrency, so 1 client seems fine. The
average of 10 runs, 15 seconds each look like this:

            simple   prepared      10/s    10/p    100/s    100/p
    -------------------------------------------------------------
    master   21847      59476     23343   59380    11757    56488
    v4       21546      57757     22864   57704    11572    57350
    v4+v5    23374      56089     24410   56140    14765    55302

The first two columns are your bench.sql, with -M simple or prepared.
The other columns are 10 or 100 values, /s is simple, /p is prepared.

Compared to master:

            simple   prepared      10/s    10/p    100/s    100/p
    -------------------------------------------------------------
    v4      98.62%     97.11%    97.95%  97.18%   98.43%  101.52%
    v4+v5  106.99%     94.31%   104.57%  94.54%  125.59%   97.90%

That seems to mostly match your observation - there's a small
performance hit (~2%), although that might be due to changes in the
layout of the binary. And v4+v5 improves that a bit (even compared to
master), although I don't see the same 20% speedup.

I see +25% improvement, but only with 100 values.

It's a bit strange that in prepared mode, the v5 actually hurts the
performance a bit.

That being said, this is a pretty extreme test case. I'm pretty sure
that once the table is not empty, the results will probably show a clear
improvement. I'll collect some of those results.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Thu, 8 Apr 2021 at 05:54, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> I only ran that with a single client, the machine only has 4 cores and
> this should not be related to concurrency, so 1 client seems fine. The
> average of 10 runs, 15 seconds each look like this:

Thanks for running these tests.   The reason I added so much
concurrency was that this AMD machine has some weird behaviour in
regards to power management.   Every bios update I seem to get changes
the power management but it still is very unstable despite me having
it on the most optimal settings in the bios.  Working it a bit harder
seems to help make it realise that there might be some urgency.


>             simple   prepared      10/s    10/p    100/s    100/p
>     -------------------------------------------------------------
>     master   21847      59476     23343   59380    11757    56488
>     v4       21546      57757     22864   57704    11572    57350
>     v4+v5    23374      56089     24410   56140    14765    55302
>
> The first two columns are your bench.sql, with -M simple or prepared.
> The other columns are 10 or 100 values, /s is simple, /p is prepared.
>
> Compared to master:
>
>             simple   prepared      10/s    10/p    100/s    100/p
>     -------------------------------------------------------------
>     v4      98.62%     97.11%    97.95%  97.18%   98.43%  101.52%
>     v4+v5  106.99%     94.31%   104.57%  94.54%  125.59%   97.90%
>
> That seems to mostly match your observation - there's a small
> performance hit (~2%), although that might be due to changes in the
> layout of the binary. And v4+v5 improves that a bit (even compared to
> master), although I don't see the same 20% speedup.

I've spent more time hacking at this patch.  I had a bit of a change
of heart earlier about having this new HashedScalarArrayOpExpr node
type.  There were more places that I imagined that I needed to add
handling for it.  For example, partprune.c needed to know about it to
allow partition pruning on them. While supporting that is just a few
lines to make a recursive call passing in the underlying
ScalarArrayOpExpr, I just didn't like the idea.

Instead, I think it'll be better just to add a new field to
ScalarArrayOpExpr and have the planner set that to tell the executor
that it should use a hash table to perform the lookups rather than a
linear search. This can just be the hash function oid, which also
saves the executor from having to look that up.

After quite a bit of hacking, I've ended up with the attached.  I
added the required JIT code to teach the jit code about
EEOP_HASHED_SCALARARRAYOP.

I also wrote the missing regression test for non-strict equality ops
and moved the declaration for the simplehash.h code into
execExprInterp.c and forward declared ScalarArrayOpExprHashTable in
exprExpr.h. I also rewrote a large number of comments and fixed a few
things like missing permission checks for the hash function.

I've not done any further performance tests yet but will start those now.

David

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Thu, 8 Apr 2021 at 18:50, David Rowley <dgrowleyml@gmail.com> wrote:
> I've not done any further performance tests yet but will start those now.

I ran a set of tests on this:

select * from a where a in( < 1 to 10 > );
and
select * from a where a in( < 1 to 100 > );

the table "a" is just an empty table with a single int column.

I ran "pgbench -T 15 -c 16 -t 16" ten times each and the resulting tps
is averaged over the 10 runs.

With 10 items in the IN clause:
master: 99887.9098314 tps
patched: 103235.7616416 tps (3.35% faster)

With 100 items:
master: 62442.4838792 tps
patched:62275.4955754 tps (0.27% slower)

These tests are just designed to test the overhead of the additional
planning and expression initialisation.  Testing the actual
performance of the patch vs master with large IN lists shows the
expected significant speedups.

These results show that there's not much in the way of a measurable
slowdown in planning or executor startup from the additional code
which decides if we should hash the ScalarArrayOpExpr.

I think the changes in the patch are fairly isolated and the test
coverage is now pretty good.  I'm planning on looking at the patch
again now and will consider pushing it for PG14.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Thu, 8 Apr 2021 at 22:54, David Rowley <dgrowleyml@gmail.com> wrote:
>
> I think the changes in the patch are fairly isolated and the test
> coverage is now pretty good.  I'm planning on looking at the patch
> again now and will consider pushing it for PG14.

I push this with some minor cleanup from the v6 patch I posted earlier.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Thu, Apr 8, 2021 at 8:01 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 8 Apr 2021 at 22:54, David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > I think the changes in the patch are fairly isolated and the test
> > coverage is now pretty good.  I'm planning on looking at the patch
> > again now and will consider pushing it for PG14.
>
> I push this with some minor cleanup from the v6 patch I posted earlier.
>
> David

Thank you!

I assume proper procedure for the CF entry is to move it into the
current CF and then mark it as committed, however I don't know how (or
don't have permissions?) to move it into the current CF. How does one
go about doing that?

Here's the entry: https://commitfest.postgresql.org/29/2542/

Thanks,
James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Alvaro Herrera
Date:
On 2021-Apr-08, James Coleman wrote:

> I assume proper procedure for the CF entry is to move it into the
> current CF and then mark it as committed, however I don't know how (or
> don't have permissions?) to move it into the current CF. How does one
> go about doing that?
> 
> Here's the entry: https://commitfest.postgresql.org/29/2542/

Done, thanks.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Thu, Apr 8, 2021 at 1:04 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Apr-08, James Coleman wrote:
>
> > I assume proper procedure for the CF entry is to move it into the
> > current CF and then mark it as committed, however I don't know how (or
> > don't have permissions?) to move it into the current CF. How does one
> > go about doing that?
> >
> > Here's the entry: https://commitfest.postgresql.org/29/2542/
>
> Done, thanks.
>
> --
> Álvaro Herrera                            39°49'30"S 73°17'W

Thanks. Is that something I should be able to do myself (should I be
asking someone for getting privileges in the app to do so)? I'm not
sure what the project policy is on that.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Alvaro Herrera
Date:
On 2021-Apr-08, James Coleman wrote:

> On Thu, Apr 8, 2021 at 1:04 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2021-Apr-08, James Coleman wrote:
> >
> > > I assume proper procedure for the CF entry is to move it into the
> > > current CF and then mark it as committed, however I don't know how (or
> > > don't have permissions?) to move it into the current CF. How does one
> > > go about doing that?
> > >
> > > Here's the entry: https://commitfest.postgresql.org/29/2542/
> >
> > Done, thanks.

> Thanks. Is that something I should be able to do myself

No, sorry.

-- 
Álvaro Herrera       Valdivia, Chile
"La vida es para el que se aventura"



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:
On 4/8/21 2:00 PM, David Rowley wrote:
> On Thu, 8 Apr 2021 at 22:54, David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> I think the changes in the patch are fairly isolated and the test
>> coverage is now pretty good.  I'm planning on looking at the patch
>> again now and will consider pushing it for PG14.
> 
> I push this with some minor cleanup from the v6 patch I posted earlier.
> 

I ran the same set of benchmarks on the v6, which I think should be
mostly identical to what was committed. I extended the test a bit to
test table with 0, 1, 5 and 1000 rows, and also with int and text
values, to see how it works with more expensive comparators.

I built binaries with gcc 9.2.0 and clang 11.0.0, the full results are
attached. There's a bit of difference between gcc and clang, but the
general behavior is about the same, so I'll only present gcc results to
keep this simple. I'll only throughput comparison to master, so >1.0
means good, <1.0 means bad. If you're interested in actual tps, see the
full results.

For the v5 patch (actually v4-0001 + v5-0002) and v6, the results are:

integer column / v5
===================

    rows      10/p    100/p     16/p     10/s    100/s     16/s
    -----------------------------------------------------------
       0       97%      97%      97%     107%     126%     108%
       1       95%      82%      94%     108%     132%     110%
       5       95%      83%      95%     108%     132%     110%
    1000      129%     481%     171%     131%     382%     165%


integer column / v6
===================

    rows      10/p    100/p     16/p     10/s    100/s     16/s
    -----------------------------------------------------------
       0       97%      97%      97%      98%      98%      98%
       1       96%      84%      95%      97%      97%      98%
       5       97%      85%      96%      98%      97%      97%
    1000      129%     489%     172%     128%     330%     162%


text column / v5
================

    rows      10/p    100/p     16/p     10/s    100/s     16/s
    -----------------------------------------------------------
       0      100%     100%     100%     106%     119%     108%
       1       96%      81%      95%     107%     120%     109%
       5       97%      82%      96%     107%     121%     109%
    1000      291%    1622%     402%     255%    1092%     337%


text column / v6
================

    rows      10/p    100/p     16/p     10/s    100/s     16/s
    -----------------------------------------------------------
       0      101%     101%     101%      98%      99%      99%
       1       98%      82%      96%      98%      96%      97%
       5      100%      84%      98%      98%      96%      98%
    1000      297%    1645%     408%     255%    1000%     336%


Overall, the behavior for integer and text columns is the same, for both
patches. There's a couple interesting observations:

1) For the "simple" query mode, v5 helped quite a bit (20-30% speedup),
but v6 does not seem to help at all - it's either same or slower than
unpatched master.

I wonder why is that, and if we could get some of the speedup with v6?
At first I thought that maybe v5 is not building the hash table in cases
where v6 does, but that shouldn't be faster than master.


2) For the "prepared" mode, there's a clear performance hit the longer
the array is (for both v5 and v6). For 100 elements it's about 15%,
which is not great.

I think the reason is fairly simple - building the hash table is not
free, and with few rows it's not worth it - it'd be faster to just
search the array directly. Unfortunately, the logic that makes the
decision to switch to hashing only looks at the array length only, and
ignores the number of rows entirely. So I think if we want to address
this, convert_saop_to_hashed_saop needs to compare

    has_build_cost + nrows * hash_lookup_cost

and

    nrows * linear_lookup_cost

to make reasonable decision.

I was thinking that maybe we can ignore this, because people probably
have much larger tables in practice. But I'm not sure that's really
true, because there may be other quals and it's possible the preceding
ones are quite selective, filtering most of the rows.

I'm not sure how much of the necessary information we have available in
convert_saop_to_hashed_saop_walker, though :-( I suppose we know the
number of input rows for that plan node, not sure about selectivity of
the other quals, though.

It's also a bit strange that we get speedup for "simple" protocol, while
for "prepared" it gets slower. That seems counter-intuitive, because why
should we see opposite outcomes in those cases? I'd assume that we'll
see either speedup or slowdown in both cases, with the relative change
being more significant in the "prepared" mode.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Fri, 9 Apr 2021 at 09:32, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> I ran the same set of benchmarks on the v6, which I think should be
> mostly identical to what was committed. I extended the test a bit to
> test table with 0, 1, 5 and 1000 rows, and also with int and text
> values, to see how it works with more expensive comparators.
>
> I built binaries with gcc 9.2.0 and clang 11.0.0, the full results are
> attached. There's a bit of difference between gcc and clang, but the
> general behavior is about the same, so I'll only present gcc results to
> keep this simple. I'll only throughput comparison to master, so >1.0
> means good, <1.0 means bad. If you're interested in actual tps, see the
> full results.
>
> For the v5 patch (actually v4-0001 + v5-0002) and v6, the results are:
>
> integer column / v5
> ===================
>
>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>     -----------------------------------------------------------
>        0       97%      97%      97%     107%     126%     108%
>        1       95%      82%      94%     108%     132%     110%
>        5       95%      83%      95%     108%     132%     110%
>     1000      129%     481%     171%     131%     382%     165%

I think we should likely ignore the v5 patch now.  The reason is that
it was pretty much unfinished and there were many places that I'd not
yet added support for the HashedScalarArrayOpExpr node type yet.  This
could cause these nodes to be skipped during node mutations or node
walking which would certainly make planning faster, just not in a way
that's correct.

> integer column / v6
> ===================
>
>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>     -----------------------------------------------------------
>        0       97%      97%      97%      98%      98%      98%
>        1       96%      84%      95%      97%      97%      98%
>        5       97%      85%      96%      98%      97%      97%
>     1000      129%     489%     172%     128%     330%     162%

This is a really informative set of results. I can only guess that the
slowdown of the 100/prepared query is down to building the hash table.
I think that because the 0 rows test does not show the slowdown and we
only build the table when evaluating for the first time. There's a
slightly larger hit on 1 row vs 5 rows, which makes sense since the
rewards of the hash lookup start paying off more with more rows.

Looking at your tps numbers, I think I can see why we get the drop in
performance with prepared statements but not simple statements. This
seems to just be down to the fact that the planning time dominates in
the simple statement case.  For example, the "1 row" test for 100/s
for v6 is 10023.3 tps, whereas the 100/p result is 44093.8 tps. With
master, prepared gets 52400.0 tps. So we could say the hash table
build costs us 8306.2 tps, or 3.59 microseconds per execution, per:

postgres=# select 1000000 / 52400.0 - 1000000 / 44093.8;
      ?column?
---------------------
 -3.5949559161508538
(1 row)

If we look at the tps for the simple query version of the same test.
Master did 10309.6 tps, v6 did 10023.3 tps. If we apply that 3.59
microsecond slowdown to master's tps, then we get pretty close to
within 1% of the v6 tps:

postgres=# select 1000000 / (1000000 / 10309.6 + 3.59);
       ?column?
-----------------------
 9941.6451581291294165
(1 row)

> text column / v6
> ================
>
>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>     -----------------------------------------------------------
>        0      101%     101%     101%      98%      99%      99%
>        1       98%      82%      96%      98%      96%      97%
>        5      100%      84%      98%      98%      96%      98%
>     1000      297%    1645%     408%     255%    1000%     336%
>
>
> Overall, the behavior for integer and text columns is the same, for both
> patches. There's a couple interesting observations:
>
> 1) For the "simple" query mode, v5 helped quite a bit (20-30% speedup),
> but v6 does not seem to help at all - it's either same or slower than
> unpatched master.

I think that's related to the fact that I didn't finish adding
HashedScalarArrayOpExpr processing to all places that needed it.

> I wonder why is that, and if we could get some of the speedup with v6?
> At first I thought that maybe v5 is not building the hash table in cases
> where v6 does, but that shouldn't be faster than master.

I don't think v5 and v6 really do anything much differently in the
executor. The only difference is really during ExecInitExprRec() when
we initialize the expression. With v5 we had a case
T_HashedScalarArrayOpExpr: to handle the new node type, but in v6 we
have if (OidIsValid(opexpr->hashfuncid)). Oh, wait. I did add the
missing permissions check on the hash function, so that will account
for something. As far as I can see, that's required.

> 2) For the "prepared" mode, there's a clear performance hit the longer
> the array is (for both v5 and v6). For 100 elements it's about 15%,
> which is not great.
>
> I think the reason is fairly simple - building the hash table is not
> free, and with few rows it's not worth it - it'd be faster to just
> search the array directly. Unfortunately, the logic that makes the
> decision to switch to hashing only looks at the array length only, and
> ignores the number of rows entirely. So I think if we want to address
> this, convert_saop_to_hashed_saop needs to compare
>
>     has_build_cost + nrows * hash_lookup_cost
>
> and
>
>     nrows * linear_lookup_cost
>
> to make reasonable decision.

I thought about that but I was really worried that the performance of
ScalarArrayOpExpr would just become too annoyingly unpredictable. You
know fairly well that we can often get massive row underestimations in
the planner. (I guess you worked on ext stats mainly because of that)
The problem I want to avoid is the ones where we get a big row
underestimation but don't really get a bad plan as a result. For
example a query like:

SELECT * FROM big_table WHERE col1 = ... AND col2 = ... AND col3 = ...
AND col4 IN( ... big list of values ...);

If col1, col2 and col3 are highly correlated but individually fairly
selective, then we could massively underestimate how many rows the IN
clause will see (assuming no rearranging was done here).

I'm not completely opposed to the idea of taking the estimated rows
into account during planning. It might just mean having to move the
convert_saop_to_hashed_saop() call somewhere else. I imagine that's
fairly trivial to do. I just have concerns about doing so.

> I was thinking that maybe we can ignore this, because people probably
> have much larger tables in practice. But I'm not sure that's really
> true, because there may be other quals and it's possible the preceding
> ones are quite selective, filtering most of the rows.
>
> I'm not sure how much of the necessary information we have available in
> convert_saop_to_hashed_saop_walker, though :-( I suppose we know the
> number of input rows for that plan node, not sure about selectivity of
> the other quals, though.
>
> It's also a bit strange that we get speedup for "simple" protocol, while
> for "prepared" it gets slower. That seems counter-intuitive, because why
> should we see opposite outcomes in those cases? I'd assume that we'll
> see either speedup or slowdown in both cases, with the relative change
> being more significant in the "prepared" mode.

I hope my theory above about the planner time dominating the overall
time shows why that is.

Another way to look at these result is by taking your tps value and
calculating how long it takes to do N number of transactions then
totalling up the time it takes. If I do that to calculate how long it
took each test to perform 1000 transactions and sum each test grouping
by mode and version with rollup on mode, I get:

time values are in seconds:

pg-master 28.07
prepared 11.28
simple 16.79

pg-v6 15.86
prepared 5.23
simple 10.63

So, the overall result when applying the total time is 177%.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:

On 4/9/21 1:21 AM, David Rowley wrote:
> On Fri, 9 Apr 2021 at 09:32, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> I ran the same set of benchmarks on the v6, which I think should be
>> mostly identical to what was committed. I extended the test a bit to
>> test table with 0, 1, 5 and 1000 rows, and also with int and text
>> values, to see how it works with more expensive comparators.
>>
>> I built binaries with gcc 9.2.0 and clang 11.0.0, the full results are
>> attached. There's a bit of difference between gcc and clang, but the
>> general behavior is about the same, so I'll only present gcc results to
>> keep this simple. I'll only throughput comparison to master, so >1.0
>> means good, <1.0 means bad. If you're interested in actual tps, see the
>> full results.
>>
>> For the v5 patch (actually v4-0001 + v5-0002) and v6, the results are:
>>
>> integer column / v5
>> ===================
>>
>>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>>     -----------------------------------------------------------
>>        0       97%      97%      97%     107%     126%     108%
>>        1       95%      82%      94%     108%     132%     110%
>>        5       95%      83%      95%     108%     132%     110%
>>     1000      129%     481%     171%     131%     382%     165%
> 
> I think we should likely ignore the v5 patch now.  The reason is that
> it was pretty much unfinished and there were many places that I'd not
> yet added support for the HashedScalarArrayOpExpr node type yet.  This
> could cause these nodes to be skipped during node mutations or node
> walking which would certainly make planning faster, just not in a way
> that's correct.
> 
>> integer column / v6
>> ===================
>>
>>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>>     -----------------------------------------------------------
>>        0       97%      97%      97%      98%      98%      98%
>>        1       96%      84%      95%      97%      97%      98%
>>        5       97%      85%      96%      98%      97%      97%
>>     1000      129%     489%     172%     128%     330%     162%
> 
> This is a really informative set of results. I can only guess that the
> slowdown of the 100/prepared query is down to building the hash table.
> I think that because the 0 rows test does not show the slowdown and we
> only build the table when evaluating for the first time. There's a
> slightly larger hit on 1 row vs 5 rows, which makes sense since the
> rewards of the hash lookup start paying off more with more rows.
> 

Agreed. I think that's essentially what I wrote too.

> Looking at your tps numbers, I think I can see why we get the drop in
> performance with prepared statements but not simple statements. This
> seems to just be down to the fact that the planning time dominates in
> the simple statement case.  For example, the "1 row" test for 100/s
> for v6 is 10023.3 tps, whereas the 100/p result is 44093.8 tps. With
> master, prepared gets 52400.0 tps. So we could say the hash table
> build costs us 8306.2 tps, or 3.59 microseconds per execution, per:
> 
> postgres=# select 1000000 / 52400.0 - 1000000 / 44093.8;
>       ?column?
> ---------------------
>  -3.5949559161508538
> (1 row)
> 
> If we look at the tps for the simple query version of the same test.
> Master did 10309.6 tps, v6 did 10023.3 tps. If we apply that 3.59
> microsecond slowdown to master's tps, then we get pretty close to
> within 1% of the v6 tps:
> 
> postgres=# select 1000000 / (1000000 / 10309.6 + 3.59);
>        ?column?
> -----------------------
>  9941.6451581291294165
> (1 row)
> 

Right, that makes perfect sense.

>> text column / v6
>> ================
>>
>>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>>     -----------------------------------------------------------
>>        0      101%     101%     101%      98%      99%      99%
>>        1       98%      82%      96%      98%      96%      97%
>>        5      100%      84%      98%      98%      96%      98%
>>     1000      297%    1645%     408%     255%    1000%     336%
>>
>>
>> Overall, the behavior for integer and text columns is the same, for botYep,h
>> patches. There's a couple interesting observations:
>>
>> 1) For the "simple" query mode, v5 helped quite a bit (20-30% speedup),
>> but v6 does not seem to help at all - it's either same or slower than
>> unpatched master.
> 
> I think that's related to the fact that I didn't finish adding
> HashedScalarArrayOpExpr processing to all places that needed it.
> 

Not sure I understand. Shouldn't that effectively default to the
unpatched behavior? How could that result in code that is *faster* than
master?


>> I wonder why is that, and if we could get some of the speedup with v6?
>> At first I thought that maybe v5 is not building the hash table in cases
>> where v6 does, but that shouldn't be faster than master.
> 
> I don't think v5 and v6 really do anything much differently in the
> executor. The only difference is really during ExecInitExprRec() when
> we initialize the expression. With v5 we had a case
> T_HashedScalarArrayOpExpr: to handle the new node type, but in v6 we
> have if (OidIsValid(opexpr->hashfuncid)). Oh, wait. I did add the
> missing permissions check on the hash function, so that will account
> for something. As far as I can see, that's required.
> 

I think the check is required, but I don't think that should be so very
expensive - it's likely one of many other permission checks during.

But I think the puzzle is not so much about v5 vs v6, but more about v5
vs. master. I still don't understand how v5 managed to be faster than
master, but maybe I'm missing something.

>> 2) For the "prepared" mode, there's a clear performance hit the longer
>> the array is (for both v5 and v6). For 100 elements it's about 15%,
>> which is not great.
>>
>> I think the reason is fairly simple - building the hash table is not
>> free, and with few rows it's not worth it - it'd be faster to just
>> search the array directly. Unfortunately, the logic that makes the
>> decision to switch to hashing only looks at the array length only, and
>> ignores the number of rows entirely. So I think if we want to address
>> this, convert_saop_to_hashed_saop needs to compare
>>
>>     has_build_cost + nrows * hash_lookup_cost
>>
>> and
>>
>>     nrows * linear_lookup_cost
>>
>> to make reasonable decision.
> 
> I thought about that but I was really worried that the performance of
> ScalarArrayOpExpr would just become too annoyingly unpredictable. You
> know fairly well that we can often get massive row underestimations in
> the planner. (I guess you worked on ext stats mainly because of that)
> The problem I want to avoid is the ones where we get a big row
> underestimation but don't really get a bad plan as a result. For
> example a query like:
> 
> SELECT * FROM big_table WHERE col1 = ... AND col2 = ... AND col3 = ...
> AND col4 IN( ... big list of values ...);
> 
> If col1, col2 and col3 are highly correlated but individually fairly
> selective, then we could massively underestimate how many rows the IN
> clause will see (assuming no rearranging was done here).
> 
> I'm not completely opposed to the idea of taking the estimated rows
> into account during planning. It might just mean having to move the
> convert_saop_to_hashed_saop() call somewhere else. I imagine that's
> fairly trivial to do. I just have concerns about doing so.
> 

Hmm, yeah. I understand your concerns, but the way it works now it kinda
penalizes correct estimates. Imagine you have workload with simple OLTP
queries, the queries always hit only a couple rows - but we make it run
15% slower because we're afraid there might be under-estimate.

It's sensible to make the planning resilient to under-estimates, but I'm
not sure just ignoring the cardinality estimate with the justification
it might be wrong is good strategy. In a way, all the other planning
decisions assume it's correct, so why should this be any different?
We're not using seqscan exclusively just because the selectivity might
be wrong, making index scan ineffective, for example.

Maybe the right solution is to rely on the estimates, but then also
enable the hashing if we significantly cross the threshold during
execution. So for example we might get estimate 10 rows, and calculate
that the hashing would start winning at 100 rows, so we start without
hashing. But then at execution if we get 200 rows, we build the hash
table and start using it.

Yes, there's a risk that there are only 200 rows, and the time spent
building the hash table is wasted. But it's much more likely that there
are many more rows.

>> I was thinking that maybe we can ignore this, because people probably
>> have much larger tables in practice. But I'm not sure that's really
>> true, because there may be other quals and it's possible the preceding
>> ones are quite selective, filtering most of the rows.
>>
>> I'm not sure how much of the necessary information we have available in
>> convert_saop_to_hashed_saop_walker, though :-( I suppose we know the
>> number of input rows for that plan node, not sure about selectivity of
>> the other quals, though.
>>
>> It's also a bit strange that we get speedup for "simple" protocol, while
>> for "prepared" it gets slower. That seems counter-intuitive, because why
>> should we see opposite outcomes in those cases? I'd assume that we'll
>> see either speedup or slowdown in both cases, with the relative change
>> being more significant in the "prepared" mode.
> 
> I hope my theory above about the planner time dominating the overall
> time shows why that is.
> 

I think it does explain the v6 behavior, where prepared gets slower
while simple is about the same (with low row counts).

But I'm not sure I understand the v5, where simple got faster than master.

> Another way to look at these result is by taking your tps value and
> calculating how long it takes to do N number of transactions then
> totalling up the time it takes. If I do that to calculate how long it
> took each test to perform 1000 transactions and sum each test grouping
> by mode and version with rollup on mode, I get:
> 
> time values are in seconds:
> 
> pg-master 28.07
> prepared 11.28
> simple 16.79
> 
> pg-v6 15.86
> prepared 5.23
> simple 10.63
> 
> So, the overall result when applying the total time is 177%.
> 

Not sure how you got those numbers, or how it explains the results.

E.g. on v5, the results for 100 int values / 1 row look like this:

            100/p     100/s
    master  52400     10310
        v5  43446     13610

I understand why the prepared mode got slower. I don't understand how
the simple mode got faster.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Sat, 10 Apr 2021 at 10:32, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> But I think the puzzle is not so much about v5 vs v6, but more about v5
> vs. master. I still don't understand how v5 managed to be faster than
> master, but maybe I'm missing something.

Well, v5 wrapped ScalarArrayOpExpr inside HashedScalarArrayOpExpr, but
didn't add cases for HashedScalarArrayOpExpr in all locations where it
should have. For example, fix_expr_common() has a case for
ScalarArrayOpExpr, but if we gave it a HashedScalarArrayOpExpr it
would have very little work to do.

I've not gone and proved that's the exact reason why the planner
became faster, but I really don't see any other reason.

> Maybe the right solution is to rely on the estimates, but then also
> enable the hashing if we significantly cross the threshold during
> execution. So for example we might get estimate 10 rows, and calculate
> that the hashing would start winning at 100 rows, so we start without
> hashing. But then at execution if we get 200 rows, we build the hash
> table and start using it.

To do that, we'd need to store the number of evaluations of the
function somewhere. I'm not really sure that would be a good idea as I
imagine we'd need to store that in ExprEvalStep.  I imagine if that
was a good idea then we'd have done the same for JIT.


> On 4/9/21 1:21 AM, David Rowley wrote:
> > time values are in seconds:
> >
> > pg-master 28.07
> > prepared 11.28
> > simple 16.79
> >
> > pg-v6 15.86
> > prepared 5.23
> > simple 10.63
> >
> > So, the overall result when applying the total time is 177%.
> >
>
> Not sure how you got those numbers, or how it explains the results.

I got it by doing "1 / tps * 1000" to get the time it would take to
execute 1000 transactions of each of your tests. I then grouped by
patch, mode and took the sum of the calculated number.  My point was
that overall the patch is significantly faster.  I was trying to
highlight that the 0 and 1 row test take up very little time and the
overhead of building the hash table is only showing up because the
query executes so quickly.

FWIW, I think executing a large IN clause on a table that has 0 rows
is likely not that interesting a case to optimise for.  That's not the
same as a query that just returns 0 rows due to filtering out hundreds
or thousands of rows during execution.  The overhead of building the
hash table is not going to show up very easily in that sort of case.

> I understand why the prepared mode got slower. I don't understand how
> the simple mode got faster.

I very much imagine v5 was faster at planning due to the unfinished
nature of the patch. I'd not added support for HashedScalarArrayOpExpr
in all the places I should have. That would result in the planner
skipping lots of work that it needs to do.  The way I got it to work
was to add it, then just add enough cases in the planner to handle
HashedScalarArrayOpExpr so I didn't get any errors.  I stopped after
that just to show the idea.  Lack of errors does not mean it was
correct. At least setrefs.c was not properly handling
HashedScalarArrayOpExpr.

I really think it would be best if we just ignore the performance of
v5. Looking at the performance of a patch that was incorrectly
skipping a bunch of required work does not seem that fair.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tomas Vondra
Date:

On 4/11/21 12:03 AM, David Rowley wrote:
> On Sat, 10 Apr 2021 at 10:32, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> But I think the puzzle is not so much about v5 vs v6, but more about v5
>> vs. master. I still don't understand how v5 managed to be faster than
>> master, but maybe I'm missing something.
> 
> Well, v5 wrapped ScalarArrayOpExpr inside HashedScalarArrayOpExpr, but
> didn't add cases for HashedScalarArrayOpExpr in all locations where it
> should have. For example, fix_expr_common() has a case for
> ScalarArrayOpExpr, but if we gave it a HashedScalarArrayOpExpr it
> would have very little work to do.
> 
> I've not gone and proved that's the exact reason why the planner
> became faster, but I really don't see any other reason.
> 
>> Maybe the right solution is to rely on the estimates, but then also
>> enable the hashing if we significantly cross the threshold during
>> execution. So for example we might get estimate 10 rows, and calculate
>> that the hashing would start winning at 100 rows, so we start without
>> hashing. But then at execution if we get 200 rows, we build the hash
>> table and start using it.
> 
> To do that, we'd need to store the number of evaluations of the
> function somewhere. I'm not really sure that would be a good idea as I
> imagine we'd need to store that in ExprEvalStep.  I imagine if that
> was a good idea then we'd have done the same for JIT.
> 

Sure, we'd need to track the number of lookups, but I'd imagine that's
fairly cheap and we can stop once we switch to hash mode.

I'm not sure "JIT does not do that" is really a proof it's a bad idea.
My guess is it wasn't considered back then, and the current heuristics
is the simplest possible. So maybe it's the other way and we should
consider to do the same thing for JIT?

FWIW if we look at what JIT does, it'd argue it supports the approach to
trust the estimates. Because if we under-estimate stuff, the cost won't
exceed the "jit_above_cost" threshold, and we won't use JIT.

> 
>> On 4/9/21 1:21 AM, David Rowley wrote:
>>> time values are in seconds:
>>>
>>> pg-master 28.07
>>> prepared 11.28
>>> simple 16.79
>>>
>>> pg-v6 15.86
>>> prepared 5.23
>>> simple 10.63
>>>
>>> So, the overall result when applying the total time is 177%.
>>>
>>
>> Not sure how you got those numbers, or how it explains the results.
> 
> I got it by doing "1 / tps * 1000" to get the time it would take to
> execute 1000 transactions of each of your tests. I then grouped by
> patch, mode and took the sum of the calculated number.  My point was
> that overall the patch is significantly faster.  I was trying to
> highlight that the 0 and 1 row test take up very little time and the
> overhead of building the hash table is only showing up because the
> query executes so quickly.
> 

Ah, I see. TBH I don't think combining the results gives us a very
meaningful value - those cases were quite arbitrary, but summing them
together like this assumes the workload has about 25% of each. But if
your workload is exclusively 0/1/5 rows it's going to be hit.

> FWIW, I think executing a large IN clause on a table that has 0 rows
> is likely not that interesting a case to optimise for.  That's not the
> same as a query that just returns 0 rows due to filtering out hundreds
> or thousands of rows during execution.  The overhead of building the
> hash table is not going to show up very easily in that sort of case.
> 

Yeah, it's probably true that queries with long IN lists are probably
dealing with many input rows. And you're right we don't really care
about how many rows are ultimately produced by the query (or even the
step with the IN list) - if we spent a lot of time to filter the rows
before applying the IN list, the time to initialize the hash table is
probably just noise.

I wonder what's the relationship between the length of the IN list and
the minimum number of rows needed for the hash to start winning.

>> I understand why the prepared mode got slower. I don't understand how
>> the simple mode got faster.
> 
> I very much imagine v5 was faster at planning due to the unfinished
> nature of the patch. I'd not added support for HashedScalarArrayOpExpr
> in all the places I should have. That would result in the planner
> skipping lots of work that it needs to do.  The way I got it to work
> was to add it, then just add enough cases in the planner to handle
> HashedScalarArrayOpExpr so I didn't get any errors.  I stopped after
> that just to show the idea.  Lack of errors does not mean it was
> correct. At least setrefs.c was not properly handling
> HashedScalarArrayOpExpr.
> 
> I really think it would be best if we just ignore the performance of
> v5. Looking at the performance of a patch that was incorrectly
> skipping a bunch of required work does not seem that fair.
> 

Aha! I was assuming v5 was correct, but if that assumption is incorrect
then the whole "v5 speedup" is just an illusion, aAnd you're right we
should simply ignore that. Thanks for the explanation!

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Sun, 11 Apr 2021 at 10:38, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I wonder what's the relationship between the length of the IN list and
> the minimum number of rows needed for the hash to start winning.

I made the attached spreadsheet which demonstrates the crossover point
using the costs that I coded into cost_qual_eval_walker().

It basically shows, for large arrays, that there are fairly
significant benefits to hashing for just 2 lookups and not hashing
only just wins for 1 lookup.  However, the cost model does not account
for allocating memory for the hash table, which is far from free.

You can adjust the number of items in the IN clause by changing the
value in cell B1. The values in B2 and B3 are what I saw the planner
set when I tested with both INT and TEXT types.

David

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Fri, 9 Apr 2021 at 00:00, David Rowley <dgrowleyml@gmail.com> wrote:
> I push this with some minor cleanup from the v6 patch I posted earlier.

I realised when working on something unrelated last night that we can
also do hash lookups for NOT IN too.

We'd just need to check if the operator's negator operator is
hashable.  No new fields would need to be added to ScalarArrayOpExpr.
We'd just set the hashfuncid to the correct value and then set the
opfuncid to the negator function.  In the executor, we'd know to check
if the value is in the table or not in the table based on the useOr
value.

I'm not really sure whether lack of NOT IN support is going to be a
source of bug reports for PG14 or not.  If it was, then it might be
worth doing something about that for PG14.   Otherwise, we can just
leave it for future work for PG15 and beyond.  I personally don't have
any strong feelings either way, but I'm leaning towards just writing a
patch and thinking of pushing it sometime after we branch for PG15.

I've included the RMT, just in case they want to voice an opinion on that.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> I realised when working on something unrelated last night that we can
> also do hash lookups for NOT IN too.

... and still get the behavior right for nulls?

            regards, tom lane



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Tue, 13 Apr 2021 at 11:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I realised when working on something unrelated last night that we can
> > also do hash lookups for NOT IN too.
>
> ... and still get the behavior right for nulls?

Yeah, it will. There are already some special cases for NULLs in the
IN version. Those would need to be adapted for NOT IN.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Mon, Apr 12, 2021 at 7:49 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 13 Apr 2021 at 11:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > David Rowley <dgrowleyml@gmail.com> writes:
> > > I realised when working on something unrelated last night that we can
> > > also do hash lookups for NOT IN too.
> >
> > ... and still get the behavior right for nulls?
>
> Yeah, it will. There are already some special cases for NULLs in the
> IN version. Those would need to be adapted for NOT IN.

I hadn't thought about using the negator operator directly that way
when I initially wrote the patch.

But also I didn't think a whole lot about the NOT IN case at all --
and there's no mention of such that I see in this thread or the
precursor thread. It's pretty obvious that it wasn't part of my
immediate need, but obviously it'd be nice to have the consistency.

All that to say this: my vote would be to put it into PG15 also.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Mon, Apr 12, 2021 at 10:07 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Mon, Apr 12, 2021 at 7:49 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Tue, 13 Apr 2021 at 11:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > David Rowley <dgrowleyml@gmail.com> writes:
> > > > I realised when working on something unrelated last night that we can
> > > > also do hash lookups for NOT IN too.
> > >
> > > ... and still get the behavior right for nulls?
> >
> > Yeah, it will. There are already some special cases for NULLs in the
> > IN version. Those would need to be adapted for NOT IN.
>
> I hadn't thought about using the negator operator directly that way
> when I initially wrote the patch.
>
> But also I didn't think a whole lot about the NOT IN case at all --
> and there's no mention of such that I see in this thread or the
> precursor thread. It's pretty obvious that it wasn't part of my
> immediate need, but obviously it'd be nice to have the consistency.
>
> All that to say this: my vote would be to put it into PG15 also.

...and here's a draft patch. I can take this to a new thread if you'd
prefer; the one here already got committed, on the other hand this is
pretty strongly linked to this discussion, so I figured it made sense
to post it here.

James

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Wed, 14 Apr 2021 at 05:40, James Coleman <jtc331@gmail.com> wrote:
> ...and here's a draft patch. I can take this to a new thread if you'd
> prefer; the one here already got committed, on the other hand this is
> pretty strongly linked to this discussion, so I figured it made sense
> to post it here.

I only glanced at this when you sent it and I was confused about how
it works. The patch didn't look like how I imagined it should and I
couldn't see how the executor part worked without any changes.

Anyway, I decided to clear up my confusion tonight and apply the patch
to figure all this out...  unfortunately, I see why I was confused
now. It actually does not work at all :-(

You're still passing the <> operator to get_op_hash_functions(), which
of course is not hashable, so we just never do hashing for NOT IN.

All your tests pass just fine because the standard non-hashed code path is used.

My idea was that you'd not add any fields to ScalarArrayOpExpr and for
soaps with useOr == false, check if the negator of the operator is
hashable. If so set the opfuncid to the negator operator's function.

I'm a bit undecided if it's safe to set the opfuncid to the negator
function.  If anything were to set that again based on the opno then
it would likely set it to the wrong thing. We can't go changing the
opno either because EXPLAIN would display the wrong thing.

Anyway, I've attached what I ended up with after spending a few hours
looking at this.

I pretty much used all your tests as is with the exception of removing
one that looked duplicated.

David

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Sat, Apr 24, 2021 at 6:25 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 14 Apr 2021 at 05:40, James Coleman <jtc331@gmail.com> wrote:
> > ...and here's a draft patch. I can take this to a new thread if you'd
> > prefer; the one here already got committed, on the other hand this is
> > pretty strongly linked to this discussion, so I figured it made sense
> > to post it here.
>
> I only glanced at this when you sent it and I was confused about how
> it works. The patch didn't look like how I imagined it should and I
> couldn't see how the executor part worked without any changes.
>
> Anyway, I decided to clear up my confusion tonight and apply the patch
> to figure all this out...  unfortunately, I see why I was confused
> now. It actually does not work at all :-(
>
> You're still passing the <> operator to get_op_hash_functions(), which
> of course is not hashable, so we just never do hashing for NOT IN.
>
> All your tests pass just fine because the standard non-hashed code path is used.

I was surprised when it "just worked" too; I should have stopped to
verify the path was being taken. Egg on my face for not doing so. :(

> My idea was that you'd not add any fields to ScalarArrayOpExpr and for
> soaps with useOr == false, check if the negator of the operator is
> hashable. If so set the opfuncid to the negator operator's function.
>
> I'm a bit undecided if it's safe to set the opfuncid to the negator
> function.  If anything were to set that again based on the opno then
> it would likely set it to the wrong thing. We can't go changing the
> opno either because EXPLAIN would display the wrong thing.

I don't personally see a reason why this is a problem. But I also
don't know that I have enough knowledge of the codebase to say that
definitively.

> Anyway, I've attached what I ended up with after spending a few hours
> looking at this.

Overall I like this approach.

One thing I think we could clean up:

+ bool            useOr;  /* use OR or AND semantics? */
...
+ /* useOr == true means an IN clause, useOr == false is NOT IN */

I'm wondering if the intersection of these two lines implies that
useOr isn't quite the right name here. Perhaps something like
"negated"?

On the other hand (to make the counterargument) useOr would keep it
consistent with the other ones.

The other thing I got to thinking about was = ALL. It doesn't get
turned into a hash op because the negator of = isn't hashable. I think
it's correct that that's the determining factor, because I can't
imagine what it would mean to hash <>. But...I wanted to confirm I
wasn't missing something. We don't have explicit tests for that case,
but I'm not sure it's necessary either.

> I pretty much used all your tests as is with the exception of removing
> one that looked duplicated.

Sounds good.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Sat, 8 May 2021 at 09:15, James Coleman <jtc331@gmail.com> wrote:
>
> On Sat, Apr 24, 2021 at 6:25 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > I'm a bit undecided if it's safe to set the opfuncid to the negator
> > function.  If anything were to set that again based on the opno then
> > it would likely set it to the wrong thing. We can't go changing the
> > opno either because EXPLAIN would display the wrong thing.
>
> I don't personally see a reason why this is a problem. But I also
> don't know that I have enough knowledge of the codebase to say that
> definitively.

The reason for my concern is that if the opfuncid is set to
InvalidOid, set_sa_opfuncid() always sets the ScalarArrayOpExpr's
opfuncid to get_opcode(opexpr->opno) . I'm effectively setting the
opfuncid to get_opcode(get_negator(opexpr->opno)), if anything were to
reset the ScalarArrayOpExpr's opfuncid to InvalidOid, then
set_sa_opfuncid() would repopulate it with the wrong value.

Maybe the solution there is to teach set_sa_opfuncid() about our
hashing NOT IN trick and have it check if (!opexpr->useOr &&
OidIsValid(opexpr->hashfuncid)) and if that's true then do
opexpr->opfuncid = get_opcode(get_negator(opexpr->opno)).  Then we
could just not bothing setting opfuncid in
convert_saop_to_hashed_saop_walker().


> > Anyway, I've attached what I ended up with after spending a few hours
> > looking at this.
>
> Overall I like this approach.
>
> One thing I think we could clean up:
>
> + bool            useOr;  /* use OR or AND semantics? */
> ...
> + /* useOr == true means an IN clause, useOr == false is NOT IN */
>
> I'm wondering if the intersection of these two lines implies that
> useOr isn't quite the right name here. Perhaps something like
> "negated"?

I'm not sure I want to go changing that.  The whole IN() / NOT IN()
behaviour regarding NULLs all seems pretty weird until you mentally
replace a IN (1,2,3) with a = 1 OR a = 2 OR a = 3.  And for the a NOT
IN(1,2,3) case, a <> 1 AND a <> 2 AND a <> 3.  People can make a bit
more sense of the weirdness of NULLs with NOT IN when they mentally
convert their expression like that.  I think having that in code is
useful too. Any optimisations that are added must match those
semantics.

> The other thing I got to thinking about was = ALL. It doesn't get
> turned into a hash op because the negator of = isn't hashable. I think
> it's correct that that's the determining factor, because I can't
> imagine what it would mean to hash <>. But...I wanted to confirm I
> wasn't missing something. We don't have explicit tests for that case,
> but I'm not sure it's necessary either.

It's important to think of other cases, I just don't think there's any
need to do anything for that one.  Remember that we have the
restriction of requiring a set of Consts, so for that case to be met,
someone would have to write something like: col =
ALL('{1,1,1,1,1,1,1,1}'::int[]);  I think if anyone comes along
complaining that a query containing that is not as fast as they'd like
then we might tell them that they should just use: col = 1. A sanity
checkup might not go amiss either.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Sat, 8 May 2021 at 09:15, James Coleman <jtc331@gmail.com> wrote:
>> On Sat, Apr 24, 2021 at 6:25 AM David Rowley <dgrowleyml@gmail.com> wrote:
>>> I'm a bit undecided if it's safe to set the opfuncid to the negator
>>> function.  If anything were to set that again based on the opno then
>>> it would likely set it to the wrong thing. We can't go changing the
>>> opno either because EXPLAIN would display the wrong thing.

>> I don't personally see a reason why this is a problem. But I also
>> don't know that I have enough knowledge of the codebase to say that
>> definitively.

> The reason for my concern is that if the opfuncid is set to
> InvalidOid, set_sa_opfuncid() always sets the ScalarArrayOpExpr's
> opfuncid to get_opcode(opexpr->opno).

I will personally veto any design that involves setting opfuncid to
something that doesn't match the opno.  That's just horrid, and it
will break something somewhere, either immediately or down the road.

I don't immediately see why you can't add an "invert" boolean flag to
ScalarArrayOpExpr and let the executor machinery deal with this.  That'd
have the advantage of not having to depend on there being a negator.

            regards, tom lane



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Fri, May 7, 2021 at 9:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > On Sat, 8 May 2021 at 09:15, James Coleman <jtc331@gmail.com> wrote:
> >> On Sat, Apr 24, 2021 at 6:25 AM David Rowley <dgrowleyml@gmail.com> wrote:
> >>> I'm a bit undecided if it's safe to set the opfuncid to the negator
> >>> function.  If anything were to set that again based on the opno then
> >>> it would likely set it to the wrong thing. We can't go changing the
> >>> opno either because EXPLAIN would display the wrong thing.
>
> >> I don't personally see a reason why this is a problem. But I also
> >> don't know that I have enough knowledge of the codebase to say that
> >> definitively.
>
> > The reason for my concern is that if the opfuncid is set to
> > InvalidOid, set_sa_opfuncid() always sets the ScalarArrayOpExpr's
> > opfuncid to get_opcode(opexpr->opno).
>
> I will personally veto any design that involves setting opfuncid to
> something that doesn't match the opno.  That's just horrid, and it
> will break something somewhere, either immediately or down the road.

This is the "project design" style/policy I don't have. Thanks.

> I don't immediately see why you can't add an "invert" boolean flag to
> ScalarArrayOpExpr and let the executor machinery deal with this.  That'd
> have the advantage of not having to depend on there being a negator.

Don't we need to have a negator to be able to look up the proper has
function? At least somewhere in the process you'd have to convert from
looking up the <> op to looking up the = op and then setting the
"invert" flag.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
James Coleman
Date:
On Fri, May 7, 2021 at 8:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> It's important to think of other cases, I just don't think there's any
> need to do anything for that one.  Remember that we have the
> restriction of requiring a set of Consts, so for that case to be met,
> someone would have to write something like: col =
> ALL('{1,1,1,1,1,1,1,1}'::int[]);  I think if anyone comes along
> complaining that a query containing that is not as fast as they'd like
> then we might tell them that they should just use: col = 1. A sanity
> checkup might not go amiss either.

I wasn't concerned with trying to optimize this case (I don't think we
can anyway, at least not without adding new work, like de-duplicating
the array first). Though I do hope that someday I'll/we'll get around
to getting the stable subexpressions caching patch finished, and then
this will be able to work for more than constant arrays.

I just wanted to confirm we'd thought through the cases we can't
handle to ensure we're not accidentally covering them.

James



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Sat, 8 May 2021 at 13:37, James Coleman <jtc331@gmail.com> wrote:
>
> On Fri, May 7, 2021 at 9:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I don't immediately see why you can't add an "invert" boolean flag to
> > ScalarArrayOpExpr and let the executor machinery deal with this.  That'd
> > have the advantage of not having to depend on there being a negator.
>
> Don't we need to have a negator to be able to look up the proper has
> function? At least somewhere in the process you'd have to convert from
> looking up the <> op to looking up the = op and then setting the
> "invert" flag.

Yeah, we *do* need to ensure there's a negator in the planner as we
need to use it during hash probes.  It's no good checking the hash
bucket we landed on does match with the <> operator's function.  We
won't find many matches that way!

I'm not opposed to adding some new field if that's what it takes.  I'd
imagine the new field will be something like negfuncid which will be
InvalidOid unless the hash function is set and useOr == false

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Sat, 8 May 2021 at 14:04, David Rowley <dgrowleyml@gmail.com> wrote:
> I'm not opposed to adding some new field if that's what it takes.  I'd
> imagine the new field will be something like negfuncid which will be
> InvalidOid unless the hash function is set and useOr == false

Just while this is still swapped into main memory, I've attached a
patch that adds a new field to ScalarArrayOpExpr rather than
repurposing the existing field.

David

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
Zhihong Yu
Date:


On Fri, May 7, 2021 at 9:50 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 8 May 2021 at 14:04, David Rowley <dgrowleyml@gmail.com> wrote:
> I'm not opposed to adding some new field if that's what it takes.  I'd
> imagine the new field will be something like negfuncid which will be
> InvalidOid unless the hash function is set and useOr == false

Just while this is still swapped into main memory, I've attached a
patch that adds a new field to ScalarArrayOpExpr rather than
repurposing the existing field.

David

Hi,

+       if (!OidIsValid(saop->negfuncid))
+           record_plan_function_dependency(root, saop->hashfuncid);

Is there a typo in the second line ? (root, saop->negfuncid)

Cheers

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Sat, 8 May 2021 at 20:17, Zhihong Yu <zyu@yugabyte.com> wrote:
> +       if (!OidIsValid(saop->negfuncid))
> +           record_plan_function_dependency(root, saop->hashfuncid);
>
> Is there a typo in the second line ? (root, saop->negfuncid)

Yeah, that's a mistake. Thanks for checking it.

David



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Sat, 8 May 2021 at 20:29, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sat, 8 May 2021 at 20:17, Zhihong Yu <zyu@yugabyte.com> wrote:
> > +       if (!OidIsValid(saop->negfuncid))
> > +           record_plan_function_dependency(root, saop->hashfuncid);
> >
> > Is there a typo in the second line ? (root, saop->negfuncid)
>
> Yeah, that's a mistake. Thanks for checking it.

I've attached a patch which fixes the mistake mentioned above.

Also, dropped the RMT from the thread. I only introduced them when I
wanted some input about if hashing NOT IN should be included in PG14.
Nobody seems to think that should be done.

David

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
I've been looking at the NOT IN hashing patch again and after making a
few minor tweaks I think it's pretty much ready to go.

If anyone feels differently, please let me know in the next couple of
days. Otherwise, I plan on taking a final look and pushing it soon.

David

Attachment

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

From
David Rowley
Date:
On Tue, 6 Jul 2021 at 22:39, David Rowley <dgrowleyml@gmail.com> wrote:
> If anyone feels differently, please let me know in the next couple of
> days. Otherwise, I plan on taking a final look and pushing it soon.

After doing some very minor adjustments, I pushed this. (29f45e299).

Thanks to James and Zhihong for reviewing.

David