Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays - Mailing list pgsql-hackers

From James Coleman
Subject Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date
Msg-id CAAaqYe9Mg0r_0oO1gt0rwoGphgnvBaULoEMaJMwouYvmf+oMuA@mail.gmail.com
Whole thread Raw
In response to Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Binary search in ScalarArrayOpExpr for OR'd constant arrays  (James Coleman <jtc331@gmail.com>)
Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (James Coleman <jtc331@gmail.com>)
List pgsql-hackers
On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml@gmail.com> wrote:
> >> >>
> >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> >> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> >> >> > think ran into mostly the same challenge of deciding when the bloom
> >> >> > filter can be useful and is worth the extra work.
> >> >>
> >> >> Speaking of that, it would be interesting to see how a test where you
> >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> >> >> be interesting to know if the planner is able to make a more suitable
> >> >> choice and also to see how all the work over the years to improve Hash
> >> >> Joins compares to the bsearch with and without the bloom filter.
> >> >
> >> >It would be interesting.
> >> >
> >> >It also makes one wonder about optimizing these into to hash
> >> >joins...which I'd thought about over at [1]. I think it'd be a very
> >> >significant effort though.
> >> >
> >>
> >> I modified the script to also do the join version of the query. I can
> >> only run it on my laptop at the moment, so the results may be a bit
> >> different from those I shared before, but it's interesting I think.
> >>
> >> In most cases it's comparable to the binsearch/bloom approach, and in
> >> some cases it actually beats them quite significantly. It seems to
> >> depend on how expensive the comparison is - for "int" the comparison is
> >> very cheap and there's almost no difference. For "text" the comparison
> >> is much more expensive, and there are significant speedups.
> >>
> >> For example the test with 100k lookups in array of 10k elements and 10%
> >> match probability, the timings are these
> >>
> >>    master:  62362 ms
> >>    binsearch: 201 ms
> >>    bloom:      65 ms
> >>    hashjoin:   36 ms
> >>
> >> I do think the explanation is fairly simple - the bloom filter
> >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> >> some overhead to build and check the bits. The hash join probably
> >> eliminates a lot of the remaining comparisons, because the hash table
> >> is sized to have one tuple per bucket.
> >>
> >> Note: I also don't claim the PoC has the most efficient bloom filter
> >> implementation possible. I'm sure it could be made faster.
> >>
> >> Anyway, I'm not sure transforming this to a hash join is worth the
> >> effort - I agree that seems quite complex. But perhaps this suggest we
> >> should not be doing binary search and instead just build a simple hash
> >> table - that seems much simpler, and it'll probably give us about the
> >> same benefits.
> >
> >That's actually what I originally thought about doing, but I chose
> >binary search since it seemed a lot easier to get off the ground.
> >
>
> OK, that makes perfect sense.
>
> >If we instead build a hash is there anything else we need to be
> >concerned about? For example, work mem? I suppose for the binary
> >search we already have to expand the array, so perhaps it's not all
> >that meaningful relative to that...
> >
>
> I don't think we need to be particularly concerned about work_mem. We
> don't care about it now, and it's not clear to me what we could do about
> it - we already have the array in memory anyway, so it's a bit futile.
> Furthermore, if we need to care about it, it probably applies to the
> binary search too.
>
> >I was looking earlier at what our standard hash implementation was,
> >and it seemed less obvious what was needed to set that up (so binary
> >search seemed a faster proof of concept). If you happen to have any
> >pointers to similar usages I should look at, please let me know.
> >
>
> I think the hash join implementation is far too complicated. It has to
> care about work_mem, so it implements batching, etc. That's a lot of
> complexity we don't need here. IMO we could use either the usual
> dynahash, or maybe even the simpler simplehash.
>
> FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> benchmarks makes sense and running it independently. I'm not aware of
> any issues but it was done late at night and only ran on my laptop.

Some quick calculations (don't have the scripting in a form I can
attach yet; using this as an opportunity to hack on a genericized
performance testing framework of sorts) suggest your results are
correct. I was also testing on my laptop, but I showed 1.) roughly
equivalent results for IN (VALUES ...) and IN (<list>) for integers,
but when I switch to (short; average 3 characters long) text values I
show the hash join on VALUES is twice as fast as the binary search.

Given that, I'm planning to implement this as a hash lookup and share
a revised patch.

James



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Setting min/max TLS protocol in clientside libpq
Next
From: James Coleman
Date:
Subject: Re: doc review for v13