Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays - Mailing list pgsql-hackers

From James Coleman
Subject Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date
Msg-id CAAaqYe-9N32MNK+nawqQfRcc-cwEbruzDV65p71X2nsntetyBQ@mail.gmail.com
Whole thread Raw
In response to Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote:
> >Over in "execExprInterp() questions / How to improve scalar array op
> >expr eval?" [1] I'd mused about how we might be able to optimized
> >scalar array ops with OR'd semantics.
> >
> >This patch implements a binary search for such expressions when the
> >array argument is a constant so that we can avoid needing to teach
> >expression execution to cache stable values or know when a param has
> >changed.
> >
> >The speed-up for the target case can pretty impressive: in my
> >admittedly contrived and relatively unscientific test with a query in
> >the form:
> >
> >select count(*) from generate_series(1,100000) n(i) where i in (<1000
> >random integers in the series>)
> >
> >shows ~30ms for the patch versus ~640ms on master.
> >
>
> Nice improvement, although 1000 items is probably a bit unusual. The
> threshold used in the patch (9 elements) seems a bit too low - what
> results have you seen with smaller arrays?

At least in our systems we regularly work with 1000 batches of items,
which means you get IN clauses of identifiers of that size. Admittedly
the most common case sees those IN clauses as simple index scans
(e.g., WHERE <primary key> IN (...)), but it's also common to have a
broader query that merely filters additionally on something like "...
AND <some foreign key> IN (...)" where it makes sense for the rest of
the quals to take precedence in generating a reasonable plan. In that
case, the IN becomes a regular filter, hence the idea behind the
patch.

Side note: I'd love for us to be able to treat "IN (VALUES)" the same
way...but as noted in the other thread that's an extremely large
amount of work, I think. But similarly you could use a hash here
instead of a binary search...but this seems quite good.

As to the choice of 9 elements: I just picked that as a starting
point; Andres had previously commented off hand that at 8 elements
serial scanning was faster, so I figured this was a reasonable
starting point for discussion.

Perhaps it would make sense to determine that minimum not as a pure
constant but scaled based on how many rows the planner expects us to
see? Of course that'd be a more invasive patch...so may or may not be
as feasible as a reasonable default.

> Another idea - would a bloom filter be useful here, as a second
> optimization? That is, for large arrays build s small bloom filter,
> allowing us to skip even the binary search.

That's an interesting idea. I actually haven't personally worked with
bloom filters, so didn't think about that.

Are you thinking that you'd also build the filter *and* presort the
array? Or try to get away with using only the bloom filter and not
expanding and sorting the array at all?

Thanks,
James



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: PG compilation error with Visual Studio 2015/2017/2019
Next
From: Juan José Santamaría Flecha
Date:
Subject: Re: PG compilation error with Visual Studio 2015/2017/2019