Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays - Mailing list pgsql-hackers
From | James Coleman |
---|---|
Subject | Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays |
Date | |
Msg-id | CAAaqYe8bAfECVwRVG8YdKbQ3yuQhxCytByM4Ln_vbJcA8BWQrw@mail.gmail.com Whole thread Raw |
In response to | Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
|
List | pgsql-hackers |
On Thu, Apr 23, 2020 at 10:55 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On Thu, Apr 23, 2020 at 09:02:26AM -0400, James Coleman wrote: > >On Thu, Apr 23, 2020 at 8:47 AM Tomas Vondra > ><tomas.vondra@2ndquadrant.com> wrote: > >> > >> On Mon, Apr 20, 2020 at 09:27:34PM -0400, James Coleman wrote: > >> >Over in "execExprInterp() questions / How to improve scalar array op > >> >expr eval?" [1] I'd mused about how we might be able to optimized > >> >scalar array ops with OR'd semantics. > >> > > >> >This patch implements a binary search for such expressions when the > >> >array argument is a constant so that we can avoid needing to teach > >> >expression execution to cache stable values or know when a param has > >> >changed. > >> > > >> >The speed-up for the target case can pretty impressive: in my > >> >admittedly contrived and relatively unscientific test with a query in > >> >the form: > >> > > >> >select count(*) from generate_series(1,100000) n(i) where i in (<1000 > >> >random integers in the series>) > >> > > >> >shows ~30ms for the patch versus ~640ms on master. > >> > > >> > >> Nice improvement, although 1000 items is probably a bit unusual. The > >> threshold used in the patch (9 elements) seems a bit too low - what > >> results have you seen with smaller arrays? > > > >At least in our systems we regularly work with 1000 batches of items, > >which means you get IN clauses of identifiers of that size. Admittedly > >the most common case sees those IN clauses as simple index scans > >(e.g., WHERE <primary key> IN (...)), but it's also common to have a > >broader query that merely filters additionally on something like "... > >AND <some foreign key> IN (...)" where it makes sense for the rest of > >the quals to take precedence in generating a reasonable plan. In that > >case, the IN becomes a regular filter, hence the idea behind the > >patch. > > > >Side note: I'd love for us to be able to treat "IN (VALUES)" the same > >way...but as noted in the other thread that's an extremely large > >amount of work, I think. But similarly you could use a hash here > >instead of a binary search...but this seems quite good. > > > >As to the choice of 9 elements: I just picked that as a starting > >point; Andres had previously commented off hand that at 8 elements > >serial scanning was faster, so I figured this was a reasonable > >starting point for discussion. > > > >Perhaps it would make sense to determine that minimum not as a pure > >constant but scaled based on how many rows the planner expects us to > >see? Of course that'd be a more invasive patch...so may or may not be > >as feasible as a reasonable default. > > > > Not sure. That seems a bit overcomplicated and I don't think it depends > on the number of rows the planner expects to see very much. I think we > usually assume the linear search is cheaper for small arrays and then at > some point the binary search starts winning The question is where this > "break even" point is. Well since it has to do preprocessing work (expanding the array and then sorting it), then the number of rows processed matters, right? For example, doing a linear search on 1000 items only once is going to be cheaper than preprocessing the array and then doing a binary search, but only a very large row count the binary search + preprocessing might very well win out for only a 10 element array. I'm not trying to argue for more work for myself here: I think the optimization is worth it on its own, and something like this could be a further improvement on its own. But it is interesting to think about. James
pgsql-hackers by date: