Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays - Mailing list pgsql-hackers

From James Coleman
Subject Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date
Msg-id CAAaqYe8zdhuwWdhdKcg9n2_SfLe2dHZfpyfymXgGGP9mHgt5uQ@mail.gmail.com
Whole thread Raw
In response to Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (James Coleman <jtc331@gmail.com>)
Responses Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
On Fri, Sep 11, 2020 at 5:11 PM James Coleman <jtc331@gmail.com> wrote:
>
> On Tue, Sep 8, 2020 at 4:37 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 08/09/2020 22:25, James Coleman wrote:
> > > On Wed, Aug 19, 2020 at 3:16 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > >>
> > >> You could also apply the optimization for non-Const expressions, as long
> > >> as they're stable (i.e. !contain_volatile_functions(expr)).
> > >
> > > Is that true? Don't we also have to worry about whether or not the
> > > value is stable (i.e., know when a param has changed)? There have been
> > > discussions about being able to cache stable subexpressions, and my
> > > understanding was that we'd need to have that infrastructure (along
> > > with the necessarily invalidation when the param changes) to be able
> > > to use this for non-Const expressions.
> >
> > Yeah, you're right, you'd have to also check for PARAM_EXEC Params. And
> > Vars. I think the conditions are the same as those checked in
> > match_clause_to_partition_key() in partprune.c (it's a long function,
> > search for "if (!IsA(rightop, Const))"). Not sure it's worth the
> > trouble, then. But it would be nice to handle queries like "WHERE column
> > = ANY ($1)"
>
> If I'm understanding properly you're imagining something in the form of:
>
> with x as (select '{1,2,3,4,5,6,7,8,9,10}'::int[])
> select * from t where i = any ((select * from x)::int[]);
>
> I've been playing around with this and currently have these checks:
>
> contain_var_clause((Node *) arrayarg)
> contain_volatile_functions((Node *) arrayarg)
> contain_exec_param((Node *) arrayarg, NIL)
>
> (the last one I had to modify the function to handle empty lists)
>
> If any are true, then have to disable the optimization. But for
> queries in the form above the test contain_exec_param((Node *)
> arrayarg, NIL) evaluates to true, even though we know from looking at
> the query that the array subexpression is stable for the length of the
> query.
>
> Am I misunderstanding what you're going for? Or is there a way to
> confirm the expr, although an exec param, won't change?
>
> Another interesting thing that this would imply is that we'd either have to:
>
> 1. Remove the array length check altogether,
> 2. Always use the hash when have a non-Const, but when a Const only if
> the array length check passes, or
> 3. Make this new expr op more fully featured by teaching it how to use
> either a straight loop through a deconstructed array or use the hash.
>
> That last option feeds into further discussion in the below:
>
> > >> Deconstructing the array Datum into a simple C array on first call would
> > >> be a win even for very small arrays and for AND semantics, even if you
> > >> don't use a hash table.
> > >
> > > Because you wouldn't have to repeatedly detoast it? Or some other
> > > reason I'm not thinking of? My intuition would have been that (aside
> > > from detoasting if necessary) there wouldn't be much real overhead in,
> > > for example, an array storing integers.
> >
> > Dealing with NULLs and different element sizes in the array is pretty
> > complicated. Looping through the array currently looks like this:
> >
> >         /* Loop over the array elements */
> >         s = (char *) ARR_DATA_PTR(arr);
> >         bitmap = ARR_NULLBITMAP(arr);
> >         bitmask = 1;
> >
> >         for (int i = 0; i < nitems; i++)
> >         {
> >                 Datum           elt;
> >                 Datum           thisresult;
> >
> >                 /* Get array element, checking for NULL */
> >                 if (bitmap && (*bitmap & bitmask) == 0)
> >                 {
> >                         fcinfo->args[1].value = (Datum) 0;
> >                         fcinfo->args[1].isnull = true;
> >                 }
> >                 else
> >                 {
> >                         elt = fetch_att(s, typbyval, typlen);
> >                         s = att_addlength_pointer(s, typlen, s);
> >                         s = (char *) att_align_nominal(s, typalign);
> >                         fcinfo->args[1].value = elt;
> >                         fcinfo->args[1].isnull = false;
> >                 }
> >
> >                 [do stuff with Datum/isnull]
> >
> >                 /* advance bitmap pointer if any */
> >                 if (bitmap)
> >                 {
> >                         bitmask <<= 1;
> >                         if (bitmask == 0x100)
> >                         {
> >                                 bitmap++;
> >                                 bitmask = 1;
> >                         }
> >                 }
> >         }
> >
> > Compared with just:
> >
> >         for (int i = 0; i < nitems; i++)
> >         {
> >                 Datum           elt = datums[i];
> >
> >                 [do stuff with the Datum]
> >         }
> >
> > I'm not sure how much difference that makes, but I presume it's not
> > zero, and it seems like an easy win when you have the code to deal with
> > the Datum array representation anyway.
>
> Doing this would necessitate option 3 above: we'd have to have this
> new expr op be able both to use a hash or alternatively do a normal
> loop.
>
> Being able to use this in more cases than just a Const array expr is
> certainly interesting, but I'm not sure yet about the feasibility or
> desirability of that at this point given the above restrictions.
>
> One other point in favor of the additional complexity here is that
> it's likely that the above described runtime switching between hash
> and loop would be necessary (for this optimization to come into play)
> if caching of stable subexpressions ever lands. I have some interest
> in working on that...but it's also a large project.

I've attached a cleaned up patch. Last CF it was listed in is
https://commitfest.postgresql.org/29/2542/ -- what's the appropriate
step to take here given it's an already existing patch, but not yet
moved into recent CFs?

Thanks,
James

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] Custom compression methods
Next
From: John Naylor
Date:
Subject: Re: Speeding up GIST index creation for tsvectors