Thread: Patch: improve selectivity estimation for IN/NOT IN

Patch: improve selectivity estimation for IN/NOT IN

From
Daniele Varrazzo
Date:
Hello,

the attached patch improves the array selectivity estimation for = ANY
and <> ALL, hence for the IN/NOT IN operators, to avoid the
shortcoming described in
<http://archives.postgresql.org/pgsql-performance/2012-03/msg00006.php>.

Cheers,

-- Daniele

Attachment

Re: Patch: improve selectivity estimation for IN/NOT IN

From
Tom Lane
Date:
Daniele Varrazzo <daniele.varrazzo@gmail.com> writes:
> the attached patch improves the array selectivity estimation for = ANY
> and <> ALL, hence for the IN/NOT IN operators, to avoid the
> shortcoming described in
> <http://archives.postgresql.org/pgsql-performance/2012-03/msg00006.php>.

In connection with Alexander Korotkov's array-estimation patch,
I just committed some code into scalararraysel() that checks whether the
operator is equality or inequality of the array element type.  It does
that by consulting the default btree or hash opclass for the element
type.  I did that with the thought that it could be used to attack this
issue too, but I see that you've done it another way, ie check to see
whether the operator uses eqsel() or neqsel() as selectivity estimator.

I'm not sure offhand which way is better.  It could be argued that yours
is more appropriate because if the operator isn't btree equality, but acts
enough like it to use eqsel() as estimator, then it's still appropriate
for scalararraysel() to treat it as equality.  On the other side of the
coin, an operator might be equality but have reason to use some
operator-specific estimator rather than eqsel().  We have real examples
of the former (such as the approximate-equality geometric operators)
but I think the latter case is just hypothetical.  Another thing that
has to be thought about is that there are numerous cross-type operators
that use eqsel, such as date-vs-timestamp, and it's far from clear
whether it's appropriate for scalararraysel() to use the modified stats
calculation when dealing with one of these.  The btree-based logic is
impervious to that since it won't match any cross-type operator.

Thoughts?

(BTW, in any case I don't trust function pointer comparison to be
portable.  It'd be a lot safer to look at the function OID.)
        regards, tom lane


Re: Patch: improve selectivity estimation for IN/NOT IN

From
Daniele Varrazzo
Date:
On Sun, Mar 4, 2012 at 2:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Daniele Varrazzo <daniele.varrazzo@gmail.com> writes:
>> the attached patch improves the array selectivity estimation for = ANY
>> and <> ALL, hence for the IN/NOT IN operators, to avoid the
>> shortcoming described in
>> <http://archives.postgresql.org/pgsql-performance/2012-03/msg00006.php>.
>
> In connection with Alexander Korotkov's array-estimation patch,
> I just committed some code into scalararraysel() that checks whether the
> operator is equality or inequality of the array element type.

It looks like you have grand plans for array estimation. My patch has
a much more modest scope, and I'm hoping it could be applied to
currently maintained PG versions, as I consider the currently produced
estimations a bug.

> It does
> that by consulting the default btree or hash opclass for the element
> type.  I did that with the thought that it could be used to attack this
> issue too, but I see that you've done it another way, ie check to see
> whether the operator uses eqsel() or neqsel() as selectivity estimator.
>
> I'm not sure offhand which way is better.  It could be argued that yours
> is more appropriate because if the operator isn't btree equality, but acts
> enough like it to use eqsel() as estimator, then it's still appropriate
> for scalararraysel() to treat it as equality.  On the other side of the
> coin, an operator might be equality but have reason to use some
> operator-specific estimator rather than eqsel().  We have real examples
> of the former (such as the approximate-equality geometric operators)
> but I think the latter case is just hypothetical.  Another thing that
> has to be thought about is that there are numerous cross-type operators
> that use eqsel, such as date-vs-timestamp, and it's far from clear
> whether it's appropriate for scalararraysel() to use the modified stats
> calculation when dealing with one of these.  The btree-based logic is
> impervious to that since it won't match any cross-type operator.
>
> Thoughts?
>
> (BTW, in any case I don't trust function pointer comparison to be
> portable.  It'd be a lot safer to look at the function OID.)

My original idea was to compare the comparison function with the
catalog: I just don't know how to inspect the catalog/perform the
check. Studying how to do it I've noticed the fn_addr referring to the
"well known" eqsel/neqsel functions and thought about using it as
indicators of the right operator semantics.

If you are still interested in the patch, for sake of bugfixing, and
somebody provides an example in the source about how to compare the
functions oid, I can try to improve it.

-- Daniele


Re: Patch: improve selectivity estimation for IN/NOT IN

From
Euler Taveira de Oliveira
Date:
On 04-03-2012 00:20, Daniele Varrazzo wrote:
> It looks like you have grand plans for array estimation. My patch has
> a much more modest scope, and I'm hoping it could be applied to
> currently maintained PG versions, as I consider the currently produced
> estimations a bug.
> 
We don't normally add new features to stable branches unless it is a bug. In
the optimizer case, planner regression is a bug (that this case is not).


--   Euler Taveira de Oliveira - Timbira       http://www.timbira.com.br/  PostgreSQL: Consultoria, Desenvolvimento,
Suporte24x7 e Treinamento
 


Re: Patch: improve selectivity estimation for IN/NOT IN

From
Daniele Varrazzo
Date:
On Sun, Mar 4, 2012 at 1:44 PM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:
> On 04-03-2012 00:20, Daniele Varrazzo wrote:
>> It looks like you have grand plans for array estimation. My patch has
>> a much more modest scope, and I'm hoping it could be applied to
>> currently maintained PG versions, as I consider the currently produced
>> estimations a bug.
>>
> We don't normally add new features to stable branches unless it is a bug. In
> the optimizer case, planner regression is a bug (that this case is not).

Please note that we are talking about planning errors leading to
estimates of records in the millions instead of in the units, in
queries where all the elements are known (most common elements, with
the right stats, included in the query), even with a partial index
whose size could never physically contain all the estimated rows
screaming "something broken here!". So, it's not a regression as the
computation has always been broken, but I think it can be hardly
considered not a bug.

OTOH, I expect the decision of leaving things as they are could be
taken on the basis of the possibility that some program may stop
working in reaction of an altered query plan: I'm not going to argue
about this, although I feel it unlikely.

-- Daniele


Re: Patch: improve selectivity estimation for IN/NOT IN

From
Tom Lane
Date:
Daniele Varrazzo <daniele.varrazzo@gmail.com> writes:
> On Sun, Mar 4, 2012 at 1:44 PM, Euler Taveira de Oliveira
> <euler@timbira.com> wrote:
>> On 04-03-2012 00:20, Daniele Varrazzo wrote:
>>> It looks like you have grand plans for array estimation. My patch has
>>> a much more modest scope, and I'm hoping it could be applied to
>>> currently maintained PG versions, as I consider the currently produced
>>> estimations a bug.

>> We don't normally add new features to stable branches unless it is a bug. In
>> the optimizer case, planner regression is a bug (that this case is not).

> Please note that we are talking about planning errors leading to
> estimates of records in the millions instead of in the units,

We're also talking about a proposed patch that is not clearly a bug fix,
but is a change in a heuristic, meaning it has the potential to make
things worse in some cases.  (Notably, for arrays that *don't* contain
all-distinct values, the estimates are likely to get worse.)  So I
wasn't thinking of it as being back-patchable anyway.  It's generally
better not to destabilize planner behavior in minor releases, even if
it's a 90%-of-the-time improvement.
        regards, tom lane


Re: Patch: improve selectivity estimation for IN/NOT IN

From
Tom Lane
Date:
I wrote:
> Daniele Varrazzo <daniele.varrazzo@gmail.com> writes:
>> the attached patch improves the array selectivity estimation for = ANY
>> and <> ALL, hence for the IN/NOT IN operators, to avoid the
>> shortcoming described in
>> <http://archives.postgresql.org/pgsql-performance/2012-03/msg00006.php>.

I've committed a modified version of this patch.

> I'm not sure offhand which way is better.  It could be argued that yours
> is more appropriate because if the operator isn't btree equality, but acts
> enough like it to use eqsel() as estimator, then it's still appropriate
> for scalararraysel() to treat it as equality.  On the other side of the
> coin, an operator might be equality but have reason to use some
> operator-specific estimator rather than eqsel().

After some reflection I decided it was probably sane to use both
methods, that is apply the disjoint-probabilities calculation if the
operator appears to be equality/inequality according to either rule.
However, I also put in the safety valve I suggested in the -performance
thread, that the code fall back to the old calculation if the assumption
of disjoint probabilities yields an impossible result.  The patch as you
had it would have just clamped such a result to 1 or 0, which didn't
seem to me to be the best we could do.
        regards, tom lane