Re: Making clausesel.c Smarter - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: Making clausesel.c Smarter
Date
Msg-id CAGTBQpYqtJEHq9LKbX2HaLMbKJX7MS5zsOJ=i5FhYHO0jyj7QA@mail.gmail.com
Whole thread Raw
In response to Re: Making clausesel.c Smarter  (Claudio Freire <klaussfreire@gmail.com>)
Responses Re: Making clausesel.c Smarter  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
On Mon, Apr 3, 2017 at 9:19 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Mon, Apr 3, 2017 at 8:52 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>>> One last observation:
>>>
>>> +        /*
>>> +         * An IS NOT NULL test is a no-op if there's any other strict quals,
>>> +         * so if that's the case, then we'll only apply this, otherwise we'll
>>> +         * ignore it.
>>> +         */
>>> +        else if (cslist->selmask == CACHESEL_NOTNULLTEST)
>>> +            s1 *= cslist->notnullsel;
>>>
>>> In some paths, the range selectivity estimation code punts and returns
>>> DEFAULT_RANGE_INEQ_SEL. In those cases, if a not null test was
>>> present, it should still be applied. It could make a big difference if
>>> the selectivity for the nulltest is high.
>>
>> I'd say that the location that applies the DEFAULT_RANGE_INEQ_SEL
>> should apply the nullfrac. Seems wrong to rely on a IS NOT NULL test
>> to exists to estimate that properly. I don't think that needs done as
>> part of this patch. I'd rather limit the scope since "returned with
>> feedback" is already knocking at the door of this patch.
>
> I agree, but this patch regresses for those cases if the user
> compensated with a not null test, leaving little recourse, so I'd fix
> it in this patch to avoid the regression.
>
> Maybe you're right that the null fraction should be applied regardless
> of whether there's an explicit null check, though.

The more I read that code, the more I'm convinced there's no way to
get a DEFAULT_RANGE_INEQ_SEL if (rqlist->hibound != DEFAULT_INEQ_SEL
&&rqlist->lobound != DEFAULT_INEQ_SEL) other than from contradictory
clauses, a case the current code handles very poorly, returning
DEFAULT_RANGE_INEQ_SEL and ignoring nullfrac, something that could
give way-off estimates if the table had lots of nulls.

Say, consider if "value" was mostly null and you write:

SELECT * FROM testdata WHERE value BETWEEN 10 AND 20 AND value BETWEEN
50 AND 60;

All other cases I can think of would end with either hibound or
lobound being equal to DEFAULT_INEQ_SEL.

It seems to me, doing a git blame, that the checks in the else branch
were left only as a precaution, one that seems unnecessary.

If you ask me, I'd just leave:

+ if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound ==
DEFAULT_INEQ_SEL)
+ {
+     /* No point in checking null selectivity, DEFAULT_INEQ_SEL
implies we have no stats */
+     s2 = DEFAULT_RANGE_INEQ_SEL;
+ }
+ else
+ {
...
+   s2 = rqlist->hibound + rqlist->lobound - 1.0;
+   nullsel = nulltestsel(root, IS_NULL, rqlist->var, varRelid);
+   notnullsel = 1.0 - nullsel;
+
+   /* Adjust for double-exclusion of NULLs */
+   s2 += nullsel;
+   if (s2 <= 0.0) {
+      if (s2 <= (-1.0e-4 * notnullsel - 1.0e-6))
+      {
+           /* Most likely contradictory clauses found */
+           s2 = (1.0e-10 < notnullsel) ? 1.0e-10 : notnullsel;
+      }
+      else
+      {
+           /* Could be a rounding error */
+           s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
+      }
+   }
+ }

Where (-1.0e-4 * notnullsel - 1.0e-6) is just a very rough (and overly
cautious) estimation of the amount of rounding error that could be
there with 32-bit floats.

The above assumes a non-DEFAULT_INEQ_SEL value in lobound/hibound is
an estimation based on stats, maybe approximate, but one that makes
sense (ie: preserves the monotonicity of the CPF), and as such
negative values are either sign of a contradiction or rounding error.
The previous code left the possibility of negatives coming out of
default selectivities creeping up on non-DEFAULT_INEQ_SEL estimates,
but that doesn't seem possible coming out of scalarineqsel.

But I'd like it if Tom could comment on that, since he's the one that
did that in commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a (way back
in 2004).

Barring that, I'd just change the

s2 = DEFAULT_RANGE_INEQ_SEL;

To

s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;

Which makes a lot more sense.



pgsql-hackers by date:

Previous
From: vinayak
Date:
Subject: Re: ANALYZE command progress checker
Next
From: Chapman Flack
Date:
Subject: If an extension library is loaded during pg_upgrade, can it tell?