Re: Making clausesel.c Smarter - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: Making clausesel.c Smarter
Date
Msg-id CAGTBQpaTNrXBuNw6-MwLLuuGUDT3db2QM_1t-Fo3GrzF33Harw@mail.gmail.com
Whole thread Raw
In response to [HACKERS] Making clausesel.c Smarter  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
On Tue, Apr 4, 2017 at 8:21 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 4 April 2017 at 13:35, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Mon, Apr 3, 2017 at 9:19 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>>> On Mon, Apr 3, 2017 at 8:52 PM, David Rowley
>>> <david.rowley@2ndquadrant.com> wrote:
>>>>> One last observation:
>>>>>
>>>>> +        /*
>>>>> +         * An IS NOT NULL test is a no-op if there's any other strict quals,
>>>>> +         * so if that's the case, then we'll only apply this, otherwise we'll
>>>>> +         * ignore it.
>>>>> +         */
>>>>> +        else if (cslist->selmask == CACHESEL_NOTNULLTEST)
>>>>> +            s1 *= cslist->notnullsel;
>>>>>
>>>>> In some paths, the range selectivity estimation code punts and returns
>>>>> DEFAULT_RANGE_INEQ_SEL. In those cases, if a not null test was
>>>>> present, it should still be applied. It could make a big difference if
>>>>> the selectivity for the nulltest is high.
>>>>
>>>> I'd say that the location that applies the DEFAULT_RANGE_INEQ_SEL
>>>> should apply the nullfrac. Seems wrong to rely on a IS NOT NULL test
>>>> to exists to estimate that properly. I don't think that needs done as
>>>> part of this patch. I'd rather limit the scope since "returned with
>>>> feedback" is already knocking at the door of this patch.
>>>
>>> I agree, but this patch regresses for those cases if the user
>>> compensated with a not null test, leaving little recourse, so I'd fix
>>> it in this patch to avoid the regression.
>>>
>>> Maybe you're right that the null fraction should be applied regardless
>>> of whether there's an explicit null check, though.
>>
>> The more I read that code, the more I'm convinced there's no way to
>> get a DEFAULT_RANGE_INEQ_SEL if (rqlist->hibound != DEFAULT_INEQ_SEL
>> &&
>>  rqlist->lobound != DEFAULT_INEQ_SEL) other than from contradictory
>> clauses, a case the current code handles very poorly, returning
>> DEFAULT_RANGE_INEQ_SEL and ignoring nullfrac, something that could
>> give way-off estimates if the table had lots of nulls.
>>
>> Say, consider if "value" was mostly null and you write:
>>
>> SELECT * FROM testdata WHERE value BETWEEN 10 AND 20 AND value BETWEEN
>> 50 AND 60;
>>
>> All other cases I can think of would end with either hibound or
>> lobound being equal to DEFAULT_INEQ_SEL.
>>
>> It seems to me, doing a git blame, that the checks in the else branch
>> were left only as a precaution, one that seems unnecessary.
>>
>> If you ask me, I'd just leave:
>>
>> + if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound ==
>> DEFAULT_INEQ_SEL)
>> + {
>> +     /* No point in checking null selectivity, DEFAULT_INEQ_SEL
>> implies we have no stats */
>> +     s2 = DEFAULT_RANGE_INEQ_SEL;
>> + }
>> + else
>> + {
>> ...
>> +   s2 = rqlist->hibound + rqlist->lobound - 1.0;
>> +   nullsel = nulltestsel(root, IS_NULL, rqlist->var, varRelid);
>> +   notnullsel = 1.0 - nullsel;
>> +
>> +   /* Adjust for double-exclusion of NULLs */
>> +   s2 += nullsel;
>> +   if (s2 <= 0.0) {
>> +      if (s2 <= (-1.0e-4 * notnullsel - 1.0e-6))
>> +      {
>> +           /* Most likely contradictory clauses found */
>> +           s2 = (1.0e-10 < notnullsel) ? 1.0e-10 : notnullsel;
>> +      }
>> +      else
>> +      {
>> +           /* Could be a rounding error */
>> +           s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
>> +      }
>> +   }
>> + }
>>
>> Where (-1.0e-4 * notnullsel - 1.0e-6) is just a very rough (and overly
>> cautious) estimation of the amount of rounding error that could be
>> there with 32-bit floats.
>>
>> The above assumes a non-DEFAULT_INEQ_SEL value in lobound/hibound is
>> an estimation based on stats, maybe approximate, but one that makes
>> sense (ie: preserves the monotonicity of the CPF), and as such
>> negative values are either sign of a contradiction or rounding error.
>> The previous code left the possibility of negatives coming out of
>> default selectivities creeping up on non-DEFAULT_INEQ_SEL estimates,
>> but that doesn't seem possible coming out of scalarineqsel.
>>
>> But I'd like it if Tom could comment on that, since he's the one that
>> did that in commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a (way back
>> in 2004).
>>
>> Barring that, I'd just change the
>>
>> s2 = DEFAULT_RANGE_INEQ_SEL;
>>
>> To
>>
>> s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
>>
>> Which makes a lot more sense.
>
> I think to fix this properly the selfuncs would have to return the
> estimate, and nullfrac separately, so the nullfrac could just be
> applied once per expression. There's already hacks in there to get rid
> of the double null filtering. I'm not proposing that as something for
> this patch. It would be pretty invasive to change this.

There's no need, you can compute the nullfrac with nulltestsel. While
there's some rework involved, it's not a big deal (it just reads the
stats tuple), and it's a clean solution.



pgsql-hackers by date:

Previous
From: Claudio Freire
Date:
Subject: Re: Making clausesel.c Smarter
Next
From: Maksim Milyutin
Date:
Subject: Re: Proposal: Local indexes for partitioned table