Re: [HACKERS] PATCH: multivariate histograms and MCV lists - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date
Msg-id b423b2fe-8aff-d64c-152a-8c77cf45faa7@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On 03/28/2018 04:12 PM, Dean Rasheed wrote:
> On 28 March 2018 at 01:34, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> Attached is a patch fixing this. In the end I've decided to keep both
>> branches - one handling boolean Vars and one for NOT clauses. I think
>> you're right we can only see (NOT var) cases, but I'm not sure about that.
>>
>> For example, what if an operator does not have a negator? Then we can't
>> transform NOT (a AND b) => (NOT a OR NOT b), I guess. So I kept this for
>> now, and we can remove this later.
>>
> 
> OK, but it's going to have to work harder to set "fullmatch"
> correctly. If you have a boolean Var clause, which is identical to
> "bool_var = true", it ought to add to "eqmatches" if true is found in
> the MCV list. Likewise a boolean Var under a NOT clause is identical
> to "bool_var = false", so it ought to add to "eqmatches" if false is
> found in the MCV list. Both those cases would be easy to handle, if
> general NOT support wasn't required, and you just special-cased "NOT
> bool_var".
> 
> If you're going to handle the general case of an arbitrary clause
> under a NOT, then the recursive call to mcv_update_match_bitmap()
> would seem to need to know that it's under a NOT (a new "is_not"
> parameter?), to invert the logic around adding to "eqmatches". That
> applies to other general OpExpr's too -- for example, "NOT (box_var =
> ?)" won't be rewritten because there is no box_ne operator, but when
> mcv_update_match_bitmap() is called recursively with the "box_var =
> ?", it shouldn't add to "eqmatches", despite this being an EQSEL
> operator.
> 
> As mentioned before, I think this whole thing only works if
> mcv_update_match_bitmap() returns the "eqmatches" list, so that if it
> is called recursively, it can be merged with the caller's list. What
> isn't immediately obvious to me is what happens to a NOT clause under
> another NOT clause, possibly with an AND or OR in-between. Would the
> "is_not" parameter just flip back to false again?
> 

After thinking about this a bit more, I'm not sure if updating the info
based on recursive calls makes sense. The fullmatch flag was supposed to
answer a simple question - can there be just a single matching item?

If there are equality conditions on all columns, there can be just a
single matching item - if we have found it in the MCV (i.e. s1 > 0.0),
then we don't need to inspect the non-MCV part.

But handling this in recursive manner breaks this entirely, because with
something like

   (a=1) AND (b=1 OR b=2)

you suddenly can have multiple matching items. Which makes the fullmatch
flag somewhat useless.

So I think we should be looking at top-level equality clauses only, just
like number_of_groups() does.


> There's also an interesting question around the NullTest clause. Since
> NULLs are being recorded in the MCV list, shouldn't "IS NULL" be
> treated as semantically like an equality clause, and cause that
> attribute to be added to "eqmatches" if NULL is found in the MCV list?
> 
> 
>> I've also realized that the "fullmatch" flag is somewhat confused,
>> because some places interpreted it as "there is equality on each
>> attribute" but in fact it also required an actual MCV match.
> 
> Yes, I was having similar thoughts. I think "eqmatches" / "fullmatch"
> probably just wants to track whether there was an exact comparison on
> all the attributes, not whether or not the value was in the MCV list,
> because the latter is already available in the "matches" bitmap.
> Knowing that complete, exact comparisons happened, and it wasn't in
> the MCV list, makes the "(1 - mcv_totalsel)) / otherdistinct" estimate
> reasonable.
> 

I think we can remove the fullmatch flag from mcv_update_bitmap
entirely. All we need to know is the presence of equality clauses and
whether there was a match in MCV (which we know from s1 > 0.0).

> However, I don't think that tracking "eqmatches" or "fullmatch" is
> sufficient for the general case. For example, for other operators like
> "!=", "<", "<=", all (or maybe half) the "1 - mcv_totalsel" ought to
> count towards the selectivity, plus possibly part of the MCV list
> (e.g., for "<=", using the sum of the matching MCV frequencies plus
> half the sum of the non-MCV frequencies might be reasonable -- c.f.
> scalarineqsel()). For an OR clause, you might want to count the number
> of non-MCV matches, because logically each one adds another "(1 -
> mcv_totalsel)) / otherdistinct" to the total selectivity. It's not
> immediately obvious how that can be made to fit into the current code
> structure. Perhaps it could be made to work by tracking the overall
> selectivity as it goes along. Or perhaps it could track the
> count/proportion of non-MCV matches.
> 

Yes, ignoring the non-equality clauses in 0002 is wrong - that's pretty
much why it's WIP and not merged into 0001.

thanks for the feedback

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: segfault due to invalid cached plan
Next
From: Nicolas Thauvin
Date:
Subject: Re: segfault due to invalid cached plan