Re: Use extended statistics to estimate (Var op Var) clauses - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Use extended statistics to estimate (Var op Var) clauses
Date
Msg-id e16416cc-51a0-dda1-64d5-d07ae34f0ea4@enterprisedb.com
Whole thread Raw
In response to Re: Use extended statistics to estimate (Var op Var) clauses  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: Use extended statistics to estimate (Var op Var) clauses  (Zhihong Yu <zyu@yugabyte.com>)
Re: Use extended statistics to estimate (Var op Var) clauses  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On 8/28/21 6:30 PM, Mark Dilger wrote:
> 
> 
>> On Aug 28, 2021, at 6:52 AM, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>> 
>> Part 0003 fixes handling of those clauses so that we don't treat
>> them as simple, but it does that by tweaking
>> statext_is_compatible_clause(), as suggested by Dean.
> 
> Function examine_opclause_args() doesn't set issimple to anything in
> the IsA(rightop, Const) case, but assigns *issimplep = issimple at
> the bottom.  The compiler is not complaining about using a possibly
> uninitialized variable, but if I change the "return true" on the very
> next line to "return issimple", the compiler complains quite loudly.
> 

Yeah, true. Thanks for noticing this was a bug - I forgot to set the 
issimple variable in the first branch.

> 
> Some functions define bool *issimple, others bool *issimplep and bool
> issimple.  You might want to standardize the naming.
> 

I think the naming is standard with respect to the surrounding code. If 
the other parameters use "p" to mark "pointer" then issimplep is used, 
but in other places it's just "issimple". IMHO this is appropriate.

> It's difficult to know what "simple" means in extended_stats.c.
> There is no file-global comment explaining the concept, and functions
> like compare_scalars_simple don't have correlates named
> compare_scalars_complex or such, so the reader cannot infer by
> comparison what the difference might be between a "simple" case and
> some non-"simple" case.  The functions' issimple (or issimplep)
> argument are undocumented.
> 
> There is a comment:
> 
> /* * statext_mcv_clauselist_selectivity *      Estimate clauses using
> the best multi-column statistics. .... * * - simple selectivity:
> Computed without extended statistics, i.e. as if the *
> columns/clauses were independent. * .... */
> 
> but it takes a while to find if you search for "issimple".
> 

Yeah, true. This was added a while ago when Dean reworked the estimation 
(based on MCV), and it seemed clear back then. But now a comment 
explaining this concept (and how it affects the estimation) would be 
helpful. I'll try digging in the archives for the details.

> 
> In both scalarineqsel_wrapper() and eqsel_internal(), the call to
> matching_restriction_variables() should usually return false, since
> comparing a variable to itself is an unusual case.  The next call is
> to get_restriction_variable(), which repeats the work of examining
> the left and right variables.  So in almost all cases, after throwing
> away the results of:
> 
> examine_variable(root, left, varRelid, &ldata); 
> examine_variable(root, right, varRelid, &rdata);
> 
> performed in matching_restriction_variables(), we'll do exactly the
> same work again (with one variable named differently) in
> get_restriction_variable():
> 
> examine_variable(root, left, varRelid, vardata); 
> examine_variable(root, right, varRelid, &rdata);
> 
> That'd be fine if example_variable() were a cheap function, but it
> appears not to be.  Do you think you could save the results rather
> than recomputing them?  It's a little messy, since these are the only
> two functions out of about ten which follow this pattern, so you'd
> have to pass NULLs into get_restriction_variable() from the other
> eight callers, but it still looks like that would be a win.
> 

I had similar concerns, although I don't think those functions are very 
expensive compared to the rest of the estimation code. I haven't done 
any measurements yet, though.

But I don't think saving the results is the way to go - in a way, we 
already store the stats (which seems like the most expensive bit) in 
syscache. It seems better to just simplify examine_variable() so that it 
does not lookup the statistics, which we don't need here at all.


The attached version of the patches fixes the other bugs reported here 
so far - most importantly it reworks how we set issimple while examining 
the clauses, so that it's never skips the initialization. Hopefully the 
added comments also explain it a bit more clearly.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: "REIX, Tony"
Date:
Subject: RE: AIX: Symbols are missing in libpq.a
Next
From: Aleksander Alekseev
Date:
Subject: Re: Patch: shouldn't timezone(text, timestamp[tz]) be STABLE?