Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id f0f7e00c-7ccb-9412-0147-a06a60777f06@enterprisedb.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: PoC/WIP: Extended statistics on expressions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
Hi,

On 3/17/21 4:55 PM, Dean Rasheed wrote:
> On Sun, 7 Mar 2021 at 21:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> 2) ndistinct
>>
>> There's one thing that's bugging me, in how we handle "partial" matches.
>> For each expression we track both the original expression and the Vars
>> we extract from it. If we can't find a statistics matching the whole
>> expression, we try to match those individual Vars, and we remove the
>> matching ones from the list. And in the end we multiply the estimates
>> for the remaining Vars.
>>
>> This works fine with one matching ndistinct statistics. Consider for example
>>
>>      GROUP BY (a+b), (c+d)
>>
>> with statistics on [(a+b),c] - that is, expression and one column.
> 
> I've just been going over this example, and I think it actually works
> slightly differently from how you described, though I haven't worked
> out the full general implications of that.
> 
>> We parse the expressions into two GroupExprInfo
>>
>>      {expr: (a+b), vars: [a, b]}
>>      {expr: (c+d), vars: [c, d]}
>>
> 
> Here, I think what you actually get, in the presence of stats on
> [(a+b),c] is actually the following two GroupExprInfos:
> 
>       {expr: (a+b), vars: []}
>       {expr: (c+d), vars: [c, d]}
> 

Yeah, right. To be precise, we get

    {expr: (a+b), vars: [(a+b)]}

because in the first case we pass NIL, so add_unique_group_expr treats
the whole expression as a var (a bit strange, but OK).

> because of the code immediately after this comment in estimate_num_groups():
> 
>         /*
>          * If examine_variable is able to deduce anything about the GROUP BY
>          * expression, treat it as a single variable even if it's really more
>          * complicated.
>          */
> 
> As it happens, that makes no difference in this case, where there is
> just this one stats object, but it does change things when there are
> two stats objects.
> 
>> and the statistics matches the first item exactly (the expression). The
>> second expression is not in the statistics, but we match "c". So we end
>> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo:
>>
>>      {expr: (c+d), vars: [d]}
> 
> Right.
> 
>> Without any other statistics we estimate that as ndistinct for "d", so
>> we end up with
>>
>>      ndistinct((a+b), c) * ndistinct(d)
>>
>> which mostly makes sense. It assumes ndistinct(c+d) is product of the
>> ndistinct estimates, but that's kinda what we've been always doing.
> 
> Yes, that appears to be what happens, and it's probably the best that
> can be done with the available stats.
> 
>> But now consider we have another statistics on just (c+d). In the second
>> loop we end up matching this expression exactly, so we end up with
>>
>>      ndistinct((a+b), c) * ndistinct((c+d))
> 
> In this case, with stats on (c+d) as well, the two GroupExprInfos
> built at the start change to:
> 
>       {expr: (a+b), vars: []}
>       {expr: (c+d), vars: []}
> 
> so it actually ends up not using any multivariate stats, but instead
> uses the two univariate expression stats, giving
> 
>      ndistinct((a+b)) * ndistinct((c+d))
> 
> which actually seems pretty good, and gave very good estimates in the
> simple test case I tried.
> 

Yeah, that works pretty well in this case.

I wonder if we'd be better off extracting the Vars and doing what I
mistakenly described as the current behavior. That's essentially mean
extracting the Vars even in the case where we now pass NIL.

My concern is that the current behavior (where we prefer expression
stats over multi-column stats to some extent) works fine as long as the
parts are independent, but once there's dependency it's probably more
likely to produce underestimates. I think underestimates for grouping
estimates were a risk in the past, so let's not make that worse.

>> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think
>> what we should do after the first loop is just discarding the whole
>> expression and "expand" into per-variable GroupExprInfo, so in the
>> second step we would not match the (c+d) statistics.
> 
> Not using the (c+d) stats would give either
> 
>      ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> 
> or
> 
>      ndistinct((a+b), c) * ndistinct(d)
> 
> depending on exactly how the algorithm was changed. In my test, these
> both gave worse estimates, but there are probably other datasets for
> which they might do better. It all depends on how much correlation
> there is between (a+b) and c.
> 
> I suspect that there is no optimal strategy for handling overlapping
> stats that works best for all datasets, but the current algorithm
> seems to do a pretty decent job.
> 
>> Of course, maybe there's a better way to pick the statistics, but I
>> think our conclusion so far was that people should just create
>> statistics covering all the columns in the query, to not have to match
>> multiple statistics like this.
> 
> Yes, I think that's always likely to work better, especially for
> ndistinct stats, where all possible permutations of subsets of the
> columns are included, so a single ndistinct stat can work well for a
> range of different queries.
> 

Yeah, I agree that's a reasonable mitigation. Ultimately, there's no
perfect algorithm how to pick and combine stats when we don't know if
there even is a statistical dependency between the subsets of columns.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_dump new feature: exporting functions only. Bad or good idea ?
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Custom compression methods