Home > mailing lists
Re: improving GROUP BY estimation - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: improving GROUP BY estimation
Date	March 3, 2016 19:16:44
Msg-id	56D88D93.5000509@2ndquadrant.com Whole thread Raw
In response to	Re: improving GROUP BY estimation (Mark Dilger <hornschnorter@gmail.com>)
Responses	Re: improving GROUP BY estimation
List	pgsql-hackers
Tree view
Hi,

On 03/03/2016 06:40 PM, Mark Dilger wrote:
>
>> On Mar 3, 2016, at 8:35 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
>>> Hi, Tomas!
>>>
>>> I've assigned to review this patch.
>>>
>>> I've checked version estimate-num-groups-v2.txt by Mark Dilger.
>>> It applies to head cleanly, passes corrected regression tests.
>>>
>>> About correlated/uncorrelated cases. I think now optimizer mostly assumes
>>> all distributions to be independent. I think we should follow this
>>> assumption in this case also until we have fundamentally better option (like
>>> your multivariate statistics).
>>>
>>> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
>>> double input_rows,
>>> reldistinct = clamp;
>>>
>>> /*
>>> -* Multiply by restriction selectivity.
>>> +* Estimate number of distinct values expected in given number of rows.
>>> */
>>> -reldistinct *= rel->rows / rel->tuples;
>>> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
>>>
>>> /*
>>> * Update estimate of total distinct groups.
>>>
>>> I think we need way more explanation in comments here (possibly with
>>> external links). For now, it looks like formula which appears from nowhere.
>>
>> I thought the details (particularly the link to stackexchange, discussing
>> the formula) would be enough, but let me elaborate.
>>
>> The current formula
>>
>>     reldistinct *= rel->rows / rel->tuples;
>>
>> simply multiplies the ndistinct estimate with selectivity. So when you
>> select 1% of the table, the estimate says we'll see 1% of ndistinct
>> values. But clearly that's naive, because for example when you have 10k
>> distinct values and you select 10M rows, you should expect nearly all of
>> them in the sample.
>
> The current formula assumes total dependence between the columns.  You can
> create tables where this is true, and the formula gives precisely the right
> estimate.  I'm not claiming this is common, or that it should be the default
> assumption, but it is one end of the spectrum of possible correct
> estimates.

I'm not really sure what you mean by dependence between columns, as both the 
old and new formula only works with total cardinality and selectivity, and has 
absolutely no idea about multiple columns.

In case you mean dependence between columns in the GROUP BY and columns used to 
compute the selectivity (i.e. referenced in WHERE), then perhaps you could say 
it like that, although I think there's no such explicit assumption.

>
>> And that's what the formula does - it gives you the expected number of
>> distinct values, when selecting 'k' values from 'd' distinct values with
>> replacements:
>>
>>     count(k, d) = d * (1 - ((d-1)/d) ^ k)
>>
>> It's assuming the distinct values are equally probable (uniform
>> distribution) and that the probabilities do not change (that's the "with
>> replacement").
>
> The new formula assumes total independence between the columns. You can
> likewise create tables where this is true, and you did so upthread, and for
> those tables it gives precisely the right estimate. This is the other end of
> the spectrum of possible correct estimates.
>
> In the absence of multivariate statistics, either because you haven't
> committed that work yet, or because the multivariate data the planner needs
> hasn't been collected, choosing an estimate exactly in the middle of the
> spectrum (whatever that would mean mathematically) would minimize the
> maximum possible error in the estimate.

FWIW the current version of multivariate statistics can't really help in this 
case. Perhaps it will help in the future, but it's way more complicated that it 
might seem as it requires transferring information from the WHERE clause to the 
GROUP BY clause.

>
> I have the sense that my point of view is in the minority, because the
> expectation is the problems caused by independence assumptions will only be
> fixed when multivariate statistics are implemented and available, and so we
> should just keep the independence assumptions everywhere until that time. I
>  tend to disagree with that, on the grounds that even when that work is
> finished, we'll never have complete coverage of every possible set of
> columns and what their degree of dependence is.

Well, in a sense you're right that those estimates work best in different 
situations. The problem with constructing an estimator the way you propose 
(simply returning an average) is that in both the extreme cases (perfect 
dependence or independence) one of those estimates is really bad. Moreover the 
new formula typically produces higher values than the old one,

Consider for example this:
    CREATE TABLE t AS SELECT (10000 * random())::int a,                             (10000 * random())::int b
            FROM generate_series(1,1000000) s(i);
 

and let's see estimates for queries like this:
    SELECT a FROM t WHERE b < $1 GROUP BY a;

Then for different values of $1 we get this:
               |  10 |   50 |  100 |  500 |  1000 |  5000   --------------------------------------------------------
   actual | 919 | 3829 | 6244 | 9944 | 10001 | 10001       current |  10 |   50 |  102 |  516 |  1018 |  4996
new| 973 | 4001 | 6382 | 9897 |  9951 |  9951       average | 491 | 2025 | 3205 | 5206 |  5484 |  7473
 

and for a table with perfectly dependent columns:
    CREATE TABLE t AS SELECT mod(i,10000) a,                             mod(i,10000) b                       FROM
generate_series(1,1000000)s(i);
 

we get this:
                |  10 |   50 |  100 |  500 |  1000 |  5000     --------------------------------------------------------
       actual |  10 |   50 |  100 |  500 |  1000 |  5000        current |  10 |   53 |  105 |  508 |  1016 |  5014
     new | 880 | 4105 | 6472 | 9955 | 10018 | 10018        average | 445 | 2079 | 3288 | 5231 |  5517 |  7516
 

So yes, each estimator works great for exactly the opposite cases. But notice 
that typically, the results of the new formula is much higher than the old one, 
sometimes by two orders of magnitude (and it shouldn't be difficult to 
construct examples of much higher differences).

The table also includes the 'average' estimator you propose, but it's rather 
obvious that the result is always much closer to the new value, simply because
   (small number) + (huge number)   ------------------------------                  2

is always much closer to the huge number. We're usually quite happy when the 
estimates are within the same order of magnitude, so whether it's K or K/2 
makes pretty much no difference.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date:
From: Andres Freund
Date: 03 March 2016, 19:06:20
Subject: Re: silent data loss with ext4 / all current versions
From: Tom Lane
Date: 03 March 2016, 19:20:03
Subject: Re: WIP: Upper planner pathification
Re: improving GROUP BY estimation - Mailing list pgsql-hackers

Previous

Next