Re: improving GROUP BY estimation - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: improving GROUP BY estimation
Date
Msg-id 56D867C3.3040309@2ndquadrant.com
Whole thread Raw
In response to Re: improving GROUP BY estimation  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses Re: improving GROUP BY estimation
Re: improving GROUP BY estimation
List pgsql-hackers
Hi,

On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
> Hi, Tomas!
>
> I've assigned to review this patch.
>
> I've checked version estimate-num-groups-v2.txt by Mark Dilger.
> It applies to head cleanly, passes corrected regression tests.
>
> About correlated/uncorrelated cases. I think now optimizer mostly assumes
> all distributions to be independent. I think we should follow this
> assumption in this case also until we have fundamentally better option (like
> your multivariate statistics).
>
> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs,
> double input_rows,
> reldistinct = clamp;
>
> /*
> -* Multiply by restriction selectivity.
> +* Estimate number of distinct values expected in given number of rows.
> */
> -reldistinct *= rel->rows / rel->tuples;
> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
>
> /*
> * Update estimate of total distinct groups.
>
> I think we need way more explanation in comments here (possibly with
> external links). For now, it looks like formula which appears from nowhere.

I thought the details (particularly the link to stackexchange, discussing the 
formula) would be enough, but let me elaborate.

The current formula
    reldistinct *= rel->rows / rel->tuples;

simply multiplies the ndistinct estimate with selectivity. So when you select 
1% of the table, the estimate says we'll see 1% of ndistinct values. But 
clearly that's naive, because for example when you have 10k distinct values and 
you select 10M rows, you should expect nearly all of them in the sample.

And that's what the formula does - it gives you the expected number of distinct 
values, when selecting 'k' values from 'd' distinct values with replacements:
    count(k, d) = d * (1 - ((d-1)/d) ^ k)

It's assuming the distinct values are equally probable (uniform distribution) 
and that the probabilities do not change (that's the "with replacement").

I guess we could relax those assumptions and for example use the MCV statistics 
to further improve the estimate, and also relax the 'with replacement' but 
that'd make the formula way more complicated.

[1] 

http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers

>
> Also, note that rel->tuples gone away from formula parameters. So, we
> actually don't care about how may tuples are in the table. This is because
> this formula assumes that same tuple could be selected multiple times. For
> low numbers this can lead to some errors. Consider selecting 2 from 3
> distinct tuples. If you can't select same tuple twice then you always
> selecting 2 distinct tuples. But according to formula you will select 5/3 in
> average. I think this error in small numbers is negligible, especially
> because getting rid of this error would require way more computations. But
> it worth mentioning in comments though.

Well, yeah. That's the consequence of 'with replacement' assumption.

I guess we could handle this somehow, though. For example we can compute the 
minimum possible number of distinct values like this:
    average_rows_per_value = (tuples / ndistinct);    min_estimate = ceil(rows / average_rows_per_value);

and use that as a minimum for the estimate. In the example you've given this 
would say
    average_rows_per_value = (3 / 3) = 1;    min_estimate = ceil(2 / 1) = 2;

So it serves as a correction for the 'with replacement' assumption. With only 2 
distinct values we'd get this:
    average_rows_per_value = (3 / 2) = 1.5;    min_estimate = ceil(2 / 1.5) = 2;

Of course, it's just a heuristics, so this may fail in some other cases.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: "Daniel Verite"
Date:
Subject: Re: pg_dump / copy bugs with "big lines" ?
Next
From: Andrew Dunstan
Date:
Subject: Re: VS 2015 support in src/tools/msvc