Home > mailing lists

Re: improving GROUP BY estimation - Mailing list pgsql-hackers

From	Dean Rasheed
Subject	Re: improving GROUP BY estimation
Date	March 18, 2016 13:20:31
Msg-id	CAEZATCVd9cACCX_fiFSa0sHiUR5JdzrYL_1N79-PYQe9WSaTWA@mail.gmail.com Whole thread Raw
In response to	Re: improving GROUP BY estimation (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: improving GROUP BY estimation
List	pgsql-hackers

Tree view

On 18 March 2016 at 00:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:
>>> I think that a better formula to use would be
>>>
>>> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
>>> reldistinct)
>
> Attached is a v3 of the patch using this formula instead of the original
> one. Interestingly, that apparently reduces the number of regression tests
> that get broken to a single one.
>

Cool.

> I'm not sure whether we need to provide a link to the PDF the formula comes
> from - perhaps we should?
>

Probably a better URL to give is
http://www.adellera.it/investigations/distinct_balls/ which has a link
to the PDF version of the paper and also some supporting material.

However, while that paper is in general very clear, I don't think it
gives a very clear explanation of that particular formula, and it
doesn't explain what it represents. It merely says that if "i" can be
ignored "for some reason (e.g. i << Nr)", then that formula is an
approximation to the exact "without replacement" formula, which is the
subject of that paper.

But actually, that formula is the exact formula for the expected
number of distinct values when selecting *with replacement* from a
collection where each of the distinct values occurs an equal number of
times. So I think we should say that.

Perhaps something along the lines of:
           /*            * Update the estimate based on the restriction selectivity.            *            * This
usesthe formula for the expected number of distinct values            * when selecting with replacement from a
collectionwhere each of            * the distinct values occurs an equal number of times (a uniform            *
distributionof values). This is a very close approximation to            * the more correct method of selecting without
replacementwhen            * the number of distinct values is quite large --- for example,            * see
http://www.adellera.it/investigations/distinct_balls/.           * Additionally, it also turns out to work very well
evenwhen the            * number of distinct values is small.            */

Regards,
Dean

pgsql-hackers by date:

From: Robert Haas
Date: 18 March 2016, 12:58:16
Subject: Re: Performance degradation in commit ac1d794

From: Rajkumar Raghuwanshi
Date: 18 March 2016, 13:23:03
Subject: Postgres_fdw join pushdown - getting server crash in left outer join of three table

Re: improving GROUP BY estimation - Mailing list pgsql-hackers

Previous

Next