Home > mailing lists

Re: LIMIT confuses the planner - Mailing list pgsql-performance

From	Tom Lane
Subject	Re: LIMIT confuses the planner
Date	March 22, 2009 18:29:32
Msg-id	17814.1237757356@sss.pgh.pa.us Whole thread Raw
In response to	Re: LIMIT confuses the planner (Kouber Saparev <kouber@saparev.com>)
Responses	Re: LIMIT confuses the planner Re: LIMIT confuses the planner
List	pgsql-performance

Tree view

Kouber Saparev <kouber@saparev.com> writes:
> Tom Lane wrote:
>> Hmph, that's still not real good.  Ideally it should be estimating
>> *less* than the average frequency, because the estimate is made after
>> excluding all the most-common-values, which evidently 'kouber' is not
>> one of.

> I altered the statistics for that column to 1000, so now the planner
> assumes exactly 492 rows for the fore-mentioned query, which is indeed
> the average. It never went *less* than that value, it was always higher,
> i.e. for a statistics value of 600, it was 588, for 800, it became 540.

I got some time to think more about this and experiment a bit further.
As far as I can tell there is no fundamental bug here --- given
reasonably accurate stats the rowcount estimate behaves as expected, ie,
you get an estimate that's less than the actual average number of values
if the target value is not one of the known MCVs.  However, as the
n_distinct estimate falls below the actual number of distinct values,
that rowcount estimate necessarily rises.  What had surprised me about
this report is that the estimate matched the true average number of rows
so closely; I wondered if there was some property of the way we estimate
n_distinct that would make that happen.  But I now think that that was
just chance: there doesn't seem to be any underlying behavior that would
cause it.  I did some experiments with random data matching a Zipfian
distribution (1/k law) and did not observe that the rowcount estimate
converged to the true average when the n_distinct value was too low.

So the bottom line here is just that the estimated n_distinct is too
low.  We've seen before that the equation we use tends to do that more
often than not.  I doubt that consistently erring on the high side would
be better though :-(.  Estimating n_distinct from a limited sample of
the population is known to be a statistically hard problem, so we'll
probably not ever have perfect answers, but doing better is on the
to-do list.

            regards, tom lane

pgsql-performance by date:

From: Greg Smith
Date: 22 March 2009, 17:17:50
Subject: Re: "iowait" bug?

From: Laurent Wandrebeck
Date: 22 March 2009, 20:15:01
Subject: Re: "iowait" bug?

Re: LIMIT confuses the planner - Mailing list pgsql-performance

Previous

Next