Re: LIMIT confuses the planner - Mailing list pgsql-performance

From Tom Lane
Subject Re: LIMIT confuses the planner
Date
Msg-id 17814.1237757356@sss.pgh.pa.us
Whole thread Raw
In response to Re: LIMIT confuses the planner  (Kouber Saparev <kouber@saparev.com>)
Responses Re: LIMIT confuses the planner
Re: LIMIT confuses the planner
List pgsql-performance
Kouber Saparev <kouber@saparev.com> writes:
> Tom Lane wrote:
>> Hmph, that's still not real good.  Ideally it should be estimating
>> *less* than the average frequency, because the estimate is made after
>> excluding all the most-common-values, which evidently 'kouber' is not
>> one of.

> I altered the statistics for that column to 1000, so now the planner
> assumes exactly 492 rows for the fore-mentioned query, which is indeed
> the average. It never went *less* than that value, it was always higher,
> i.e. for a statistics value of 600, it was 588, for 800, it became 540.

I got some time to think more about this and experiment a bit further.
As far as I can tell there is no fundamental bug here --- given
reasonably accurate stats the rowcount estimate behaves as expected, ie,
you get an estimate that's less than the actual average number of values
if the target value is not one of the known MCVs.  However, as the
n_distinct estimate falls below the actual number of distinct values,
that rowcount estimate necessarily rises.  What had surprised me about
this report is that the estimate matched the true average number of rows
so closely; I wondered if there was some property of the way we estimate
n_distinct that would make that happen.  But I now think that that was
just chance: there doesn't seem to be any underlying behavior that would
cause it.  I did some experiments with random data matching a Zipfian
distribution (1/k law) and did not observe that the rowcount estimate
converged to the true average when the n_distinct value was too low.

So the bottom line here is just that the estimated n_distinct is too
low.  We've seen before that the equation we use tends to do that more
often than not.  I doubt that consistently erring on the high side would
be better though :-(.  Estimating n_distinct from a limited sample of
the population is known to be a statistically hard problem, so we'll
probably not ever have perfect answers, but doing better is on the
to-do list.

            regards, tom lane

pgsql-performance by date:

Previous
From: Greg Smith
Date:
Subject: Re: "iowait" bug?
Next
From: Laurent Wandrebeck
Date:
Subject: Re: "iowait" bug?