Home > mailing lists

Re: n_distinct off by a factor of 1000 - Mailing list pgsql-general

From	Michael Lewis
Subject	Re: n_distinct off by a factor of 1000
Date	June 25, 2020 01:27:35
Msg-id	CAHOFxGpv06xOQqxXcfCrb-A5LgwOyBaeHuZvVP3CEvk8L=MsTQ@mail.gmail.com Whole thread Raw
In response to	Re: n_distinct off by a factor of 1000 ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Responses	Re: n_distinct off by a factor of 1000 (Klaudie Willis <Klaudie.Willis@protonmail.com>) Re: n_distinct off by a factor of 1000 ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
List	pgsql-general

Tree view

On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:

Yes, estimating the number of distinct values from a relatively small
sample is hard when you don't know the underlying distribution. It might
be possible to analyze the sample to find the distribution and get a
better estimate. But I'm not sure how useful that would really be: If
a few values are very common and most very rare you are probably also
much more likely to use the common values in a query: And for those you
you would massively underestimate their frequency if you had an accurate
n_distinct value. That might be just as bad or even worse.

This would only be true for values that are "common" but not in the MCVs list, right?

If we could increase the sampling ratio beyond the hard coded 300x to get a more representative sample and use that to estimate ndistinct (and also the frequency of the most common values) but only actually stored the 100 MCVs (or whatever the stats target is set to for the system or column) then the issue may be mitigated without increasing planning time because of stats that are larger than prudent, and the "only" cost should be longer processing time when (auto) analyzing... plus overhead for considering this potential new setting in all analyze cases I suppose.

pgsql-general by date:

From: "Peter J. Holzer"
Date: 24 June 2020, 23:35:19
Subject: Re: n_distinct off by a factor of 1000

From: Tim Cross
Date: 25 June 2020, 02:27:49
Subject: Re: Persistent Connections

Re: n_distinct off by a factor of 1000 - Mailing list pgsql-general

Previous

Next