Re: n_distinct off by a factor of 1000 - Mailing list pgsql-general

From Michael Lewis
Subject Re: n_distinct off by a factor of 1000
Date
Msg-id CAHOFxGpv06xOQqxXcfCrb-A5LgwOyBaeHuZvVP3CEvk8L=MsTQ@mail.gmail.com
Whole thread Raw
In response to Re: n_distinct off by a factor of 1000  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Responses Re: n_distinct off by a factor of 1000  (Klaudie Willis <Klaudie.Willis@protonmail.com>)
Re: n_distinct off by a factor of 1000  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
List pgsql-general


On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
Yes, estimating the number of distinct values from a relatively small
sample is hard when you don't know the underlying distribution. It might
be possible to analyze the sample to find the distribution and get a
better estimate. But I'm not sure how useful that would really be: If
a few values are very common and most very rare you are probably also
much more likely to use the common values in a query: And for those you
you would massively underestimate their frequency if you had an accurate
n_distinct value. That might be just as bad or even worse.


This would only be true for values that are "common" but not in the MCVs list, right?

If we could increase the sampling ratio beyond the hard coded 300x to get a more representative sample and use that to estimate ndistinct (and also the frequency of the most common values) but only actually stored the 100 MCVs (or whatever the stats target is set to for the system or column) then the issue may be mitigated without increasing planning time because of stats that are larger than prudent, and the "only" cost should be longer processing time when (auto) analyzing... plus overhead for considering this potential new setting in all analyze cases I suppose.

pgsql-general by date:

Previous
From: "Peter J. Holzer"
Date:
Subject: Re: n_distinct off by a factor of 1000
Next
From: Tim Cross
Date:
Subject: Re: Persistent Connections