Re: n_distinct off by a factor of 1000 - Mailing list pgsql-general

From Peter J. Holzer
Subject Re: n_distinct off by a factor of 1000
Date
Msg-id 20200628215550.GA17947@hjp.at
Whole thread Raw
In response to Re: n_distinct off by a factor of 1000  (Michael Lewis <mlewis@entrata.com>)
List pgsql-general
On 2020-06-24 16:27:35 -0600, Michael Lewis wrote:
> On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>
>     Yes, estimating the number of distinct values from a relatively small
>     sample is hard when you don't know the underlying distribution. It might
>     be possible to analyze the sample to find the distribution and get a
>     better estimate. But I'm not sure how useful that would really be: If
>     a few values are very common and most very rare you are probably also
>     much more likely to use the common values in a query: And for those you
>     you would massively underestimate their frequency if you had an accurate
>     n_distinct value. That might be just as bad or even worse.
>
>
>
> This would only be true for values that are "common" but not in the MCVs list,
> right?

Yes, but if you have 33 million values there are likely to be a lot of
them "common but not in the MCVs list", even for a very biased
distribution.


> If we could increase the sampling ratio beyond the hard coded 300x to get a
> more representative sample

I thought of that but abandoned it since I don't think a better estimate
for n_distinct will help (see above for the reason). The problem is that
the distribution is biased and the planner has no idea whether the value
it is searching for is common or rare if it isn't in the MCV list.

Unless ...

As I understood Klaudie, the values are ids, and ids have no inherent
meaning, the common values are probably scattered randomly.

But it might be possible to change that. Group the ids by frequency. Ids
< 1E12 occur at most 10 times, Ids >= 1E12 <2E12 occur at most 100 times
and so on.

This may mean that ids aren't long time stable - they may change as
their frequency changes. But if an id always changes by a multiple of
1E12, the last 12 decimal digits are stable.

The advantage is that then the planner can use the histogram to get a
pretty good estimate of how frequent a value is.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: How to get previous log file
Next
From: Emrah Tema
Date:
Subject: bdr_init_copy ends fine but doesn't copy the remote db