Re: n_distinct off by a factor of 1000 - Mailing list pgsql-general

From Peter J. Holzer
Subject Re: n_distinct off by a factor of 1000
Date
Msg-id 20200624203519.GA24048@hjp.at
Whole thread Raw
In response to n_distinct off by a factor of 1000  (Klaudie Willis <Klaudie.Willis@protonmail.com>)
Responses Re: n_distinct off by a factor of 1000  (Michael Lewis <mlewis@entrata.com>)
List pgsql-general
[Please keep replies on the list]

On 2020-06-24 11:02:22 +0000, Klaudie Willis wrote:
> Holzer,  thanks for your feedback.  Yes, your guess is very good.  The
> data consists of millions of instruments that occur a handful of cases
> (rare), and instruments that are very common.
>
> I am still a little surprised that it is so hard to sample data and
> estimate distinct values within a 1000X. My intuition misleads me into
> thinking this should be simpler, but I understand that this is no
> simple task at all.  To your information, it seems like SQL server
> 2016 makes the same "mistake" when calculating distincts (or 1/density
> as they call it). I have similar data there that I have looked into,
> and it seems "fooled" as well.

Yes, estimating the number of distinct values from a relatively small
sample is hard when you don't know the underlying distribution. It might
be possible to analyze the sample to find the distribution and get a
better estimate. But I'm not sure how useful that would really be: If
a few values are very common and most very rare you are probably also
much more likely to use the common values in a query: And for those you
you would massively underestimate their frequency if you had an accurate
n_distinct value. That might be just as bad or even worse.

I'm wondering whether it would make sense to use a range instead of a
single value in cost calculations. Then the planner could prefer plans
which had reasonable cost over the whole range over plans which are
good at one end of the range but horrible at the other.

I'm a afraid that would lead to combinatorial explosion, though. Even
with a small number of ranges there would be a large number of possible
combinations to calculate. So one would probably have to resort to monte
carlo simulation or soemthing like that.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

pgsql-general by date:

Previous
From: Erwin Sebastian Andreasen
Date:
Subject: Curious behaviour with "order by random()"
Next
From: Michael Lewis
Date:
Subject: Re: n_distinct off by a factor of 1000