Re: n_distinct off by a factor of 1000 - Mailing list pgsql-general

From Peter J. Holzer
Subject Re: n_distinct off by a factor of 1000
Date
Msg-id 20200624095448.GA417@hjp.at
Whole thread Raw
In response to Re: n_distinct off by a factor of 1000  (Klaudie Willis <Klaudie.Willis@protonmail.com>)
List pgsql-general
On 2020-06-24 07:30:05 +0000, Klaudie Willis wrote:
> show default_statistics_target; --> 500
> ALTER TABLE public.bigtable ALTER COLUMN instrumentid_ref SET STATISTICS 5000;
>
> Here is the output of the "ANALYZE VERBOSE bigtable;"
> INFO:  analyzing "public.bigtables" inheritance tree
[...]
> INFO:  analyzing "public.bigtable_y2018"
> INFO:  "bigtable_y2018": scanned 1500000 of 10661013 pages, containing 28915115
> live rows and 12589 dead rows; 1500000 rows in sample, 205509611 estimated
> total rows
> INFO:  analyzing "public.bigtable_y2019"
> INFO:  "bigtable_y2019": scanned 1500000 of 8911886 pages, containing 28888514
> live rows and 17778 dead rows; 1500000 rows in sample, 171634096 estimated
> total rows
> INFO:  analyzing "public.bigtable_y2020"
> INFO:  "bigtable_y2020": scanned 1500000 of 6126616 pages, containing 29488967
> live rows and 6330 dead rows; 1500000 rows in sample, 120445051 estimated total
> rows
> INFO:  analyzing "public.bigtable_y2021"
> INFO:  "bigtable_y2021": scanned 1 of 1 pages, containing 8 live rows and 0
> dead rows; 8 rows in sample, 8 estimated total rows

So it scans 1500000 rows from each partition, but ...

[...]
> frac_MCV;n_distinct; n_mcv; n_hist;tablename
> 0.9205394 122160 2140 5001 "bigtable"
> 0.9203018 124312 1736 5001 "bigtable_y2018"
> 0.9258158 113846 2107 5001 "bigtable_y2020"
> 0.875        -0.375      2              "bigtable_y2021"
> 0.92304045 118267 2204 5001 "bigtable_y2019"

Estimates the number of distinct values in each partition as between
113846 and 124312. So it can have encountered at most that many
different values, which means that it must have encountered each value
about 12 or 13 times on average.

My guess is that there are relatively few (less than 120000) distinct
values which make up the bulk (over 90 %) of these tables and a lot (33
million) values which are very rare.

Is this guess correct?

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

pgsql-general by date:

Previous
From: Jaime Soler
Date:
Subject: pgbench and timestamps
Next
From: "Peter J. Holzer"
Date:
Subject: Re: Persistent Connections