Home > mailing lists

Re: n_distinct off by a factor of 1000 - Mailing list pgsql-general

From	Michael Lewis
Subject	Re: n_distinct off by a factor of 1000
Date	June 23, 2020 20:33:26
Msg-id	CAHOFxGoXQ9yCR3q=Lh2HGucJfFdgi=PUX+RAuakyiXdnjy6=XA@mail.gmail.com Whole thread Raw
In response to	Re: n_distinct off by a factor of 1000 (Klaudie Willis <Klaudie.Willis@protonmail.com>)
List	pgsql-general

Tree view

> > On 23/06/2020 14:42, Klaudie Willis wrote:
> >
> > > I got my first hint of why this problem occurs when I looked at the
> > > statistics. For the column in question, "instrument_ref" the
> > > statistics claimed it to be:
> > > The default_statistics_target=500, and analyze has been run.
> > > select * from pg_stats where attname like 'instr%_ref'; -- Result:
> > > 40.000
> > > select count(distinct instrumentid_ref) from bigtable -- Result: 33
> > > 385 922 (!!)That is an astonishing difference of almost a 1000X.

Try something like this to check how representative those "most common values" are. If you have n_distinct very low compared to reality and also the fraction of the table that the "most common" values are claiming to cover is low, then you can get very bad estimates when querying for values that are not in the MCVs list. The planner will assume an even distribution for other values and that may be much much higher or lower than reality. That is, if you have statistics target of 100 like normal, and those cover 5% of the table, and you have ndistinct value of 500, then the other 400 values are assumed to evenly cover that 95% of the table so each value would be .95/400 * reltuples as an estimate. If your real count of distinct values is 40000 then the number of values you expect to get for each value in your IN clause drops hugely.

Using a custom ndistinct will dramatically impact the estimates that the planner is using to make the decision of index vs sequential scan. Also, if the custom ndistinct and the actual distinct count vary by 2x or 10x as your data grows, it matters very little IMO as compared to relying on the sample taken by (auto)analyze job being off by a factor of 1000x or even 100x as you have experienced.

SELECT

( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,

tablename,

attname,

inherited,

null_frac,

n_distinct,

array_length(most_common_vals,1) n_mcv,

array_length(histogram_bounds,1) n_hist,

correlation,

FROM pg_stats

WHERE

schemaname = 'public'

AND tablename=‘table’

AND attname=‘column’;

pgsql-general by date:

From: Adrian Klaver
Date: 23 June 2020, 20:25:26
Subject: Re: pg_dump empty tables

From: Michael Lewis
Date: 23 June 2020, 20:42:50
Subject: Re: UPDATE on 20 Million Records Transaction or not?

Re: n_distinct off by a factor of 1000 - Mailing list pgsql-general

Previous

Next