Re: n_distinct off by a factor of 1000 - Mailing list pgsql-general

From Klaudie Willis
Subject Re: n_distinct off by a factor of 1000
Date
Msg-id exYVuFsfuM3l2SdUeIoTOCojYDWE4uqL_jlZBZo0MB5ij7XZ8hlaOVzNKHJyRhb26mwAIaNxil96YgzzRxd1Tps2T4YrJUHMOtGraAi_hW8=@protonmail.com
Whole thread Raw
In response to Re: n_distinct off by a factor of 1000  (Michael Lewis <mlewis@entrata.com>)
List pgsql-general

If we could increase the sampling ratio beyond the hard coded 300x to get a more representative sample and use that to estimate ndistinct (and also the frequency of the most common values) but only actually stored the 100 MCVs (or whatever the stats target is set to for the system or column) then the issue may be mitigated without increasing planning time because of stats that are larger than prudent, and the "only" cost should be longer processing time when (auto) analyzing... plus overhead for considering this potential new setting in all analyze cases I suppose.

I found another large deviation in one of my bridge tables. It is an (int,int) table of 900M rows where the B column contains 2.7M distinct values, however the pg_stats table claims it to be only 10.400. These numbers are with a statistics target of 500.  I'm not sure that really matters for the planner for the queries I run, but it makes me a little nervous :)

Also, is it just my data samples, or is the n_distinct way more often underestimated by a larger ratio, than overestimated?

K

pgsql-general by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Error in Table Creation
Next
From: Patrick FICHE
Date:
Subject: RE: PostGreSQL TDE encryption patch