Home > mailing lists

[HACKERS] Make ANALYZE more selective about what is a "most common value"? - Mailing list pgsql-hackers

From	Tom Lane
Subject	[HACKERS] Make ANALYZE more selective about what is a "most common value"?
Date	June 5, 2017 00:30:29
Msg-id	32261.1496611829@sss.pgh.pa.us Whole thread Raw
Responses	Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"? Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
List	pgsql-hackers

Tree view

I've been thinking about the behavior discussed in
https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
and it seems to me that there are a couple of things we ought to do about
it.

First, I think we need a larger hard floor on the number of occurrences
of a value that're required to make ANALYZE decide it is a "most common
value". The existing coding is willing to believe that anything that
appears at least twice in the sample is a potential MCV, but that design
originated when we were envisioning stats samples of just a few thousand
rows --- specifically, default_statistics_target was originally just 10,
leading to a 3000-row sample size. So accepting two-appearance values as
MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
could be a tenth or a hundredth of that.

As a round number, I'm thinking that a good floor would be a frequency
estimate of 1/1000. With today's typical sample size of 30000 rows,
a value would have to appear at least 30 times in the sample to be
believed to be an MCV. That seems like it gives us a reasonable margin
of error against the kind of sampling noise seen in the above-cited
thread.

Second, the code also has a rule that potential MCVs need to have an
estimated frequency at least 25% larger than what it thinks the "average"
value's frequency is. A rule of that general form seems like a good idea,
but I now think the 25% threshold is far too small to do anything useful.
In particular, in any case like this where there are more distinct values
than there are sample rows, the "average frequency" estimate will
correspond to less than one occurrence in the sample, so that this rule is
totally useless to filter anything that we would otherwise consider as an
MCV. I wonder if we shouldn't make it be "at least double the estimated
average frequency".

Thoughts?
regards, tom lane

pgsql-hackers by date:

From: Andres Freund
Date: 05 June 2017, 00:19:42
Subject: Re: [HACKERS] PostgreSQL 10 changes in exclusion constraints - didsomething change? CASE WHEN behavior oddity

From: Mark Dilger
Date: 05 June 2017, 00:32:24
Subject: Re: [HACKERS] PostgreSQL 10 changes in exclusion constraints - did something change? CASE WHEN behavior oddity

[HACKERS] Make ANALYZE more selective about what is a "most common value"? - Mailing list pgsql-hackers

Previous

Next