Thread: Strange heuristic in analyze.c
So I never realized the consequences of this little heuristic in analyze.c in the handling of very low cardinality columns where we want to just capture the complete list of values in the mcv and throw away the histogram: else if (toowide_cnt == 0 && nmultiple == ndistinct) { /* * Every value in the sample appeared morethan once. Assume the * column has just these values. */ stats->stadistinct = ndistinct; } The problem with this heuristic is that if the table is small enough you might expect you can set the statistics target high and "sample" the entire table and get a very accurate mcv covering all the values. However if any of the values in the table appears only once this heuristic will defeat you. The following code will then throw out of the mcv any value which isn't 25% more common than "average". Leaving you with a histogram for those values which often does very poorly if the values don't fit any pattern and are just discrete arbitrary values. -- greg
Greg Stark wrote: > So I never realized the consequences of this little heuristic in > analyze.c in the handling of very low cardinality columns where we > want to just capture the complete list of values in the mcv and throw > away the histogram: > > else if (toowide_cnt == 0 && nmultiple == ndistinct) > { > /* > * Every value in the sample appeared more than once. Assume the > * column has just these values. > */ > stats->stadistinct = ndistinct; > } > > The problem with this heuristic is that if the table is small enough > you might expect you can set the statistics target high and "sample" > the entire table and get a very accurate mcv covering all the values. > However if any of the values in the table appears only once this > heuristic will defeat you. The following code will then throw out of > the mcv any value which isn't 25% more common than "average". Leaving > you with a histogram for those values which often does very poorly if > the values don't fit any pattern and are just discrete arbitrary > values. Do you want a C comment to document this problem? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, Feb 5, 2010 at 8:53 PM, Bruce Momjian <bruce@momjian.us> wrote: > Do you want a C comment to document this problem? Well I would rather a better heuristic :) We really need some statistics nerds in this group who can pipe up when these kinds of issues come up. There must be a good way to estimate the probability that we've seen all distinct values. -- greg