Re: Strange heuristic in analyze.c - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Strange heuristic in analyze.c
Date
Msg-id 201002052053.o15KrvG09347@momjian.us
Whole thread Raw
In response to Strange heuristic in analyze.c  (Greg Stark <stark@mit.edu>)
Responses Re: Strange heuristic in analyze.c
List pgsql-hackers
Greg Stark wrote:
> So I never realized the consequences of this little heuristic in
> analyze.c in the handling of very low cardinality columns where we
> want to just capture the complete list of values in the mcv and throw
> away the histogram:
> 
>         else if (toowide_cnt == 0 && nmultiple == ndistinct)
>         {
>             /*
>              * Every value in the sample appeared more than once.  Assume the
>              * column has just these values.
>              */
>             stats->stadistinct = ndistinct;
>         }
> 
> The problem with this heuristic is that if the table is small enough
> you might expect you can set the statistics target high and "sample"
> the entire table and get a very accurate mcv covering all the values.
> However if any of the values in the table appears only once this
> heuristic will defeat you. The following code will then throw out of
> the mcv any value which isn't 25% more common than "average". Leaving
> you with a histogram for those values which often does very poorly if
> the values don't fit any pattern and are just discrete arbitrary
> values.

Do you want a C comment to document this problem?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: Confusion over Python drivers
Next
From: Bruce Momjian
Date:
Subject: Re: Confusion over Python drivers