MCV lists for highly skewed distributions - Mailing list pgsql-hackers

From Jeff Janes
Subject MCV lists for highly skewed distributions
Date
Msg-id CAMkU=1yvdGvW9TmiLAhz2erFnvnPFYHbOZuO+a=4DVkzpuQ2tw@mail.gmail.com
Whole thread Raw
Responses Re: MCV lists for highly skewed distributions  (John Naylor <jcnaylor@gmail.com>)
Re: MCV lists for highly skewed distributions  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
I want to revive a patch I sent  couple years ago to the performance list, as I have seen the issue pop up repeatedly since then.

The problem is that if you have a highly skewed distribution, such as exponential frequencies, it usually stores too few entries in the MCV list.

For example:

create table foo as select floor(-log(random())/log(2))::int  as who from generate_series(1,10000000);

This will usually store 0 to 5 as MCV with the default stat target[1].  Then value 6 is not stored, and its ~75k repetitions get averaged over the remaining distinct values.  This leads to horrible misplanning for rare values.  For example, who=17 has 37 entries, but is estimated to have 20,000.  The correct join for 37 values is often not the correct one for 20,000.

If we stored just a few more values, their inclusion in the MCV would mean they are depleted from the residual count, correctly lowering the estimate we would get for very rare values not included in the sample.

So instead of having the threshold of 1.25x the average frequency over all values, I think we should use 1.25x the average frequency of only those values not already included in the MCV, as in the attached.

As it is, you can partially overcome the too short MCV list by cranking up the default statistics target, but this is a weak effect and comes at a high cost of CPU time.  In some of the real distributions I've looked at, cranking up default statistics target is almost entirely ineffective.

I think that perhaps maxmincount should also use the dynamic values_cnt_remaining rather than the static one.  After all, things included in the MCV don' get represented in the histogram.  When I've seen planning problems due to skewed distributions I also usually see redundant values in the histogram boundary list which I think should be in the MCV list instead. But I have not changed that here, pending discussion.

[1] Occasionally it will store a much longer MCV list, because no values was sampled exactly once, which triggers a different code path in which all seen values are put in the MCV and the histogram is NULL.  This is not reliable, as whether the least-sample value is present in the sample once or twice is pretty brittle.

Cheers,

Jeff
Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] taking stdbool.h into use
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] path toward faster partition pruning