On Fri, Feb 27, 2026, at 20:15, Tom Lane wrote:
> Hmm, git blame says I originated this function 25 years ago
> (f905d65ee), but I don't claim to remember that.
>
> Looking at it now, though, I think that bd3e3e9e5 is indeed
> wrong but not in the way Joel suggests. The longstanding
> way to compute mcv_freq is
>
> /*
> * The first MCV stat is for the most common value.
> */
> if (sslot.nnumbers > 0)
> *mcv_freq = sslot.numbers[0];
>
> *This number is a fraction measured on the raw relation.*
> (Necessarily so, because it's just a number computed by ANALYZE.)
> Then bd3e3e9e5 added
>
> /*
> * If there are no recorded MCVs, but we do have a histogram, then
> * assume that ANALYZE determined that the column is unique.
> */
> if (vardata.rel && vardata.rel->rows > 0)
> *mcv_freq = 1.0 / vardata.rel->rows;
>
> This is a pure thinko. rel->rows is the estimated number
> of filtered rows. What I should have used is rel->tuples,
> which is the estimated raw relation size, so as to get a
> number that is commensurate with the longstanding way
> of calculating mcv_freq. Then that also matches up with
> computing avgfreq on the raw relation.
>
> So I think the correct fix is basically
>
> - if (vardata.rel && vardata.rel->rows > 0)
> - *mcv_freq = 1.0 / vardata.rel->rows;
> + if (vardata.rel && vardata.rel->tuples > 0)
> + *mcv_freq = 1.0 / vardata.rel->tuples;
Nice catch, it turns out there are actually two bugs.
This is one of them, and I agree with the fix.
> and I wonder if that will wind up in reverting a lot of the plan
> choice changes seen in bd3e3e9e5.
>
> Joel, do you want to run this to ground, and in particular
> see if that way of fixing it passes your sanity tests?
Challenge accepted!
[...hours later...]
My conclusion is that we still need to move avgfreq
computation, like I suggested.
The reason for this is that estfract is calculated as:
estfract = 1.0 / ndistinct;
where ndistinct has been adjusted to account for restriction clauses.
Therefore, we must also use the adjusted avgfreq when adjusting
estfract here:
/*
* Adjust estimated bucketsize upward to account for skewed distribution.
*/
if (avgfreq > 0.0 && *mcv_freq > avgfreq)
estfract *= *mcv_freq / avgfreq;
What I first didn't understand was that mcv_freq is of course an
approximation of the mcv not only in the total table, but also in a
restriction, under the uniform restriction assumption. So we should not
adjust mcv_freq here, but we must use the restriction adjusted avgfreq
value, since estfract is calculated from the restriction adjusted
ndistinct value. Otherwise estfract will be garbage.
Feel free to skip looking at
demo-estimate_hash_bucket_stats.txt
if the above explanation above is satisfactory. It only shows demo
queries to prove the buggy calculations, and that both fixes are needed.
The queries demonstrates the separate bugs and how the fixes also fixes
both plans. Query 1 demonstrates the mcv_freq bug. Query 2 demonstrates
the avgfreq bug.
/Joel