Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq
Date
Msg-id 995278.1772219726@sss.pgh.pa.us
Whole thread Raw
In response to Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq  (Tender Wang <tndrwang@gmail.com>)
Responses Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq
Re: [BUG?] estimate_hash_bucket_stats uses wrong ndistinct for avgfreq
List pgsql-hackers
Tender Wang <tndrwang@gmail.com> writes:
> I added Tom to the cc list. He may know more about this.

Hmm, git blame says I originated this function 25 years ago
(f905d65ee), but I don't claim to remember that.

Looking at it now, though, I think that bd3e3e9e5 is indeed
wrong but not in the way Joel suggests.  The longstanding
way to compute mcv_freq is

            /*
             * The first MCV stat is for the most common value.
             */
            if (sslot.nnumbers > 0)
                *mcv_freq = sslot.numbers[0];

*This number is a fraction measured on the raw relation.*
(Necessarily so, because it's just a number computed by ANALYZE.)
Then bd3e3e9e5 added

            /*
             * If there are no recorded MCVs, but we do have a histogram, then
             * assume that ANALYZE determined that the column is unique.
             */
            if (vardata.rel && vardata.rel->rows > 0)
                *mcv_freq = 1.0 / vardata.rel->rows;

This is a pure thinko.  rel->rows is the estimated number
of filtered rows.  What I should have used is rel->tuples,
which is the estimated raw relation size, so as to get a
number that is commensurate with the longstanding way
of calculating mcv_freq.  Then that also matches up with
computing avgfreq on the raw relation.

So I think the correct fix is basically

-            if (vardata.rel && vardata.rel->rows > 0)
-                *mcv_freq = 1.0 / vardata.rel->rows;
+            if (vardata.rel && vardata.rel->tuples > 0)
+                *mcv_freq = 1.0 / vardata.rel->tuples;

and I wonder if that will wind up in reverting a lot of the plan
choice changes seen in bd3e3e9e5.

Joel, do you want to run this to ground, and in particular
see if that way of fixing it passes your sanity tests?

            regards, tom lane



pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: Fix bug in multixact Oldest*MXactId initialization and access
Next
From: 陈宗志
Date:
Subject: Re: [PROPOSAL] Doublewrite Buffer as an alternative torn page protection to Full Page Write