Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics - Mailing list pgsql-hackers

From Quan Zongliang
Subject Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Date
Msg-id 4660403b-8df9-2b44-2214-26b21f35b539@yeah.net
Whole thread Raw
In response to Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
List pgsql-hackers

On 2023/6/17 06:46, Tom Lane wrote:
> Quan Zongliang <quanzongliang@yeah.net> writes:
>> Perhaps we should discard this (dups cnt > 1) restriction?
> 
> That's not going to happen on the basis of one test case that you
> haven't even shown us.  The implications of doing it are very unclear.
> In particular, I seem to recall that there are bits of logic that
> depend on the assumption that MCV entries always represent more than
> one row.  The nmultiple calculation Tomas referred to may be failing
> because of that, but I'm worried about there being other places.
> 

The statistics for the other table look like this:
stadistinct | 6
stanumbers1 | {0.50096667,0.49736667,0.0012}
stavalues1  | {v22,v23,v5}

The value that appears twice in the small table (v1 and v2) does not 
appear here. The stadistinct's true value is 18 instead of 6 (three 
values in the small table do not appear here).

When calculating the selectivity:
if (nd2 > sslot2->nvalues)
   totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - sslot2->nvalues);

totalsel1 = 0
nd2 = 21
sslot2->nvalues = 2
unmatchfreq1 = 0.99990002016420476
otherfreq2 = 0.82608695328235626

result: totalsel1 = 0.043473913749706022
rows = 0.043473913749706022 * 23 * 2,000,000 = 1999800


> Basically, you're proposing a rather fundamental change in the rules
> by which Postgres has gathered statistics for decades.  You need to
> bring some pretty substantial evidence to support that.  The burden
> of proof is on you, not on the status quo.
> 
>             regards, tom lane




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Next
From: "Joel Jacobson"
Date:
Subject: Re: Do we want a hashset type?