On 2023/6/16 23:39, Tomas Vondra wrote:
>
>
> On 6/16/23 11:25, Quan Zongliang wrote:
>>
>> We have a small table with only 23 rows and 21 values.
>>
>> The resulting MCV and histogram is as follows
>> stanumbers1 | {0.08695652,0.08695652}
>> stavalues1 | {v1,v2}
>> stavalues2 |
>> {v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}
>>
>> An incorrect number of rows was estimated when HashJoin was done with
>> another large table (about 2 million rows).
>>
>> Hash Join (cost=1.52..92414.61 rows=2035023 width=0) (actual
>> time=1.943..1528.983 rows=3902 loops=1)
>>
>
> That's interesting. I wonder how come the estimate gets this bad simply
> by skipping values entries with a single row in the sample, which means
> we know the per-value selectivity pretty well.
>
> I guess the explanation has to be something strange happening when
> estimating the join condition selectivity, where we combine MCVs from
> both sides of the join (which has to be happening here, otherwise it
> would not matter what gets to the MCV).
>
> It'd be interesting to know what's in the other MCV, and what are the
> other statistics for the attributes (ndistinct etc.).
>
> Or even better, a reproducer SQL script that builds two tables and then
> joins them.
>
The other table is severely skewed. Most rows cannot JOIN the small
table. This special case causes the inaccuracy of cost calculation.
>> The reason is that the MCV of the small table excludes values with rows
>> of 1. Put them in the MCV in the statistics to get the correct result.
>>
>> Using the conservative samplerows <= attstattarget doesn't completely
>> solve this problem. It can solve this case.
>>
>> After modification we get statistics without histogram:
>> stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
>> stavalues1 | {v,v2, ... }
>>
>> And we have the right estimates:
>> Hash Join (cost=1.52..72100.69 rows=3631 width=0) (actual
>> time=1.447..1268.385 rows=3902 loops=1)
>>
>
> I'm not against building a "complete" MCV, but I guess the case where
> (samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV
> complete whenever we decide (ndistinct <= num_mcv)?
>
> That would need to happen later, because we don't have the ndistinct
> estimate yet at this point - we'd have to do the loop a bit later (or
> likely twice).
>
> FWIW the patch breaks the calculation of nmultiple (and thus likely the
> ndistinct estimate).
>
It's not just a small table. If a column's value is nearly unique. It
also causes the same problem because we exclude values that occur only
once. samplerows <= num_mcv just solves one scenario.
Perhaps we should discard this (dups cnt > 1) restriction?
>
> regards
>