Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Date
Msg-id 753270d7-a303-d1bb-1291-da717bc615a1@enterprisedb.com
Whole thread Raw
In response to Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics  (Quan Zongliang <quanzongliang@yeah.net>)
List pgsql-hackers
On 6/17/23 02:02, Quan Zongliang wrote:
> 
> 
> On 2023/6/17 06:46, Tom Lane wrote:
>> Quan Zongliang <quanzongliang@yeah.net> writes:
>>> Perhaps we should discard this (dups cnt > 1) restriction?
>>
>> That's not going to happen on the basis of one test case that you
>> haven't even shown us.  The implications of doing it are very unclear.
>> In particular, I seem to recall that there are bits of logic that
>> depend on the assumption that MCV entries always represent more than
>> one row.  The nmultiple calculation Tomas referred to may be failing
>> because of that, but I'm worried about there being other places.
>>

I don't recall any logic that'd outright fail with MCVs containing
single-row groups, and I haven't noticed anything obvious in analyze.c
during a cursory search. Maybe the paper analyze_mcv_list builds on
makes some assumptions? Not sure.

However, compute_distinct_stats() doesn't seem to have such protection
against single-row MCV groups, so if that's wrong we kinda already have
the issue I think (admittedly, compute_distinct_stats is much less used
than compute_scalar_stats).

> 
> The statistics for the other table look like this:
> stadistinct | 6
> stanumbers1 | {0.50096667,0.49736667,0.0012}
> stavalues1  | {v22,v23,v5}
> 
> The value that appears twice in the small table (v1 and v2) does not
> appear here. The stadistinct's true value is 18 instead of 6 (three
> values in the small table do not appear here).
> 
> When calculating the selectivity:
> if (nd2 > sslot2->nvalues)
>   totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - sslot2->nvalues);
> 
> totalsel1 = 0
> nd2 = 21
> sslot2->nvalues = 2
> unmatchfreq1 = 0.99990002016420476
> otherfreq2 = 0.82608695328235626
> 
> result: totalsel1 = 0.043473913749706022
> rows = 0.043473913749706022 * 23 * 2,000,000 = 1999800
> 

Attached is a script reproducing this.

I think the fundamental issue here is that the most common element of
the large table - v22 (~50%) is not in the tiny one at all. IIRC the
join estimation assumes the domain of one table is a subset of the
other. The values 22 / 23 violate that assumption, unfortunately.

Including all values into the small MCV fix this because then

  otherfreq1 = 0.0

and that simply eliminates the impact of stuff that didn't have a match
between the two MCV lists. Which mitigates the violated assumption.

But once the small table gets too large for the MCV, this won't work
that well - it probably helps a bit, as it makes otherfreq1 smaller.

Which doesn't mean it's useless, but it's likely a rare combination that
a table is (and remains) smaller than MCV, and the large table contains
values without a match in the smaller one (think foreign keys).

> 
>> Basically, you're proposing a rather fundamental change in the rules
>> by which Postgres has gathered statistics for decades.  You need to
>> bring some pretty substantial evidence to support that.  The burden
>> of proof is on you, not on the status quo.
>>

Right. It's a good example of a "quick hack" fixing one particular case,
without considering the consequences on other cases too much. Good as a
starting point, but plenty of legwork to do.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: Deleting prepared statements from libpq.
Next
From: Andrew Dunstan
Date:
Subject: Re: run pgindent on a regular basis / scripted manner