Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
Date
Msg-id f4da3644-2ccd-f617-aa53-5d3dec265d07@enterprisedb.com
Whole thread Raw
In response to Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs  (Alena Rybakina <lena.ribackina@yandex.ru>)
Responses Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
List pgsql-hackers

On 7/8/23 10:29, Alena Rybakina wrote:
> 
>> Well, one option would be to modify all selectivity functions to do
>> something like the patch does for nulltestsel(). That seems a bit
>> cumbersome because why should those places care about maybe running on
>> the outer side of a join, or what? For code in extensions this would be
>> particularly problematic, I think.
> Agree. I would say that we can try it if nothing else works out.
>> So what I was thinking about doing this in a way that'd make this
>> automatic, without having to modify the selectivity functions.
>>
>> Option (3) is very simple - examine_variable would simply adjust the
>> statistics by tweaking the null_frac field, when looking at variables on
>> the outer side of the join. But it has issues when estimating multiple
>> conditions.
>>
>> Imagine t1 has 1M rows, and we want to estimate
>>
>>    SELECT * FROM t1 LEFT JOIN t2 ON (t1.id = t2.id)
>>            WHERE ((t2.a=1) AND (t2.b=1))
>>
>> but only 50% of the t1 rows has a match in t2. Assume each of the t2
>> conditions matches 100% rows in the table. With the correction, this
>> means 50% selectivity for each condition. And if we combine them the
>> usual way, it's 0.5 * 0.5 = 0.25.
>>
>> But we know all the rows in the "matching" part match the condition, so
>> the correct selectivity should be 0.5.
>>
>> In a way, this is just another case of estimation issues due to the
>> assumption of independence.
>> FWIW, I used "AND" in the example for simplicity, but that'd probably be
>> pushed to the baserel level. There'd need to be OR to keep it at the
>> join level, but the overall issue is the same, I think.
>>
>> Also, this entirely ignores extended statistics - I have no idea how we
>> might tweak those in (3).
> 
> I understood the idea - it is very similar to what is implemented in the
> current patch.
> 
> But I don't understand how to do it in the examine_variable function, to
> be honest.
> 

Well, I don't have a detailed plan either. In principle it shouldn't be
that hard, I think - examine_variable is loading the statistics, so it
could apply the same null_frac correction, just like nulltestsel would
do a bit later.

The main question is how to pass the information to examine_variable. It
doesn't get the SpecialJoinInfo (which is what nulltestsel used), so
we'd need to invent something new ... add a new argument?

>> But (4) was suggesting we could improve this essentially by treating the
>> join as two distinct sets of rows
>>
>>   - the inner join result
>>
>>   - rows without match on the outer side
>>
>> For the inner part, we would do estimates as now (using the regular
>> per-column statistics). If we knew the conditions match 100% rows, we'd
>> still get 100% when the conditions are combined.
>>
>> For the second part of the join we know the outer side is just NULLs in
>> all columns, and that'd make the estimation much simpler for most
>> clauses. We'd just need to have "fake" statistics with null_frac=1.0 and
>> that's it.
>>
>> And then we'd just combine these two selectivities. If we know the inner
>> side is 50% and all rows match the conditions, and no rows in the other
>> 50% match, the selectivity is 50%.
>>
>> inner_part * inner_sel + outer_part * outer_sel = 0.5 * 1.0 + 0.0 = 0.5
>>
>> Now, we still have issues with independence assumption in each of these
>> parts separately. But that's OK, I think.
>>
>> I think (4) could be implemented by doing the current estimation for the
>>   inner part, and by tweaking examine_variable in the "outer" part in a
>> way similar to (3). Except that it just sets null_frac=1.0 everywhere.
>>
>> For (4) we don't need to tweak those at all,
>> because for inner part we can just apply them as is, and for outer part
>> it's irrelevant because everything is NULL.
> I like this idea the most) I'll try to start with this and implement the
> patch.

Good to hear.

>> I hope this makes more sense. If not, let me know and I'll try to
>> explain it better.
> 
> Thank you for your explanation)
> 
> I will unsubscribe soon based on the results or if I have any questions.
> 

OK

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alena Rybakina
Date:
Subject: Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
Next
From: Andrew Dunstan
Date:
Subject: Re: pg_basebackup check vs Windows file path limits