Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
Date
Msg-id 6dce55f4-e8f6-8e8e-b2af-b45ccff13598@enterprisedb.com
Whole thread Raw
In response to Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
List pgsql-hackers
On 6/24/23 02:08, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> The problem is that the selectivity for "IS NULL" is estimated using the
>> table-level statistics. But the LEFT JOIN entirely breaks the idea that
>> the null_frac has anything to do with NULLs in the join result.
> 
> Right.
> 
>> I wonder how to improve this, say by adjusting the IS NULL selectivity
>> when we know to operate on the outer side of the join. We're able to
>> do this for antijoins, so maybe we could do that here, somehow?
> 
> This mess is part of the long-term plan around the work I've been doing
> on outer-join-aware Vars.  We now have infrastructure that can let
> the estimator routines see "oh, this Var isn't directly from a scan
> of its table, it's been passed through a potentially-nulling outer
> join --- and I can see which one".  I don't have more than vague ideas
> about what happens next, but that is clearly an essential step on the
> road to doing better.
> 

I was wondering if that work on outer-join-aware Vars could help with
this, but I wasn't following it very closely. I agree the ability to
check if the Var could be NULL due to an outer join seems useful, as it
says whether applying raw attribute statistics makes sense or not.

I was thinking about what to do for the case when that's not possible,
i.e. when the Var refers to nullable side of the join. Knowing that this
is happening is clearly not enough - we need to know how many new NULLs
are "injected" into the join result, and "communicate" that to the
estimation routines.

Attached is a very ugly experimental patch doing that, and with it the
estimate changes to this:

                                 QUERY PLAN
  ----------------------------------------------------------------------
   Hash Left Join  (cost=3.25..18179.88 rows=999900 width=16)
                   (actual time=0.528..596.151 rows=999900 loops=1)
     Hash Cond: (large.id = small.id)
     Filter: ((small.id IS NULL) OR
              (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[])))
     Rows Removed by Filter: 100
     ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)
                     (actual time=0.069..176.138 rows=1000000 loops=1)
     ->  Hash  (cost=2.00..2.00 rows=100 width=8)
               (actual time=0.371..0.373 rows=100 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 12kB
           ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8)
                         (actual time=0.032..0.146 rows=100 loops=1)
   Planning Time: 3.845 ms
   Execution Time: 712.405 ms
  (10 rows)

Seems nice, but. The patch is pretty ugly, I don't claim it works for
other queries or that this is exactly what we should do. It calculates
"unmatched frequency" next to eqjoinsel_inner, stashes that info into
sjinfo and the estimator (nulltestsel) then uses that to adjust the
nullfrac it gets from the statistics.

The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...

1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).

2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.

3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.

4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.

So maybe we should split the join estimate into two parts, one for each
subset of the join result. One for the rows with a match (and then we
can just do what we do now, with the attribute stats we already have).
And one for the "unmatched part" where we know the values on the outer
side are NULL (and then we can easily "fake" stats with null_frac=1.0).


I really hope what I just wrote makes at least a little bit of sense.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: Re: Do we want a hashset type?
Next
From: Michael Paquier
Date:
Subject: Re: Remove deprecation warnings when compiling PG ~13 with OpenSSL 3.0~