Re: JOIN_SEMI planning question - Mailing list pgsql-hackers

From Tom Lane
Subject Re: JOIN_SEMI planning question
Date
Msg-id 18311.1561592101@sss.pgh.pa.us
Whole thread Raw
In response to JOIN_SEMI planning question  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> While looking at bug #15857[1], I wondered why the following two
> queries get different plans, given the schema and data from the bug
> report:
> ...
> Here's my question: how could it ever be OK to sort/unique something
> and put it in a hash table, but not OK to put exactly the same thing
> in the hash table directly, with JOIN_SEMI logic to prevent multiple
> matches?  And likewise for the inner side of other join strategies.

I think you're thinking about it wrong.  Or maybe there's some additional
work we could put in to extend the logic.

If we're considering the case where the semijoin is c.base_id = a.id,
then we definitely can do "a SEMIJOIN c WHERE a.id = c.base_id".
If we want to join b to c first, we can do so, but we have to unique-ify c
before that join, and then both the b/c join and the later join with a
can become plain inner joins.  We *don't* get to skip unique-ifying c
before joining to b and then apply a semijoin with a later, because in
general that's going to result in the wrong number of output rows.
(Example: if a is unique but b has multiple copies of a particular join
key, and the key does appear in c, we should end with multiple output rows
having that key, and we wouldn't.)

It's possible that in some situations we could prove that semijoining
later would work, but it seems like that'd require a great deal more
analysis than the code does now.  There'd also have to be tracking of
whether the final a join still has to be a semijoin or not, which'd
now depend on which path for b/c was being considered.

> Or to put it in the language of the comment, how could you ever have
> enough rels to do a join between B and unique(C), but not enough rels
> to do a semi-join between B and C?

If you're not joining C to B at all, but to some other rel A, you can't do
it as a semijoin because you can't execute the semijoin qual correctly.

In the particular case we're looking at here, it may be possible to prove
that equivalence-class substitution from the original B/C semijoin qual
gives rise to an A/C join qual that will work as an equivalent semijoin
qual, and then it's OK to do A/C as a semijoin with that.  But I'm not
very sure what the proof conditions need to be, and I'm 100% sure that
the code isn't making any such proof now.

            regards, tom lane



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Speed up transaction completion faster after many relations areaccessed in a transaction
Next
From: Michael Paquier
Date:
Subject: Re: Useless configure checks for RAND_OpenSSL (HAVE_RAND_OPENSSL)