Thread: JOIN_SEMI planning question

JOIN_SEMI planning question

From
Thomas Munro
Date:
Hello,

While looking at bug #15857[1], I wondered why the following two
queries get different plans, given the schema and data from the bug
report:

(1) SELECT COUNT (*)
      FROM a
      JOIN b
        ON a.id=b.base_id
     WHERE EXISTS (
           SELECT 1
             FROM c
            WHERE c.base_id = a.id
           );

(2) SELECT COUNT (*)
      FROM a
      JOIN b
        ON a.id=b.base_id
     WHERE EXISTS (
           SELECT 1
             FROM c
            WHERE c.base_id = b.base_id
           );

The only difference is a.id vs b.base_id in the WHERE clause, and
those are equivalent, and the planner knows it as we can see from the
join cond visible in the plan.  Query 1 gets a JOIN_UNIQUE_INNER for
BC (C is sorted, made unique and then hashed for an inner join with B)
while query 2 gets a JOIN_SEMI for BC (C is hashed for a semi join
with B).  In both cases JOIN_UNIQUE_INNER is considered for BC (though
it's later blocked for PHJ since commit aca127c1), but the better
JOIN_SEMI is considered only for query 2.

The relevant decision logic is in populate_joinrel_with_paths()'s
JOIN_SEMI case, where it considers which of JOIN_SEMI,
JOIN_UNIQUE_INNER and JOIN_UNIQUE_OUTER to add.

Here's my question: how could it ever be OK to sort/unique something
and put it in a hash table, but not OK to put exactly the same thing
in the hash table directly, with JOIN_SEMI logic to prevent multiple
matches?  And likewise for the inner side of other join strategies.
Or to put in in plain C, in what case would the following change be
wrong?

                        /*
                         * We might have a normal semijoin, or a case
where we don't have
                         * enough rels to do the semijoin but can
unique-ify the RHS and
                         * then do an innerjoin (see comments in
join_is_legal).  In the
                         * latter case we can't apply JOIN_SEMI joining.
                         */
-                       if (bms_is_subset(sjinfo->min_lefthand, rel1->relids) &&
-                               bms_is_subset(sjinfo->min_righthand,
rel2->relids))
+                       if ((bms_is_subset(sjinfo->min_lefthand,
rel1->relids) &&
+                                bms_is_subset(sjinfo->min_righthand,
rel2->relids)) ||
+                                bms_equal(sjinfo->syn_righthand, rel2->relids))
                        {
                                if (is_dummy_rel(rel1) || is_dummy_rel(rel2) ||

restriction_is_constant_false(restrictlist, joinrel, false))

Or to put it in the language of the comment, how could you ever have
enough rels to do a join between B and unique(C), but not enough rels
to do a semi-join between B and C?

I admit that I don't have a great grasp of equivalence classes,
(min|syn)_(left|right)hand or the join planning code in general,
having focused mostly on execution so far, so the above is a
cargo-cult change and I may be missing something fundamental...

Which plan wins is of course a costing matter, but having the
JOIN_SEMI option available has the advantage of being more profitably
parallelisable, which led me here.  With the above hack you get a
[Parallel] Hash Semi Join for BC with both queries (unless you set
work_mem higher and then you get a much slower merge join, but that's
an entirely separate problem).  Only one regression test plan changes
-- it's structurally similar to the bug report query, but at a glance
the new plan is better anyway.  The test still demonstrates what it
wants to demonstrate (namely that the absence or presence of a unique
index causes the plan to change, and with the above hack that is even
clearer because the two plans now differ only in "Nested Loop Semi
Join" vs "Nested Loop").

[1] https://www.postgresql.org/message-id/flat/15857-d1ba2a64bce0795e%40postgresql.org

-- 
Thomas Munro
https://enterprisedb.com



Re: JOIN_SEMI planning question

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> While looking at bug #15857[1], I wondered why the following two
> queries get different plans, given the schema and data from the bug
> report:
> ...
> Here's my question: how could it ever be OK to sort/unique something
> and put it in a hash table, but not OK to put exactly the same thing
> in the hash table directly, with JOIN_SEMI logic to prevent multiple
> matches?  And likewise for the inner side of other join strategies.

I think you're thinking about it wrong.  Or maybe there's some additional
work we could put in to extend the logic.

If we're considering the case where the semijoin is c.base_id = a.id,
then we definitely can do "a SEMIJOIN c WHERE a.id = c.base_id".
If we want to join b to c first, we can do so, but we have to unique-ify c
before that join, and then both the b/c join and the later join with a
can become plain inner joins.  We *don't* get to skip unique-ifying c
before joining to b and then apply a semijoin with a later, because in
general that's going to result in the wrong number of output rows.
(Example: if a is unique but b has multiple copies of a particular join
key, and the key does appear in c, we should end with multiple output rows
having that key, and we wouldn't.)

It's possible that in some situations we could prove that semijoining
later would work, but it seems like that'd require a great deal more
analysis than the code does now.  There'd also have to be tracking of
whether the final a join still has to be a semijoin or not, which'd
now depend on which path for b/c was being considered.

> Or to put it in the language of the comment, how could you ever have
> enough rels to do a join between B and unique(C), but not enough rels
> to do a semi-join between B and C?

If you're not joining C to B at all, but to some other rel A, you can't do
it as a semijoin because you can't execute the semijoin qual correctly.

In the particular case we're looking at here, it may be possible to prove
that equivalence-class substitution from the original B/C semijoin qual
gives rise to an A/C join qual that will work as an equivalent semijoin
qual, and then it's OK to do A/C as a semijoin with that.  But I'm not
very sure what the proof conditions need to be, and I'm 100% sure that
the code isn't making any such proof now.

            regards, tom lane