Thread: join plan with unexpected var clauses

join plan with unexpected var clauses

From
Luc Vlaming
Date:
Hi,

At a customer we came across a curious plan (see attached testcase).

Given the testcase we see that the outer semi join tries to join the 
outer with the inner table id columns, even though the middle table id 
column is also there. Is this expected behavior?

The reason i'm asking is two-fold:
- the inner hash table now is bigger than i'd expect and has columns 
that you would normally not select on.
- the middle join now projects the inner as result, which is quite 
suprising and seems invalid from a SQL standpoint.

Plan:
  Finalize Aggregate
    Output: count(*)
    ->  Gather
          Output: (PARTIAL count(*))
          Workers Planned: 4
          ->  Partial Aggregate
                Output: PARTIAL count(*)
                ->  Parallel Hash Semi Join
                      Hash Cond: (_outer.id3 = _inner.id2)
                      ->  Parallel Seq Scan on public._outer
                            Output: _outer.id3, _outer.extra1
                      ->  Parallel Hash
                            Output: middle.id1, _inner.id2
                            ->  Parallel Hash Semi Join
                                  Output: middle.id1, _inner.id2
                                  Hash Cond: (middle.id1 = _inner.id2)
                                  ->  Parallel Seq Scan on public.middle
                                        Output: middle.id1
                                  ->  Parallel Hash
                                        Output: _inner.id2
                                        ->  Parallel Seq Scan on 
public._inner
                                              Output: _inner.id2

Kind regards,
Luc
Swarm64

Attachment

Re: join plan with unexpected var clauses

From
Tom Lane
Date:
Luc Vlaming <luc@swarm64.com> writes:
> Given the testcase we see that the outer semi join tries to join the 
> outer with the inner table id columns, even though the middle table id 
> column is also there. Is this expected behavior?

I don't see anything greatly wrong with it.  The planner has concluded
that _inner.id2 and middle.id1 are part of an equivalence class, so it
can form the top-level join by equating _outer.id3 to either of them.
AFAIR that choice is made at random --- there's certainly not any logic
that thinks about "well, the intermediate join output could be a bit
narrower if we choose this one instead of that one".

I think "made at random" actually boils down to "take the first usable
member of the equivalence class".  If I switch around the wording of
the first equality condition:

   ... select 1 from _inner where middle.id1 = _inner.id2

then I get a plan where the top join uses middle.id1.  However,
it's still propagating both middle.id1 and _inner.id2 up through
the bottom join, so that isn't buying anything efficiency-wise.

            regards, tom lane