Re: Asymmetric partition-wise JOIN - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Asymmetric partition-wise JOIN
Date
Msg-id CA+hUKGJw6ii_jLgkit1FUF8-cXaBt+RFjC+B3quh8sTFa9ge0g@mail.gmail.com
Whole thread Raw
In response to Re: Asymmetric partition-wise JOIN  (Kohei KaiGai <kaigai@heterodb.com>)
Responses Re: Asymmetric partition-wise JOIN
List pgsql-hackers
On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai@heterodb.com> wrote:
> We can consider the table join ptable X t1 above is equivalent to:
>   (ptable_p0 + ptable_p1 + ptable_p2) X t1
> = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
> It returns an equivalent result, however, rows are already reduced by HashJoin
> in the individual leaf of Append, so CPU-cycles consumed by Append node can
> be cheaper.
>
> On the other hands, it has a downside because t1 must be read 3 times and
> hash table also must be built 3 times. It increases the expected cost,
> so planner
> may not choose the asymmetric partition-wise join plan.

What if you include the partition constraint as a filter on t1?  So you get:

ptable X t1 =
  (ptable_p0 X (σ hash(dist)%4=0 (t1))) +
  (ptable_p1 X (σ hash(dist)%4=1 (t1))) +
  (ptable_p2 X (σ hash(dist)%4=2 (t1))) +
  (ptable_p3 X (σ hash(dist)%4=3 (t1)))

Pros:
1.  The hash tables will not contain unnecessary junk.
2.  You'll get the right answer if t1 is on the outer side of an outer join.
3.  If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4.  The filter might enable a selective and efficient plan like an index scan.

Cons:
1.  The filter might not enable a selective and efficient plan, and
therefore cause extra work.

(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)

> One idea I have is, sibling HashJoin shares a hash table that was built once
> by any of the sibling Hash plan. Right now, it is not implemented yet.

Yeah, I've thought a little bit about that in the context of Parallel
Repartition.  I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure.  That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table.  I ran into a number of problems
while thinking about that, which I should write about in another
thread.

--
Thomas Munro
https://enterprisedb.com



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Optimize single tuple fetch from nbtree index
Next
From: Bruce Momjian
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)