Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Date
Msg-id CA+TgmoZB52TnJGX5jzO=NMfH+vGDjmqgk+HrrFErXgMnP521Nw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
List pgsql-hackers
On Thu, Apr 27, 2017 at 3:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> The third goal requires that the partition bounds be compared based on
> the partition keys present in the equi-join. While matching the
> partitions to be joined, the partition bounds corresponding the base
> relation whose partition keys appear in the equi-join are used for
> comparison using support function corresponding to the data types of
> partition keys. This requires us to associate the partitions of a join
> with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3)
> (notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds
> (X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c
> assuming strict operators this can be executed as (AB)C or A(BC). AB
> will have partitions A1B1, A2B3 since there is no matching bound of A
> for B2 and A is outer relation. A1B1 is associated with bound X1 of A
> and C both. A2B3 is associated with bound of X3, which happens to be
> 2nd bound of A but third of B. When we join (AB) with C, we should
> notice that C1 goes with A1B1, C2 doesn't have any matching partition
> in AB and C3 goes with A2B3. If we compare bounds of B with C without
> any transformation we will know C2 matches B2, but we need to look at
> the children of AB to realize that B2 isn't present in any of the
> children and thus C2 should not be joined with any partition of AB.

Sure.

> That usually looks a quadratic order operation on the number of
> partitions.

Now that I don't buy.  Certainly, for range partitions, given a list
of ranges of length M and another of length N, this can be done in
O(M+N) time by merge-joining the lists of bounds.  You pointed out
upthread that for list partitions, things are a bit complicated
because a single list partition can contain multiple values which are
not necessarily contiguous, but I think that this can still be done in
O(M+N) time.  Sort all of the bounds, associating each one to a
partition, and do a merge pass; whenever two bounds match, match the
two corresponding partitions, but if one of those partitions is
already matched to some other partition, then fail.

For example, consider A1 FOR VALUES IN (1,3,5), A2 FOR VALUES IN
(2,4,6), B1 FOR VALUES IN (1,6), B2 FOR VALUES IN (2,4).  The sorted
bounds for A are 1,2,3,4,5,6; for B, 1,2,4,6.  The first value in both
lists is a 1, so the corresponding partitions A1 and B1 are matched.
The second value in both lists is a 2, so the corresponding partitions
A2 and B2 are matched.  Then we hit a 3 on the A side that has no
match on the B side, but that's fine; we don't need to do anything.
If the partition on the A side never got a mapping at any point during
this merge pass, we'd eventually need to match it to a dummy partition
(unless this is an inner join) but it's already mapped to B1 so no
problem.  Then we hit a 4 which says that A2 must match B2, which is
consistent with what we already determine; no problem.  Then we hit
another value that only exists on the A side, which is fine just as
before.  Finally we hit a 6 on each side, which means that A2 must
match B1, which is inconsistent with the existing mappings so we give
up; no partitionwise join is possible here.

> The complexity can be reduced by maintaining as many
> partition bounds as the number of base relations participating in the
> join (an idea, I have floated earlier [1]) I don't elaborate it here
> to avoid digression. There's also the complexity of an N-way join with
> multiple partition keys and joins on partition keys from different
> relations as discussed in [1]. There may be more involved cases, that
> I haven't thought about. In short, implementation for 1st and 3rd
> optimization together looks fairly complex.

I spent some time thinking about this today and I think I see how we
could make it work: keep a single set of bounds for each join
relation, but record the type of each bound.  For example, suppose we
are full joining relation i2, with an int2 partition column, which has
partitions i2a from 0 to 10000 and i2b from 20000 to 30000, to
relation i4, with an int4 partition column, which has partitions i4a
from 5000 to 15000 and i4b from 25000 to 35000.   We end up with a
joinrel with 2 partitions.  The first goes from 0 (stored as an int2)
to 15000 (stored as an int4) and the second goes from 20000 (stored as
an int2) to 35000 (stored as an int4).  If we subsequently need to
merge these bounds with yet another relation at a higher join level,
we can use the opfamily (which is common) to dig out the right
cross-type operator for each comparison we may need to perform, based
on the precise types of the datums being compared.  Of course, we
might not find an appropriate cross-type operator in some cases,
because an opfamily isn't required to provide that, so then we'd have
to fail gracefully somehow, but that could be done.

Having said that I think we could make this work, I'm starting to
agree with you that it will add more complexity than it's worth.
Needing to keep track of the type of every partition bound
individually seems like a real nuisance, and it's not likely to win
very often because, realistically, people should and generally will
use the same type for the partitioning column in all of the relevant
tables.  So I'm going to revise my position and say it's fine to just
give up on partitionwise join unless the types match exactly, but I
still think we should try to cover the cases where the bounds don't
match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow,
optimizations 1 and 2 from your list of 4).  I agree that ganging
partitions (optimization 4 from your list) is not something to tackle
right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] identity columns
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] identity columns