Thread: Support run-time partition pruning for hash join

Support run-time partition pruning for hash join

From
Richard Guo
Date:
If we have a hash join with an Append node on the outer side, something
like

 Hash Join
   Hash Cond: (pt.a = t.a)
   ->  Append
         ->  Seq Scan on pt_p1 pt_1
         ->  Seq Scan on pt_p2 pt_2
         ->  Seq Scan on pt_p3 pt_3
   ->  Hash
         ->  Seq Scan on t

We can actually prune those subnodes of the Append that cannot possibly
contain any matching tuples from the other side of the join.  To do
that, when building the Hash table, for each row from the inner side we
can compute the minimum set of subnodes that can possibly match the join
condition.  When we have built the Hash table and start to execute the
Append node, we should have known which subnodes are survived and thus
can skip other subnodes.

This kind of partition pruning can be extended to happen across multiple
join levels.  For instance,

 Hash Join
   Hash Cond: (pt.a = t2.a)
   ->  Hash Join
         Hash Cond: (pt.a = t1.a)
         ->  Append
               ->  Seq Scan on pt_p1 pt_1
               ->  Seq Scan on pt_p2 pt_2
               ->  Seq Scan on pt_p3 pt_3
         ->  Hash
               ->  Seq Scan on t1
   ->  Hash
         ->  Seq Scan on t2

We can compute the matching subnodes of the Append when building Hash
table for 't1' according to the join condition 'pt.a = t1.a', and when
building Hash table for 't2' according to join condition 'pt.a = t2.a',
and the final surviving subnodes would be their intersection.

Greenplum [1] has implemented this kind of partition pruning as
'Partition Selector'.  Attached is a patch that refactores Greenplum's
implementation to make it work on PostgreSQL master.  Here are some
details about the patch.

During planning:

1. When creating a hash join plan in create_hashjoin_plan() we first
   collect information required to build PartitionPruneInfos at this
   join, which includes the join's RestrictInfos and the join's inner
   relids, and put this information in a stack.

2. When we call create_append_plan() for an appendrel, for each of the
   joins we check if join partition pruning is possible to take place
   for this appendrel, based on the information collected at that join,
   and if so build a PartitionPruneInfo and add it to the stack entry.

3. After finishing the outer side of the hash join, we should have built
   all the PartitionPruneInfos that can be used to perform join
   partition pruning at this join.  So we pop out the stack entry to get
   the PartitionPruneInfos and add them to Hash node.

During executing:

When building the hash table for a hash join, we perform the partition
prunning for each row according to each of the JoinPartitionPruneStates
at this join, and store each result in a special executor parameter to
make it available to Append nodes.  When executing an Append node, we
can directly use the pre-computed pruning results to skip subnodes that
cannot contain any matching rows.

Here is a query that shows the effect of the join partition prunning.

CREATE TABLE pt (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE pt_p1 PARTITION OF pt FOR VALUES FROM (0) TO (250);
CREATE TABLE pt_p2 PARTITION OF pt FOR VALUES FROM (250) TO (500);
CREATE TABLE pt_p3 PARTITION OF pt FOR VALUES FROM (500) TO (600);
INSERT INTO pt SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 599) i WHERE i % 2 = 0;

CREATE TABLE t1 (a int, b int);
INSERT INTO t1 values (10, 10);

CREATE TABLE t2 (a int, b int);
INSERT INTO t2 values (300, 300);

ANALYZE pt, t1, t2;

SET enable_nestloop TO off;

explain (analyze, costs off, summary off, timing off)
select * from pt join t1 on pt.a = t1.a right join t2 on pt.a = t2.a;
                         QUERY PLAN
------------------------------------------------------------
 Hash Right Join (actual rows=1 loops=1)
   Hash Cond: (pt.a = t2.a)
   ->  Hash Join (actual rows=0 loops=1)
         Hash Cond: (pt.a = t1.a)
         ->  Append (actual rows=0 loops=1)
               ->  Seq Scan on pt_p1 pt_1 (never executed)
               ->  Seq Scan on pt_p2 pt_2 (never executed)
               ->  Seq Scan on pt_p3 pt_3 (never executed)
         ->  Hash (actual rows=1 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Seq Scan on t1 (actual rows=1 loops=1)
   ->  Hash (actual rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on t2 (actual rows=1 loops=1)
(14 rows)


There are several points that need more consideration.

1. All the join partition prunning decisions are made in createplan.c
   where the best path tree has been decided.  This is not great.  Maybe
   it's better to make it happen when we build up the path tree, so that
   we can take the partition prunning into consideration when estimating
   the costs.

2. In order to make the join partition prunning take effect, the patch
   hacks the empty-outer optimization in ExecHashJoinImpl().  Not sure
   if this is a good practice.

3. This patch does not support parallel hash join yet.  But it's not
   hard to add the support.

4. Is it possible and worthwhile to extend the join partition prunning
   mechanism to support nestloop and mergejoin also?

Any thoughts or comments?

[1] https://github.com/greenplum-db/gpdb

Thanks
Richard
Attachment

Re: Support run-time partition pruning for hash join

From
Andy Fan
Date:


On Mon, Aug 21, 2023 at 11:48 AM Richard Guo <guofenglinux@gmail.com> wrote:
If we have a hash join with an Append node on the outer side, something
like

 Hash Join
   Hash Cond: (pt.a = t.a)
   ->  Append
         ->  Seq Scan on pt_p1 pt_1
         ->  Seq Scan on pt_p2 pt_2
         ->  Seq Scan on pt_p3 pt_3
   ->  Hash
         ->  Seq Scan on t

We can actually prune those subnodes of the Append that cannot possibly
contain any matching tuples from the other side of the join.  To do
that, when building the Hash table, for each row from the inner side we
can compute the minimum set of subnodes that can possibly match the join
condition.  When we have built the Hash table and start to execute the
Append node, we should have known which subnodes are survived and thus
can skip other subnodes.
  
This feature looks good, but is it possible to know if we can prune
any subnodes before we pay the extra effort (building the Hash 
table, for each row... stuff)?  IIUC, looks no.  If so, I think this area
needs more attention. I can't provide any good suggestions yet. 

Maybe at least,  if we have found no subnodes can be skipped
during the hashing, we can stop doing such work anymore. 

There are several points that need more consideration.

1. All the join partition prunning decisions are made in createplan.c
   where the best path tree has been decided.  This is not great.  Maybe
   it's better to make it happen when we build up the path tree, so that
   we can take the partition prunning into consideration when estimating
   the costs.

fwiw, the current master totally ignores the cost reduction for run-time 
partition prune, even for init partition prune.  So in some real cases, 
pg chooses a hash join just because the cost of nest loop join is 
highly over estimated. 

4. Is it possible and worthwhile to extend the join partition prunning
   mechanism to support nestloop and mergejoin also?

In my current knowledge, we have to build the inner table first for this
optimization?  so hash join and sort merge should be OK, but nestloop should
be impossible unless I missed something. 

--
Best Regards
Andy Fan

Re: Support run-time partition pruning for hash join

From
David Rowley
Date:
On Tue, 22 Aug 2023 at 00:34, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> On Mon, Aug 21, 2023 at 11:48 AM Richard Guo <guofenglinux@gmail.com> wrote:
>> 1. All the join partition prunning decisions are made in createplan.c
>>    where the best path tree has been decided.  This is not great.  Maybe
>>    it's better to make it happen when we build up the path tree, so that
>>    we can take the partition prunning into consideration when estimating
>>    the costs.
>
>
> fwiw, the current master totally ignores the cost reduction for run-time
> partition prune, even for init partition prune.  So in some real cases,
> pg chooses a hash join just because the cost of nest loop join is
> highly over estimated.

This is true about the existing code. It's a very tricky thing to cost
given that the parameter values are always unknown to the planner.
The best we have for these today is the various hardcoded constants in
selfuncs.h. While I do agree that it's not great that the costing code
knows nothing about run-time pruning, I also think that run-time
pruning during execution with parameterised nested loops is much more
likely to be able to prune partitions and save actual work than the
equivalent with Hash Joins.  It's more common for the planner to
choose to Nested Loop when there are fewer outer rows, so the pruning
code is likely to be called fewer times with Nested Loop than with
Hash Join.

With Hash Join, it seems to me that the pruning must take place for
every row that makes it into the hash table.  There will be maybe
cases where the unioned set of partitions simply yields every
partition and all the work results in no savings. Pruning on a scalar
value seems much more likely to be able to prune away unneeded
Append/MergeAppend subnodes.

Perhaps there can be something adaptive in Hash Join which stops
trying to prune when all partitions must be visited.  On a quick
glance of the patch, I don't see any code in ExecJoinPartitionPrune()
which gives up trying to prune when the number of members in
part_prune_result is equal to the prunable Append/MergeAppend
subnodes.

It would be good to see some performance comparisons of the worst case
to see how much overhead the pruning code adds to Hash Join.  It may
well be that we need to consider two Hash Join paths, one with and one
without run-time pruning. It's pretty difficult to meaningfully cost,
as I already mentioned, however.

>> 4. Is it possible and worthwhile to extend the join partition prunning
>>    mechanism to support nestloop and mergejoin also?
>
>
> In my current knowledge, we have to build the inner table first for this
> optimization?  so hash join and sort merge should be OK, but nestloop should
> be impossible unless I missed something.

But run-time pruning already works for Nested Loops... I must be
missing something here.

I imagine for Merge Joins a more generic approach would be better by
implementing parameterised Merge Joins (a.k.a zigzag merge joins).
The Append/MergeAppend node could then select the correct partition(s)
based on the current parameter value at rescan. I don't think any code
changes would be needed in node[Merge]Append.c for that to work.  This
could also speed up Merge Joins to non-partitioned tables when an
index is providing presorted input to the join.

David



Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Mon, Aug 21, 2023 at 8:34 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
This feature looks good, but is it possible to know if we can prune
any subnodes before we pay the extra effort (building the Hash 
table, for each row... stuff)? 

It might be possible if we take the partition prunning into
consideration when estimating costs.  But it seems not easy to calculate
the costs accurately.
 
Maybe at least,  if we have found no subnodes can be skipped
during the hashing, we can stop doing such work anymore. 

Yeah, this is what we can do.
 
In my current knowledge, we have to build the inner table first for this
optimization?  so hash join and sort merge should be OK, but nestloop should
be impossible unless I missed something. 

For nestloop and mergejoin, we'd always execute the outer side first.
So the Append/MergeAppend nodes need to be on the inner side for the
join partition prunning to take effect.  For a mergejoin that will
explicitly sort the outer side, the sort node would process all the
outer rows before scanning the inner side, so we can do the join
partition prunning with that.  For a nestloop, if we have a Material
node on the outer side, we can do that too, but I wonder if we'd have
such a plan in real world, because we only add Material to the inner
side of nestloop.

Thanks
Richard

Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Tue, Aug 22, 2023 at 2:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
With Hash Join, it seems to me that the pruning must take place for
every row that makes it into the hash table.  There will be maybe
cases where the unioned set of partitions simply yields every
partition and all the work results in no savings. Pruning on a scalar
value seems much more likely to be able to prune away unneeded
Append/MergeAppend subnodes.

Yeah, you're right.  If we have 'pt HashJoin t', for a subnode of 'pt'
to be pruned, it needs every row of 't' to be able to prune that
subnode.  The situation may improve if we have more than 2-way hash
joins, because the final surviving subnodes would be the intersection of
matching subnodes in each Hash.

With parameterized nestloop I agree that it's more likely to be able to
prune subnodes at rescan of Append/MergeAppend nodes based on scalar
values.

Sometimes we may just not generate parameterized nestloop as final plan,
such as when there are no indexes and no lateral references in the
Append/MergeAppend node.  In this case I think it would be great if we
can still do some partition prunning.  So I think this new 'join
partition prunning mechanism' (maybe this is not a proper name) should
be treated as a supplement to, not a substitute for, the current
run-time partition prunning based on parameterized nestloop, and it is
so implemented in the patch.
 
Perhaps there can be something adaptive in Hash Join which stops
trying to prune when all partitions must be visited.  On a quick
glance of the patch, I don't see any code in ExecJoinPartitionPrune()
which gives up trying to prune when the number of members in
part_prune_result is equal to the prunable Append/MergeAppend
subnodes.

Yeah, we can do that.
 
But run-time pruning already works for Nested Loops... I must be
missing something here.

Here I mean nestloop with non-parameterized inner path.  As I explained
upthread, we need to have a Material node on the outer side for that to
work, which seems not possible in real world.

Thanks
Richard

Re: Support run-time partition pruning for hash join

From
Andy Fan
Date:


On Tue, Aug 22, 2023 at 5:43 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Mon, Aug 21, 2023 at 8:34 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
This feature looks good, but is it possible to know if we can prune
any subnodes before we pay the extra effort (building the Hash 
table, for each row... stuff)? 

It might be possible if we take the partition prunning into
consideration when estimating costs.  But it seems not easy to calculate
the costs accurately.

This is a real place I am worried about the future of this patch. 
Personally, I do like this patch,  but not sure what if this issue can't be
fixed to make everyone happy, and fixing this perfectly looks hopeless
for me.  However, let's see what will happen. 
 
 
Maybe at least,  if we have found no subnodes can be skipped
during the hashing, we can stop doing such work anymore. 

Yeah, this is what we can do.
 
cool. 
 
 
In my current knowledge, we have to build the inner table first for this
optimization?  so hash join and sort merge should be OK, but nestloop should
be impossible unless I missed something. 

For nestloop and mergejoin, we'd always execute the outer side first.
So the Append/MergeAppend nodes need to be on the inner side for the
join partition prunning to take effect.  For a mergejoin that will
explicitly sort the outer side, the sort node would process all the
outer rows before scanning the inner side, so we can do the join
partition prunning with that.  For a nestloop, if we have a Material
node on the outer side, we can do that too, but I wonder if we'd have
such a plan in real world, because we only add Material to the inner
side of nestloop.
 
This is more interesting than I expected,thanks for the explaination. 

--
Best Regards
Andy Fan

Re: Support run-time partition pruning for hash join

From
Andy Fan
Date:


> fwiw, the current master totally ignores the cost reduction for run-time
> partition prune, even for init partition prune.  So in some real cases,
> pg chooses a hash join just because the cost of nest loop join is
> highly over estimated.

This is true about the existing code. It's a very tricky thing to cost
given that the parameter values are always unknown to the planner.
The best we have for these today is the various hardcoded constants in
selfuncs.h. While I do agree that it's not great that the costing code
knows nothing about run-time pruning, I also think that run-time
pruning during execution with parameterised nested loops is much more
likely to be able to prune partitions and save actual work than the
equivalent with Hash Joins.  It's more common for the planner to
choose to Nested Loop when there are fewer outer rows, so the pruning
code is likely to be called fewer times with Nested Loop than with
Hash Join.

Yes, I agree with this.  In my 4 years of PostgresSQL,  I just run into
2 cases of this issue and 1 of them is joining 12+ tables with run-time
partition prune for every join.  But this situation causes more issues than
generating a wrong plan, like for a simple SELECT * FROM p WHERE
partkey = $1;  generic plan will never win so we have to pay the expensive
planning cost for partitioned table. 

If we don't require very accurate costing for every case,  like we only
care about '=' operator which is the most common case,  it should be
easier than the case here since we just need to know if only 1 partition
will survive after pruning, but don't care about which one it is.  I'd like
to discuss in another thread, and leave this thread for Richard's patch
only. 

--
Best Regards
Andy Fan

Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Tue, Aug 22, 2023 at 2:38 PM David Rowley <dgrowleyml@gmail.com> wrote:
It would be good to see some performance comparisons of the worst case
to see how much overhead the pruning code adds to Hash Join.  It may
well be that we need to consider two Hash Join paths, one with and one
without run-time pruning. It's pretty difficult to meaningfully cost,
as I already mentioned, however.

I performed some performance comparisons of the worst case with two
tables as below:

1. The partitioned table has 1000 children, and 100,000 tuples in total.

2. The other table is designed that
    a) its tuples occupy every partition of the partitioned table so
       that no partitions can be pruned during execution,
    b) tuples belong to the same partition are placed together so that
       we need to scan all its tuples before we could know that no
       pruning would happen and we could stop trying to prune,
    c) the tuples are unique on the hash key so as to minimize the cost
       of hash probe, so that we can highlight the impact of the pruning
       codes.

Here is the execution time (ms) I get with different sizes of the other
table.

tuples  unpatched   patched
10000   45.74       53.46   (+0.17)
20000   54.48       70.18   (+0.29)
30000   62.57       85.18   (+0.36)
40000   69.14       99.19   (+0.43)
50000   76.46       111.09  (+0.45)
60000   82.68       126.37  (+0.53)
70000   92.69       137.89  (+0.49)
80000   94.49       151.46  (+0.60)
90000   101.53      164.93  (+0.62)
100000  107.22      178.44  (+0.66)

So the overhead the pruning code adds to Hash Join is too large to be
accepted :(.  I think we need to solve this problem first before we can
make this new partition pruning mechanism some useful in practice, but
how?  Some thoughts currently in my mind include

1) we try our best to estimate the cost of this partition pruning when
creating hash join paths, and decide based on the cost whether to use it
or not.  But this does not seem to be an easy task.

2) we use some heuristics when executing hash join, such as when we
notice that a $threshold percentage of the partitions must be visited
we just abort the pruning and assume that no partitions can be pruned.

Any thoughts or comments?

Thanks
Richard

Re: Support run-time partition pruning for hash join

From
David Rowley
Date:
On Thu, 24 Aug 2023 at 21:27, Richard Guo <guofenglinux@gmail.com> wrote:
> I performed some performance comparisons of the worst case with two
> tables as below:
>
> 1. The partitioned table has 1000 children, and 100,000 tuples in total.
>
> 2. The other table is designed that
>     a) its tuples occupy every partition of the partitioned table so
>        that no partitions can be pruned during execution,
>     b) tuples belong to the same partition are placed together so that
>        we need to scan all its tuples before we could know that no
>        pruning would happen and we could stop trying to prune,
>     c) the tuples are unique on the hash key so as to minimize the cost
>        of hash probe, so that we can highlight the impact of the pruning
>        codes.
>
> Here is the execution time (ms) I get with different sizes of the other
> table.
>
> tuples  unpatched   patched
> 10000   45.74       53.46   (+0.17)
> 20000   54.48       70.18   (+0.29)
> 30000   62.57       85.18   (+0.36)
> 40000   69.14       99.19   (+0.43)
> 50000   76.46       111.09  (+0.45)
> 60000   82.68       126.37  (+0.53)
> 70000   92.69       137.89  (+0.49)
> 80000   94.49       151.46  (+0.60)
> 90000   101.53      164.93  (+0.62)
> 100000  107.22      178.44  (+0.66)
>
> So the overhead the pruning code adds to Hash Join is too large to be
> accepted :(.

Agreed.  Run-time pruning is pretty fast to execute, but so is
inserting a row into a hash table.

> I think we need to solve this problem first before we can
> make this new partition pruning mechanism some useful in practice, but
> how?  Some thoughts currently in my mind include
>
> 1) we try our best to estimate the cost of this partition pruning when
> creating hash join paths, and decide based on the cost whether to use it
> or not.  But this does not seem to be an easy task.

I think we need to consider another Hash Join path when we detect that
the outer side of the Hash Join involves scanning a partitioned table.

I'd suggest writing some cost which costs an execution of run-time
pruning.  With LIST and RANGE you probably want something like
cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account
for the binary search over the sorted datum array. For HASH
partitions, something like cpu_operator_cost * npartcols once for each
hashed tuple.

You'll need to then come up with some counter costs to subtract from
the Append/MergeAppend.  This is tricky, as discussed.  Just come up
with something crude for now.

To start with, it could just be as crude as:

total_costs *= (Min(expected_outer_rows, n_append_subnodes)  /
n_append_subnodes);

i.e assume that every outer joined row will require exactly 1 new
partition up to the total number of partitions.  That's pretty much
worst-case, but it'll at least allow the optimisation to work for
cases like where the hash table is expected to contain just a tiny
number of rows (fewer than the number of partitions)

To make it better, you might want to look at join selectivity
estimation and see if you can find something there to influence
something better.

> 2) we use some heuristics when executing hash join, such as when we
> notice that a $threshold percentage of the partitions must be visited
> we just abort the pruning and assume that no partitions can be pruned.

You could likely code in something that checks
bms_num_members(jpstate->part_prune_result) to see if it still remains
below the total Append/MergeAppend subplans whenever, say whenever the
lower 8 bits of hashtable->totalTuples are all off.  You can just give
up doing any further pruning when all partitions are already required.

David



Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Fri, Aug 25, 2023 at 11:03 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 24 Aug 2023 at 21:27, Richard Guo <guofenglinux@gmail.com> wrote:
> I think we need to solve this problem first before we can
> make this new partition pruning mechanism some useful in practice, but
> how?  Some thoughts currently in my mind include
>
> 1) we try our best to estimate the cost of this partition pruning when
> creating hash join paths, and decide based on the cost whether to use it
> or not.  But this does not seem to be an easy task.

I think we need to consider another Hash Join path when we detect that
the outer side of the Hash Join involves scanning a partitioned table.

I'd suggest writing some cost which costs an execution of run-time
pruning.  With LIST and RANGE you probably want something like
cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account
for the binary search over the sorted datum array. For HASH
partitions, something like cpu_operator_cost * npartcols once for each
hashed tuple.

You'll need to then come up with some counter costs to subtract from
the Append/MergeAppend.  This is tricky, as discussed.  Just come up
with something crude for now.

To start with, it could just be as crude as:

total_costs *= (Min(expected_outer_rows, n_append_subnodes)  /
n_append_subnodes);

i.e assume that every outer joined row will require exactly 1 new
partition up to the total number of partitions.  That's pretty much
worst-case, but it'll at least allow the optimisation to work for
cases like where the hash table is expected to contain just a tiny
number of rows (fewer than the number of partitions)

To make it better, you might want to look at join selectivity
estimation and see if you can find something there to influence
something better.

Thank you for the suggestion.  I will take some time considering it.

When we have multiple join levels, it seems the situation becomes even
more complex.  One Append/MergeAppend node might be pruned by more than
one Hash node, and one Hash node might provide pruning for more than one
Append/MergeAppend node.  For instance, below is the plan from the test
case added in the v1 patch:

explain (analyze, costs off, summary off, timing off)
select * from tprt p1
   inner join tprt p2 on p1.col1 = p2.col1
   right join tbl1 t on p1.col1 = t.col1 and p2.col1 = t.col1;
                               QUERY PLAN
-------------------------------------------------------------------------
 Hash Right Join (actual rows=2 loops=1)
   Hash Cond: ((p1.col1 = t.col1) AND (p2.col1 = t.col1))
   ->  Hash Join (actual rows=3 loops=1)
         Hash Cond: (p1.col1 = p2.col1)
         ->  Append (actual rows=3 loops=1)
               ->  Seq Scan on tprt_1 p1_1 (never executed)
               ->  Seq Scan on tprt_2 p1_2 (actual rows=3 loops=1)
               ->  Seq Scan on tprt_3 p1_3 (never executed)
               ->  Seq Scan on tprt_4 p1_4 (never executed)
               ->  Seq Scan on tprt_5 p1_5 (never executed)
               ->  Seq Scan on tprt_6 p1_6 (never executed)
         ->  Hash (actual rows=3 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Append (actual rows=3 loops=1)
                     ->  Seq Scan on tprt_1 p2_1 (never executed)
                     ->  Seq Scan on tprt_2 p2_2 (actual rows=3 loops=1)
                     ->  Seq Scan on tprt_3 p2_3 (never executed)
                     ->  Seq Scan on tprt_4 p2_4 (never executed)
                     ->  Seq Scan on tprt_5 p2_5 (never executed)
                     ->  Seq Scan on tprt_6 p2_6 (never executed)
   ->  Hash (actual rows=2 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on tbl1 t (actual rows=2 loops=1)
(23 rows)

In this plan, the Append node of 'p1' is pruned by two Hash nodes: Hash
node of 't' and Hash node of 'p2'.  Meanwhile, the Hash node of 't'
provides pruning for two Append nodes: Append node of 'p1' and Append
node of 'p2'.

In this case, meaningfully costing for the partition pruning seems even
more difficult.  Do you have any suggestion on that?
 
> 2) we use some heuristics when executing hash join, such as when we
> notice that a $threshold percentage of the partitions must be visited
> we just abort the pruning and assume that no partitions can be pruned.

You could likely code in something that checks
bms_num_members(jpstate->part_prune_result) to see if it still remains
below the total Append/MergeAppend subplans whenever, say whenever the
lower 8 bits of hashtable->totalTuples are all off.  You can just give
up doing any further pruning when all partitions are already required.

Yeah, we can do that.  While this may not help in the tests I performed
for the worst case because the table in the hash side is designed that
tuples belong to the same partition are placed together so that we need
to scan almost all its tuples before we could know that all partitions
are already required, I think this might help a lot in real world.

Thanks
Richard

Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Fri, Aug 25, 2023 at 11:03 AM David Rowley <dgrowleyml@gmail.com> wrote:
I'd suggest writing some cost which costs an execution of run-time
pruning.  With LIST and RANGE you probably want something like
cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account
for the binary search over the sorted datum array. For HASH
partitions, something like cpu_operator_cost * npartcols once for each
hashed tuple.

You'll need to then come up with some counter costs to subtract from
the Append/MergeAppend.  This is tricky, as discussed.  Just come up
with something crude for now.

To start with, it could just be as crude as:

total_costs *= (Min(expected_outer_rows, n_append_subnodes)  /
n_append_subnodes);

i.e assume that every outer joined row will require exactly 1 new
partition up to the total number of partitions.  That's pretty much
worst-case, but it'll at least allow the optimisation to work for
cases like where the hash table is expected to contain just a tiny
number of rows (fewer than the number of partitions)

To make it better, you might want to look at join selectivity
estimation and see if you can find something there to influence
something better.

I have a go at writing some costing codes according to your suggestion.
That's compute_partprune_cost() in the v2 patch.

For the hash side, this function computes the pruning cost as
cpu_operator_cost * LOG2(nparts) * inner_rows for LIST and RANGE, and
cpu_operator_cost * nparts * inner_rows for HASH.

For the Append/MergeAppend side, this function first estimates the size
of outer side that matches, using the same idea as we estimate the
joinrel size for JOIN_SEMI.  Then it assumes that each outer joined row
occupies one new partition (the worst case) and computes how much cost
can be saved from partition pruning.

If the cost saved from the Append/MergeAppend side is larger than the
pruning cost from the Hash side, then we say that partition pruning is a
win.

Note that this costing logic runs for each Append-Hash pair, so it copes
with the case where we have multiple join levels.

With this costing logic added, I performed the same performance
comparisons of the worst case as in [1], and here is what I got.

tuples  unpatched   patched
10000   44.66       44.37   -0.006493506
20000   52.41       52.29   -0.002289639
30000   61.11       61.12   +0.000163639
40000   67.87       68.24   +0.005451599
50000   74.51       74.75   +0.003221044
60000   82.3        81.55   -0.009113001
70000   87.16       86.98   -0.002065168
80000   93.49       93.89   +0.004278532
90000   101.52      100.83  -0.00679669
100000  108.34      108.56  +0.002030644

So the costing logic successfully avoids performing the partition
pruning in the worst case.

I also tested the cases where partition pruning is possible with
different sizes of the hash side.

tuples  unpatched   patched
100     36.86       2.4     -0.934888768
200     35.87       2.37    -0.933928074
300     35.95       2.55    -0.92906815
400     36.4        2.63    -0.927747253
500     36.39       2.85    -0.921681781
600     36.32       2.97    -0.918226872
700     36.6        3.23    -0.911748634
800     36.88       3.44    -0.906724512
900     37.02       3.46    -0.906537007
1000    37.25       37.21   -0.001073826

The first 9 rows show that the costing logic allows the partition
pruning to be performed and the pruning turns out to be a big win.  The
last row shows that the partition pruning is disallowed by the costing
logic because it thinks no partition can be pruned (we have 1000
partitions in total).

So it seems that the new costing logic is quite crude and tends to be
very conservative, but it can help avoid the large overhead in the worst
cases.  I think this might be a good start to push this patch forward.

Any thoughts or comments?

[1] https://www.postgresql.org/message-id/CAMbWs49%2Bp6hBxXJHFiSwOtPCSkAHwhJj3hTpCR_pmMiUUVLZ1Q%40mail.gmail.com

Thanks
Richard
Attachment

Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Tue, Aug 29, 2023 at 6:41 PM Richard Guo <guofenglinux@gmail.com> wrote:
So it seems that the new costing logic is quite crude and tends to be
very conservative, but it can help avoid the large overhead in the worst
cases.  I think this might be a good start to push this patch forward.

Any thoughts or comments?

I rebased this patch over the latest master.  Nothing changed except
that I revised the new added test case to make it more stable.

However, the cfbot indicates that there are test cases that fail on
FreeBSD [1] (no failure on other platforms).  So I set up a FreeBSD-13
locally but just cannot reproduce the failure.  I must be doing
something wrong.  Can anyone give me some hints or suggestions?

FYI. The failure looks like:

 explain (costs off)
   select p2.a, p1.c from permtest_parent p1 inner join permtest_parent p2
   on p1.a = p2.a and left(p1.c, 3) ~ 'a1$';
-                     QUERY PLAN
-----------------------------------------------------
- Hash Join
-   Hash Cond: (p2.a = p1.a)
-   ->  Seq Scan on permtest_grandchild p2
-   ->  Hash
-         ->  Seq Scan on permtest_grandchild p1
-               Filter: ("left"(c, 3) ~ 'a1$'::text)
-(6 rows)
-
+ERROR:  unrecognized node type: 1130127496

[1] https://api.cirrus-ci.com/v1/artifact/task/5334808075698176/testrun/build/testrun/regress/regress/regression.diffs

Thanks
Richard
Attachment

Re: Support run-time partition pruning for hash join

From
Alexander Lakhin
Date:
Hello Richard,

02.11.2023 14:19, Richard Guo wrote:

However, the cfbot indicates that there are test cases that fail on
FreeBSD [1] (no failure on other platforms).  So I set up a FreeBSD-13
locally but just cannot reproduce the failure.  I must be doing
something wrong.  Can anyone give me some hints or suggestions?

FYI. The failure looks like:

 explain (costs off)
   select p2.a, p1.c from permtest_parent p1 inner join permtest_parent p2
   on p1.a = p2.a and left(p1.c, 3) ~ 'a1$';
-                     QUERY PLAN
-----------------------------------------------------
- Hash Join
-   Hash Cond: (p2.a = p1.a)
-   ->  Seq Scan on permtest_grandchild p2
-   ->  Hash
-         ->  Seq Scan on permtest_grandchild p1
-               Filter: ("left"(c, 3) ~ 'a1$'::text)
-(6 rows)
-
+ERROR:  unrecognized node type: 1130127496

I've managed to reproduce that failure on my Ubuntu with:
CPPFLAGS="-Og -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES" ./configure ... make check
...
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b;
-                    QUERY PLAN                    
---------------------------------------------------
- Sort
-   Sort Key: t1.a, t2.b
-   ->  Hash Right Join
-         Hash Cond: (t2.b = t1.a)
-         ->  Append
-               ->  Seq Scan on prt2_p1 t2_1
-               ->  Seq Scan on prt2_p2 t2_2
-               ->  Seq Scan on prt2_p3 t2_3
-         ->  Hash
-               ->  Append
-                     ->  Seq Scan on prt1_p1 t1_1
-                           Filter: (b = 0)
-                     ->  Seq Scan on prt1_p2 t1_2
-                           Filter: (b = 0)
-                     ->  Seq Scan on prt1_p3 t1_3
-                           Filter: (b = 0)
-(16 rows)
-
+ERROR:  unrecognized node type: -1465804424
...

As far as I can see from https://cirrus-ci.com/task/6642692659085312,
the FreeBSD host has the following CPPFLAGS specified:
-DRELCACHE_FORCE_RELEASE
-DCOPY_PARSE_PLAN_TREES
-DWRITE_READ_PARSE_PLAN_TREES
-DRAW_EXPRESSION_COVERAGE_TEST
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

Best regards,
Alexander

Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Sat, Nov 4, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
02.11.2023 14:19, Richard Guo wrote:
However, the cfbot indicates that there are test cases that fail on
FreeBSD [1] (no failure on other platforms).  So I set up a FreeBSD-13
locally but just cannot reproduce the failure.  I must be doing
something wrong.  Can anyone give me some hints or suggestions?
I've managed to reproduce that failure on my Ubuntu with:
CPPFLAGS="-Og -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES" ./configure ... make check

Wow, thank you so much.  You saved me a lot of time.  It turns out that
it was caused by me not making JoinPartitionPruneInfo a node.  The same
issue can also exist for JoinPartitionPruneCandidateInfo - if you
pprint(root) at some point you'll see 'could not dump unrecognized node
type' warning.

Fixed this issue in v4.

Thanks
Richard
Attachment

Re: Support run-time partition pruning for hash join

From
Alexander Lakhin
Date:
Hello Richard,

06.11.2023 06:05, Richard Guo wrote:

Fixed this issue in v4.


Please look at a warning and an assertion failure triggered by the
following script:
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
set min_parallel_table_scan_size = '1kB';

create table t1 (i int) partition by range (i);
create table t1_1 partition of t1 for values from (1) to (2);
create table t1_2 partition of t1 for values from (2) to (3);
insert into t1 values (1), (2);

create table t2(i int);
insert into t2 values (1), (2);
analyze t1, t2;

select * from t1 right join t2 on t1.i = t2.i;

2023-11-06 14:11:37.398 UTC|law|regression|6548f419.392cf5|WARNING:  Join partition pruning $0 has not been performed yet.
TRAP: failed Assert("node->as_prune_state"), File: "nodeAppend.c", Line: 846, PID: 3747061

Best regards,
Alexander

Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Mon, Nov 6, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
Please look at a warning and an assertion failure triggered by the
following script:
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
set min_parallel_table_scan_size = '1kB';

create table t1 (i int) partition by range (i);
create table t1_1 partition of t1 for values from (1) to (2);
create table t1_2 partition of t1 for values from (2) to (3);
insert into t1 values (1), (2);

create table t2(i int);
insert into t2 values (1), (2);
analyze t1, t2;

select * from t1 right join t2 on t1.i = t2.i;

2023-11-06 14:11:37.398 UTC|law|regression|6548f419.392cf5|WARNING:  Join partition pruning $0 has not been performed yet.
TRAP: failed Assert("node->as_prune_state"), File: "nodeAppend.c", Line: 846, PID: 3747061

Thanks for the report!  I failed to take care of the parallel-hashjoin
case, and I have to admit that it's not clear to me yet how we should do
join partition pruning in that case.

For now I think it's better to just avoid performing join partition
pruning for parallel hashjoin, so that the patch doesn't become too
complex for review.  We can always extend it in the future.

I have done that in v5.  Thanks for testing!

Thanks
Richard
Attachment

Re: Support run-time partition pruning for hash join

From
vignesh C
Date:
On Tue, 7 Nov 2023 at 13:25, Richard Guo <guofenglinux@gmail.com> wrote:
>
>
> On Mon, Nov 6, 2023 at 11:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>>
>> Please look at a warning and an assertion failure triggered by the
>> following script:
>> set parallel_setup_cost = 0;
>> set parallel_tuple_cost = 0;
>> set min_parallel_table_scan_size = '1kB';
>>
>> create table t1 (i int) partition by range (i);
>> create table t1_1 partition of t1 for values from (1) to (2);
>> create table t1_2 partition of t1 for values from (2) to (3);
>> insert into t1 values (1), (2);
>>
>> create table t2(i int);
>> insert into t2 values (1), (2);
>> analyze t1, t2;
>>
>> select * from t1 right join t2 on t1.i = t2.i;
>>
>> 2023-11-06 14:11:37.398 UTC|law|regression|6548f419.392cf5|WARNING:  Join partition pruning $0 has not been
performedyet. 
>> TRAP: failed Assert("node->as_prune_state"), File: "nodeAppend.c", Line: 846, PID: 3747061
>
>
> Thanks for the report!  I failed to take care of the parallel-hashjoin
> case, and I have to admit that it's not clear to me yet how we should do
> join partition pruning in that case.
>
> For now I think it's better to just avoid performing join partition
> pruning for parallel hashjoin, so that the patch doesn't become too
> complex for review.  We can always extend it in the future.
>
> I have done that in v5.  Thanks for testing!

CFBot shows that the patch does not apply anymore as in [1]:
=== Applying patches on top of PostgreSQL commit ID
924d046dcf55887c98a1628675a30f4b0eebe556 ===
=== applying patch
./v5-0001-Support-run-time-partition-pruning-for-hash-join.patch
...
patching file src/include/nodes/plannodes.h
...
patching file src/include/optimizer/cost.h
Hunk #1 FAILED at 211.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/optimizer/cost.h.rej

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_4512.log

Regards,
Vignesh



Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Sat, Jan 27, 2024 at 11:29 AM vignesh C <vignesh21@gmail.com> wrote:
CFBot shows that the patch does not apply anymore as in [1]:

Please post an updated version for the same.

Attached is an updated patch.  Nothing else has changed.

Thanks
Richard
Attachment

Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:

On Tue, Jan 30, 2024 at 10:33 AM Richard Guo <guofenglinux@gmail.com> wrote:
Attached is an updated patch.  Nothing else has changed.

Here is another rebase over master so it applies again.  Nothing else
has changed.

Thanks
Richard
Attachment

Re: Support run-time partition pruning for hash join

From
Andrei Lepikhov
Date:
On 19/3/2024 07:12, Richard Guo wrote:
> 
> On Tue, Jan 30, 2024 at 10:33 AM Richard Guo <guofenglinux@gmail.com 
> <mailto:guofenglinux@gmail.com>> wrote:
> 
>     Attached is an updated patch.  Nothing else has changed.
> 
> 
> Here is another rebase over master so it applies again.  Nothing else
> has changed.
The patch doesn't apply to the master now.
I wonder why this work was suppressed - it looks highly profitable in 
the case of foreign partitions. And the idea of cost-based enablement 
makes it a must-have, I think.
I have just skimmed through the patch and have a couple of questions:
1. It makes sense to calculate the cost and remember the minimum number 
of pruned partitions when the cost of HJ with probing is still 
profitable. Why don't we disable this probing in runtime if we see that 
the number of potentially pruning partitions is already too low?
2. Maybe I misunderstood the code, but having matched a hashed tuple 
with a partition, it makes sense for further tuples to reduce the number 
of probing expressions because we already know that the partition will 
not be pruned.

-- 
regards, Andrei Lepikhov




Re: Support run-time partition pruning for hash join

From
David Rowley
Date:
On Tue, 22 Aug 2023 at 21:51, Richard Guo <guofenglinux@gmail.com> wrote:
> Sometimes we may just not generate parameterized nestloop as final plan,
> such as when there are no indexes and no lateral references in the
> Append/MergeAppend node.  In this case I think it would be great if we
> can still do some partition running.

(I just read through this thread again to remind myself of where it's at.)

Here are my current thoughts: You've done some costing work which will
only prefer the part-prune hash join path in very conservative cases.
This is to reduce the risk of performance regressions caused by
running the pruning code too often in cases where it's less likely to
be able to prune any partitions.

Now, I'm not saying we shouldn't ever do this pruning hash join stuff,
but what I think might be better to do as a first step is to have
partitioned tables create a parameterized path on their partition key,
and a prefix thereof for RANGE partitioned tables. This would allow
parameterized nested loop joins when no index exists on the partition
key.

Right now you can get a plan that does this if you do:

create table p (col int);
create table pt (partkey int) partition by list(partkey);
create table pt1 partition of pt for values in(1);
create table pt2 partition of pt for values in(2);
insert into p values(1);
insert into pt values(1);

explain (analyze, costs off, timing off, summary off)
SELECT * FROM p, LATERAL (SELECT * FROM pt WHERE p.col = pt.partkey OFFSET 0);
                        QUERY PLAN
----------------------------------------------------------
 Nested Loop (actual rows=0 loops=1)
   ->  Seq Scan on p (actual rows=1 loops=1)
   ->  Append (actual rows=0 loops=1)
         ->  Seq Scan on pt1 pt_1 (actual rows=0 loops=1)
               Filter: (p.col = partkey)
         ->  Seq Scan on pt2 pt_2 (never executed)
               Filter: (p.col = partkey)

You get the parameterized nested loop. Great! But, as soon as you drop
the OFFSET 0, the lateral join will be converted to an inner join and
Nested Loop won't look so great when it's not parameterized.

explain (analyze, costs off, timing off, summary off)
SELECT * FROM p, LATERAL (SELECT * FROM pt WHERE p.col = pt.partkey);
                        QUERY PLAN
----------------------------------------------------------
 Hash Join (actual rows=1 loops=1)
   Hash Cond: (pt.partkey = p.col)
   ->  Append (actual rows=1 loops=1)
         ->  Seq Scan on pt1 pt_1 (actual rows=1 loops=1)
         ->  Seq Scan on pt2 pt_2 (actual rows=0 loops=1)
   ->  Hash (actual rows=1 loops=1)
         Buckets: 4096  Batches: 2  Memory Usage: 32kB
         ->  Seq Scan on p (actual rows=1 loops=1)

Maybe instead of inventing a very pessimistic part prune Hash Join, it
might be better to make the above work without the LATERAL + OFFSET 0
by creating the parameterized paths Seq Scan paths. That's going to be
an immense help when the non-partitioned relation just has a small
number of rows, which I think your costing favoured anyway.

What do you think?

David



Re: Support run-time partition pruning for hash join

From
Richard Guo
Date:
On Fri, Sep 6, 2024 at 9:22 AM David Rowley <dgrowleyml@gmail.com> wrote:
> Maybe instead of inventing a very pessimistic part prune Hash Join, it
> might be better to make the above work without the LATERAL + OFFSET 0
> by creating the parameterized paths Seq Scan paths. That's going to be
> an immense help when the non-partitioned relation just has a small
> number of rows, which I think your costing favoured anyway.
>
> What do you think?

This approach seems promising.  It reminds me of the discussion about
pushing join clauses into a seqscan [1].  But I think there are two
problems that we need to address to make it work.

* Currently, the costing code does not take run-time pruning into
consideration.  How should we calculate the costs of the parameterized
paths on partitioned tables?

* This approach generates additional paths at the scan level, which
may not be easily compared with regular scan paths.  As a result, we
might need to retain these paths at every level of the join tree.  I'm
afraid this could lead to a significant increase in planning time in
some cases.  We need to find a way to avoid regressions in planning
time.

[1] https://postgr.es/m/3478841.1724878067@sss.pgh.pa.us

Thanks
Richard



Re: Support run-time partition pruning for hash join

From
David Rowley
Date:
On Fri, 6 Sept 2024 at 19:19, Richard Guo <guofenglinux@gmail.com> wrote:
> * Currently, the costing code does not take run-time pruning into
> consideration.  How should we calculate the costs of the parameterized
> paths on partitioned tables?

Couldn't we assume total_cost = total_cost / n_apppend_children for
equality conditions and do something with DEFAULT_INEQ_SEL and
DEFAULT_RANGE_INEQ_SEL for more complex cases.  I understand we
probably need to do something about this to have the planner have any
chance of actually choose these Paths, so hacking something in there
to test the idea is sound before going to the trouble of refining the
cost model seems like a good idea.

> * This approach generates additional paths at the scan level, which
> may not be easily compared with regular scan paths.  As a result, we
> might need to retain these paths at every level of the join tree.  I'm
> afraid this could lead to a significant increase in planning time in
> some cases.  We need to find a way to avoid regressions in planning
> time.

How about just creating these Paths for partitioned tables (and
partitions) when there's an EquivalenceClass containing multiple
relids on the partition key?  I think those are about the only cases
that could benefit, so I think it makes sense to restrict making the
additional Paths for that case.

David