Thread: [HACKERS] Runtime Partition Pruning

[HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
I have been working on implementing the runtime partition pruning
which would increase the performance of queries involving partitioned
table to a great extent.

PFA the POC which can be applied over Amit's patch for faster
partition pruning [1] and Dilip's refactor patch [2] on commit
2c74e6c1dcc5002fa8b822e5757f6c95d899fb7a.

[1] https://www.postgresql.org/message-id/e02923ea-a117-a6ad-6a3e-ea5e1ba41ece%40lab.ntt.co.jp

[2] https://www.postgresql.org/message-id/CAFiTN-tGnQzF_4QtbOHT-3hE%3DOvNaMfbbeRxa4UY0CQyF0G8gQ%40mail.gmail.com

There were a couple of things that need improvement/opinion:
In get_rel_partition_info, we store minop and maxop for each partition
key. For the equality case, which is most common, both would store the
same value. We could make it better by storing equal (bound, bound,
....) instead repeating the same values.

get_partitions_for_keys currently returns the list of partitions valid
for the given keys but for a table with many partitions this list
would be very long so maybe for range qual ( key > a & key < b ) we
could only store the min and max partition number and increment
as_whichplan by 1 till we reach max partition number. For
non-continuous partitions, we would still need the list.

Currently, the partitions numbers are recalculated whenever the
ChgParam is set, This can be optimised by skipping this step when only
a non-partition key column has changed; reusing the existing
partitions selected.

Others:
- better handling of multiple key
- allow use of expression in the quals.
- To use min_incl, max_incl properly in get_partitions_for_keys.
- pruning during function calls.


Currently with patch, during NestLoop:
Nested Loop
-> SeqScan tbl1
-> Append
  -> Index Scan p01
  -> Index Scan p02
  -> Index Scan p03

For each tuple from tbl1, only the relevant partition (p01or p02 or
p03) will be scanned.


--- Prepared Statement Behaviour with patch---

Table Descritpion:
                                   Table "public.tprt"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 col1   | integer |           |          |         | plain   |              |
 col2   | integer |           |          |         | plain   |              |
Partition key: RANGE (col1)
Partitions: tprt_1 FOR VALUES FROM (1) TO (50001),
            tprt_2 FOR VALUES FROM (50001) TO (100001),
            tprt_3 FOR VALUES FROM (100001) TO (200001)

EXPLAIN EXECUTE prstmt_select(15);

                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..1736.55 rows=1 width=8)
   ->  Seq Scan on tprt_1  (cost=0.00..849.15 rows=16724 width=8)
         Filter: (col1 < $1)
(3 rows)

EXPLAIN EXECUTE prstmt_select(60000);
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..1736.55 rows=2 width=8)
   ->  Seq Scan on tprt_1  (cost=0.00..849.15 rows=16724 width=8)
         Filter: (col1 < $1)
   ->  Seq Scan on tprt_2  (cost=0.00..849.15 rows=16724 width=8)
         Filter: (col1 < $1)
(5 rows)


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello all,

Here is the updated patch which is rebased over v10 of Amit Langote's
path towards faster pruning patch [1]. It modifies the PartScanKeyInfo
struct to hold expressions which is then evaluated by the executor to
fetch the correct partitions using the function.

The code still chooses the custom plan instead of the generic plan for
the prepared statements. I am working on it. The following output is
after adding a hack in the code forcing selection of generic plan.

postgres=# EXPLAIN EXECUTE prstmt_select(70000);
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..1732.25 rows=2 width=8)
   ->  Seq Scan on tprt_1  (cost=0.00..847.00 rows=16667 width=8)
         Filter: ($1 > col1)
   ->  Seq Scan on tprt_2  (cost=0.00..847.00 rows=16667 width=8)
         Filter: ($1 > col1)
(5 rows)

postgres=# EXPLAIN EXECUTE prstmt_select(200000);
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..1732.25 rows=3 width=8)
   ->  Seq Scan on tprt_1  (cost=0.00..847.00 rows=16667 width=8)
         Filter: ($1 > col1)
   ->  Seq Scan on tprt_2  (cost=0.00..847.00 rows=16667 width=8)
         Filter: ($1 > col1)
   ->  Seq Scan on tprt_3  (cost=0.00..38.25 rows=753 width=8)
         Filter: ($1 > col1)
(7 rows)


[1] https://www.postgresql.org/message-id/b8094e71-2c73-ed8e-d8c3-53f232c8c049%40lab.ntt.co.jp

Tested on commit: 9b9cb3c4534d717c1c95758670198ebbf8a20af2

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Thu, Nov 9, 2017 at 6:18 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> The code still chooses the custom plan instead of the generic plan for
> the prepared statements. I am working on it.

I don't think it's really the job of this patch to do anything about
that problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Runtime Partition Pruning

From
Amit Kapila
Date:
On Thu, Nov 9, 2017 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Nov 9, 2017 at 6:18 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> The code still chooses the custom plan instead of the generic plan for
>> the prepared statements. I am working on it.
>
> I don't think it's really the job of this patch to do anything about
> that problem.
>

+1.  I think if we really want to do something about plan choice when
partitions are involved that should be done as a separate patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Runtime Partition Pruning

From
amul sul
Date:
On Thu, Nov 9, 2017 at 4:48 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello all,
>
> Here is the updated patch which is rebased over v10 of Amit Langote's
> path towards faster pruning patch [1]. It modifies the PartScanKeyInfo
> struct to hold expressions which is then evaluated by the executor to
> fetch the correct partitions using the function.
>

Hi Beena,

I have started looking into your patch, here few initial comments
for your 0001 patch:

1.351 + *     Evaluate and store the ooutput of ExecInitExpr for each
of the keys.

Typo: ooutput

2.822 +               if (IsA(constexpr, Const) &&is_runtime)823 +                   continue;824 +825 +
if(IsA(constexpr, Param) &&!is_runtime)826 +                   continue;827 + 
Add space after '&&'
3.1095 +    * Generally for appendrel we don't fetch the clause from the the

Typo: Double 'the'

4.272 -/*-------------------------------------------------------------------------273 +
/*-------------------------------------------------------------------------
Unnecessary hunk.

5.313 +       Node       *n =
eval_const_expressions_from_list(estate->es_param_list_info, val);314 +

Crossing 80 column window.  Same at line # 323 & 325

6.315 +       keys->eqkeys_datums[i++] = ((Const *) n)->constvalue;

Don’t we need a check for IsA(n, Const) or assert ?

7.
1011 +   if (prmList)
1012 +       context.boundParams = prmList;  /* bound Params */
1013 +   else
1014 +       context.boundParams = NULL;

No need of prmList null check, context.boundParams = prmList; is enough.

8.  It would be nice if you create a separate patch where you are moving   PartScanKeyInfo and exporting function
declaration.

9.  Could you please add few regression tests, that would help in
review & testing.

10. Could you please rebase your patch against latest "path toward faster   partition pruning" patch by Amit.

Regards,
Amul


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello Amul,

Thank you for reviewing.

On Fri, Nov 10, 2017 at 4:33 PM, amul sul <sulamul@gmail.com> wrote:
> On Thu, Nov 9, 2017 at 4:48 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Hello all,
>>
>> Here is the updated patch which is rebased over v10 of Amit Langote's
>> path towards faster pruning patch [1]. It modifies the PartScanKeyInfo
>> struct to hold expressions which is then evaluated by the executor to
>> fetch the correct partitions using the function.
>>
>
> Hi Beena,
>
> I have started looking into your patch, here few initial comments
> for your 0001 patch:
>
> 1.
>  351 + *     Evaluate and store the ooutput of ExecInitExpr for each
> of the keys.
>
> Typo: ooutput

Corrected.

>
> 2.
>  822 +               if (IsA(constexpr, Const) &&is_runtime)
>  823 +                   continue;
>  824 +
>  825 +               if (IsA(constexpr, Param) &&!is_runtime)
>  826 +                   continue;
>  827 +
>
>  Add space after '&&'

Done.

>
>  3.
>  1095 +    * Generally for appendrel we don't fetch the clause from the the
>
> Typo: Double 'the'
>
> 4.
>  272 -/*-------------------------------------------------------------------------
>  273 + /*-------------------------------------------------------------------------
>
>  Unnecessary hunk.

Removed.

>
> 5.
>  313 +       Node       *n =
> eval_const_expressions_from_list(estate->es_param_list_info, val);
>  314 +
>
> Crossing 80 column window.  Same at line # 323 & 325

Fixed.

>
> 6.
>  315 +       keys->eqkeys_datums[i++] = ((Const *) n)->constvalue;
>
> Don’t we need a check for IsA(n, Const) or assert ?

added

>
> 7.
> 1011 +   if (prmList)
> 1012 +       context.boundParams = prmList;  /* bound Params */
> 1013 +   else
> 1014 +       context.boundParams = NULL;
>
> No need of prmList null check, context.boundParams = prmList; is enough.
>
> 8.  It would be nice if you create a separate patch where you are moving
>     PartScanKeyInfo and exporting function declaration.

This is in 0001.

>
> 9.  Could you please add few regression tests, that would help in
> review & testing.

I will make a seperate regression patch and submit soon.

>
> 10. Could you please rebase your patch against latest "path toward faster
>     partition pruning" patch by Amit.


The following is rebased over v11 Amit's patch [1]


[1] https://www.postgresql.org/message-id/62d21a7b-fea9-f2d7-c33a-8caa12eca612%40lab.ntt.co.jp


--

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
PFA the updated patches.

On Tue, Nov 14, 2017 at 11:45 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello Amul,
>
> Thank you for reviewing.
>
> On Fri, Nov 10, 2017 at 4:33 PM, amul sul <sulamul@gmail.com> wrote:
>> On Thu, Nov 9, 2017 at 4:48 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>>> Hello all,
>>>
>>> Here is the updated patch which is rebased over v10 of Amit Langote's
>>> path towards faster pruning patch [1]. It modifies the PartScanKeyInfo
>>> struct to hold expressions which is then evaluated by the executor to
>>> fetch the correct partitions using the function.
>>>
>>
>> Hi Beena,
>>
>> I have started looking into your patch, here few initial comments
>> for your 0001 patch:
>>
>> 1.
>>  351 + *     Evaluate and store the ooutput of ExecInitExpr for each
>> of the keys.
>>
>> Typo: ooutput
>
> Corrected.
>
>>
>> 2.
>>  822 +               if (IsA(constexpr, Const) &&is_runtime)
>>  823 +                   continue;
>>  824 +
>>  825 +               if (IsA(constexpr, Param) &&!is_runtime)
>>  826 +                   continue;
>>  827 +
>>
>>  Add space after '&&'
>
> Done.
>
>>
>>  3.
>>  1095 +    * Generally for appendrel we don't fetch the clause from the the
>>
>> Typo: Double 'the'
>>
>> 4.
>>  272 -/*-------------------------------------------------------------------------
>>  273 + /*-------------------------------------------------------------------------
>>
>>  Unnecessary hunk.
>
> Removed.
>
>>
>> 5.
>>  313 +       Node       *n =
>> eval_const_expressions_from_list(estate->es_param_list_info, val);
>>  314 +
>>
>> Crossing 80 column window.  Same at line # 323 & 325
>
> Fixed.
>
>>
>> 6.
>>  315 +       keys->eqkeys_datums[i++] = ((Const *) n)->constvalue;
>>
>> Don’t we need a check for IsA(n, Const) or assert ?
>
> added
>
>>
>> 7.
>> 1011 +   if (prmList)
>> 1012 +       context.boundParams = prmList;  /* bound Params */
>> 1013 +   else
>> 1014 +       context.boundParams = NULL;
>>
>> No need of prmList null check, context.boundParams = prmList; is enough.
>>
>> 8.  It would be nice if you create a separate patch where you are moving
>>     PartScanKeyInfo and exporting function declaration.
>
> This is in 0001.
>
>>
>> 9.  Could you please add few regression tests, that would help in
>> review & testing.
>
> I will make a seperate regression patch and submit soon.
>
>>
>> 10. Could you please rebase your patch against latest "path toward faster
>>     partition pruning" patch by Amit.
>
>
> The following is rebased over v11 Amit's patch [1]
>
>
> [1] https://www.postgresql.org/message-id/62d21a7b-fea9-f2d7-c33a-8caa12eca612%40lab.ntt.co.jp
>


--

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Rajkumar Raghuwanshi
Date:
On Tue, Nov 14, 2017 at 11:46 AM, Beena Emerson <memissemerson@gmail.com> wrote:
PFA the updated patches.

Hi,

I have started testing this along with fast pruning. It is crashing for sql with subqueries.
one to test case is given below.

CREATE TABLE prun_test_part (empno int, sal int, deptno int) PARTITION BY RANGE(sal);
CREATE TABLE prun_test_part_p1 PARTITION OF prun_test_part FOR VALUES FROM (0) TO (100);
CREATE TABLE prun_test_part_p2 PARTITION OF prun_test_part FOR VALUES FROM (100) TO (200);
CREATE TABLE prun_test_part_p3 PARTITION OF prun_test_part FOR VALUES FROM (200) TO (300);
CREATE TABLE prun_test_part_p4 PARTITION OF prun_test_part FOR VALUES FROM (300) TO (400);

INSERT INTO prun_test_part VALUES (10,90,10);
INSERT INTO prun_test_part VALUES (11,100,10);
INSERT INTO prun_test_part VALUES (20,110,20);
INSERT INTO prun_test_part VALUES (21,200,20);
INSERT INTO prun_test_part VALUES (30,210,30);
INSERT INTO prun_test_part VALUES (31,300,30);
INSERT INTO prun_test_part VALUES (50,310,20);

explain (costs off)
SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part WHERE sal = 50);
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 14 November 2017 at 19:16, Beena Emerson <memissemerson@gmail.com> wrote:
> PFA the updated patches.

Hi Beena,

Thanks for working on this. I've had a look at the patch to try to
understand how it is working. I found it a bit surprising that the
code assumes it can rely on the order of Append->appendplans matching
what's needed based on the return value of get_partitions_for_keys().

I tried using the following to break this:


drop table if exists ab;
create table ab (a int not null, b int not null) partition by list(a);
create table ab_a2 partition of ab for values in(2) partition by list (b);
create table ab_a2_b1 partition of ab_a2 for values in (1);
create table ab_a2_b2 partition of ab_a2 for values in (2);
create table ab_a2_b3 partition of ab_a2 for values in (3);

create table ab_a1 partition of ab for values in(1) partition by list (b);
create table ab_a1_b1 partition of ab_a1 for values in (1);
create table ab_a1_b2 partition of ab_a1 for values in (2);
create table ab_a1_b3 partition of ab_a1 for values in (3);
create table ab_a3 partition of ab for values in(3) partition by list (b);
create table ab_a3_b1 partition of ab_a3 for values in (1);
create table ab_a3_b2 partition of ab_a3 for values in (2);
create table ab_a3_b3 partition of ab_a3 for values in (3);

prepare ab_q1 (int, int) as select * from ab where a = $1 and b = $2;

explain execute ab_q1 (2,3);
explain execute ab_q1 (2,3);
explain execute ab_q1 (2,3);
explain execute ab_q1 (2,3);

postgres=# explain execute ab_q1 (2,3);                         QUERY PLAN
---------------------------------------------------------------Append  (cost=0.00..43.90 rows=1 width=8)  ->  Seq Scan
onab_a2_b3  (cost=0.00..43.90 rows=1 width=8)        Filter: ((a = 2) AND (b = 3))
 
(3 rows)


postgres=# explain execute ab_q1 (2,3);                         QUERY PLAN
---------------------------------------------------------------Append  (cost=0.00..395.10 rows=1 width=8)  ->  Seq Scan
onab_a1_b2  (cost=0.00..43.90 rows=1 width=8)
 
<--------- wrong partition        Filter: ((a = $1) AND (b = $2))
(3 rows)

As soon as we hit the generic plan the wrong partition is selected

I think to do this you're going to have to store some sort of array
that maps the partition index to the subpath in the Append node so you
can correctly identify the subpath based on what you're getting back
from get_partitions_for_keys(). Perhaps what you had worked previously
when we were not returning a Bitmapset with that function.

Once you've got that design worked out I can take another look at this.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 15 November 2017 at 01:57, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I think to do this you're going to have to store some sort of array
> that maps the partition index to the subpath in the Append node so you
> can correctly identify the subpath based on what you're getting back
> from get_partitions_for_keys(). Perhaps what you had worked previously
> when we were not returning a Bitmapset with that function.
>
> Once you've got that design worked out I can take another look at this.

I think this is a bit more broken than I originally mentioned above.
The code you have at the moment assumes there will be just a single
partitioned table in the hierarchy. Remember that complex partitioned
hierarchies will be flattened during set_append_rel_pathlist(), so
there may be multiple partitioned relations to search for.

A more simple way to break the patch is to have some constants in the
query to eliminate some of the partitions during planning, leaving
just a few to be eliminated during execution.

Something like:

deallocate ab_q1;
drop table if exists ab;
create table ab (a int not null, b int not null) partition by list(a);
create table ab_a1 partition of ab for values in (1);
create table ab_a2 partition of ab for values in (2);
create table ab_a3 partition of ab for values in (3);
create table ab_a4 partition of ab for values in (4);
create table ab_a5 partition of ab for values in (5);
create table ab_a6 partition of ab for values in (6);
create table ab_a7 partition of ab for values in (7);
create table ab_a8 partition of ab for values in (8);
create table ab_a9 partition of ab for values in (9);
create table ab_a10 partition of ab for values in (10);

prepare ab_q1 (int) as select * from ab where a between 4 and 5 and a = $1;

explain execute ab_q1 (4);
explain execute ab_q1 (4);
explain execute ab_q1 (4);
explain execute ab_q1 (4);
explain execute ab_q1 (4);

explain execute ab_q1 (4); -- TRAP: FailedAssertion("!(n <
list->length)", File: "src/backend/nodes/list.c", Line: 392)

So some sort of hierarchical structure of the partition hierarchy
would need to be stored in the Append node and then you'd need to
search at each level, and then somehow match the results up to the
subpaths that you have in the Append. Although, I'm still not sure
this is the best way to go about this.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On 2017/11/15 8:13, David Rowley wrote:
> So some sort of hierarchical structure of the partition hierarchy
> would need to be stored in the Append node and then you'd need to
> search at each level, and then somehow match the results up to the
> subpaths that you have in the Append. Although, I'm still not sure
> this is the best way to go about this.

I think we should try to use PartitionDispatch stuff for this somehow,
just like get_partition_for_tuple() does, although I haven't thought very
hard whether that infrastructure would be useful as is.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello Rajkumar,

On Tue, Nov 14, 2017 at 2:22 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> On Tue, Nov 14, 2017 at 11:46 AM, Beena Emerson <memissemerson@gmail.com>
> wrote:
>>
>> PFA the updated patches.
>
>
> Hi,
>
> I have started testing this along with fast pruning. It is crashing for sql
> with subqueries.
> one to test case is given below.
>
> CREATE TABLE prun_test_part (empno int, sal int, deptno int) PARTITION BY
> RANGE(sal);
> CREATE TABLE prun_test_part_p1 PARTITION OF prun_test_part FOR VALUES FROM
> (0) TO (100);
> CREATE TABLE prun_test_part_p2 PARTITION OF prun_test_part FOR VALUES FROM
> (100) TO (200);
> CREATE TABLE prun_test_part_p3 PARTITION OF prun_test_part FOR VALUES FROM
> (200) TO (300);
> CREATE TABLE prun_test_part_p4 PARTITION OF prun_test_part FOR VALUES FROM
> (300) TO (400);
>
> INSERT INTO prun_test_part VALUES (10,90,10);
> INSERT INTO prun_test_part VALUES (11,100,10);
> INSERT INTO prun_test_part VALUES (20,110,20);
> INSERT INTO prun_test_part VALUES (21,200,20);
> INSERT INTO prun_test_part VALUES (30,210,30);
> INSERT INTO prun_test_part VALUES (31,300,30);
> INSERT INTO prun_test_part VALUES (50,310,20);
>
> explain (costs off)
> SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part
> WHERE sal = 50);
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>

Thanks for your review. I am looking into it.


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello David,

Thank you for reviewing.

On Wed, Nov 15, 2017 at 4:43 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 15 November 2017 at 01:57, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I think to do this you're going to have to store some sort of array
>> that maps the partition index to the subpath in the Append node so you
>> can correctly identify the subpath based on what you're getting back
>> from get_partitions_for_keys(). Perhaps what you had worked previously
>> when we were not returning a Bitmapset with that function.
>>
>> Once you've got that design worked out I can take another look at this.
>
> I think this is a bit more broken than I originally mentioned above.
> The code you have at the moment assumes there will be just a single
> partitioned table in the hierarchy. Remember that complex partitioned
> hierarchies will be flattened during set_append_rel_pathlist(), so
> there may be multiple partitioned relations to search for.
>
> A more simple way to break the patch is to have some constants in the
> query to eliminate some of the partitions during planning, leaving
> just a few to be eliminated during execution.
>
> Something like:
>
> deallocate ab_q1;
> drop table if exists ab;
> create table ab (a int not null, b int not null) partition by list(a);
> create table ab_a1 partition of ab for values in (1);
> create table ab_a2 partition of ab for values in (2);
> create table ab_a3 partition of ab for values in (3);
> create table ab_a4 partition of ab for values in (4);
> create table ab_a5 partition of ab for values in (5);
> create table ab_a6 partition of ab for values in (6);
> create table ab_a7 partition of ab for values in (7);
> create table ab_a8 partition of ab for values in (8);
> create table ab_a9 partition of ab for values in (9);
> create table ab_a10 partition of ab for values in (10);
>
> prepare ab_q1 (int) as select * from ab where a between 4 and 5 and a = $1;
>
> explain execute ab_q1 (4);
> explain execute ab_q1 (4);
> explain execute ab_q1 (4);
> explain execute ab_q1 (4);
> explain execute ab_q1 (4);
>
> explain execute ab_q1 (4); -- TRAP: FailedAssertion("!(n <
> list->length)", File: "src/backend/nodes/list.c", Line: 392)
>
> So some sort of hierarchical structure of the partition hierarchy
> would need to be stored in the Append node and then you'd need to
> search at each level, and then somehow match the results up to the
> subpaths that you have in the Append. Although, I'm still not sure
> this is the best way to go about this.

Thank you for your suggestion. I am looking into this and will post a
patch soon.



-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Dilip Kumar
Date:
On Wed, Nov 15, 2017 at 4:43 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 15 November 2017 at 01:57, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I think to do this you're going to have to store some sort of array
>> that maps the partition index to the subpath in the Append node so you
>> can correctly identify the subpath based on what you're getting back
>> from get_partitions_for_keys(). Perhaps what you had worked previously
>> when we were not returning a Bitmapset with that function.
>>
>> Once you've got that design worked out I can take another look at this.
>

> So some sort of hierarchical structure of the partition hierarchy
> would need to be stored in the Append node and then you'd need to
> search at each level, and then somehow match the results up to the
> subpaths that you have in the Append. Although, I'm still not sure
> this is the best way to go about this.
>

Instead of hierarchical structure can’t we maintain an array (one
entry per partition), and whenever any leaf partition’s subpath
is added to the append rel (in function set_append_rel_pathlist) we
can set that subpath number in corresponding array index.

And, later we can add some wrapper over get_partitions_for_key such
that it can recursively traverse the non-leaf partitions (something
like get_partition_for_tuple does as Amit mentioned.). And, ultimately
gives the output as a list of leaf partition's indexes. Then we can
find the sub-plan number by looking into the array.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Runtime Partition Pruning

From
Michael Paquier
Date:
On Wed, Nov 15, 2017 at 3:53 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Thank you for your suggestion. I am looking into this and will post a
> patch soon.

It has been two weeks since this update and no new patch has showed
up. I am marking the patch as returned with feedback. If you can
produce a new version, of course feel free to post it and register a
new entry in the CF app.
-- 
Michael


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

PFA the new version of the patch which can be applied over v11 patches
of Amit Langote [1]. The patch has been completely modified and the
0001 patch of previous series is no longer required. As mentioned
above, I have used the PartitionDispatchInfo and an array to which
holds the actual subplan index.

This one addresses the problems mentioned above but still needs to be
thoroughly tested. I will soon post a new patch rebased over Amit's
v13 patches[2] soon with additional code comments.



[1] https://www.postgresql.org/message-id/62d21a7b-fea9-f2d7-c33a-8caa12eca612%40lab.ntt.co.jp
[2] https://www.postgresql.org/message-id/df609168-b7fd-4c0b-e9b2-6e398d411e27%40lab.ntt.co.jp


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello Rajkumar,


On Tue, Nov 14, 2017 at 2:22 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> On Tue, Nov 14, 2017 at 11:46 AM, Beena Emerson <memissemerson@gmail.com>
> wrote:
>>
>> PFA the updated patches.
>
>
> Hi,
>
> I have started testing this along with fast pruning. It is crashing for sql
> with subqueries.
> one to test case is given below.
>
> CREATE TABLE prun_test_part (empno int, sal int, deptno int) PARTITION BY
> RANGE(sal);
> CREATE TABLE prun_test_part_p1 PARTITION OF prun_test_part FOR VALUES FROM
> (0) TO (100);
> CREATE TABLE prun_test_part_p2 PARTITION OF prun_test_part FOR VALUES FROM
> (100) TO (200);
> CREATE TABLE prun_test_part_p3 PARTITION OF prun_test_part FOR VALUES FROM
> (200) TO (300);
> CREATE TABLE prun_test_part_p4 PARTITION OF prun_test_part FOR VALUES FROM
> (300) TO (400);
>
> INSERT INTO prun_test_part VALUES (10,90,10);
> INSERT INTO prun_test_part VALUES (11,100,10);
> INSERT INTO prun_test_part VALUES (20,110,20);
> INSERT INTO prun_test_part VALUES (21,200,20);
> INSERT INTO prun_test_part VALUES (30,210,30);
> INSERT INTO prun_test_part VALUES (31,300,30);
> INSERT INTO prun_test_part VALUES (50,310,20);
>
> explain (costs off)
> SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part
> WHERE sal = 50);
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>

With the new patch, the output is as follows:

postgres=# explain (costs off)
SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM
prun_test_part WHERE sal = 50);                          QUERY PLAN
-----------------------------------------------------------------Append  InitPlan 1 (returns $0)    ->  Append
-> Seq Scan on prun_test_part_p1 prun_test_part_p1_1                Filter: (sal = 50)  ->  Seq Scan on
prun_test_part_p1       Filter: (sal < $0)
 
(7 rows)


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
On Tue, Nov 14, 2017 at 6:27 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 14 November 2017 at 19:16, Beena Emerson <memissemerson@gmail.com> wrote:
>> PFA the updated patches.
>
> Hi Beena,
>
> Thanks for working on this. I've had a look at the patch to try to
> understand how it is working. I found it a bit surprising that the
> code assumes it can rely on the order of Append->appendplans matching
> what's needed based on the return value of get_partitions_for_keys().
>
> I tried using the following to break this:
>
>
> drop table if exists ab;
> create table ab (a int not null, b int not null) partition by list(a);
> create table ab_a2 partition of ab for values in(2) partition by list (b);
> create table ab_a2_b1 partition of ab_a2 for values in (1);
> create table ab_a2_b2 partition of ab_a2 for values in (2);
> create table ab_a2_b3 partition of ab_a2 for values in (3);
>
> create table ab_a1 partition of ab for values in(1) partition by list (b);
> create table ab_a1_b1 partition of ab_a1 for values in (1);
> create table ab_a1_b2 partition of ab_a1 for values in (2);
> create table ab_a1_b3 partition of ab_a1 for values in (3);
> create table ab_a3 partition of ab for values in(3) partition by list (b);
> create table ab_a3_b1 partition of ab_a3 for values in (1);
> create table ab_a3_b2 partition of ab_a3 for values in (2);
> create table ab_a3_b3 partition of ab_a3 for values in (3);
>
> prepare ab_q1 (int, int) as select * from ab where a = $1 and b = $2;
>
> explain execute ab_q1 (2,3);
> explain execute ab_q1 (2,3);
> explain execute ab_q1 (2,3);
> explain execute ab_q1 (2,3);
>
> postgres=# explain execute ab_q1 (2,3);
>                           QUERY PLAN
> ---------------------------------------------------------------
>  Append  (cost=0.00..43.90 rows=1 width=8)
>    ->  Seq Scan on ab_a2_b3  (cost=0.00..43.90 rows=1 width=8)
>          Filter: ((a = 2) AND (b = 3))
> (3 rows)
>
>
> postgres=# explain execute ab_q1 (2,3);
>                           QUERY PLAN
> ---------------------------------------------------------------
>  Append  (cost=0.00..395.10 rows=1 width=8)
>    ->  Seq Scan on ab_a1_b2  (cost=0.00..43.90 rows=1 width=8)
> <--------- wrong partition
>          Filter: ((a = $1) AND (b = $2))
> (3 rows)
>

With the new patch, the output is:
postgres=#  explain execute ab_q1 (2,3);                         QUERY PLAN
---------------------------------------------------------------Append  (cost=0.00..43.90 rows=1 width=8)  ->  Seq Scan
onab_a2_b3  (cost=0.00..43.90 rows=1 width=8)        Filter: ((a = 2) AND (b = 3))
 
(3 rows)

postgres=#  explain execute ab_q1 (2,3);                         QUERY PLAN
---------------------------------------------------------------Append  (cost=0.00..395.10 rows=4 width=8)  ->  Seq Scan
onab_a2_b3  (cost=0.00..43.90 rows=1 width=8)        Filter: ((a = $1) AND (b = $2))
 
(3 rows)


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Wed, Nov 15, 2017 at 4:43 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 15 November 2017 at 01:57, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I think to do this you're going to have to store some sort of array
>> that maps the partition index to the subpath in the Append node so you
>> can correctly identify the subpath based on what you're getting back
>> from get_partitions_for_keys(). Perhaps what you had worked previously
>> when we were not returning a Bitmapset with that function.
>>
>> Once you've got that design worked out I can take another look at this.
>
> I think this is a bit more broken than I originally mentioned above.
> The code you have at the moment assumes there will be just a single
> partitioned table in the hierarchy. Remember that complex partitioned
> hierarchies will be flattened during set_append_rel_pathlist(), so
> there may be multiple partitioned relations to search for.
>
> A more simple way to break the patch is to have some constants in the
> query to eliminate some of the partitions during planning, leaving
> just a few to be eliminated during execution.
>
> Something like:
>
> deallocate ab_q1;
> drop table if exists ab;
> create table ab (a int not null, b int not null) partition by list(a);
> create table ab_a1 partition of ab for values in (1);
> create table ab_a2 partition of ab for values in (2);
> create table ab_a3 partition of ab for values in (3);
> create table ab_a4 partition of ab for values in (4);
> create table ab_a5 partition of ab for values in (5);
> create table ab_a6 partition of ab for values in (6);
> create table ab_a7 partition of ab for values in (7);
> create table ab_a8 partition of ab for values in (8);
> create table ab_a9 partition of ab for values in (9);
> create table ab_a10 partition of ab for values in (10);
>
> prepare ab_q1 (int) as select * from ab where a between 4 and 5 and a = $1;
>
> explain execute ab_q1 (4);
> explain execute ab_q1 (4);
> explain execute ab_q1 (4);
> explain execute ab_q1 (4);
> explain execute ab_q1 (4);
>
> explain execute ab_q1 (4); -- TRAP: FailedAssertion("!(n <
> list->length)", File: "src/backend/nodes/list.c", Line: 392)
>

This is handled in the new patch.

postgres=# explain execute ab_q1 (4);                        QUERY PLAN
------------------------------------------------------------Append  (cost=0.00..49.55 rows=1 width=8)  ->  Seq Scan on
ab_a4 (cost=0.00..49.55 rows=1 width=8)        Filter: ((a >= 4) AND (a <= 5) AND (a = 4))
 
(3 rows)

postgres=# explain execute ab_q1 (4);                        QUERY PLAN
------------------------------------------------------------Append  (cost=0.00..99.10 rows=1 width=8)  ->  Seq Scan on
ab_a4 (cost=0.00..49.55 rows=1 width=8)        Filter: ((a >= 4) AND (a <= 5) AND (a = $1))
 
(3 rows)

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Wed, Nov 29, 2017 at 7:11 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Nov 15, 2017 at 3:53 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>> Thank you for your suggestion. I am looking into this and will post a
>> patch soon.
>
> It has been two weeks since this update and no new patch has showed
> up. I am marking the patch as returned with feedback. If you can
> produce a new version, of course feel free to post it and register a
> new entry in the CF app.


I have moved it to next CF with status needs review.

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

PFA the updated patch rebased over Amit's v13 patches [1] part of
which is committed. This also fixes few bugs I found.
The InitPlans require execPlan which is not set during ExecInitAppend
and so the evaluation of extern quals is moved from ExecInitAppend to
ExecAppend. This changes the output of explain but only the correct
partition(s) are scanned.

David Q1:
postgres=#  explain analyse execute ab_q1 (3,3); --const
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006
rows=0 loops=1)
   ->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (actual
time=0.005..0.005 rows=0 loops=1)
         Filter: ((a = 3) AND (b = 3))
 Planning time: 0.588 ms
 Execution time: 0.043 ms
(5 rows)

postgres=#  explain analyse execute ab_q1 (3,3); --Param only
ab_a3_b3 plan is executed
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..395.10 rows=9 width=8) (actual time=0.119..0.119
rows=0 loops=1)
   ->  Seq Scan on ab_a1_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a1_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a1_b3  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a2_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a2_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a2_b3  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a3_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a3_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (actual
time=0.006..0.006 rows=0 loops=1)
         Filter: ((a = $1) AND (b = $2))
 Planning time: 0.828 ms
 Execution time: 0.234 ms
(21 rows)


David Q1
postgres=#  explain analyse execute ab_q1 (4); -- Const
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..49.55 rows=1 width=8) (actual time=0.005..0.005
rows=0 loops=1)
   ->  Seq Scan on ab_a4  (cost=0.00..49.55 rows=1 width=8) (actual
time=0.004..0.004 rows=0 loops=1)
         Filter: ((a >= 4) AND (a <= 5) AND (a = 4))
 Planning time: 0.501 ms
 Execution time: 0.039 ms
(5 rows)

postgres=#  explain analyse execute ab_q1 (4); --Param
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..99.10 rows=2 width=8) (actual time=0.063..0.063
rows=0 loops=1)
   ->  Seq Scan on ab_a4  (cost=0.00..49.55 rows=1 width=8) (actual
time=0.004..0.004 rows=0 loops=1)
         Filter: ((a >= 4) AND (a <= 5) AND (a = $1))
   ->  Seq Scan on ab_a5  (cost=0.00..49.55 rows=1 width=8) (never executed)
         Filter: ((a >= 4) AND (a <= 5) AND (a = $1))
 Planning time: 0.563 ms
 Execution time: 0.111 ms


I am still working on the patch to add more comments and regression
tests but comments on the code is welcome.

[1]https://www.postgresql.org/message-id/df609168-b7fd-4c0b-e9b2-6e398d411e27%40lab.ntt.co.jp

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> David Q1:
> postgres=#  explain analyse execute ab_q1 (3,3); --const
>                                                QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006
> rows=0 loops=1)
>    ->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (actual
> time=0.005..0.005 rows=0 loops=1)
>          Filter: ((a = 3) AND (b = 3))
>  Planning time: 0.588 ms
>  Execution time: 0.043 ms
> (5 rows)

I think the EXPLAIN ANALYZE input should show something attached to
the Append node so that we can tell that partition pruning is in use.
I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes"
or if we can give a few more useful details.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello Robert,

On Sat, Dec 2, 2017 at 12:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> David Q1:
>> postgres=#  explain analyse execute ab_q1 (3,3); --const
>>                                                QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------
>>  Append  (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006
>> rows=0 loops=1)
>>    ->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (actual
>> time=0.005..0.005 rows=0 loops=1)
>>          Filter: ((a = 3) AND (b = 3))
>>  Planning time: 0.588 ms
>>  Execution time: 0.043 ms
>> (5 rows)
>
> I think the EXPLAIN ANALYZE input should show something attached to
> the Append node so that we can tell that partition pruning is in use.
> I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes"
> or if we can give a few more useful details.
>

The output above is shown for a Const Value i.e. optimizer pruning
which I included just to show that the correct partition is chosen
even during runtime pruning for the given value. So taking your
suggestion, the output for runtime pruning could be something as
follows:

postgres=#  explain analyse execute ab_q1 (3,3);
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..395.10 rows=9 width=8) (actual time=0.119..0.119
rows=0 loops=1) (run-time partition pruning: on)
   ->  Seq Scan on ab_a1_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a1_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a1_b3  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a2_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a2_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a2_b3  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a3_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a3_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (actual
time=0.006..0.006 rows=0 loops=1)
         Filter: ((a = $1) AND (b = $2))
 Planning time: 0.828 ms
 Execution time: 0.234 ms
(21 rows)


Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Sat, Dec 2, 2017 at 3:33 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>  Append  (cost=0.00..395.10 rows=9 width=8) (actual time=0.119..0.119
> rows=0 loops=1) (run-time partition pruning: on)

If we can, it would be better to show something a bit more precise,
like the table being used for run-time pruning, or the expression
being used for pruning.

Also, we shouldn't use an ad-hoc format like "(run-time
partition-pruning: on)"; rather, we should display something using one
of the ExplainPropertyBlah functions in explain.c.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 2 December 2017 at 08:04, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> David Q1:
>> postgres=#  explain analyse execute ab_q1 (3,3); --const
>>                                                QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------
>>  Append  (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006
>> rows=0 loops=1)
>>    ->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (actual
>> time=0.005..0.005 rows=0 loops=1)
>>          Filter: ((a = 3) AND (b = 3))
>>  Planning time: 0.588 ms
>>  Execution time: 0.043 ms
>> (5 rows)
>
> I think the EXPLAIN ANALYZE input should show something attached to
> the Append node so that we can tell that partition pruning is in use.
> I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes"
> or if we can give a few more useful details.

It already does. Anything subnode with "(never executed)" was pruned
at runtime. Do we really need anything else to tell us that?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
On Wed, Dec 6, 2017 at 1:21 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 2 December 2017 at 08:04, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>>> David Q1:
>>> postgres=#  explain analyse execute ab_q1 (3,3); --const
>>>                                                QUERY PLAN
>>> ---------------------------------------------------------------------------------------------------------
>>>  Append  (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006
>>> rows=0 loops=1)
>>>    ->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (actual
>>> time=0.005..0.005 rows=0 loops=1)
>>>          Filter: ((a = 3) AND (b = 3))
>>>  Planning time: 0.588 ms
>>>  Execution time: 0.043 ms
>>> (5 rows)
>>
>> I think the EXPLAIN ANALYZE input should show something attached to
>> the Append node so that we can tell that partition pruning is in use.
>> I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes"
>> or if we can give a few more useful details.
>
> It already does. Anything subnode with "(never executed)" was pruned
> at runtime. Do we really need anything else to tell us that?

I have added the partition quals that are used for pruning.

PFA the updated patch. I have changed the names of variables to make
it more appropriate, along with adding more code comments and doing
some refactoring and other code cleanups.

Few cases:

1. Only runtime pruning - David's case1
explain analyse execute ab_q1 (2,3);
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..395.10 rows=9 width=8) (actual time=0.101..0.101
rows=0 loops=1)
   Runtime Partition Pruning: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a1_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a1_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a1_b3  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a2_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a2_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a2_b3  (cost=0.00..43.90 rows=1 width=8) (actual
time=0.007..0.007 rows=0 loops=1)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a3_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a3_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (never executed)
         Filter: ((a = $1) AND (b = $2))
 Planning time: 0.780 ms
 Execution time: 0.220 ms
(22 rows)

2. Runtime pruning after optimizer pruning - David's case 2.
((a >= 4) AND (a <= 5)  is used during optimizer pruning and only (a =
$1) is used for runtime pruning.
=#  explain (analyse, costs off, summary off) execute ab_q1 (4);
                            QUERY PLAN
-------------------------------------------------------------------
 Append (actual time=0.062..0.062 rows=0 loops=1)
   Runtime Partition Pruning: (a = $1)
   ->  Seq Scan on ab_a4 (actual time=0.005..0.005 rows=0 loops=1)
         Filter: ((a >= 4) AND (a <= 5) AND (a = $1))
   ->  Seq Scan on ab_a5 (never executed)
         Filter: ((a >= 4) AND (a <= 5) AND (a = $1))
(6 rows)

3. Nestloop Join
tbl1.col1 only has values from 1 to 10.

=# \d+ tprt
                                   Table "public.tprt"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 col1   | integer |           |          |         | plain   |              |
 col2   | integer |           |          |         | plain   |              |
Partition key: RANGE (col1)
Partitions: tprt_1 FOR VALUES FROM (1) TO (5001),
            tprt_2 FOR VALUES FROM (5001) TO (10001),
            tprt_3 FOR VALUES FROM (10001) TO (20001)


=# explain (analyse, costs off, summary off) SELECT * FROM tbl1 JOIN
tprt ON tbl1.col1 > tprt.col1;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Nested Loop (actual time=0.053..0.192 rows=45 loops=1)
   ->  Seq Scan on tbl1 (actual time=0.007..0.009 rows=10 loops=1)
   ->  Append (actual time=0.003..0.004 rows=4 loops=10)
         Runtime Partition Pruning Join Filter: (tbl1.col1 > col1)
         ->  Index Scan using tprt1_idx on tprt_1 (actual
time=0.002..0.004 rows=5 loops=9)
               Index Cond: (tbl1.col1 > col1)
         ->  Index Scan using tprt2_idx on tprt_2 (never executed)
               Index Cond: (tbl1.col1 > col1)
         ->  Index Scan using tprt3_idx on tprt_3 (never executed)
               Index Cond: (tbl1.col1 > col1)
(10 rows)

4. InitPlan - Raghu's test case:
4.1 Only few partitions satisfy the param
explain (analyse, costs off, summary off) SELECT * FROM prun_test_part
WHERE sal < (SELECT sal FROM prun_test_part WHERE sal = 200);
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Append (actual time=0.034..0.038 rows=3 loops=1)
   Runtime Partition Pruning: (sal < $0)
   InitPlan 1 (returns $0)
     ->  Append (actual time=0.008..0.009 rows=1 loops=1)
           ->  Seq Scan on prun_test_part_p3 prun_test_part_p3_1
(actual time=0.008..0.009 rows=1 loops=1)
                 Filter: (sal = 200)
                 Rows Removed by Filter: 1
   ->  Seq Scan on prun_test_part_p1 (actual time=0.002..0.003 rows=1 loops=1)
         Filter: (sal < $0)
   ->  Seq Scan on prun_test_part_p2 (actual time=0.002..0.003 rows=2 loops=1)
         Filter: (sal < $0)
   ->  Seq Scan on prun_test_part_p3 (never executed)
         Filter: (sal < $0)
   ->  Seq Scan on prun_test_part_p4 (never executed)
         Filter: (sal < $0)
(15 rows)


4.2 When the InitPlan query returns nothing
=# explain (analyse, costs off, summary off) SELECT * FROM
prun_test_part WHERE sal < (SELECT sal FROM prun_test_part WHERE sal =
50);
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Append (actual time=0.050..0.050 rows=0 loops=1)
   Runtime Partition Pruning: (sal < $0)
   InitPlan 1 (returns $0)
     ->  Append (actual time=0.013..0.013 rows=0 loops=1)
           ->  Seq Scan on prun_test_part_p1 prun_test_part_p1_1
(actual time=0.012..0.012 rows=0 loops=1)
                 Filter: (sal = 50)
                 Rows Removed by Filter: 1
   ->  Seq Scan on prun_test_part_p1 (never executed)
         Filter: (sal < $0)
   ->  Seq Scan on prun_test_part_p2 (never executed)
         Filter: (sal < $0)
   ->  Seq Scan on prun_test_part_p3 (never executed)
         Filter: (sal < $0)
   ->  Seq Scan on prun_test_part_p4 (never executed)
         Filter: (sal < $0)
(15 rows)



-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 7 December 2017 at 20:22, Beena Emerson <memissemerson@gmail.com> wrote:
> PFA the updated patch.

Hi Beena,

Thanks for updating this.

Can you list the patches which are required for this to apply to
today's master branch?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello David,

On Thu, Dec 7, 2017 at 4:07 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 December 2017 at 20:22, Beena Emerson <memissemerson@gmail.com> wrote:
>> PFA the updated patch.
>
> Hi Beena,
>
> Thanks for updating this.
>
> Can you list the patches which are required for this to apply to
> today's master branch?
>

Thanks for looking into this.

Currently Amit's v13 patches do not apply on the HEAD and I was
working on 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 which is the last
commit where all Amit's v13 patches applies cleanly.


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Thu, Dec 7, 2017 at 12:52 PM, Beena Emerson <memissemerson@gmail.com> wrote:
>
> 1. Only runtime pruning - David's case1
> explain analyse execute ab_q1 (2,3);
>                                                QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..395.10 rows=9 width=8) (actual time=0.101..0.101
> rows=0 loops=1)
>    Runtime Partition Pruning: ((a = $1) AND (b = $2))
>    ->  Seq Scan on ab_a1_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
>          Filter: ((a = $1) AND (b = $2))
>    ->  Seq Scan on ab_a1_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
>          Filter: ((a = $1) AND (b = $2))
>    ->  Seq Scan on ab_a1_b3  (cost=0.00..43.90 rows=1 width=8) (never executed)
>          Filter: ((a = $1) AND (b = $2))
>    ->  Seq Scan on ab_a2_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
>          Filter: ((a = $1) AND (b = $2))
>    ->  Seq Scan on ab_a2_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
>          Filter: ((a = $1) AND (b = $2))
>    ->  Seq Scan on ab_a2_b3  (cost=0.00..43.90 rows=1 width=8) (actual
> time=0.007..0.007 rows=0 loops=1)
>          Filter: ((a = $1) AND (b = $2))
>    ->  Seq Scan on ab_a3_b1  (cost=0.00..43.90 rows=1 width=8) (never executed)
>          Filter: ((a = $1) AND (b = $2))
>    ->  Seq Scan on ab_a3_b2  (cost=0.00..43.90 rows=1 width=8) (never executed)
>          Filter: ((a = $1) AND (b = $2))
>    ->  Seq Scan on ab_a3_b3  (cost=0.00..43.90 rows=1 width=8) (never executed)
>          Filter: ((a = $1) AND (b = $2))
>  Planning time: 0.780 ms
>  Execution time: 0.220 ms
> (22 rows)
>
> 2. Runtime pruning after optimizer pruning - David's case 2.
> ((a >= 4) AND (a <= 5)  is used during optimizer pruning and only (a =
> $1) is used for runtime pruning.
> =#  explain (analyse, costs off, summary off) execute ab_q1 (4);
>                             QUERY PLAN
> -------------------------------------------------------------------
>  Append (actual time=0.062..0.062 rows=0 loops=1)
>    Runtime Partition Pruning: (a = $1)
>    ->  Seq Scan on ab_a4 (actual time=0.005..0.005 rows=0 loops=1)
>          Filter: ((a >= 4) AND (a <= 5) AND (a = $1))
>    ->  Seq Scan on ab_a5 (never executed)
>          Filter: ((a >= 4) AND (a <= 5) AND (a = $1))
> (6 rows)
>

FYI,

The v4 version of the patch accidentally included the
choose_custom_plan hack I had used to force the runtime pruning in the
above cases(1,2), which has been removed in v5. So with only the patch
applied, it would continue to give the output as with the const and
not the Param because the custom plan is preferred over the generic
one. This was pointed out in the initial post of this thread. Just to
compare, I continued using the hack for the tests to show the
behaviour changes.

A different case would need to be used to test the behaviour which
picks the generic plan.

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 7 December 2017 at 23:56, Beena Emerson <memissemerson@gmail.com> wrote:
> Currently Amit's v13 patches do not apply on the HEAD and I was
> working on 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 which is the last
> commit where all Amit's v13 patches applies cleanly.

Thanks.

I was just looking over this and was wondering about the following case:

drop table if exists p;
create table p (a int not null, b int not null) partition by range (a);
create table p1 partition of p for values from (0) to (1000);
create table p2 partition of p for values from (1000) to (2000);
create table p3 partition of p for values from (2000) to (3000);
create table p4 partition of p for values from (3000) to (4000);

create index on p1 (a);
create index on p2 (a);
create index on p3 (a);
create index on p4 (a);


insert into p select x,x from generate_series(1,3999) x;

drop table if exists t;
create table t (a int not null);

insert into t select generate_Series(1,10);

analyze p;

analyze t;

set enable_mergejoin=0;
set enable_hashjoin=0;

explain analyze select * from p inner join t on p.a = t.a;

The patch gives me:

                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Nested Loop (actual time=0.032..0.159 rows=10 loops=1)
   ->  Seq Scan on t (actual time=0.012..0.013 rows=10 loops=1)
   ->  Append (actual time=0.004..0.013 rows=1 loops=10)
         ->  Index Scan using p1_a_idx on p1 (actual time=0.004..0.004
rows=1 loops=10)
               Index Cond: (a = t.a)
         ->  Index Scan using p2_a_idx on p2 (actual time=0.003..0.003
rows=0 loops=10)
               Index Cond: (a = t.a)
         ->  Index Scan using p3_a_idx on p3 (actual time=0.002..0.002
rows=0 loops=10)
               Index Cond: (a = t.a)
         ->  Index Scan using p4_a_idx on p4 (actual time=0.003..0.003
rows=0 loops=10)
               Index Cond: (a = t.a)
 Planning time: 0.472 ms
 Execution time: 0.241 ms
(13 rows)

but I expected to get (never executed) for p2, p3 and p4.

The following code makes me think you intend this to work:

@@ -280,6 +438,10 @@ ExecReScanAppend(AppendState *node)
 {
  int i;

+ /* Determine subplans to scan based on the new Params */
+ if (node->ps.chgParam != NULL && node->join_clauses)
+ set_append_subplan_indexes(node, node->join_clauses);
+

It just does not due to the node->join_clauses being NULL.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote:
> I have added the partition quals that are used for pruning.
>
> PFA the updated patch. I have changed the names of variables to make
> it more appropriate, along with adding more code comments and doing
> some refactoring and other code cleanups.

- initClauses() seems to be a duplicate of the existing function
ExecInitExprList(), except for the extra NULL test, which isn't
necessary.

- The executor already has a system for making sure that relations get
opened and locked, and it's different from the new one scheme which
set_append_subplan_indexes() implements. Relations should be locked
during the executor initialization phase (i.e. ExecInitAppend) and not
when the first tuple is requested (i.e. ExecAppend).  Also, there's
already code to lock both child relations (i.e. the scans of those
relations, see InitScanRelation, ExecInitIndexScan) and non-leaf
partitions (ExecLockNonLeafAppendTables).  The call to
find_all_inheritors() will lock all of that same stuff again *plus*
the leaf partitions that were pruned during planning - moreover, if
the Append is rescanned, we'll walk the partitioning structure again
for every rescan.  I think RelationGetPartitionDispatchInfo should be
called directly from ExecInitAppend after the existing code to take
locks has been called, and store a pointer to the PartitionDispatch
object in the AppendState for future use.

- I am surprised that set_append_subplan_indexes() needs to worry
about multi-level partitioning directly.  I would have thought that
Amit's patch would take care of that, just returning a Bitmapset of
indexes which this function could use directly.  It also doesn't seem
like a very good idea to convert the Bitmapset (subplans) into a list
of integers (node->subplan_indexes), as set_append_subplan_indexes()
does at the bottom.  The Bitmapset will be a lot more efficient; we
should be able to just iterate over that directly rather than
converting it into a List.  Note that a Bitmapset can be created with
a single palloc, but an integer list needs one per list element plus
one for the list itself.

- I don't think it's a good idea for ExecInitAppend to copy so much
information into the appendstate.  It copies append_paths_size,
append_paths_array, parentoid, base_params, es_param_list_info,
join_clauses, but it could just as well access them directly via
planstate->plan and planstate->state when they are needed.  Maybe you
had some thought that this would be more efficient, but it probably
doesn't save much and it's unlike what we do elsewhere in the
executor.

- A good chunk of the rest of this logic in nodeAppend.c looks like
it's going to conflict heavily with the Parallel Append patch that
just went in.  That's going to require some thought.  There's no
reason why a parallel-aware Append can't do runtime partition pruning,
but we want to avoid as much overhead as possible when runtime pruning
isn't chosen.  In the parallel-aware case, I think we should just try
to jigger things so that the plans we don't need to scan get marked
pa_finished.  We don't want to hold pa_lock while doing the pruning,
so probably what should happen is add a new ParallelAppendState member
indicating wither pruning has been done; any process which needs to
choose a subplan and sees that pruning isn't done yet releases the
lock, performs pruning, then reacquires the lock, marks pa_finished on
all plans that we don't need to scan, marks pruning down, picks a
plan, and releases the lock.  There is a race condition where pruning
gets conducted by multiple workers around the same time, but it often
won't happen and isn't a big deal if it does; they should all get the
same answer.  In the non-parallel-aware case, I think we should
probably replace choose_next_subplan_locally with
choose_next_subplan_simply (for the non-pruning case) and
choose_next_subplan_with_pruning (for the other case).

- On the optimizer side of things, the new calls to
find_all_inheritors() and get_leaf_part_recurse() in
set_base_rel_sizes() don't look good.  As in the executor stuff,
that's work that the optimizer is already doing elsewhere, and we
don't want to redo it here.  In the case of the optimizer, the most
relevant place is probably expand_partitioned_rtentry().  Another
place where we already have the relation open is get_relation_info().
Any information you want to get from the relation descriptor needs to
be saved in one of those functions; don't re-open the relation
elsewhere.

- I think you are perhaps doing the work a little too early here,
especially with regards to the join clauses.  In set_base_rel_sizes()
we don't have a clear idea what the join order will be, or what type
of join will be used, so we don't know what join clauses are relevant
for run-time pruning.  I don't think that trying to identify join
clauses at that stage makes sense.  I think that the time when we know
(or can figure out) that stuff might be when we go to build a
parameterized append path - see the bottom of
add_paths_to_append_rel().  It's really the parameterization -- the
relids listed in required_outer -- that tell us which join clauses we
could potentially use for runtime pruning.  I'm not sure in detail how
this should work, but it seems to me that generally the right formula
is probably: useful clauses from the appendrel's baserestrictinfo PLUS
joinclauses that mention only vars from the relation itself and
whatever's in required_outer.

- You can't use random fields in PlannerInfo *root as scratch space
for functions in different files to communicate with each other.  I
mean, there's obviously some stuff that's valid to stick into the
PlannerInfo, but these are basically per-relation details which you're
counting on clearing before anybody gets confused.  That's not a good
design, and it's probably a sign that not all of the code is in the
right place yet.  For example, a lot of the logic you've added to
create_append_path() probably belongs in the caller -- look at how
simple create_.*_path() functions generally are.  Similarly, think
about whether the chunk of logic added to set_rel_size() just after
the call to set_append_rel_size() doesn't really belong inside that
function, or conversely whether the chunk of code in
set_base_rel_sizes() just before the call to set_rel_size() shuldn't
be moved down.  I'm hopeful that with some rejiggering of this sort
you can get the code that needs to communicate closer to being all in
one place and pass around whatever is needed via the parameter lists
of the functions involved, or even local variables, rather than via
PlannerInfo.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Jesper Pedersen
Date:
Hi Beena,

On 12/07/2017 02:22 AM, Beena Emerson wrote:
> I have added the partition quals that are used for pruning.
> 
> PFA the updated patch. I have changed the names of variables to make
> it more appropriate, along with adding more code comments and doing
> some refactoring and other code cleanups.
> 

As the current patch conflicts with [1] could you provide a rebased 
version ?

Thanks in advance !

[1] 
https://www.postgresql.org/message-id/9b98fc47-34b8-0ab6-27fc-c8a0889f2e5b%40lab.ntt.co.jp

Best regards,
  Jesper


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello Robert,
Thank you for the comments. I have started working on it.

On Fri, Dec 8, 2017 at 9:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> I have added the partition quals that are used for pruning.
>>
>> PFA the updated patch. I have changed the names of variables to make
>> it more appropriate, along with adding more code comments and doing
>> some refactoring and other code cleanups.
>
> - initClauses() seems to be a duplicate of the existing function
> ExecInitExprList(), except for the extra NULL test, which isn't
> necessary.

The initClauses has been removed and ExecInitExprList has been used.

> - The executor already has a system for making sure that relations get
> opened and locked, and it's different from the new one scheme which
> set_append_subplan_indexes() implements. Relations should be locked
> during the executor initialization phase (i.e. ExecInitAppend) and not
> when the first tuple is requested (i.e. ExecAppend).  Also, there's
> already code to lock both child relations (i.e. the scans of those
> relations, see InitScanRelation, ExecInitIndexScan) and non-leaf
> partitions (ExecLockNonLeafAppendTables).   The call to
> find_all_inheritors() will lock all of that same stuff again *plus*
> the leaf partitions that were pruned during planning - moreover, if
> the Append is rescanned, we'll walk the partitioning structure again
> for every rescan.  I think RelationGetPartitionDispatchInfo should be
> called directly from ExecInitAppend after the existing code to take
> locks has been called, and store a pointer to the PartitionDispatch
> object in the AppendState for future use.

I have moved the call to ExecInitAppend. This still uses the previous
locking method, I will work on it in the next version of the patch.


> - I am surprised that set_append_subplan_indexes() needs to worry
> about multi-level partitioning directly.  I would have thought that
> Amit's patch would take care of that, just returning a Bitmapset of
> indexes which this function could use directly.  It also doesn't seem
> like a very good idea to convert the Bitmapset (subplans) into a list
> of integers (node->subplan_indexes), as set_append_subplan_indexes()
> does at the bottom.  The Bitmapset will be a lot more efficient; we
> should be able to just iterate over that directly rather than
> converting it into a List.  Note that a Bitmapset can be created with
> a single palloc, but an integer list needs one per list element plus
> one for the list itself.

The function get_partitions_from_clauses returns the Bitmap set of
partitions for a level of partition. So when the BitmapSet that
indicates a child partitioned table, set_append_subplan_indexes loops
throgh again till it gets the list of all leaf indexes.

I am working on the other comments and will post the patch along with
rebasing to v14 of Amit's patch soon.


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
On Tue, Dec 12, 2017 at 4:57 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello Robert,
> Thank you for the comments. I have started working on it.
>
> On Fri, Dec 8, 2017 at 9:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>>> I have added the partition quals that are used for pruning.
>>>
>>> PFA the updated patch. I have changed the names of variables to make
>>> it more appropriate, along with adding more code comments and doing
>>> some refactoring and other code cleanups.
>>
>> - initClauses() seems to be a duplicate of the existing function
>> ExecInitExprList(), except for the extra NULL test, which isn't
>> necessary.
>
> The initClauses has been removed and ExecInitExprList has been used.
>
>> - The executor already has a system for making sure that relations get
>> opened and locked, and it's different from the new one scheme which
>> set_append_subplan_indexes() implements. Relations should be locked
>> during the executor initialization phase (i.e. ExecInitAppend) and not
>> when the first tuple is requested (i.e. ExecAppend).  Also, there's
>> already code to lock both child relations (i.e. the scans of those
>> relations, see InitScanRelation, ExecInitIndexScan) and non-leaf
>> partitions (ExecLockNonLeafAppendTables).   The call to
>> find_all_inheritors() will lock all of that same stuff again *plus*
>> the leaf partitions that were pruned during planning - moreover, if
>> the Append is rescanned, we'll walk the partitioning structure again
>> for every rescan.  I think RelationGetPartitionDispatchInfo should be
>> called directly from ExecInitAppend after the existing code to take
>> locks has been called, and store a pointer to the PartitionDispatch
>> object in the AppendState for future use.
>
> I have moved the call to ExecInitAppend. This still uses the previous
> locking method, I will work on it in the next version of the patch.
>
>
>> - I am surprised that set_append_subplan_indexes() needs to worry
>> about multi-level partitioning directly.  I would have thought that
>> Amit's patch would take care of that, just returning a Bitmapset of
>> indexes which this function could use directly.  It also doesn't seem
>> like a very good idea to convert the Bitmapset (subplans) into a list
>> of integers (node->subplan_indexes), as set_append_subplan_indexes()
>> does at the bottom.  The Bitmapset will be a lot more efficient; we
>> should be able to just iterate over that directly rather than
>> converting it into a List.  Note that a Bitmapset can be created with
>> a single palloc, but an integer list needs one per list element plus
>> one for the list itself.
>
> The function get_partitions_from_clauses returns the Bitmap set of
> partitions for a level of partition. So when the BitmapSet that
> indicates a child partitioned table, set_append_subplan_indexes loops
> throgh again till it gets the list of all leaf indexes.
>
> I am working on the other comments and will post the patch along with
> rebasing to v14 of Amit's patch soon.
>
>
> --
>


PFA the updated patch, this can be applied over the v13 patches [1]
over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3

[1] https://www.postgresql.org/message-id/df609168-b7fd-4c0b-e9b2-6e398d411e27%40lab.ntt.co.jp


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Fri, Dec 8, 2017 at 3:37 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 December 2017 at 23:56, Beena Emerson <memissemerson@gmail.com> wrote:
>> Currently Amit's v13 patches do not apply on the HEAD and I was
>> working on 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 which is the last
>> commit where all Amit's v13 patches applies cleanly.
>
> Thanks.
>
> I was just looking over this and was wondering about the following case:
>
> drop table if exists p;
> create table p (a int not null, b int not null) partition by range (a);
> create table p1 partition of p for values from (0) to (1000);
> create table p2 partition of p for values from (1000) to (2000);
> create table p3 partition of p for values from (2000) to (3000);
> create table p4 partition of p for values from (3000) to (4000);
>
> create index on p1 (a);
> create index on p2 (a);
> create index on p3 (a);
> create index on p4 (a);
>
>
> insert into p select x,x from generate_series(1,3999) x;
>
> drop table if exists t;
> create table t (a int not null);
>
> insert into t select generate_Series(1,10);
>
> analyze p;
>
> analyze t;
>
> set enable_mergejoin=0;
> set enable_hashjoin=0;
>
> explain analyze select * from p inner join t on p.a = t.a;
>
> The patch gives me:
>
>                                        QUERY PLAN
> ----------------------------------------------------------------------------------------
>  Nested Loop (actual time=0.032..0.159 rows=10 loops=1)
>    ->  Seq Scan on t (actual time=0.012..0.013 rows=10 loops=1)
>    ->  Append (actual time=0.004..0.013 rows=1 loops=10)
>          ->  Index Scan using p1_a_idx on p1 (actual time=0.004..0.004
> rows=1 loops=10)
>                Index Cond: (a = t.a)
>          ->  Index Scan using p2_a_idx on p2 (actual time=0.003..0.003
> rows=0 loops=10)
>                Index Cond: (a = t.a)
>          ->  Index Scan using p3_a_idx on p3 (actual time=0.002..0.002
> rows=0 loops=10)
>                Index Cond: (a = t.a)
>          ->  Index Scan using p4_a_idx on p4 (actual time=0.003..0.003
> rows=0 loops=10)
>                Index Cond: (a = t.a)
>  Planning time: 0.472 ms
>  Execution time: 0.241 ms
> (13 rows)
>
> but I expected to get (never executed) for p2, p3 and p4.
>
> The following code makes me think you intend this to work:
>
> @@ -280,6 +438,10 @@ ExecReScanAppend(AppendState *node)
>  {
>   int i;
>
> + /* Determine subplans to scan based on the new Params */
> + if (node->ps.chgParam != NULL && node->join_clauses)
> + set_append_subplan_indexes(node, node->join_clauses);
> +
>
> It just does not due to the node->join_clauses being NULL.
>
Thank you for your tests. I am working on this.

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello Jesper,

On Tue, Dec 12, 2017 at 4:04 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Hi Beena,
>
> On 12/07/2017 02:22 AM, Beena Emerson wrote:
>>
>> I have added the partition quals that are used for pruning.
>>
>> PFA the updated patch. I have changed the names of variables to make
>> it more appropriate, along with adding more code comments and doing
>> some refactoring and other code cleanups.
>>
>
> As the current patch conflicts with [1] could you provide a rebased version
> ?
>
> Thanks in advance !
>
> [1]
> https://www.postgresql.org/message-id/9b98fc47-34b8-0ab6-27fc-c8a0889f2e5b%40lab.ntt.co.jp
>

I am aware of this and will post a rebased version soon.


Thank you,

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 13 December 2017 at 00:33, Beena Emerson <memissemerson@gmail.com> wrote:
> PFA the updated patch, this can be applied over the v13 patches [1]
> over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3

Hi Beena,

Thanks for posting an updated patch.

I've been looking over this and I think that the use of the
PartitionDispatch in set_append_subplan_indexes is not correct. What
we need here is the index of the Append's subnode and that's not what
RelationGetPartitionDispatchInfo() gives you. Remember that some
partitions could have been pruned away already during planning.

This quick example shows that the partition selection is not correct.

create table p (a int, b int) partition by range (a);

create table p_a_neg partition of p for values from (minvalue) to (0)
partition by range (b);
create table p_a_pos partition of p for values from (0) to (maxvalue)
partition by range (b);

create table p_a_neg_b_neg partition of p_a_neg for values from
(minvalue) to (0);
create table p_a_neg_b_pos partition of p_a_neg for values from (0) to
(maxvalue);

create table p_a_pos_b_neg partition of p_a_pos for values from
(minvalue) to (0);
create table p_a_pos_b_pos partition of p_a_pos for values from (0) to
(maxvalue);

prepare q1 (int, int) as select * from p where a = $1 and b = $1;

explain analyze execute q1 (-1,-1); -- this works.
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..175.60 rows=4 width=8) (actual time=1.099..1.099
rows=0 loops=1)
   Runtime Partition Pruning: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_neg_b_neg  (cost=0.00..43.90 rows=1 width=8)
(actual time=0.023..0.023 rows=0 loops=1)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_neg_b_pos  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_pos_b_neg  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_pos_b_pos  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
(12 rows)

explain analyze execute q1 (-1,1); -- should scan p_a_neg_b_pos, but does not.
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..175.60 rows=4 width=8) (actual
time=758996.359..758996.359 rows=0 loops=1)
   Runtime Partition Pruning: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_neg_b_neg  (cost=0.00..43.90 rows=1 width=8)
(actual time=0.056..0.056 rows=0 loops=1)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_neg_b_pos  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_pos_b_neg  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
   ->  Seq Scan on p_a_pos_b_pos  (cost=0.00..43.90 rows=1 width=8)
(never executed)
         Filter: ((a = $1) AND (b = $1))
(12 rows)

So, I started to look at what the best way to put this right might be.
I see that since Parallel Append was committed that the subnodes are
now sorted in cost order inside create_append_path(), so likely we'll
want to delay figuring out the subpath list indexes until after that's
done since sorting would scramble our index arrays. We could simply
look at the subpaths at the end of create_append_path() and create
some sort of new matrix type that can accept the output of Amit's
get_partitions_from_clauses() and translate that Bitmapset into the
subpath indexes (another Bitmapset). This will also need to work for
sub-partitions too, so this matrix must be some sort of tree that we
can descend into when we see that get_partitions_from_clauses returned
a bit for a sub-partition instead of a leaf-partition.

I bashed this idea around a bit and I came up with the attached. It's
very far from complete and in a very WIP state. I've not really done
anything to make the correct clause list available in nodeAppend.c
yet, but I think the code that's there is worthy of a look. I've not
done that much work on the new choose_next_subplan* functions in
nodeAppend.c. I just modified choose_next_subplan_locally to show how
this set of functions need to take into account the subnode bitmap set
of valid partitions to scan. Perhaps some special case is needed to
have these functions ignore the Bitmapset when runtime pruning is
disabled (perhaps a completely new set of the functions is needed to
support the selection of the next non-pruned partition). Although,
probably that can be debated a bit later as it's a fairly minor detail
for now.

My patch also lacks any means to extract the Params during
match_clauses_to_partkey(), or at least most of the cases. I've just
added 1 case there. I did this because I thought it was better to
extract the ParamIds rather than a bool to say we've matched params.
This way we can only reevaluate which subplans to look at on rescan of
an Append if and only if the params we actually care about have
changed. I've not given this part a huge amount of thought yet.

I'm a little unsure where to go from here. Obviously, this is quite a
major rewrite of your patch. The parts that I've got missing likely
can use quite a bit of the stuff you've already written, but that
needs some review. I wanted to post this now as I know you're busy
working on this to rebase it on parallel Append and to also address
Robert's concerns, which all seem valid to me.

What's the best way for us to coordinate our efforts on this? Maybe
you could look at my patch and sanity check it to ensure I'm not
taking anything in the wrong direction?

I also think that MergeAppend needs similar work, but I think it's
best to get the Append case working first, or perhaps that's another
patch...

Please find attached my patch, which is based directly on top of
Amit's faster partition pruning v14 [1], which I patched against
4034db215b9

[1] https://www.postgresql.org/message-id/9b98fc47-34b8-0ab6-27fc-c8a0889f2e5b%40lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
Hi.

On 2017/12/16 15:05, David Rowley wrote:
> On 13 December 2017 at 00:33, Beena Emerson <memissemerson@gmail.com> wrote:
>> PFA the updated patch, this can be applied over the v13 patches [1]
>> over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3
> 
> Hi Beena,
> 
> Thanks for posting an updated patch.
> 
> I've been looking over this and I think that the use of the
> PartitionDispatch in set_append_subplan_indexes is not correct. What
> we need here is the index of the Append's subnode and that's not what
> RelationGetPartitionDispatchInfo() gives you. Remember that some
> partitions could have been pruned away already during planning.

A somewhat similar concern is being discussed on the "UPDATE partition
key" thread [1].  In that case, ExecInitModifyTable(), when initializing
tuple routing information to handle the "update partition key" case, will
have to deal with the fact that there might be fewer sub-plans in the
ModifyTable node than there are partitions in the partition tree.  That
is, source partitions that planner would have determined after pruning,
could be fewer than possible target partitions for rows from the source
partitions to move to, of which the latter consists of *all* partitions.
So, we have to have a mapping from leaf partition indexes as figured out
by RelationGetPartitionDispatchInfo() (indexes that are offsets into a
global array for *all* partitions), to sub-plan indexes which are offsets
into the array for only those partitions that have a sub-plan.  Such
mapping is built (per the latest patch on that thread) by
ExecSetupPartitionTupleRouting() in execPartition.c.

We could do something similar here using a similar code structure.  Maybe,
add a ExecSetupPartitionRuntimePruning() in execPartition.c (mimicking
ExecSetupPartitionTupleRouting), that accepts AppendState node.
Furthermore, it might be a good idea to have something similar to
ExecFindPartition(), say, ExecGetPartitions().  That is, we have new
functions for run-time pruning that are counterparts to corresponding
functions for tuple routing.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/c5e1d4ad-d243-52c5-608b-5dbb7183e465%40lab.ntt.co.jp



Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On 2017/12/09 0:57, Robert Haas wrote:
> On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote:
>> I have added the partition quals that are used for pruning.
>>
>> PFA the updated patch. I have changed the names of variables to make
>> it more appropriate, along with adding more code comments and doing
>> some refactoring and other code cleanups.
> 
> - I am surprised that set_append_subplan_indexes() needs to worry
> about multi-level partitioning directly.  I would have thought that
> Amit's patch would take care of that, just returning a Bitmapset of
> indexes which this function could use directly.

Actually, the partition.c code that my patch adds is limited to consider
one partitioned table at a time, not the whole tree.  As of 0a480502b09
[1], we call set_append_rel_size() separately for each partitioned table
in a partition tree.  In each such call, we call partition.c to perform
partition pruning for the given partitioned table.

In the run-time pruning case, we should get, via Append, a list of pruning
clauses for each partitioned table in the tree that survived plan-time
pruning.  Then, just like ExecFindPartition() calls
get_partition_for_tuple() for each partitioned table until we get to a
leaf partition, we should call partition.c for each un-pruned partitioned
table that have run-time pruning clauses associated.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0a480502b09



Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 18 December 2017 at 21:31, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/12/16 15:05, David Rowley wrote:
>> I've been looking over this and I think that the use of the
>> PartitionDispatch in set_append_subplan_indexes is not correct. What
>> we need here is the index of the Append's subnode and that's not what
>> RelationGetPartitionDispatchInfo() gives you. Remember that some
>> partitions could have been pruned away already during planning.
>
> A somewhat similar concern is being discussed on the "UPDATE partition
> key" thread [1].  In that case, ExecInitModifyTable(), when initializing
> tuple routing information to handle the "update partition key" case, will
> have to deal with the fact that there might be fewer sub-plans in the
> ModifyTable node than there are partitions in the partition tree.  That
> is, source partitions that planner would have determined after pruning,
> could be fewer than possible target partitions for rows from the source
> partitions to move to, of which the latter consists of *all* partitions.
> So, we have to have a mapping from leaf partition indexes as figured out
> by RelationGetPartitionDispatchInfo() (indexes that are offsets into a
> global array for *all* partitions), to sub-plan indexes which are offsets
> into the array for only those partitions that have a sub-plan.  Such
> mapping is built (per the latest patch on that thread) by
> ExecSetupPartitionTupleRouting() in execPartition.c.

Surely this is a different problem? With UPDATE of a partition key, if
the planner eliminates all but 1 partition the UPDATE could cause that
tuple to be "moved" into any leaf partition, very possibly one that's
been eliminated during planning.

In the case of runtime Append pruning, we can forget about all
partitions that the planner managed to eliminate, we'll never need to
touch those, ever. All we care about here is trying to reduce the
number of partitions down further using values that were not available
during planning.

> We could do something similar here using a similar code structure.  Maybe,
> add a ExecSetupPartitionRuntimePruning() in execPartition.c (mimicking
> ExecSetupPartitionTupleRouting), that accepts AppendState node.
> Furthermore, it might be a good idea to have something similar to
> ExecFindPartition(), say, ExecGetPartitions().  That is, we have new
> functions for run-time pruning that are counterparts to corresponding
> functions for tuple routing.

Seems to me in this case we're better to build this structure during
planning and save it with the plan so that it can be used over and
over, rather than building it again and again each time the plan is
executed. Likely a common use case for run-time pruning is when the
plan is going to be used multiple times with different parameters, so
we really don't want to repeat any work that we don't have to here.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hi David,

Thank you for reviewing and looking at this.

I have attached the WIP patch which incorporates some of Robert's
comments and is rebased over Amit's v14 patch.

On Sat, Dec 16, 2017 at 11:35 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 13 December 2017 at 00:33, Beena Emerson <memissemerson@gmail.com> wrote:
>> PFA the updated patch, this can be applied over the v13 patches [1]
>> over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3
>
> Hi Beena,
>
> Thanks for posting an updated patch.
>
> I've been looking over this and I think that the use of the
> PartitionDispatch in set_append_subplan_indexes is not correct. What
> we need here is the index of the Append's subnode and that's not what
> RelationGetPartitionDispatchInfo() gives you. Remember that some
> partitions could have been pruned away already during planning.
>
> This quick example shows that the partition selection is not correct.
>
> create table p (a int, b int) partition by range (a);
>
> create table p_a_neg partition of p for values from (minvalue) to (0)
> partition by range (b);
> create table p_a_pos partition of p for values from (0) to (maxvalue)
> partition by range (b);
>
> create table p_a_neg_b_neg partition of p_a_neg for values from
> (minvalue) to (0);
> create table p_a_neg_b_pos partition of p_a_neg for values from (0) to
> (maxvalue);
>
> create table p_a_pos_b_neg partition of p_a_pos for values from
> (minvalue) to (0);
> create table p_a_pos_b_pos partition of p_a_pos for values from (0) to
> (maxvalue);
>
> prepare q1 (int, int) as select * from p where a = $1 and b = $1;
>
> explain analyze execute q1 (-1,-1); -- this works.
>                                                   QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..175.60 rows=4 width=8) (actual time=1.099..1.099
> rows=0 loops=1)
>    Runtime Partition Pruning: ((a = $1) AND (b = $1))
>    ->  Seq Scan on p_a_neg_b_neg  (cost=0.00..43.90 rows=1 width=8)
> (actual time=0.023..0.023 rows=0 loops=1)
>          Filter: ((a = $1) AND (b = $1))
>    ->  Seq Scan on p_a_neg_b_pos  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>          Filter: ((a = $1) AND (b = $1))
>    ->  Seq Scan on p_a_pos_b_neg  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>          Filter: ((a = $1) AND (b = $1))
>    ->  Seq Scan on p_a_pos_b_pos  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>          Filter: ((a = $1) AND (b = $1))
> (12 rows)
>
> explain analyze execute q1 (-1,1); -- should scan p_a_neg_b_pos, but does not.
>                                                   QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..175.60 rows=4 width=8) (actual
> time=758996.359..758996.359 rows=0 loops=1)
>    Runtime Partition Pruning: ((a = $1) AND (b = $1))
>    ->  Seq Scan on p_a_neg_b_neg  (cost=0.00..43.90 rows=1 width=8)
> (actual time=0.056..0.056 rows=0 loops=1)
>          Filter: ((a = $1) AND (b = $1))
>    ->  Seq Scan on p_a_neg_b_pos  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>          Filter: ((a = $1) AND (b = $1))
>    ->  Seq Scan on p_a_pos_b_neg  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>          Filter: ((a = $1) AND (b = $1))
>    ->  Seq Scan on p_a_pos_b_pos  (cost=0.00..43.90 rows=1 width=8)
> (never executed)
>          Filter: ((a = $1) AND (b = $1))
> (12 rows)
>
> So, I started to look at what the best way to put this right might be.
> I see that since Parallel Append was committed that the subnodes are
> now sorted in cost order inside create_append_path(), so likely we'll
> want to delay figuring out the subpath list indexes until after that's
> done since sorting would scramble our index arrays. We could simply
> look at the subpaths at the end of create_append_path() and create
> some sort of new matrix type that can accept the output of Amit's
> get_partitions_from_clauses() and translate that Bitmapset into the
> subpath indexes (another Bitmapset). This will also need to work for
> sub-partitions too, so this matrix must be some sort of tree that we
> can descend into when we see that get_partitions_from_clauses returned
> a bit for a sub-partition instead of a leaf-partition.

Yes, the change in sort order means that the current
append_paths_array cannot be used for Parallel append and a new logic
has to be devised. I have still not thought about it but your method
seems like a good way to go. Currently I have worked on the Parallel
bit considering that the appends_path_array holds the correct
subplan_index.

>
> I bashed this idea around a bit and I came up with the attached. It's
> very far from complete and in a very WIP state. I've not really done
> anything to make the correct clause list available in nodeAppend.c
> yet, but I think the code that's there is worthy of a look. I've not
> done that much work on the new choose_next_subplan* functions in
> nodeAppend.c. I just modified choose_next_subplan_locally to show how
> this set of functions need to take into account the subnode bitmap set
> of valid partitions to scan. Perhaps some special case is needed to
> have these functions ignore the Bitmapset when runtime pruning is
> disabled (perhaps a completely new set of the functions is needed to
> support the selection of the next non-pruned partition). Although,
> probably that can be debated a bit later as it's a fairly minor detail
> for now.
>
> My patch also lacks any means to extract the Params during
> match_clauses_to_partkey(), or at least most of the cases. I've just
> added 1 case there. I did this because I thought it was better to
> extract the ParamIds rather than a bool to say we've matched params.
> This way we can only reevaluate which subplans to look at on rescan of
> an Append if and only if the params we actually care about have
> changed. I've not given this part a huge amount of thought yet.

Currently, during ReScan the patch prunes whenever the Param changes
and I had this 'rescan pruning optimization' in mind but had not
worked on it.

>
> I'm a little unsure where to go from here. Obviously, this is quite a
> major rewrite of your patch. The parts that I've got missing likely
> can use quite a bit of the stuff you've already written, but that
> needs some review. I wanted to post this now as I know you're busy
> working on this to rebase it on parallel Append and to also address
> Robert's concerns, which all seem valid to me.
>
> What's the best way for us to coordinate our efforts on this? Maybe
> you could look at my patch and sanity check it to ensure I'm not
> taking anything in the wrong direction?

I have not seen the patch in depth but this approach seems good. Dilip
has already worked on the join equality bug you pointed out before, I
am yet to merge that patch and will work on the optimizer comments of
Robert's. Maybe I can incorporate your patch while working on it.

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Mon, Dec 18, 2017 at 4:03 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
>
>> We could do something similar here using a similar code structure.  Maybe,
>> add a ExecSetupPartitionRuntimePruning() in execPartition.c (mimicking
>> ExecSetupPartitionTupleRouting), that accepts AppendState node.
>> Furthermore, it might be a good idea to have something similar to
>> ExecFindPartition(), say, ExecGetPartitions().  That is, we have new
>> functions for run-time pruning that are counterparts to corresponding
>> functions for tuple routing.
>
> Seems to me in this case we're better to build this structure during
> planning and save it with the plan so that it can be used over and
> over, rather than building it again and again each time the plan is
> executed. Likely a common use case for run-time pruning is when the
> plan is going to be used multiple times with different parameters, so
> we really don't want to repeat any work that we don't have to here.
>

I agree. It would be better to avoid building the structure during execution.
PFA the updated patch.

--

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 19 December 2017 at 21:54, Beena Emerson <memissemerson@gmail.com> wrote:
> PFA the updated patch.

Hi Beena,

Thanks for updating the patch. I've merged a chunk of your latest
patch with what I was working on and cleaned up some mistakes that I
had made in my WIP version.

I've now done the work to make this work with parallel append, which
wasn't so difficult. The 3 functions which choose the next subplan
just needed to be made aware that they need to pay attention to the
Bitmapset which lists which subplans need to be scanned. I did have to
write a bms_prev_member() function to get the next lower set bit in a
Bitmapset to make this easier.  I think what I have is fairly clean.

I've also solved the problem we had discussed about only needing to
reselecting the subplans during ExecReScanAppend() when a parameter
has changed that is actually used in run-time pruning. If it's not
used, we now no longer wastefully reselect the same matching subplans.

The patch still does need more work in match_clauses_to_partkey() to
ensure we extract all paramids from the clauses. Right now, I'm only
grabbing the paramids from an OpExpr. There's a bunch of other places
Params could be hidden in there. Anything Amit is supporting for
Consts we need to grab the Param Ids for when it's a Param instead of
a Const.

There's also a few things about the patch which I didn't change around
too much as I didn't want change Amit's v15 patch too much. He working
on this still and I didn't want too many conflicts. Basically, I had
to make match_clauses_to_partkey() an external function so that I
could use it in createplan.c in create_append_plan() in order to
extract the Param Ids of any parameters in the parameterized path
clause. This function is currently in allpaths.c in Amit's patch, but
I'm feeling that it does not really belong there. I'll discuss with
Amit on the faster partition pruning thread.

The attached patch is pretty fresh. I've not given it a huge amount of
testing so far. I do know there's still work to do in order to make
cases like;

prepare q2 (int) as select * from p where a in(10000,20000,$1);
explain (costs off, analyze) execute q2 (45678);

work correctly. To fix that requires a bit more invasion into the
faster partition pruning v15 patch which this is based on. I don't
really want to touch that too much just yet.

Here are some examples of what the patch does:

-- Test 1 Setup
-- This shows the patch working with multiple partition levels.

drop table if exists pt;
create table pt (a int, b int) partition by range (a);

create table pt_a_neg partition of pt for values from (minvalue) to
(0) partition by range (b);
create table pt_a_pos partition of pt for values from (0) to
(maxvalue) partition by range (b);

create table pt_a_neg_b_neg partition of pt_a_neg for values from
(minvalue) to (0);
create table pt_a_neg_b_pos partition of pt_a_neg for values from (0)
to (maxvalue);

create table pt_a_pos_b_neg partition of pt_a_pos for values from
(minvalue) to (0);
create table pt_a_pos_b_pos partition of pt_a_pos for values from (0)
to (maxvalue);

insert into pt select x,x from generate_series(-1000,1000) x;

analyze pt;

prepare q1 (int, int) as select * from pt where a = $1 and b = $2;

-- Test 1 Result (first 5 custom plan executions removed)

postgres=# explain (costs off, analyze) execute q1 (-10,10);
                                 QUERY PLAN
----------------------------------------------------------------------------
 Append (actual time=0.007..0.007 rows=0 loops=1)
   ->  Seq Scan on pt_a_neg_b_neg (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_neg_b_pos (actual time=0.007..0.007 rows=0 loops=1)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_pos_b_neg (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_pos_b_pos (never executed)
         Filter: ((a = $1) AND (b = $2))
 Planning time: 0.366 ms
 Execution time: 0.077 ms
(11 rows)

postgres=# explain (costs off, analyze) execute q1 (-10,-10);
                                 QUERY PLAN
----------------------------------------------------------------------------
 Append (actual time=0.235..0.237 rows=1 loops=1)
   ->  Seq Scan on pt_a_neg_b_neg (actual time=0.234..0.236 rows=1 loops=1)
         Filter: ((a = $1) AND (b = $2))
         Rows Removed by Filter: 999
   ->  Seq Scan on pt_a_neg_b_pos (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_pos_b_neg (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_pos_b_pos (never executed)
         Filter: ((a = $1) AND (b = $2))
 Planning time: 0.025 ms
 Execution time: 0.313 ms
(12 rows)

postgres=# explain (costs off, analyze) execute q1 (10,-10);
                                 QUERY PLAN
----------------------------------------------------------------------------
 Append (actual time=0.014..0.014 rows=0 loops=1)
   ->  Seq Scan on pt_a_neg_b_neg (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_neg_b_pos (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_pos_b_neg (actual time=0.013..0.013 rows=0 loops=1)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_pos_b_pos (never executed)
         Filter: ((a = $1) AND (b = $2))
 Planning time: 0.025 ms
 Execution time: 0.091 ms
(11 rows)

postgres=# explain (costs off, analyze) execute q1 (10,10);
                                 QUERY PLAN
----------------------------------------------------------------------------
 Append (actual time=0.032..0.222 rows=1 loops=1)
   ->  Seq Scan on pt_a_neg_b_neg (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_neg_b_pos (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_pos_b_neg (never executed)
         Filter: ((a = $1) AND (b = $2))
   ->  Seq Scan on pt_a_pos_b_pos (actual time=0.031..0.221 rows=1 loops=1)
         Filter: ((a = $1) AND (b = $2))
         Rows Removed by Filter: 1000
 Planning time: 0.026 ms
 Execution time: 0.297 ms
(12 rows)

-- Test 2 Setup
-- This test shows the patch working with parameterized paths. Note
that only subplans which actually have a value matching a value from
the outer side of the join get scanned.

create table p (a int not null) partition by range(a);
select 'create table p'||x|| ' partition of p for values from ('||x *
10000 || ') to (' || (x+1)*10000 || ');' from generate_Series(0,9)x;
\gexec

insert into p select generate_Series(0,99999);

select 'create index on p'||x||' (a)' from generate_Series(0,9)x;
\gexec

create table t (a int not null);

insert into t values(4),(5);

analyze p,t;
set enable_mergejoin=0;
explain (costs off, analyze) select * from t inner join p p on p.a=t.a;
insert into t values(45678);
explain (costs off, analyze) select * from t inner join p p on p.a=t.a;


-- Test 2 Result
postgres=# explain (costs off, analyze) select * from t inner join p p
on p.a=t.a;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Nested Loop (actual time=0.058..0.069 rows=2 loops=1)
   ->  Seq Scan on t (actual time=0.011..0.012 rows=2 loops=1)
   ->  Append (actual time=0.020..0.021 rows=1 loops=2)
         ->  Index Only Scan using p0_a_idx on p0 p (actual
time=0.019..0.020 rows=1 loops=2)
               Index Cond: (a = t.a)
               Heap Fetches: 2
         ->  Index Only Scan using p1_a_idx on p1 p_1 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p2_a_idx on p2 p_2 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p3_a_idx on p3 p_3 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p4_a_idx on p4 p_4 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p5_a_idx on p5 p_5 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p6_a_idx on p6 p_6 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p7_a_idx on p7 p_7 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p8_a_idx on p8 p_8 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p9_a_idx on p9 p_9 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
 Planning time: 1.080 ms
 Execution time: 0.315 ms
(35 rows)


postgres=# insert into t values(45678); -- add a record to show that
p4 gets scanned.
INSERT 0 1
postgres=# explain (costs off, analyze) select * from t inner join p p
on p.a=t.a;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Nested Loop (actual time=0.041..0.106 rows=3 loops=1)
   ->  Seq Scan on t (actual time=0.014..0.014 rows=3 loops=1)
   ->  Append (actual time=0.024..0.025 rows=1 loops=3)
         ->  Index Only Scan using p0_a_idx on p0 p (actual
time=0.010..0.011 rows=1 loops=2)
               Index Cond: (a = t.a)
               Heap Fetches: 2
         ->  Index Only Scan using p1_a_idx on p1 p_1 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p2_a_idx on p2 p_2 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p3_a_idx on p3 p_3 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p4_a_idx on p4 p_4 (actual
time=0.049..0.050 rows=1 loops=1)
               Index Cond: (a = t.a)
               Heap Fetches: 1
         ->  Index Only Scan using p5_a_idx on p5 p_5 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p6_a_idx on p6 p_6 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p7_a_idx on p7 p_7 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p8_a_idx on p8 p_8 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
         ->  Index Only Scan using p9_a_idx on p9 p_9 (never executed)
               Index Cond: (a = t.a)
               Heap Fetches: 0
 Planning time: 0.635 ms
 Execution time: 0.233 ms
(35 rows)


There is, however, still a fundamental problem with the patch, or
maybe the idea in general (this was also pointed out by Amit Langote
in an off-list discussion):

The problem is down to the logic in choose_custom_plan() only choosing
a generic plan if the average cost of the generic plan is less than
the average custom plan cost. The problem is that the generic plan can
have many extra Append subnodes in comparison to the custom plan, all
of which are taken into account in the total plan cost, but these may
be pruned during execution. The logic in choose_custom_plan() has no
idea about this.  I don't have any bright ideas on how to fix this
yet, as, suppose a PREPAREd statement like the following comes along:

PREPARE q3 (int, int) AS SELECT * FROM partitioned_table WHERE partkey
BETWEEN $1 AND $2;

the run-time pruning may prune it down no subplans, all subplans, or
any number in between. So we can't do anything like take the total
Append cost to be the highest costing of its subplans, and likely
using the average cost might not be a good idea either. It might work
sometimes, but likely won't be very stable. If this is not fixed then
choose_custom_plan() has a very low probability of choosing a generic
plan which has run-time partition pruning enabled, which in a way
defeats the purpose of this whole patch.

I'm a bit uncertain on the best way to resolve this. It needs to be
discussed here.

One more thing. The attached is not yet set up to work with
MergeAppend. It's likely just a small amount of additional work to
make this happen, so likely should be something that we do.

Anyway, I've attached the latest version of the patch. This is based
on Amit's v15 of faster-partition-pruning [1] which I found to cleanly
apply to f94eec490

[1] https://www.postgresql.org/message-id/06cde8a5-0ac7-dcf5-ad66-1ca781623e0c@lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello David,


On Thu, Dec 21, 2017 at 2:31 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 19 December 2017 at 21:54, Beena Emerson <memissemerson@gmail.com> wrote:

> The problem is down to the logic in choose_custom_plan() only choosing
> a generic plan if the average cost of the generic plan is less than
> the average custom plan cost. The problem is that the generic plan can
> have many extra Append subnodes in comparison to the custom plan, all
> of which are taken into account in the total plan cost, but these may
> be pruned during execution. The logic in choose_custom_plan() has no
> idea about this.  I don't have any bright ideas on how to fix this
> yet, as, suppose a PREPAREd statement like the following comes along:
>
> PREPARE q3 (int, int) AS SELECT * FROM partitioned_table WHERE partkey
> BETWEEN $1 AND $2;
>
> the run-time pruning may prune it down no subplans, all subplans, or
> any number in between. So we can't do anything like take the total
> Append cost to be the highest costing of its subplans, and likely
> using the average cost might not be a good idea either. It might work
> sometimes, but likely won't be very stable. If this is not fixed then
> choose_custom_plan() has a very low probability of choosing a generic
> plan which has run-time partition pruning enabled, which in a way
> defeats the purpose of this whole patch.
>
> I'm a bit uncertain on the best way to resolve this. It needs to be
> discussed here.

I had mentioned this in the first mail on this thread that the generic
plan is always preferred and Robert said that it is not in scope of
this patch. Maybe we can start a new thread for this.

>
> One more thing. The attached is not yet set up to work with
> MergeAppend. It's likely just a small amount of additional work to
> make this happen, so likely should be something that we do.
>
> Anyway, I've attached the latest version of the patch. This is based
> on Amit's v15 of faster-partition-pruning [1] which I found to cleanly
> apply to f94eec490

Thank you for working on this. I  will look into this and merge with
my current version of patch and Amit's v16 patches and post a new
patch soon.


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 21 December 2017 at 23:51, Beena Emerson <memissemerson@gmail.com> wrote:
> On Thu, Dec 21, 2017 at 2:31 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> On 19 December 2017 at 21:54, Beena Emerson <memissemerson@gmail.com> wrote:
>
>> The problem is down to the logic in choose_custom_plan() only choosing
>> a generic plan if the average cost of the generic plan is less than
>> the average custom plan cost. The problem is that the generic plan can
>> have many extra Append subnodes in comparison to the custom plan, all
>> of which are taken into account in the total plan cost, but these may
>> be pruned during execution. The logic in choose_custom_plan() has no
>> idea about this..
>
> I had mentioned this in the first mail on this thread that the generic
> plan is always preferred and Robert said that it is not in scope of
> this patch. Maybe we can start a new thread for this.

Sorry, I missed that. I knew you had modified choose_custom_plan()
too, but I didn't know the diagnosis of the problem had made its way
here yet.

I wonder if Robert understands the extent of the problem. The patch
will be useful to prune away partitions when the partitioned table is
on the inside of a parameterised nested loop join, but I think we'll
likely get some complaints about PREPARE statements always using a
custom plan.  It's going to make this pretty hard to write regression
tests for unless we went and invent some GUC "prefer_generic_plan", or
"generic_plan_cost_multiplier" to provide a way to coax
choose_custom_plan() into not choosing a custom plan.  I admit to not
having the answer to this but I don't think we should rule out
discussing possible fixes.

> Thank you for working on this. I  will look into this and merge with
> my current version of patch and Amit's v16 patches and post a new
> patch soon.

okay. I've just rebased mine on top of Amit's v16 now and will send
shortly. I already took some code from your v7 patch to make my v2
work. Which parts do you think still need to be merged?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Thu, Dec 21, 2017 at 6:01 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 21 December 2017 at 23:51, Beena Emerson <memissemerson@gmail.com> wrote:
>> On Thu, Dec 21, 2017 at 2:31 PM, David Rowley
>> <david.rowley@2ndquadrant.com> wrote:
>>> On 19 December 2017 at 21:54, Beena Emerson <memissemerson@gmail.com> wrote:
>>
>>> The problem is down to the logic in choose_custom_plan() only choosing
>>> a generic plan if the average cost of the generic plan is less than
>>> the average custom plan cost. The problem is that the generic plan can
>>> have many extra Append subnodes in comparison to the custom plan, all
>>> of which are taken into account in the total plan cost, but these may
>>> be pruned during execution. The logic in choose_custom_plan() has no
>>> idea about this..
>>
>> I had mentioned this in the first mail on this thread that the generic
>> plan is always preferred and Robert said that it is not in scope of
>> this patch. Maybe we can start a new thread for this.
>
> Sorry, I missed that. I knew you had modified choose_custom_plan()
> too, but I didn't know the diagnosis of the problem had made its way
> here yet.
>
> I wonder if Robert understands the extent of the problem. The patch
> will be useful to prune away partitions when the partitioned table is
> on the inside of a parameterised nested loop join, but I think we'll
> likely get some complaints about PREPARE statements always using a
> custom plan.  It's going to make this pretty hard to write regression
> tests for unless we went and invent some GUC "prefer_generic_plan", or
> "generic_plan_cost_multiplier" to provide a way to coax
> choose_custom_plan() into not choosing a custom plan.  I admit to not
> having the answer to this but I don't think we should rule out
> discussing possible fixes.

I have attached the regression tests I am currently using to check my
patch.  It has few prepared statements using runtime pruning.


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 21 December 2017 at 22:01, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached the latest version of the patch. This is based
> on Amit's v15 of faster-partition-pruning [1] which I found to cleanly
> apply to f94eec490

Well, that went out of date pretty quickly. Amit has now posted v16 of
the faster partition pruning patch [1] which conflicts with my changes
in my v2 patch.

I've attached a new version of the patch to resolve these conflicts.

In v16 Amit added some code to eliminate LIST partitions when there
are not equals expressions which cause all the items in the IN list to
be unmatchable.

This appears to work fine with the run-time pruning in the attached, per:

create table ta (a int not null) partition by list (a);
create table ta1 partition of ta for values in(1,2);
create table ta2 partition of ta for values in(3,4);
explain select * from ta where a <> 1 and a <> 2;
create table ta_null partition of ta for values in(null);

prepare q1 (int) as select * from ta where a <> 1 and a <> $1;

explain (costs off, analyze) execute q1(2);
                           QUERY PLAN
-----------------------------------------------------------------
 Append (actual time=0.012..0.012 rows=0 loops=1)
   ->  Seq Scan on ta1 (never executed)
         Filter: ((a <> 1) AND (a <> $1))
   ->  Seq Scan on ta2 (actual time=0.012..0.012 rows=0 loops=1)
         Filter: ((a <> 1) AND (a <> $1))
 Planning time: 0.019 ms
 Execution time: 0.057 ms
(7 rows)


explain (costs off, analyze) execute q1(1);
                           QUERY PLAN
-----------------------------------------------------------------
 Append (actual time=0.017..0.017 rows=0 loops=1)
   ->  Seq Scan on ta1 (actual time=0.013..0.013 rows=0 loops=1)
         Filter: ((a <> 1) AND (a <> $1))
   ->  Seq Scan on ta2 (actual time=0.003..0.003 rows=0 loops=1)
         Filter: ((a <> 1) AND (a <> $1))
 Planning time: 0.021 ms
 Execution time: 0.068 ms
(7 rows)

[1] https://www.postgresql.org/message-id/5ebae4cf-8145-975c-ad75-16eb7f756f32%40lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 22 December 2017 at 01:43, Beena Emerson <memissemerson@gmail.com> wrote:
>> I wonder if Robert understands the extent of the problem. The patch
>> will be useful to prune away partitions when the partitioned table is
>> on the inside of a parameterised nested loop join, but I think we'll
>> likely get some complaints about PREPARE statements always using a
>> custom plan.  It's going to make this pretty hard to write regression
>> tests for unless we went and invent some GUC "prefer_generic_plan", or
>> "generic_plan_cost_multiplier" to provide a way to coax
>> choose_custom_plan() into not choosing a custom plan.  I admit to not
>> having the answer to this but I don't think we should rule out
>> discussing possible fixes.
>
> I have attached the regression tests I am currently using to check my
> patch.  It has few prepared statements using runtime pruning.

Thanks for sending.  Will the tests still pass if you remove the hack
from choose_custom_plan()?


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Thu, Dec 21, 2017 at 6:30 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 22 December 2017 at 01:43, Beena Emerson <memissemerson@gmail.com> wrote:
>>> I wonder if Robert understands the extent of the problem. The patch
>>> will be useful to prune away partitions when the partitioned table is
>>> on the inside of a parameterised nested loop join, but I think we'll
>>> likely get some complaints about PREPARE statements always using a
>>> custom plan.  It's going to make this pretty hard to write regression
>>> tests for unless we went and invent some GUC "prefer_generic_plan", or
>>> "generic_plan_cost_multiplier" to provide a way to coax
>>> choose_custom_plan() into not choosing a custom plan.  I admit to not
>>> having the answer to this but I don't think we should rule out
>>> discussing possible fixes.
>>
>> I have attached the regression tests I am currently using to check my
>> patch.  It has few prepared statements using runtime pruning.
>
> Thanks for sending.  Will the tests still pass if you remove the hack
> from choose_custom_plan()?
>

The hack has been removed in the v7 patch sent and yes the tests work
without any additional adjustments.

Thank you,

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 22 December 2017 at 02:06, Beena Emerson <memissemerson@gmail.com> wrote:
>> Thanks for sending.  Will the tests still pass if you remove the hack
>> from choose_custom_plan()?
>>
>
> The hack has been removed in the v7 patch sent and yes the tests work
> without any additional adjustments.

Oh nice. I see you've included parameters to force the 5 custom plans
to having a higher cost, then for the first generic plan attempt,
you're choosing params that'll result in a cheaper plan. Great idea.
I'll revoke what I said about this being impossible to test now :)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Thu, Dec 21, 2017 at 6:26 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 21 December 2017 at 22:01, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I've attached the latest version of the patch. This is based
>> on Amit's v15 of faster-partition-pruning [1] which I found to cleanly
>> apply to f94eec490
>
> Well, that went out of date pretty quickly. Amit has now posted v16 of
> the faster partition pruning patch [1] which conflicts with my changes
> in my v2 patch.
>
> I've attached a new version of the patch to resolve these conflicts.

Thank you for the patch, I applied it over Amit's v16 patches on
commit 180428404.

I found that make check is crashing along with the following sql you mentioned.

postgres=# create table ta (a int not null) partition by list (a);
CREATE TABLE
postgres=# create table ta1 partition of ta for values in(1,2);
CREATE TABLE
postgres=# create table ta2 partition of ta for values in(3,4);
CREATE TABLE
postgres=# explain select * from ta where a <> 1 and a <> 2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

This seems like having two different patches for the same feature. I
will post my version of the patch which uses the struct
PartitionPruneInfo from your patch and I will add the other additional
features you added like optimizing the pruning rescan. I will try and
post the patch tomorrow.
If there is more suggestions, you can give it over that; otherwise it
seems like duplicating efforts.

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 22 December 2017 at 03:02, Beena Emerson <memissemerson@gmail.com> wrote:
> postgres=# create table ta (a int not null) partition by list (a);
> CREATE TABLE
> postgres=# create table ta1 partition of ta for values in(1,2);
> CREATE TABLE
> postgres=# create table ta2 partition of ta for values in(3,4);
> CREATE TABLE
> postgres=# explain select * from ta where a <> 1 and a <> 2;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

I just get "ERROR:  negative bitmapset member not allowed" here. I
reported that one to Amit over on the other thread. Not sure why
you're getting a crash. Can you get a backtrace?


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Thu, Dec 21, 2017 at 4:01 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> The problem is down to the logic in choose_custom_plan() only choosing
> a generic plan if the average cost of the generic plan is less than
> the average custom plan cost. The problem is that the generic plan can
> have many extra Append subnodes in comparison to the custom plan, all
> of which are taken into account in the total plan cost, but these may
> be pruned during execution. The logic in choose_custom_plan() has no
> idea about this.  I don't have any bright ideas on how to fix this
> yet, as, suppose a PREPAREd statement like the following comes along:
>
> PREPARE q3 (int, int) AS SELECT * FROM partitioned_table WHERE partkey
> BETWEEN $1 AND $2;
>
> the run-time pruning may prune it down no subplans, all subplans, or
> any number in between. So we can't do anything like take the total
> Append cost to be the highest costing of its subplans, and likely
> using the average cost might not be a good idea either.

Well, I do think we need to make some kind of estimate.  It may be a
bad estimate, but if we do nothing, we're estimating that no pruning
at all will happen, which is probably not right either.  I mean, if we
have...

PREPARE q3 (int, int) AS SELECT * FROM unpartitioned_table WHERE
partkey BETWEEN $1 AND $2;

...that has to decide whether to use an index.  And to do that it has
to estimate what fraction of the table will match the BETWEEN clause.
That may be an uninformed guess, but it guesses something.  We
probably want to do something here that makes the guess for a
partitioned_table similar to the guess for an unpartitioned_table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 22 December 2017 at 03:02, Beena Emerson <memissemerson@gmail.com> wrote:
> This seems like having two different patches for the same feature. I
> will post my version of the patch which uses the struct
> PartitionPruneInfo from your patch and I will add the other additional
> features you added like optimizing the pruning rescan. I will try and
> post the patch tomorrow.

I apologise for persisting in making these parallel efforts. I do have
time right now to dedicate to review this patch, but that time is
running out. At this end of this time, I was really hoping that there
would be a patch that's worthy of being committed (or at least one
worthy of a committers time). During my review of v5, because I found
the patch to still need quite a bit of work, I thought the best use of
that time was to make it work myself, which to my knowledge I have
done. Although, I'm sure my patch will still have bugs, it appears to
me to be quite a bit further ahead than your v7 WIP patch.

> If there is more suggestions, you can give it over that; otherwise it
> seems like duplicating efforts.

Much of the things I did differently from you could be taken as suggestions.

There were a number of things in the v7 patch were still not in a
workable state:

1. Using the PlannerInfo to record details about Append. How will this
work with a plan containing multiple Appends scanning partitioned
tables?
2. The use of AppendState->subplan_indexes List. Please use a
Bitmapset to mark the valid subplans. Lists are not particularly
efficient to get the nth item.
3. Use of PlannerInfo to store details specific to a single
partitioned table in set_base_rel_sizes.
4. Use of a new PlannerInfo->join_clauses in set_rel_size(). How will
this work when there are multiple partitioned tables being scanned in
a single plan?
5. In match_clauses_to_partkey() you're using a new
PlannerInfo->baserestrictinfo_param_indexes List to store ParamIds.
How will this work when there are multiple partitioned tables being
scanned in a single plan? A Bitmapset would be a better choice to
store paramids in.
6. In set_append_rel_pathlist you're using more PlannerInfo members to
handle a specific Append rel. Again, how will it work for plans
scanning multiple different partitioned tables?
7. Your changes to get_appendrel_parampathinfo() ignore equivalence
join clauses, don't you need to look at these too? If so, maybe it
might be worth just using get_baserel_parampathinfo()?
8. Lack of ability to detect if set_append_subplan_indexes() needs to
be called in ExecReScanAppend(). Some parameters that change might not
have an effect on which partitions to scan.

If you go and find a new way to solve all those problems, then please
consider which one of us it is that's making the duplicate effort.

Again, I'm sorry that I have been standing on your toes with my work
here. I'm certainly not out to try to take any glory here. I just want
the patch to be in a working state and the time I have to do that is
fast running out.

Please consider my efforts as an offer of assistance rather than a
threat to your work.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 22 December 2017 at 12:45, Robert Haas <robertmhaas@gmail.com> wrote:
> PREPARE q3 (int, int) AS SELECT * FROM unpartitioned_table WHERE
> partkey BETWEEN $1 AND $2;
>
> ...that has to decide whether to use an index.  And to do that it has
> to estimate what fraction of the table will match the BETWEEN clause.
> That may be an uninformed guess, but it guesses something.  We
> probably want to do something here that makes the guess for a
> partitioned_table similar to the guess for an unpartitioned_table.

Are you considering some sort of clauselist_selectivity() estimate on
the given parameters and comparing that to the same selectivities that
were determined for the previous custom plans?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Thu, Dec 21, 2017 at 6:53 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 22 December 2017 at 12:45, Robert Haas <robertmhaas@gmail.com> wrote:
>> PREPARE q3 (int, int) AS SELECT * FROM unpartitioned_table WHERE
>> partkey BETWEEN $1 AND $2;
>>
>> ...that has to decide whether to use an index.  And to do that it has
>> to estimate what fraction of the table will match the BETWEEN clause.
>> That may be an uninformed guess, but it guesses something.  We
>> probably want to do something here that makes the guess for a
>> partitioned_table similar to the guess for an unpartitioned_table.
>
> Are you considering some sort of clauselist_selectivity() estimate on
> the given parameters and comparing that to the same selectivities that
> were determined for the previous custom plans?

No, I don't think comparing to previous custom plans is a workable
approach.  I was thinking, rather, that if we know for example that
we've doing pruning on partition_column = $1, then we know that only
one partition will match.  That's probably a common case.  If we've
got partition_column > $1, we could assume that, say, 75% of the
partitions would match.  partition_column BETWEEN $1 and $2 is
probably a bit more selective, so maybe we assume 50% of the
partitions would match.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 22 December 2017 at 14:29, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 21, 2017 at 6:53 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> Are you considering some sort of clauselist_selectivity() estimate on
>> the given parameters and comparing that to the same selectivities that
>> were determined for the previous custom plans?
>
> No, I don't think comparing to previous custom plans is a workable
> approach.  I was thinking, rather, that if we know for example that
> we've doing pruning on partition_column = $1, then we know that only
> one partition will match.  That's probably a common case.  If we've
> got partition_column > $1, we could assume that, say, 75% of the
> partitions would match.  partition_column BETWEEN $1 and $2 is
> probably a bit more selective, so maybe we assume 50% of the
> partitions would match.

Okay. Do you think this is something we need to solve for this patch?
When I complained originally I didn't quite see any way to even test
the majority of this patch with the regression tests, but Beena has
since proven me wrong about that.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Thu, Dec 21, 2017 at 8:37 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
>> No, I don't think comparing to previous custom plans is a workable
>> approach.  I was thinking, rather, that if we know for example that
>> we've doing pruning on partition_column = $1, then we know that only
>> one partition will match.  That's probably a common case.  If we've
>> got partition_column > $1, we could assume that, say, 75% of the
>> partitions would match.  partition_column BETWEEN $1 and $2 is
>> probably a bit more selective, so maybe we assume 50% of the
>> partitions would match.
>
> Okay. Do you think this is something we need to solve for this patch?
> When I complained originally I didn't quite see any way to even test
> the majority of this patch with the regression tests, but Beena has
> since proven me wrong about that.

Although I have done one round of view of this patch, I haven't really
got my head around it completely yet and I haven't spent of time on it
yet, so my opinions are not as well-formed as maybe they should be.
I'm glad, by the way, that you are putting some effort into it, as I
think that will help move this forward more quickly.   At a high
level, I want to avoid trying to solve too many problems in one patch
(which was the motivation behind my comment near the top of the
thread), but I also want to end up with something useful (which I
think is your concern).

Leaving aside the difficulty of implementation, I have some questions
about what the right thing to do actually is.  In a simple case, I'm
guessing that the cost of creating a custom plan will exceed the
amount that the plan saves, but in more complex cases, I'm not sure
that will be true.  For instance, if we know the particular parameter
value at plan time, we can make a more accurate estimate of how many
times that value appears, which can then feed into our choice of what
plan shape to use.  That is, for a query like SELECT * FROM a JOIN b
ON a.x = b.x WHERE a.y = $1, the generic plan might choose, say, a
nested loop with b on the inner side, but if we know that a particular
value for $1 will match a lot of rows in a, we might prefer a hash or
merge join for that specific case.  Run-time pruning doesn't give us
that flexibility.  My intuition is that the more complex we make the
query, the more point there will be to making custom plans, and the
simpler the query, the more likely it is that a generic plan will be
good enough that it's not worth replanning every time.

Now, in my experience, the current system for custom plans vs. generic
plans doesn't approach the problem in this way at all, and in my
experience that results in some pretty terrible behavior.  It will do
things like form a custom plan every time because the estimated cost
of the custom plan is lower than the estimated cost of the generic
plan even though the two plans are structurally identical; only the
estimates differ.  It will waste gobs of CPU cycles by replanning a
primary key lookup 5 times just on the off chance that a lookup on the
primary key index isn't the best option.  But this patch isn't going
to fix any of that.  The best we can probably do is try to adjust the
costing for Append paths in some way that reflects the costs and
benefits of pruning.  I'm tentatively in favor of trying to do
something modest in that area, but I don't have a detailed proposal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
Hi,

Please find attached my 4th version this patch.

This is now based on v17 of Amit's faster partition pruning patch [1].
It also now includes Beena's tests which I've done some mostly
cosmetic changes to.

I've also fixed a few bugs, one in a case where I was not properly
handling zero matching partitions in nodeAppend.c.

Another change I've made is to now perform the partition pruning at
run-time using a new memory context that's reset each time we
redetermine the matching partitions. This was required since we're
calling a planner function which might not be too careful about
pfreeing memory it allocates. A test case I was running before making
this change ended out failing to palloc memory due to OOM.

I've not done anything about reducing the cost of the Append path when
runtime pruning is enabled. I'm still thinking over the best way to
handle that.

[1] https://www.postgresql.org/message-id/58c3e20a-a964-4fdb-4e7d-bd833e9bead1@lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello David,

On Wed, Dec 27, 2017 at 8:36 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Hi,
>
> Please find attached my 4th version this patch.

Thanks for the patch
>
> This is now based on v17 of Amit's faster partition pruning patch [1].
> It also now includes Beena's tests which I've done some mostly
> cosmetic changes to.
>
> I've also fixed a few bugs, one in a case where I was not properly
> handling zero matching partitions in nodeAppend.c.
>
> Another change I've made is to now perform the partition pruning at
> run-time using a new memory context that's reset each time we
> redetermine the matching partitions. This was required since we're
> calling a planner function which might not be too careful about
> pfreeing memory it allocates. A test case I was running before making
> this change ended out failing to palloc memory due to OOM.
>
> I've not done anything about reducing the cost of the Append path when
> runtime pruning is enabled. I'm still thinking over the best way to
> handle that.
>

I think you are testing without asserts

The following assert fails: src/backend/optimizer/plan/setrefs.c :
set_plan_refs: ln 921
Assert(splan->plan.qual == NIL);
Append node now has runtime partition quals.

Also since the valid subplans are set in ExecInitAppend, the queries
with Init Plans do not work. I had moved it to ExecAppend in my patch
to handle the InitPlans as well.

DROP TABLE IF EXISTS prun_test_part;
CREATE TABLE prun_test_part (sal int) PARTITION BY RANGE(sal);
CREATE TABLE prun_test_part_p1 PARTITION OF prun_test_part FOR VALUES
FROM (0) TO (100);
CREATE TABLE prun_test_part_p2 PARTITION OF prun_test_part FOR VALUES
FROM (100) TO (200);
CREATE TABLE prun_test_part_p3 PARTITION OF prun_test_part FOR VALUES
FROM (200) TO (300);
CREATE TABLE prun_test_part_p4 PARTITION OF prun_test_part FOR VALUES
FROM (300) TO (400);

INSERT INTO prun_test_part VALUES (90), (100), (110), (200), (210),
(300), (310);
=# explain (analyze, costs off, summary off, timing off) SELECT * FROM
prun_test_part WHERE sal < (SELECT sal FROM prun_test_part WHERE sal =
200);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 1 January 2018 at 19:22, Beena Emerson <memissemerson@gmail.com> wrote:
> I think you are testing without asserts

Yeah, I was indeed. Oops.

> The following assert fails: src/backend/optimizer/plan/setrefs.c :
> set_plan_refs: ln 921
> Assert(splan->plan.qual == NIL);
> Append node now has runtime partition quals.
>
> Also since the valid subplans are set in ExecInitAppend, the queries
> with Init Plans do not work. I had moved it to ExecAppend in my patch
> to handle the InitPlans as well.

Thanks for noticing. I've now changed things around so this case works
as it should and I've added a test too.

I've attached an updated patch which also fixes a number of other
problems with my previous patch.

1. The Bitmapset I was using in nodeAppend.c to mark the valid
subplans was pretty bogus for Parallel Append since the memory for the
set was not in shared memory. I changed things around to reuse the
pa_finished[] array and the patch just now sets pa_finished to true
for any invalid subplans.
2. I've added a new memory context to use in nodeAppend.c which is
used to call the planner code to determine which partitions are valid.
I'd been trying to have Amit be careful to pfree() everything in his
v17 patch, but I realised it was just not possible to get everything
pfree'd. I found it pretty easy to construct a test case which caused
an OOM.
3. I've added support for IN lists to be pruned when the IN() list
contains a parameter. The changes I made to support this case probably
mostly belong in Amit's faster partition pruning patch, but I've put
them here for now to get this case working. There's a bunch of new
tests to test this.
4. Various other cosmetic improvements.

The attached patch should be applied after patching master with Amit's
v17 faster partition pruning patch [1].

[1] https://www.postgresql.org/message-id/58c3e20a-a964-4fdb-4e7d-bd833e9bead1@lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
I tried this patch (applying it on Amit's last current version on top of
4e2970f8807f which is the latest it applies to) and regression tests
fail with the attached diff; in all cases it appears to be an off-by-one
in row count.  Would you please give it a look?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 5 January 2018 at 05:37, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I tried this patch (applying it on Amit's last current version on top of
> 4e2970f8807f which is the latest it applies to) and regression tests
> fail with the attached diff; in all cases it appears to be an off-by-one
> in row count.  Would you please give it a look?

Looks like it's down to ExplainPropertyFloat() having
machine-dependent behaviour.

On the machine that I was working with when testing this the following
code outputs "1"

#include <stdio.h>

double nfiltered = 1.0;
double nloops = 2.0;

int main(void)
{
printf("%.*f", 0, nfiltered / nloops);
return 0;
}

but on your machine it must be outputting "0"?

I'm not sure if I should just try to have even row numbers in the
tests, or if we should be fixing EXPLAIN so it is consistent.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 5 January 2018 at 05:37, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I tried this patch (applying it on Amit's last current version on top of
> 4e2970f8807f which is the latest it applies to) and regression tests
> fail with the attached diff; in all cases it appears to be an off-by-one
> in row count.  Would you please give it a look?

Thanks for testing. I've attached an updated patch which hopefully fixes this.

I've only thing I did to fix it was to alter the tests a bit so that
the row counts in explain are evenly divisible by the nloops or
parallel workers. Looks like it was failing due to platform dependent
behaviour in printf.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
David Rowley wrote:

> Looks like it's down to ExplainPropertyFloat() having
> machine-dependent behaviour.
> 
> On the machine that I was working with when testing this the following
> code outputs "1"
> [ sample code ]
> 
> but on your machine it must be outputting "0"?

Yeah, it does.  Thanks for updating --- I'll look at your patch
tomorrow.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Fri, Jan 5, 2018 at 6:24 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 5 January 2018 at 05:37, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> I tried this patch (applying it on Amit's last current version on top of
>> 4e2970f8807f which is the latest it applies to) and regression tests
>> fail with the attached diff; in all cases it appears to be an off-by-one
>> in row count.  Would you please give it a look?
>
> Thanks for testing. I've attached an updated patch which hopefully fixes this.
>
> I've only thing I did to fix it was to alter the tests a bit so that
> the row counts in explain are evenly divisible by the nloops or
> parallel workers. Looks like it was failing due to platform dependent
> behaviour in printf.
>

It does not handle change in column order (varattno) in subpartitions.

In the following case a2 has different column order
drop table ab_c;
create table ab_c (a int not null, b int) partition by list(a);

--a2 with different col order
  create table abc_a2 (b int, a int not null) partition by list(b);
  create table abc_a2_b1 partition of abc_a2 for values in (1);
  create table abc_a2_b2 partition of abc_a2 for values in (2);
  create table abc_a2_b3 partition of abc_a2 for values in (3);
  alter table ab_c attach partition abc_a2 for values in (2);

--a1 and a3 with same col order as the parent
  create table abc_a1 partition of ab_c for values in(1) partition by list (b);
  create table abc_a1_b1 partition of abc_a1 for values in (1);
  create table abc_a1_b2 partition of abc_a1 for values in (2);
  create table abc_a1_b3 partition of abc_a1 for values in (3);
  create table abc_a3 partition of ab_c for values in(3) partition by list (b);
  create table abc_a3_b1 partition of abc_a3 for values in (1);
  create table abc_a3_b2 partition of abc_a3 for values in (2);
  create table abc_a3_b3 partition of abc_a3 for values in (3);

  deallocate abc_q1;
  prepare abc_q1 (int, int, int) as select * from ab_c where a BETWEEN
$1 and $2 AND b <= $3;

--optimizer pruning
explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 3, 1);
                      QUERY PLAN
------------------------------------------------------
 Append (actual rows=0 loops=1)
   ->  Seq Scan on abc_a1_b1 (actual rows=0 loops=1)
         Filter: ((a >= 1) AND (a <= 3) AND (b <= 1))
   ->  Seq Scan on abc_a2_b1 (actual rows=0 loops=1)
         Filter: ((a >= 1) AND (a <= 3) AND (b <= 1))
   ->  Seq Scan on abc_a3_b1 (actual rows=0 loops=1)
         Filter: ((a >= 1) AND (a <= 3) AND (b <= 1))
(7 rows)

--runtime pruning after 5 runs
explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 3, 1);
                       QUERY PLAN
---------------------------------------------------------
 Append (actual rows=0 loops=1)
   ->  Seq Scan on abc_a1_b1 (actual rows=0 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a1_b2 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a1_b3 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a2_b1 (actual rows=0 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a2_b2 (actual rows=0 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a2_b3 (actual rows=0 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a3_b1 (actual rows=0 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a3_b2 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a3_b3 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
(19 rows)

As seen partition a2 does not prune like in other 2 subpartitions - a1 and a3.
-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 6 January 2018 at 07:31, Beena Emerson <memissemerson@gmail.com> wrote:
> It does not handle change in column order (varattno) in subpartitions.

Thanks for testing and finding that. I completely overlooked applying
translation of the prune qual so that it's compatible with the
sub-partition.

I've fixed this in the attached, but I did so by calling
adjust_appendrel_attrs() from the nodeAppend.c, which did, of course,
mean that the AppendRelInfo needed to be given to the executor. I was
also a bit unsure what exactly I should be doing in primnodes.h, since
I've put PartitionPruneInfo in there, but AppendRelInfo is not. I
stuck a quick declaration of AppendRelInfo in primnode.h with an XXX
comment so we don't forget to think about that again.

In all honesty, this calling planner code from the executor seems like
quite new ground for PostgreSQL, so I'm really not sure if we're
breaking any major rules or not with we've got now. Perhaps the saving
grace here is that we're not teaching the executor how to do anything
smart with these data structures, they're just given to it to pass
back to the planner function at the appropriate moment.

I've also borrowed and simplified your test case to ensure this fix
remains working. I added another level of partitioning with another
partition that has the columns in a different order again. This is to
ensure the translation code translates from the quals of the previous
level up, not the top-level. That's required since the AppendRelInfo
is only translating 1 level at a time.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 7 January 2018 at 00:03, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've fixed this in the attached, but I did so by calling
> adjust_appendrel_attrs() from the nodeAppend.c, which did, of course,
> mean that the AppendRelInfo needed to be given to the executor. I was
> also a bit unsure what exactly I should be doing in primnodes.h, since
> I've put PartitionPruneInfo in there, but AppendRelInfo is not. I
> stuck a quick declaration of AppendRelInfo in primnode.h with an XXX
> comment so we don't forget to think about that again.

Actually, this was not a very smart fix for the problem. It seems much
better to make the prune qual part of PartitionPruneInfo and just have
the planner translate the qual to what's required for the partition
that the PartitionPruneInfo belongs to. This means we no longer need
to use the Append's qual to store the prune qual and that all the
pruning information for one partition is now neatly in a single
struct.

I've attached a patch which does things like this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Sun, Jan 7, 2018 at 5:31 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 January 2018 at 00:03, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I've fixed this in the attached, but I did so by calling
>> adjust_appendrel_attrs() from the nodeAppend.c, which did, of course,
>> mean that the AppendRelInfo needed to be given to the executor. I was
>> also a bit unsure what exactly I should be doing in primnodes.h, since
>> I've put PartitionPruneInfo in there, but AppendRelInfo is not. I
>> stuck a quick declaration of AppendRelInfo in primnode.h with an XXX
>> comment so we don't forget to think about that again.
>
> Actually, this was not a very smart fix for the problem. It seems much
> better to make the prune qual part of PartitionPruneInfo and just have
> the planner translate the qual to what's required for the partition
> that the PartitionPruneInfo belongs to. This means we no longer need
> to use the Append's qual to store the prune qual and that all the
> pruning information for one partition is now neatly in a single
> struct.
>
> I've attached a patch which does things like this.


The pruning does not work well with char type:

Case: A subpartition has a different col order and the subpartitioned
col is type char.

drop table ab_c;
 create table ab_c (a int not null, b char) partition by list(a);
  create table abc_a2 (b char, a int not null) partition by list(b);
  create table abc_a2_b1 partition of abc_a2 for values in ('1');
  create table abc_a2_b2 partition of abc_a2 for values in ('2');
  create table abc_a2_b3 partition of abc_a2 for values in ('3');
  alter table ab_c attach partition abc_a2 for values in (2);
  create table abc_a1 partition of ab_c for values in(1) partition by list (b);
  create table abc_a1_b1 partition of abc_a1 for values in ('1');
  create table abc_a1_b2 partition of abc_a1 for values in ('2');
  create table abc_a1_b3 partition of abc_a1 for values in ('3');
  create table abc_a3 partition of ab_c for values in(3) partition by list (b);
  create table abc_a3_b1 partition of abc_a3 for values in ('1');
  create table abc_a3_b2 partition of abc_a3 for values in ('2');
  create table abc_a3_b3 partition of abc_a3 for values in ('3');
  deallocate abc_q1;

INSERT INTO ab_c VALUES (1,'1'), (1,'2'), (1,'3');
INSERT INTO ab_c VALUES (2,'1'), (2,'2'), (2,'3');
INSERT INTO ab_c VALUES (3,'1'), (3,'2'), (3,'3');

prepare abc_q1 (int, int, char) as select * from ab_c where a BETWEEN
$1 and $2 AND b <= $3;

--after 5 runs: abc_a2_b3 is not pruned.

# explain (analyze, costs off, summary off, timing off) execute abc_q1
(1, 2, '2');
                       QUERY PLAN
---------------------------------------------------------
 Append (actual rows=4 loops=1)
   ->  Seq Scan on abc_a1_b1 (actual rows=1 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a1_b2 (actual rows=1 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a1_b3 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a2_b1 (actual rows=1 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a2_b2 (actual rows=1 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a2_b3 (actual rows=0 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
         Rows Removed by Filter: 1
   ->  Seq Scan on abc_a3_b1 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a3_b2 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
   ->  Seq Scan on abc_a3_b3 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b <= $3))
(20 rows)


Case 2: Case with optimizer pruning
drop table ab_c;
 create table ab_c (a int not null, b int) partition by list(a);
 create table abc_a2 (b int, a int not null) partition by list(b);
  create table abc_a2_b1 partition of abc_a2 for values in (1);
  create table abc_a2_b2 partition of abc_a2 for values in (2);
  create table abc_a2_b3 partition of abc_a2 for values in (3);
  alter table ab_c attach partition abc_a2 for values in (2);
  create table abc_a1 partition of ab_c for values in(1) partition by list (b);
  create table abc_a1_b1 partition of abc_a1 for values in (1);
  create table abc_a1_b2 partition of abc_a1 for values in (2);
  create table abc_a1_b3 partition of abc_a1 for values in (3);
  create table abc_a3 partition of ab_c for values in(3) partition by list (b);
  create table abc_a3_b1 partition of abc_a3 for values in (1);
  create table abc_a3_b2 partition of abc_a3 for values in (2);
  create table abc_a3_b3 partition of abc_a3 for values in (3);
  deallocate abc_q1;


=# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1);
                                  QUERY PLAN
------------------------------------------------------------------------------
 Append (actual rows=2 loops=1)
   ->  Seq Scan on abc_a1_b2 (actual rows=1 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
   ->  Seq Scan on abc_a1_b3 (actual rows=1 loops=1)
         Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
   ->  Seq Scan on abc_a2_b2 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
   ->  Seq Scan on abc_a2_b3 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
   ->  Seq Scan on abc_a3_b2 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
   ->  Seq Scan on abc_a3_b3 (never executed)
         Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
(13 rows)

postgres=# explain (analyze, costs off, summary off, timing off)
execute abc_q1 (1, 2);
ERROR:  partition missing from Append subplans

-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hi,

The mail was accidently sent before I could complete.

On Tue, Jan 9, 2018 at 2:24 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> Hello,
>
> The pruning does not work well with char type:
>
> Case 2: Case with optimizer pruning
> drop table ab_c;
>  create table ab_c (a int not null, b int) partition by list(a);
>  create table abc_a2 (b int, a int not null) partition by list(b);
>   create table abc_a2_b1 partition of abc_a2 for values in (1);
>   create table abc_a2_b2 partition of abc_a2 for values in (2);
>   create table abc_a2_b3 partition of abc_a2 for values in (3);
>   alter table ab_c attach partition abc_a2 for values in (2);
>   create table abc_a1 partition of ab_c for values in(1) partition by list (b);
>   create table abc_a1_b1 partition of abc_a1 for values in (1);
>   create table abc_a1_b2 partition of abc_a1 for values in (2);
>   create table abc_a1_b3 partition of abc_a1 for values in (3);
>   create table abc_a3 partition of ab_c for values in(3) partition by list (b);
>   create table abc_a3_b1 partition of abc_a3 for values in (1);
>   create table abc_a3_b2 partition of abc_a3 for values in (2);
>   create table abc_a3_b3 partition of abc_a3 for values in (3);
>   deallocate abc_q1;

Prepared statement is missing:
 prepare abc_q1 (int, int) as select a,b from ab_c where a BETWEEN $1
and $2 AND b IN (3, 2);
>
>
> =# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1);
>                                   QUERY PLAN
> ------------------------------------------------------------------------------
>  Append (actual rows=2 loops=1)
>    ->  Seq Scan on abc_a1_b2 (actual rows=1 loops=1)
>          Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>    ->  Seq Scan on abc_a1_b3 (actual rows=1 loops=1)
>          Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>    ->  Seq Scan on abc_a2_b2 (never executed)
>          Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>    ->  Seq Scan on abc_a2_b3 (never executed)
>          Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>    ->  Seq Scan on abc_a3_b2 (never executed)
>          Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
>    ->  Seq Scan on abc_a3_b3 (never executed)
>          Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[])))
> (13 rows)
>
> postgres=# explain (analyze, costs off, summary off, timing off)
> execute abc_q1 (1, 2);
> ERROR:  partition missing from Append subplans

These work fine when the column order of subpartitons are not changed.

Case 3: Optimizer pruning with char types:
Same as case1 with all subpartitions having same col order as parent.

drop table ab_c;
 create table ab_c (a int not null, b char) partition by list(a);
 create table abc_a2 ( a int not null, b char) partition by list(b);
  create table abc_a2_b1 partition of abc_a2 for values in ('1');
  create table abc_a2_b2 partition of abc_a2 for values in ('2');
  create table abc_a2_b3 partition of abc_a2 for values in ('3');
  alter table ab_c attach partition abc_a2 for values in (2);
  create table abc_a1 partition of ab_c for values in(1) partition by list (b);
  create table abc_a1_b1 partition of abc_a1 for values in ('1');
  create table abc_a1_b2 partition of abc_a1 for values in ('2');
  create table abc_a1_b3 partition of abc_a1 for values in ('3');
  create table abc_a3 partition of ab_c for values in(3) partition by list (b);
  create table abc_a3_b1 partition of abc_a3 for values in ('1');
  create table abc_a3_b2 partition of abc_a3 for values in ('2');
  create table abc_a3_b3 partition of abc_a3 for values in ('3');

deallocate abc_q1;
  prepare abc_q1 (int, int) as select a,b from ab_c where a BETWEEN $1
and $2 AND b IN ('3', '2');

-- b4 runtime pruning
=# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 8);
                                QUERY PLAN
---------------------------------------------------------------------------
 Append (actual rows=0 loops=1)
   ->  Seq Scan on abc_a1_b2 (actual rows=0 loops=1)
         Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a1_b3 (actual rows=0 loops=1)
         Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a2_b2 (actual rows=0 loops=1)
         Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a2_b3 (actual rows=0 loops=1)
         Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a3_b2 (actual rows=0 loops=1)
         Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
   ->  Seq Scan on abc_a3_b3 (actual rows=0 loops=1)
         Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[])))
(13 rows)

-- after 5 runs

=# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1);
ERROR:  operator 1057 is not a member of opfamily 1976



-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 9 January 2018 at 22:22, Beena Emerson <memissemerson@gmail.com> wrote:
> ERROR:  operator 1057 is not a member of opfamily 1976

Thanks for finding these. I'm looking into the above, and the other
ones you've mentioned now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On Tue, Jan 9, 2018 at 2:24 PM, Beena Emerson <memissemerson@gmail.com> wrote:
> prepare abc_q1 (int, int, char) as select * from ab_c where a BETWEEN
> $1 and $2 AND b <= $3;
>
> --after 5 runs: abc_a2_b3 is not pruned.

This seems to be down to an existing bug. I'm not yet sure if it's
caused by faster_partition_prune_v17, or if it exists in master.
Basically RelOptInfo->partition_rels can contain duplicates for
relations. In your example while debugging make_partition_pruneinfo I
see:

list_nth_int(best_path->partitioned_rels,0)
1
list_nth_int(best_path->partitioned_rels,1)
3
list_nth_int(best_path->partitioned_rels,2)
8
list_nth_int(best_path->partitioned_rels,3)
13
list_nth_int(best_path->partitioned_rels,4)
3
list_nth_int(best_path->partitioned_rels,5)
8
list_nth_int(best_path->partitioned_rels,6)
13

There should only be 4 items in this list, not 7.

make_partition_pruneinfo might have been a bit naive to assume this
couldn't happen, so I've coded it to be a bit more resilient to this
happening. It'll still end up creating another sub-PartitionPruneInfo
and slotting into the same place, but it'll no longer attempt to
translate the prunequals twice... which was what was causing the
problem. I'd been a bit sloppy and assigned the output of
adjust_appendrel_attrs() back to the prunequals which is a parameter
to the function instead of assigning to a local variable like I've
done now.


On 9 January 2018 at 22:22, Beena Emerson <memissemerson@gmail.com> wrote:
>> postgres=# explain (analyze, costs off, summary off, timing off)
>> execute abc_q1 (1, 2);
>> ERROR:  partition missing from Append subplans

This also seems to be fixed by the above fix.

> =# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1);
> ERROR:  operator 1057 is not a member of opfamily 1976

This seems to be broken in faster_partition_prune_v17 where in
classify_partition_bounding_keys() the code properly checks if the
clause matches the partition key for OpExpr, but fails to do the same
for ScalarArrayOpExpr. I'll report to Amit on the thread for that
patch.

I'll also investigate the duplication in RelOptInfo->partition_rels
and report that in another thread.

Can you confirm that case 1 and 2 are working with the attached?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
Hi,

I've attached an updated patch, now at v10. v9 was short lived due to
the evolution of Amit's which which this based on.

This version is based on Amit's v27 of faster partition pruning [1]
which can be applied atop of ad7dbee36.

I've done a self review of this, but I've not yet done any final
polishing work as the faster partition pruning patch is still
evolving. I will, for example, likely need to do some more work in
nodeAppend.c to add a few more comments and probably think of better
names for a few new things that have made a first appearance in this
version of the patch

[1] https://www.postgresql.org/message-id/520f8a71-286d-e36d-34cf-363fd74366e1@lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Rajkumar Raghuwanshi
Date:
On Sat, Feb 17, 2018 at 2:27 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:
Hi,

I've attached an updated patch, now at v10. v9 was short lived due to
the evolution of Amit's which which this based on.

This version is based on Amit's v27 of faster partition pruning [1]
which can be applied atop of ad7dbee36.

Hi,

I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR:  partition missing from Append subplans" with the patch. on head and only with Amit's patches query is working fine, Please find test case below.

CREATE TABLE part ( c1 INT2, c2 DATE) PARTITION BY RANGE (c1);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (0) TO (141) PARTITION BY RANGE(c2);
CREATE TABLE part_p11 PARTITION OF part_p1 FOR VALUES FROM ('1/1/1997') TO ('2/1/1999');
CREATE TABLE part_p12 PARTITION OF part_p1 FOR VALUES FROM ('2/1/1999') TO ('2/1/2000');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (141) TO (211) PARTITION BY RANGE(c2);
CREATE TABLE part_p21 PARTITION OF part_p2 FOR VALUES FROM ('1/1/2000') TO ('2/1/2001');
CREATE TABLE part_p22 PARTITION OF part_p2 FOR VALUES FROM ('2/1/2001') TO ('2/1/2006');

INSERT INTO part VALUES (100,'1/1/1999');
INSERT INTO part VALUES (110,'1/1/1998');
INSERT INTO part VALUES (130,'1/1/2000');
INSERT INTO part VALUES (170,'1/1/2000');
INSERT INTO part VALUES (180,'1/1/2001');
INSERT INTO part VALUES (190,'1/1/2006');
INSERT INTO part VALUES (200,'1/1/2000');
INSERT INTO part VALUES (210,'1/1/2002');

postgres=# PREPARE RTP AS SELECT * FROM PART WHERE c2 BETWEEN '1/1/1998' AND '1/1/1999';
PREPARE
postgres=# EXPLAIN execute RTP;
                                 QUERY PLAN                                 
-----------------------------------------------------------------------------
 Append  (cost=0.00..46.00 rows=12 width=6)
   ->  Seq Scan on part_p11  (cost=0.00..46.00 rows=12 width=6)
         Filter: ((c2 >= '1998-01-01'::date) AND (c2 <= '1999-01-01'::date))
(3 rows)

postgres=# execute RTP;
ERROR:  partition missing from Append subplans

deallocate RTP;
DROP TABLE part;

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
 

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 20 February 2018 at 23:46, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR:
> partition missing from Append subplans" with the patch. on head and only
> with Amit's patches query is working fine, Please find test case below.

Thanks for the test case. I can recreate locally. This is down to the
fact that make_partition_pruneinfo() only makes sub-PartitionPruneInfo
for actual subpaths found in the Append. Your test case happens to
match both the part_p1 and part_p2 partitions on the first level of
iteration, but since no child of part_p2 is found in
make_partition_pruneinfo, that element in the subpartindex never gets
set.

The fix might be to just remove the error and silently ignore those
cases, but I was hoping to keep that around as it might catch other
bugs. I'm just not sure yet how to do both.

I'll rebase this on Amit's latest patch and have a think about it
while doing that.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 21 February 2018 at 13:44, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 20 February 2018 at 23:46, Rajkumar Raghuwanshi
> <rajkumar.raghuwanshi@enterprisedb.com> wrote:
>> I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR:
>> partition missing from Append subplans" with the patch. on head and only
>> with Amit's patches query is working fine, Please find test case below.
>
> Thanks for the test case. I can recreate locally. This is down to the
> fact that make_partition_pruneinfo() only makes sub-PartitionPruneInfo
> for actual subpaths found in the Append. Your test case happens to
> match both the part_p1 and part_p2 partitions on the first level of
> iteration, but since no child of part_p2 is found in
> make_partition_pruneinfo, that element in the subpartindex never gets
> set.
>
> The fix might be to just remove the error and silently ignore those
> cases, but I was hoping to keep that around as it might catch other
> bugs. I'm just not sure yet how to do both.
 >
> I'll rebase this on Amit's latest patch and have a think about it
> while doing that.

I ended up fixing this another way.  The patch currently will build a
PartitionPruneInfo and include that in the Append node for any Append
node which belongs to a partitioned table which has a non-empty set of
clauses to attempt to use for pruning.  Currently, at no point in
planning does the patch verify those quals to see if they'd be any use
for partition pruning during execution.  Really to be any use they
would need to contain at least one Param which is in a clause which
matches a partitioned key of that partition or some subpartition
thereof.

At the moment these clauses are only verified during the first call of
set_valid_runtime_subplans() which is called the first time the
choose_next_subplan() function is called in nodeAppend.c.   It would
be possible to determine this during planning, but it would mean doing
an extra call of generate_partition_clauses(), of which most of the
work would be thrown away, as we'd only really want to know if any
Params match the partition key.  I thought it was best to do this
during execution as then we can actually make full use of the work
done in that function and cache the result for reuse each time we need
to redetermine the newly matching subplans when any param matching a
partition key changes.  I don't entirely think this is perfect, but
since we can't send the PartitionClauseInfo over to the executor from
the planner it seems like the best option.

The way I fixed the reported error was to cache all the subplans for
the partition at the current hierarchy level and if the quals don't
contain any Params matching the partition key, then we can safely
assume that all of the subplans for that partition must match. The
planner should have pruned any unneeded partitions for this
partitioned table since there are no Params, which would only be known
at run-time.  Doing it this way allowed me to keep the sanity check
that alerted you to find this bug in the first place.

Other things I don't particularly like about the current patch are how
I had to construct the partition key expressions in
set_valid_runtime_subplans_recurse(). This pretty much uses code
borrowed from set_baserel_partition_key_exprs(). One way around that
would be to change the partprune.c code to deal with the
partkey->partattrs and consume an expression from the list on attnum =
0. I didn't do that as I want to minimise how much I touch Amit's
patch before it gets committed as doing so would likely just cause
more headaches for me keeping this merged with his work. Another
option to resolve this would be to put the partition key expressions
into the PartitionPruneInfo.

I've attached v11 of the patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Rajkumar Raghuwanshi
Date:
On Wed, Feb 21, 2018 at 2:36 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:
I've attached v11 of the patch.

Hi,

I have applied attached patch on head "6f1d723b6359507ef55a81617167507bc25e3e2b" over Amit's v30 patches. while testing further I got a server crash with below test case. Please take a look.

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
INSERT INTO prt1 SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0, 599, 2) i;
CREATE INDEX iprt1_p1_a on prt1_p1(a);
CREATE INDEX iprt1_p2_a on prt1_p2(a);
CREATE INDEX iprt1_p3_a on prt1_p3(a);
ANALYZE prt1;

CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt2_p2 PARTITION OF prt2 FOR VALUES FROM (250) TO (500);
CREATE TABLE prt2_p3 PARTITION OF prt2 FOR VALUES FROM (500) TO (600);
INSERT INTO prt2 SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0, 599, 3) i;
CREATE INDEX iprt2_p1_b on prt2_p1(b);
CREATE INDEX iprt2_p2_b on prt2_p2(b);
CREATE INDEX iprt2_p3_b on prt2_p3(b);
ANALYZE prt2;

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0003', '0004', '0010');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0001', '0005', '0002', '0009');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN ('0006', '0007', '0008', '0011');
INSERT INTO plt1 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 599, 2) i;
CREATE INDEX iplt1_p1_c on plt1_p1(c);
CREATE INDEX iplt1_p2_c on plt1_p2(c);
CREATE INDEX iplt1_p3_c on plt1_p3(c);
ANALYZE plt1;

CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0000', '0003', '0004', '0010');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0001', '0005', '0002', '0009');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0006', '0007', '0008', '0011');
INSERT INTO plt2 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 599, 3) i;
CREATE INDEX iplt2_p1_c on plt2_p1(c);
CREATE INDEX iplt2_p2_c on plt2_p2(c);
CREATE INDEX iplt2_p3_c on plt2_p3(c);
ANALYZE plt2;

select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 t1,prt2 t2 where t1.a=t2.b)
and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);

/*
postgres=# select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 t1,prt2 t2 where t1.a=t2.b)
postgres-# and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
*/

stack-trace give below :

/*
(gdb) bt
#0  0x00000000006ce6dc in ExecEvalParamExec (state=0x26e9ee0, op=0x26e9f78, econtext=0x26ea390) at execExprInterp.c:2222
#1  0x00000000006cc66a in ExecInterpExpr (state=0x26e9ee0, econtext=0x26ea390, isnull=0x7ffe0f75d77f "") at execExprInterp.c:1024
#2  0x00000000006cdd8c in ExecInterpExprStillValid (state=0x26e9ee0, econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at execExprInterp.c:1819
#3  0x00000000007db078 in ExecEvalExprSwitchContext (state=0x26e9ee0, econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at ../../../../src/include/executor/executor.h:305
#4  0x00000000007e2072 in evaluate_expr (expr=0x26a3cb0, result_type=25, result_typmod=-1, result_collation=0) at clauses.c:4890
#5  0x00000000007e588a in partkey_datum_from_expr (context=0x26d3180, parttypid=25, expr=0x26a3cb0, value=0x7ffe0f75da00) at partprune.c:1504
#6  0x00000000007e5243 in extract_bounding_datums (context=0x26d3180, minimalclauses=0x7ffe0f75d900, keys=0x7ffe0f75da00) at partprune.c:1307
#7  0x00000000007e377d in get_partitions_from_clauses (context=0x26d3180) at partprune.c:273
#8  0x00000000006ea2ec in set_valid_runtime_subplans_recurse (node=0x269bf90, pinfo=0x7f6cf6765cf0, ctxcache=0x26d3158, validsubplans=0x7ffe0f75de10) at nodeAppend.c:771
#9  0x00000000006e9ebf in set_valid_runtime_subplans (node=0x269bf90) at nodeAppend.c:640
#10 0x00000000006e99b5 in choose_next_subplan_locally (node=0x269bf90) at nodeAppend.c:426
#11 0x00000000006e9598 in ExecAppend (pstate=0x269bf90) at nodeAppend.c:224
#12 0x00000000006deb3a in ExecProcNodeFirst (node=0x269bf90) at execProcnode.c:446
#13 0x00000000006fb9ee in ExecProcNode (node=0x269bf90) at ../../../src/include/executor/executor.h:239
#14 0x00000000006fbcc4 in ExecHashJoinImpl (pstate=0x2697808, parallel=0 '\000') at nodeHashjoin.c:262
#15 0x00000000006fc3fd in ExecHashJoin (pstate=0x2697808) at nodeHashjoin.c:565
#16 0x00000000006deb3a in ExecProcNodeFirst (node=0x2697808) at execProcnode.c:446
#17 0x000000000070c376 in ExecProcNode (node=0x2697808) at ../../../src/include/executor/executor.h:239
#18 0x000000000070c70e in ExecNestLoop (pstate=0x262c0a0) at nodeNestloop.c:160
#19 0x00000000006deb3a in ExecProcNodeFirst (node=0x262c0a0) at execProcnode.c:446
#20 0x00000000006fb9ee in ExecProcNode (node=0x262c0a0) at ../../../src/include/executor/executor.h:239
#21 0x00000000006fbcc4 in ExecHashJoinImpl (pstate=0x262bec8, parallel=0 '\000') at nodeHashjoin.c:262
#22 0x00000000006fc3fd in ExecHashJoin (pstate=0x262bec8) at nodeHashjoin.c:565
#23 0x00000000006deb3a in ExecProcNodeFirst (node=0x262bec8) at execProcnode.c:446
#24 0x00000000006ea5bd in ExecProcNode (node=0x262bec8) at ../../../src/include/executor/executor.h:239
#25 0x00000000006eaab0 in fetch_input_tuple (aggstate=0x262ba18) at nodeAgg.c:406
#26 0x00000000006ecd40 in agg_retrieve_direct (aggstate=0x262ba18) at nodeAgg.c:1736
#27 0x00000000006ec932 in ExecAgg (pstate=0x262ba18) at nodeAgg.c:1551
#28 0x00000000006deb3a in ExecProcNodeFirst (node=0x262ba18) at execProcnode.c:446
#29 0x00000000006d59cd in ExecProcNode (node=0x262ba18) at ../../../src/include/executor/executor.h:239
#30 0x00000000006d8326 in ExecutePlan (estate=0x262b7c8, planstate=0x262ba18, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0,
    direction=ForwardScanDirection, dest=0x7f6cf676c7f0, execute_once=1 '\001') at execMain.c:1721
#31 0x00000000006d5f9f in standard_ExecutorRun (queryDesc=0x258aa98, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at execMain.c:361
#32 0x00000000006d5dbb in ExecutorRun (queryDesc=0x258aa98, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at execMain.c:304
#33 0x00000000008b588b in PortalRunSelect (portal=0x25caa58, forward=1 '\001', count=0, dest=0x7f6cf676c7f0) at pquery.c:932
#34 0x00000000008b5519 in PortalRun (portal=0x25caa58, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x7f6cf676c7f0, altdest=0x7f6cf676c7f0,
    completionTag=0x7ffe0f75e5e0 "") at pquery.c:773
#35 0x00000000008af540 in exec_simple_query (
    query_string=0x2565728 "select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 t1,prt2 t2 where t1.a=t2.b) \nand (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);") at postgres.c:1120
#36 0x00000000008b37d4 in PostgresMain (argc=1, argv=0x25910e0, dbname=0x2590f40 "postgres", username=0x2562228 "edb") at postgres.c:4144
#37 0x0000000000812afa in BackendRun (port=0x2588ea0) at postmaster.c:4412
#38 0x000000000081226e in BackendStartup (port=0x2588ea0) at postmaster.c:4084
*/

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 21 February 2018 at 22:45, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
> select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1
> t1,prt2 t2 where t1.a=t2.b)
> and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);

Thanks for the test case.

It seems like the x.c Param for some reason has a ParamId of 0. I need
to go learn a bit more about Params to understand why this is.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
Hi David.

On 2018/02/21 18:06, David Rowley wrote:
> Other things I don't particularly like about the current patch are how
> I had to construct the partition key expressions in
> set_valid_runtime_subplans_recurse(). This pretty much uses code
> borrowed from set_baserel_partition_key_exprs(). One way around that
> would be to change the partprune.c code to deal with the
> partkey->partattrs and consume an expression from the list on attnum =
> 0. I didn't do that as I want to minimise how much I touch Amit's
> patch before it gets committed as doing so would likely just cause
> more headaches for me keeping this merged with his work. Another
> option to resolve this would be to put the partition key expressions
> into the PartitionPruneInfo.

Another way could be to refactor the code you've borrowed from
set_baserel_partition_key_exprs() into its own function that is exported
by some optimizer header.

I removed PartitionKey/Relation from signatures of various functions of my
patch to avoid having to re-heap_open() the relation per [1].

> I've attached v11 of the patch.

Some comments:

* I noticed that the patch adds a function to bitmapset.c which you might
want to extract into its own patch, like your patch to add bms_add_range()
that got committed as 84940644d [2].

* Maybe it's better to add `default: break;` in the two switch's
you've added in partkey_datum_from_expr()

partprune.c: In function ‘partkey_datum_from_expr’:
partprune.c:1555:2: warning: enumeration value ‘T_Invalid’ not handled in
switch [-Wswitch]
  switch (nodeTag(expr))

partprune.c:1562:4: warning: enumeration value ‘PARAM_SUBLINK’ not handled
in switch [-Wswitch]
    switch (((Param *) expr)->paramkind)

* I see that you moved PartitionClauseInfo's definition from partprune.c
to partition.h.  Isn't it better to move it to partprune.h?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com

[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=84940644d



Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 22 February 2018 at 22:31, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Some comments:

Hi Amit,

Thanks for looking at this. I'll work through your comments and
produce a patch sometime in the near future.

One problem that I'm facing now is down to the way I'm gathering the
ParamIds that match the partkeys. As you'll see from the patch I've
added a 'paramids' field to PartitionPruneContext and I'm populating
this when the clauses are being pre-processed in
extract_partition_clauses(). The problem is that the or_clauses are
not pre-processed at all, so the current patch will not properly
perform run-time pruning when the Params are "hidden" in OR branches.

One way I thought to fix this was to change the clause processing to
create an array of PartitionClauseInfos, one for each OR branch. This
would also improve the performance of the run-time pruning, meaning
that all of the or clauses would be already matched to the partition
keys once, rather than having to redo that again each time a Param
changes its value.

If I go and write a patch to do that, would you want it in your patch,
or would you rather I kept it over here?  Or perhaps you have a better
idea on how to solve...?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Jesper Pedersen
Date:
Hi David,

On 02/21/2018 04:06 AM, David Rowley wrote:
> I've attached v11 of the patch.
> 

Are UPDATE and DELETE suppose to be supported ?

With

-- test.sql --
CREATE TABLE test (a integer NOT NULL, b integer) PARTITION BY HASH(a);
CREATE TABLE test_p00 PARTITION OF test FOR VALUES WITH (MODULUS 2, 
REMAINDER 0);
CREATE TABLE test_p01 PARTITION OF test FOR VALUES WITH (MODULUS 2, 
REMAINDER 1);
CREATE INDEX idx_test_a ON test (a);
CREATE INDEX idx_test_b ON test (b);

INSERT INTO test (SELECT i,i FROM generate_series(1, 1000000) AS i);

ANALYZE;
-- test.sql --

and

UPDATE test SET b = 1 WHERE a = ?
DELETE FROM test WHERE a = ?

both shows that all partitions are scanned;

  Update on test
    Update on test_p00
    Update on test_p01
    ->  Index Scan using test_p00_a_idx on test_p00
          Index Cond: (a = 1)
    ->  Index Scan using test_p01_a_idx on test_p01
          Index Cond: (a = 1)

Using prune_v32 and runtime_v11 with conflicts resolved.

Best regards,
  Jesper


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 23 February 2018 at 01:15, David Rowley <david.rowley@2ndquadrant.com> wrote:
> One problem that I'm facing now is down to the way I'm gathering the
> ParamIds that match the partkeys. As you'll see from the patch I've
> added a 'paramids' field to PartitionPruneContext and I'm populating
> this when the clauses are being pre-processed in
> extract_partition_clauses(). The problem is that the or_clauses are
> not pre-processed at all, so the current patch will not properly
> perform run-time pruning when the Params are "hidden" in OR branches.
>
> One way I thought to fix this was to change the clause processing to
> create an array of PartitionClauseInfos, one for each OR branch. This
> would also improve the performance of the run-time pruning, meaning
> that all of the or clauses would be already matched to the partition
> keys once, rather than having to redo that again each time a Param
> changes its value.
>
> If I go and write a patch to do that, would you want it in your patch,
> or would you rather I kept it over here?  Or perhaps you have a better
> idea on how to solve...?

Hi Amit,

I've attached a patch which does this. For now, the patch is only
intended to assist in the discussion of the above idea.

The patch is based on a WIP version of run-time pruning that I'm not
quite ready to post yet, but with a small amount of work you could
probably take it and base it on your faster partition pruning v31
patch [1].

I ended up pulling the PartitionPruneInfo out of the
PartitionPruneContext. This was required due how I've now made
extract_partition_clauses() recursively call itself. We don't want to
overwrite the context's clauseinfo with the one from the recursive
call. A side effect of this is that the subcontext is no longer
required when processing the OR clauses. You only did this so that the
context's clauseinfo was not overwritten. I also think it's better to
seperate out the inputs and outputs. Anything in context was more
intended to be for input fields.

Let me know your thoughts about this. If you don't want it for faster
partition pruning, then I'll probably go and tidy it up and include it
for run-time pruning.

[1] https://www.postgresql.org/message-id/00ae2273-bb6b-1287-9ebc-5459b37c9078%40lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
Hi David.

On 2018/02/23 0:11, David Rowley wrote:
> On 23 February 2018 at 01:15, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> One problem that I'm facing now is down to the way I'm gathering the
>> ParamIds that match the partkeys. As you'll see from the patch I've
>> added a 'paramids' field to PartitionPruneContext and I'm populating
>> this when the clauses are being pre-processed in
>> extract_partition_clauses(). The problem is that the or_clauses are
>> not pre-processed at all, so the current patch will not properly
>> perform run-time pruning when the Params are "hidden" in OR branches.
>>
>> One way I thought to fix this was to change the clause processing to
>> create an array of PartitionClauseInfos, one for each OR branch. This
>> would also improve the performance of the run-time pruning, meaning
>> that all of the or clauses would be already matched to the partition
>> keys once, rather than having to redo that again each time a Param
>> changes its value.
>>
>> If I go and write a patch to do that, would you want it in your patch,
>> or would you rather I kept it over here?  Or perhaps you have a better
>> idea on how to solve...?
> 
> I've attached a patch which does this. For now, the patch is only
> intended to assist in the discussion of the above idea.
> 
> The patch is based on a WIP version of run-time pruning that I'm not
> quite ready to post yet, but with a small amount of work you could
> probably take it and base it on your faster partition pruning v31
> patch [1].
> 
> I ended up pulling the PartitionPruneInfo out of the
> PartitionPruneContext. This was required due how I've now made
> extract_partition_clauses() recursively call itself. We don't want to
> overwrite the context's clauseinfo with the one from the recursive
> call. A side effect of this is that the subcontext is no longer
> required when processing the OR clauses. You only did this so that the
> context's clauseinfo was not overwritten. I also think it's better to
> seperate out the inputs and outputs. Anything in context was more
> intended to be for input fields.
> 
> Let me know your thoughts about this. If you don't want it for faster
> partition pruning, then I'll probably go and tidy it up and include it
> for run-time pruning.

Thanks for the patch.

I don't have time today to look at the patch closely, but at first blush,
it seems like something I should incorporate into my own patch, because
it's restructuring the partprune.c code.  I will study the issue and your
proposed solution in the form of this restructuring more closely over the
weekend and reply (probably with a new version of my patch) on Monday.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 22 February 2018 at 22:31, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Some comments:
>
> * I noticed that the patch adds a function to bitmapset.c which you might
> want to extract into its own patch, like your patch to add bms_add_range()
> that got committed as 84940644d [2].

I've made that 0001 in the series

> * Maybe it's better to add `default: break;` in the two switch's
> you've added in partkey_datum_from_expr()
>
> partprune.c: In function ‘partkey_datum_from_expr’:
> partprune.c:1555:2: warning: enumeration value ‘T_Invalid’ not handled in
> switch [-Wswitch]
>   switch (nodeTag(expr))
>
> partprune.c:1562:4: warning: enumeration value ‘PARAM_SUBLINK’ not handled
> in switch [-Wswitch]
>     switch (((Param *) expr)->paramkind)

I wasn't aware of that gcc flag. I was also surprised to see a clean
compile from master with it enabled. This area has been changed a bit
from the last patch, but the remaining switch now has a default:
return false;

> * I see that you moved PartitionClauseInfo's definition from partprune.c
> to partition.h.  Isn't it better to move it to partprune.h?

Moved. I'd done it the other way to try to reduce the number of
planner headers included in the executor, but will defer to your
better judgement, as I see you're busy working on improving this area
in another patch set.

I've attached an updated set of patches.

I hope this also addresses Rajkumar reported crash. I ended up making
some changes to how the Param values are determined by reusing more of
the existing executor code rather than duplicating it in
partkey_datum_from_expr. I really could use a sanity check on my
changes to that function now, especially the cross type portion.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 23 February 2018 at 04:11, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Are UPDATE and DELETE suppose to be supported ?

To be honest, I had not even considered those. Without looking in
detail I imagine it may be possible to allow this simply by setting
the AppendPath->trypartitionprune in the correct cases in the
inheritence_planner(). I would need to look into this in some detail
to find out for sure.

Another case which likely is simple to implement is the exact same
processing for MergeAppends. I currently see no reason why the same
pruning cannot be done for subnodes of that node type too. I've just
not done so yet. I'd rather get more sanity check reviews on the
current scope of the patch before I widen it out to other areas, but
at the same time also don't want to leave very simple things to PG12
which can easily be done in PG11. So I'll try to look at this and get
back to you, or perhaps release a new set of patches to support the
additional features.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Simon Riggs
Date:
On 23 February 2018 at 11:40, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 23 February 2018 at 04:11, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> Are UPDATE and DELETE suppose to be supported ?
>
> To be honest, I had not even considered those.

I can say that I had considered those. Handling of UPDATE and DELETE
with partitions is significantly different, so its not just an
oversight its a different branch of the project.

We need SELECT to work first and then we have a chance of making
UPDATE/DELETE work.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
(Sorry, I had mistakenly replied only to Simon on Friday)

On Fri, Feb 23, 2018 at 9:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 23 February 2018 at 11:40, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 23 February 2018 at 04:11, Jesper Pedersen
>> <jesper.pedersen@redhat.com> wrote:
>>> Are UPDATE and DELETE suppose to be supported ?
>>
>> To be honest, I had not even considered those.
>
> I can say that I had considered those. Handling of UPDATE and DELETE
> with partitions is significantly different, so its not just an
> oversight its a different branch of the project.
>
> We need SELECT to work first and then we have a chance of making
> UPDATE/DELETE work.

+1

Thanks,
Amit


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On 2018/02/23 20:40, David Rowley wrote:
> On 23 February 2018 at 04:11, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> Are UPDATE and DELETE suppose to be supported ?
> 
> To be honest, I had not even considered those. Without looking in
> detail I imagine it may be possible to allow this simply by setting
> the AppendPath->trypartitionprune in the correct cases in the
> inheritence_planner(). I would need to look into this in some detail
> to find out for sure.

I guess you meant in ModifyTablePath.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
Hi David.

On 2018/02/23 20:34, David Rowley wrote:
> On 22 February 2018 at 22:31, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Some comments:
>>
>> * I noticed that the patch adds a function to bitmapset.c which you might
>> want to extract into its own patch, like your patch to add bms_add_range()
>> that got committed as 84940644d [2].
> 
> I've made that 0001 in the series

Thanks.

> I've attached an updated set of patches.
> 
> I hope this also addresses Rajkumar reported crash. I ended up making
> some changes to how the Param values are determined by reusing more of
> the existing executor code rather than duplicating it in
> partkey_datum_from_expr. I really could use a sanity check on my
> changes to that function now, especially the cross type portion.

I've incorporated portions of 0002 and 0003 into my patch on the other
thread (v34) posted at [1].  That is, mostly the changes around handling
OR clauses and interface changes resulting from it.

Attached are revised version of your patches after the aforementioned
rearrangements.  Note that after I took out the optimizer portion of the
0003 patch to incorporate it into my patch (OR clause processing bits),
not much was left in it, so I squashed it into 0002.  So there are only
0001 and 0002.

As a review comment on 0002, I think trypartitionprune is better written
as try_partition_prune.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/158f04ce-9deb-0457-ddcc-78fb73db4ebc%40lab.ntt.co.jp

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 27 February 2018 at 22:39, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I've incorporated portions of 0002 and 0003 into my patch on the other
> thread (v34) posted at [1].  That is, mostly the changes around handling
> OR clauses and interface changes resulting from it.

Thanks. I was just in the middle of swapping the order of the patches
so that the OR clause patch was directly based on yours.

> Attached are revised version of your patches after the aforementioned
> rearrangements.  Note that after I took out the optimizer portion of the
> 0003 patch to incorporate it into my patch (OR clause processing bits),
> not much was left in it, so I squashed it into 0002.  So there are only
> 0001 and 0002.

I've locally got a patch which is significantly different to the v12
patch which moves lots of code into nodePartition.c and fixes up the
missing node read/write functions too.

> As a review comment on 0002, I think trypartitionprune is better written
> as try_partition_prune.

That no longer exists in the new version... Will post soonish, just
need to base it all on your v34 [1] now! :)

[1] https://www.postgresql.org/message-id/158f04ce-9deb-0457-ddcc-78fb73db4ebc%40lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
I've attached v14 of the patch.

This is based on Amit's faster partition pruning patch v35 [1]

There's quite a bit of code churn in this version from the last
version. I've now moved most of the run-time pruning code into
execPartition.c and aimed to make it more generic to apply to node
types other than Append. I also had to make a few changes to the
PartitionPruneInfo node type so that it could get some support in
readfuncs.c and outfuncs.c, which I had previously missed.  As proof
that the code in execPartition.c is fairly generic and applies to any
subnode type that supports a List of subnodes, I went ahead and wrote
a small additional patch to add support for run-time pruning for
MergeAppend. This just takes an extra 100 lines of code.

I've also split the patch out a bit more into logical parts in the
hope it makes things easier to review.

[1] https://www.postgresql.org/message-id/0f96dd16-f5d5-7301-4ddf-858d41a6cbe3@lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Jesper Pedersen
Date:
Hi David,

On 03/01/2018 08:04 AM, David Rowley wrote:
> I've also split the patch out a bit more into logical parts in the
> hope it makes things easier to review.
> 

A small typo in 0001:

+ * leftmost_ons_pos[x] gives the bit number (0-7) of the leftmost one 
bit in a

..."_one_"...

0004 fails "make check-world" due to

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE 
boolp_f jpedersen
pg_restore: [archiver (db)] could not execute query: ERROR:  syntax 
error at or near "false"
LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false);

Do you require https://commitfest.postgresql.org/17/1410/ as well ?

I'll look more at 0002-0005 over the coming days.

Thanks for working on this !

Best regards,
  Jesper


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 2 March 2018 at 07:17, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> A small typo in 0001:
>
> + * leftmost_ons_pos[x] gives the bit number (0-7) of the leftmost one bit
> in a
>
> ..."_one_"...

Oops. I'll fix that.

> 0004 fails "make check-world" due to
>
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE
> boolp_f jpedersen
> pg_restore: [archiver (db)] could not execute query: ERROR:  syntax error at
> or near "false"
> LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false);

The tests seem to have stumbled on a pg_dump bug which causes it to
produce syntax that's not valid (currently)

I should be able to stop my patch failing the test by dropping that
table, which I should have been doing anyway.

> Do you require https://commitfest.postgresql.org/17/1410/ as well ?

I'll look at that thread and see if there's any pg_dump being broken discussion.

> I'll look more at 0002-0005 over the coming days.

Thanks for the review and in advance for the future review.

I'll delay releasing a new patch as there's some discussion over on
the faster partition pruning thread which affects this too [1]

[1] https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=MTcXKxTwBnsLRHD3rig@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Jesper Pedersen
Date:
Hi David,

On 03/01/2018 08:29 PM, David Rowley wrote:
>> 0004 fails "make check-world" due to
>>
>> pg_restore: [archiver (db)] Error while PROCESSING TOC:
>> pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE
>> boolp_f jpedersen
>> pg_restore: [archiver (db)] could not execute query: ERROR:  syntax error at
>> or near "false"
>> LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false);
> 
> The tests seem to have stumbled on a pg_dump bug which causes it to
> produce syntax that's not valid (currently)
> 
> I should be able to stop my patch failing the test by dropping that
> table, which I should have been doing anyway.
>

I've added that thread to the Open Items list.

> Thanks for the review and in advance for the future review.
> 
> I'll delay releasing a new patch as there's some discussion over on
> the faster partition pruning thread which affects this too [1]
> 
> [1] https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=MTcXKxTwBnsLRHD3rig@mail.gmail.com
> 

Sure, 0003-0005 depends on that thread. 0002 is refactoring so that one 
is ready.

In 0004 should we add a HASH based test case,

-- test.sql --
-- verify pruning in hash partitions
create table hashp (a int primary key, b int) partition by hash (a);
create table hashp_0 partition of hashp for values with (modulus 2, 
remainder 0);
create table hashp_1 partition of hashp for values with (modulus 2, 
remainder 1);
insert into hashp values (0,0), (1,1), (2,2), (3,3);
prepare q1 (int) as select * from hashp where a = $1;
execute q1 (1);
execute q1 (1);
execute q1 (1);
execute q1 (1);
execute q1 (1);
explain (analyze, costs off, summary off, timing off)  execute q1 (1);
explain (analyze, costs off, summary off, timing off)  execute q1 (3);
deallocate q1;
drop table hashp;
-- test.sql --

Also, should 0004 consider the "Parallel Append" case, aka

-- parallel.sql --
CREATE TABLE t1 (
     a integer NOT NULL,
     b integer NOT NULL
) PARTITION BY HASH (b);

CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 0);
CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 1);
CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 2);
CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 3);
INSERT INTO t1 (SELECT i, i FROM generate_series(1, 1000000) AS i);
PREPARE q1 (int) AS SELECT * FROM t1 WHERE a = $1;
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)  EXECUTE q1 (5432);
DEALLOCATE q1;
DROP TABLE t1;
-- parallel.sql --

Best regards,
  Jesper


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 10 March 2018 at 07:50, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> On 03/01/2018 08:29 PM, David Rowley wrote:
>> I'll delay releasing a new patch as there's some discussion over on
>> the faster partition pruning thread which affects this too [1]
>>
>> [1]
>> https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=MTcXKxTwBnsLRHD3rig@mail.gmail.com
>>
>
> Sure, 0003-0005 depends on that thread. 0002 is refactoring so that one is
> ready.

Okay, back to this again. The faster partition pruning patch has
undergone a large refactor which cause me to delay basing this patch
on top of it until that patch has stabilised again.  It's now getting
close, so this seems like a good time to send a new verison of this.

> In 0004 should we add a HASH based test case,

I don't think we really need to test each partition type in too much
detail in the run-time pruning patch. I'm more interested in testing
multiple partition levels there's important new code in this patch
which does need lots of additional testing for multi-level partitions.
I think adding a HASH partition case would be more aiming to test the
planner's pruning code, which this does not make any behavioural
changes to.

> Also, should 0004 consider the "Parallel Append" case, aka

There is a parallel Append test case in there already, see the queries
below the "-- parallel append" comment.

I've attached a new version of the patch.  I'm now at v18 after having
some versions of the patch that I didn't release which were based on
various versions of Amit's faster partition pruning patch.

The attached patchset is based on Amit's v45 faster partition pruning [1].

I've made a few changes since the v14 version. Since Amit's v45 patch
now creates the partition pruning details in a data structure that can
be copied from the planner over to the executor, it means this patch
is now able to do much of the setup work for run-time pruning in the
planner. Doing this now allows us to determine if run-time pruning is
even possible at plan time, rather than the executor attempting it and
sometimes wasting effort when it failed to find Params matching the
partition key.

Another change from the last version is that I've separated out the
handling of exec Params and external Params.  The new patch now will
perform a pruning step at executor startup if some external params
match the partition key.  This is very beneficial to a PREPAREd OLTP
type query against a partitioned table as it means we can skip
sub-plan initialisation for all non-matching partitions.  Initialising
Append/MergeAppend/ModifyTable nodes with fewer subnodes than were
originally planned did require a small change in explain.c in some
code that was assuming the subnode arrays were the same length as the
subplan list. I also ended up adding a Int property to EXPLAIN to show
the number of subnodes that have been removed during execution.
Unfortunately, there is a small corner case with this in the case
where all subnodes are removed it leaves EXPLAIN with no subnodes to
search for outer side Vars in. I didn't really see any sane way to
handle this in EXPLAIN, so I think the best fix for this is what I've
done, and that's just to always initalise at least 1 subnode, even
when none match the external Params. Cost-wise, I don't think this is
such a big deal as the cost savings here are coming from saving on
initalising ten's or hundreds of subnodes, not 1.

To put the new patch to the test, I tried pgbench -S -M prepared -s
100 with and without having modified pgbench_accounts to separate into
10 RANGE partitions of equal size.

A non-partitioned table was getting 12503 TPS.
With partitioned tables, the old version of this patch was getting: 5470 TPS.
With partitioned tables, the attached version gets 11247 TPS.
For perspective, today's master with a partitioned table gets 4719 TPS.

So you can see it's a pretty good performance boost by skipping
initialisation of the 9 non-matching subplans.  It's not hard to
imagine the gains getting more significant with a larger number of
partitions. Ideally, the performance of a partitioned table would be
faster than the non-partitioned case, but please remember this query
is a single row PK lookup SELECT, so is a very fast query in both
cases. Partitioning cases should improve more as the table grows and
indexes struggle to fit in RAM, or when many rows are being taken from
the partition and being aggregated.

Unfortunately, the story is not quite as good for the non -M prepared
version of the benchmark. This shows that planning time of partitioned
table queries could still use some improvements. Amit's v45 patch
certainly makes a dent in the planner slow performance here, but it's
still nothing like as fast as the non-partitioned case. More work is
required there in the future.

This patchset also contains a patch to improve the performance of
inheritance planning of UPDATE/DELETE queries. This patch also
implements run-time pruning for UPDATE/DELETE too. This also has a
significant performance improvement for planning of UPDATE/DELETE
operations on partitioned tables with a large number of partitions,
performance is as follows:

Values in transactions per second.

Partitions = 1
Unpatched: 7323.3
Patched: 6573.2 (-10.24%)

Partitions = 2
Unpatched: 6784.8
Patched: 6377.1 (-6.01%)

Partitions = 4
Unpatched: 5903.0
Patched: 6106.8 (3.45%)

Partitions = 8
Unpatched: 4582.0
Patched: 5579.9 (21.78%)

Partitions = 16
Unpatched: 3131.5
Patched: 4521.2 (44.38%)

Partitions = 32
Unpatched: 1779.8
Patched: 3387.8 (90.35%)

Partitions = 64
Unpatched: 821.9
Patched: 2245.4 (173.18%)

Partitions = 128
Unpatched: 322.2
Patched: 1319.6 (309.56%)

Partitions = 256
Unpatched: 84.3
Patched: 731.7 (768.27%)

Partitions = 512
Unpatched: 22.5
Patched: 382.8 (1597.74%)

Partitions = 1024
Unpatched: 5.5
Patched: 150.1 (2607.83%)

This shows a small regression in planner performance of 10% for a
table with 1 partition, but performance starts improving again by just
4 partitions. By the time we get to 1024 partitions the patched
planner is 26 times faster than unpatched. It's likely not a
no-brainer since some people may get a small performance regression,
but it certainly makes having larger numbers of partitions much more
usable than it was previously.

There are still a few bugs lingering in v45 of Amit's faster partition
pruning patch. One of which is causing a regression test to fail in
the attached patch set. I've also attached a patch which must be
applied after Amit's v45 patchset to fixup a couple of things missing.
Hopefully, this won't be needed once v46 is out.

To test this apply all patches in [1], then apply
faster_part_prune_v45_fixups.patch, then the attached 0001-0005
patches.

[1] https://www.postgresql.org/message-id/fc73cef4-6879-26c3-6859-2f910640234a@lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
David Rowley wrote:

> To put the new patch to the test, I tried pgbench -S -M prepared -s
> 100 with and without having modified pgbench_accounts to separate into
> 10 RANGE partitions of equal size.
> 
> A non-partitioned table was getting 12503 TPS.
> With partitioned tables, the old version of this patch was getting: 5470 TPS.
> With partitioned tables, the attached version gets 11247 TPS.
> For perspective, today's master with a partitioned table gets 4719 TPS.
> 
> So you can see it's a pretty good performance boost by skipping
> initialisation of the 9 non-matching subplans.  It's not hard to
> imagine the gains getting more significant with a larger number of
> partitions.

These are excellent news!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Jesper Pedersen
Date:
Hi David,

On 03/31/2018 09:52 AM, David Rowley wrote:
> I've attached a new version of the patch.  I'm now at v18 after having
> some versions of the patch that I didn't release which were based on
> various versions of Amit's faster partition pruning patch.
> 

Thank you for the updated patch set !

I have tested this together with Amit's v46 patch.

The attached case doesn't trigger a generic plan, so basically all time 
is spent in GetCachedPlan.

The standard table case (std.sql) gives:

  generic_cost = 8.4525
  avg_custom_cost = 13.4525
  total_custom_cost = 67.2625

whereas the 64 hash partition case (hash.sql) gives:

  generic_cost = 540.32
  avg_custom_cost = 175.9425
  total_custom_cost = 879.7125

I tested with pgbench -M prepared -f select.sql.

Also, I'm seeing a regression for check-world in 
src/test/regress/results/inherit.out

***************
*** 642,648 ****
   ---------------------+---+---+-----
    mlparted_tab_part1  | 1 | a |
    mlparted_tab_part2a | 2 | a |
!  mlparted_tab_part2b | 2 | b | xxx
    mlparted_tab_part3  | 3 | a | xxx
   (4 rows)

--- 642,648 ----
   ---------------------+---+---+-----
    mlparted_tab_part1  | 1 | a |
    mlparted_tab_part2a | 2 | a |
!  mlparted_tab_part2b | 2 | b |
    mlparted_tab_part3  | 3 | a | xxx
   (4 rows)

I'll spend some more time tomorrow.

Thanks for working on this !

Best regards,
  Jesper

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hello,

On Tue, Apr 3, 2018 at 11:14 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Hi David,
>
> On 03/31/2018 09:52 AM, David Rowley wrote:
>>
>> I've attached a new version of the patch.  I'm now at v18 after having
>> some versions of the patch that I didn't release which were based on
>> various versions of Amit's faster partition pruning patch.
>>
>
> Thank you for the updated patch set !
>
> I have tested this together with Amit's v46 patch.
>
>
> Also, I'm seeing a regression for check-world in
> src/test/regress/results/inherit.out
>

With Amit's v46 patch,  the following query in partition_prune was
crashing during make check.
explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3);



-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 4 April 2018 at 05:44, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> I have tested this together with Amit's v46 patch.

Thanks for testing this.

> The attached case doesn't trigger a generic plan, so basically all time is
> spent in GetCachedPlan.

Yeah, there's still no resolution to the fact that a generic plan +
runtime pruning might be cheaper than a custom plan.  The problem is
the generic plan appears expensive to the custom vs generic plan
comparison due to it containing more Append subnodes and the run-time
pruning not being taking into account by that comparison.

There's been some discussion about this on this thread somewhere.

I think the best solution is probably the one suggested by Robert [1]
and that's to alter the Append plan's cost when run-time pruning is
enabled to try to account for the run-time pruning. This would be a
bit of a blind guess akin to what we do for clause selectivity
estimates for Params, but it's probably better than nothing, and
likely better than doing nothing.

> Also, I'm seeing a regression for check-world in
> src/test/regress/results/inherit.out
>
> ***************
> *** 642,648 ****
>   ---------------------+---+---+-----
>    mlparted_tab_part1  | 1 | a |
>    mlparted_tab_part2a | 2 | a |
> !  mlparted_tab_part2b | 2 | b | xxx
>    mlparted_tab_part3  | 3 | a | xxx
>   (4 rows)
>
> --- 642,648 ----
>   ---------------------+---+---+-----
>    mlparted_tab_part1  | 1 | a |
>    mlparted_tab_part2a | 2 | a |
> !  mlparted_tab_part2b | 2 | b |
>    mlparted_tab_part3  | 3 | a | xxx
>   (4 rows)
>
> I'll spend some more time tomorrow.

Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for
that with Amit over on [2].

If you patch v46 with the patch I attached to that thread, it should work.

[1] https://www.postgresql.org/message-id/CA%2BTgmoZv8sd9cKyYtHwmd_13%2BBAjkVKo%3DECe7G98tBK5Ejwatw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 4 April 2018 at 05:50, Beena Emerson <memissemerson@gmail.com> wrote:
> With Amit's v46 patch,  the following query in partition_prune was
> crashing during make check.
> explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3);

Hi Beena,

Thanks for looking.

Does it work correctly if you apply [1] to Amit's v46 patch before
patching with v18 run-time partition pruning?

[1] https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 4 April 2018 at 14:10, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 4 April 2018 at 05:44, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>> The attached case doesn't trigger a generic plan, so basically all time is
>> spent in GetCachedPlan.
>
> Yeah, there's still no resolution to the fact that a generic plan +
> runtime pruning might be cheaper than a custom plan.  The problem is
> the generic plan appears expensive to the custom vs generic plan
> comparison due to it containing more Append subnodes and the run-time
> pruning not being taking into account by that comparison.

Just for the record, some of the benchmarks I did above also used the
attached patch for the -M prepared case.

I didn't intend the patch for PostgreSQL, but I am starting to think
that it would be useful to have something to save from having to
EXECUTE PREPAREd statements 5 times before getting a generic plan.
Doing that is starting to seem a bit fragile to me. Would be nice to
have some solution, but I've so far not thought of anything better
than the attached (incomplete) patch.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Beena Emerson
Date:
Hi David,

On Wed, Apr 4, 2018 at 7:57 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 4 April 2018 at 05:50, Beena Emerson <memissemerson@gmail.com> wrote:
>> With Amit's v46 patch,  the following query in partition_prune was
>> crashing during make check.
>> explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3);
>
> Hi Beena,
>
> Thanks for looking.
>
> Does it work correctly if you apply [1] to Amit's v46 patch before
> patching with v18 run-time partition pruning?
>
> [1] https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com
>

Thanks for working on it. make check passes when the patch [1] is also applied.

--

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
Hi David.

On 2018/04/04 11:10, David Rowley wrote:
> On 4 April 2018 at 05:44, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>> Also, I'm seeing a regression for check-world in
>> src/test/regress/results/inherit.out
>>
>> ***************
>> *** 642,648 ****
>>   ---------------------+---+---+-----
>>    mlparted_tab_part1  | 1 | a |
>>    mlparted_tab_part2a | 2 | a |
>> !  mlparted_tab_part2b | 2 | b | xxx
>>    mlparted_tab_part3  | 3 | a | xxx
>>   (4 rows)
>>
>> --- 642,648 ----
>>   ---------------------+---+---+-----
>>    mlparted_tab_part1  | 1 | a |
>>    mlparted_tab_part2a | 2 | a |
>> !  mlparted_tab_part2b | 2 | b |
>>    mlparted_tab_part3  | 3 | a | xxx
>>   (4 rows)
>>
>> I'll spend some more time tomorrow.
> 
> Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for
> that with Amit over on [2].

I'm not sure if we've yet discussed anything that'd be related to this on
the faster pruning thread.  It seems that the difference arises from
mlparted_tab_part2b not being selected for an update query that's executed
just before this test.  When I execute an equivalent select query to check
if mlparted_tab_part2b is inadvertently pruned due to the new code, I
don't see the latest faster pruning patch doing it:

explain (costs off)
select *
from mlparted_tab mlp,
    (select a from some_tab union all select a+1 from some_tab) ss (a)
where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3;
                                    QUERY PLAN

----------------------------------------------------------------------------------
 Nested Loop
   Join Filter: (((mlp.a = some_tab.a) AND (mlp.b = 'b'::bpchar)) OR
(mlp.a = 3))
   ->  Append
         ->  Seq Scan on some_tab
         ->  Seq Scan on some_tab some_tab_1
   ->  Materialize
         ->  Append
               ->  Seq Scan on mlparted_tab_part1 mlp
                     Filter: ((b = 'b'::bpchar) OR (a = 3))
               ->  Seq Scan on mlparted_tab_part2b mlp_1
                     Filter: ((b = 'b'::bpchar) OR (a = 3))
               ->  Seq Scan on mlparted_tab_part3 mlp_2
                     Filter: ((b = 'b'::bpchar) OR (a = 3))
(13 rows)

For the original update query, constraint exclusion selects the same set
of partitions:

explain (costs off) update mlparted_tab mlp set c = 'xxx'
from (select a from some_tab union all select a+1 from some_tab) ss (a)
where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3;
                                          QUERY PLAN

----------------------------------------------------------------------------------------------
 Update on mlparted_tab mlp
   Update on mlparted_tab_part1 mlp_1
   Update on mlparted_tab_part2b mlp_2
   Update on mlparted_tab_part3 mlp_3
   ->  Nested Loop
         Join Filter: (((mlp_1.a = some_tab.a) AND (mlp_1.b =
'b'::bpchar)) OR (mlp_1.a = 3))
         ->  Append
               ->  Seq Scan on some_tab
               ->  Seq Scan on some_tab some_tab_1
         ->  Materialize
               ->  Seq Scan on mlparted_tab_part1 mlp_1
                     Filter: ((b = 'b'::bpchar) OR (a = 3))
   ->  Nested Loop
         Join Filter: (((mlp_2.a = some_tab.a) AND (mlp_2.b =
'b'::bpchar)) OR (mlp_2.a = 3))
         ->  Append
               ->  Seq Scan on some_tab
               ->  Seq Scan on some_tab some_tab_1
         ->  Materialize
               ->  Seq Scan on mlparted_tab_part2b mlp_2
                     Filter: ((b = 'b'::bpchar) OR (a = 3))
   ->  Nested Loop
         Join Filter: (((mlp_3.a = some_tab.a) AND (mlp_3.b =
'b'::bpchar)) OR (mlp_3.a = 3))
         ->  Append
               ->  Seq Scan on some_tab
               ->  Seq Scan on some_tab some_tab_1
         ->  Materialize
               ->  Seq Scan on mlparted_tab_part3 mlp_3
                     Filter: ((b = 'b'::bpchar) OR (a = 3))
(28 rows)

What am I missing?

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 4 April 2018 at 18:27, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/04 11:10, David Rowley wrote:
>> On 4 April 2018 at 05:44, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>>> Also, I'm seeing a regression for check-world in
>>> src/test/regress/results/inherit.out
>>>
>>> ***************
>>> *** 642,648 ****
>>>   ---------------------+---+---+-----
>>>    mlparted_tab_part1  | 1 | a |
>>>    mlparted_tab_part2a | 2 | a |
>>> !  mlparted_tab_part2b | 2 | b | xxx
>>>    mlparted_tab_part3  | 3 | a | xxx
>>>   (4 rows)
>>>
>>> --- 642,648 ----
>>>   ---------------------+---+---+-----
>>>    mlparted_tab_part1  | 1 | a |
>>>    mlparted_tab_part2a | 2 | a |
>>> !  mlparted_tab_part2b | 2 | b |
>>>    mlparted_tab_part3  | 3 | a | xxx
>>>   (4 rows)
>>>
>>> I'll spend some more time tomorrow.
>>
>> Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for
>> that with Amit over on [2].
>
> I'm not sure if we've yet discussed anything that'd be related to this on
> the faster pruning thread.

hmm, yeah, I didn't really explain the context, but the report was in [1]

Basically, the OR clause in the following SQL fragment was overwriting
the scan_all_non_null value:

where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3;

Basically the:

result->scan_all_nonnull = step_result->scan_all_nonnull;

The minimum fix would have been to change that line to:

result->scan_all_nonnull |= step_result->scan_all_nonnull;

Anyway, it all irrelevant now as that code has all changed.

[1] https://www.postgresql.org/message-id/CAKJS1f_SHPuqDhQWJq-_P1kpPQn7BJt71yPbDP_8b3rhwFQyGA@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On 2018/04/04 16:04, David Rowley wrote:
> On 4 April 2018 at 18:27, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I'm not sure if we've yet discussed anything that'd be related to this on
>> the faster pruning thread.
> 
> hmm, yeah, I didn't really explain the context, but the report was in [1]>
> [1] https://www.postgresql.org/message-id/CAKJS1f_SHPuqDhQWJq-_P1kpPQn7BJt71yPbDP_8b3rhwFQyGA@mail.gmail.com

Oh, I see.  Hopefully it is no longer an issue.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

From
Jesper Pedersen
Date:
Hi David,

On 04/03/2018 10:10 PM, David Rowley wrote:
>> The attached case doesn't trigger a generic plan, so basically all time is
>> spent in GetCachedPlan.
> 
> Yeah, there's still no resolution to the fact that a generic plan +
> runtime pruning might be cheaper than a custom plan.  The problem is
> the generic plan appears expensive to the custom vs generic plan
> comparison due to it containing more Append subnodes and the run-time
> pruning not being taking into account by that comparison.
> 
> There's been some discussion about this on this thread somewhere.
>

Forgot about that, sorry.

> I think the best solution is probably the one suggested by Robert [1]
> and that's to alter the Append plan's cost when run-time pruning is
> enabled to try to account for the run-time pruning. This would be a
> bit of a blind guess akin to what we do for clause selectivity
> estimates for Params, but it's probably better than nothing, and
> likely better than doing nothing.
>

Yeah, something based on the number of WHERE clauses, and if the 
partition type has DEFAULT / NULL partition could help. Forcing 
choose_custom_plan() to return false does help a lot (> 400%) for the 
HASH case.

But maybe this area is best left for PG12.

> Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for
> that with Amit over on [2].
>

I was running with a version of faster_part_prune_v45_fixups.patch.

Patch v49 with v18 (0001-0004) works. 0005 needs a rebase.

Thanks again,
  Jesper


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 5 April 2018 at 05:31, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> Patch v49 with v18 (0001-0004) works. 0005 needs a rebase.

I've attached an updated patchset against Amit's v49 faster partition
pruning patch [1].

v49 should also be patched with [2] and [3].

I'm pretty happy with patches 0001 to 0004. At the moment I'm still a
bit unsure of 0005. I need to do a bit more sanity checking on it,
mostly around the changes in planner.c. Although I am keen to see 0005
make it into PG11 as it does make running larger numbers of partition
more usable for UPDATE/DELETE too. Amit's improvements are really good
alone, but unfortunately, anyone who's waiting on us speeding up the
partition pruning in SELECT queries is probably also waiting on us
doing the same for UPDATE/DELETE. So I do think 0005 is important for
PG11, providing it's correct, of course.

[1] https://www.postgresql.org/message-id/c5331ff6-8b31-0742-758e-bd7b9aeddf07%40lab.ntt.co.jp
[2] https://www.postgresql.org/message-id/CAKJS1f_ad%3DB9rVf5dPD27%3DxTN1Ob7xJi6N4BFBNytj93rCPqzg%40mail.gmail.com
[3] https://www.postgresql.org/message-id/3eedafaa-840f-bf72-2fa9-dadb0852e959%40redhat.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
Hi David.

On 2018/03/31 22:52, David Rowley wrote:
> The attached patchset is based on Amit's v45 faster partition pruning [1].
> 
> I've made a few changes since the v14 version. Since Amit's v45 patch
> now creates the partition pruning details in a data structure that can
> be copied from the planner over to the executor, it means this patch
> is now able to do much of the setup work for run-time pruning in the
> planner. Doing this now allows us to determine if run-time pruning is
> even possible at plan time, rather than the executor attempting it and
> sometimes wasting effort when it failed to find Params matching the
> partition key.
>
> Another change from the last version is that I've separated out the
> handling of exec Params and external Params.  The new patch now will
> perform a pruning step at executor startup if some external params
> match the partition key.  This is very beneficial to a PREPAREd OLTP
> type query against a partitioned table as it means we can skip
> sub-plan initialisation for all non-matching partitions.

This is quite nice.

> Initialising
> Append/MergeAppend/ModifyTable nodes with fewer subnodes than were
> originally planned did require a small change in explain.c in some
> code that was assuming the subnode arrays were the same length as the
> subplan list. I also ended up adding a Int property to EXPLAIN to show
> the number of subnodes that have been removed during execution.
> Unfortunately, there is a small corner case with this in the case
> where all subnodes are removed it leaves EXPLAIN with no subnodes to
> search for outer side Vars in. I didn't really see any sane way to
> handle this in EXPLAIN, so I think the best fix for this is what I've
> done, and that's just to always initalise at least 1 subnode, even
> when none match the external Params. Cost-wise, I don't think this is
> such a big deal as the cost savings here are coming from saving on
> initalising ten's or hundreds of subnodes, not 1.

I have wondered about the possibility of setting a valid (non-dummy)
targetlist in Append and MergeAppend if they're created for a partitioned
table.  Currently they're overwritten by a dummy one using
set_dummy_tlist_references() in set_plan_refs() citing the following reason:

 * set_dummy_tlist_references
 *    Replace the targetlist of an upper-level plan node with a simple
 *    list of OUTER_VAR references to its child.
 *
 * This is used for plan types like Sort and Append that don't evaluate
 * their targetlists.  Although the executor doesn't care at all what's in
 * the tlist, EXPLAIN needs it to be realistic.

In fact, when I had noticed that this EXPLAIN behavior I had wondered if
that's something we should have discussed when d3cc37f1d801a [1] went in.

> To put the new patch to the test, I tried pgbench -S -M prepared -s
> 100 with and without having modified pgbench_accounts to separate into
> 10 RANGE partitions of equal size.
> 
> A non-partitioned table was getting 12503 TPS.
> With partitioned tables, the old version of this patch was getting: 5470 TPS.
> With partitioned tables, the attached version gets 11247 TPS.
> For perspective, today's master with a partitioned table gets 4719 TPS.

Quite nice!

> So you can see it's a pretty good performance boost by skipping
> initialisation of the 9 non-matching subplans.  It's not hard to
> imagine the gains getting more significant with a larger number of
> partitions. Ideally, the performance of a partitioned table would be
> faster than the non-partitioned case, but please remember this query
> is a single row PK lookup SELECT, so is a very fast query in both
> cases. Partitioning cases should improve more as the table grows and
> indexes struggle to fit in RAM, or when many rows are being taken from
> the partition and being aggregated.
> 
> Unfortunately, the story is not quite as good for the non -M prepared
> version of the benchmark. This shows that planning time of partitioned
> table queries could still use some improvements. Amit's v45 patch
> certainly makes a dent in the planner slow performance here, but it's
> still nothing like as fast as the non-partitioned case. More work is
> required there in the future.

Certainly.  Things like the issue that we both replied to yesterday should
be addressed for starters [2].

> This patchset also contains a patch to improve the performance of
> inheritance planning of UPDATE/DELETE queries. This patch also
> implements run-time pruning for UPDATE/DELETE too. This also has a
> significant performance improvement for planning of UPDATE/DELETE
> operations on partitioned tables with a large number of partitions,
> performance is as follows:
> 
> Values in transactions per second.
> 
> Partitions = 1
> Unpatched: 7323.3
> Patched: 6573.2 (-10.24%)
> 
> Partitions = 2
> Unpatched: 6784.8
> Patched: 6377.1 (-6.01%)
> 
> Partitions = 4
> Unpatched: 5903.0
> Patched: 6106.8 (3.45%)
> 
> Partitions = 8
> Unpatched: 4582.0
> Patched: 5579.9 (21.78%)
> 
> Partitions = 16
> Unpatched: 3131.5
> Patched: 4521.2 (44.38%)
> 
> Partitions = 32
> Unpatched: 1779.8
> Patched: 3387.8 (90.35%)
> 
> Partitions = 64
> Unpatched: 821.9
> Patched: 2245.4 (173.18%)
> 
> Partitions = 128
> Unpatched: 322.2
> Patched: 1319.6 (309.56%)
> 
> Partitions = 256
> Unpatched: 84.3
> Patched: 731.7 (768.27%)
> 
> Partitions = 512
> Unpatched: 22.5
> Patched: 382.8 (1597.74%)
> 
> Partitions = 1024
> Unpatched: 5.5
> Patched: 150.1 (2607.83%)

Great!

I will post comments on your v19 later today.

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d3cc37f1d801a

[2]
https://www.postgresql.org/message-id/20180403194613.GY28454%40telsasoft.com



Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 5 April 2018 at 15:14, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/03/31 22:52, David Rowley wrote:
>> Initialising
>> Append/MergeAppend/ModifyTable nodes with fewer subnodes than were
>> originally planned did require a small change in explain.c in some
>> code that was assuming the subnode arrays were the same length as the
>> subplan list. I also ended up adding a Int property to EXPLAIN to show
>> the number of subnodes that have been removed during execution.
>> Unfortunately, there is a small corner case with this in the case
>> where all subnodes are removed it leaves EXPLAIN with no subnodes to
>> search for outer side Vars in. I didn't really see any sane way to
>> handle this in EXPLAIN, so I think the best fix for this is what I've
>> done, and that's just to always initalise at least 1 subnode, even
>> when none match the external Params. Cost-wise, I don't think this is
>> such a big deal as the cost savings here are coming from saving on
>> initalising ten's or hundreds of subnodes, not 1.
>
> I have wondered about the possibility of setting a valid (non-dummy)
> targetlist in Append and MergeAppend if they're created for a partitioned
> table.  Currently they're overwritten by a dummy one using
> set_dummy_tlist_references() in set_plan_refs() citing the following reason:
>
>  * set_dummy_tlist_references
>  *    Replace the targetlist of an upper-level plan node with a simple
>  *    list of OUTER_VAR references to its child.
>  *
>  * This is used for plan types like Sort and Append that don't evaluate
>  * their targetlists.  Although the executor doesn't care at all what's in
>  * the tlist, EXPLAIN needs it to be realistic.
>
> In fact, when I had noticed that this EXPLAIN behavior I had wondered if
> that's something we should have discussed when d3cc37f1d801a [1] went in.

I had a quick hack at this to see if it would work and it does seem to
on my very simple test. However, it would mean removing
set_dummy_tlist_references from more than just Append/MergeAppend

create table listp (a int, b int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 partition of listp for values in(2);
prepare q1 (int, int) as select * from listp where a in($1,$2) order
by b limit 1;
explain execute q1(1,2);
explain execute q1(1,2);
explain execute q1(1,2);
explain execute q1(1,2);
explain execute q1(1,2);
explain (verbose, costs off) execute q1(0,0);
                       QUERY PLAN
--------------------------------------------------------
 Limit
   Output: listp.a, listp.b
   ->  Sort
         Output: listp.a, listp.b
         Sort Key: listp.b
         ->  Append
               Subplans Pruned: 2
(7 rows)

The downside is that if we were to do this it would mean changing the
output in cases like:

explain (verbose, costs off) (select a z, b y from listp union all
select * from listp) order by y;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Sort
   Output: z, y
   Sort Key: y
   ->  Append
         ->  Seq Scan on public.listp1
               Output: listp1.a, listp1.b
         ->  Seq Scan on public.listp2
               Output: listp2.a, listp2.b
         ->  Seq Scan on public.listp1 listp1_1
               Output: listp1_1.a, listp1_1.b
         ->  Seq Scan on public.listp2 listp2_1
               Output: listp2_1.a, listp2_1.b

Notice the sort key now refers to the alias rather than a column from
the first append child.

It sure is an interesting thought, and one I'd not considered, but I
don't think trying for something like this is going to be the path of
least resistance. It may also add quite a bit of complexity if we just
try to do it when the OUTER_VAR would lead to a Append/MergeAppend
which belongs to a partitioned table scan.

I think this idea has good merit and some form of it might be the
nicest way to allow run-time pruning of all subnodes in the future.
Perhaps we can put it on the shelf and think about it again for PG12.
However, it might not be the most interesting optimization to work on,
as I think probably run-time pruning away of all subnodes is probably
far less common than pruning some, or all but one, and the cost of
initializing the one unneeded subnode is not so big.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 5 April 2018 at 14:01, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've attached an updated patchset against Amit's v49 faster partition
> pruning patch.

Well, v19 was short lived. 0005 conflicted with some recent changes to MERGE.

The attached v20 patch set is now based on Amit's runtime partition
prune v50 [1].

There are a few changes since v19:

1. I've run pgindent on the entire patchset.
2. Made a pass over the comments and fixed a few things that were not
quite right. Also improved some wording.
3. Fixed a small error in make_partition_pruneinfo where I had
accidentally used the 'i' variable for two purposes at once. There was
no live bug there, but I've now changed things around as what was
there was wrong regardless of if it was causing issues or not.
4. updated typedefs.list with new typedefs added in the patchset.

[1] https://www.postgresql.org/message-id/77a518ac-e4a0-4cd1-9988-e5d754a6501f%40lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On 2018/04/05 12:14, Amit Langote wrote:
> I will post comments on your v19 later today.

I looked at it and here are my thoughts on it after having for the first
time looked very closely at it.

* Regarding PartitionPruneInfo:

I think the names of the fields could be improved -- like pruning_steps
instead prunesteps, unpruned_parts instead of allpartindexs.  The latter
is even a bit misleading because it doesn't in fact contain *all*
partition indexes, only those that are unpruned, because each either has a
subpath or it appears in (unpruned) partitioned_rels list.  Also, I didn't
like the name subnodeindex and subpartindex very much.  subplan_indexes
and parent_indexes would sound more informative to me.

* make_partition_pruneinfo has a parameter resultRelations that's not used
anywhere

* In make_partition_pruneinfo()

    allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size);
    allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size);

I think these arrays need to have root->simple_rel_array_size + 1
elements, because they're subscripted using RT indexes which start at 1.

* Also in make_partition_pruneinfo()

    /* Initialize to -1 to indicate the rel was not found */
    for (i = 0; i < root->simple_rel_array_size; i++)
    {
        allsubnodeindex[i] = -1;
        allsubpartindex[i] = -1;
    }

Maybe, allocate the arrays above mentioned using palloc0 and don't do this
initialization.  Instead make the indexes that are stored in these start
with 1 and consider 0 as invalid entries.

* Regarding the code added in execPartition.c and execPartition.h:

I wondered why PartitionedRelPruning is named the way it is.  I saw many
parallels with PartitionDispatchData both in terms of the main thing it
consists of, that is, the map that translates partition indexes as in
partition descriptor to that of subplans or of some other executor
structure.  Also, I imagine you tried to mimic PartitionTupleRouting with
PartitionPruning but not throughout.  For example, tuple routing struct
pointer variables are throughout called proute, whereas PartitionPruning
ones are called partprune instead of, say, pprune.  Consistency would
help, imho.

* Instead of nesting PartitionedRelPruning inside another, just store them
in a global flat array in the PartitionPruning, like PartitionTupleRouting
does for PartitionDispatch of individual partitioned tables in the tree.

typedef struct PartitionedRelPruning
{
    int         nparts;
    int        *subnodeindex;
    struct PartitionedRelPruning **subpartprune;

* I don't see why there needs to be nparts in the above, because it
already has a PartitionPruneContext member which has that information.

In fact, I made most of changes myself while going through the code.
Please see attached the delta patch.  It also tweaks quite a few other
things including various comments.  I think parts of it apply to 0001,
0003, and 0004 patches.  See if this looks good to you.

Thanks,
Amit

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
Hi Amit,

Thanks for having a look at this.

On 6 April 2018 at 00:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I looked at it and here are my thoughts on it after having for the first
> time looked very closely at it.
>
> * Regarding PartitionPruneInfo:
>
> I think the names of the fields could be improved -- like pruning_steps
> instead prunesteps, unpruned_parts instead of allpartindexs.  The latter
> is even a bit misleading because it doesn't in fact contain *all*
> partition indexes, only those that are unpruned, because each either has a
> subpath or it appears in (unpruned) partitioned_rels list.  Also, I didn't
> like the name subnodeindex and subpartindex very much.  subplan_indexes
> and parent_indexes would sound more informative to me.

Seems mostly fair. I'm not a fan of using the term "unpruned" though.
I'll have a think.  The "all" is meant in terms of what exists as
subnodes.

subplan_indexes and parent_indexes seem like better names, I agree.

> * make_partition_pruneinfo has a parameter resultRelations that's not used
> anywhere

It gets used in 0005.

I guess I could only add it in 0005, but make_partition_pruneinfo is
only used in 0003, so you could say the same about that entire
function.

Do you think I should delay adding that parameter until the 0005 patch?

> * In make_partition_pruneinfo()
>
>     allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size);
>     allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size);
>
> I think these arrays need to have root->simple_rel_array_size + 1
> elements, because they're subscripted using RT indexes which start at 1.

RT indexes are always 1-based. See setup_simple_rel_arrays. It already
sets the array size to list_length(rtable) + 1.

> * Also in make_partition_pruneinfo()
>
>     /* Initialize to -1 to indicate the rel was not found */
>     for (i = 0; i < root->simple_rel_array_size; i++)
>     {
>         allsubnodeindex[i] = -1;
>         allsubpartindex[i] = -1;
>     }
>
> Maybe, allocate the arrays above mentioned using palloc0 and don't do this
> initialization.  Instead make the indexes that are stored in these start
> with 1 and consider 0 as invalid entries.

0 is a valid subplan index. It is possible to make this happen, but
I'd need to subtract 1 everywhere I used the map. That does not seem
very nice. Seems more likely to result in bugs where we might forget
to do the - 1.

Did you want this because you'd rather have the palloc0() than the for
loop setting the array elements to -1? Or is there another reason?

> * Regarding the code added in execPartition.c and execPartition.h:
>
> I wondered why PartitionedRelPruning is named the way it is.  I saw many
> parallels with PartitionDispatchData both in terms of the main thing it
> consists of, that is, the map that translates partition indexes as in
> partition descriptor to that of subplans or of some other executor
> structure.  Also, I imagine you tried to mimic PartitionTupleRouting with
> PartitionPruning but not throughout.  For example, tuple routing struct
> pointer variables are throughout called proute, whereas PartitionPruning
> ones are called partprune instead of, say, pprune.  Consistency would
> help, imho.

Yes, I saw similarities and did end up moving all the code into
execPartition a while back.

I'll look into this renaming.

> * Instead of nesting PartitionedRelPruning inside another, just store them
> in a global flat array in the PartitionPruning, like PartitionTupleRouting
> does for PartitionDispatch of individual partitioned tables in the tree.
>
> typedef struct PartitionedRelPruning
> {
>     int         nparts;
>     int        *subnodeindex;
>     struct PartitionedRelPruning **subpartprune;

There is a flat array in PartitionPruning. subpartprune contains
pointers into that array. I want to have this pointer array so I can
directly reference the flat array in PartitionPruning.

Maybe I've misunderstood what you mean here.

> * I don't see why there needs to be nparts in the above, because it
> already has a PartitionPruneContext member which has that information.

Good point. I'll remove that.

> In fact, I made most of changes myself while going through the code.
> Please see attached the delta patch.  It also tweaks quite a few other
> things including various comments.  I think parts of it apply to 0001,
> 0003, and 0004 patches.  See if this looks good to you.

Thanks. I'll look.

It's late over this side now, so will look tomorrow.

Thanks again for reviewing this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Jesper Pedersen
Date:
Hi David,

First of all: Solid patch set with good documentation.

On 04/05/2018 09:41 AM, David Rowley wrote:
> Seems mostly fair. I'm not a fan of using the term "unpruned" though.
> I'll have a think.  The "all" is meant in terms of what exists as
> subnodes.
>

'included_parts' / 'excluded_parts' probably isn't better...

> subplan_indexes and parent_indexes seem like better names, I agree.
>

More clear.

>> * Also in make_partition_pruneinfo()
>>
>>      /* Initialize to -1 to indicate the rel was not found */
>>      for (i = 0; i < root->simple_rel_array_size; i++)
>>      {
>>          allsubnodeindex[i] = -1;
>>          allsubpartindex[i] = -1;
>>      }
>>
>> Maybe, allocate the arrays above mentioned using palloc0 and don't do this
>> initialization.  Instead make the indexes that are stored in these start
>> with 1 and consider 0 as invalid entries.
> 
> 0 is a valid subplan index. It is possible to make this happen, but
> I'd need to subtract 1 everywhere I used the map. That does not seem
> very nice. Seems more likely to result in bugs where we might forget
> to do the - 1.
> 
> Did you want this because you'd rather have the palloc0() than the for
> loop setting the array elements to -1? Or is there another reason?
>

I think that doing palloc0 would be confusing; -1 is more clear, 
especially since it is used with 'allpartindexes' which is a Bitmapset.

Doing the variable renames as Amit suggests would be good.

I ran some tests (v50_v20) (make check-world passes), w/ and w/o 
choose_custom_plan() being false, and seeing good performance results 
without running into issues.

Maybe 0005 should be expanded in partition_prune.sql with the supported 
cases to make those more clear.

Thanks !

Best regards,
  Jesper


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
Hi David.

On 2018/04/05 22:41, David Rowley wrote:
>> * make_partition_pruneinfo has a parameter resultRelations that's not used
>> anywhere
> 
> It gets used in 0005.
> 
> I guess I could only add it in 0005, but make_partition_pruneinfo is
> only used in 0003, so you could say the same about that entire
> function.
> 
> Do you think I should delay adding that parameter until the 0005 patch?

Yes, I think.

>> * In make_partition_pruneinfo()
>>
>>     allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size);
>>     allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size);
>>
>> I think these arrays need to have root->simple_rel_array_size + 1
>> elements, because they're subscripted using RT indexes which start at 1.
> 
> RT indexes are always 1-based. See setup_simple_rel_arrays. It already
> sets the array size to list_length(rtable) + 1.

Oh, I missed that simple_rel_array_size itself is set to consider 1-based
RT indexes.

relnode.c:73
root->simple_rel_array_size = list_length(root->parse->rtable) + 1;

>> * Also in make_partition_pruneinfo()
>>
>>     /* Initialize to -1 to indicate the rel was not found */
>>     for (i = 0; i < root->simple_rel_array_size; i++)
>>     {
>>         allsubnodeindex[i] = -1;
>>         allsubpartindex[i] = -1;
>>     }
>>
>> Maybe, allocate the arrays above mentioned using palloc0 and don't do this
>> initialization.  Instead make the indexes that are stored in these start
>> with 1 and consider 0 as invalid entries.
> 
> 0 is a valid subplan index. It is possible to make this happen, but
> I'd need to subtract 1 everywhere I used the map. That does not seem
> very nice. Seems more likely to result in bugs where we might forget
> to do the - 1.

You can subtract 1 right here in make_partition_pruneinfo before setting
the values in PartitionPruneInfo's subplan_indexes and parent_indexes.
I'm only proposing to make make_partition_pruneinfo() a bit faster by not
looping over both the local map arrays setting them to -1.

IOW, I'm not saying that we emit PartitionPruneInfo nodes that contain
1-based indexes.

> Did you want this because you'd rather have the palloc0() than the for
> loop setting the array elements to -1? Or is there another reason?

Yeah, that's it.

>> * Instead of nesting PartitionedRelPruning inside another, just store them
>> in a global flat array in the PartitionPruning, like PartitionTupleRouting
>> does for PartitionDispatch of individual partitioned tables in the tree.
>>
>> typedef struct PartitionedRelPruning
>> {
>>     int         nparts;
>>     int        *subnodeindex;
>>     struct PartitionedRelPruning **subpartprune;
> 
> There is a flat array in PartitionPruning. subpartprune contains
> pointers into that array. I want to have this pointer array so I can
> directly reference the flat array in PartitionPruning.
> 
> Maybe I've misunderstood what you mean here.

I think we can save some space here by not having the pointers stored
here.  Instead of passing the pointer itself in the recursive calls to
find_subplans_for_extparams_recurse, et al, just pass the entire array and
an offset to use for the given recursive invocation.

If you look at ExecFindPartition used for tuple routing, you may see that
there no recursion at all.  Similarly find_subplans_for_extparams_recurse,
et al might be able to avoid recursion if similar tricks are used.

Finally about having two different functions for different sets of Params:
can't we have just named find_subplans_for_params_recurse() and use the
appropriate one based on the value of some parameter?  I can't help but
notice the duplication of code.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
(sending my reply in parts for concurrency)

On 6 April 2018 at 14:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I think we can save some space here by not having the pointers stored
> here.  Instead of passing the pointer itself in the recursive calls to
> find_subplans_for_extparams_recurse, et al, just pass the entire array and
> an offset to use for the given recursive invocation.

hmm, where would those offsets be stored?  I don't want to have to do
any linear searching to determine the offset, which is why I just
stored the pointer to the flat array. It seems very efficient to me to
do this. Remember that this pruning can occur per tuple in cases like
parameterized nested loops.

Are you worried about memory consumption? It's one pointer per
partition. I imagine there's lots more allocated for DML on a
partitioned table as it needs to store maps to map attribute numbers.

Or are you thinking the saving of storing an array of 32-bit int
values is better than the array of probably 64-bit pointers? So
requires half the space?

> If you look at ExecFindPartition used for tuple routing, you may see that
> there no recursion at all.  Similarly find_subplans_for_extparams_recurse,
> et al might be able to avoid recursion if similar tricks are used.

That seems pretty different. That's looking for a single node in a
tree, so just is following a single path from the root, it never has
to go back up a level and look down any other paths.

What we need for the find_subplans_for_extparams_recurse is to find
all nodes in the entire tree which match the given clause. Doing this
without recursion would require some sort of stack so we can go back
up a level and search again down another branch.  There are ways
around this without using recursion, sure, but I don't think any of
them will be quite as convenient and simple. The best I can think of
is to palloc some stack manually and use some depth_level to track
which element to use.  An actual stack seems more simple. I can't
quite think of a good way to know in advance how many elements we'd
need to palloc.

> Finally about having two different functions for different sets of Params:
> can't we have just named find_subplans_for_params_recurse() and use the
> appropriate one based on the value of some parameter?  I can't help but
> notice the duplication of code.

I had decided not to make this one function previously as I didn't
really want to add unnecessary branching in the code. After
implementing it, it does not look as bad as I thought.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
Hi David,

On 2018/04/06 12:27, David Rowley wrote:
> (sending my reply in parts for concurrency)
> 
> On 6 April 2018 at 14:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I think we can save some space here by not having the pointers stored
>> here.  Instead of passing the pointer itself in the recursive calls to
>> find_subplans_for_extparams_recurse, et al, just pass the entire array and
>> an offset to use for the given recursive invocation.
> 
> hmm, where would those offsets be stored?  I don't want to have to do
> any linear searching to determine the offset, which is why I just
> stored the pointer to the flat array. It seems very efficient to me to
> do this. Remember that this pruning can occur per tuple in cases like
> parameterized nested loops.
> 
> Are you worried about memory consumption? It's one pointer per
> partition. I imagine there's lots more allocated for DML on a
> partitioned table as it needs to store maps to map attribute numbers.
> 
> Or are you thinking the saving of storing an array of 32-bit int
> values is better than the array of probably 64-bit pointers? So
> requires half the space?

Yeah, just copy it from the PartitionPruneInfo like you're doing for
subnodeindex:

   memcpy(partrelprune->subpartindex, pinfo->subpartindex,
          sizeof(int) * pinfo->nparts);

Instead I see ExecSetupPartitionPruning is now doing this:

  /*
   * Setup the PartitionedRelPruning's subpartprune so that we can
   * quickly find sub-PartitionedRelPruning details for any
   * sub-partitioned tables that this partitioned table contains.
   * We need to be able to find these quickly during our recursive
   * search to find all matching subnodes.
   */
   for (j = 0; j < pinfo->nparts; j++)
   {
       int         subpartidx = pinfo->subpartindex[j];

       Assert(subpartidx < list_length(partitionpruneinfo));

       if (subpartidx >= 0)
           partrelprune->subpartprune[j] = &partrelprunes[subpartidx];
       else
           partrelprune->subpartprune[j] = NULL;
   }


With that in place, pass the index/offset instead of the pointer to the
next recursive invocation of find_subplans_*, along with the array
containing all PartitionedRelPruning's.

So, where you have in each of find_subplans_*:

  if (partrelprune->subnodeindex[i] >= 0)
      *validsubplans = bms_add_member(*validsubplans,
                                      partrelprune->subnodeindex[i]);
  else if (partrelprune->subpartprune[i] != NULL)
      find_subplans_for_allparams_recurse(partrelprune->subpartprune[i],
                                          validsubplans);

I'm proposing that you do:

  if (partrelprune->subnodeindex[i] >= 0)
      *validsubplans = bms_add_member(*validsubplans,
                                      partrelprune->subnodeindex[i]);
  else if (partrelprune->subpartindex[i] >= 0)
      find_subplans_for_allparams_recurse(all_partrelprunes,
                                          partrelprune->subpartindex[i],
                                          validsubplans);

And at the top of each of  find_subplans_*:

  ParitionedRelPruning *partrelprune = all_partrelprunes[offset];

where the signature is:

  static void
  find_subplans_for_allparams_recurse(
                                 PartitionRelPruning **all_partrelprune,
                                 int offset,
                                 Bitmapset **validsubplans)

The all_partrelprune above refers to the flat array in PartitionPruning.
On the first call from ExecFindMatchingSubPlans, et al, you'd pass 0 for
offset to start pruning with the root parent's PartitionedRelPruning.  All
the values contained in subnodeindex and subpartindex are indexes into the
global array (whole-tree that is) anyway and that fact would be more
apparent if we use this code structure, imho.

>> If you look at ExecFindPartition used for tuple routing, you may see that
>> there no recursion at all.  Similarly find_subplans_for_extparams_recurse,
>> et al might be able to avoid recursion if similar tricks are used.
> 
> That seems pretty different. That's looking for a single node in a
> tree, so just is following a single path from the root, it never has
> to go back up a level and look down any other paths.
> 
> What we need for the find_subplans_for_extparams_recurse is to find
> all nodes in the entire tree which match the given clause. Doing this
> without recursion would require some sort of stack so we can go back
> up a level and search again down another branch.  There are ways
> around this without using recursion, sure, but I don't think any of
> them will be quite as convenient and simple. The best I can think of
> is to palloc some stack manually and use some depth_level to track
> which element to use.  An actual stack seems more simple. I can't
> quite think of a good way to know in advance how many elements we'd
> need to palloc.

Hmm, yeah.  I just remembered that I had to give up suggesting this a
while back on this thread.  So, okay, you don't need to do anything about
this.

>> Finally about having two different functions for different sets of Params:
>> can't we have just named find_subplans_for_params_recurse() and use the
>> appropriate one based on the value of some parameter?  I can't help but
>> notice the duplication of code.
> 
> I had decided not to make this one function previously as I didn't
> really want to add unnecessary branching in the code. After
> implementing it, it does not look as bad as I thought.

You could at the top of, say, find_subplans_for_params_recurse(..., bool
extparam):

Bitmapset *params = extparam
                        ? partrelprune->extparams
                        : partrelprune->allparams;

ISTT, find_subplans_for_extparams_recurse and
find_subplans_for_allparams_recurse contain the exact same code except
these two lines in the two functions, respectively:

    if (!bms_is_empty(partrelprune->extparams))
    {
        context->safeparams = partrelprune->extparams;

    if (!bms_is_empty(partrelprune->allparams))
    {
        context->safeparams = partrelprune->allparams;

I didn't suggest the same for say ExecFindInitialMatchingSubPlans and
ExecFindMatchingSubPlans because they seem to be at least a bit different
in their missions.  IIUC, the former has to do index value adjustment
(those in subnodeindex and subpartindex) after having discovered a new set
of matching partitions in the tree after pruning with external params at
Append/MergeAppend startup and those params won't change after that point.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 6 April 2018 at 00:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> * Regarding PartitionPruneInfo:
>
> I think the names of the fields could be improved -- like pruning_steps
> instead prunesteps, unpruned_parts instead of allpartindexs.  The latter
> is even a bit misleading because it doesn't in fact contain *all*
> partition indexes, only those that are unpruned, because each either has a
> subpath or it appears in (unpruned) partitioned_rels list.  Also, I didn't
> like the name subnodeindex and subpartindex very much.  subplan_indexes
> and parent_indexes would sound more informative to me.

I've done a load of renaming. I went with subnode_map and subpart_map
instead of _indexes. I thought this was better. Let me know if you
disagree.

> * make_partition_pruneinfo has a parameter resultRelations that's not used
> anywhere

I've taken this out of 0001 and it is now introduced again in 0005,
where it's used.

> * Also in make_partition_pruneinfo()
>
>     /* Initialize to -1 to indicate the rel was not found */
>     for (i = 0; i < root->simple_rel_array_size; i++)
>     {
>         allsubnodeindex[i] = -1;
>         allsubpartindex[i] = -1;
>     }
>
> Maybe, allocate the arrays above mentioned using palloc0 and don't do this
> initialization.  Instead make the indexes that are stored in these start
> with 1 and consider 0 as invalid entries.

I've made this change.

> * Regarding the code added in execPartition.c and execPartition.h:
>
> I wondered why PartitionedRelPruning is named the way it is.  I saw many
> parallels with PartitionDispatchData both in terms of the main thing it
> consists of, that is, the map that translates partition indexes as in
> partition descriptor to that of subplans or of some other executor
> structure.  Also, I imagine you tried to mimic PartitionTupleRouting with
> PartitionPruning but not throughout.  For example, tuple routing struct
> pointer variables are throughout called proute, whereas PartitionPruning
> ones are called partprune instead of, say, pprune.  Consistency would
> help, imho.

I've made a series of changes here too, but didn't use the word
"Dispatch" anywhere. I'm not really sure what the origin of this word
is. To me, it means to send something somewhere, which I thought might
be why we see it tuple routing, since the tuple is being "dispatched"
to the correct partition.

We're not dispatching anything anywhere in partition pruning, so don't
really think the term can be used here. Although, if you see some
other reason for using that word, please explain.

> * Instead of nesting PartitionedRelPruning inside another, just store them
> in a global flat array in the PartitionPruning, like PartitionTupleRouting
> does for PartitionDispatch of individual partitioned tables in the tree.
>
> typedef struct PartitionedRelPruning
> {
>     int         nparts;
>     int        *subnodeindex;
>     struct PartitionedRelPruning **subpartprune;

I've kept the same subpart_map from the PartitionPruneInfo. I actually
ended up using that one without performing a memcpy() on it, since
we're not changing it anywhere. That might or might not be a good idea
since way might day think we can change it which would alter the
plan's copy, but on the other hand, there's a little performance boost
in not performing a copy.

> * I don't see why there needs to be nparts in the above, because it
> already has a PartitionPruneContext member which has that information.

I've removed that field.

I've also been doing a bit of work on the 0005 patch:

1. It now properly supports skipping subplans when exec params can
eliminate certain partitions. Previously I'd only coded it to work
with external params.
2. Fixed bug where statement level triggers would not fire when all
partitions were pruned.
3. Added tests

The patch is still based on v50 of the runtime pruning patch [1]

[1] https://www.postgresql.org/message-id/77a518ac-e4a0-4cd1-9988-e5d754a6501f%40lab.ntt.co.jp

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
I rebased this series on top of the committed version of the other patch.
Here's v22, with no other changes than rebasing.  I did not include
0005, though.

Git changed the author tag for 0001.  Not intentional.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> I rebased this series on top of the committed version of the other patch.
> Here's v22, with no other changes than rebasing.  I did not include
> 0005, though.

Apologies, I forgot to "git add" one fixup for 0001.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 7 April 2018 at 09:29, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Alvaro Herrera wrote:
>> I rebased this series on top of the committed version of the other patch.
>> Here's v22, with no other changes than rebasing.  I did not include
>> 0005, though.
>
> Apologies, I forgot to "git add" one fixup for 0001.

0003 I think.

I'm looking over the rebased patches now.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 7 April 2018 at 10:45, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I'm looking over the rebased patches now.

I've made a complete read of 0001 and 0002 so far.

Your rebase looks fine.

After the complete read, I only have the following comments:

0001:

1. missing "the" before "partition key":

* Extract Params matching partition key and record if we got any.

2. Is this the property name we're going to stick with:

ExplainPropertyInteger("Subplans Pruned", NULL, nplans - nsubnodes, es);

Other ideas are: "Subplans Removed"

3. In the following comment I've used the word "hierarchy", but maybe
we need to add the word "flattened" before it.

 * PartitionPruning - Encapsulates a hierarchy of PartitionRelPruning

4. Comment mentions "after init plan", but really we can only know the
value of an exec param during actual execution. So:

* Parameters that are safe to be used for partition pruning. execparams
* are not safe to use until after init plan.

maybe better as:

* Parameters that are safe to be used for partition pruning. execparams
* are not safe to use until the executor is running.

0002:

Looks fine. But if I was committing this, to give me confidence, I'd
want to know how the left_most_one table was generated.

I used:

#include <stdio.h>

int main(void)
{
        int i = 1;

        printf("0, ");
        while (i < 256)
        {
                printf("%d, ", 31 - __builtin_clz(i));
                if ((i & 0xf) == 0xf)
                        putchar('\n');
                i++;
        }
        return 0;
}

Continuing to read 0003 and 0004 now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 7 April 2018 at 12:03, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Continuing to read 0003 and 0004 now.


0003:

1. "setup" -> "set"

/* If run-time partition pruning is enabled, then setup that up now */

2. We should be able to get rid of as_noopscan and just have another
special negative value for as_whichplan.

I've attached a patch to do this.

3. I've forgotten to drop table boolvalues in the tests.

Patched attached to fix.

0004:

1. "ms_valid_subplans" -> "valid_subplans" in:

 * ms_valid_subplans for runtime pruning, valid mergeplans indexes to
 * scan.

All the other fields are not being prefixed with ms_ in these comments.

Everything else looks fine from my point of view.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On Sat, Apr 7, 2018 at 11:26 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Everything else looks fine from my point of view.

Me too, although I still think having struct names PartitionPruning
and PartitionRelPruning is going to be  a bit confusing.  We should
think about naming the latter to something else.  I suggested
PartitionPruningDispatch(Data), but David doesn't seem to like it.
Maybe, PartitionPruneState, because it parallels the
PartitionPruneInfo that comes from the planner for every partitioned
table in the tree.

Thanks,
Amit


Re: [HACKERS] Runtime Partition Pruning

From
Andres Freund
Date:
On 2018-04-07 13:26:51 +0900, Amit Langote wrote:
> On Sat, Apr 7, 2018 at 11:26 AM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> > Everything else looks fine from my point of view.
> 
> Me too, although I still think having struct names PartitionPruning
> and PartitionRelPruning is going to be  a bit confusing.  We should
> think about naming the latter to something else.  I suggested
> PartitionPruningDispatch(Data), but David doesn't seem to like it.
> Maybe, PartitionPruneState, because it parallels the
> PartitionPruneInfo that comes from the planner for every partitioned
> table in the tree.

I've not followed this thread/feature at all, but I don't find the
comments atop partprune.c even remotely sufficient. Unless there's an
README hidden or such hidden somewhere?

Greetings,

Andres Freund


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 7 April 2018 at 16:26, Amit Langote <amitlangote09@gmail.com> wrote:
> Maybe, PartitionPruneState, because it parallels the
> PartitionPruneInfo that comes from the planner for every partitioned
> table in the tree.

I like that.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 7 April 2018 at 16:31, Andres Freund <andres@anarazel.de> wrote:
> I've not followed this thread/feature at all, but I don't find the
> comments atop partprune.c even remotely sufficient. Unless there's an
> README hidden or such hidden somewhere?

There's not a README file. The comments for partprune.c, do you want
this to explain more about how partition pruning works or how this
patch uses the existing code?

Probably if we need to explain more there about how pruning works then
it should be a fixup patch to 9fdb675fc, no?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-04-07 13:26:51 +0900, Amit Langote wrote:
>> On Sat, Apr 7, 2018 at 11:26 AM, David Rowley
>> <david.rowley@2ndquadrant.com> wrote:
>> > Everything else looks fine from my point of view.
>>
>> Me too, although I still think having struct names PartitionPruning
>> and PartitionRelPruning is going to be  a bit confusing.  We should
>> think about naming the latter to something else.  I suggested
>> PartitionPruningDispatch(Data), but David doesn't seem to like it.
>> Maybe, PartitionPruneState, because it parallels the
>> PartitionPruneInfo that comes from the planner for every partitioned
>> table in the tree.
>
> I've not followed this thread/feature at all, but I don't find the
> comments atop partprune.c even remotely sufficient. Unless there's an
> README hidden or such hidden somewhere?

Sorry there isn't a README and I agree partprune.c's header comment
could be improved quite a bit.

Just to be clear, that's the fault of the patch that was already
committed earlier today (9fdb675fc "Faster partition pruning"), not
this patch, which just extends partition.c's functionality to
implement additional planner and executor support for runtime pruning.

I'm drafting a patch that expands the partprune.c comment and will post shortly.

Thanks,
Amit


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On Sat, Apr 7, 2018 at 1:58 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Probably if we need to explain more there about how pruning works then
> it should be a fixup patch to 9fdb675fc, no?

Yes, I just replied and working on a patch.

Thanks,
Amit


Re: [HACKERS] Runtime Partition Pruning

From
Andres Freund
Date:
On 2018-04-07 16:58:01 +1200, David Rowley wrote:
> On 7 April 2018 at 16:31, Andres Freund <andres@anarazel.de> wrote:
> > I've not followed this thread/feature at all, but I don't find the
> > comments atop partprune.c even remotely sufficient. Unless there's an
> > README hidden or such hidden somewhere?
> 
> There's not a README file. The comments for partprune.c, do you want
> this to explain more about how partition pruning works or how this
> patch uses the existing code?

Primarily the first. This isn't trivial straightforward code.


> Probably if we need to explain more there about how pruning works then
> it should be a fixup patch to 9fdb675fc, no?

Yea, it's about that. Sorry for accidentally jumping on the wrong
thread.

Greetings,

Andres Freund


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On Sat, Apr 7, 2018 at 1:58 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote:
>> I've not followed this thread/feature at all, but I don't find the
>> comments atop partprune.c even remotely sufficient. Unless there's an
>> README hidden or such hidden somewhere?
>
> Sorry there isn't a README and I agree partprune.c's header comment
> could be improved quite a bit.
>
> Just to be clear, that's the fault of the patch that was already
> committed earlier today (9fdb675fc "Faster partition pruning"), not
> this patch, which just extends partition.c's functionality to
> implement additional planner and executor support for runtime pruning.
>
> I'm drafting a patch that expands the partprune.c comment and will post shortly.

See if the attached makes it any better.

Now I know we don't have the runtime pruning in yet, but since the
proposed patch would extend its functionality I have included its
description in the comment.

Thanks,
Amit

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 7 April 2018 at 16:26, Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Apr 7, 2018 at 11:26 AM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> Everything else looks fine from my point of view.
>
> Me too, although I still think having struct names PartitionPruning
> and PartitionRelPruning is going to be  a bit confusing.  We should
> think about naming the latter to something else.  I suggested
> PartitionPruningDispatch(Data), but David doesn't seem to like it.
> Maybe, PartitionPruneState, because it parallels the
> PartitionPruneInfo that comes from the planner for every partitioned
> table in the tree.

Ok, so I've gone and done this.

PartitionPruning has become PartitionPruneState
PartitionRelPruning has become PartitionPruningData

I've changed pointers to PartitionPruneStates to be named prunestate,
sometimes having the node prefix; as_, ma_, in these cases prune and
state are separated with a _ which seems to be the general rule for
executor state struct members.

Generally, pointers to PartitionPruningData are now named pprune.
Hopefully, that's ok, as this was the name previously used for
PartitionPruning pointers.

I applied the patch to get rid of as_noop_scan in favour of using a
special as_whichplan value. There was already one special value
(INVALID_SUBPLAN_INDEX), so seemed better to build on that rather than
inventing something new. This also means we don't have to make the
AppendState struct and wider too, which seems like a good thing to try
to do.

I made all the fixups which I mentioned in my review earlier and also
re-removed the resultRelation parameter from make_partition_pruneinfo.
It sneaked back into v22.

v23 is attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> See if the attached makes it any better.
> 
> Now I know we don't have the runtime pruning in yet, but since the
> proposed patch would extend its functionality I have included its
> description in the comment.

Thanks!

I edited it as attached, to 1. avoid mentioning functionality that
doesn't yet exist, and 2. avoid excessive internal detail (we want a
high-level overview here), which from experience gets outdated pretty
quickly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 8 April 2018 at 00:23, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I edited it as attached, to 1. avoid mentioning functionality that
> doesn't yet exist, and 2. avoid excessive internal detail (we want a
> high-level overview here), which from experience gets outdated pretty
> quickly.

It's not exactly wrong but:

+ * are turned into a set of "pruning steps", which are then executed to
+ * produce a set of RTIs of partitions whose bounds satisfy the constraints in
+ * the step.  Partitions not in the set are said to have been pruned.

It's only prune_append_rel_partitions which is only used for the
planner's pruning needs that converts the partition indexes to RTIs.
Would it be better to mention that the output is partition indexes?
Maybe:

"which are then executed to produce a set of partition indexes whose
bounds satisfy the constraints in the step. These partition indexes
may then be translated into RTIs", or maybe even not mention the RTIs.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
David Rowley wrote:

> It's not exactly wrong but:
> 
> + * are turned into a set of "pruning steps", which are then executed to
> + * produce a set of RTIs of partitions whose bounds satisfy the constraints in
> + * the step.  Partitions not in the set are said to have been pruned.
> 
> It's only prune_append_rel_partitions which is only used for the
> planner's pruning needs that converts the partition indexes to RTIs.
> Would it be better to mention that the output is partition indexes?
> Maybe:
> 
> "which are then executed to produce a set of partition indexes whose
> bounds satisfy the constraints in the step. These partition indexes
> may then be translated into RTIs", or maybe even not mention the RTIs.

Amit had it as "indexes" also in his original.  I wanted to avoid using
the "indexes" word alone, whose meaning is so overloaded.  How about
this?

"... which are then executed to produce a set of partitions (as indexes
of resultRelInfo->part_rels array) that satisfy the constraints in the
step".

Maybe "the boundinfo array" instead of part_rels, which as I understand
also uses the same indexing as the other array, and partprune mostly
works based on boundinfo anyway?

Not mentioning RTIs seems fine.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 8 April 2018 at 01:18, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Amit had it as "indexes" also in his original.  I wanted to avoid using
> the "indexes" word alone, whose meaning is so overloaded.

hmm, good point.

> How about this?
> "... which are then executed to produce a set of partitions (as indexes
> of resultRelInfo->part_rels array) that satisfy the constraints in the
> step".

Works for me, but with RelOptInfo rather than resultRelInfo.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
David Rowley wrote:
> On 8 April 2018 at 01:18, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Amit had it as "indexes" also in his original.  I wanted to avoid using
> > the "indexes" word alone, whose meaning is so overloaded.
> 
> hmm, good point.
> 
> > How about this?
> > "... which are then executed to produce a set of partitions (as indexes
> > of resultRelInfo->part_rels array) that satisfy the constraints in the
> > step".
> 
> Works for me, but with RelOptInfo rather than resultRelInfo.

Oops, sorry about that.  Pushed now, adding one line to
get_matching_partitions also.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Jesper Pedersen
Date:
Hi,

On 04/07/2018 04:45 AM, David Rowley wrote:
> Ok, so I've gone and done this.
> 
> PartitionPruning has become PartitionPruneState
> PartitionRelPruning has become PartitionPruningData
> 
> I've changed pointers to PartitionPruneStates to be named prunestate,
> sometimes having the node prefix; as_, ma_, in these cases prune and
> state are separated with a _ which seems to be the general rule for
> executor state struct members.
> 
> Generally, pointers to PartitionPruningData are now named pprune.
> Hopefully, that's ok, as this was the name previously used for
> PartitionPruning pointers.
> 
> I applied the patch to get rid of as_noop_scan in favour of using a
> special as_whichplan value. There was already one special value
> (INVALID_SUBPLAN_INDEX), so seemed better to build on that rather than
> inventing something new. This also means we don't have to make the
> AppendState struct and wider too, which seems like a good thing to try
> to do.
> 
> I made all the fixups which I mentioned in my review earlier and also
> re-removed the resultRelation parameter from make_partition_pruneinfo.
> It sneaked back into v22.
> 
> v23 is attached.
> 

Passes check-world.

Changing explain.c to "Subplans Removed" as suggested by you in [1] is a 
good idea.

[1] 
https://www.postgresql.org/message-id/CAKJS1f99JnkbOshdV_4zoJZ96DPtKeHMHv43JRL_ZdHRkkVKCA%40mail.gmail.com

Best regards,
  Jesper


Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
I pushed this patch -- 0001, 0002 and 0003 only.  I did not include
anything from 0004 and 0005; I didn't even get to the point of reading
them, so that I could focus on the first part.

I did not find anything to complain about.  I made a few adjustments and
asked David to supply a paragraph for perform.sgml (the "Using EXPLAIN"
section) which is included here.  I also adopted Jesper's (actually
David's) suggestion of changing "Partitions Pruned" to "Partitions
Removed" in the EXPLAIN output.

I had reservations about a relation_open() in the new executor code.  It
seemed a bit odd; we don't have any other relation_open in the executor
anywhere.  However, setting up the pruneinfo needs some stuff from
relcache that I don't see a reasonable mechanism to pass through
planner.  I asked Andres about it on IM and while he didn't endorse the
patch in any way, his quick opinion was that "it wasn't entirely
insane".  I verified that we already hold lock on the relation.


While we didn't get fast pruning support for MergeAppend or the
DELETE/UPDATE parts, I think those are valuable and recommend to
resubmit those for PG12.

Thank you!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 8 April 2018 at 09:13, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I pushed this patch -- 0001, 0002 and 0003 only.  I did not include
> anything from 0004 and 0005; I didn't even get to the point of reading
> them, so that I could focus on the first part.

Oh great! Thank you for working on this and pushing it, especially so
during your weekend.

> While we didn't get fast pruning support for MergeAppend or the
> DELETE/UPDATE parts, I think those are valuable and recommend to
> resubmit those for PG12.

Thanks. I'll certainly be doing that.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I had reservations about a relation_open() in the new executor code.  It
> seemed a bit odd; we don't have any other relation_open in the executor
> anywhere.  However, setting up the pruneinfo needs some stuff from
> relcache that I don't see a reasonable mechanism to pass through
> planner.  I asked Andres about it on IM and while he didn't endorse the
> patch in any way, his quick opinion was that "it wasn't entirely
> insane".  I verified that we already hold lock on the relation.

I don't get this.  The executor surely had to (and did) open all of
the relations somewhere even before this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I had reservations about a relation_open() in the new executor code.  It
> > seemed a bit odd; we don't have any other relation_open in the executor
> > anywhere.  However, setting up the pruneinfo needs some stuff from
> > relcache that I don't see a reasonable mechanism to pass through
> > planner.  I asked Andres about it on IM and while he didn't endorse the
> > patch in any way, his quick opinion was that "it wasn't entirely
> > insane".  I verified that we already hold lock on the relation.
> 
> I don't get this.  The executor surely had to (and did) open all of
> the relations somewhere even before this patch.

Yeah.

I was worried that this coding could be seen as breaking modularity, or
trying to do excessive work.  However, after looking closer at it, it
doesn't really look like it's the case.  So, nevermind.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Robert Haas wrote:
>> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> > I had reservations about a relation_open() in the new executor code.  It
>> > seemed a bit odd; we don't have any other relation_open in the executor
>> > anywhere.  However, setting up the pruneinfo needs some stuff from
>> > relcache that I don't see a reasonable mechanism to pass through
>> > planner.  I asked Andres about it on IM and while he didn't endorse the
>> > patch in any way, his quick opinion was that "it wasn't entirely
>> > insane".  I verified that we already hold lock on the relation.
>>
>> I don't get this.  The executor surely had to (and did) open all of
>> the relations somewhere even before this patch.
>
> Yeah.
>
> I was worried that this coding could be seen as breaking modularity, or
> trying to do excessive work.  However, after looking closer at it, it
> doesn't really look like it's the case.  So, nevermind.

Well what I'm saying is that it shouldn't be necessary.  If the
relations are being opened already and the pointers to the relcache
entries are being saved someplace, you shouldn't need to re-open them
elsewhere to get pointers to the relcache entries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Robert Haas wrote:

> >> I don't get this.  The executor surely had to (and did) open all of
> >> the relations somewhere even before this patch.
> >
> > I was worried that this coding could be seen as breaking modularity, or
> > trying to do excessive work.  However, after looking closer at it, it
> > doesn't really look like it's the case.  So, nevermind.
> 
> Well what I'm saying is that it shouldn't be necessary.  If the
> relations are being opened already and the pointers to the relcache
> entries are being saved someplace, you shouldn't need to re-open them
> elsewhere to get pointers to the relcache entries.

Oh, okay.  I can look into that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> >> I don't get this.  The executor surely had to (and did) open all of
> >> the relations somewhere even before this patch.

> > I was worried that this coding could be seen as breaking modularity, or
> > trying to do excessive work.  However, after looking closer at it, it
> > doesn't really look like it's the case.  So, nevermind.
> 
> Well what I'm saying is that it shouldn't be necessary.  If the
> relations are being opened already and the pointers to the relcache
> entries are being saved someplace, you shouldn't need to re-open them
> elsewhere to get pointers to the relcache entries.

I looked a bit more into this.  It turns out that we have indeed opened
the relation before -- first in parserOpenTable (for addRangeTableEntry),
then in expandRTE, then in QueryRewrite, then in subquery_planner, then
in get_relation_info.

So, frankly, since each module thinks it's okay to open it every once in
a while, I'm not sure we should be terribly stressed about doing it once
more for partition pruning.  Particularly since communicating the
pointer seems to be quite troublesome.

To figure out, I used the attached patch (not intended for application)
to add a backtrace to each log message, plus a couple of accusatory
elog() calls in relation_open and ExecSetupPartitionPruneState.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On 2018/04/11 6:32, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> 
>>>> I don't get this.  The executor surely had to (and did) open all of
>>>> the relations somewhere even before this patch.
> 
>>> I was worried that this coding could be seen as breaking modularity, or
>>> trying to do excessive work.  However, after looking closer at it, it
>>> doesn't really look like it's the case.  So, nevermind.
>>
>> Well what I'm saying is that it shouldn't be necessary.  If the
>> relations are being opened already and the pointers to the relcache
>> entries are being saved someplace, you shouldn't need to re-open them
>> elsewhere to get pointers to the relcache entries.
> 
> I looked a bit more into this.  It turns out that we have indeed opened
> the relation before -- first in parserOpenTable (for addRangeTableEntry),
> then in expandRTE, then in QueryRewrite, then in subquery_planner, then
> in get_relation_info.
> 
> So, frankly, since each module thinks it's okay to open it every once in
> a while, I'm not sure we should be terribly stressed about doing it once
> more for partition pruning.  Particularly since communicating the
> pointer seems to be quite troublesome.

Maybe, Robert was saying somewhere in the executor itself, before
ExecInitAppend, or more precisely, ExecSetupPartitionPruneState is called.
 But that doesn't seem to be the case.

For the result relation case (insert/update/delete on a partitioned
table), we don't need to do extra relation_opens, as InitPlan opens the
needed relations when building the ResultRelInfo(s), from where they're
later accessed, for example, in ExecInitModifyTable.

However, in the Append/MergeAppend cases, we don't, at any earlier point
in the executor, open the partitioned tables.  InitPlan doesn't touch
them.  In ExecInitAppend, ExecLockNonLeafAppendTables that we call before
calling ExecSetupPartitionPruneState does not open, just locks them using
LockRelationOid.

So, relation_open on partitioned tables in ExecSetupPartitionPruneState
seem to be the first time they're opened *in the executor*.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 11 April 2018 at 09:32, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Robert Haas wrote:
>> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
>> >> I don't get this.  The executor surely had to (and did) open all of
>> >> the relations somewhere even before this patch.
>
>> > I was worried that this coding could be seen as breaking modularity, or
>> > trying to do excessive work.  However, after looking closer at it, it
>> > doesn't really look like it's the case.  So, nevermind.
>>
>> Well what I'm saying is that it shouldn't be necessary.  If the
>> relations are being opened already and the pointers to the relcache
>> entries are being saved someplace, you shouldn't need to re-open them
>> elsewhere to get pointers to the relcache entries.
>
> I looked a bit more into this.  It turns out that we have indeed opened
> the relation before -- first in parserOpenTable (for addRangeTableEntry),
> then in expandRTE, then in QueryRewrite, then in subquery_planner, then
> in get_relation_info.
>
> So, frankly, since each module thinks it's okay to open it every once in
> a while, I'm not sure we should be terribly stressed about doing it once
> more for partition pruning.  Particularly since communicating the
> pointer seems to be quite troublesome.

I guess the problem there would be there's nothing to say that parse
analysis will shortly be followed by a call to the planner, and a call
to the planner does not mean the plan is about to be executed. So I
don't think it would be possible to keep pointers to relcache entries
between these modules, and it would be hard to determine whose
responsibility it would be to call relation_close().

It might be possible to do something better in each module by keeping
an array indexed by RTI which have each entry NULL initially then on
first relation_open set the element in the array to that pointer.

This might mean we'd save a few relation_open calls, but I don't know
if there would be a way to somehow remove the Relation from the array
on relation_close.  Having something like this might mean we could
detect lock upgrade hazards more easily, but the whole thing is a
cache on top of a cache which does seem a bit weird.  relation_open()
should be pretty cheap if the relation is already open. It's just a
hash table lookup. What is described above just changes that to an
array lookup.  It also does nothing for index_open.

However, something like the above would simplify
ExecLockNonLeafAppendTables() a bit and get rid of the O(N^2) which
checks the partition is not a result relation.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Thu, Apr 12, 2018 at 12:40 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I guess the problem there would be there's nothing to say that parse
> analysis will shortly be followed by a call to the planner, and a call
> to the planner does not mean the plan is about to be executed. So I
> don't think it would be possible to keep pointers to relcache entries
> between these modules, and it would be hard to determine whose
> responsibility it would be to call relation_close().

Yeah, that's definitely a non-starter.

> It might be possible to do something better in each module by keeping
> an array indexed by RTI which have each entry NULL initially then on
> first relation_open set the element in the array to that pointer.

I'm not sure that makes a lot of sense in the planner, but in the
executor it might be a good idea.   See also
https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
for related discussion.  I think that a coding pattern where we rely
on relation_open(..., NoLock) is inherently dangerous -- it's too easy
to be wrong about whether the lock is sure to have been taken.  It
would be much better to open the relation once and hold onto the
pointer, not just for performance reasons, but for robustness.

BTW, looking at ExecSetupPartitionPruneState:

        /*
         * Create a sub memory context which we'll use when making calls to the
         * query planner's function to determine which partitions will
match.  The
         * planner is not too careful about freeing memory, so we'll ensure we
         * call the function in this context to avoid any memory leaking in the
         * executor's memory context.
         */

This is a sloppy cut-and-paste job, not only because somebody changed
one copy of the word "planner" to "executor" and left the others
untouched, but also because the rationale isn't really correct for the
executor anyway, which has memory contexts all over the place and
frees them all the time.  I don't know whether the context is not
needed at all or whether the context is needed but the rationale is
different, but I don't buy that explanation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 13 April 2018 at 04:57, Robert Haas <robertmhaas@gmail.com> wrote:
> BTW, looking at ExecSetupPartitionPruneState:
>
>         /*
>          * Create a sub memory context which we'll use when making calls to the
>          * query planner's function to determine which partitions will
> match.  The
>          * planner is not too careful about freeing memory, so we'll ensure we
>          * call the function in this context to avoid any memory leaking in the
>          * executor's memory context.
>          */
>
> This is a sloppy cut-and-paste job, not only because somebody changed
> one copy of the word "planner" to "executor" and left the others
> untouched, but also because the rationale isn't really correct for the
> executor anyway, which has memory contexts all over the place and
> frees them all the time.  I don't know whether the context is not
> needed at all or whether the context is needed but the rationale is
> different, but I don't buy that explanation.

The comment is written exactly as intended. Unsure which of the
"planner"s you think should be "executor".

The context is needed. I can easily produce an OOM without it.



-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On 2018/04/13 1:57, Robert Haas wrote:
>> It might be possible to do something better in each module by keeping
>> an array indexed by RTI which have each entry NULL initially then on
>> first relation_open set the element in the array to that pointer.
> 
> I'm not sure that makes a lot of sense in the planner, but in the
> executor it might be a good idea.   See also
> https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
> for related discussion.  I think that a coding pattern where we rely
> on relation_open(..., NoLock) is inherently dangerous -- it's too easy
> to be wrong about whether the lock is sure to have been taken.  It
> would be much better to open the relation once and hold onto the
> pointer, not just for performance reasons, but for robustness.

About the specific relation_open(.., NoLock) under question, I think there
might be a way to address this by opening the tables with the appropriate
lock mode in partitioned_rels list in ExecLockNonleafAppendTables and
keeping the Relation pointers around until ExecEndNode.  Then instead of
ExecSetupPartitionPruneState doing relation_open/close(.., NoLock), it
just reuses the one that's passed by the caller.

Attached a PoC patch.  David, thoughts?

Thanks,
Amit

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On 2018/04/13 14:38, Amit Langote wrote:
> About the specific relation_open(.., NoLock) under question, I think there
> might be a way to address this by opening the tables with the appropriate
> lock mode in partitioned_rels list in ExecLockNonleafAppendTables

That may have sounded a bit confusing:

I meant to say: "by opening the tables in partitioned_rels list with the
appropriate lock mode in ExecLockNonleafAppendTables"

> Attached a PoC patch.
There were a couple of unnecessary hunks, which removed in the attached.

Thanks,
Amit

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On 2018/04/13 14:48, Amit Langote wrote:
> On 2018/04/13 14:38, Amit Langote wrote:
>> About the specific relation_open(.., NoLock) under question, I think there
>> might be a way to address this by opening the tables with the appropriate
>> lock mode in partitioned_rels list in ExecLockNonleafAppendTables
> 
> That may have sounded a bit confusing:
> 
> I meant to say: "by opening the tables in partitioned_rels list with the
> appropriate lock mode in ExecLockNonleafAppendTables"
> 
>> Attached a PoC patch.
>
> There were a couple of unnecessary hunks, which removed in the attached.

Sorry, still a couple more were unnecessary.

Thanks,
Amit

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Thu, Apr 12, 2018 at 6:01 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 13 April 2018 at 04:57, Robert Haas <robertmhaas@gmail.com> wrote:
>> BTW, looking at ExecSetupPartitionPruneState:
>>
>>         /*
>>          * Create a sub memory context which we'll use when making calls to the
>>          * query planner's function to determine which partitions will
>> match.  The
>>          * planner is not too careful about freeing memory, so we'll ensure we
>>          * call the function in this context to avoid any memory leaking in the
>>          * executor's memory context.
>>          */
>>
>> This is a sloppy cut-and-paste job, not only because somebody changed
>> one copy of the word "planner" to "executor" and left the others
>> untouched, but also because the rationale isn't really correct for the
>> executor anyway, which has memory contexts all over the place and
>> frees them all the time.  I don't know whether the context is not
>> needed at all or whether the context is needed but the rationale is
>> different, but I don't buy that explanation.
>
> The comment is written exactly as intended. Unsure which of the
> "planner"s you think should be "executor".
>
> The context is needed. I can easily produce an OOM without it.

Oh, crap.  You know, I totally misread what that comment was trying to
say.  Sorry.

But I wonder why it's the executor's job to clean up after the
planner, instead of adjusting the relevant planner functions to avoid
leaking memory?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 14 April 2018 at 05:04, Robert Haas <robertmhaas@gmail.com> wrote:
> But I wonder why it's the executor's job to clean up after the
> planner, instead of adjusting the relevant planner functions to avoid
> leaking memory?

It might be possible, but it might also be risky and difficult.

For a while, during my review of the faster partition pruning patch I
was asking Amit to add pfree() calls in various places for this exact
reason, but in the end, I gave up and decided it was easier to just
create a new memory context to call the planner function from. I've
now forgotten the exact reason why I finally decided it was too much
trouble.  The pruning code now works using your step logic so perhaps
that reason no longer applies, although, on a quick scan of the
pruning code now, it seems to require that get_matching_partitions
performs a deep pfree of each PruneStepResult. However, there is still
partkey_datum_from_expr which performs ExecInitExpr, although perhaps
that can just be done once, and the result stashed in the
PartitionPruneContext.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
David Rowley wrote:

> For a while, during my review of the faster partition pruning patch I
> was asking Amit to add pfree() calls in various places for this exact
> reason, but in the end, I gave up and decided it was easier to just
> create a new memory context to call the planner function from. I've
> now forgotten the exact reason why I finally decided it was too much
> trouble.  The pruning code now works using your step logic so perhaps
> that reason no longer applies, although, on a quick scan of the
> pruning code now, it seems to require that get_matching_partitions
> performs a deep pfree of each PruneStepResult. However, there is still
> partkey_datum_from_expr which performs ExecInitExpr, although perhaps
> that can just be done once, and the result stashed in the
> PartitionPruneContext.

I think trying to replace a well-placed MemoryContextReset (or Delete)
with a bunch of individual pfrees is pointless.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 17 April 2018 at 14:33, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> David Rowley wrote:
>
>> For a while, during my review of the faster partition pruning patch I
>> was asking Amit to add pfree() calls in various places for this exact
>> reason, but in the end, I gave up and decided it was easier to just
>> create a new memory context to call the planner function from. I've
>> now forgotten the exact reason why I finally decided it was too much
>> trouble.  The pruning code now works using your step logic so perhaps
>> that reason no longer applies, although, on a quick scan of the
>> pruning code now, it seems to require that get_matching_partitions
>> performs a deep pfree of each PruneStepResult. However, there is still
>> partkey_datum_from_expr which performs ExecInitExpr, although perhaps
>> that can just be done once, and the result stashed in the
>> PartitionPruneContext.
>
> I think trying to replace a well-placed MemoryContextReset (or Delete)
> with a bunch of individual pfrees is pointless.

I agree. I think I'd sleep better at night with the context reset in
there rather than hoping we've managed to pfree everything.

I did go and start working on a patch to test how possible this would
be and came up with the attached. I've left a stray
MemoryContextStatsDetail call in there which does indicate that
something is not being freed. I'm just not sure what it is yet.

The patch does happen to improve performance slightly, but that is
most likely due to the caching of the ExprStates rather than the
change of memory management. It's not really possible to do that with
the reset unless we stored the executor's memory context in
PartitionPruneContext and did a context switch back inside
partkey_datum_from_expr before calling ExecInitExpr.

My test case was as follows:

create table p (a int, value int) partition by hash (a);
select 'create table p'||x|| ' partition of p for values with (modulus
10, remainder '||x||');' from generate_series(0,9) x;
\gexec
create table t1 (a int);

insert into p select x,x from generate_Series(1,10000000) x;
insert into t1 select x from generate_series(1,10000000) x;

create index on p(a);

set enable_hashjoin = 0;
set enable_mergejoin = 0;
explain analyze select count(*) from t1 inner join p on t1.a=p.a;


-- Unpatched
Execution Time: 19725.981 ms
Execution Time: 19533.655 ms
Execution Time: 19542.854 ms

-- Patched
Execution Time: 17389.537 ms
Execution Time: 17603.802 ms
Execution Time: 17618.670 ms

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Mon, Apr 16, 2018 at 10:46 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I did go and start working on a patch to test how possible this would
> be and came up with the attached. I've left a stray
> MemoryContextStatsDetail call in there which does indicate that
> something is not being freed. I'm just not sure what it is yet.
>
> The patch does happen to improve performance slightly, but that is
> most likely due to the caching of the ExprStates rather than the
> change of memory management. It's not really possible to do that with
> the reset unless we stored the executor's memory context in
> PartitionPruneContext and did a context switch back inside
> partkey_datum_from_expr before calling ExecInitExpr.

10% is more than a "slight" improvement, I'd say!  It's certainly got
to be worth avoiding the repeated calls to ExecInitExpr, whatever we
do about the memory contexts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 19 April 2018 at 03:13, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 16, 2018 at 10:46 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> I did go and start working on a patch to test how possible this would
>> be and came up with the attached. I've left a stray
>> MemoryContextStatsDetail call in there which does indicate that
>> something is not being freed. I'm just not sure what it is yet.
>>
>> The patch does happen to improve performance slightly, but that is
>> most likely due to the caching of the ExprStates rather than the
>> change of memory management. It's not really possible to do that with
>> the reset unless we stored the executor's memory context in
>> PartitionPruneContext and did a context switch back inside
>> partkey_datum_from_expr before calling ExecInitExpr.
>
> 10% is more than a "slight" improvement, I'd say!  It's certainly got
> to be worth avoiding the repeated calls to ExecInitExpr, whatever we
> do about the memory contexts.

I've attached a patch which does just this. On benchmarking again with
this single change performance has improved 15% over master.

Also, out of curiosity, I also checked what this performed like before
the run-time pruning patch was committed (5c0675215). Taking the
average of the times below, it seems without this patch the
performance of this case has improved about 356% and about 410% with
this patch. So, I agree, it might be worth considering.

create table p (a int, value int) partition by hash (a);
select 'create table p'||x|| ' partition of p for values with (modulus
10, remainder '||x||');' from generate_series(0,9) x;
\gexec
create table t1 (a int);

insert into p select x,x from generate_Series(1,1000) x;
insert into t1 select x from generate_series(1,1000) x;

create index on p(a);

set enable_hashjoin = 0;
set enable_mergejoin = 0;
explain analyze select count(*) from t1 inner join p on t1.a=p.a;

-- Unpatched
Execution Time: 20413.975 ms
Execution Time: 20232.050 ms
Execution Time: 20229.116 ms

-- Patched
Execution Time: 17758.111 ms
Execution Time: 17645.151 ms
Execution Time: 17492.260 ms

-- 5c0675215e153ba1297fd494b34af2fdebd645d1
Execution Time: 72875.161 ms
Execution Time: 71817.757 ms
Execution Time: 72411.730 ms

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 19 April 2018 at 12:04, David Rowley <david.rowley@2ndquadrant.com> wrote:
> insert into p select x,x from generate_Series(1,1000) x;
> insert into t1 select x from generate_series(1,1000) x;

Correction. These were meant to read:

insert into p select x,x from generate_Series(1,10000000) x;
insert into t1 select x from generate_series(1,10000000) x;

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
Hi David.

On 2018/04/19 9:04, David Rowley wrote:
> On 19 April 2018 at 03:13, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Apr 16, 2018 at 10:46 PM, David Rowley
>> <david.rowley@2ndquadrant.com> wrote:
>>> The patch does happen to improve performance slightly, but that is
>>> most likely due to the caching of the ExprStates rather than the
>>> change of memory management. It's not really possible to do that with
>>> the reset unless we stored the executor's memory context in
>>> PartitionPruneContext and did a context switch back inside
>>> partkey_datum_from_expr before calling ExecInitExpr.
>>
>> 10% is more than a "slight" improvement, I'd say!  It's certainly got
>> to be worth avoiding the repeated calls to ExecInitExpr, whatever we
>> do about the memory contexts.
> 
> I've attached a patch which does just this. On benchmarking again with
> this single change performance has improved 15% over master.
> 
> Also, out of curiosity, I also checked what this performed like before
> the run-time pruning patch was committed (5c0675215). Taking the
> average of the times below, it seems without this patch the
> performance of this case has improved about 356% and about 410% with
> this patch. So, I agree, it might be worth considering.
> 
> create table p (a int, value int) partition by hash (a);
> select 'create table p'||x|| ' partition of p for values with (modulus
> 10, remainder '||x||');' from generate_series(0,9) x;
> \gexec
> create table t1 (a int);
> 
> insert into p select x,x from generate_Series(1,1000) x;
> insert into t1 select x from generate_series(1,1000) x;
> 
> create index on p(a);
> 
> set enable_hashjoin = 0;
> set enable_mergejoin = 0;
> explain analyze select count(*) from t1 inner join p on t1.a=p.a;
> 
> -- Unpatched
> Execution Time: 20413.975 ms
> Execution Time: 20232.050 ms
> Execution Time: 20229.116 ms
> 
> -- Patched
> Execution Time: 17758.111 ms
> Execution Time: 17645.151 ms
> Execution Time: 17492.260 ms
> 
> -- 5c0675215e153ba1297fd494b34af2fdebd645d1
> Execution Time: 72875.161 ms
> Execution Time: 71817.757 ms
> Execution Time: 72411.730 ms

That's neat!  Definitely agree that we should call ExecInitExpr just once
here.  The patch looks good too, except the long line.  Maybe:

@@ -1514,13 +1514,15 @@ ExecSetupPartitionPruneState(PlanState *planstate,
List *partitionpruneinfo)
             foreach(lc3, step->exprs)
             {
                 Expr       *expr = (Expr *) lfirst(lc3);
+                int         step_id = step->step.step_id;

                 /*
                  * partkey_datum_from_expr does not need an expression state
                  * to evaluate a Const.
                  */
                 if (!IsA(expr, Const))
-                    context->exprstates[step->step.step_id * partnatts +
keyno] = ExecInitExpr(expr, context->planstate);
+                    context->exprstates[step_id * partnatts + keyno] =
+                                    ExecInitExpr(expr, context->planstate);

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
Anybody wanna argue against pushing this patch now?  I'm inclined to
push it on the grounds of being closure for already committed work, but
there are possible arguments about this being new development after
feature freeze.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> That's neat!  Definitely agree that we should call ExecInitExpr just once
> here.  The patch looks good too, except the long line.

How about this as a small tweak?  Determine the array index using a
macro, which serves as documentation.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
Pushed.  Thanks!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Andres Freund
Date:
On 2018-04-19 12:04:35 +1200, David Rowley wrote:
> On 19 April 2018 at 03:13, Robert Haas <robertmhaas@gmail.com> wrote:
> > 10% is more than a "slight" improvement, I'd say!  It's certainly got
> > to be worth avoiding the repeated calls to ExecInitExpr, whatever we
> > do about the memory contexts.

Yea, that seems important. Good that that got in.

What I wonder, after skimming this change, is where the relevant
expression context is reset?  That's not really related to this change
but the wider thread, I just noticed it while looking at this.

Greetings,

Andres Freund


Re: [HACKERS] Runtime Partition Pruning

From
Alvaro Herrera
Date:
Andres Freund wrote:

> What I wonder, after skimming this change, is where the relevant
> expression context is reset?  That's not really related to this change
> but the wider thread, I just noticed it while looking at this.

Do you mean ResetExprContext?  We use the plan's context, so it should
occur together with the plan reset itself, i.e. nodeAppend.c should do
it somewhere.

... Hmm, it appears we don't do it anywhere there actually.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 25 April 2018 at 06:21, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Andres Freund wrote:
>
>> What I wonder, after skimming this change, is where the relevant
>> expression context is reset?  That's not really related to this change
>> but the wider thread, I just noticed it while looking at this.
>
> Do you mean ResetExprContext?  We use the plan's context, so it should
> occur together with the plan reset itself, i.e. nodeAppend.c should do
> it somewhere.
>
> ... Hmm, it appears we don't do it anywhere there actually.

It's not immediately obvious to me why this is required.

All the expressions that are initialized here must live the entire
length of the executor run, and since they're allocated in the
ExecutorState context they'll be destroyed in FreeExecutorState().

If we do need to call ResetExprContext for these, then we'd just need
to invent a teardown for ExecSetupPartitionPruneState which would free
off the memory allocated and call ResetExprContext for all non-NULL
ExprStates in each context->exprstates. This function would need to be
called from the node's End function for any node that's set up a
PartitionPruneState.

Do we really need to do this given that the memory context these are
allocated in will be released a moment later anyway?

Just to ensure I'm not dreaming, I ran the following and couldn't see
the backend's memory consumption move.

create table lp (a int, value int) partition by list(a);
create table lp_1 partition of lp for values in(1);
create table lp_2 partition of lp for values in(2);
create function lp_value(p_a int) returns int as $$ select value from
lp where a = p_a; $$ language sql;
insert into lp values(1,10),(2,20);
select sum(lp_value(x)) from generate_Series(1,2) x,
generate_series(1,10000000);

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
David Rowley
Date:
On 25 April 2018 at 05:10, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Pushed.  Thanks!

Thanks!

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Runtime Partition Pruning

From
Peter Eisentraut
Date:
On 2018-04-10 23:32, Alvaro Herrera wrote:
> To figure out, I used the attached patch (not intended for application)
> to add a backtrace to each log message, plus a couple of accusatory
> elog() calls in relation_open and ExecSetupPartitionPruneState.

What do people think about adding something like this errbacktrace()
from Álvaro's patch to core PostgreSQL?  If we could devise a way to
selectively enable it, it might be an easier way for users to provide
backtraces from errors.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Runtime Partition Pruning

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 2018-04-10 23:32, Alvaro Herrera wrote:
>> To figure out, I used the attached patch (not intended for application)
>> to add a backtrace to each log message, plus a couple of accusatory
>> elog() calls in relation_open and ExecSetupPartitionPruneState.

> What do people think about adding something like this errbacktrace()
> from Álvaro's patch to core PostgreSQL?  If we could devise a way to
> selectively enable it, it might be an easier way for users to provide
> backtraces from errors.

I think we did discuss it right after that, or somewhere nearby, and
concluded that the output is so imprecise that it's not really going
to be worth whatever portability issues we'd have to deal with.

I'd be all for a better version, but glibc's backtrace() just sucks,
at least given our coding style with lots of static functions.

            regards, tom lane



Re: [HACKERS] Runtime Partition Pruning

From
Tom Lane
Date:
I wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> What do people think about adding something like this errbacktrace()
>> from Álvaro's patch to core PostgreSQL?

> I think we did discuss it right after that, or somewhere nearby, and
> concluded that the output is so imprecise that it's not really going
> to be worth whatever portability issues we'd have to deal with.

Hmm, after some digging in the archives, the closest thing I can find
is this thread:

https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com

where we discussed using libunwind instead, but people didn't like
the extra dependency.

However, I stand by the assertion that glibc's backtrace() is too
imprecise to be useful; I've experimented with it and despaired of
being able to tell where control had actually been.

            regards, tom lane



Re: [HACKERS] Runtime Partition Pruning

From
Andres Freund
Date:
Hi,

On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
> I wrote:
> > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> >> What do people think about adding something like this errbacktrace()
> >> from Álvaro's patch to core PostgreSQL?
> 
> > I think we did discuss it right after that, or somewhere nearby, and
> > concluded that the output is so imprecise that it's not really going
> > to be worth whatever portability issues we'd have to deal with.
> 
> Hmm, after some digging in the archives, the closest thing I can find
> is this thread:
> 
>
https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
> 
> where we discussed using libunwind instead, but people didn't like
> the extra dependency.

Hm, I didn't actually see that much concern about that. I still think we
should just go for libunwind. At least on debian it's likely to already
be installed:

andres@alap4:~$ apt rdepends libunwind8
libunwind8
Reverse Depends:
  Depends: libunwind-dev (= 1.2.1-9)
  Depends: linux-perf-4.16
  Depends: linux-perf-4.15
  Depends: linux-perf-4.14
  Depends: rspamd
  Depends: linux-perf-5.0
  Depends: libjulia1
  Depends: julia
  Depends: geary
  Depends: libunwind8-dbgsym (= 1.2.1-9)
  Depends: xwayland
  Depends: xvfb
  Depends: xserver-xorg-core
  Depends: xserver-xephyr
  Depends: xnest
  Depends: xdmx
  Depends: trafficserver
  Depends: tigervnc-standalone-server
  Depends: tarantool
  Depends: strace
  Depends: spring
  Depends: rspamd
  Depends: linux-perf-4.19
  Depends: libunwind-setjmp0
  Depends: libeina1a
  Depends: libjulia1
  Depends: julia
  Depends: intel-gpu-tools
  Depends: libheaptrack
  Depends: libgoogle-perftools4
  Depends: libgoogle-glog0v5
  Depends: gdnsd
  Depends: libevas1-engines-x
  Depends: libevas1

In particular strace, xserver-xorg-core, perf are reasonably likely to
already installed.

It's also not a large library. I'd bet if we made it an optional
build-time dependency it'd get included by just about every distro.

Greetings,

Andres Freund



Re: [HACKERS] Runtime Partition Pruning

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
>> Hmm, after some digging in the archives, the closest thing I can find
>> is this thread:
>>
https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
>> where we discussed using libunwind instead, but people didn't like
>> the extra dependency.

> Hm, I didn't actually see that much concern about that. I still think we
> should just go for libunwind.

Is it actually better?  The basic problem with backtrace() is that it
only knows about global functions, and so reports call sites in static
functions as if they were in whatever global function physically precedes
the static one.  I think doing materially better requires depending on
debug symbols, which (at least in the Red Hat world) aren't going to
be there in a typical production situation.

            regards, tom lane



Re: [HACKERS] Runtime Partition Pruning

From
Andres Freund
Date:
Hi,

On 2019-05-24 12:08:57 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
> >> Hmm, after some digging in the archives, the closest thing I can find
> >> is this thread:
> >>
https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
> >> where we discussed using libunwind instead, but people didn't like
> >> the extra dependency.
> 
> > Hm, I didn't actually see that much concern about that. I still think we
> > should just go for libunwind.
> 
> Is it actually better?

I've not looked in a while, but I think at some point it was.


> The basic problem with backtrace() is that it
> only knows about global functions, and so reports call sites in static
> functions as if they were in whatever global function physically precedes
> the static one.

Does that depend on whether the program was compiled with
-fno-omit-frame-pointer? At least some distros now compile with frame
pointers enabled, to get reasonably fast perf profiles (at a basically
immeasurable slowdown, on modern-ish CPUs).


> I think doing materially better requires depending on
> debug symbols, which (at least in the Red Hat world) aren't going to
> be there in a typical production situation.

It's still a lot easier to install debug symbols than to attach a
debugger and get a backtrace that way. Especially when the problem is
hard to reproduce.

Greetings,

Andres Freund



Re: [HACKERS] Runtime Partition Pruning

From
Tomas Vondra
Date:
On Fri, May 24, 2019 at 09:24:28AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-05-24 12:08:57 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>> > On 2019-05-24 11:34:58 -0400, Tom Lane wrote:
>> >> Hmm, after some digging in the archives, the closest thing I can find
>> >> is this thread:
>> >>
https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com
>> >> where we discussed using libunwind instead, but people didn't like
>> >> the extra dependency.
>>
>> > Hm, I didn't actually see that much concern about that. I still think we
>> > should just go for libunwind.
>>
>> Is it actually better?
>
>I've not looked in a while, but I think at some point it was.
>
>
>> The basic problem with backtrace() is that it
>> only knows about global functions, and so reports call sites in static
>> functions as if they were in whatever global function physically precedes
>> the static one.
>
>Does that depend on whether the program was compiled with
>-fno-omit-frame-pointer? At least some distros now compile with frame
>pointers enabled, to get reasonably fast perf profiles (at a basically
>immeasurable slowdown, on modern-ish CPUs).
>

I doubt that, because if that was the case we'd not be able to get
accurate profiles from perf, no? And AFAICS that's not the case,
irrespectedly of whether -fno-omit-frame-pointer is used.


>
>> I think doing materially better requires depending on
>> debug symbols, which (at least in the Red Hat world) aren't going to
>> be there in a typical production situation.
>
>It's still a lot easier to install debug symbols than to attach a
>debugger and get a backtrace that way. Especially when the problem is
>hard to reproduce.
>

Right. Debugger requires interaction with a running process, while
having it integrated would make that unnecessary.

But I think Peter also suggested this might require the ability to
selectively enable the backtraces, and I think he's right. I doubt we
want to log them for every log message, right?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: [HACKERS] Runtime Partition Pruning

From
Andres Freund
Date:
Hi,

On 2019-05-25 00:42:39 +0200, Tomas Vondra wrote:
> On Fri, May 24, 2019 at 09:24:28AM -0700, Andres Freund wrote:
> > On 2019-05-24 12:08:57 -0400, Tom Lane wrote:
> > > Andres Freund <andres@anarazel.de> writes:
> > > The basic problem with backtrace() is that it
> > > only knows about global functions, and so reports call sites in static
> > > functions as if they were in whatever global function physically precedes
> > > the static one.
> > 
> > Does that depend on whether the program was compiled with
> > -fno-omit-frame-pointer? At least some distros now compile with frame
> > pointers enabled, to get reasonably fast perf profiles (at a basically
> > immeasurable slowdown, on modern-ish CPUs).
> > 
> 
> I doubt that, because if that was the case we'd not be able to get
> accurate profiles from perf, no? And AFAICS that's not the case,
> irrespectedly of whether -fno-omit-frame-pointer is used.

I can't parse this. With perf you can get accurate call-graph profiles
if you either use -fno-omit-frame-pointer, to force frame pointers to be
present (so the call graph can cheaply be assembled during profiling),
or with dwarf (the entire stack is saved, and then dwarf is unwinding at
perf report time - very large), or with lbr (CPU saves traces of
branches taken, enabling call graphs to be computed, but it needs they're not
very deep).


> > > I think doing materially better requires depending on
> > > debug symbols, which (at least in the Red Hat world) aren't going to
> > > be there in a typical production situation.
> > 
> > It's still a lot easier to install debug symbols than to attach a
> > debugger and get a backtrace that way. Especially when the problem is
> > hard to reproduce.
> > 
> 
> Right. Debugger requires interaction with a running process, while
> having it integrated would make that unnecessary.
> 
> But I think Peter also suggested this might require the ability to
> selectively enable the backtraces, and I think he's right. I doubt we
> want to log them for every log message, right?

Well, I think that if we had PANIC, SIGSEGV/BUS most FATALs covered,
we'd be off to a very good start. I'm not sure it's wise to give users
control over the computation of stack computations.

Greetings,

Andres Freund



Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Fri, May 24, 2019 at 12:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Is it actually better?  The basic problem with backtrace() is that it
> only knows about global functions, and so reports call sites in static
> functions as if they were in whatever global function physically precedes
> the static one.  I think doing materially better requires depending on
> debug symbols, which (at least in the Red Hat world) aren't going to
> be there in a typical production situation.

I don't have an opinion on glibc vs. libunwind, but I don't understand
this argument.  If you are unlucky enough to have a production server
that is crashing due to some hitherto-unknown bug, and if it's not
possible to get a good backtrace without installing debugging symbols,
then you are going to have to pick between (1) installing those
debugging symbols and (2) getting a poor backtrace.  I don't really
see that as a problem so much as just the way life is.  You can't
expect to get good debugging output without debugging symbols, just as
you can't expect to get a clean audit without bank statements or a
clear idea of how to find your way to an unknown destination without a
map.  If you don't have the thing that contains the information that
is needed, you can't get the information; c'est la vie.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Runtime Partition Pruning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, May 24, 2019 at 12:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Is it actually better?  The basic problem with backtrace() is that it
>> only knows about global functions, and so reports call sites in static
>> functions as if they were in whatever global function physically precedes
>> the static one.  I think doing materially better requires depending on
>> debug symbols, which (at least in the Red Hat world) aren't going to
>> be there in a typical production situation.

> I don't have an opinion on glibc vs. libunwind, but I don't understand
> this argument.  If you are unlucky enough to have a production server
> that is crashing due to some hitherto-unknown bug, and if it's not
> possible to get a good backtrace without installing debugging symbols,
> then you are going to have to pick between (1) installing those
> debugging symbols and (2) getting a poor backtrace.  I don't really
> see that as a problem so much as just the way life is.

Well, it *is* a problem.  The whole point of this discussion I think is
to try to get better information "by default" for routine bug reports.
So if those come from production servers without debug symbols, which
I believe will be the usual case, then it seems likely to me that
libunwind will produce no better results than glibc.  (But perhaps
I'm wrong about that --- I have not experimented with libunwind.)

Now it's true that "install debug symbols" is less of an ask than
"install debug symbols, *and* gdb, and make sure server core dumps are
enabled, and then go through this arcane manual procedure next time
you get a core dump".  But we shouldn't fool ourselves that it isn't
an ask that's going to be hard for people with corporate policies
against installing extra stuff on production servers.

            regards, tom lane



Re: [HACKERS] Runtime Partition Pruning

From
Robert Haas
Date:
On Wed, May 29, 2019 at 6:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, it *is* a problem.  The whole point of this discussion I think is
> to try to get better information "by default" for routine bug reports.
> So if those come from production servers without debug symbols, which
> I believe will be the usual case, then it seems likely to me that
> libunwind will produce no better results than glibc.  (But perhaps
> I'm wrong about that --- I have not experimented with libunwind.)

Sure, I agree.

> Now it's true that "install debug symbols" is less of an ask than
> "install debug symbols, *and* gdb, and make sure server core dumps are
> enabled, and then go through this arcane manual procedure next time
> you get a core dump".  But we shouldn't fool ourselves that it isn't
> an ask that's going to be hard for people with corporate policies
> against installing extra stuff on production servers.

There may be cases where that is true, but as you say, it's better
than what we have now.  Plus, what exactly is the alternative?  We
could:

- encourage packagers to install debug symbols by default (but they
might not; it might even be against policy), or
- invent our own system for generating backtraces and ignore what the
OS toolchain knows how to do (sounds painfully complex and expensive),
or
- just live with the fact that it's imperfect.

Is there a fourth option?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Runtime Partition Pruning

From
Andy Fan
Date:


Now, in my experience, the current system for custom plans vs. generic
plans doesn't approach the problem in this way at all, and in my
experience that results in some pretty terrible behavior.  It will do
things like form a custom plan every time because the estimated cost
of the custom plan is lower than the estimated cost of the generic
plan even though the two plans are structurally identical; only the
estimates differ.  It will waste gobs of CPU cycles by replanning a
primary key lookup 5 times just on the off chance that a lookup on the
primary key index isn't the best option.  But this patch isn't going
to fix any of that.  The best we can probably do is try to adjust the
costing for Append paths in some way that reflects the costs and
benefits of pruning.  I'm tentatively in favor of trying to do
something modest in that area, but I don't have a detailed proposal.


I just realized this issue recently and reported it at [1], then Amit pointed
me to this issue being discussed here, so I would like to continue this topic
here. 

I think we can split the issue into 2 issues.  One is the partition prune in initial
partition prune, which maybe happen in custom plan case only and caused
the above issue.  The other one happens in the "Run-Time" partition prune, 
I admit that is an important issue to resolve as well, but looks harder.   So I
think we can fix the first one at first. 

The proposal to fix the first one is we can remember how many partitions 
survived after plan time pruned for a RelOptInfo for a custom plan.  and record
such information in the CachedPlanSource.  When we count for the cost of a
generic plan, we can reduce the cost based on such information. 

Any thought about this?  I'd be sorry if I missed some already existing discussion
on this topic. 


--
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

From
Andy Fan
Date:


On Sun, Oct 4, 2020 at 3:10 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:


Now, in my experience, the current system for custom plans vs. generic
plans doesn't approach the problem in this way at all, and in my
experience that results in some pretty terrible behavior.  It will do
things like form a custom plan every time because the estimated cost
of the custom plan is lower than the estimated cost of the generic
plan even though the two plans are structurally identical; only the
estimates differ.  It will waste gobs of CPU cycles by replanning a
primary key lookup 5 times just on the off chance that a lookup on the
primary key index isn't the best option.  But this patch isn't going
to fix any of that.  The best we can probably do is try to adjust the
costing for Append paths in some way that reflects the costs and
benefits of pruning.  I'm tentatively in favor of trying to do
something modest in that area, but I don't have a detailed proposal.


I just realized this issue recently and reported it at [1], then Amit pointed
me to this issue being discussed here, so I would like to continue this topic
here. 

I think we can split the issue into 2 issues.  One is the partition prune in initial
partition prune, which maybe happen in custom plan case only and caused
the above issue.  The other one happens in the "Run-Time" partition prune, 
I admit that is an important issue to resolve as well, but looks harder.   So I
think we can fix the first one at first. 

... When we count for the cost of a
generic plan, we can reduce the cost based on such information. 
 
This way doesn't work since after the initial partition prune,  not only the
cost of the Append node should be reduced,  the cost of other plans should
be reduced as well [1]

However I think if we can use partition prune information from a custom plan
at the cost_append_path stage,  it looks the issue can be fixed.  If so, the idea
is similar to David's idea in [2],  however Robert didn't agree with this[2]. 
Can anyone elaborate this objection?  for a partkey > $1 or BETWEEN cases, 
some real results from the past are probably better than some hard-coded 
assumptions IMO. 


--
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

From
Andy Fan
Date:


On Wed, Oct 7, 2020 at 5:05 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:


On Sun, Oct 4, 2020 at 3:10 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:


Now, in my experience, the current system for custom plans vs. generic
plans doesn't approach the problem in this way at all, and in my
experience that results in some pretty terrible behavior.  It will do
things like form a custom plan every time because the estimated cost
of the custom plan is lower than the estimated cost of the generic
plan even though the two plans are structurally identical; only the
estimates differ.  It will waste gobs of CPU cycles by replanning a
primary key lookup 5 times just on the off chance that a lookup on the
primary key index isn't the best option.  But this patch isn't going
to fix any of that.  The best we can probably do is try to adjust the
costing for Append paths in some way that reflects the costs and
benefits of pruning.  I'm tentatively in favor of trying to do
something modest in that area, but I don't have a detailed proposal.


I just realized this issue recently and reported it at [1], then Amit pointed
me to this issue being discussed here, so I would like to continue this topic
here. 

I think we can split the issue into 2 issues.  One is the partition prune in initial
partition prune, which maybe happen in custom plan case only and caused
the above issue.  The other one happens in the "Run-Time" partition prune, 
I admit that is an important issue to resolve as well, but looks harder.   So I
think we can fix the first one at first. 

... When we count for the cost of a
generic plan, we can reduce the cost based on such information. 
 
This way doesn't work since after the initial partition prune,  not only the
cost of the Append node should be reduced,  the cost of other plans should
be reduced as well [1]

However I think if we can use partition prune information from a custom plan
at the cost_append_path stage,  it looks the issue can be fixed.  If so, the idea
is similar to David's idea in [2],  however Robert didn't agree with this[2]. 
Can anyone elaborate this objection?  for a partkey > $1 or BETWEEN cases, 
some real results from the past are probably better than some hard-coded 
assumptions IMO. 

I can understand Robert's idea now,  he intended to resolve both the 
"initial-partition-prune" case and "runtime partition prune" case while I always think
about the former case (Amit reminded me about that at the beginning, but I just
wake up right now..)

With the Robert's idea,  I think we can do some hack at create_append_path,
we can figure out the pruneinfo (it was done at create_append_plan now) at that
stage, and then did a fix_append_path with such pruneinfo.  We need to fix not
only the cost of AppendPath,  but also rows of AppendPath, For example:  
SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
plan is:
Nest Loop
   Nest Loop
      t1
      Append (p)
   t2

Then the rows of Append (p) will be very important. 

For this idea,  how to estimate how many partitions/rows can be pruned is a key
part.  Robert said "I was thinking, rather, that if we know for example that
we've doing pruning on partition_column = $1, then we know that only
one partition will match.  That's probably a common case.  If we've
got partition_column > $1, we could assume that, say, 75% of the
partitions would match.  partition_column BETWEEN $1 and $2 is
probably a bit more selective, so maybe we assume 50% of the
partitions would match.".   I think we can't say the 75% or 50% is a good
number,  but the keypoint may be "partition_column = $1" is a common
case.  And for the others case, we probably don't make it worse.

I think we need to do something here, any thoughts? Personally I'm more
like this idea now.

--
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

From
Ashutosh Bapat
Date:
On Wed, Oct 7, 2020 at 7:00 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
>
> On Wed, Oct 7, 2020 at 5:05 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>
>>
>>
>> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Now, in my experience, the current system for custom plans vs. generic
>>>> plans doesn't approach the problem in this way at all, and in my
>>>> experience that results in some pretty terrible behavior.  It will do
>>>> things like form a custom plan every time because the estimated cost
>>>> of the custom plan is lower than the estimated cost of the generic
>>>> plan even though the two plans are structurally identical; only the
>>>> estimates differ.  It will waste gobs of CPU cycles by replanning a
>>>> primary key lookup 5 times just on the off chance that a lookup on the
>>>> primary key index isn't the best option.  But this patch isn't going
>>>> to fix any of that.  The best we can probably do is try to adjust the
>>>> costing for Append paths in some way that reflects the costs and
>>>> benefits of pruning.  I'm tentatively in favor of trying to do
>>>> something modest in that area, but I don't have a detailed proposal.
>>>>
>>>
>>> I just realized this issue recently and reported it at [1], then Amit pointed
>>> me to this issue being discussed here, so I would like to continue this topic
>>> here.
>>>
>>> I think we can split the issue into 2 issues.  One is the partition prune in initial
>>> partition prune, which maybe happen in custom plan case only and caused
>>> the above issue.  The other one happens in the "Run-Time" partition prune,
>>> I admit that is an important issue to resolve as well, but looks harder.   So I
>>> think we can fix the first one at first.
>>>
>>> ... When we count for the cost of a
>>> generic plan, we can reduce the cost based on such information.
>>
>>
>> This way doesn't work since after the initial partition prune,  not only the
>> cost of the Append node should be reduced,  the cost of other plans should
>> be reduced as well [1]
>>
>> However I think if we can use partition prune information from a custom plan
>> at the cost_append_path stage,  it looks the issue can be fixed.  If so, the idea
>> is similar to David's idea in [2],  however Robert didn't agree with this[2].
>> Can anyone elaborate this objection?  for a partkey > $1 or BETWEEN cases,
>> some real results from the past are probably better than some hard-coded
>> assumptions IMO.
>
>
> I can understand Robert's idea now,  he intended to resolve both the
> "initial-partition-prune" case and "runtime partition prune" case while I always think
> about the former case (Amit reminded me about that at the beginning, but I just
> wake up right now..)
>
> With the Robert's idea,  I think we can do some hack at create_append_path,
> we can figure out the pruneinfo (it was done at create_append_plan now) at that
> stage, and then did a fix_append_path with such pruneinfo.  We need to fix not
> only the cost of AppendPath,  but also rows of AppendPath, For example:
> SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> plan is:
> Nest Loop
>    Nest Loop
>       t1
>       Append (p)
>    t2
>
> Then the rows of Append (p) will be very important.
>
> For this idea,  how to estimate how many partitions/rows can be pruned is a key
> part.  Robert said "I was thinking, rather, that if we know for example that
> we've doing pruning on partition_column = $1, then we know that only
> one partition will match.  That's probably a common case.  If we've
> got partition_column > $1, we could assume that, say, 75% of the
> partitions would match.  partition_column BETWEEN $1 and $2 is
> probably a bit more selective, so maybe we assume 50% of the
> partitions would match.".   I think we can't say the 75% or 50% is a good
> number,  but the keypoint may be "partition_column = $1" is a common
> case.  And for the others case, we probably don't make it worse.
>
> I think we need to do something here, any thoughts? Personally I'm more
> like this idea now.

Yes, I think we have to do something on those lines. But I am
wondering why our stats machinery is failing to do that already. For
equality I think it's straight forward. But even for other operators
we should use the statistics from the partitioned table to estimate
the selectivity and then adjust number of scanned partitions = (number
of partitions * fraction of rows scanned). That might be better than
50% or 75%.

-- 
Best Wishes,
Ashutosh Bapat



Re: [HACKERS] Runtime Partition Pruning

From
Andy Fan
Date:
Hi Ashutosh:

On Thu, Oct 8, 2020 at 7:25 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Wed, Oct 7, 2020 at 7:00 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
>
> On Wed, Oct 7, 2020 at 5:05 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>
>>
>>
>> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Now, in my experience, the current system for custom plans vs. generic
>>>> plans doesn't approach the problem in this way at all, and in my
>>>> experience that results in some pretty terrible behavior.  It will do
>>>> things like form a custom plan every time because the estimated cost
>>>> of the custom plan is lower than the estimated cost of the generic
>>>> plan even though the two plans are structurally identical; only the
>>>> estimates differ.  It will waste gobs of CPU cycles by replanning a
>>>> primary key lookup 5 times just on the off chance that a lookup on the
>>>> primary key index isn't the best option.  But this patch isn't going
>>>> to fix any of that.  The best we can probably do is try to adjust the
>>>> costing for Append paths in some way that reflects the costs and
>>>> benefits of pruning.  I'm tentatively in favor of trying to do
>>>> something modest in that area, but I don't have a detailed proposal.
>>>>
>>>
>>> I just realized this issue recently and reported it at [1], then Amit pointed
>>> me to this issue being discussed here, so I would like to continue this topic
>>> here.
>>>
>>> I think we can split the issue into 2 issues.  One is the partition prune in initial
>>> partition prune, which maybe happen in custom plan case only and caused
>>> the above issue.  The other one happens in the "Run-Time" partition prune,
>>> I admit that is an important issue to resolve as well, but looks harder.   So I
>>> think we can fix the first one at first.
>>>
>>> ... When we count for the cost of a
>>> generic plan, we can reduce the cost based on such information.
>>
>>
>> This way doesn't work since after the initial partition prune,  not only the
>> cost of the Append node should be reduced,  the cost of other plans should
>> be reduced as well [1]
>>
>> However I think if we can use partition prune information from a custom plan
>> at the cost_append_path stage,  it looks the issue can be fixed.  If so, the idea
>> is similar to David's idea in [2],  however Robert didn't agree with this[2].
>> Can anyone elaborate this objection?  for a partkey > $1 or BETWEEN cases,
>> some real results from the past are probably better than some hard-coded
>> assumptions IMO.
>
>
> I can understand Robert's idea now,  he intended to resolve both the
> "initial-partition-prune" case and "runtime partition prune" case while I always think
> about the former case (Amit reminded me about that at the beginning, but I just
> wake up right now..)
>
> With the Robert's idea,  I think we can do some hack at create_append_path,
> we can figure out the pruneinfo (it was done at create_append_plan now) at that
> stage, and then did a fix_append_path with such pruneinfo.  We need to fix not
> only the cost of AppendPath,  but also rows of AppendPath, For example:
> SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> plan is:
> Nest Loop
>    Nest Loop
>       t1
>       Append (p)
>    t2
>
> Then the rows of Append (p) will be very important.
>
> For this idea,  how to estimate how many partitions/rows can be pruned is a key
> part.  Robert said "I was thinking, rather, that if we know for example that
> we've doing pruning on partition_column = $1, then we know that only
> one partition will match.  That's probably a common case.  If we've
> got partition_column > $1, we could assume that, say, 75% of the
> partitions would match.  partition_column BETWEEN $1 and $2 is
> probably a bit more selective, so maybe we assume 50% of the
> partitions would match.".   I think we can't say the 75% or 50% is a good
> number,  but the keypoint may be "partition_column = $1" is a common
> case.  And for the others case, we probably don't make it worse.
>
> I think we need to do something here, any thoughts? Personally I'm more
> like this idea now.

Yes, I think we have to do something on those lines. But I am
wondering why our stats machinery is failing to do that already. For
equality I think it's straight forward. But even for other operators
we should use the statistics from the partitioned table to estimate
the selectivity and then adjust number of scanned partitions = (number
of partitions * fraction of rows scanned). That might be better than
50% or 75%.

 
Sorry for the late reply!  Suppose we have partition defined like this:
p
- p1
- p2

When you talk about "the statistics from the partitioned table", do you
mean the statistics from p or p1/p2?   I just confirmed there is no statistics
for p (at least pg_class.reltuples = -1),  so I think you are talking about
p1/p2.  

Here we are talking about partkey = $1 or  partkey = RunTimeValue. 
so even the value can hit 1 partition only,  but since we don't know 
the exact value, so we treat all the partition equally.  so looks
nothing wrong with partition level estimation.  However when we cost
the Append path, we need know just one of them can be hit,  then
we need do something there.  Both AppendPath->rows/total_cost
should be adjusted (That doesn't happen now). 

--
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

From
Ashutosh Bapat
Date:
On Mon, Oct 12, 2020 at 7:59 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

>
> Sorry for the late reply!  Suppose we have partition defined like this:
> p
> - p1
> - p2
>
> When you talk about "the statistics from the partitioned table", do you
> mean the statistics from p or p1/p2?   I just confirmed there is no statistics
> for p (at least pg_class.reltuples = -1),  so I think you are talking about
> p1/p2.

I am talking about p when I say statistics from the partitioned table.
I see that pg_statistic row from p is well populated.
pg_class.reltuples = -1 indicates that the heap doesn't have any rows.
set_rel_size() sets the number of rows in the partitioned table based
on the rows in individual unpruned partitions.

>
> Here we are talking about partkey = $1 or  partkey = RunTimeValue.
> so even the value can hit 1 partition only,  but since we don't know
> the exact value, so we treat all the partition equally.  so looks
> nothing wrong with partition level estimation.  However when we cost
> the Append path, we need know just one of them can be hit,  then
> we need do something there.  Both AppendPath->rows/total_cost
> should be adjusted (That doesn't happen now).

I think in this case we can safely assume that only one partition will
remain so normalize costs considering that only one partition will
survive.

-- 
Best Wishes,
Ashutosh Bapat



Re: [HACKERS] Runtime Partition Pruning

From
Andy Fan
Date:


On Mon, Oct 12, 2020 at 5:48 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Mon, Oct 12, 2020 at 7:59 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

>
> Sorry for the late reply!  Suppose we have partition defined like this:
> p
> - p1
> - p2
>
> When you talk about "the statistics from the partitioned table", do you
> mean the statistics from p or p1/p2?   I just confirmed there is no statistics
> for p (at least pg_class.reltuples = -1),  so I think you are talking about
> p1/p2.

I am talking about p when I say statistics from the partitioned table.
I see that pg_statistic row from p is well populated.
pg_class.reltuples = -1 indicates that the heap doesn't have any rows.
set_rel_size() sets the number of rows in the partitioned table based
on the rows in individual unpruned partitions.

 
Glad to know that, Thanks for this info!
 
>
> Here we are talking about partkey = $1 or  partkey = RunTimeValue.
> so even the value can hit 1 partition only,  but since we don't know
> the exact value, so we treat all the partition equally.  so looks
> nothing wrong with partition level estimation.  However when we cost
> the Append path, we need know just one of them can be hit,  then
> we need do something there.  Both AppendPath->rows/total_cost
> should be adjusted (That doesn't happen now).

I think in this case we can safely assume that only one partition will
remain so normalize costs considering that only one partition will
survive.

Exactly.  What I am trying to do is fix this at create_append_path,
do you have different suggestions? about the pkey > $1 case,  I think
even if we use the statistics from partition level,  it would be 
hard-code as well since we don't know what value $1 is. 

I have gone through the main part of the RunTime partition prune, hope
I can update a runnable patch soon.  The main idea is fix the rows/
costs at create_append_path stage.  So any suggestion in a different
direction will be very useful. 

--
Best Regards
Andy Fan

Re: [HACKERS] Runtime Partition Pruning

From
Amit Langote
Date:
On Thu, Oct 8, 2020 at 8:25 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Wed, Oct 7, 2020 at 7:00 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> > I can understand Robert's idea now,  he intended to resolve both the
> > "initial-partition-prune" case and "runtime partition prune" case while I always think
> > about the former case (Amit reminded me about that at the beginning, but I just
> > wake up right now..)
> >
> > With the Robert's idea,  I think we can do some hack at create_append_path,
> > we can figure out the pruneinfo (it was done at create_append_plan now) at that
> > stage, and then did a fix_append_path with such pruneinfo.  We need to fix not
> > only the cost of AppendPath,  but also rows of AppendPath, For example:
> > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> > plan is:
> > Nest Loop
> >    Nest Loop
> >       t1
> >       Append (p)
> >    t2
> >
> > Then the rows of Append (p) will be very important.
> >
> > For this idea,  how to estimate how many partitions/rows can be pruned is a key
> > part.  Robert said "I was thinking, rather, that if we know for example that
> > we've doing pruning on partition_column = $1, then we know that only
> > one partition will match.  That's probably a common case.  If we've
> > got partition_column > $1, we could assume that, say, 75% of the
> > partitions would match.  partition_column BETWEEN $1 and $2 is
> > probably a bit more selective, so maybe we assume 50% of the
> > partitions would match.".   I think we can't say the 75% or 50% is a good
> > number,  but the keypoint may be "partition_column = $1" is a common
> > case.  And for the others case, we probably don't make it worse.
> >
> > I think we need to do something here, any thoughts? Personally I'm more
> > like this idea now.
>
> Yes, I think we have to do something on those lines. But I am
> wondering why our stats machinery is failing to do that already. For
> equality I think it's straight forward. But even for other operators
> we should use the statistics from the partitioned table to estimate
> the selectivity and then adjust number of scanned partitions = (number
> of partitions * fraction of rows scanned). That might be better than
> 50% or 75%.

This is an interesting idea, that is, the idea of consulting parent
relation's stats to guess at the number of partitions that might be
scanned.

However, we don't currently consult an inheritance parent relation's
stats at all during "base" relation planning, which is why you don't
see the parent relation's statistics reflected in the row count and
costs assigned to its (Append at al) paths.  The hard-coded rule is to
sum up children's rows and their paths' costs; see cost_append().

My thinking on this was that we'd just extend that hard-coded rule to
take into account that run-time pruning, if applicable in a given
plan, will cause some of those child paths to be discarded.  Maybe
Andy is headed in that direction?

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: [HACKERS] Runtime Partition Pruning

From
Andy Fan
Date:


On Tue, Oct 13, 2020 at 3:48 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, Oct 8, 2020 at 8:25 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Wed, Oct 7, 2020 at 7:00 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> > I can understand Robert's idea now,  he intended to resolve both the
> > "initial-partition-prune" case and "runtime partition prune" case while I always think
> > about the former case (Amit reminded me about that at the beginning, but I just
> > wake up right now..)
> >
> > With the Robert's idea,  I think we can do some hack at create_append_path,
> > we can figure out the pruneinfo (it was done at create_append_plan now) at that
> > stage, and then did a fix_append_path with such pruneinfo.  We need to fix not
> > only the cost of AppendPath,  but also rows of AppendPath, For example:
> > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> > plan is:
> > Nest Loop
> >    Nest Loop
> >       t1
> >       Append (p)
> >    t2
> >
> > Then the rows of Append (p) will be very important.
> >
> > For this idea,  how to estimate how many partitions/rows can be pruned is a key
> > part.  Robert said "I was thinking, rather, that if we know for example that
> > we've doing pruning on partition_column = $1, then we know that only
> > one partition will match.  That's probably a common case.  If we've
> > got partition_column > $1, we could assume that, say, 75% of the
> > partitions would match.  partition_column BETWEEN $1 and $2 is
> > probably a bit more selective, so maybe we assume 50% of the
> > partitions would match.".   I think we can't say the 75% or 50% is a good
> > number,  but the keypoint may be "partition_column = $1" is a common
> > case.  And for the others case, we probably don't make it worse.
> >
> > I think we need to do something here, any thoughts? Personally I'm more
> > like this idea now.
>
> Yes, I think we have to do something on those lines. But I am
> wondering why our stats machinery is failing to do that already. For
> equality I think it's straight forward. But even for other operators
> we should use the statistics from the partitioned table to estimate
> the selectivity and then adjust number of scanned partitions = (number
> of partitions * fraction of rows scanned). That might be better than
> 50% or 75%.

This is an interesting idea, that is, the idea of consulting parent
relation's stats to guess at the number of partitions that might be
scanned.

However, we don't currently consult an inheritance parent relation's
stats at all during "base" relation planning, which is why you don't
see the parent relation's statistics reflected in the row count and
costs assigned to its (Append at al) paths.  The hard-coded rule is to
sum up children's rows and their paths' costs; see cost_append().

My thinking on this was that we'd just extend that hard-coded rule to
take into account that run-time pruning, if applicable in a given
plan, will cause some of those child paths to be discarded.  Maybe
Andy is headed in that direction?


Yes, that's what I am trying to do.  The minimum code in my mind is:

create_append_path(...)
{

   double run_time_prune_est;

   if (enable_partition_prune && .. )
       List *  partition_prune_step_ops =  cal_prune_step_ops(rel,  partitioned_rels);
       run_time_prune_est = estimate_runtime_prune_ratio(partition_prune_step_ops);
}
// adjust the rows/costs of AppendPath based on the  run_time_prune_est.

The difference between '=' and '<' will be handled in function 
estimate_runtime_prune_ratio. Since I still don't make my mind runnable now,  
some data structures may  change, but the overall idea is something like above. 


--
Best Regards
Andy Fan