Thread: [HACKERS] Runtime Partition Pruning
I have been working on implementing the runtime partition pruning which would increase the performance of queries involving partitioned table to a great extent. PFA the POC which can be applied over Amit's patch for faster partition pruning [1] and Dilip's refactor patch [2] on commit 2c74e6c1dcc5002fa8b822e5757f6c95d899fb7a. [1] https://www.postgresql.org/message-id/e02923ea-a117-a6ad-6a3e-ea5e1ba41ece%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/CAFiTN-tGnQzF_4QtbOHT-3hE%3DOvNaMfbbeRxa4UY0CQyF0G8gQ%40mail.gmail.com There were a couple of things that need improvement/opinion: In get_rel_partition_info, we store minop and maxop for each partition key. For the equality case, which is most common, both would store the same value. We could make it better by storing equal (bound, bound, ....) instead repeating the same values. get_partitions_for_keys currently returns the list of partitions valid for the given keys but for a table with many partitions this list would be very long so maybe for range qual ( key > a & key < b ) we could only store the min and max partition number and increment as_whichplan by 1 till we reach max partition number. For non-continuous partitions, we would still need the list. Currently, the partitions numbers are recalculated whenever the ChgParam is set, This can be optimised by skipping this step when only a non-partition key column has changed; reusing the existing partitions selected. Others: - better handling of multiple key - allow use of expression in the quals. - To use min_incl, max_incl properly in get_partitions_for_keys. - pruning during function calls. Currently with patch, during NestLoop: Nested Loop -> SeqScan tbl1 -> Append -> Index Scan p01 -> Index Scan p02 -> Index Scan p03 For each tuple from tbl1, only the relevant partition (p01or p02 or p03) will be scanned. --- Prepared Statement Behaviour with patch--- Table Descritpion: Table "public.tprt" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+---------+---------+--------------+------------- col1 | integer | | | | plain | | col2 | integer | | | | plain | | Partition key: RANGE (col1) Partitions: tprt_1 FOR VALUES FROM (1) TO (50001), tprt_2 FOR VALUES FROM (50001) TO (100001), tprt_3 FOR VALUES FROM (100001) TO (200001) EXPLAIN EXECUTE prstmt_select(15); QUERY PLAN ------------------------------------------------------------------ Append (cost=0.00..1736.55 rows=1 width=8) -> Seq Scan on tprt_1 (cost=0.00..849.15 rows=16724 width=8) Filter: (col1 < $1) (3 rows) EXPLAIN EXECUTE prstmt_select(60000); QUERY PLAN ------------------------------------------------------------------ Append (cost=0.00..1736.55 rows=2 width=8) -> Seq Scan on tprt_1 (cost=0.00..849.15 rows=16724 width=8) Filter: (col1 < $1) -> Seq Scan on tprt_2 (cost=0.00..849.15 rows=16724 width=8) Filter: (col1 < $1) (5 rows) -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello all, Here is the updated patch which is rebased over v10 of Amit Langote's path towards faster pruning patch [1]. It modifies the PartScanKeyInfo struct to hold expressions which is then evaluated by the executor to fetch the correct partitions using the function. The code still chooses the custom plan instead of the generic plan for the prepared statements. I am working on it. The following output is after adding a hack in the code forcing selection of generic plan. postgres=# EXPLAIN EXECUTE prstmt_select(70000); QUERY PLAN ------------------------------------------------------------------ Append (cost=0.00..1732.25 rows=2 width=8) -> Seq Scan on tprt_1 (cost=0.00..847.00 rows=16667 width=8) Filter: ($1 > col1) -> Seq Scan on tprt_2 (cost=0.00..847.00 rows=16667 width=8) Filter: ($1 > col1) (5 rows) postgres=# EXPLAIN EXECUTE prstmt_select(200000); QUERY PLAN ------------------------------------------------------------------ Append (cost=0.00..1732.25 rows=3 width=8) -> Seq Scan on tprt_1 (cost=0.00..847.00 rows=16667 width=8) Filter: ($1 > col1) -> Seq Scan on tprt_2 (cost=0.00..847.00 rows=16667 width=8) Filter: ($1 > col1) -> Seq Scan on tprt_3 (cost=0.00..38.25 rows=753 width=8) Filter: ($1 > col1) (7 rows) [1] https://www.postgresql.org/message-id/b8094e71-2c73-ed8e-d8c3-53f232c8c049%40lab.ntt.co.jp Tested on commit: 9b9cb3c4534d717c1c95758670198ebbf8a20af2 -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Nov 9, 2017 at 6:18 AM, Beena Emerson <memissemerson@gmail.com> wrote: > The code still chooses the custom plan instead of the generic plan for > the prepared statements. I am working on it. I don't think it's really the job of this patch to do anything about that problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 9, 2017 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Nov 9, 2017 at 6:18 AM, Beena Emerson <memissemerson@gmail.com> wrote: >> The code still chooses the custom plan instead of the generic plan for >> the prepared statements. I am working on it. > > I don't think it's really the job of this patch to do anything about > that problem. > +1. I think if we really want to do something about plan choice when partitions are involved that should be done as a separate patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Nov 9, 2017 at 4:48 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello all, > > Here is the updated patch which is rebased over v10 of Amit Langote's > path towards faster pruning patch [1]. It modifies the PartScanKeyInfo > struct to hold expressions which is then evaluated by the executor to > fetch the correct partitions using the function. > Hi Beena, I have started looking into your patch, here few initial comments for your 0001 patch: 1.351 + * Evaluate and store the ooutput of ExecInitExpr for each of the keys. Typo: ooutput 2.822 + if (IsA(constexpr, Const) &&is_runtime)823 + continue;824 +825 + if(IsA(constexpr, Param) &&!is_runtime)826 + continue;827 + Add space after '&&' 3.1095 + * Generally for appendrel we don't fetch the clause from the the Typo: Double 'the' 4.272 -/*-------------------------------------------------------------------------273 + /*------------------------------------------------------------------------- Unnecessary hunk. 5.313 + Node *n = eval_const_expressions_from_list(estate->es_param_list_info, val);314 + Crossing 80 column window. Same at line # 323 & 325 6.315 + keys->eqkeys_datums[i++] = ((Const *) n)->constvalue; Don’t we need a check for IsA(n, Const) or assert ? 7. 1011 + if (prmList) 1012 + context.boundParams = prmList; /* bound Params */ 1013 + else 1014 + context.boundParams = NULL; No need of prmList null check, context.boundParams = prmList; is enough. 8. It would be nice if you create a separate patch where you are moving PartScanKeyInfo and exporting function declaration. 9. Could you please add few regression tests, that would help in review & testing. 10. Could you please rebase your patch against latest "path toward faster partition pruning" patch by Amit. Regards, Amul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hello Amul, Thank you for reviewing. On Fri, Nov 10, 2017 at 4:33 PM, amul sul <sulamul@gmail.com> wrote: > On Thu, Nov 9, 2017 at 4:48 PM, Beena Emerson <memissemerson@gmail.com> wrote: >> Hello all, >> >> Here is the updated patch which is rebased over v10 of Amit Langote's >> path towards faster pruning patch [1]. It modifies the PartScanKeyInfo >> struct to hold expressions which is then evaluated by the executor to >> fetch the correct partitions using the function. >> > > Hi Beena, > > I have started looking into your patch, here few initial comments > for your 0001 patch: > > 1. > 351 + * Evaluate and store the ooutput of ExecInitExpr for each > of the keys. > > Typo: ooutput Corrected. > > 2. > 822 + if (IsA(constexpr, Const) &&is_runtime) > 823 + continue; > 824 + > 825 + if (IsA(constexpr, Param) &&!is_runtime) > 826 + continue; > 827 + > > Add space after '&&' Done. > > 3. > 1095 + * Generally for appendrel we don't fetch the clause from the the > > Typo: Double 'the' > > 4. > 272 -/*------------------------------------------------------------------------- > 273 + /*------------------------------------------------------------------------- > > Unnecessary hunk. Removed. > > 5. > 313 + Node *n = > eval_const_expressions_from_list(estate->es_param_list_info, val); > 314 + > > Crossing 80 column window. Same at line # 323 & 325 Fixed. > > 6. > 315 + keys->eqkeys_datums[i++] = ((Const *) n)->constvalue; > > Don’t we need a check for IsA(n, Const) or assert ? added > > 7. > 1011 + if (prmList) > 1012 + context.boundParams = prmList; /* bound Params */ > 1013 + else > 1014 + context.boundParams = NULL; > > No need of prmList null check, context.boundParams = prmList; is enough. > > 8. It would be nice if you create a separate patch where you are moving > PartScanKeyInfo and exporting function declaration. This is in 0001. > > 9. Could you please add few regression tests, that would help in > review & testing. I will make a seperate regression patch and submit soon. > > 10. Could you please rebase your patch against latest "path toward faster > partition pruning" patch by Amit. The following is rebased over v11 Amit's patch [1] [1] https://www.postgresql.org/message-id/62d21a7b-fea9-f2d7-c33a-8caa12eca612%40lab.ntt.co.jp -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
PFA the updated patches. On Tue, Nov 14, 2017 at 11:45 AM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello Amul, > > Thank you for reviewing. > > On Fri, Nov 10, 2017 at 4:33 PM, amul sul <sulamul@gmail.com> wrote: >> On Thu, Nov 9, 2017 at 4:48 PM, Beena Emerson <memissemerson@gmail.com> wrote: >>> Hello all, >>> >>> Here is the updated patch which is rebased over v10 of Amit Langote's >>> path towards faster pruning patch [1]. It modifies the PartScanKeyInfo >>> struct to hold expressions which is then evaluated by the executor to >>> fetch the correct partitions using the function. >>> >> >> Hi Beena, >> >> I have started looking into your patch, here few initial comments >> for your 0001 patch: >> >> 1. >> 351 + * Evaluate and store the ooutput of ExecInitExpr for each >> of the keys. >> >> Typo: ooutput > > Corrected. > >> >> 2. >> 822 + if (IsA(constexpr, Const) &&is_runtime) >> 823 + continue; >> 824 + >> 825 + if (IsA(constexpr, Param) &&!is_runtime) >> 826 + continue; >> 827 + >> >> Add space after '&&' > > Done. > >> >> 3. >> 1095 + * Generally for appendrel we don't fetch the clause from the the >> >> Typo: Double 'the' >> >> 4. >> 272 -/*------------------------------------------------------------------------- >> 273 + /*------------------------------------------------------------------------- >> >> Unnecessary hunk. > > Removed. > >> >> 5. >> 313 + Node *n = >> eval_const_expressions_from_list(estate->es_param_list_info, val); >> 314 + >> >> Crossing 80 column window. Same at line # 323 & 325 > > Fixed. > >> >> 6. >> 315 + keys->eqkeys_datums[i++] = ((Const *) n)->constvalue; >> >> Don’t we need a check for IsA(n, Const) or assert ? > > added > >> >> 7. >> 1011 + if (prmList) >> 1012 + context.boundParams = prmList; /* bound Params */ >> 1013 + else >> 1014 + context.boundParams = NULL; >> >> No need of prmList null check, context.boundParams = prmList; is enough. >> >> 8. It would be nice if you create a separate patch where you are moving >> PartScanKeyInfo and exporting function declaration. > > This is in 0001. > >> >> 9. Could you please add few regression tests, that would help in >> review & testing. > > I will make a seperate regression patch and submit soon. > >> >> 10. Could you please rebase your patch against latest "path toward faster >> partition pruning" patch by Amit. > > > The following is rebased over v11 Amit's patch [1] > > > [1] https://www.postgresql.org/message-id/62d21a7b-fea9-f2d7-c33a-8caa12eca612%40lab.ntt.co.jp > -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Nov 14, 2017 at 11:46 AM, Beena Emerson <memissemerson@gmail.com> wrote:
Hi,
I have started testing this along with fast pruning. It is crashing for sql with subqueries.
one to test case is given below.
CREATE TABLE prun_test_part (empno int, sal int, deptno int) PARTITION BY RANGE(sal);
CREATE TABLE prun_test_part_p1 PARTITION OF prun_test_part FOR VALUES FROM (0) TO (100);
CREATE TABLE prun_test_part_p2 PARTITION OF prun_test_part FOR VALUES FROM (100) TO (200);
CREATE TABLE prun_test_part_p3 PARTITION OF prun_test_part FOR VALUES FROM (200) TO (300);
CREATE TABLE prun_test_part_p4 PARTITION OF prun_test_part FOR VALUES FROM (300) TO (400);
INSERT INTO prun_test_part VALUES (10,90,10);
INSERT INTO prun_test_part VALUES (11,100,10);
INSERT INTO prun_test_part VALUES (20,110,20);
INSERT INTO prun_test_part VALUES (21,200,20);
INSERT INTO prun_test_part VALUES (30,210,30);
INSERT INTO prun_test_part VALUES (31,300,30);
INSERT INTO prun_test_part VALUES (50,310,20);
explain (costs off)
SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part WHERE sal = 50);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
PFA the updated patches.
Hi,
I have started testing this along with fast pruning. It is crashing for sql with subqueries.
one to test case is given below.
CREATE TABLE prun_test_part (empno int, sal int, deptno int) PARTITION BY RANGE(sal);
CREATE TABLE prun_test_part_p1 PARTITION OF prun_test_part FOR VALUES FROM (0) TO (100);
CREATE TABLE prun_test_part_p2 PARTITION OF prun_test_part FOR VALUES FROM (100) TO (200);
CREATE TABLE prun_test_part_p3 PARTITION OF prun_test_part FOR VALUES FROM (200) TO (300);
CREATE TABLE prun_test_part_p4 PARTITION OF prun_test_part FOR VALUES FROM (300) TO (400);
INSERT INTO prun_test_part VALUES (10,90,10);
INSERT INTO prun_test_part VALUES (11,100,10);
INSERT INTO prun_test_part VALUES (20,110,20);
INSERT INTO prun_test_part VALUES (21,200,20);
INSERT INTO prun_test_part VALUES (30,210,30);
INSERT INTO prun_test_part VALUES (31,300,30);
INSERT INTO prun_test_part VALUES (50,310,20);
explain (costs off)
SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part WHERE sal = 50);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
On 14 November 2017 at 19:16, Beena Emerson <memissemerson@gmail.com> wrote: > PFA the updated patches. Hi Beena, Thanks for working on this. I've had a look at the patch to try to understand how it is working. I found it a bit surprising that the code assumes it can rely on the order of Append->appendplans matching what's needed based on the return value of get_partitions_for_keys(). I tried using the following to break this: drop table if exists ab; create table ab (a int not null, b int not null) partition by list(a); create table ab_a2 partition of ab for values in(2) partition by list (b); create table ab_a2_b1 partition of ab_a2 for values in (1); create table ab_a2_b2 partition of ab_a2 for values in (2); create table ab_a2_b3 partition of ab_a2 for values in (3); create table ab_a1 partition of ab for values in(1) partition by list (b); create table ab_a1_b1 partition of ab_a1 for values in (1); create table ab_a1_b2 partition of ab_a1 for values in (2); create table ab_a1_b3 partition of ab_a1 for values in (3); create table ab_a3 partition of ab for values in(3) partition by list (b); create table ab_a3_b1 partition of ab_a3 for values in (1); create table ab_a3_b2 partition of ab_a3 for values in (2); create table ab_a3_b3 partition of ab_a3 for values in (3); prepare ab_q1 (int, int) as select * from ab where a = $1 and b = $2; explain execute ab_q1 (2,3); explain execute ab_q1 (2,3); explain execute ab_q1 (2,3); explain execute ab_q1 (2,3); postgres=# explain execute ab_q1 (2,3); QUERY PLAN ---------------------------------------------------------------Append (cost=0.00..43.90 rows=1 width=8) -> Seq Scan onab_a2_b3 (cost=0.00..43.90 rows=1 width=8) Filter: ((a = 2) AND (b = 3)) (3 rows) postgres=# explain execute ab_q1 (2,3); QUERY PLAN ---------------------------------------------------------------Append (cost=0.00..395.10 rows=1 width=8) -> Seq Scan onab_a1_b2 (cost=0.00..43.90 rows=1 width=8) <--------- wrong partition Filter: ((a = $1) AND (b = $2)) (3 rows) As soon as we hit the generic plan the wrong partition is selected I think to do this you're going to have to store some sort of array that maps the partition index to the subpath in the Append node so you can correctly identify the subpath based on what you're getting back from get_partitions_for_keys(). Perhaps what you had worked previously when we were not returning a Bitmapset with that function. Once you've got that design worked out I can take another look at this. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 15 November 2017 at 01:57, David Rowley <david.rowley@2ndquadrant.com> wrote: > I think to do this you're going to have to store some sort of array > that maps the partition index to the subpath in the Append node so you > can correctly identify the subpath based on what you're getting back > from get_partitions_for_keys(). Perhaps what you had worked previously > when we were not returning a Bitmapset with that function. > > Once you've got that design worked out I can take another look at this. I think this is a bit more broken than I originally mentioned above. The code you have at the moment assumes there will be just a single partitioned table in the hierarchy. Remember that complex partitioned hierarchies will be flattened during set_append_rel_pathlist(), so there may be multiple partitioned relations to search for. A more simple way to break the patch is to have some constants in the query to eliminate some of the partitions during planning, leaving just a few to be eliminated during execution. Something like: deallocate ab_q1; drop table if exists ab; create table ab (a int not null, b int not null) partition by list(a); create table ab_a1 partition of ab for values in (1); create table ab_a2 partition of ab for values in (2); create table ab_a3 partition of ab for values in (3); create table ab_a4 partition of ab for values in (4); create table ab_a5 partition of ab for values in (5); create table ab_a6 partition of ab for values in (6); create table ab_a7 partition of ab for values in (7); create table ab_a8 partition of ab for values in (8); create table ab_a9 partition of ab for values in (9); create table ab_a10 partition of ab for values in (10); prepare ab_q1 (int) as select * from ab where a between 4 and 5 and a = $1; explain execute ab_q1 (4); explain execute ab_q1 (4); explain execute ab_q1 (4); explain execute ab_q1 (4); explain execute ab_q1 (4); explain execute ab_q1 (4); -- TRAP: FailedAssertion("!(n < list->length)", File: "src/backend/nodes/list.c", Line: 392) So some sort of hierarchical structure of the partition hierarchy would need to be stored in the Append node and then you'd need to search at each level, and then somehow match the results up to the subpaths that you have in the Append. Although, I'm still not sure this is the best way to go about this. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2017/11/15 8:13, David Rowley wrote: > So some sort of hierarchical structure of the partition hierarchy > would need to be stored in the Append node and then you'd need to > search at each level, and then somehow match the results up to the > subpaths that you have in the Append. Although, I'm still not sure > this is the best way to go about this. I think we should try to use PartitionDispatch stuff for this somehow, just like get_partition_for_tuple() does, although I haven't thought very hard whether that infrastructure would be useful as is. Thanks, Amit
Hello Rajkumar, On Tue, Nov 14, 2017 at 2:22 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Tue, Nov 14, 2017 at 11:46 AM, Beena Emerson <memissemerson@gmail.com> > wrote: >> >> PFA the updated patches. > > > Hi, > > I have started testing this along with fast pruning. It is crashing for sql > with subqueries. > one to test case is given below. > > CREATE TABLE prun_test_part (empno int, sal int, deptno int) PARTITION BY > RANGE(sal); > CREATE TABLE prun_test_part_p1 PARTITION OF prun_test_part FOR VALUES FROM > (0) TO (100); > CREATE TABLE prun_test_part_p2 PARTITION OF prun_test_part FOR VALUES FROM > (100) TO (200); > CREATE TABLE prun_test_part_p3 PARTITION OF prun_test_part FOR VALUES FROM > (200) TO (300); > CREATE TABLE prun_test_part_p4 PARTITION OF prun_test_part FOR VALUES FROM > (300) TO (400); > > INSERT INTO prun_test_part VALUES (10,90,10); > INSERT INTO prun_test_part VALUES (11,100,10); > INSERT INTO prun_test_part VALUES (20,110,20); > INSERT INTO prun_test_part VALUES (21,200,20); > INSERT INTO prun_test_part VALUES (30,210,30); > INSERT INTO prun_test_part VALUES (31,300,30); > INSERT INTO prun_test_part VALUES (50,310,20); > > explain (costs off) > SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part > WHERE sal = 50); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > Thanks for your review. I am looking into it. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello David, Thank you for reviewing. On Wed, Nov 15, 2017 at 4:43 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 15 November 2017 at 01:57, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I think to do this you're going to have to store some sort of array >> that maps the partition index to the subpath in the Append node so you >> can correctly identify the subpath based on what you're getting back >> from get_partitions_for_keys(). Perhaps what you had worked previously >> when we were not returning a Bitmapset with that function. >> >> Once you've got that design worked out I can take another look at this. > > I think this is a bit more broken than I originally mentioned above. > The code you have at the moment assumes there will be just a single > partitioned table in the hierarchy. Remember that complex partitioned > hierarchies will be flattened during set_append_rel_pathlist(), so > there may be multiple partitioned relations to search for. > > A more simple way to break the patch is to have some constants in the > query to eliminate some of the partitions during planning, leaving > just a few to be eliminated during execution. > > Something like: > > deallocate ab_q1; > drop table if exists ab; > create table ab (a int not null, b int not null) partition by list(a); > create table ab_a1 partition of ab for values in (1); > create table ab_a2 partition of ab for values in (2); > create table ab_a3 partition of ab for values in (3); > create table ab_a4 partition of ab for values in (4); > create table ab_a5 partition of ab for values in (5); > create table ab_a6 partition of ab for values in (6); > create table ab_a7 partition of ab for values in (7); > create table ab_a8 partition of ab for values in (8); > create table ab_a9 partition of ab for values in (9); > create table ab_a10 partition of ab for values in (10); > > prepare ab_q1 (int) as select * from ab where a between 4 and 5 and a = $1; > > explain execute ab_q1 (4); > explain execute ab_q1 (4); > explain execute ab_q1 (4); > explain execute ab_q1 (4); > explain execute ab_q1 (4); > > explain execute ab_q1 (4); -- TRAP: FailedAssertion("!(n < > list->length)", File: "src/backend/nodes/list.c", Line: 392) > > So some sort of hierarchical structure of the partition hierarchy > would need to be stored in the Append node and then you'd need to > search at each level, and then somehow match the results up to the > subpaths that you have in the Append. Although, I'm still not sure > this is the best way to go about this. Thank you for your suggestion. I am looking into this and will post a patch soon. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 15, 2017 at 4:43 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 15 November 2017 at 01:57, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I think to do this you're going to have to store some sort of array >> that maps the partition index to the subpath in the Append node so you >> can correctly identify the subpath based on what you're getting back >> from get_partitions_for_keys(). Perhaps what you had worked previously >> when we were not returning a Bitmapset with that function. >> >> Once you've got that design worked out I can take another look at this. > > So some sort of hierarchical structure of the partition hierarchy > would need to be stored in the Append node and then you'd need to > search at each level, and then somehow match the results up to the > subpaths that you have in the Append. Although, I'm still not sure > this is the best way to go about this. > Instead of hierarchical structure can’t we maintain an array (one entry per partition), and whenever any leaf partition’s subpath is added to the append rel (in function set_append_rel_pathlist) we can set that subpath number in corresponding array index. And, later we can add some wrapper over get_partitions_for_key such that it can recursively traverse the non-leaf partitions (something like get_partition_for_tuple does as Amit mentioned.). And, ultimately gives the output as a list of leaf partition's indexes. Then we can find the sub-plan number by looking into the array. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 15, 2017 at 3:53 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Thank you for your suggestion. I am looking into this and will post a > patch soon. It has been two weeks since this update and no new patch has showed up. I am marking the patch as returned with feedback. If you can produce a new version, of course feel free to post it and register a new entry in the CF app. -- Michael
Hello, PFA the new version of the patch which can be applied over v11 patches of Amit Langote [1]. The patch has been completely modified and the 0001 patch of previous series is no longer required. As mentioned above, I have used the PartitionDispatchInfo and an array to which holds the actual subplan index. This one addresses the problems mentioned above but still needs to be thoroughly tested. I will soon post a new patch rebased over Amit's v13 patches[2] soon with additional code comments. [1] https://www.postgresql.org/message-id/62d21a7b-fea9-f2d7-c33a-8caa12eca612%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/df609168-b7fd-4c0b-e9b2-6e398d411e27%40lab.ntt.co.jp -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hello Rajkumar, On Tue, Nov 14, 2017 at 2:22 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > On Tue, Nov 14, 2017 at 11:46 AM, Beena Emerson <memissemerson@gmail.com> > wrote: >> >> PFA the updated patches. > > > Hi, > > I have started testing this along with fast pruning. It is crashing for sql > with subqueries. > one to test case is given below. > > CREATE TABLE prun_test_part (empno int, sal int, deptno int) PARTITION BY > RANGE(sal); > CREATE TABLE prun_test_part_p1 PARTITION OF prun_test_part FOR VALUES FROM > (0) TO (100); > CREATE TABLE prun_test_part_p2 PARTITION OF prun_test_part FOR VALUES FROM > (100) TO (200); > CREATE TABLE prun_test_part_p3 PARTITION OF prun_test_part FOR VALUES FROM > (200) TO (300); > CREATE TABLE prun_test_part_p4 PARTITION OF prun_test_part FOR VALUES FROM > (300) TO (400); > > INSERT INTO prun_test_part VALUES (10,90,10); > INSERT INTO prun_test_part VALUES (11,100,10); > INSERT INTO prun_test_part VALUES (20,110,20); > INSERT INTO prun_test_part VALUES (21,200,20); > INSERT INTO prun_test_part VALUES (30,210,30); > INSERT INTO prun_test_part VALUES (31,300,30); > INSERT INTO prun_test_part VALUES (50,310,20); > > explain (costs off) > SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part > WHERE sal = 50); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > With the new patch, the output is as follows: postgres=# explain (costs off) SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part WHERE sal = 50); QUERY PLAN -----------------------------------------------------------------Append InitPlan 1 (returns $0) -> Append -> Seq Scan on prun_test_part_p1 prun_test_part_p1_1 Filter: (sal = 50) -> Seq Scan on prun_test_part_p1 Filter: (sal < $0) (7 rows) -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 14, 2017 at 6:27 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 14 November 2017 at 19:16, Beena Emerson <memissemerson@gmail.com> wrote: >> PFA the updated patches. > > Hi Beena, > > Thanks for working on this. I've had a look at the patch to try to > understand how it is working. I found it a bit surprising that the > code assumes it can rely on the order of Append->appendplans matching > what's needed based on the return value of get_partitions_for_keys(). > > I tried using the following to break this: > > > drop table if exists ab; > create table ab (a int not null, b int not null) partition by list(a); > create table ab_a2 partition of ab for values in(2) partition by list (b); > create table ab_a2_b1 partition of ab_a2 for values in (1); > create table ab_a2_b2 partition of ab_a2 for values in (2); > create table ab_a2_b3 partition of ab_a2 for values in (3); > > create table ab_a1 partition of ab for values in(1) partition by list (b); > create table ab_a1_b1 partition of ab_a1 for values in (1); > create table ab_a1_b2 partition of ab_a1 for values in (2); > create table ab_a1_b3 partition of ab_a1 for values in (3); > create table ab_a3 partition of ab for values in(3) partition by list (b); > create table ab_a3_b1 partition of ab_a3 for values in (1); > create table ab_a3_b2 partition of ab_a3 for values in (2); > create table ab_a3_b3 partition of ab_a3 for values in (3); > > prepare ab_q1 (int, int) as select * from ab where a = $1 and b = $2; > > explain execute ab_q1 (2,3); > explain execute ab_q1 (2,3); > explain execute ab_q1 (2,3); > explain execute ab_q1 (2,3); > > postgres=# explain execute ab_q1 (2,3); > QUERY PLAN > --------------------------------------------------------------- > Append (cost=0.00..43.90 rows=1 width=8) > -> Seq Scan on ab_a2_b3 (cost=0.00..43.90 rows=1 width=8) > Filter: ((a = 2) AND (b = 3)) > (3 rows) > > > postgres=# explain execute ab_q1 (2,3); > QUERY PLAN > --------------------------------------------------------------- > Append (cost=0.00..395.10 rows=1 width=8) > -> Seq Scan on ab_a1_b2 (cost=0.00..43.90 rows=1 width=8) > <--------- wrong partition > Filter: ((a = $1) AND (b = $2)) > (3 rows) > With the new patch, the output is: postgres=# explain execute ab_q1 (2,3); QUERY PLAN ---------------------------------------------------------------Append (cost=0.00..43.90 rows=1 width=8) -> Seq Scan onab_a2_b3 (cost=0.00..43.90 rows=1 width=8) Filter: ((a = 2) AND (b = 3)) (3 rows) postgres=# explain execute ab_q1 (2,3); QUERY PLAN ---------------------------------------------------------------Append (cost=0.00..395.10 rows=4 width=8) -> Seq Scan onab_a2_b3 (cost=0.00..43.90 rows=1 width=8) Filter: ((a = $1) AND (b = $2)) (3 rows) -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, On Wed, Nov 15, 2017 at 4:43 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 15 November 2017 at 01:57, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I think to do this you're going to have to store some sort of array >> that maps the partition index to the subpath in the Append node so you >> can correctly identify the subpath based on what you're getting back >> from get_partitions_for_keys(). Perhaps what you had worked previously >> when we were not returning a Bitmapset with that function. >> >> Once you've got that design worked out I can take another look at this. > > I think this is a bit more broken than I originally mentioned above. > The code you have at the moment assumes there will be just a single > partitioned table in the hierarchy. Remember that complex partitioned > hierarchies will be flattened during set_append_rel_pathlist(), so > there may be multiple partitioned relations to search for. > > A more simple way to break the patch is to have some constants in the > query to eliminate some of the partitions during planning, leaving > just a few to be eliminated during execution. > > Something like: > > deallocate ab_q1; > drop table if exists ab; > create table ab (a int not null, b int not null) partition by list(a); > create table ab_a1 partition of ab for values in (1); > create table ab_a2 partition of ab for values in (2); > create table ab_a3 partition of ab for values in (3); > create table ab_a4 partition of ab for values in (4); > create table ab_a5 partition of ab for values in (5); > create table ab_a6 partition of ab for values in (6); > create table ab_a7 partition of ab for values in (7); > create table ab_a8 partition of ab for values in (8); > create table ab_a9 partition of ab for values in (9); > create table ab_a10 partition of ab for values in (10); > > prepare ab_q1 (int) as select * from ab where a between 4 and 5 and a = $1; > > explain execute ab_q1 (4); > explain execute ab_q1 (4); > explain execute ab_q1 (4); > explain execute ab_q1 (4); > explain execute ab_q1 (4); > > explain execute ab_q1 (4); -- TRAP: FailedAssertion("!(n < > list->length)", File: "src/backend/nodes/list.c", Line: 392) > This is handled in the new patch. postgres=# explain execute ab_q1 (4); QUERY PLAN ------------------------------------------------------------Append (cost=0.00..49.55 rows=1 width=8) -> Seq Scan on ab_a4 (cost=0.00..49.55 rows=1 width=8) Filter: ((a >= 4) AND (a <= 5) AND (a = 4)) (3 rows) postgres=# explain execute ab_q1 (4); QUERY PLAN ------------------------------------------------------------Append (cost=0.00..99.10 rows=1 width=8) -> Seq Scan on ab_a4 (cost=0.00..49.55 rows=1 width=8) Filter: ((a >= 4) AND (a <= 5) AND (a = $1)) (3 rows) -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, On Wed, Nov 29, 2017 at 7:11 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Nov 15, 2017 at 3:53 PM, Beena Emerson <memissemerson@gmail.com> wrote: >> Thank you for your suggestion. I am looking into this and will post a >> patch soon. > > It has been two weeks since this update and no new patch has showed > up. I am marking the patch as returned with feedback. If you can > produce a new version, of course feel free to post it and register a > new entry in the CF app. I have moved it to next CF with status needs review. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, PFA the updated patch rebased over Amit's v13 patches [1] part of which is committed. This also fixes few bugs I found. The InitPlans require execPlan which is not set during ExecInitAppend and so the evaluation of extern quals is moved from ExecInitAppend to ExecAppend. This changes the output of explain but only the correct partition(s) are scanned. David Q1: postgres=# explain analyse execute ab_q1 (3,3); --const QUERY PLAN --------------------------------------------------------------------------------------------------------- Append (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 rows=0 loops=1) -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=1) Filter: ((a = 3) AND (b = 3)) Planning time: 0.588 ms Execution time: 0.043 ms (5 rows) postgres=# explain analyse execute ab_q1 (3,3); --Param only ab_a3_b3 plan is executed QUERY PLAN --------------------------------------------------------------------------------------------------------- Append (cost=0.00..395.10 rows=9 width=8) (actual time=0.119..0.119 rows=0 loops=1) -> Seq Scan on ab_a1_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a1_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a1_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 rows=0 loops=1) Filter: ((a = $1) AND (b = $2)) Planning time: 0.828 ms Execution time: 0.234 ms (21 rows) David Q1 postgres=# explain analyse execute ab_q1 (4); -- Const QUERY PLAN ------------------------------------------------------------------------------------------------------ Append (cost=0.00..49.55 rows=1 width=8) (actual time=0.005..0.005 rows=0 loops=1) -> Seq Scan on ab_a4 (cost=0.00..49.55 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=1) Filter: ((a >= 4) AND (a <= 5) AND (a = 4)) Planning time: 0.501 ms Execution time: 0.039 ms (5 rows) postgres=# explain analyse execute ab_q1 (4); --Param QUERY PLAN ------------------------------------------------------------------------------------------------------ Append (cost=0.00..99.10 rows=2 width=8) (actual time=0.063..0.063 rows=0 loops=1) -> Seq Scan on ab_a4 (cost=0.00..49.55 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=1) Filter: ((a >= 4) AND (a <= 5) AND (a = $1)) -> Seq Scan on ab_a5 (cost=0.00..49.55 rows=1 width=8) (never executed) Filter: ((a >= 4) AND (a <= 5) AND (a = $1)) Planning time: 0.563 ms Execution time: 0.111 ms I am still working on the patch to add more comments and regression tests but comments on the code is welcome. [1]https://www.postgresql.org/message-id/df609168-b7fd-4c0b-e9b2-6e398d411e27%40lab.ntt.co.jp -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson <memissemerson@gmail.com> wrote: > David Q1: > postgres=# explain analyse execute ab_q1 (3,3); --const > QUERY PLAN > --------------------------------------------------------------------------------------------------------- > Append (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 > rows=0 loops=1) > -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual > time=0.005..0.005 rows=0 loops=1) > Filter: ((a = 3) AND (b = 3)) > Planning time: 0.588 ms > Execution time: 0.043 ms > (5 rows) I think the EXPLAIN ANALYZE input should show something attached to the Append node so that we can tell that partition pruning is in use. I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes" or if we can give a few more useful details. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Robert, On Sat, Dec 2, 2017 at 12:34 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson <memissemerson@gmail.com> wrote: >> David Q1: >> postgres=# explain analyse execute ab_q1 (3,3); --const >> QUERY PLAN >> --------------------------------------------------------------------------------------------------------- >> Append (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 >> rows=0 loops=1) >> -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual >> time=0.005..0.005 rows=0 loops=1) >> Filter: ((a = 3) AND (b = 3)) >> Planning time: 0.588 ms >> Execution time: 0.043 ms >> (5 rows) > > I think the EXPLAIN ANALYZE input should show something attached to > the Append node so that we can tell that partition pruning is in use. > I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes" > or if we can give a few more useful details. > The output above is shown for a Const Value i.e. optimizer pruning which I included just to show that the correct partition is chosen even during runtime pruning for the given value. So taking your suggestion, the output for runtime pruning could be something as follows: postgres=# explain analyse execute ab_q1 (3,3); QUERY PLAN --------------------------------------------------------------------------------------------------------- Append (cost=0.00..395.10 rows=9 width=8) (actual time=0.119..0.119 rows=0 loops=1) (run-time partition pruning: on) -> Seq Scan on ab_a1_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a1_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a1_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 rows=0 loops=1) Filter: ((a = $1) AND (b = $2)) Planning time: 0.828 ms Execution time: 0.234 ms (21 rows) Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Dec 2, 2017 at 3:33 AM, Beena Emerson <memissemerson@gmail.com> wrote: > Append (cost=0.00..395.10 rows=9 width=8) (actual time=0.119..0.119 > rows=0 loops=1) (run-time partition pruning: on) If we can, it would be better to show something a bit more precise, like the table being used for run-time pruning, or the expression being used for pruning. Also, we shouldn't use an ad-hoc format like "(run-time partition-pruning: on)"; rather, we should display something using one of the ExplainPropertyBlah functions in explain.c. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2 December 2017 at 08:04, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson <memissemerson@gmail.com> wrote: >> David Q1: >> postgres=# explain analyse execute ab_q1 (3,3); --const >> QUERY PLAN >> --------------------------------------------------------------------------------------------------------- >> Append (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 >> rows=0 loops=1) >> -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual >> time=0.005..0.005 rows=0 loops=1) >> Filter: ((a = 3) AND (b = 3)) >> Planning time: 0.588 ms >> Execution time: 0.043 ms >> (5 rows) > > I think the EXPLAIN ANALYZE input should show something attached to > the Append node so that we can tell that partition pruning is in use. > I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes" > or if we can give a few more useful details. It already does. Anything subnode with "(never executed)" was pruned at runtime. Do we really need anything else to tell us that? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 6, 2017 at 1:21 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 2 December 2017 at 08:04, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson <memissemerson@gmail.com> wrote: >>> David Q1: >>> postgres=# explain analyse execute ab_q1 (3,3); --const >>> QUERY PLAN >>> --------------------------------------------------------------------------------------------------------- >>> Append (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 >>> rows=0 loops=1) >>> -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual >>> time=0.005..0.005 rows=0 loops=1) >>> Filter: ((a = 3) AND (b = 3)) >>> Planning time: 0.588 ms >>> Execution time: 0.043 ms >>> (5 rows) >> >> I think the EXPLAIN ANALYZE input should show something attached to >> the Append node so that we can tell that partition pruning is in use. >> I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes" >> or if we can give a few more useful details. > > It already does. Anything subnode with "(never executed)" was pruned > at runtime. Do we really need anything else to tell us that? I have added the partition quals that are used for pruning. PFA the updated patch. I have changed the names of variables to make it more appropriate, along with adding more code comments and doing some refactoring and other code cleanups. Few cases: 1. Only runtime pruning - David's case1 explain analyse execute ab_q1 (2,3); QUERY PLAN --------------------------------------------------------------------------------------------------------- Append (cost=0.00..395.10 rows=9 width=8) (actual time=0.101..0.101 rows=0 loops=1) Runtime Partition Pruning: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a1_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a1_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a1_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a2_b3 (cost=0.00..43.90 rows=1 width=8) (actual time=0.007..0.007 rows=0 loops=1) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $2)) Planning time: 0.780 ms Execution time: 0.220 ms (22 rows) 2. Runtime pruning after optimizer pruning - David's case 2. ((a >= 4) AND (a <= 5) is used during optimizer pruning and only (a = $1) is used for runtime pruning. =# explain (analyse, costs off, summary off) execute ab_q1 (4); QUERY PLAN ------------------------------------------------------------------- Append (actual time=0.062..0.062 rows=0 loops=1) Runtime Partition Pruning: (a = $1) -> Seq Scan on ab_a4 (actual time=0.005..0.005 rows=0 loops=1) Filter: ((a >= 4) AND (a <= 5) AND (a = $1)) -> Seq Scan on ab_a5 (never executed) Filter: ((a >= 4) AND (a <= 5) AND (a = $1)) (6 rows) 3. Nestloop Join tbl1.col1 only has values from 1 to 10. =# \d+ tprt Table "public.tprt" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------+---------+-----------+----------+---------+---------+--------------+------------- col1 | integer | | | | plain | | col2 | integer | | | | plain | | Partition key: RANGE (col1) Partitions: tprt_1 FOR VALUES FROM (1) TO (5001), tprt_2 FOR VALUES FROM (5001) TO (10001), tprt_3 FOR VALUES FROM (10001) TO (20001) =# explain (analyse, costs off, summary off) SELECT * FROM tbl1 JOIN tprt ON tbl1.col1 > tprt.col1; QUERY PLAN -------------------------------------------------------------------------------------------- Nested Loop (actual time=0.053..0.192 rows=45 loops=1) -> Seq Scan on tbl1 (actual time=0.007..0.009 rows=10 loops=1) -> Append (actual time=0.003..0.004 rows=4 loops=10) Runtime Partition Pruning Join Filter: (tbl1.col1 > col1) -> Index Scan using tprt1_idx on tprt_1 (actual time=0.002..0.004 rows=5 loops=9) Index Cond: (tbl1.col1 > col1) -> Index Scan using tprt2_idx on tprt_2 (never executed) Index Cond: (tbl1.col1 > col1) -> Index Scan using tprt3_idx on tprt_3 (never executed) Index Cond: (tbl1.col1 > col1) (10 rows) 4. InitPlan - Raghu's test case: 4.1 Only few partitions satisfy the param explain (analyse, costs off, summary off) SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part WHERE sal = 200); QUERY PLAN ----------------------------------------------------------------------------------------------------------- Append (actual time=0.034..0.038 rows=3 loops=1) Runtime Partition Pruning: (sal < $0) InitPlan 1 (returns $0) -> Append (actual time=0.008..0.009 rows=1 loops=1) -> Seq Scan on prun_test_part_p3 prun_test_part_p3_1 (actual time=0.008..0.009 rows=1 loops=1) Filter: (sal = 200) Rows Removed by Filter: 1 -> Seq Scan on prun_test_part_p1 (actual time=0.002..0.003 rows=1 loops=1) Filter: (sal < $0) -> Seq Scan on prun_test_part_p2 (actual time=0.002..0.003 rows=2 loops=1) Filter: (sal < $0) -> Seq Scan on prun_test_part_p3 (never executed) Filter: (sal < $0) -> Seq Scan on prun_test_part_p4 (never executed) Filter: (sal < $0) (15 rows) 4.2 When the InitPlan query returns nothing =# explain (analyse, costs off, summary off) SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part WHERE sal = 50); QUERY PLAN ----------------------------------------------------------------------------------------------------------- Append (actual time=0.050..0.050 rows=0 loops=1) Runtime Partition Pruning: (sal < $0) InitPlan 1 (returns $0) -> Append (actual time=0.013..0.013 rows=0 loops=1) -> Seq Scan on prun_test_part_p1 prun_test_part_p1_1 (actual time=0.012..0.012 rows=0 loops=1) Filter: (sal = 50) Rows Removed by Filter: 1 -> Seq Scan on prun_test_part_p1 (never executed) Filter: (sal < $0) -> Seq Scan on prun_test_part_p2 (never executed) Filter: (sal < $0) -> Seq Scan on prun_test_part_p3 (never executed) Filter: (sal < $0) -> Seq Scan on prun_test_part_p4 (never executed) Filter: (sal < $0) (15 rows) -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 7 December 2017 at 20:22, Beena Emerson <memissemerson@gmail.com> wrote: > PFA the updated patch. Hi Beena, Thanks for updating this. Can you list the patches which are required for this to apply to today's master branch? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hello David, On Thu, Dec 7, 2017 at 4:07 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 December 2017 at 20:22, Beena Emerson <memissemerson@gmail.com> wrote: >> PFA the updated patch. > > Hi Beena, > > Thanks for updating this. > > Can you list the patches which are required for this to apply to > today's master branch? > Thanks for looking into this. Currently Amit's v13 patches do not apply on the HEAD and I was working on 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 which is the last commit where all Amit's v13 patches applies cleanly. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, On Thu, Dec 7, 2017 at 12:52 PM, Beena Emerson <memissemerson@gmail.com> wrote: > > 1. Only runtime pruning - David's case1 > explain analyse execute ab_q1 (2,3); > QUERY PLAN > --------------------------------------------------------------------------------------------------------- > Append (cost=0.00..395.10 rows=9 width=8) (actual time=0.101..0.101 > rows=0 loops=1) > Runtime Partition Pruning: ((a = $1) AND (b = $2)) > -> Seq Scan on ab_a1_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) > Filter: ((a = $1) AND (b = $2)) > -> Seq Scan on ab_a1_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) > Filter: ((a = $1) AND (b = $2)) > -> Seq Scan on ab_a1_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) > Filter: ((a = $1) AND (b = $2)) > -> Seq Scan on ab_a2_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) > Filter: ((a = $1) AND (b = $2)) > -> Seq Scan on ab_a2_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) > Filter: ((a = $1) AND (b = $2)) > -> Seq Scan on ab_a2_b3 (cost=0.00..43.90 rows=1 width=8) (actual > time=0.007..0.007 rows=0 loops=1) > Filter: ((a = $1) AND (b = $2)) > -> Seq Scan on ab_a3_b1 (cost=0.00..43.90 rows=1 width=8) (never executed) > Filter: ((a = $1) AND (b = $2)) > -> Seq Scan on ab_a3_b2 (cost=0.00..43.90 rows=1 width=8) (never executed) > Filter: ((a = $1) AND (b = $2)) > -> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (never executed) > Filter: ((a = $1) AND (b = $2)) > Planning time: 0.780 ms > Execution time: 0.220 ms > (22 rows) > > 2. Runtime pruning after optimizer pruning - David's case 2. > ((a >= 4) AND (a <= 5) is used during optimizer pruning and only (a = > $1) is used for runtime pruning. > =# explain (analyse, costs off, summary off) execute ab_q1 (4); > QUERY PLAN > ------------------------------------------------------------------- > Append (actual time=0.062..0.062 rows=0 loops=1) > Runtime Partition Pruning: (a = $1) > -> Seq Scan on ab_a4 (actual time=0.005..0.005 rows=0 loops=1) > Filter: ((a >= 4) AND (a <= 5) AND (a = $1)) > -> Seq Scan on ab_a5 (never executed) > Filter: ((a >= 4) AND (a <= 5) AND (a = $1)) > (6 rows) > FYI, The v4 version of the patch accidentally included the choose_custom_plan hack I had used to force the runtime pruning in the above cases(1,2), which has been removed in v5. So with only the patch applied, it would continue to give the output as with the const and not the Param because the custom plan is preferred over the generic one. This was pointed out in the initial post of this thread. Just to compare, I continued using the hack for the tests to show the behaviour changes. A different case would need to be used to test the behaviour which picks the generic plan. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7 December 2017 at 23:56, Beena Emerson <memissemerson@gmail.com> wrote: > Currently Amit's v13 patches do not apply on the HEAD and I was > working on 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 which is the last > commit where all Amit's v13 patches applies cleanly. Thanks. I was just looking over this and was wondering about the following case: drop table if exists p; create table p (a int not null, b int not null) partition by range (a); create table p1 partition of p for values from (0) to (1000); create table p2 partition of p for values from (1000) to (2000); create table p3 partition of p for values from (2000) to (3000); create table p4 partition of p for values from (3000) to (4000); create index on p1 (a); create index on p2 (a); create index on p3 (a); create index on p4 (a); insert into p select x,x from generate_series(1,3999) x; drop table if exists t; create table t (a int not null); insert into t select generate_Series(1,10); analyze p; analyze t; set enable_mergejoin=0; set enable_hashjoin=0; explain analyze select * from p inner join t on p.a = t.a; The patch gives me: QUERY PLAN ---------------------------------------------------------------------------------------- Nested Loop (actual time=0.032..0.159 rows=10 loops=1) -> Seq Scan on t (actual time=0.012..0.013 rows=10 loops=1) -> Append (actual time=0.004..0.013 rows=1 loops=10) -> Index Scan using p1_a_idx on p1 (actual time=0.004..0.004 rows=1 loops=10) Index Cond: (a = t.a) -> Index Scan using p2_a_idx on p2 (actual time=0.003..0.003 rows=0 loops=10) Index Cond: (a = t.a) -> Index Scan using p3_a_idx on p3 (actual time=0.002..0.002 rows=0 loops=10) Index Cond: (a = t.a) -> Index Scan using p4_a_idx on p4 (actual time=0.003..0.003 rows=0 loops=10) Index Cond: (a = t.a) Planning time: 0.472 ms Execution time: 0.241 ms (13 rows) but I expected to get (never executed) for p2, p3 and p4. The following code makes me think you intend this to work: @@ -280,6 +438,10 @@ ExecReScanAppend(AppendState *node) { int i; + /* Determine subplans to scan based on the new Params */ + if (node->ps.chgParam != NULL && node->join_clauses) + set_append_subplan_indexes(node, node->join_clauses); + It just does not due to the node->join_clauses being NULL. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote: > I have added the partition quals that are used for pruning. > > PFA the updated patch. I have changed the names of variables to make > it more appropriate, along with adding more code comments and doing > some refactoring and other code cleanups. - initClauses() seems to be a duplicate of the existing function ExecInitExprList(), except for the extra NULL test, which isn't necessary. - The executor already has a system for making sure that relations get opened and locked, and it's different from the new one scheme which set_append_subplan_indexes() implements. Relations should be locked during the executor initialization phase (i.e. ExecInitAppend) and not when the first tuple is requested (i.e. ExecAppend). Also, there's already code to lock both child relations (i.e. the scans of those relations, see InitScanRelation, ExecInitIndexScan) and non-leaf partitions (ExecLockNonLeafAppendTables). The call to find_all_inheritors() will lock all of that same stuff again *plus* the leaf partitions that were pruned during planning - moreover, if the Append is rescanned, we'll walk the partitioning structure again for every rescan. I think RelationGetPartitionDispatchInfo should be called directly from ExecInitAppend after the existing code to take locks has been called, and store a pointer to the PartitionDispatch object in the AppendState for future use. - I am surprised that set_append_subplan_indexes() needs to worry about multi-level partitioning directly. I would have thought that Amit's patch would take care of that, just returning a Bitmapset of indexes which this function could use directly. It also doesn't seem like a very good idea to convert the Bitmapset (subplans) into a list of integers (node->subplan_indexes), as set_append_subplan_indexes() does at the bottom. The Bitmapset will be a lot more efficient; we should be able to just iterate over that directly rather than converting it into a List. Note that a Bitmapset can be created with a single palloc, but an integer list needs one per list element plus one for the list itself. - I don't think it's a good idea for ExecInitAppend to copy so much information into the appendstate. It copies append_paths_size, append_paths_array, parentoid, base_params, es_param_list_info, join_clauses, but it could just as well access them directly via planstate->plan and planstate->state when they are needed. Maybe you had some thought that this would be more efficient, but it probably doesn't save much and it's unlike what we do elsewhere in the executor. - A good chunk of the rest of this logic in nodeAppend.c looks like it's going to conflict heavily with the Parallel Append patch that just went in. That's going to require some thought. There's no reason why a parallel-aware Append can't do runtime partition pruning, but we want to avoid as much overhead as possible when runtime pruning isn't chosen. In the parallel-aware case, I think we should just try to jigger things so that the plans we don't need to scan get marked pa_finished. We don't want to hold pa_lock while doing the pruning, so probably what should happen is add a new ParallelAppendState member indicating wither pruning has been done; any process which needs to choose a subplan and sees that pruning isn't done yet releases the lock, performs pruning, then reacquires the lock, marks pa_finished on all plans that we don't need to scan, marks pruning down, picks a plan, and releases the lock. There is a race condition where pruning gets conducted by multiple workers around the same time, but it often won't happen and isn't a big deal if it does; they should all get the same answer. In the non-parallel-aware case, I think we should probably replace choose_next_subplan_locally with choose_next_subplan_simply (for the non-pruning case) and choose_next_subplan_with_pruning (for the other case). - On the optimizer side of things, the new calls to find_all_inheritors() and get_leaf_part_recurse() in set_base_rel_sizes() don't look good. As in the executor stuff, that's work that the optimizer is already doing elsewhere, and we don't want to redo it here. In the case of the optimizer, the most relevant place is probably expand_partitioned_rtentry(). Another place where we already have the relation open is get_relation_info(). Any information you want to get from the relation descriptor needs to be saved in one of those functions; don't re-open the relation elsewhere. - I think you are perhaps doing the work a little too early here, especially with regards to the join clauses. In set_base_rel_sizes() we don't have a clear idea what the join order will be, or what type of join will be used, so we don't know what join clauses are relevant for run-time pruning. I don't think that trying to identify join clauses at that stage makes sense. I think that the time when we know (or can figure out) that stuff might be when we go to build a parameterized append path - see the bottom of add_paths_to_append_rel(). It's really the parameterization -- the relids listed in required_outer -- that tell us which join clauses we could potentially use for runtime pruning. I'm not sure in detail how this should work, but it seems to me that generally the right formula is probably: useful clauses from the appendrel's baserestrictinfo PLUS joinclauses that mention only vars from the relation itself and whatever's in required_outer. - You can't use random fields in PlannerInfo *root as scratch space for functions in different files to communicate with each other. I mean, there's obviously some stuff that's valid to stick into the PlannerInfo, but these are basically per-relation details which you're counting on clearing before anybody gets confused. That's not a good design, and it's probably a sign that not all of the code is in the right place yet. For example, a lot of the logic you've added to create_append_path() probably belongs in the caller -- look at how simple create_.*_path() functions generally are. Similarly, think about whether the chunk of logic added to set_rel_size() just after the call to set_append_rel_size() doesn't really belong inside that function, or conversely whether the chunk of code in set_base_rel_sizes() just before the call to set_rel_size() shuldn't be moved down. I'm hopeful that with some rejiggering of this sort you can get the code that needs to communicate closer to being all in one place and pass around whatever is needed via the parameter lists of the functions involved, or even local variables, rather than via PlannerInfo. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Beena, On 12/07/2017 02:22 AM, Beena Emerson wrote: > I have added the partition quals that are used for pruning. > > PFA the updated patch. I have changed the names of variables to make > it more appropriate, along with adding more code comments and doing > some refactoring and other code cleanups. > As the current patch conflicts with [1] could you provide a rebased version ? Thanks in advance ! [1] https://www.postgresql.org/message-id/9b98fc47-34b8-0ab6-27fc-c8a0889f2e5b%40lab.ntt.co.jp Best regards, Jesper
Hello Robert, Thank you for the comments. I have started working on it. On Fri, Dec 8, 2017 at 9:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote: >> I have added the partition quals that are used for pruning. >> >> PFA the updated patch. I have changed the names of variables to make >> it more appropriate, along with adding more code comments and doing >> some refactoring and other code cleanups. > > - initClauses() seems to be a duplicate of the existing function > ExecInitExprList(), except for the extra NULL test, which isn't > necessary. The initClauses has been removed and ExecInitExprList has been used. > - The executor already has a system for making sure that relations get > opened and locked, and it's different from the new one scheme which > set_append_subplan_indexes() implements. Relations should be locked > during the executor initialization phase (i.e. ExecInitAppend) and not > when the first tuple is requested (i.e. ExecAppend). Also, there's > already code to lock both child relations (i.e. the scans of those > relations, see InitScanRelation, ExecInitIndexScan) and non-leaf > partitions (ExecLockNonLeafAppendTables). The call to > find_all_inheritors() will lock all of that same stuff again *plus* > the leaf partitions that were pruned during planning - moreover, if > the Append is rescanned, we'll walk the partitioning structure again > for every rescan. I think RelationGetPartitionDispatchInfo should be > called directly from ExecInitAppend after the existing code to take > locks has been called, and store a pointer to the PartitionDispatch > object in the AppendState for future use. I have moved the call to ExecInitAppend. This still uses the previous locking method, I will work on it in the next version of the patch. > - I am surprised that set_append_subplan_indexes() needs to worry > about multi-level partitioning directly. I would have thought that > Amit's patch would take care of that, just returning a Bitmapset of > indexes which this function could use directly. It also doesn't seem > like a very good idea to convert the Bitmapset (subplans) into a list > of integers (node->subplan_indexes), as set_append_subplan_indexes() > does at the bottom. The Bitmapset will be a lot more efficient; we > should be able to just iterate over that directly rather than > converting it into a List. Note that a Bitmapset can be created with > a single palloc, but an integer list needs one per list element plus > one for the list itself. The function get_partitions_from_clauses returns the Bitmap set of partitions for a level of partition. So when the BitmapSet that indicates a child partitioned table, set_append_subplan_indexes loops throgh again till it gets the list of all leaf indexes. I am working on the other comments and will post the patch along with rebasing to v14 of Amit's patch soon. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 12, 2017 at 4:57 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello Robert, > Thank you for the comments. I have started working on it. > > On Fri, Dec 8, 2017 at 9:27 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote: >>> I have added the partition quals that are used for pruning. >>> >>> PFA the updated patch. I have changed the names of variables to make >>> it more appropriate, along with adding more code comments and doing >>> some refactoring and other code cleanups. >> >> - initClauses() seems to be a duplicate of the existing function >> ExecInitExprList(), except for the extra NULL test, which isn't >> necessary. > > The initClauses has been removed and ExecInitExprList has been used. > >> - The executor already has a system for making sure that relations get >> opened and locked, and it's different from the new one scheme which >> set_append_subplan_indexes() implements. Relations should be locked >> during the executor initialization phase (i.e. ExecInitAppend) and not >> when the first tuple is requested (i.e. ExecAppend). Also, there's >> already code to lock both child relations (i.e. the scans of those >> relations, see InitScanRelation, ExecInitIndexScan) and non-leaf >> partitions (ExecLockNonLeafAppendTables). The call to >> find_all_inheritors() will lock all of that same stuff again *plus* >> the leaf partitions that were pruned during planning - moreover, if >> the Append is rescanned, we'll walk the partitioning structure again >> for every rescan. I think RelationGetPartitionDispatchInfo should be >> called directly from ExecInitAppend after the existing code to take >> locks has been called, and store a pointer to the PartitionDispatch >> object in the AppendState for future use. > > I have moved the call to ExecInitAppend. This still uses the previous > locking method, I will work on it in the next version of the patch. > > >> - I am surprised that set_append_subplan_indexes() needs to worry >> about multi-level partitioning directly. I would have thought that >> Amit's patch would take care of that, just returning a Bitmapset of >> indexes which this function could use directly. It also doesn't seem >> like a very good idea to convert the Bitmapset (subplans) into a list >> of integers (node->subplan_indexes), as set_append_subplan_indexes() >> does at the bottom. The Bitmapset will be a lot more efficient; we >> should be able to just iterate over that directly rather than >> converting it into a List. Note that a Bitmapset can be created with >> a single palloc, but an integer list needs one per list element plus >> one for the list itself. > > The function get_partitions_from_clauses returns the Bitmap set of > partitions for a level of partition. So when the BitmapSet that > indicates a child partitioned table, set_append_subplan_indexes loops > throgh again till it gets the list of all leaf indexes. > > I am working on the other comments and will post the patch along with > rebasing to v14 of Amit's patch soon. > > > -- > PFA the updated patch, this can be applied over the v13 patches [1] over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 [1] https://www.postgresql.org/message-id/df609168-b7fd-4c0b-e9b2-6e398d411e27%40lab.ntt.co.jp -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hello, On Fri, Dec 8, 2017 at 3:37 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 December 2017 at 23:56, Beena Emerson <memissemerson@gmail.com> wrote: >> Currently Amit's v13 patches do not apply on the HEAD and I was >> working on 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 which is the last >> commit where all Amit's v13 patches applies cleanly. > > Thanks. > > I was just looking over this and was wondering about the following case: > > drop table if exists p; > create table p (a int not null, b int not null) partition by range (a); > create table p1 partition of p for values from (0) to (1000); > create table p2 partition of p for values from (1000) to (2000); > create table p3 partition of p for values from (2000) to (3000); > create table p4 partition of p for values from (3000) to (4000); > > create index on p1 (a); > create index on p2 (a); > create index on p3 (a); > create index on p4 (a); > > > insert into p select x,x from generate_series(1,3999) x; > > drop table if exists t; > create table t (a int not null); > > insert into t select generate_Series(1,10); > > analyze p; > > analyze t; > > set enable_mergejoin=0; > set enable_hashjoin=0; > > explain analyze select * from p inner join t on p.a = t.a; > > The patch gives me: > > QUERY PLAN > ---------------------------------------------------------------------------------------- > Nested Loop (actual time=0.032..0.159 rows=10 loops=1) > -> Seq Scan on t (actual time=0.012..0.013 rows=10 loops=1) > -> Append (actual time=0.004..0.013 rows=1 loops=10) > -> Index Scan using p1_a_idx on p1 (actual time=0.004..0.004 > rows=1 loops=10) > Index Cond: (a = t.a) > -> Index Scan using p2_a_idx on p2 (actual time=0.003..0.003 > rows=0 loops=10) > Index Cond: (a = t.a) > -> Index Scan using p3_a_idx on p3 (actual time=0.002..0.002 > rows=0 loops=10) > Index Cond: (a = t.a) > -> Index Scan using p4_a_idx on p4 (actual time=0.003..0.003 > rows=0 loops=10) > Index Cond: (a = t.a) > Planning time: 0.472 ms > Execution time: 0.241 ms > (13 rows) > > but I expected to get (never executed) for p2, p3 and p4. > > The following code makes me think you intend this to work: > > @@ -280,6 +438,10 @@ ExecReScanAppend(AppendState *node) > { > int i; > > + /* Determine subplans to scan based on the new Params */ > + if (node->ps.chgParam != NULL && node->join_clauses) > + set_append_subplan_indexes(node, node->join_clauses); > + > > It just does not due to the node->join_clauses being NULL. > Thank you for your tests. I am working on this. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Jesper, On Tue, Dec 12, 2017 at 4:04 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Hi Beena, > > On 12/07/2017 02:22 AM, Beena Emerson wrote: >> >> I have added the partition quals that are used for pruning. >> >> PFA the updated patch. I have changed the names of variables to make >> it more appropriate, along with adding more code comments and doing >> some refactoring and other code cleanups. >> > > As the current patch conflicts with [1] could you provide a rebased version > ? > > Thanks in advance ! > > [1] > https://www.postgresql.org/message-id/9b98fc47-34b8-0ab6-27fc-c8a0889f2e5b%40lab.ntt.co.jp > I am aware of this and will post a rebased version soon. Thank you, Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 13 December 2017 at 00:33, Beena Emerson <memissemerson@gmail.com> wrote: > PFA the updated patch, this can be applied over the v13 patches [1] > over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 Hi Beena, Thanks for posting an updated patch. I've been looking over this and I think that the use of the PartitionDispatch in set_append_subplan_indexes is not correct. What we need here is the index of the Append's subnode and that's not what RelationGetPartitionDispatchInfo() gives you. Remember that some partitions could have been pruned away already during planning. This quick example shows that the partition selection is not correct. create table p (a int, b int) partition by range (a); create table p_a_neg partition of p for values from (minvalue) to (0) partition by range (b); create table p_a_pos partition of p for values from (0) to (maxvalue) partition by range (b); create table p_a_neg_b_neg partition of p_a_neg for values from (minvalue) to (0); create table p_a_neg_b_pos partition of p_a_neg for values from (0) to (maxvalue); create table p_a_pos_b_neg partition of p_a_pos for values from (minvalue) to (0); create table p_a_pos_b_pos partition of p_a_pos for values from (0) to (maxvalue); prepare q1 (int, int) as select * from p where a = $1 and b = $1; explain analyze execute q1 (-1,-1); -- this works. QUERY PLAN -------------------------------------------------------------------------------------------------------------- Append (cost=0.00..175.60 rows=4 width=8) (actual time=1.099..1.099 rows=0 loops=1) Runtime Partition Pruning: ((a = $1) AND (b = $1)) -> Seq Scan on p_a_neg_b_neg (cost=0.00..43.90 rows=1 width=8) (actual time=0.023..0.023 rows=0 loops=1) Filter: ((a = $1) AND (b = $1)) -> Seq Scan on p_a_neg_b_pos (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $1)) -> Seq Scan on p_a_pos_b_neg (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $1)) -> Seq Scan on p_a_pos_b_pos (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $1)) (12 rows) explain analyze execute q1 (-1,1); -- should scan p_a_neg_b_pos, but does not. QUERY PLAN -------------------------------------------------------------------------------------------------------------- Append (cost=0.00..175.60 rows=4 width=8) (actual time=758996.359..758996.359 rows=0 loops=1) Runtime Partition Pruning: ((a = $1) AND (b = $1)) -> Seq Scan on p_a_neg_b_neg (cost=0.00..43.90 rows=1 width=8) (actual time=0.056..0.056 rows=0 loops=1) Filter: ((a = $1) AND (b = $1)) -> Seq Scan on p_a_neg_b_pos (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $1)) -> Seq Scan on p_a_pos_b_neg (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $1)) -> Seq Scan on p_a_pos_b_pos (cost=0.00..43.90 rows=1 width=8) (never executed) Filter: ((a = $1) AND (b = $1)) (12 rows) So, I started to look at what the best way to put this right might be. I see that since Parallel Append was committed that the subnodes are now sorted in cost order inside create_append_path(), so likely we'll want to delay figuring out the subpath list indexes until after that's done since sorting would scramble our index arrays. We could simply look at the subpaths at the end of create_append_path() and create some sort of new matrix type that can accept the output of Amit's get_partitions_from_clauses() and translate that Bitmapset into the subpath indexes (another Bitmapset). This will also need to work for sub-partitions too, so this matrix must be some sort of tree that we can descend into when we see that get_partitions_from_clauses returned a bit for a sub-partition instead of a leaf-partition. I bashed this idea around a bit and I came up with the attached. It's very far from complete and in a very WIP state. I've not really done anything to make the correct clause list available in nodeAppend.c yet, but I think the code that's there is worthy of a look. I've not done that much work on the new choose_next_subplan* functions in nodeAppend.c. I just modified choose_next_subplan_locally to show how this set of functions need to take into account the subnode bitmap set of valid partitions to scan. Perhaps some special case is needed to have these functions ignore the Bitmapset when runtime pruning is disabled (perhaps a completely new set of the functions is needed to support the selection of the next non-pruned partition). Although, probably that can be debated a bit later as it's a fairly minor detail for now. My patch also lacks any means to extract the Params during match_clauses_to_partkey(), or at least most of the cases. I've just added 1 case there. I did this because I thought it was better to extract the ParamIds rather than a bool to say we've matched params. This way we can only reevaluate which subplans to look at on rescan of an Append if and only if the params we actually care about have changed. I've not given this part a huge amount of thought yet. I'm a little unsure where to go from here. Obviously, this is quite a major rewrite of your patch. The parts that I've got missing likely can use quite a bit of the stuff you've already written, but that needs some review. I wanted to post this now as I know you're busy working on this to rebase it on parallel Append and to also address Robert's concerns, which all seem valid to me. What's the best way for us to coordinate our efforts on this? Maybe you could look at my patch and sanity check it to ensure I'm not taking anything in the wrong direction? I also think that MergeAppend needs similar work, but I think it's best to get the Append case working first, or perhaps that's another patch... Please find attached my patch, which is based directly on top of Amit's faster partition pruning v14 [1], which I patched against 4034db215b9 [1] https://www.postgresql.org/message-id/9b98fc47-34b8-0ab6-27fc-c8a0889f2e5b%40lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi. On 2017/12/16 15:05, David Rowley wrote: > On 13 December 2017 at 00:33, Beena Emerson <memissemerson@gmail.com> wrote: >> PFA the updated patch, this can be applied over the v13 patches [1] >> over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 > > Hi Beena, > > Thanks for posting an updated patch. > > I've been looking over this and I think that the use of the > PartitionDispatch in set_append_subplan_indexes is not correct. What > we need here is the index of the Append's subnode and that's not what > RelationGetPartitionDispatchInfo() gives you. Remember that some > partitions could have been pruned away already during planning. A somewhat similar concern is being discussed on the "UPDATE partition key" thread [1]. In that case, ExecInitModifyTable(), when initializing tuple routing information to handle the "update partition key" case, will have to deal with the fact that there might be fewer sub-plans in the ModifyTable node than there are partitions in the partition tree. That is, source partitions that planner would have determined after pruning, could be fewer than possible target partitions for rows from the source partitions to move to, of which the latter consists of *all* partitions. So, we have to have a mapping from leaf partition indexes as figured out by RelationGetPartitionDispatchInfo() (indexes that are offsets into a global array for *all* partitions), to sub-plan indexes which are offsets into the array for only those partitions that have a sub-plan. Such mapping is built (per the latest patch on that thread) by ExecSetupPartitionTupleRouting() in execPartition.c. We could do something similar here using a similar code structure. Maybe, add a ExecSetupPartitionRuntimePruning() in execPartition.c (mimicking ExecSetupPartitionTupleRouting), that accepts AppendState node. Furthermore, it might be a good idea to have something similar to ExecFindPartition(), say, ExecGetPartitions(). That is, we have new functions for run-time pruning that are counterparts to corresponding functions for tuple routing. Thanks, Amit [1] https://www.postgresql.org/message-id/c5e1d4ad-d243-52c5-608b-5dbb7183e465%40lab.ntt.co.jp
On 2017/12/09 0:57, Robert Haas wrote: > On Thu, Dec 7, 2017 at 2:22 AM, Beena Emerson <memissemerson@gmail.com> wrote: >> I have added the partition quals that are used for pruning. >> >> PFA the updated patch. I have changed the names of variables to make >> it more appropriate, along with adding more code comments and doing >> some refactoring and other code cleanups. > > - I am surprised that set_append_subplan_indexes() needs to worry > about multi-level partitioning directly. I would have thought that > Amit's patch would take care of that, just returning a Bitmapset of > indexes which this function could use directly. Actually, the partition.c code that my patch adds is limited to consider one partitioned table at a time, not the whole tree. As of 0a480502b09 [1], we call set_append_rel_size() separately for each partitioned table in a partition tree. In each such call, we call partition.c to perform partition pruning for the given partitioned table. In the run-time pruning case, we should get, via Append, a list of pruning clauses for each partitioned table in the tree that survived plan-time pruning. Then, just like ExecFindPartition() calls get_partition_for_tuple() for each partitioned table until we get to a leaf partition, we should call partition.c for each un-pruned partitioned table that have run-time pruning clauses associated. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0a480502b09
On 18 December 2017 at 21:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/12/16 15:05, David Rowley wrote: >> I've been looking over this and I think that the use of the >> PartitionDispatch in set_append_subplan_indexes is not correct. What >> we need here is the index of the Append's subnode and that's not what >> RelationGetPartitionDispatchInfo() gives you. Remember that some >> partitions could have been pruned away already during planning. > > A somewhat similar concern is being discussed on the "UPDATE partition > key" thread [1]. In that case, ExecInitModifyTable(), when initializing > tuple routing information to handle the "update partition key" case, will > have to deal with the fact that there might be fewer sub-plans in the > ModifyTable node than there are partitions in the partition tree. That > is, source partitions that planner would have determined after pruning, > could be fewer than possible target partitions for rows from the source > partitions to move to, of which the latter consists of *all* partitions. > So, we have to have a mapping from leaf partition indexes as figured out > by RelationGetPartitionDispatchInfo() (indexes that are offsets into a > global array for *all* partitions), to sub-plan indexes which are offsets > into the array for only those partitions that have a sub-plan. Such > mapping is built (per the latest patch on that thread) by > ExecSetupPartitionTupleRouting() in execPartition.c. Surely this is a different problem? With UPDATE of a partition key, if the planner eliminates all but 1 partition the UPDATE could cause that tuple to be "moved" into any leaf partition, very possibly one that's been eliminated during planning. In the case of runtime Append pruning, we can forget about all partitions that the planner managed to eliminate, we'll never need to touch those, ever. All we care about here is trying to reduce the number of partitions down further using values that were not available during planning. > We could do something similar here using a similar code structure. Maybe, > add a ExecSetupPartitionRuntimePruning() in execPartition.c (mimicking > ExecSetupPartitionTupleRouting), that accepts AppendState node. > Furthermore, it might be a good idea to have something similar to > ExecFindPartition(), say, ExecGetPartitions(). That is, we have new > functions for run-time pruning that are counterparts to corresponding > functions for tuple routing. Seems to me in this case we're better to build this structure during planning and save it with the plan so that it can be used over and over, rather than building it again and again each time the plan is executed. Likely a common use case for run-time pruning is when the plan is going to be used multiple times with different parameters, so we really don't want to repeat any work that we don't have to here. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, Thank you for reviewing and looking at this. I have attached the WIP patch which incorporates some of Robert's comments and is rebased over Amit's v14 patch. On Sat, Dec 16, 2017 at 11:35 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 13 December 2017 at 00:33, Beena Emerson <memissemerson@gmail.com> wrote: >> PFA the updated patch, this can be applied over the v13 patches [1] >> over commit 487a0c1518af2f3ae2d05b7fd23d636d687f28f3 > > Hi Beena, > > Thanks for posting an updated patch. > > I've been looking over this and I think that the use of the > PartitionDispatch in set_append_subplan_indexes is not correct. What > we need here is the index of the Append's subnode and that's not what > RelationGetPartitionDispatchInfo() gives you. Remember that some > partitions could have been pruned away already during planning. > > This quick example shows that the partition selection is not correct. > > create table p (a int, b int) partition by range (a); > > create table p_a_neg partition of p for values from (minvalue) to (0) > partition by range (b); > create table p_a_pos partition of p for values from (0) to (maxvalue) > partition by range (b); > > create table p_a_neg_b_neg partition of p_a_neg for values from > (minvalue) to (0); > create table p_a_neg_b_pos partition of p_a_neg for values from (0) to > (maxvalue); > > create table p_a_pos_b_neg partition of p_a_pos for values from > (minvalue) to (0); > create table p_a_pos_b_pos partition of p_a_pos for values from (0) to > (maxvalue); > > prepare q1 (int, int) as select * from p where a = $1 and b = $1; > > explain analyze execute q1 (-1,-1); -- this works. > QUERY PLAN > -------------------------------------------------------------------------------------------------------------- > Append (cost=0.00..175.60 rows=4 width=8) (actual time=1.099..1.099 > rows=0 loops=1) > Runtime Partition Pruning: ((a = $1) AND (b = $1)) > -> Seq Scan on p_a_neg_b_neg (cost=0.00..43.90 rows=1 width=8) > (actual time=0.023..0.023 rows=0 loops=1) > Filter: ((a = $1) AND (b = $1)) > -> Seq Scan on p_a_neg_b_pos (cost=0.00..43.90 rows=1 width=8) > (never executed) > Filter: ((a = $1) AND (b = $1)) > -> Seq Scan on p_a_pos_b_neg (cost=0.00..43.90 rows=1 width=8) > (never executed) > Filter: ((a = $1) AND (b = $1)) > -> Seq Scan on p_a_pos_b_pos (cost=0.00..43.90 rows=1 width=8) > (never executed) > Filter: ((a = $1) AND (b = $1)) > (12 rows) > > explain analyze execute q1 (-1,1); -- should scan p_a_neg_b_pos, but does not. > QUERY PLAN > -------------------------------------------------------------------------------------------------------------- > Append (cost=0.00..175.60 rows=4 width=8) (actual > time=758996.359..758996.359 rows=0 loops=1) > Runtime Partition Pruning: ((a = $1) AND (b = $1)) > -> Seq Scan on p_a_neg_b_neg (cost=0.00..43.90 rows=1 width=8) > (actual time=0.056..0.056 rows=0 loops=1) > Filter: ((a = $1) AND (b = $1)) > -> Seq Scan on p_a_neg_b_pos (cost=0.00..43.90 rows=1 width=8) > (never executed) > Filter: ((a = $1) AND (b = $1)) > -> Seq Scan on p_a_pos_b_neg (cost=0.00..43.90 rows=1 width=8) > (never executed) > Filter: ((a = $1) AND (b = $1)) > -> Seq Scan on p_a_pos_b_pos (cost=0.00..43.90 rows=1 width=8) > (never executed) > Filter: ((a = $1) AND (b = $1)) > (12 rows) > > So, I started to look at what the best way to put this right might be. > I see that since Parallel Append was committed that the subnodes are > now sorted in cost order inside create_append_path(), so likely we'll > want to delay figuring out the subpath list indexes until after that's > done since sorting would scramble our index arrays. We could simply > look at the subpaths at the end of create_append_path() and create > some sort of new matrix type that can accept the output of Amit's > get_partitions_from_clauses() and translate that Bitmapset into the > subpath indexes (another Bitmapset). This will also need to work for > sub-partitions too, so this matrix must be some sort of tree that we > can descend into when we see that get_partitions_from_clauses returned > a bit for a sub-partition instead of a leaf-partition. Yes, the change in sort order means that the current append_paths_array cannot be used for Parallel append and a new logic has to be devised. I have still not thought about it but your method seems like a good way to go. Currently I have worked on the Parallel bit considering that the appends_path_array holds the correct subplan_index. > > I bashed this idea around a bit and I came up with the attached. It's > very far from complete and in a very WIP state. I've not really done > anything to make the correct clause list available in nodeAppend.c > yet, but I think the code that's there is worthy of a look. I've not > done that much work on the new choose_next_subplan* functions in > nodeAppend.c. I just modified choose_next_subplan_locally to show how > this set of functions need to take into account the subnode bitmap set > of valid partitions to scan. Perhaps some special case is needed to > have these functions ignore the Bitmapset when runtime pruning is > disabled (perhaps a completely new set of the functions is needed to > support the selection of the next non-pruned partition). Although, > probably that can be debated a bit later as it's a fairly minor detail > for now. > > My patch also lacks any means to extract the Params during > match_clauses_to_partkey(), or at least most of the cases. I've just > added 1 case there. I did this because I thought it was better to > extract the ParamIds rather than a bool to say we've matched params. > This way we can only reevaluate which subplans to look at on rescan of > an Append if and only if the params we actually care about have > changed. I've not given this part a huge amount of thought yet. Currently, during ReScan the patch prunes whenever the Param changes and I had this 'rescan pruning optimization' in mind but had not worked on it. > > I'm a little unsure where to go from here. Obviously, this is quite a > major rewrite of your patch. The parts that I've got missing likely > can use quite a bit of the stuff you've already written, but that > needs some review. I wanted to post this now as I know you're busy > working on this to rebase it on parallel Append and to also address > Robert's concerns, which all seem valid to me. > > What's the best way for us to coordinate our efforts on this? Maybe > you could look at my patch and sanity check it to ensure I'm not > taking anything in the wrong direction? I have not seen the patch in depth but this approach seems good. Dilip has already worked on the join equality bug you pointed out before, I am yet to merge that patch and will work on the optimizer comments of Robert's. Maybe I can incorporate your patch while working on it. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, On Mon, Dec 18, 2017 at 4:03 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > >> We could do something similar here using a similar code structure. Maybe, >> add a ExecSetupPartitionRuntimePruning() in execPartition.c (mimicking >> ExecSetupPartitionTupleRouting), that accepts AppendState node. >> Furthermore, it might be a good idea to have something similar to >> ExecFindPartition(), say, ExecGetPartitions(). That is, we have new >> functions for run-time pruning that are counterparts to corresponding >> functions for tuple routing. > > Seems to me in this case we're better to build this structure during > planning and save it with the plan so that it can be used over and > over, rather than building it again and again each time the plan is > executed. Likely a common use case for run-time pruning is when the > plan is going to be used multiple times with different parameters, so > we really don't want to repeat any work that we don't have to here. > I agree. It would be better to avoid building the structure during execution. PFA the updated patch. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 19 December 2017 at 21:54, Beena Emerson <memissemerson@gmail.com> wrote: > PFA the updated patch. Hi Beena, Thanks for updating the patch. I've merged a chunk of your latest patch with what I was working on and cleaned up some mistakes that I had made in my WIP version. I've now done the work to make this work with parallel append, which wasn't so difficult. The 3 functions which choose the next subplan just needed to be made aware that they need to pay attention to the Bitmapset which lists which subplans need to be scanned. I did have to write a bms_prev_member() function to get the next lower set bit in a Bitmapset to make this easier. I think what I have is fairly clean. I've also solved the problem we had discussed about only needing to reselecting the subplans during ExecReScanAppend() when a parameter has changed that is actually used in run-time pruning. If it's not used, we now no longer wastefully reselect the same matching subplans. The patch still does need more work in match_clauses_to_partkey() to ensure we extract all paramids from the clauses. Right now, I'm only grabbing the paramids from an OpExpr. There's a bunch of other places Params could be hidden in there. Anything Amit is supporting for Consts we need to grab the Param Ids for when it's a Param instead of a Const. There's also a few things about the patch which I didn't change around too much as I didn't want change Amit's v15 patch too much. He working on this still and I didn't want too many conflicts. Basically, I had to make match_clauses_to_partkey() an external function so that I could use it in createplan.c in create_append_plan() in order to extract the Param Ids of any parameters in the parameterized path clause. This function is currently in allpaths.c in Amit's patch, but I'm feeling that it does not really belong there. I'll discuss with Amit on the faster partition pruning thread. The attached patch is pretty fresh. I've not given it a huge amount of testing so far. I do know there's still work to do in order to make cases like; prepare q2 (int) as select * from p where a in(10000,20000,$1); explain (costs off, analyze) execute q2 (45678); work correctly. To fix that requires a bit more invasion into the faster partition pruning v15 patch which this is based on. I don't really want to touch that too much just yet. Here are some examples of what the patch does: -- Test 1 Setup -- This shows the patch working with multiple partition levels. drop table if exists pt; create table pt (a int, b int) partition by range (a); create table pt_a_neg partition of pt for values from (minvalue) to (0) partition by range (b); create table pt_a_pos partition of pt for values from (0) to (maxvalue) partition by range (b); create table pt_a_neg_b_neg partition of pt_a_neg for values from (minvalue) to (0); create table pt_a_neg_b_pos partition of pt_a_neg for values from (0) to (maxvalue); create table pt_a_pos_b_neg partition of pt_a_pos for values from (minvalue) to (0); create table pt_a_pos_b_pos partition of pt_a_pos for values from (0) to (maxvalue); insert into pt select x,x from generate_series(-1000,1000) x; analyze pt; prepare q1 (int, int) as select * from pt where a = $1 and b = $2; -- Test 1 Result (first 5 custom plan executions removed) postgres=# explain (costs off, analyze) execute q1 (-10,10); QUERY PLAN ---------------------------------------------------------------------------- Append (actual time=0.007..0.007 rows=0 loops=1) -> Seq Scan on pt_a_neg_b_neg (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_neg_b_pos (actual time=0.007..0.007 rows=0 loops=1) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_pos_b_neg (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_pos_b_pos (never executed) Filter: ((a = $1) AND (b = $2)) Planning time: 0.366 ms Execution time: 0.077 ms (11 rows) postgres=# explain (costs off, analyze) execute q1 (-10,-10); QUERY PLAN ---------------------------------------------------------------------------- Append (actual time=0.235..0.237 rows=1 loops=1) -> Seq Scan on pt_a_neg_b_neg (actual time=0.234..0.236 rows=1 loops=1) Filter: ((a = $1) AND (b = $2)) Rows Removed by Filter: 999 -> Seq Scan on pt_a_neg_b_pos (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_pos_b_neg (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_pos_b_pos (never executed) Filter: ((a = $1) AND (b = $2)) Planning time: 0.025 ms Execution time: 0.313 ms (12 rows) postgres=# explain (costs off, analyze) execute q1 (10,-10); QUERY PLAN ---------------------------------------------------------------------------- Append (actual time=0.014..0.014 rows=0 loops=1) -> Seq Scan on pt_a_neg_b_neg (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_neg_b_pos (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_pos_b_neg (actual time=0.013..0.013 rows=0 loops=1) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_pos_b_pos (never executed) Filter: ((a = $1) AND (b = $2)) Planning time: 0.025 ms Execution time: 0.091 ms (11 rows) postgres=# explain (costs off, analyze) execute q1 (10,10); QUERY PLAN ---------------------------------------------------------------------------- Append (actual time=0.032..0.222 rows=1 loops=1) -> Seq Scan on pt_a_neg_b_neg (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_neg_b_pos (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_pos_b_neg (never executed) Filter: ((a = $1) AND (b = $2)) -> Seq Scan on pt_a_pos_b_pos (actual time=0.031..0.221 rows=1 loops=1) Filter: ((a = $1) AND (b = $2)) Rows Removed by Filter: 1000 Planning time: 0.026 ms Execution time: 0.297 ms (12 rows) -- Test 2 Setup -- This test shows the patch working with parameterized paths. Note that only subplans which actually have a value matching a value from the outer side of the join get scanned. create table p (a int not null) partition by range(a); select 'create table p'||x|| ' partition of p for values from ('||x * 10000 || ') to (' || (x+1)*10000 || ');' from generate_Series(0,9)x; \gexec insert into p select generate_Series(0,99999); select 'create index on p'||x||' (a)' from generate_Series(0,9)x; \gexec create table t (a int not null); insert into t values(4),(5); analyze p,t; set enable_mergejoin=0; explain (costs off, analyze) select * from t inner join p p on p.a=t.a; insert into t values(45678); explain (costs off, analyze) select * from t inner join p p on p.a=t.a; -- Test 2 Result postgres=# explain (costs off, analyze) select * from t inner join p p on p.a=t.a; QUERY PLAN ---------------------------------------------------------------------------------------------- Nested Loop (actual time=0.058..0.069 rows=2 loops=1) -> Seq Scan on t (actual time=0.011..0.012 rows=2 loops=1) -> Append (actual time=0.020..0.021 rows=1 loops=2) -> Index Only Scan using p0_a_idx on p0 p (actual time=0.019..0.020 rows=1 loops=2) Index Cond: (a = t.a) Heap Fetches: 2 -> Index Only Scan using p1_a_idx on p1 p_1 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p2_a_idx on p2 p_2 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p3_a_idx on p3 p_3 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p4_a_idx on p4 p_4 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p5_a_idx on p5 p_5 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p6_a_idx on p6 p_6 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p7_a_idx on p7 p_7 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p8_a_idx on p8 p_8 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p9_a_idx on p9 p_9 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 Planning time: 1.080 ms Execution time: 0.315 ms (35 rows) postgres=# insert into t values(45678); -- add a record to show that p4 gets scanned. INSERT 0 1 postgres=# explain (costs off, analyze) select * from t inner join p p on p.a=t.a; QUERY PLAN ------------------------------------------------------------------------------------------------ Nested Loop (actual time=0.041..0.106 rows=3 loops=1) -> Seq Scan on t (actual time=0.014..0.014 rows=3 loops=1) -> Append (actual time=0.024..0.025 rows=1 loops=3) -> Index Only Scan using p0_a_idx on p0 p (actual time=0.010..0.011 rows=1 loops=2) Index Cond: (a = t.a) Heap Fetches: 2 -> Index Only Scan using p1_a_idx on p1 p_1 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p2_a_idx on p2 p_2 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p3_a_idx on p3 p_3 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p4_a_idx on p4 p_4 (actual time=0.049..0.050 rows=1 loops=1) Index Cond: (a = t.a) Heap Fetches: 1 -> Index Only Scan using p5_a_idx on p5 p_5 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p6_a_idx on p6 p_6 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p7_a_idx on p7 p_7 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p8_a_idx on p8 p_8 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 -> Index Only Scan using p9_a_idx on p9 p_9 (never executed) Index Cond: (a = t.a) Heap Fetches: 0 Planning time: 0.635 ms Execution time: 0.233 ms (35 rows) There is, however, still a fundamental problem with the patch, or maybe the idea in general (this was also pointed out by Amit Langote in an off-list discussion): The problem is down to the logic in choose_custom_plan() only choosing a generic plan if the average cost of the generic plan is less than the average custom plan cost. The problem is that the generic plan can have many extra Append subnodes in comparison to the custom plan, all of which are taken into account in the total plan cost, but these may be pruned during execution. The logic in choose_custom_plan() has no idea about this. I don't have any bright ideas on how to fix this yet, as, suppose a PREPAREd statement like the following comes along: PREPARE q3 (int, int) AS SELECT * FROM partitioned_table WHERE partkey BETWEEN $1 AND $2; the run-time pruning may prune it down no subplans, all subplans, or any number in between. So we can't do anything like take the total Append cost to be the highest costing of its subplans, and likely using the average cost might not be a good idea either. It might work sometimes, but likely won't be very stable. If this is not fixed then choose_custom_plan() has a very low probability of choosing a generic plan which has run-time partition pruning enabled, which in a way defeats the purpose of this whole patch. I'm a bit uncertain on the best way to resolve this. It needs to be discussed here. One more thing. The attached is not yet set up to work with MergeAppend. It's likely just a small amount of additional work to make this happen, so likely should be something that we do. Anyway, I've attached the latest version of the patch. This is based on Amit's v15 of faster-partition-pruning [1] which I found to cleanly apply to f94eec490 [1] https://www.postgresql.org/message-id/06cde8a5-0ac7-dcf5-ad66-1ca781623e0c@lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hello David, On Thu, Dec 21, 2017 at 2:31 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 19 December 2017 at 21:54, Beena Emerson <memissemerson@gmail.com> wrote: > The problem is down to the logic in choose_custom_plan() only choosing > a generic plan if the average cost of the generic plan is less than > the average custom plan cost. The problem is that the generic plan can > have many extra Append subnodes in comparison to the custom plan, all > of which are taken into account in the total plan cost, but these may > be pruned during execution. The logic in choose_custom_plan() has no > idea about this. I don't have any bright ideas on how to fix this > yet, as, suppose a PREPAREd statement like the following comes along: > > PREPARE q3 (int, int) AS SELECT * FROM partitioned_table WHERE partkey > BETWEEN $1 AND $2; > > the run-time pruning may prune it down no subplans, all subplans, or > any number in between. So we can't do anything like take the total > Append cost to be the highest costing of its subplans, and likely > using the average cost might not be a good idea either. It might work > sometimes, but likely won't be very stable. If this is not fixed then > choose_custom_plan() has a very low probability of choosing a generic > plan which has run-time partition pruning enabled, which in a way > defeats the purpose of this whole patch. > > I'm a bit uncertain on the best way to resolve this. It needs to be > discussed here. I had mentioned this in the first mail on this thread that the generic plan is always preferred and Robert said that it is not in scope of this patch. Maybe we can start a new thread for this. > > One more thing. The attached is not yet set up to work with > MergeAppend. It's likely just a small amount of additional work to > make this happen, so likely should be something that we do. > > Anyway, I've attached the latest version of the patch. This is based > on Amit's v15 of faster-partition-pruning [1] which I found to cleanly > apply to f94eec490 Thank you for working on this. I will look into this and merge with my current version of patch and Amit's v16 patches and post a new patch soon. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21 December 2017 at 23:51, Beena Emerson <memissemerson@gmail.com> wrote: > On Thu, Dec 21, 2017 at 2:31 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> On 19 December 2017 at 21:54, Beena Emerson <memissemerson@gmail.com> wrote: > >> The problem is down to the logic in choose_custom_plan() only choosing >> a generic plan if the average cost of the generic plan is less than >> the average custom plan cost. The problem is that the generic plan can >> have many extra Append subnodes in comparison to the custom plan, all >> of which are taken into account in the total plan cost, but these may >> be pruned during execution. The logic in choose_custom_plan() has no >> idea about this.. > > I had mentioned this in the first mail on this thread that the generic > plan is always preferred and Robert said that it is not in scope of > this patch. Maybe we can start a new thread for this. Sorry, I missed that. I knew you had modified choose_custom_plan() too, but I didn't know the diagnosis of the problem had made its way here yet. I wonder if Robert understands the extent of the problem. The patch will be useful to prune away partitions when the partitioned table is on the inside of a parameterised nested loop join, but I think we'll likely get some complaints about PREPARE statements always using a custom plan. It's going to make this pretty hard to write regression tests for unless we went and invent some GUC "prefer_generic_plan", or "generic_plan_cost_multiplier" to provide a way to coax choose_custom_plan() into not choosing a custom plan. I admit to not having the answer to this but I don't think we should rule out discussing possible fixes. > Thank you for working on this. I will look into this and merge with > my current version of patch and Amit's v16 patches and post a new > patch soon. okay. I've just rebased mine on top of Amit's v16 now and will send shortly. I already took some code from your v7 patch to make my v2 work. Which parts do you think still need to be merged? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hello, On Thu, Dec 21, 2017 at 6:01 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 21 December 2017 at 23:51, Beena Emerson <memissemerson@gmail.com> wrote: >> On Thu, Dec 21, 2017 at 2:31 PM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >>> On 19 December 2017 at 21:54, Beena Emerson <memissemerson@gmail.com> wrote: >> >>> The problem is down to the logic in choose_custom_plan() only choosing >>> a generic plan if the average cost of the generic plan is less than >>> the average custom plan cost. The problem is that the generic plan can >>> have many extra Append subnodes in comparison to the custom plan, all >>> of which are taken into account in the total plan cost, but these may >>> be pruned during execution. The logic in choose_custom_plan() has no >>> idea about this.. >> >> I had mentioned this in the first mail on this thread that the generic >> plan is always preferred and Robert said that it is not in scope of >> this patch. Maybe we can start a new thread for this. > > Sorry, I missed that. I knew you had modified choose_custom_plan() > too, but I didn't know the diagnosis of the problem had made its way > here yet. > > I wonder if Robert understands the extent of the problem. The patch > will be useful to prune away partitions when the partitioned table is > on the inside of a parameterised nested loop join, but I think we'll > likely get some complaints about PREPARE statements always using a > custom plan. It's going to make this pretty hard to write regression > tests for unless we went and invent some GUC "prefer_generic_plan", or > "generic_plan_cost_multiplier" to provide a way to coax > choose_custom_plan() into not choosing a custom plan. I admit to not > having the answer to this but I don't think we should rule out > discussing possible fixes. I have attached the regression tests I am currently using to check my patch. It has few prepared statements using runtime pruning. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 21 December 2017 at 22:01, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached the latest version of the patch. This is based > on Amit's v15 of faster-partition-pruning [1] which I found to cleanly > apply to f94eec490 Well, that went out of date pretty quickly. Amit has now posted v16 of the faster partition pruning patch [1] which conflicts with my changes in my v2 patch. I've attached a new version of the patch to resolve these conflicts. In v16 Amit added some code to eliminate LIST partitions when there are not equals expressions which cause all the items in the IN list to be unmatchable. This appears to work fine with the run-time pruning in the attached, per: create table ta (a int not null) partition by list (a); create table ta1 partition of ta for values in(1,2); create table ta2 partition of ta for values in(3,4); explain select * from ta where a <> 1 and a <> 2; create table ta_null partition of ta for values in(null); prepare q1 (int) as select * from ta where a <> 1 and a <> $1; explain (costs off, analyze) execute q1(2); QUERY PLAN ----------------------------------------------------------------- Append (actual time=0.012..0.012 rows=0 loops=1) -> Seq Scan on ta1 (never executed) Filter: ((a <> 1) AND (a <> $1)) -> Seq Scan on ta2 (actual time=0.012..0.012 rows=0 loops=1) Filter: ((a <> 1) AND (a <> $1)) Planning time: 0.019 ms Execution time: 0.057 ms (7 rows) explain (costs off, analyze) execute q1(1); QUERY PLAN ----------------------------------------------------------------- Append (actual time=0.017..0.017 rows=0 loops=1) -> Seq Scan on ta1 (actual time=0.013..0.013 rows=0 loops=1) Filter: ((a <> 1) AND (a <> $1)) -> Seq Scan on ta2 (actual time=0.003..0.003 rows=0 loops=1) Filter: ((a <> 1) AND (a <> $1)) Planning time: 0.021 ms Execution time: 0.068 ms (7 rows) [1] https://www.postgresql.org/message-id/5ebae4cf-8145-975c-ad75-16eb7f756f32%40lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 22 December 2017 at 01:43, Beena Emerson <memissemerson@gmail.com> wrote: >> I wonder if Robert understands the extent of the problem. The patch >> will be useful to prune away partitions when the partitioned table is >> on the inside of a parameterised nested loop join, but I think we'll >> likely get some complaints about PREPARE statements always using a >> custom plan. It's going to make this pretty hard to write regression >> tests for unless we went and invent some GUC "prefer_generic_plan", or >> "generic_plan_cost_multiplier" to provide a way to coax >> choose_custom_plan() into not choosing a custom plan. I admit to not >> having the answer to this but I don't think we should rule out >> discussing possible fixes. > > I have attached the regression tests I am currently using to check my > patch. It has few prepared statements using runtime pruning. Thanks for sending. Will the tests still pass if you remove the hack from choose_custom_plan()? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hello, On Thu, Dec 21, 2017 at 6:30 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 22 December 2017 at 01:43, Beena Emerson <memissemerson@gmail.com> wrote: >>> I wonder if Robert understands the extent of the problem. The patch >>> will be useful to prune away partitions when the partitioned table is >>> on the inside of a parameterised nested loop join, but I think we'll >>> likely get some complaints about PREPARE statements always using a >>> custom plan. It's going to make this pretty hard to write regression >>> tests for unless we went and invent some GUC "prefer_generic_plan", or >>> "generic_plan_cost_multiplier" to provide a way to coax >>> choose_custom_plan() into not choosing a custom plan. I admit to not >>> having the answer to this but I don't think we should rule out >>> discussing possible fixes. >> >> I have attached the regression tests I am currently using to check my >> patch. It has few prepared statements using runtime pruning. > > Thanks for sending. Will the tests still pass if you remove the hack > from choose_custom_plan()? > The hack has been removed in the v7 patch sent and yes the tests work without any additional adjustments. Thank you, Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 December 2017 at 02:06, Beena Emerson <memissemerson@gmail.com> wrote: >> Thanks for sending. Will the tests still pass if you remove the hack >> from choose_custom_plan()? >> > > The hack has been removed in the v7 patch sent and yes the tests work > without any additional adjustments. Oh nice. I see you've included parameters to force the 5 custom plans to having a higher cost, then for the first generic plan attempt, you're choosing params that'll result in a cheaper plan. Great idea. I'll revoke what I said about this being impossible to test now :) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hello, On Thu, Dec 21, 2017 at 6:26 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 21 December 2017 at 22:01, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I've attached the latest version of the patch. This is based >> on Amit's v15 of faster-partition-pruning [1] which I found to cleanly >> apply to f94eec490 > > Well, that went out of date pretty quickly. Amit has now posted v16 of > the faster partition pruning patch [1] which conflicts with my changes > in my v2 patch. > > I've attached a new version of the patch to resolve these conflicts. Thank you for the patch, I applied it over Amit's v16 patches on commit 180428404. I found that make check is crashing along with the following sql you mentioned. postgres=# create table ta (a int not null) partition by list (a); CREATE TABLE postgres=# create table ta1 partition of ta for values in(1,2); CREATE TABLE postgres=# create table ta2 partition of ta for values in(3,4); CREATE TABLE postgres=# explain select * from ta where a <> 1 and a <> 2; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. This seems like having two different patches for the same feature. I will post my version of the patch which uses the struct PartitionPruneInfo from your patch and I will add the other additional features you added like optimizing the pruning rescan. I will try and post the patch tomorrow. If there is more suggestions, you can give it over that; otherwise it seems like duplicating efforts. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 December 2017 at 03:02, Beena Emerson <memissemerson@gmail.com> wrote: > postgres=# create table ta (a int not null) partition by list (a); > CREATE TABLE > postgres=# create table ta1 partition of ta for values in(1,2); > CREATE TABLE > postgres=# create table ta2 partition of ta for values in(3,4); > CREATE TABLE > postgres=# explain select * from ta where a <> 1 and a <> 2; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. I just get "ERROR: negative bitmapset member not allowed" here. I reported that one to Amit over on the other thread. Not sure why you're getting a crash. Can you get a backtrace? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 21, 2017 at 4:01 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > The problem is down to the logic in choose_custom_plan() only choosing > a generic plan if the average cost of the generic plan is less than > the average custom plan cost. The problem is that the generic plan can > have many extra Append subnodes in comparison to the custom plan, all > of which are taken into account in the total plan cost, but these may > be pruned during execution. The logic in choose_custom_plan() has no > idea about this. I don't have any bright ideas on how to fix this > yet, as, suppose a PREPAREd statement like the following comes along: > > PREPARE q3 (int, int) AS SELECT * FROM partitioned_table WHERE partkey > BETWEEN $1 AND $2; > > the run-time pruning may prune it down no subplans, all subplans, or > any number in between. So we can't do anything like take the total > Append cost to be the highest costing of its subplans, and likely > using the average cost might not be a good idea either. Well, I do think we need to make some kind of estimate. It may be a bad estimate, but if we do nothing, we're estimating that no pruning at all will happen, which is probably not right either. I mean, if we have... PREPARE q3 (int, int) AS SELECT * FROM unpartitioned_table WHERE partkey BETWEEN $1 AND $2; ...that has to decide whether to use an index. And to do that it has to estimate what fraction of the table will match the BETWEEN clause. That may be an uninformed guess, but it guesses something. We probably want to do something here that makes the guess for a partitioned_table similar to the guess for an unpartitioned_table. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 December 2017 at 03:02, Beena Emerson <memissemerson@gmail.com> wrote: > This seems like having two different patches for the same feature. I > will post my version of the patch which uses the struct > PartitionPruneInfo from your patch and I will add the other additional > features you added like optimizing the pruning rescan. I will try and > post the patch tomorrow. I apologise for persisting in making these parallel efforts. I do have time right now to dedicate to review this patch, but that time is running out. At this end of this time, I was really hoping that there would be a patch that's worthy of being committed (or at least one worthy of a committers time). During my review of v5, because I found the patch to still need quite a bit of work, I thought the best use of that time was to make it work myself, which to my knowledge I have done. Although, I'm sure my patch will still have bugs, it appears to me to be quite a bit further ahead than your v7 WIP patch. > If there is more suggestions, you can give it over that; otherwise it > seems like duplicating efforts. Much of the things I did differently from you could be taken as suggestions. There were a number of things in the v7 patch were still not in a workable state: 1. Using the PlannerInfo to record details about Append. How will this work with a plan containing multiple Appends scanning partitioned tables? 2. The use of AppendState->subplan_indexes List. Please use a Bitmapset to mark the valid subplans. Lists are not particularly efficient to get the nth item. 3. Use of PlannerInfo to store details specific to a single partitioned table in set_base_rel_sizes. 4. Use of a new PlannerInfo->join_clauses in set_rel_size(). How will this work when there are multiple partitioned tables being scanned in a single plan? 5. In match_clauses_to_partkey() you're using a new PlannerInfo->baserestrictinfo_param_indexes List to store ParamIds. How will this work when there are multiple partitioned tables being scanned in a single plan? A Bitmapset would be a better choice to store paramids in. 6. In set_append_rel_pathlist you're using more PlannerInfo members to handle a specific Append rel. Again, how will it work for plans scanning multiple different partitioned tables? 7. Your changes to get_appendrel_parampathinfo() ignore equivalence join clauses, don't you need to look at these too? If so, maybe it might be worth just using get_baserel_parampathinfo()? 8. Lack of ability to detect if set_append_subplan_indexes() needs to be called in ExecReScanAppend(). Some parameters that change might not have an effect on which partitions to scan. If you go and find a new way to solve all those problems, then please consider which one of us it is that's making the duplicate effort. Again, I'm sorry that I have been standing on your toes with my work here. I'm certainly not out to try to take any glory here. I just want the patch to be in a working state and the time I have to do that is fast running out. Please consider my efforts as an offer of assistance rather than a threat to your work. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 22 December 2017 at 12:45, Robert Haas <robertmhaas@gmail.com> wrote: > PREPARE q3 (int, int) AS SELECT * FROM unpartitioned_table WHERE > partkey BETWEEN $1 AND $2; > > ...that has to decide whether to use an index. And to do that it has > to estimate what fraction of the table will match the BETWEEN clause. > That may be an uninformed guess, but it guesses something. We > probably want to do something here that makes the guess for a > partitioned_table similar to the guess for an unpartitioned_table. Are you considering some sort of clauselist_selectivity() estimate on the given parameters and comparing that to the same selectivities that were determined for the previous custom plans? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 21, 2017 at 6:53 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 22 December 2017 at 12:45, Robert Haas <robertmhaas@gmail.com> wrote: >> PREPARE q3 (int, int) AS SELECT * FROM unpartitioned_table WHERE >> partkey BETWEEN $1 AND $2; >> >> ...that has to decide whether to use an index. And to do that it has >> to estimate what fraction of the table will match the BETWEEN clause. >> That may be an uninformed guess, but it guesses something. We >> probably want to do something here that makes the guess for a >> partitioned_table similar to the guess for an unpartitioned_table. > > Are you considering some sort of clauselist_selectivity() estimate on > the given parameters and comparing that to the same selectivities that > were determined for the previous custom plans? No, I don't think comparing to previous custom plans is a workable approach. I was thinking, rather, that if we know for example that we've doing pruning on partition_column = $1, then we know that only one partition will match. That's probably a common case. If we've got partition_column > $1, we could assume that, say, 75% of the partitions would match. partition_column BETWEEN $1 and $2 is probably a bit more selective, so maybe we assume 50% of the partitions would match. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22 December 2017 at 14:29, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 21, 2017 at 6:53 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> Are you considering some sort of clauselist_selectivity() estimate on >> the given parameters and comparing that to the same selectivities that >> were determined for the previous custom plans? > > No, I don't think comparing to previous custom plans is a workable > approach. I was thinking, rather, that if we know for example that > we've doing pruning on partition_column = $1, then we know that only > one partition will match. That's probably a common case. If we've > got partition_column > $1, we could assume that, say, 75% of the > partitions would match. partition_column BETWEEN $1 and $2 is > probably a bit more selective, so maybe we assume 50% of the > partitions would match. Okay. Do you think this is something we need to solve for this patch? When I complained originally I didn't quite see any way to even test the majority of this patch with the regression tests, but Beena has since proven me wrong about that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 21, 2017 at 8:37 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: >> No, I don't think comparing to previous custom plans is a workable >> approach. I was thinking, rather, that if we know for example that >> we've doing pruning on partition_column = $1, then we know that only >> one partition will match. That's probably a common case. If we've >> got partition_column > $1, we could assume that, say, 75% of the >> partitions would match. partition_column BETWEEN $1 and $2 is >> probably a bit more selective, so maybe we assume 50% of the >> partitions would match. > > Okay. Do you think this is something we need to solve for this patch? > When I complained originally I didn't quite see any way to even test > the majority of this patch with the regression tests, but Beena has > since proven me wrong about that. Although I have done one round of view of this patch, I haven't really got my head around it completely yet and I haven't spent of time on it yet, so my opinions are not as well-formed as maybe they should be. I'm glad, by the way, that you are putting some effort into it, as I think that will help move this forward more quickly. At a high level, I want to avoid trying to solve too many problems in one patch (which was the motivation behind my comment near the top of the thread), but I also want to end up with something useful (which I think is your concern). Leaving aside the difficulty of implementation, I have some questions about what the right thing to do actually is. In a simple case, I'm guessing that the cost of creating a custom plan will exceed the amount that the plan saves, but in more complex cases, I'm not sure that will be true. For instance, if we know the particular parameter value at plan time, we can make a more accurate estimate of how many times that value appears, which can then feed into our choice of what plan shape to use. That is, for a query like SELECT * FROM a JOIN b ON a.x = b.x WHERE a.y = $1, the generic plan might choose, say, a nested loop with b on the inner side, but if we know that a particular value for $1 will match a lot of rows in a, we might prefer a hash or merge join for that specific case. Run-time pruning doesn't give us that flexibility. My intuition is that the more complex we make the query, the more point there will be to making custom plans, and the simpler the query, the more likely it is that a generic plan will be good enough that it's not worth replanning every time. Now, in my experience, the current system for custom plans vs. generic plans doesn't approach the problem in this way at all, and in my experience that results in some pretty terrible behavior. It will do things like form a custom plan every time because the estimated cost of the custom plan is lower than the estimated cost of the generic plan even though the two plans are structurally identical; only the estimates differ. It will waste gobs of CPU cycles by replanning a primary key lookup 5 times just on the off chance that a lookup on the primary key index isn't the best option. But this patch isn't going to fix any of that. The best we can probably do is try to adjust the costing for Append paths in some way that reflects the costs and benefits of pruning. I'm tentatively in favor of trying to do something modest in that area, but I don't have a detailed proposal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Please find attached my 4th version this patch. This is now based on v17 of Amit's faster partition pruning patch [1]. It also now includes Beena's tests which I've done some mostly cosmetic changes to. I've also fixed a few bugs, one in a case where I was not properly handling zero matching partitions in nodeAppend.c. Another change I've made is to now perform the partition pruning at run-time using a new memory context that's reset each time we redetermine the matching partitions. This was required since we're calling a planner function which might not be too careful about pfreeing memory it allocates. A test case I was running before making this change ended out failing to palloc memory due to OOM. I've not done anything about reducing the cost of the Append path when runtime pruning is enabled. I'm still thinking over the best way to handle that. [1] https://www.postgresql.org/message-id/58c3e20a-a964-4fdb-4e7d-bd833e9bead1@lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hello David, On Wed, Dec 27, 2017 at 8:36 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Hi, > > Please find attached my 4th version this patch. Thanks for the patch > > This is now based on v17 of Amit's faster partition pruning patch [1]. > It also now includes Beena's tests which I've done some mostly > cosmetic changes to. > > I've also fixed a few bugs, one in a case where I was not properly > handling zero matching partitions in nodeAppend.c. > > Another change I've made is to now perform the partition pruning at > run-time using a new memory context that's reset each time we > redetermine the matching partitions. This was required since we're > calling a planner function which might not be too careful about > pfreeing memory it allocates. A test case I was running before making > this change ended out failing to palloc memory due to OOM. > > I've not done anything about reducing the cost of the Append path when > runtime pruning is enabled. I'm still thinking over the best way to > handle that. > I think you are testing without asserts The following assert fails: src/backend/optimizer/plan/setrefs.c : set_plan_refs: ln 921 Assert(splan->plan.qual == NIL); Append node now has runtime partition quals. Also since the valid subplans are set in ExecInitAppend, the queries with Init Plans do not work. I had moved it to ExecAppend in my patch to handle the InitPlans as well. DROP TABLE IF EXISTS prun_test_part; CREATE TABLE prun_test_part (sal int) PARTITION BY RANGE(sal); CREATE TABLE prun_test_part_p1 PARTITION OF prun_test_part FOR VALUES FROM (0) TO (100); CREATE TABLE prun_test_part_p2 PARTITION OF prun_test_part FOR VALUES FROM (100) TO (200); CREATE TABLE prun_test_part_p3 PARTITION OF prun_test_part FOR VALUES FROM (200) TO (300); CREATE TABLE prun_test_part_p4 PARTITION OF prun_test_part FOR VALUES FROM (300) TO (400); INSERT INTO prun_test_part VALUES (90), (100), (110), (200), (210), (300), (310); =# explain (analyze, costs off, summary off, timing off) SELECT * FROM prun_test_part WHERE sal < (SELECT sal FROM prun_test_part WHERE sal = 200); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1 January 2018 at 19:22, Beena Emerson <memissemerson@gmail.com> wrote: > I think you are testing without asserts Yeah, I was indeed. Oops. > The following assert fails: src/backend/optimizer/plan/setrefs.c : > set_plan_refs: ln 921 > Assert(splan->plan.qual == NIL); > Append node now has runtime partition quals. > > Also since the valid subplans are set in ExecInitAppend, the queries > with Init Plans do not work. I had moved it to ExecAppend in my patch > to handle the InitPlans as well. Thanks for noticing. I've now changed things around so this case works as it should and I've added a test too. I've attached an updated patch which also fixes a number of other problems with my previous patch. 1. The Bitmapset I was using in nodeAppend.c to mark the valid subplans was pretty bogus for Parallel Append since the memory for the set was not in shared memory. I changed things around to reuse the pa_finished[] array and the patch just now sets pa_finished to true for any invalid subplans. 2. I've added a new memory context to use in nodeAppend.c which is used to call the planner code to determine which partitions are valid. I'd been trying to have Amit be careful to pfree() everything in his v17 patch, but I realised it was just not possible to get everything pfree'd. I found it pretty easy to construct a test case which caused an OOM. 3. I've added support for IN lists to be pruned when the IN() list contains a parameter. The changes I made to support this case probably mostly belong in Amit's faster partition pruning patch, but I've put them here for now to get this case working. There's a bunch of new tests to test this. 4. Various other cosmetic improvements. The attached patch should be applied after patching master with Amit's v17 faster partition pruning patch [1]. [1] https://www.postgresql.org/message-id/58c3e20a-a964-4fdb-4e7d-bd833e9bead1@lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
I tried this patch (applying it on Amit's last current version on top of 4e2970f8807f which is the latest it applies to) and regression tests fail with the attached diff; in all cases it appears to be an off-by-one in row count. Would you please give it a look? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 5 January 2018 at 05:37, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I tried this patch (applying it on Amit's last current version on top of > 4e2970f8807f which is the latest it applies to) and regression tests > fail with the attached diff; in all cases it appears to be an off-by-one > in row count. Would you please give it a look? Looks like it's down to ExplainPropertyFloat() having machine-dependent behaviour. On the machine that I was working with when testing this the following code outputs "1" #include <stdio.h> double nfiltered = 1.0; double nloops = 2.0; int main(void) { printf("%.*f", 0, nfiltered / nloops); return 0; } but on your machine it must be outputting "0"? I'm not sure if I should just try to have even row numbers in the tests, or if we should be fixing EXPLAIN so it is consistent. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 5 January 2018 at 05:37, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I tried this patch (applying it on Amit's last current version on top of > 4e2970f8807f which is the latest it applies to) and regression tests > fail with the attached diff; in all cases it appears to be an off-by-one > in row count. Would you please give it a look? Thanks for testing. I've attached an updated patch which hopefully fixes this. I've only thing I did to fix it was to alter the tests a bit so that the row counts in explain are evenly divisible by the nloops or parallel workers. Looks like it was failing due to platform dependent behaviour in printf. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley wrote: > Looks like it's down to ExplainPropertyFloat() having > machine-dependent behaviour. > > On the machine that I was working with when testing this the following > code outputs "1" > [ sample code ] > > but on your machine it must be outputting "0"? Yeah, it does. Thanks for updating --- I'll look at your patch tomorrow. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello, On Fri, Jan 5, 2018 at 6:24 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 5 January 2018 at 05:37, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> I tried this patch (applying it on Amit's last current version on top of >> 4e2970f8807f which is the latest it applies to) and regression tests >> fail with the attached diff; in all cases it appears to be an off-by-one >> in row count. Would you please give it a look? > > Thanks for testing. I've attached an updated patch which hopefully fixes this. > > I've only thing I did to fix it was to alter the tests a bit so that > the row counts in explain are evenly divisible by the nloops or > parallel workers. Looks like it was failing due to platform dependent > behaviour in printf. > It does not handle change in column order (varattno) in subpartitions. In the following case a2 has different column order drop table ab_c; create table ab_c (a int not null, b int) partition by list(a); --a2 with different col order create table abc_a2 (b int, a int not null) partition by list(b); create table abc_a2_b1 partition of abc_a2 for values in (1); create table abc_a2_b2 partition of abc_a2 for values in (2); create table abc_a2_b3 partition of abc_a2 for values in (3); alter table ab_c attach partition abc_a2 for values in (2); --a1 and a3 with same col order as the parent create table abc_a1 partition of ab_c for values in(1) partition by list (b); create table abc_a1_b1 partition of abc_a1 for values in (1); create table abc_a1_b2 partition of abc_a1 for values in (2); create table abc_a1_b3 partition of abc_a1 for values in (3); create table abc_a3 partition of ab_c for values in(3) partition by list (b); create table abc_a3_b1 partition of abc_a3 for values in (1); create table abc_a3_b2 partition of abc_a3 for values in (2); create table abc_a3_b3 partition of abc_a3 for values in (3); deallocate abc_q1; prepare abc_q1 (int, int, int) as select * from ab_c where a BETWEEN $1 and $2 AND b <= $3; --optimizer pruning explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 3, 1); QUERY PLAN ------------------------------------------------------ Append (actual rows=0 loops=1) -> Seq Scan on abc_a1_b1 (actual rows=0 loops=1) Filter: ((a >= 1) AND (a <= 3) AND (b <= 1)) -> Seq Scan on abc_a2_b1 (actual rows=0 loops=1) Filter: ((a >= 1) AND (a <= 3) AND (b <= 1)) -> Seq Scan on abc_a3_b1 (actual rows=0 loops=1) Filter: ((a >= 1) AND (a <= 3) AND (b <= 1)) (7 rows) --runtime pruning after 5 runs explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 3, 1); QUERY PLAN --------------------------------------------------------- Append (actual rows=0 loops=1) -> Seq Scan on abc_a1_b1 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a1_b2 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a1_b3 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a2_b1 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a2_b2 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a2_b3 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a3_b1 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a3_b2 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a3_b3 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) (19 rows) As seen partition a2 does not prune like in other 2 subpartitions - a1 and a3. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6 January 2018 at 07:31, Beena Emerson <memissemerson@gmail.com> wrote: > It does not handle change in column order (varattno) in subpartitions. Thanks for testing and finding that. I completely overlooked applying translation of the prune qual so that it's compatible with the sub-partition. I've fixed this in the attached, but I did so by calling adjust_appendrel_attrs() from the nodeAppend.c, which did, of course, mean that the AppendRelInfo needed to be given to the executor. I was also a bit unsure what exactly I should be doing in primnodes.h, since I've put PartitionPruneInfo in there, but AppendRelInfo is not. I stuck a quick declaration of AppendRelInfo in primnode.h with an XXX comment so we don't forget to think about that again. In all honesty, this calling planner code from the executor seems like quite new ground for PostgreSQL, so I'm really not sure if we're breaking any major rules or not with we've got now. Perhaps the saving grace here is that we're not teaching the executor how to do anything smart with these data structures, they're just given to it to pass back to the planner function at the appropriate moment. I've also borrowed and simplified your test case to ensure this fix remains working. I added another level of partitioning with another partition that has the columns in a different order again. This is to ensure the translation code translates from the quals of the previous level up, not the top-level. That's required since the AppendRelInfo is only translating 1 level at a time. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 7 January 2018 at 00:03, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've fixed this in the attached, but I did so by calling > adjust_appendrel_attrs() from the nodeAppend.c, which did, of course, > mean that the AppendRelInfo needed to be given to the executor. I was > also a bit unsure what exactly I should be doing in primnodes.h, since > I've put PartitionPruneInfo in there, but AppendRelInfo is not. I > stuck a quick declaration of AppendRelInfo in primnode.h with an XXX > comment so we don't forget to think about that again. Actually, this was not a very smart fix for the problem. It seems much better to make the prune qual part of PartitionPruneInfo and just have the planner translate the qual to what's required for the partition that the PartitionPruneInfo belongs to. This means we no longer need to use the Append's qual to store the prune qual and that all the pruning information for one partition is now neatly in a single struct. I've attached a patch which does things like this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hello, On Sun, Jan 7, 2018 at 5:31 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 January 2018 at 00:03, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I've fixed this in the attached, but I did so by calling >> adjust_appendrel_attrs() from the nodeAppend.c, which did, of course, >> mean that the AppendRelInfo needed to be given to the executor. I was >> also a bit unsure what exactly I should be doing in primnodes.h, since >> I've put PartitionPruneInfo in there, but AppendRelInfo is not. I >> stuck a quick declaration of AppendRelInfo in primnode.h with an XXX >> comment so we don't forget to think about that again. > > Actually, this was not a very smart fix for the problem. It seems much > better to make the prune qual part of PartitionPruneInfo and just have > the planner translate the qual to what's required for the partition > that the PartitionPruneInfo belongs to. This means we no longer need > to use the Append's qual to store the prune qual and that all the > pruning information for one partition is now neatly in a single > struct. > > I've attached a patch which does things like this. The pruning does not work well with char type: Case: A subpartition has a different col order and the subpartitioned col is type char. drop table ab_c; create table ab_c (a int not null, b char) partition by list(a); create table abc_a2 (b char, a int not null) partition by list(b); create table abc_a2_b1 partition of abc_a2 for values in ('1'); create table abc_a2_b2 partition of abc_a2 for values in ('2'); create table abc_a2_b3 partition of abc_a2 for values in ('3'); alter table ab_c attach partition abc_a2 for values in (2); create table abc_a1 partition of ab_c for values in(1) partition by list (b); create table abc_a1_b1 partition of abc_a1 for values in ('1'); create table abc_a1_b2 partition of abc_a1 for values in ('2'); create table abc_a1_b3 partition of abc_a1 for values in ('3'); create table abc_a3 partition of ab_c for values in(3) partition by list (b); create table abc_a3_b1 partition of abc_a3 for values in ('1'); create table abc_a3_b2 partition of abc_a3 for values in ('2'); create table abc_a3_b3 partition of abc_a3 for values in ('3'); deallocate abc_q1; INSERT INTO ab_c VALUES (1,'1'), (1,'2'), (1,'3'); INSERT INTO ab_c VALUES (2,'1'), (2,'2'), (2,'3'); INSERT INTO ab_c VALUES (3,'1'), (3,'2'), (3,'3'); prepare abc_q1 (int, int, char) as select * from ab_c where a BETWEEN $1 and $2 AND b <= $3; --after 5 runs: abc_a2_b3 is not pruned. # explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 2, '2'); QUERY PLAN --------------------------------------------------------- Append (actual rows=4 loops=1) -> Seq Scan on abc_a1_b1 (actual rows=1 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a1_b2 (actual rows=1 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a1_b3 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a2_b1 (actual rows=1 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a2_b2 (actual rows=1 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a2_b3 (actual rows=0 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) Rows Removed by Filter: 1 -> Seq Scan on abc_a3_b1 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a3_b2 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) -> Seq Scan on abc_a3_b3 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b <= $3)) (20 rows) Case 2: Case with optimizer pruning drop table ab_c; create table ab_c (a int not null, b int) partition by list(a); create table abc_a2 (b int, a int not null) partition by list(b); create table abc_a2_b1 partition of abc_a2 for values in (1); create table abc_a2_b2 partition of abc_a2 for values in (2); create table abc_a2_b3 partition of abc_a2 for values in (3); alter table ab_c attach partition abc_a2 for values in (2); create table abc_a1 partition of ab_c for values in(1) partition by list (b); create table abc_a1_b1 partition of abc_a1 for values in (1); create table abc_a1_b2 partition of abc_a1 for values in (2); create table abc_a1_b3 partition of abc_a1 for values in (3); create table abc_a3 partition of ab_c for values in(3) partition by list (b); create table abc_a3_b1 partition of abc_a3 for values in (1); create table abc_a3_b2 partition of abc_a3 for values in (2); create table abc_a3_b3 partition of abc_a3 for values in (3); deallocate abc_q1; =# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1); QUERY PLAN ------------------------------------------------------------------------------ Append (actual rows=2 loops=1) -> Seq Scan on abc_a1_b2 (actual rows=1 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) -> Seq Scan on abc_a1_b3 (actual rows=1 loops=1) Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) -> Seq Scan on abc_a2_b2 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) -> Seq Scan on abc_a2_b3 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) -> Seq Scan on abc_a3_b2 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) -> Seq Scan on abc_a3_b3 (never executed) Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) (13 rows) postgres=# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 2); ERROR: partition missing from Append subplans -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, The mail was accidently sent before I could complete. On Tue, Jan 9, 2018 at 2:24 PM, Beena Emerson <memissemerson@gmail.com> wrote: > Hello, > > The pruning does not work well with char type: > > Case 2: Case with optimizer pruning > drop table ab_c; > create table ab_c (a int not null, b int) partition by list(a); > create table abc_a2 (b int, a int not null) partition by list(b); > create table abc_a2_b1 partition of abc_a2 for values in (1); > create table abc_a2_b2 partition of abc_a2 for values in (2); > create table abc_a2_b3 partition of abc_a2 for values in (3); > alter table ab_c attach partition abc_a2 for values in (2); > create table abc_a1 partition of ab_c for values in(1) partition by list (b); > create table abc_a1_b1 partition of abc_a1 for values in (1); > create table abc_a1_b2 partition of abc_a1 for values in (2); > create table abc_a1_b3 partition of abc_a1 for values in (3); > create table abc_a3 partition of ab_c for values in(3) partition by list (b); > create table abc_a3_b1 partition of abc_a3 for values in (1); > create table abc_a3_b2 partition of abc_a3 for values in (2); > create table abc_a3_b3 partition of abc_a3 for values in (3); > deallocate abc_q1; Prepared statement is missing: prepare abc_q1 (int, int) as select a,b from ab_c where a BETWEEN $1 and $2 AND b IN (3, 2); > > > =# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1); > QUERY PLAN > ------------------------------------------------------------------------------ > Append (actual rows=2 loops=1) > -> Seq Scan on abc_a1_b2 (actual rows=1 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) > -> Seq Scan on abc_a1_b3 (actual rows=1 loops=1) > Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) > -> Seq Scan on abc_a2_b2 (never executed) > Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) > -> Seq Scan on abc_a2_b3 (never executed) > Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) > -> Seq Scan on abc_a3_b2 (never executed) > Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) > -> Seq Scan on abc_a3_b3 (never executed) > Filter: ((a >= $1) AND (a <= $2) AND (b = ANY ('{3,2}'::integer[]))) > (13 rows) > > postgres=# explain (analyze, costs off, summary off, timing off) > execute abc_q1 (1, 2); > ERROR: partition missing from Append subplans These work fine when the column order of subpartitons are not changed. Case 3: Optimizer pruning with char types: Same as case1 with all subpartitions having same col order as parent. drop table ab_c; create table ab_c (a int not null, b char) partition by list(a); create table abc_a2 ( a int not null, b char) partition by list(b); create table abc_a2_b1 partition of abc_a2 for values in ('1'); create table abc_a2_b2 partition of abc_a2 for values in ('2'); create table abc_a2_b3 partition of abc_a2 for values in ('3'); alter table ab_c attach partition abc_a2 for values in (2); create table abc_a1 partition of ab_c for values in(1) partition by list (b); create table abc_a1_b1 partition of abc_a1 for values in ('1'); create table abc_a1_b2 partition of abc_a1 for values in ('2'); create table abc_a1_b3 partition of abc_a1 for values in ('3'); create table abc_a3 partition of ab_c for values in(3) partition by list (b); create table abc_a3_b1 partition of abc_a3 for values in ('1'); create table abc_a3_b2 partition of abc_a3 for values in ('2'); create table abc_a3_b3 partition of abc_a3 for values in ('3'); deallocate abc_q1; prepare abc_q1 (int, int) as select a,b from ab_c where a BETWEEN $1 and $2 AND b IN ('3', '2'); -- b4 runtime pruning =# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 8); QUERY PLAN --------------------------------------------------------------------------- Append (actual rows=0 loops=1) -> Seq Scan on abc_a1_b2 (actual rows=0 loops=1) Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[]))) -> Seq Scan on abc_a1_b3 (actual rows=0 loops=1) Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[]))) -> Seq Scan on abc_a2_b2 (actual rows=0 loops=1) Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[]))) -> Seq Scan on abc_a2_b3 (actual rows=0 loops=1) Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[]))) -> Seq Scan on abc_a3_b2 (actual rows=0 loops=1) Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[]))) -> Seq Scan on abc_a3_b3 (actual rows=0 loops=1) Filter: ((a >= 1) AND (a <= 8) AND (b = ANY ('{3,2}'::bpchar[]))) (13 rows) -- after 5 runs =# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1); ERROR: operator 1057 is not a member of opfamily 1976 -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 January 2018 at 22:22, Beena Emerson <memissemerson@gmail.com> wrote: > ERROR: operator 1057 is not a member of opfamily 1976 Thanks for finding these. I'm looking into the above, and the other ones you've mentioned now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 9, 2018 at 2:24 PM, Beena Emerson <memissemerson@gmail.com> wrote: > prepare abc_q1 (int, int, char) as select * from ab_c where a BETWEEN > $1 and $2 AND b <= $3; > > --after 5 runs: abc_a2_b3 is not pruned. This seems to be down to an existing bug. I'm not yet sure if it's caused by faster_partition_prune_v17, or if it exists in master. Basically RelOptInfo->partition_rels can contain duplicates for relations. In your example while debugging make_partition_pruneinfo I see: list_nth_int(best_path->partitioned_rels,0) 1 list_nth_int(best_path->partitioned_rels,1) 3 list_nth_int(best_path->partitioned_rels,2) 8 list_nth_int(best_path->partitioned_rels,3) 13 list_nth_int(best_path->partitioned_rels,4) 3 list_nth_int(best_path->partitioned_rels,5) 8 list_nth_int(best_path->partitioned_rels,6) 13 There should only be 4 items in this list, not 7. make_partition_pruneinfo might have been a bit naive to assume this couldn't happen, so I've coded it to be a bit more resilient to this happening. It'll still end up creating another sub-PartitionPruneInfo and slotting into the same place, but it'll no longer attempt to translate the prunequals twice... which was what was causing the problem. I'd been a bit sloppy and assigned the output of adjust_appendrel_attrs() back to the prunequals which is a parameter to the function instead of assigning to a local variable like I've done now. On 9 January 2018 at 22:22, Beena Emerson <memissemerson@gmail.com> wrote: >> postgres=# explain (analyze, costs off, summary off, timing off) >> execute abc_q1 (1, 2); >> ERROR: partition missing from Append subplans This also seems to be fixed by the above fix. > =# explain (analyze, costs off, summary off, timing off) execute abc_q1 (1, 1); > ERROR: operator 1057 is not a member of opfamily 1976 This seems to be broken in faster_partition_prune_v17 where in classify_partition_bounding_keys() the code properly checks if the clause matches the partition key for OpExpr, but fails to do the same for ScalarArrayOpExpr. I'll report to Amit on the thread for that patch. I'll also investigate the duplication in RelOptInfo->partition_rels and report that in another thread. Can you confirm that case 1 and 2 are working with the attached? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, I've attached an updated patch, now at v10. v9 was short lived due to the evolution of Amit's which which this based on. This version is based on Amit's v27 of faster partition pruning [1] which can be applied atop of ad7dbee36. I've done a self review of this, but I've not yet done any final polishing work as the faster partition pruning patch is still evolving. I will, for example, likely need to do some more work in nodeAppend.c to add a few more comments and probably think of better names for a few new things that have made a first appearance in this version of the patch [1] https://www.postgresql.org/message-id/520f8a71-286d-e36d-34cf-363fd74366e1@lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sat, Feb 17, 2018 at 2:27 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:
Hi,
I've attached an updated patch, now at v10. v9 was short lived due to
the evolution of Amit's which which this based on.
This version is based on Amit's v27 of faster partition pruning [1]
which can be applied atop of ad7dbee36.
Hi,
I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR: partition missing from Append subplans" with the patch. on head and only with Amit's patches query is working fine, Please find test case below.
CREATE TABLE part ( c1 INT2, c2 DATE) PARTITION BY RANGE (c1);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (0) TO (141) PARTITION BY RANGE(c2);
CREATE TABLE part_p11 PARTITION OF part_p1 FOR VALUES FROM ('1/1/1997') TO ('2/1/1999');
CREATE TABLE part_p12 PARTITION OF part_p1 FOR VALUES FROM ('2/1/1999') TO ('2/1/2000');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (141) TO (211) PARTITION BY RANGE(c2);
CREATE TABLE part_p21 PARTITION OF part_p2 FOR VALUES FROM ('1/1/2000') TO ('2/1/2001');
CREATE TABLE part_p22 PARTITION OF part_p2 FOR VALUES FROM ('2/1/2001') TO ('2/1/2006');
INSERT INTO part VALUES (100,'1/1/1999');
INSERT INTO part VALUES (110,'1/1/1998');
INSERT INTO part VALUES (130,'1/1/2000');
INSERT INTO part VALUES (170,'1/1/2000');
INSERT INTO part VALUES (180,'1/1/2001');
INSERT INTO part VALUES (190,'1/1/2006');
INSERT INTO part VALUES (200,'1/1/2000');
INSERT INTO part VALUES (210,'1/1/2002');
postgres=# PREPARE RTP AS SELECT * FROM PART WHERE c2 BETWEEN '1/1/1998' AND '1/1/1999';
PREPARE
postgres=# EXPLAIN execute RTP;
QUERY PLAN
-----------------------------------------------------------------------------
Append (cost=0.00..46.00 rows=12 width=6)
-> Seq Scan on part_p11 (cost=0.00..46.00 rows=12 width=6)
Filter: ((c2 >= '1998-01-01'::date) AND (c2 <= '1999-01-01'::date))
(3 rows)
postgres=# execute RTP;
ERROR: partition missing from Append subplans
deallocate RTP;
DROP TABLE part;
CREATE TABLE part ( c1 INT2, c2 DATE) PARTITION BY RANGE (c1);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES FROM (0) TO (141) PARTITION BY RANGE(c2);
CREATE TABLE part_p11 PARTITION OF part_p1 FOR VALUES FROM ('1/1/1997') TO ('2/1/1999');
CREATE TABLE part_p12 PARTITION OF part_p1 FOR VALUES FROM ('2/1/1999') TO ('2/1/2000');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES FROM (141) TO (211) PARTITION BY RANGE(c2);
CREATE TABLE part_p21 PARTITION OF part_p2 FOR VALUES FROM ('1/1/2000') TO ('2/1/2001');
CREATE TABLE part_p22 PARTITION OF part_p2 FOR VALUES FROM ('2/1/2001') TO ('2/1/2006');
INSERT INTO part VALUES (100,'1/1/1999');
INSERT INTO part VALUES (110,'1/1/1998');
INSERT INTO part VALUES (130,'1/1/2000');
INSERT INTO part VALUES (170,'1/1/2000');
INSERT INTO part VALUES (180,'1/1/2001');
INSERT INTO part VALUES (190,'1/1/2006');
INSERT INTO part VALUES (200,'1/1/2000');
INSERT INTO part VALUES (210,'1/1/2002');
postgres=# PREPARE RTP AS SELECT * FROM PART WHERE c2 BETWEEN '1/1/1998' AND '1/1/1999';
PREPARE
postgres=# EXPLAIN execute RTP;
QUERY PLAN
-----------------------------------------------------------------------------
Append (cost=0.00..46.00 rows=12 width=6)
-> Seq Scan on part_p11 (cost=0.00..46.00 rows=12 width=6)
Filter: ((c2 >= '1998-01-01'::date) AND (c2 <= '1999-01-01'::date))
(3 rows)
postgres=# execute RTP;
ERROR: partition missing from Append subplans
deallocate RTP;
DROP TABLE part;
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
On 20 February 2018 at 23:46, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR: > partition missing from Append subplans" with the patch. on head and only > with Amit's patches query is working fine, Please find test case below. Thanks for the test case. I can recreate locally. This is down to the fact that make_partition_pruneinfo() only makes sub-PartitionPruneInfo for actual subpaths found in the Append. Your test case happens to match both the part_p1 and part_p2 partitions on the first level of iteration, but since no child of part_p2 is found in make_partition_pruneinfo, that element in the subpartindex never gets set. The fix might be to just remove the error and silently ignore those cases, but I was hoping to keep that around as it might catch other bugs. I'm just not sure yet how to do both. I'll rebase this on Amit's latest patch and have a think about it while doing that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 21 February 2018 at 13:44, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 20 February 2018 at 23:46, Rajkumar Raghuwanshi > <rajkumar.raghuwanshi@enterprisedb.com> wrote: >> I have applied v10 patch on Amit's v27 over head ad7dbee36. I got "ERROR: >> partition missing from Append subplans" with the patch. on head and only >> with Amit's patches query is working fine, Please find test case below. > > Thanks for the test case. I can recreate locally. This is down to the > fact that make_partition_pruneinfo() only makes sub-PartitionPruneInfo > for actual subpaths found in the Append. Your test case happens to > match both the part_p1 and part_p2 partitions on the first level of > iteration, but since no child of part_p2 is found in > make_partition_pruneinfo, that element in the subpartindex never gets > set. > > The fix might be to just remove the error and silently ignore those > cases, but I was hoping to keep that around as it might catch other > bugs. I'm just not sure yet how to do both. > > I'll rebase this on Amit's latest patch and have a think about it > while doing that. I ended up fixing this another way. The patch currently will build a PartitionPruneInfo and include that in the Append node for any Append node which belongs to a partitioned table which has a non-empty set of clauses to attempt to use for pruning. Currently, at no point in planning does the patch verify those quals to see if they'd be any use for partition pruning during execution. Really to be any use they would need to contain at least one Param which is in a clause which matches a partitioned key of that partition or some subpartition thereof. At the moment these clauses are only verified during the first call of set_valid_runtime_subplans() which is called the first time the choose_next_subplan() function is called in nodeAppend.c. It would be possible to determine this during planning, but it would mean doing an extra call of generate_partition_clauses(), of which most of the work would be thrown away, as we'd only really want to know if any Params match the partition key. I thought it was best to do this during execution as then we can actually make full use of the work done in that function and cache the result for reuse each time we need to redetermine the newly matching subplans when any param matching a partition key changes. I don't entirely think this is perfect, but since we can't send the PartitionClauseInfo over to the executor from the planner it seems like the best option. The way I fixed the reported error was to cache all the subplans for the partition at the current hierarchy level and if the quals don't contain any Params matching the partition key, then we can safely assume that all of the subplans for that partition must match. The planner should have pruned any unneeded partitions for this partitioned table since there are no Params, which would only be known at run-time. Doing it this way allowed me to keep the sanity check that alerted you to find this bug in the first place. Other things I don't particularly like about the current patch are how I had to construct the partition key expressions in set_valid_runtime_subplans_recurse(). This pretty much uses code borrowed from set_baserel_partition_key_exprs(). One way around that would be to change the partprune.c code to deal with the partkey->partattrs and consume an expression from the list on attnum = 0. I didn't do that as I want to minimise how much I touch Amit's patch before it gets committed as doing so would likely just cause more headaches for me keeping this merged with his work. Another option to resolve this would be to put the partition key expressions into the PartitionPruneInfo. I've attached v11 of the patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, Feb 21, 2018 at 2:36 PM, David Rowley <david.rowley@2ndquadrant.com> wrote:
Hi,
I have applied attached patch on head "6f1d723b6359507ef55a81617167507bc25e3e2b" over Amit's v30 patches. while testing further I got a server crash with below test case. Please take a look.
CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
INSERT INTO prt1 SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0, 599, 2) i;
CREATE INDEX iprt1_p1_a on prt1_p1(a);
CREATE INDEX iprt1_p2_a on prt1_p2(a);
CREATE INDEX iprt1_p3_a on prt1_p3(a);
ANALYZE prt1;
CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt2_p2 PARTITION OF prt2 FOR VALUES FROM (250) TO (500);
CREATE TABLE prt2_p3 PARTITION OF prt2 FOR VALUES FROM (500) TO (600);
INSERT INTO prt2 SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0, 599, 3) i;
CREATE INDEX iprt2_p1_b on prt2_p1(b);
CREATE INDEX iprt2_p2_b on prt2_p2(b);
CREATE INDEX iprt2_p3_b on prt2_p3(b);
ANALYZE prt2;
CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0003', '0004', '0010');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0001', '0005', '0002', '0009');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN ('0006', '0007', '0008', '0011');
INSERT INTO plt1 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 599, 2) i;
CREATE INDEX iplt1_p1_c on plt1_p1(c);
CREATE INDEX iplt1_p2_c on plt1_p2(c);
CREATE INDEX iplt1_p3_c on plt1_p3(c);
ANALYZE plt1;
CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0000', '0003', '0004', '0010');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0001', '0005', '0002', '0009');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0006', '0007', '0008', '0011');
INSERT INTO plt2 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 599, 3) i;
CREATE INDEX iplt2_p1_c on plt2_p1(c);
CREATE INDEX iplt2_p2_c on plt2_p2(c);
CREATE INDEX iplt2_p3_c on plt2_p3(c);
ANALYZE plt2;
select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 t1,prt2 t2 where t1.a=t2.b)
and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);
/*
postgres=# select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 t1,prt2 t2 where t1.a=t2.b)
postgres-# and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
*/
stack-trace give below :
/*
(gdb) bt
#0 0x00000000006ce6dc in ExecEvalParamExec (state=0x26e9ee0, op=0x26e9f78, econtext=0x26ea390) at execExprInterp.c:2222
#1 0x00000000006cc66a in ExecInterpExpr (state=0x26e9ee0, econtext=0x26ea390, isnull=0x7ffe0f75d77f "") at execExprInterp.c:1024
#2 0x00000000006cdd8c in ExecInterpExprStillValid (state=0x26e9ee0, econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at execExprInterp.c:1819
#3 0x00000000007db078 in ExecEvalExprSwitchContext (state=0x26e9ee0, econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at ../../../../src/include/executor/executor.h:305
#4 0x00000000007e2072 in evaluate_expr (expr=0x26a3cb0, result_type=25, result_typmod=-1, result_collation=0) at clauses.c:4890
#5 0x00000000007e588a in partkey_datum_from_expr (context=0x26d3180, parttypid=25, expr=0x26a3cb0, value=0x7ffe0f75da00) at partprune.c:1504
#6 0x00000000007e5243 in extract_bounding_datums (context=0x26d3180, minimalclauses=0x7ffe0f75d900, keys=0x7ffe0f75da00) at partprune.c:1307
#7 0x00000000007e377d in get_partitions_from_clauses (context=0x26d3180) at partprune.c:273
#8 0x00000000006ea2ec in set_valid_runtime_subplans_recurse (node=0x269bf90, pinfo=0x7f6cf6765cf0, ctxcache=0x26d3158, validsubplans=0x7ffe0f75de10) at nodeAppend.c:771
#9 0x00000000006e9ebf in set_valid_runtime_subplans (node=0x269bf90) at nodeAppend.c:640
#10 0x00000000006e99b5 in choose_next_subplan_locally (node=0x269bf90) at nodeAppend.c:426
#11 0x00000000006e9598 in ExecAppend (pstate=0x269bf90) at nodeAppend.c:224
#12 0x00000000006deb3a in ExecProcNodeFirst (node=0x269bf90) at execProcnode.c:446
#13 0x00000000006fb9ee in ExecProcNode (node=0x269bf90) at ../../../src/include/executor/executor.h:239
#14 0x00000000006fbcc4 in ExecHashJoinImpl (pstate=0x2697808, parallel=0 '\000') at nodeHashjoin.c:262
#15 0x00000000006fc3fd in ExecHashJoin (pstate=0x2697808) at nodeHashjoin.c:565
#16 0x00000000006deb3a in ExecProcNodeFirst (node=0x2697808) at execProcnode.c:446
#17 0x000000000070c376 in ExecProcNode (node=0x2697808) at ../../../src/include/executor/executor.h:239
#18 0x000000000070c70e in ExecNestLoop (pstate=0x262c0a0) at nodeNestloop.c:160
#19 0x00000000006deb3a in ExecProcNodeFirst (node=0x262c0a0) at execProcnode.c:446
#20 0x00000000006fb9ee in ExecProcNode (node=0x262c0a0) at ../../../src/include/executor/executor.h:239
#21 0x00000000006fbcc4 in ExecHashJoinImpl (pstate=0x262bec8, parallel=0 '\000') at nodeHashjoin.c:262
#22 0x00000000006fc3fd in ExecHashJoin (pstate=0x262bec8) at nodeHashjoin.c:565
#23 0x00000000006deb3a in ExecProcNodeFirst (node=0x262bec8) at execProcnode.c:446
#24 0x00000000006ea5bd in ExecProcNode (node=0x262bec8) at ../../../src/include/executor/executor.h:239
#25 0x00000000006eaab0 in fetch_input_tuple (aggstate=0x262ba18) at nodeAgg.c:406
#26 0x00000000006ecd40 in agg_retrieve_direct (aggstate=0x262ba18) at nodeAgg.c:1736
#27 0x00000000006ec932 in ExecAgg (pstate=0x262ba18) at nodeAgg.c:1551
#28 0x00000000006deb3a in ExecProcNodeFirst (node=0x262ba18) at execProcnode.c:446
#29 0x00000000006d59cd in ExecProcNode (node=0x262ba18) at ../../../src/include/executor/executor.h:239
#30 0x00000000006d8326 in ExecutePlan (estate=0x262b7c8, planstate=0x262ba18, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0,
direction=ForwardScanDirection, dest=0x7f6cf676c7f0, execute_once=1 '\001') at execMain.c:1721
#31 0x00000000006d5f9f in standard_ExecutorRun (queryDesc=0x258aa98, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at execMain.c:361
#32 0x00000000006d5dbb in ExecutorRun (queryDesc=0x258aa98, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at execMain.c:304
#33 0x00000000008b588b in PortalRunSelect (portal=0x25caa58, forward=1 '\001', count=0, dest=0x7f6cf676c7f0) at pquery.c:932
#34 0x00000000008b5519 in PortalRun (portal=0x25caa58, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x7f6cf676c7f0, altdest=0x7f6cf676c7f0,
completionTag=0x7ffe0f75e5e0 "") at pquery.c:773
#35 0x00000000008af540 in exec_simple_query (
query_string=0x2565728 "select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 t1,prt2 t2 where t1.a=t2.b) \nand (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);") at postgres.c:1120
#36 0x00000000008b37d4 in PostgresMain (argc=1, argv=0x25910e0, dbname=0x2590f40 "postgres", username=0x2562228 "edb") at postgres.c:4144
#37 0x0000000000812afa in BackendRun (port=0x2588ea0) at postmaster.c:4412
#38 0x000000000081226e in BackendStartup (port=0x2588ea0) at postmaster.c:4084
*/
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
I've attached v11 of the patch.
Hi,
I have applied attached patch on head "6f1d723b6359507ef55a81617167507bc25e3e2b" over Amit's v30 patches. while testing further I got a server crash with below test case. Please take a look.
CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
INSERT INTO prt1 SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0, 599, 2) i;
CREATE INDEX iprt1_p1_a on prt1_p1(a);
CREATE INDEX iprt1_p2_a on prt1_p2(a);
CREATE INDEX iprt1_p3_a on prt1_p3(a);
ANALYZE prt1;
CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt2_p2 PARTITION OF prt2 FOR VALUES FROM (250) TO (500);
CREATE TABLE prt2_p3 PARTITION OF prt2 FOR VALUES FROM (500) TO (600);
INSERT INTO prt2 SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0, 599, 3) i;
CREATE INDEX iprt2_p1_b on prt2_p1(b);
CREATE INDEX iprt2_p2_b on prt2_p2(b);
CREATE INDEX iprt2_p3_b on prt2_p3(b);
ANALYZE prt2;
CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0000', '0003', '0004', '0010');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0001', '0005', '0002', '0009');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN ('0006', '0007', '0008', '0011');
INSERT INTO plt1 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 599, 2) i;
CREATE INDEX iplt1_p1_c on plt1_p1(c);
CREATE INDEX iplt1_p2_c on plt1_p2(c);
CREATE INDEX iplt1_p3_c on plt1_p3(c);
ANALYZE plt1;
CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0000', '0003', '0004', '0010');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0001', '0005', '0002', '0009');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0006', '0007', '0008', '0011');
INSERT INTO plt2 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 599, 3) i;
CREATE INDEX iplt2_p1_c on plt2_p1(c);
CREATE INDEX iplt2_p2_c on plt2_p2(c);
CREATE INDEX iplt2_p3_c on plt2_p3(c);
ANALYZE plt2;
select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 t1,prt2 t2 where t1.a=t2.b)
and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);
/*
postgres=# select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 t1,prt2 t2 where t1.a=t2.b)
postgres-# and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
*/
stack-trace give below :
/*
(gdb) bt
#0 0x00000000006ce6dc in ExecEvalParamExec (state=0x26e9ee0, op=0x26e9f78, econtext=0x26ea390) at execExprInterp.c:2222
#1 0x00000000006cc66a in ExecInterpExpr (state=0x26e9ee0, econtext=0x26ea390, isnull=0x7ffe0f75d77f "") at execExprInterp.c:1024
#2 0x00000000006cdd8c in ExecInterpExprStillValid (state=0x26e9ee0, econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at execExprInterp.c:1819
#3 0x00000000007db078 in ExecEvalExprSwitchContext (state=0x26e9ee0, econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at ../../../../src/include/executor/executor.h:305
#4 0x00000000007e2072 in evaluate_expr (expr=0x26a3cb0, result_type=25, result_typmod=-1, result_collation=0) at clauses.c:4890
#5 0x00000000007e588a in partkey_datum_from_expr (context=0x26d3180, parttypid=25, expr=0x26a3cb0, value=0x7ffe0f75da00) at partprune.c:1504
#6 0x00000000007e5243 in extract_bounding_datums (context=0x26d3180, minimalclauses=0x7ffe0f75d900, keys=0x7ffe0f75da00) at partprune.c:1307
#7 0x00000000007e377d in get_partitions_from_clauses (context=0x26d3180) at partprune.c:273
#8 0x00000000006ea2ec in set_valid_runtime_subplans_recurse (node=0x269bf90, pinfo=0x7f6cf6765cf0, ctxcache=0x26d3158, validsubplans=0x7ffe0f75de10) at nodeAppend.c:771
#9 0x00000000006e9ebf in set_valid_runtime_subplans (node=0x269bf90) at nodeAppend.c:640
#10 0x00000000006e99b5 in choose_next_subplan_locally (node=0x269bf90) at nodeAppend.c:426
#11 0x00000000006e9598 in ExecAppend (pstate=0x269bf90) at nodeAppend.c:224
#12 0x00000000006deb3a in ExecProcNodeFirst (node=0x269bf90) at execProcnode.c:446
#13 0x00000000006fb9ee in ExecProcNode (node=0x269bf90) at ../../../src/include/executor/executor.h:239
#14 0x00000000006fbcc4 in ExecHashJoinImpl (pstate=0x2697808, parallel=0 '\000') at nodeHashjoin.c:262
#15 0x00000000006fc3fd in ExecHashJoin (pstate=0x2697808) at nodeHashjoin.c:565
#16 0x00000000006deb3a in ExecProcNodeFirst (node=0x2697808) at execProcnode.c:446
#17 0x000000000070c376 in ExecProcNode (node=0x2697808) at ../../../src/include/executor/executor.h:239
#18 0x000000000070c70e in ExecNestLoop (pstate=0x262c0a0) at nodeNestloop.c:160
#19 0x00000000006deb3a in ExecProcNodeFirst (node=0x262c0a0) at execProcnode.c:446
#20 0x00000000006fb9ee in ExecProcNode (node=0x262c0a0) at ../../../src/include/executor/executor.h:239
#21 0x00000000006fbcc4 in ExecHashJoinImpl (pstate=0x262bec8, parallel=0 '\000') at nodeHashjoin.c:262
#22 0x00000000006fc3fd in ExecHashJoin (pstate=0x262bec8) at nodeHashjoin.c:565
#23 0x00000000006deb3a in ExecProcNodeFirst (node=0x262bec8) at execProcnode.c:446
#24 0x00000000006ea5bd in ExecProcNode (node=0x262bec8) at ../../../src/include/executor/executor.h:239
#25 0x00000000006eaab0 in fetch_input_tuple (aggstate=0x262ba18) at nodeAgg.c:406
#26 0x00000000006ecd40 in agg_retrieve_direct (aggstate=0x262ba18) at nodeAgg.c:1736
#27 0x00000000006ec932 in ExecAgg (pstate=0x262ba18) at nodeAgg.c:1551
#28 0x00000000006deb3a in ExecProcNodeFirst (node=0x262ba18) at execProcnode.c:446
#29 0x00000000006d59cd in ExecProcNode (node=0x262ba18) at ../../../src/include/executor/executor.h:239
#30 0x00000000006d8326 in ExecutePlan (estate=0x262b7c8, planstate=0x262ba18, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0,
direction=ForwardScanDirection, dest=0x7f6cf676c7f0, execute_once=1 '\001') at execMain.c:1721
#31 0x00000000006d5f9f in standard_ExecutorRun (queryDesc=0x258aa98, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at execMain.c:361
#32 0x00000000006d5dbb in ExecutorRun (queryDesc=0x258aa98, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at execMain.c:304
#33 0x00000000008b588b in PortalRunSelect (portal=0x25caa58, forward=1 '\001', count=0, dest=0x7f6cf676c7f0) at pquery.c:932
#34 0x00000000008b5519 in PortalRun (portal=0x25caa58, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x7f6cf676c7f0, altdest=0x7f6cf676c7f0,
completionTag=0x7ffe0f75e5e0 "") at pquery.c:773
#35 0x00000000008af540 in exec_simple_query (
query_string=0x2565728 "select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 t1,prt2 t2 where t1.a=t2.b) \nand (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);") at postgres.c:1120
#36 0x00000000008b37d4 in PostgresMain (argc=1, argv=0x25910e0, dbname=0x2590f40 "postgres", username=0x2562228 "edb") at postgres.c:4144
#37 0x0000000000812afa in BackendRun (port=0x2588ea0) at postmaster.c:4412
#38 0x000000000081226e in BackendStartup (port=0x2588ea0) at postmaster.c:4084
*/
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
On 21 February 2018 at 22:45, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1 > t1,prt2 t2 where t1.a=t2.b) > and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c); Thanks for the test case. It seems like the x.c Param for some reason has a ParamId of 0. I need to go learn a bit more about Params to understand why this is. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David. On 2018/02/21 18:06, David Rowley wrote: > Other things I don't particularly like about the current patch are how > I had to construct the partition key expressions in > set_valid_runtime_subplans_recurse(). This pretty much uses code > borrowed from set_baserel_partition_key_exprs(). One way around that > would be to change the partprune.c code to deal with the > partkey->partattrs and consume an expression from the list on attnum = > 0. I didn't do that as I want to minimise how much I touch Amit's > patch before it gets committed as doing so would likely just cause > more headaches for me keeping this merged with his work. Another > option to resolve this would be to put the partition key expressions > into the PartitionPruneInfo. Another way could be to refactor the code you've borrowed from set_baserel_partition_key_exprs() into its own function that is exported by some optimizer header. I removed PartitionKey/Relation from signatures of various functions of my patch to avoid having to re-heap_open() the relation per [1]. > I've attached v11 of the patch. Some comments: * I noticed that the patch adds a function to bitmapset.c which you might want to extract into its own patch, like your patch to add bms_add_range() that got committed as 84940644d [2]. * Maybe it's better to add `default: break;` in the two switch's you've added in partkey_datum_from_expr() partprune.c: In function ‘partkey_datum_from_expr’: partprune.c:1555:2: warning: enumeration value ‘T_Invalid’ not handled in switch [-Wswitch] switch (nodeTag(expr)) partprune.c:1562:4: warning: enumeration value ‘PARAM_SUBLINK’ not handled in switch [-Wswitch] switch (((Param *) expr)->paramkind) * I see that you moved PartitionClauseInfo's definition from partprune.c to partition.h. Isn't it better to move it to partprune.h? Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=84940644d
On 22 February 2018 at 22:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Some comments: Hi Amit, Thanks for looking at this. I'll work through your comments and produce a patch sometime in the near future. One problem that I'm facing now is down to the way I'm gathering the ParamIds that match the partkeys. As you'll see from the patch I've added a 'paramids' field to PartitionPruneContext and I'm populating this when the clauses are being pre-processed in extract_partition_clauses(). The problem is that the or_clauses are not pre-processed at all, so the current patch will not properly perform run-time pruning when the Params are "hidden" in OR branches. One way I thought to fix this was to change the clause processing to create an array of PartitionClauseInfos, one for each OR branch. This would also improve the performance of the run-time pruning, meaning that all of the or clauses would be already matched to the partition keys once, rather than having to redo that again each time a Param changes its value. If I go and write a patch to do that, would you want it in your patch, or would you rather I kept it over here? Or perhaps you have a better idea on how to solve...? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, On 02/21/2018 04:06 AM, David Rowley wrote: > I've attached v11 of the patch. > Are UPDATE and DELETE suppose to be supported ? With -- test.sql -- CREATE TABLE test (a integer NOT NULL, b integer) PARTITION BY HASH(a); CREATE TABLE test_p00 PARTITION OF test FOR VALUES WITH (MODULUS 2, REMAINDER 0); CREATE TABLE test_p01 PARTITION OF test FOR VALUES WITH (MODULUS 2, REMAINDER 1); CREATE INDEX idx_test_a ON test (a); CREATE INDEX idx_test_b ON test (b); INSERT INTO test (SELECT i,i FROM generate_series(1, 1000000) AS i); ANALYZE; -- test.sql -- and UPDATE test SET b = 1 WHERE a = ? DELETE FROM test WHERE a = ? both shows that all partitions are scanned; Update on test Update on test_p00 Update on test_p01 -> Index Scan using test_p00_a_idx on test_p00 Index Cond: (a = 1) -> Index Scan using test_p01_a_idx on test_p01 Index Cond: (a = 1) Using prune_v32 and runtime_v11 with conflicts resolved. Best regards, Jesper
On 23 February 2018 at 01:15, David Rowley <david.rowley@2ndquadrant.com> wrote: > One problem that I'm facing now is down to the way I'm gathering the > ParamIds that match the partkeys. As you'll see from the patch I've > added a 'paramids' field to PartitionPruneContext and I'm populating > this when the clauses are being pre-processed in > extract_partition_clauses(). The problem is that the or_clauses are > not pre-processed at all, so the current patch will not properly > perform run-time pruning when the Params are "hidden" in OR branches. > > One way I thought to fix this was to change the clause processing to > create an array of PartitionClauseInfos, one for each OR branch. This > would also improve the performance of the run-time pruning, meaning > that all of the or clauses would be already matched to the partition > keys once, rather than having to redo that again each time a Param > changes its value. > > If I go and write a patch to do that, would you want it in your patch, > or would you rather I kept it over here? Or perhaps you have a better > idea on how to solve...? Hi Amit, I've attached a patch which does this. For now, the patch is only intended to assist in the discussion of the above idea. The patch is based on a WIP version of run-time pruning that I'm not quite ready to post yet, but with a small amount of work you could probably take it and base it on your faster partition pruning v31 patch [1]. I ended up pulling the PartitionPruneInfo out of the PartitionPruneContext. This was required due how I've now made extract_partition_clauses() recursively call itself. We don't want to overwrite the context's clauseinfo with the one from the recursive call. A side effect of this is that the subcontext is no longer required when processing the OR clauses. You only did this so that the context's clauseinfo was not overwritten. I also think it's better to seperate out the inputs and outputs. Anything in context was more intended to be for input fields. Let me know your thoughts about this. If you don't want it for faster partition pruning, then I'll probably go and tidy it up and include it for run-time pruning. [1] https://www.postgresql.org/message-id/00ae2273-bb6b-1287-9ebc-5459b37c9078%40lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David. On 2018/02/23 0:11, David Rowley wrote: > On 23 February 2018 at 01:15, David Rowley <david.rowley@2ndquadrant.com> wrote: >> One problem that I'm facing now is down to the way I'm gathering the >> ParamIds that match the partkeys. As you'll see from the patch I've >> added a 'paramids' field to PartitionPruneContext and I'm populating >> this when the clauses are being pre-processed in >> extract_partition_clauses(). The problem is that the or_clauses are >> not pre-processed at all, so the current patch will not properly >> perform run-time pruning when the Params are "hidden" in OR branches. >> >> One way I thought to fix this was to change the clause processing to >> create an array of PartitionClauseInfos, one for each OR branch. This >> would also improve the performance of the run-time pruning, meaning >> that all of the or clauses would be already matched to the partition >> keys once, rather than having to redo that again each time a Param >> changes its value. >> >> If I go and write a patch to do that, would you want it in your patch, >> or would you rather I kept it over here? Or perhaps you have a better >> idea on how to solve...? > > I've attached a patch which does this. For now, the patch is only > intended to assist in the discussion of the above idea. > > The patch is based on a WIP version of run-time pruning that I'm not > quite ready to post yet, but with a small amount of work you could > probably take it and base it on your faster partition pruning v31 > patch [1]. > > I ended up pulling the PartitionPruneInfo out of the > PartitionPruneContext. This was required due how I've now made > extract_partition_clauses() recursively call itself. We don't want to > overwrite the context's clauseinfo with the one from the recursive > call. A side effect of this is that the subcontext is no longer > required when processing the OR clauses. You only did this so that the > context's clauseinfo was not overwritten. I also think it's better to > seperate out the inputs and outputs. Anything in context was more > intended to be for input fields. > > Let me know your thoughts about this. If you don't want it for faster > partition pruning, then I'll probably go and tidy it up and include it > for run-time pruning. Thanks for the patch. I don't have time today to look at the patch closely, but at first blush, it seems like something I should incorporate into my own patch, because it's restructuring the partprune.c code. I will study the issue and your proposed solution in the form of this restructuring more closely over the weekend and reply (probably with a new version of my patch) on Monday. Thanks, Amit
On 22 February 2018 at 22:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Some comments: > > * I noticed that the patch adds a function to bitmapset.c which you might > want to extract into its own patch, like your patch to add bms_add_range() > that got committed as 84940644d [2]. I've made that 0001 in the series > * Maybe it's better to add `default: break;` in the two switch's > you've added in partkey_datum_from_expr() > > partprune.c: In function ‘partkey_datum_from_expr’: > partprune.c:1555:2: warning: enumeration value ‘T_Invalid’ not handled in > switch [-Wswitch] > switch (nodeTag(expr)) > > partprune.c:1562:4: warning: enumeration value ‘PARAM_SUBLINK’ not handled > in switch [-Wswitch] > switch (((Param *) expr)->paramkind) I wasn't aware of that gcc flag. I was also surprised to see a clean compile from master with it enabled. This area has been changed a bit from the last patch, but the remaining switch now has a default: return false; > * I see that you moved PartitionClauseInfo's definition from partprune.c > to partition.h. Isn't it better to move it to partprune.h? Moved. I'd done it the other way to try to reduce the number of planner headers included in the executor, but will defer to your better judgement, as I see you're busy working on improving this area in another patch set. I've attached an updated set of patches. I hope this also addresses Rajkumar reported crash. I ended up making some changes to how the Param values are determined by reusing more of the existing executor code rather than duplicating it in partkey_datum_from_expr. I really could use a sanity check on my changes to that function now, especially the cross type portion. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 23 February 2018 at 04:11, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Are UPDATE and DELETE suppose to be supported ? To be honest, I had not even considered those. Without looking in detail I imagine it may be possible to allow this simply by setting the AppendPath->trypartitionprune in the correct cases in the inheritence_planner(). I would need to look into this in some detail to find out for sure. Another case which likely is simple to implement is the exact same processing for MergeAppends. I currently see no reason why the same pruning cannot be done for subnodes of that node type too. I've just not done so yet. I'd rather get more sanity check reviews on the current scope of the patch before I widen it out to other areas, but at the same time also don't want to leave very simple things to PG12 which can easily be done in PG11. So I'll try to look at this and get back to you, or perhaps release a new set of patches to support the additional features. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 23 February 2018 at 11:40, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 23 February 2018 at 04:11, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> Are UPDATE and DELETE suppose to be supported ? > > To be honest, I had not even considered those. I can say that I had considered those. Handling of UPDATE and DELETE with partitions is significantly different, so its not just an oversight its a different branch of the project. We need SELECT to work first and then we have a chance of making UPDATE/DELETE work. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
(Sorry, I had mistakenly replied only to Simon on Friday) On Fri, Feb 23, 2018 at 9:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 23 February 2018 at 11:40, David Rowley <david.rowley@2ndquadrant.com> wrote: >> On 23 February 2018 at 04:11, Jesper Pedersen >> <jesper.pedersen@redhat.com> wrote: >>> Are UPDATE and DELETE suppose to be supported ? >> >> To be honest, I had not even considered those. > > I can say that I had considered those. Handling of UPDATE and DELETE > with partitions is significantly different, so its not just an > oversight its a different branch of the project. > > We need SELECT to work first and then we have a chance of making > UPDATE/DELETE work. +1 Thanks, Amit
On 2018/02/23 20:40, David Rowley wrote: > On 23 February 2018 at 04:11, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> Are UPDATE and DELETE suppose to be supported ? > > To be honest, I had not even considered those. Without looking in > detail I imagine it may be possible to allow this simply by setting > the AppendPath->trypartitionprune in the correct cases in the > inheritence_planner(). I would need to look into this in some detail > to find out for sure. I guess you meant in ModifyTablePath. Thanks, Amit
Hi David. On 2018/02/23 20:34, David Rowley wrote: > On 22 February 2018 at 22:31, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Some comments: >> >> * I noticed that the patch adds a function to bitmapset.c which you might >> want to extract into its own patch, like your patch to add bms_add_range() >> that got committed as 84940644d [2]. > > I've made that 0001 in the series Thanks. > I've attached an updated set of patches. > > I hope this also addresses Rajkumar reported crash. I ended up making > some changes to how the Param values are determined by reusing more of > the existing executor code rather than duplicating it in > partkey_datum_from_expr. I really could use a sanity check on my > changes to that function now, especially the cross type portion. I've incorporated portions of 0002 and 0003 into my patch on the other thread (v34) posted at [1]. That is, mostly the changes around handling OR clauses and interface changes resulting from it. Attached are revised version of your patches after the aforementioned rearrangements. Note that after I took out the optimizer portion of the 0003 patch to incorporate it into my patch (OR clause processing bits), not much was left in it, so I squashed it into 0002. So there are only 0001 and 0002. As a review comment on 0002, I think trypartitionprune is better written as try_partition_prune. Thanks, Amit [1] https://www.postgresql.org/message-id/158f04ce-9deb-0457-ddcc-78fb73db4ebc%40lab.ntt.co.jp
Attachment
On 27 February 2018 at 22:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I've incorporated portions of 0002 and 0003 into my patch on the other > thread (v34) posted at [1]. That is, mostly the changes around handling > OR clauses and interface changes resulting from it. Thanks. I was just in the middle of swapping the order of the patches so that the OR clause patch was directly based on yours. > Attached are revised version of your patches after the aforementioned > rearrangements. Note that after I took out the optimizer portion of the > 0003 patch to incorporate it into my patch (OR clause processing bits), > not much was left in it, so I squashed it into 0002. So there are only > 0001 and 0002. I've locally got a patch which is significantly different to the v12 patch which moves lots of code into nodePartition.c and fixes up the missing node read/write functions too. > As a review comment on 0002, I think trypartitionprune is better written > as try_partition_prune. That no longer exists in the new version... Will post soonish, just need to base it all on your v34 [1] now! :) [1] https://www.postgresql.org/message-id/158f04ce-9deb-0457-ddcc-78fb73db4ebc%40lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I've attached v14 of the patch. This is based on Amit's faster partition pruning patch v35 [1] There's quite a bit of code churn in this version from the last version. I've now moved most of the run-time pruning code into execPartition.c and aimed to make it more generic to apply to node types other than Append. I also had to make a few changes to the PartitionPruneInfo node type so that it could get some support in readfuncs.c and outfuncs.c, which I had previously missed. As proof that the code in execPartition.c is fairly generic and applies to any subnode type that supports a List of subnodes, I went ahead and wrote a small additional patch to add support for run-time pruning for MergeAppend. This just takes an extra 100 lines of code. I've also split the patch out a bit more into logical parts in the hope it makes things easier to review. [1] https://www.postgresql.org/message-id/0f96dd16-f5d5-7301-4ddf-858d41a6cbe3@lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David, On 03/01/2018 08:04 AM, David Rowley wrote: > I've also split the patch out a bit more into logical parts in the > hope it makes things easier to review. > A small typo in 0001: + * leftmost_ons_pos[x] gives the bit number (0-7) of the leftmost one bit in a ..."_one_"... 0004 fails "make check-world" due to pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE boolp_f jpedersen pg_restore: [archiver (db)] could not execute query: ERROR: syntax error at or near "false" LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false); Do you require https://commitfest.postgresql.org/17/1410/ as well ? I'll look more at 0002-0005 over the coming days. Thanks for working on this ! Best regards, Jesper
On 2 March 2018 at 07:17, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > A small typo in 0001: > > + * leftmost_ons_pos[x] gives the bit number (0-7) of the leftmost one bit > in a > > ..."_one_"... Oops. I'll fix that. > 0004 fails "make check-world" due to > > pg_restore: [archiver (db)] Error while PROCESSING TOC: > pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE > boolp_f jpedersen > pg_restore: [archiver (db)] could not execute query: ERROR: syntax error at > or near "false" > LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false); The tests seem to have stumbled on a pg_dump bug which causes it to produce syntax that's not valid (currently) I should be able to stop my patch failing the test by dropping that table, which I should have been doing anyway. > Do you require https://commitfest.postgresql.org/17/1410/ as well ? I'll look at that thread and see if there's any pg_dump being broken discussion. > I'll look more at 0002-0005 over the coming days. Thanks for the review and in advance for the future review. I'll delay releasing a new patch as there's some discussion over on the faster partition pruning thread which affects this too [1] [1] https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=MTcXKxTwBnsLRHD3rig@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, On 03/01/2018 08:29 PM, David Rowley wrote: >> 0004 fails "make check-world" due to >> >> pg_restore: [archiver (db)] Error while PROCESSING TOC: >> pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE >> boolp_f jpedersen >> pg_restore: [archiver (db)] could not execute query: ERROR: syntax error at >> or near "false" >> LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false); > > The tests seem to have stumbled on a pg_dump bug which causes it to > produce syntax that's not valid (currently) > > I should be able to stop my patch failing the test by dropping that > table, which I should have been doing anyway. > I've added that thread to the Open Items list. > Thanks for the review and in advance for the future review. > > I'll delay releasing a new patch as there's some discussion over on > the faster partition pruning thread which affects this too [1] > > [1] https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=MTcXKxTwBnsLRHD3rig@mail.gmail.com > Sure, 0003-0005 depends on that thread. 0002 is refactoring so that one is ready. In 0004 should we add a HASH based test case, -- test.sql -- -- verify pruning in hash partitions create table hashp (a int primary key, b int) partition by hash (a); create table hashp_0 partition of hashp for values with (modulus 2, remainder 0); create table hashp_1 partition of hashp for values with (modulus 2, remainder 1); insert into hashp values (0,0), (1,1), (2,2), (3,3); prepare q1 (int) as select * from hashp where a = $1; execute q1 (1); execute q1 (1); execute q1 (1); execute q1 (1); execute q1 (1); explain (analyze, costs off, summary off, timing off) execute q1 (1); explain (analyze, costs off, summary off, timing off) execute q1 (3); deallocate q1; drop table hashp; -- test.sql -- Also, should 0004 consider the "Parallel Append" case, aka -- parallel.sql -- CREATE TABLE t1 ( a integer NOT NULL, b integer NOT NULL ) PARTITION BY HASH (b); CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 0); CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 1); CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 2); CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 3); INSERT INTO t1 (SELECT i, i FROM generate_series(1, 1000000) AS i); PREPARE q1 (int) AS SELECT * FROM t1 WHERE a = $1; EXECUTE q1 (5432); EXECUTE q1 (5432); EXECUTE q1 (5432); EXECUTE q1 (5432); EXECUTE q1 (5432); EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) EXECUTE q1 (5432); DEALLOCATE q1; DROP TABLE t1; -- parallel.sql -- Best regards, Jesper
On 10 March 2018 at 07:50, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > On 03/01/2018 08:29 PM, David Rowley wrote: >> I'll delay releasing a new patch as there's some discussion over on >> the faster partition pruning thread which affects this too [1] >> >> [1] >> https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=MTcXKxTwBnsLRHD3rig@mail.gmail.com >> > > Sure, 0003-0005 depends on that thread. 0002 is refactoring so that one is > ready. Okay, back to this again. The faster partition pruning patch has undergone a large refactor which cause me to delay basing this patch on top of it until that patch has stabilised again. It's now getting close, so this seems like a good time to send a new verison of this. > In 0004 should we add a HASH based test case, I don't think we really need to test each partition type in too much detail in the run-time pruning patch. I'm more interested in testing multiple partition levels there's important new code in this patch which does need lots of additional testing for multi-level partitions. I think adding a HASH partition case would be more aiming to test the planner's pruning code, which this does not make any behavioural changes to. > Also, should 0004 consider the "Parallel Append" case, aka There is a parallel Append test case in there already, see the queries below the "-- parallel append" comment. I've attached a new version of the patch. I'm now at v18 after having some versions of the patch that I didn't release which were based on various versions of Amit's faster partition pruning patch. The attached patchset is based on Amit's v45 faster partition pruning [1]. I've made a few changes since the v14 version. Since Amit's v45 patch now creates the partition pruning details in a data structure that can be copied from the planner over to the executor, it means this patch is now able to do much of the setup work for run-time pruning in the planner. Doing this now allows us to determine if run-time pruning is even possible at plan time, rather than the executor attempting it and sometimes wasting effort when it failed to find Params matching the partition key. Another change from the last version is that I've separated out the handling of exec Params and external Params. The new patch now will perform a pruning step at executor startup if some external params match the partition key. This is very beneficial to a PREPAREd OLTP type query against a partitioned table as it means we can skip sub-plan initialisation for all non-matching partitions. Initialising Append/MergeAppend/ModifyTable nodes with fewer subnodes than were originally planned did require a small change in explain.c in some code that was assuming the subnode arrays were the same length as the subplan list. I also ended up adding a Int property to EXPLAIN to show the number of subnodes that have been removed during execution. Unfortunately, there is a small corner case with this in the case where all subnodes are removed it leaves EXPLAIN with no subnodes to search for outer side Vars in. I didn't really see any sane way to handle this in EXPLAIN, so I think the best fix for this is what I've done, and that's just to always initalise at least 1 subnode, even when none match the external Params. Cost-wise, I don't think this is such a big deal as the cost savings here are coming from saving on initalising ten's or hundreds of subnodes, not 1. To put the new patch to the test, I tried pgbench -S -M prepared -s 100 with and without having modified pgbench_accounts to separate into 10 RANGE partitions of equal size. A non-partitioned table was getting 12503 TPS. With partitioned tables, the old version of this patch was getting: 5470 TPS. With partitioned tables, the attached version gets 11247 TPS. For perspective, today's master with a partitioned table gets 4719 TPS. So you can see it's a pretty good performance boost by skipping initialisation of the 9 non-matching subplans. It's not hard to imagine the gains getting more significant with a larger number of partitions. Ideally, the performance of a partitioned table would be faster than the non-partitioned case, but please remember this query is a single row PK lookup SELECT, so is a very fast query in both cases. Partitioning cases should improve more as the table grows and indexes struggle to fit in RAM, or when many rows are being taken from the partition and being aggregated. Unfortunately, the story is not quite as good for the non -M prepared version of the benchmark. This shows that planning time of partitioned table queries could still use some improvements. Amit's v45 patch certainly makes a dent in the planner slow performance here, but it's still nothing like as fast as the non-partitioned case. More work is required there in the future. This patchset also contains a patch to improve the performance of inheritance planning of UPDATE/DELETE queries. This patch also implements run-time pruning for UPDATE/DELETE too. This also has a significant performance improvement for planning of UPDATE/DELETE operations on partitioned tables with a large number of partitions, performance is as follows: Values in transactions per second. Partitions = 1 Unpatched: 7323.3 Patched: 6573.2 (-10.24%) Partitions = 2 Unpatched: 6784.8 Patched: 6377.1 (-6.01%) Partitions = 4 Unpatched: 5903.0 Patched: 6106.8 (3.45%) Partitions = 8 Unpatched: 4582.0 Patched: 5579.9 (21.78%) Partitions = 16 Unpatched: 3131.5 Patched: 4521.2 (44.38%) Partitions = 32 Unpatched: 1779.8 Patched: 3387.8 (90.35%) Partitions = 64 Unpatched: 821.9 Patched: 2245.4 (173.18%) Partitions = 128 Unpatched: 322.2 Patched: 1319.6 (309.56%) Partitions = 256 Unpatched: 84.3 Patched: 731.7 (768.27%) Partitions = 512 Unpatched: 22.5 Patched: 382.8 (1597.74%) Partitions = 1024 Unpatched: 5.5 Patched: 150.1 (2607.83%) This shows a small regression in planner performance of 10% for a table with 1 partition, but performance starts improving again by just 4 partitions. By the time we get to 1024 partitions the patched planner is 26 times faster than unpatched. It's likely not a no-brainer since some people may get a small performance regression, but it certainly makes having larger numbers of partitions much more usable than it was previously. There are still a few bugs lingering in v45 of Amit's faster partition pruning patch. One of which is causing a regression test to fail in the attached patch set. I've also attached a patch which must be applied after Amit's v45 patchset to fixup a couple of things missing. Hopefully, this won't be needed once v46 is out. To test this apply all patches in [1], then apply faster_part_prune_v45_fixups.patch, then the attached 0001-0005 patches. [1] https://www.postgresql.org/message-id/fc73cef4-6879-26c3-6859-2f910640234a@lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- faster_part_prune_v45_fixups.patch
- v18-0001-Provide-infrastructure-to-allow-partition-prunin.patch
- v18-0002-Add-bms_prev_member-function.patch
- v18-0003-Allow-Append-subnodes-to-be-pruned-during-execut.patch
- v18-0004-Allow-MergeAppend-s-subnodes-to-be-pruned-during.patch
- v18-0005-Improve-planning-speed-of-partitioned-table-UPDA.patch
David Rowley wrote: > To put the new patch to the test, I tried pgbench -S -M prepared -s > 100 with and without having modified pgbench_accounts to separate into > 10 RANGE partitions of equal size. > > A non-partitioned table was getting 12503 TPS. > With partitioned tables, the old version of this patch was getting: 5470 TPS. > With partitioned tables, the attached version gets 11247 TPS. > For perspective, today's master with a partitioned table gets 4719 TPS. > > So you can see it's a pretty good performance boost by skipping > initialisation of the 9 non-matching subplans. It's not hard to > imagine the gains getting more significant with a larger number of > partitions. These are excellent news! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi David, On 03/31/2018 09:52 AM, David Rowley wrote: > I've attached a new version of the patch. I'm now at v18 after having > some versions of the patch that I didn't release which were based on > various versions of Amit's faster partition pruning patch. > Thank you for the updated patch set ! I have tested this together with Amit's v46 patch. The attached case doesn't trigger a generic plan, so basically all time is spent in GetCachedPlan. The standard table case (std.sql) gives: generic_cost = 8.4525 avg_custom_cost = 13.4525 total_custom_cost = 67.2625 whereas the 64 hash partition case (hash.sql) gives: generic_cost = 540.32 avg_custom_cost = 175.9425 total_custom_cost = 879.7125 I tested with pgbench -M prepared -f select.sql. Also, I'm seeing a regression for check-world in src/test/regress/results/inherit.out *************** *** 642,648 **** ---------------------+---+---+----- mlparted_tab_part1 | 1 | a | mlparted_tab_part2a | 2 | a | ! mlparted_tab_part2b | 2 | b | xxx mlparted_tab_part3 | 3 | a | xxx (4 rows) --- 642,648 ---- ---------------------+---+---+----- mlparted_tab_part1 | 1 | a | mlparted_tab_part2a | 2 | a | ! mlparted_tab_part2b | 2 | b | mlparted_tab_part3 | 3 | a | xxx (4 rows) I'll spend some more time tomorrow. Thanks for working on this ! Best regards, Jesper
Attachment
Hello, On Tue, Apr 3, 2018 at 11:14 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Hi David, > > On 03/31/2018 09:52 AM, David Rowley wrote: >> >> I've attached a new version of the patch. I'm now at v18 after having >> some versions of the patch that I didn't release which were based on >> various versions of Amit's faster partition pruning patch. >> > > Thank you for the updated patch set ! > > I have tested this together with Amit's v46 patch. > > > Also, I'm seeing a regression for check-world in > src/test/regress/results/inherit.out > With Amit's v46 patch, the following query in partition_prune was crashing during make check. explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3); -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4 April 2018 at 05:44, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > I have tested this together with Amit's v46 patch. Thanks for testing this. > The attached case doesn't trigger a generic plan, so basically all time is > spent in GetCachedPlan. Yeah, there's still no resolution to the fact that a generic plan + runtime pruning might be cheaper than a custom plan. The problem is the generic plan appears expensive to the custom vs generic plan comparison due to it containing more Append subnodes and the run-time pruning not being taking into account by that comparison. There's been some discussion about this on this thread somewhere. I think the best solution is probably the one suggested by Robert [1] and that's to alter the Append plan's cost when run-time pruning is enabled to try to account for the run-time pruning. This would be a bit of a blind guess akin to what we do for clause selectivity estimates for Params, but it's probably better than nothing, and likely better than doing nothing. > Also, I'm seeing a regression for check-world in > src/test/regress/results/inherit.out > > *************** > *** 642,648 **** > ---------------------+---+---+----- > mlparted_tab_part1 | 1 | a | > mlparted_tab_part2a | 2 | a | > ! mlparted_tab_part2b | 2 | b | xxx > mlparted_tab_part3 | 3 | a | xxx > (4 rows) > > --- 642,648 ---- > ---------------------+---+---+----- > mlparted_tab_part1 | 1 | a | > mlparted_tab_part2a | 2 | a | > ! mlparted_tab_part2b | 2 | b | > mlparted_tab_part3 | 3 | a | xxx > (4 rows) > > I'll spend some more time tomorrow. Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for that with Amit over on [2]. If you patch v46 with the patch I attached to that thread, it should work. [1] https://www.postgresql.org/message-id/CA%2BTgmoZv8sd9cKyYtHwmd_13%2BBAjkVKo%3DECe7G98tBK5Ejwatw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2018 at 05:50, Beena Emerson <memissemerson@gmail.com> wrote: > With Amit's v46 patch, the following query in partition_prune was > crashing during make check. > explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3); Hi Beena, Thanks for looking. Does it work correctly if you apply [1] to Amit's v46 patch before patching with v18 run-time partition pruning? [1] https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2018 at 14:10, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 4 April 2018 at 05:44, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: >> The attached case doesn't trigger a generic plan, so basically all time is >> spent in GetCachedPlan. > > Yeah, there's still no resolution to the fact that a generic plan + > runtime pruning might be cheaper than a custom plan. The problem is > the generic plan appears expensive to the custom vs generic plan > comparison due to it containing more Append subnodes and the run-time > pruning not being taking into account by that comparison. Just for the record, some of the benchmarks I did above also used the attached patch for the -M prepared case. I didn't intend the patch for PostgreSQL, but I am starting to think that it would be useful to have something to save from having to EXECUTE PREPAREd statements 5 times before getting a generic plan. Doing that is starting to seem a bit fragile to me. Would be nice to have some solution, but I've so far not thought of anything better than the attached (incomplete) patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David, On Wed, Apr 4, 2018 at 7:57 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 4 April 2018 at 05:50, Beena Emerson <memissemerson@gmail.com> wrote: >> With Amit's v46 patch, the following query in partition_prune was >> crashing during make check. >> explain (analyze, costs off, summary off, timing off) execute ab_q1 (2, 2, 3); > > Hi Beena, > > Thanks for looking. > > Does it work correctly if you apply [1] to Amit's v46 patch before > patching with v18 run-time partition pruning? > > [1] https://www.postgresql.org/message-id/CAKJS1f_6%2BgXB%3DQ%2BDryeB62yW7N19sY8hH_dBSjPFjm2ifdgoCw%40mail.gmail.com > Thanks for working on it. make check passes when the patch [1] is also applied. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi David. On 2018/04/04 11:10, David Rowley wrote: > On 4 April 2018 at 05:44, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: >> Also, I'm seeing a regression for check-world in >> src/test/regress/results/inherit.out >> >> *************** >> *** 642,648 **** >> ---------------------+---+---+----- >> mlparted_tab_part1 | 1 | a | >> mlparted_tab_part2a | 2 | a | >> ! mlparted_tab_part2b | 2 | b | xxx >> mlparted_tab_part3 | 3 | a | xxx >> (4 rows) >> >> --- 642,648 ---- >> ---------------------+---+---+----- >> mlparted_tab_part1 | 1 | a | >> mlparted_tab_part2a | 2 | a | >> ! mlparted_tab_part2b | 2 | b | >> mlparted_tab_part3 | 3 | a | xxx >> (4 rows) >> >> I'll spend some more time tomorrow. > > Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for > that with Amit over on [2]. I'm not sure if we've yet discussed anything that'd be related to this on the faster pruning thread. It seems that the difference arises from mlparted_tab_part2b not being selected for an update query that's executed just before this test. When I execute an equivalent select query to check if mlparted_tab_part2b is inadvertently pruned due to the new code, I don't see the latest faster pruning patch doing it: explain (costs off) select * from mlparted_tab mlp, (select a from some_tab union all select a+1 from some_tab) ss (a) where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3; QUERY PLAN ---------------------------------------------------------------------------------- Nested Loop Join Filter: (((mlp.a = some_tab.a) AND (mlp.b = 'b'::bpchar)) OR (mlp.a = 3)) -> Append -> Seq Scan on some_tab -> Seq Scan on some_tab some_tab_1 -> Materialize -> Append -> Seq Scan on mlparted_tab_part1 mlp Filter: ((b = 'b'::bpchar) OR (a = 3)) -> Seq Scan on mlparted_tab_part2b mlp_1 Filter: ((b = 'b'::bpchar) OR (a = 3)) -> Seq Scan on mlparted_tab_part3 mlp_2 Filter: ((b = 'b'::bpchar) OR (a = 3)) (13 rows) For the original update query, constraint exclusion selects the same set of partitions: explain (costs off) update mlparted_tab mlp set c = 'xxx' from (select a from some_tab union all select a+1 from some_tab) ss (a) where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3; QUERY PLAN ---------------------------------------------------------------------------------------------- Update on mlparted_tab mlp Update on mlparted_tab_part1 mlp_1 Update on mlparted_tab_part2b mlp_2 Update on mlparted_tab_part3 mlp_3 -> Nested Loop Join Filter: (((mlp_1.a = some_tab.a) AND (mlp_1.b = 'b'::bpchar)) OR (mlp_1.a = 3)) -> Append -> Seq Scan on some_tab -> Seq Scan on some_tab some_tab_1 -> Materialize -> Seq Scan on mlparted_tab_part1 mlp_1 Filter: ((b = 'b'::bpchar) OR (a = 3)) -> Nested Loop Join Filter: (((mlp_2.a = some_tab.a) AND (mlp_2.b = 'b'::bpchar)) OR (mlp_2.a = 3)) -> Append -> Seq Scan on some_tab -> Seq Scan on some_tab some_tab_1 -> Materialize -> Seq Scan on mlparted_tab_part2b mlp_2 Filter: ((b = 'b'::bpchar) OR (a = 3)) -> Nested Loop Join Filter: (((mlp_3.a = some_tab.a) AND (mlp_3.b = 'b'::bpchar)) OR (mlp_3.a = 3)) -> Append -> Seq Scan on some_tab -> Seq Scan on some_tab some_tab_1 -> Materialize -> Seq Scan on mlparted_tab_part3 mlp_3 Filter: ((b = 'b'::bpchar) OR (a = 3)) (28 rows) What am I missing? Thanks, Amit
On 4 April 2018 at 18:27, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/04/04 11:10, David Rowley wrote: >> On 4 April 2018 at 05:44, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: >>> Also, I'm seeing a regression for check-world in >>> src/test/regress/results/inherit.out >>> >>> *************** >>> *** 642,648 **** >>> ---------------------+---+---+----- >>> mlparted_tab_part1 | 1 | a | >>> mlparted_tab_part2a | 2 | a | >>> ! mlparted_tab_part2b | 2 | b | xxx >>> mlparted_tab_part3 | 3 | a | xxx >>> (4 rows) >>> >>> --- 642,648 ---- >>> ---------------------+---+---+----- >>> mlparted_tab_part1 | 1 | a | >>> mlparted_tab_part2a | 2 | a | >>> ! mlparted_tab_part2b | 2 | b | >>> mlparted_tab_part3 | 3 | a | xxx >>> (4 rows) >>> >>> I'll spend some more time tomorrow. >> >> Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for >> that with Amit over on [2]. > > I'm not sure if we've yet discussed anything that'd be related to this on > the faster pruning thread. hmm, yeah, I didn't really explain the context, but the report was in [1] Basically, the OR clause in the following SQL fragment was overwriting the scan_all_non_null value: where (mlp.a = ss.a and mlp.b = 'b') or mlp.a = 3; Basically the: result->scan_all_nonnull = step_result->scan_all_nonnull; The minimum fix would have been to change that line to: result->scan_all_nonnull |= step_result->scan_all_nonnull; Anyway, it all irrelevant now as that code has all changed. [1] https://www.postgresql.org/message-id/CAKJS1f_SHPuqDhQWJq-_P1kpPQn7BJt71yPbDP_8b3rhwFQyGA@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/04/04 16:04, David Rowley wrote: > On 4 April 2018 at 18:27, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I'm not sure if we've yet discussed anything that'd be related to this on >> the faster pruning thread. > > hmm, yeah, I didn't really explain the context, but the report was in [1]> > [1] https://www.postgresql.org/message-id/CAKJS1f_SHPuqDhQWJq-_P1kpPQn7BJt71yPbDP_8b3rhwFQyGA@mail.gmail.com Oh, I see. Hopefully it is no longer an issue. Thanks, Amit
Hi David, On 04/03/2018 10:10 PM, David Rowley wrote: >> The attached case doesn't trigger a generic plan, so basically all time is >> spent in GetCachedPlan. > > Yeah, there's still no resolution to the fact that a generic plan + > runtime pruning might be cheaper than a custom plan. The problem is > the generic plan appears expensive to the custom vs generic plan > comparison due to it containing more Append subnodes and the run-time > pruning not being taking into account by that comparison. > > There's been some discussion about this on this thread somewhere. > Forgot about that, sorry. > I think the best solution is probably the one suggested by Robert [1] > and that's to alter the Append plan's cost when run-time pruning is > enabled to try to account for the run-time pruning. This would be a > bit of a blind guess akin to what we do for clause selectivity > estimates for Params, but it's probably better than nothing, and > likely better than doing nothing. > Yeah, something based on the number of WHERE clauses, and if the partition type has DEFAULT / NULL partition could help. Forcing choose_custom_plan() to return false does help a lot (> 400%) for the HASH case. But maybe this area is best left for PG12. > Yeah, it's a bug in v46 faster partition pruning. Discussing a fix for > that with Amit over on [2]. > I was running with a version of faster_part_prune_v45_fixups.patch. Patch v49 with v18 (0001-0004) works. 0005 needs a rebase. Thanks again, Jesper
On 5 April 2018 at 05:31, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Patch v49 with v18 (0001-0004) works. 0005 needs a rebase. I've attached an updated patchset against Amit's v49 faster partition pruning patch [1]. v49 should also be patched with [2] and [3]. I'm pretty happy with patches 0001 to 0004. At the moment I'm still a bit unsure of 0005. I need to do a bit more sanity checking on it, mostly around the changes in planner.c. Although I am keen to see 0005 make it into PG11 as it does make running larger numbers of partition more usable for UPDATE/DELETE too. Amit's improvements are really good alone, but unfortunately, anyone who's waiting on us speeding up the partition pruning in SELECT queries is probably also waiting on us doing the same for UPDATE/DELETE. So I do think 0005 is important for PG11, providing it's correct, of course. [1] https://www.postgresql.org/message-id/c5331ff6-8b31-0742-758e-bd7b9aeddf07%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/CAKJS1f_ad%3DB9rVf5dPD27%3DxTN1Ob7xJi6N4BFBNytj93rCPqzg%40mail.gmail.com [3] https://www.postgresql.org/message-id/3eedafaa-840f-bf72-2fa9-dadb0852e959%40redhat.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David. On 2018/03/31 22:52, David Rowley wrote: > The attached patchset is based on Amit's v45 faster partition pruning [1]. > > I've made a few changes since the v14 version. Since Amit's v45 patch > now creates the partition pruning details in a data structure that can > be copied from the planner over to the executor, it means this patch > is now able to do much of the setup work for run-time pruning in the > planner. Doing this now allows us to determine if run-time pruning is > even possible at plan time, rather than the executor attempting it and > sometimes wasting effort when it failed to find Params matching the > partition key. > > Another change from the last version is that I've separated out the > handling of exec Params and external Params. The new patch now will > perform a pruning step at executor startup if some external params > match the partition key. This is very beneficial to a PREPAREd OLTP > type query against a partitioned table as it means we can skip > sub-plan initialisation for all non-matching partitions. This is quite nice. > Initialising > Append/MergeAppend/ModifyTable nodes with fewer subnodes than were > originally planned did require a small change in explain.c in some > code that was assuming the subnode arrays were the same length as the > subplan list. I also ended up adding a Int property to EXPLAIN to show > the number of subnodes that have been removed during execution. > Unfortunately, there is a small corner case with this in the case > where all subnodes are removed it leaves EXPLAIN with no subnodes to > search for outer side Vars in. I didn't really see any sane way to > handle this in EXPLAIN, so I think the best fix for this is what I've > done, and that's just to always initalise at least 1 subnode, even > when none match the external Params. Cost-wise, I don't think this is > such a big deal as the cost savings here are coming from saving on > initalising ten's or hundreds of subnodes, not 1. I have wondered about the possibility of setting a valid (non-dummy) targetlist in Append and MergeAppend if they're created for a partitioned table. Currently they're overwritten by a dummy one using set_dummy_tlist_references() in set_plan_refs() citing the following reason: * set_dummy_tlist_references * Replace the targetlist of an upper-level plan node with a simple * list of OUTER_VAR references to its child. * * This is used for plan types like Sort and Append that don't evaluate * their targetlists. Although the executor doesn't care at all what's in * the tlist, EXPLAIN needs it to be realistic. In fact, when I had noticed that this EXPLAIN behavior I had wondered if that's something we should have discussed when d3cc37f1d801a [1] went in. > To put the new patch to the test, I tried pgbench -S -M prepared -s > 100 with and without having modified pgbench_accounts to separate into > 10 RANGE partitions of equal size. > > A non-partitioned table was getting 12503 TPS. > With partitioned tables, the old version of this patch was getting: 5470 TPS. > With partitioned tables, the attached version gets 11247 TPS. > For perspective, today's master with a partitioned table gets 4719 TPS. Quite nice! > So you can see it's a pretty good performance boost by skipping > initialisation of the 9 non-matching subplans. It's not hard to > imagine the gains getting more significant with a larger number of > partitions. Ideally, the performance of a partitioned table would be > faster than the non-partitioned case, but please remember this query > is a single row PK lookup SELECT, so is a very fast query in both > cases. Partitioning cases should improve more as the table grows and > indexes struggle to fit in RAM, or when many rows are being taken from > the partition and being aggregated. > > Unfortunately, the story is not quite as good for the non -M prepared > version of the benchmark. This shows that planning time of partitioned > table queries could still use some improvements. Amit's v45 patch > certainly makes a dent in the planner slow performance here, but it's > still nothing like as fast as the non-partitioned case. More work is > required there in the future. Certainly. Things like the issue that we both replied to yesterday should be addressed for starters [2]. > This patchset also contains a patch to improve the performance of > inheritance planning of UPDATE/DELETE queries. This patch also > implements run-time pruning for UPDATE/DELETE too. This also has a > significant performance improvement for planning of UPDATE/DELETE > operations on partitioned tables with a large number of partitions, > performance is as follows: > > Values in transactions per second. > > Partitions = 1 > Unpatched: 7323.3 > Patched: 6573.2 (-10.24%) > > Partitions = 2 > Unpatched: 6784.8 > Patched: 6377.1 (-6.01%) > > Partitions = 4 > Unpatched: 5903.0 > Patched: 6106.8 (3.45%) > > Partitions = 8 > Unpatched: 4582.0 > Patched: 5579.9 (21.78%) > > Partitions = 16 > Unpatched: 3131.5 > Patched: 4521.2 (44.38%) > > Partitions = 32 > Unpatched: 1779.8 > Patched: 3387.8 (90.35%) > > Partitions = 64 > Unpatched: 821.9 > Patched: 2245.4 (173.18%) > > Partitions = 128 > Unpatched: 322.2 > Patched: 1319.6 (309.56%) > > Partitions = 256 > Unpatched: 84.3 > Patched: 731.7 (768.27%) > > Partitions = 512 > Unpatched: 22.5 > Patched: 382.8 (1597.74%) > > Partitions = 1024 > Unpatched: 5.5 > Patched: 150.1 (2607.83%) Great! I will post comments on your v19 later today. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d3cc37f1d801a [2] https://www.postgresql.org/message-id/20180403194613.GY28454%40telsasoft.com
On 5 April 2018 at 15:14, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/03/31 22:52, David Rowley wrote: >> Initialising >> Append/MergeAppend/ModifyTable nodes with fewer subnodes than were >> originally planned did require a small change in explain.c in some >> code that was assuming the subnode arrays were the same length as the >> subplan list. I also ended up adding a Int property to EXPLAIN to show >> the number of subnodes that have been removed during execution. >> Unfortunately, there is a small corner case with this in the case >> where all subnodes are removed it leaves EXPLAIN with no subnodes to >> search for outer side Vars in. I didn't really see any sane way to >> handle this in EXPLAIN, so I think the best fix for this is what I've >> done, and that's just to always initalise at least 1 subnode, even >> when none match the external Params. Cost-wise, I don't think this is >> such a big deal as the cost savings here are coming from saving on >> initalising ten's or hundreds of subnodes, not 1. > > I have wondered about the possibility of setting a valid (non-dummy) > targetlist in Append and MergeAppend if they're created for a partitioned > table. Currently they're overwritten by a dummy one using > set_dummy_tlist_references() in set_plan_refs() citing the following reason: > > * set_dummy_tlist_references > * Replace the targetlist of an upper-level plan node with a simple > * list of OUTER_VAR references to its child. > * > * This is used for plan types like Sort and Append that don't evaluate > * their targetlists. Although the executor doesn't care at all what's in > * the tlist, EXPLAIN needs it to be realistic. > > In fact, when I had noticed that this EXPLAIN behavior I had wondered if > that's something we should have discussed when d3cc37f1d801a [1] went in. I had a quick hack at this to see if it would work and it does seem to on my very simple test. However, it would mean removing set_dummy_tlist_references from more than just Append/MergeAppend create table listp (a int, b int) partition by list(a); create table listp1 partition of listp for values in(1); create table listp2 partition of listp for values in(2); prepare q1 (int, int) as select * from listp where a in($1,$2) order by b limit 1; explain execute q1(1,2); explain execute q1(1,2); explain execute q1(1,2); explain execute q1(1,2); explain execute q1(1,2); explain (verbose, costs off) execute q1(0,0); QUERY PLAN -------------------------------------------------------- Limit Output: listp.a, listp.b -> Sort Output: listp.a, listp.b Sort Key: listp.b -> Append Subplans Pruned: 2 (7 rows) The downside is that if we were to do this it would mean changing the output in cases like: explain (verbose, costs off) (select a z, b y from listp union all select * from listp) order by y; QUERY PLAN -------------------------------------------------------------------------------------- Sort Output: z, y Sort Key: y -> Append -> Seq Scan on public.listp1 Output: listp1.a, listp1.b -> Seq Scan on public.listp2 Output: listp2.a, listp2.b -> Seq Scan on public.listp1 listp1_1 Output: listp1_1.a, listp1_1.b -> Seq Scan on public.listp2 listp2_1 Output: listp2_1.a, listp2_1.b Notice the sort key now refers to the alias rather than a column from the first append child. It sure is an interesting thought, and one I'd not considered, but I don't think trying for something like this is going to be the path of least resistance. It may also add quite a bit of complexity if we just try to do it when the OUTER_VAR would lead to a Append/MergeAppend which belongs to a partitioned table scan. I think this idea has good merit and some form of it might be the nicest way to allow run-time pruning of all subnodes in the future. Perhaps we can put it on the shelf and think about it again for PG12. However, it might not be the most interesting optimization to work on, as I think probably run-time pruning away of all subnodes is probably far less common than pruning some, or all but one, and the cost of initializing the one unneeded subnode is not so big. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 5 April 2018 at 14:01, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've attached an updated patchset against Amit's v49 faster partition > pruning patch. Well, v19 was short lived. 0005 conflicted with some recent changes to MERGE. The attached v20 patch set is now based on Amit's runtime partition prune v50 [1]. There are a few changes since v19: 1. I've run pgindent on the entire patchset. 2. Made a pass over the comments and fixed a few things that were not quite right. Also improved some wording. 3. Fixed a small error in make_partition_pruneinfo where I had accidentally used the 'i' variable for two purposes at once. There was no live bug there, but I've now changed things around as what was there was wrong regardless of if it was causing issues or not. 4. updated typedefs.list with new typedefs added in the patchset. [1] https://www.postgresql.org/message-id/77a518ac-e4a0-4cd1-9988-e5d754a6501f%40lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/04/05 12:14, Amit Langote wrote: > I will post comments on your v19 later today. I looked at it and here are my thoughts on it after having for the first time looked very closely at it. * Regarding PartitionPruneInfo: I think the names of the fields could be improved -- like pruning_steps instead prunesteps, unpruned_parts instead of allpartindexs. The latter is even a bit misleading because it doesn't in fact contain *all* partition indexes, only those that are unpruned, because each either has a subpath or it appears in (unpruned) partitioned_rels list. Also, I didn't like the name subnodeindex and subpartindex very much. subplan_indexes and parent_indexes would sound more informative to me. * make_partition_pruneinfo has a parameter resultRelations that's not used anywhere * In make_partition_pruneinfo() allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size); allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size); I think these arrays need to have root->simple_rel_array_size + 1 elements, because they're subscripted using RT indexes which start at 1. * Also in make_partition_pruneinfo() /* Initialize to -1 to indicate the rel was not found */ for (i = 0; i < root->simple_rel_array_size; i++) { allsubnodeindex[i] = -1; allsubpartindex[i] = -1; } Maybe, allocate the arrays above mentioned using palloc0 and don't do this initialization. Instead make the indexes that are stored in these start with 1 and consider 0 as invalid entries. * Regarding the code added in execPartition.c and execPartition.h: I wondered why PartitionedRelPruning is named the way it is. I saw many parallels with PartitionDispatchData both in terms of the main thing it consists of, that is, the map that translates partition indexes as in partition descriptor to that of subplans or of some other executor structure. Also, I imagine you tried to mimic PartitionTupleRouting with PartitionPruning but not throughout. For example, tuple routing struct pointer variables are throughout called proute, whereas PartitionPruning ones are called partprune instead of, say, pprune. Consistency would help, imho. * Instead of nesting PartitionedRelPruning inside another, just store them in a global flat array in the PartitionPruning, like PartitionTupleRouting does for PartitionDispatch of individual partitioned tables in the tree. typedef struct PartitionedRelPruning { int nparts; int *subnodeindex; struct PartitionedRelPruning **subpartprune; * I don't see why there needs to be nparts in the above, because it already has a PartitionPruneContext member which has that information. In fact, I made most of changes myself while going through the code. Please see attached the delta patch. It also tweaks quite a few other things including various comments. I think parts of it apply to 0001, 0003, and 0004 patches. See if this looks good to you. Thanks, Amit
Attachment
Hi Amit, Thanks for having a look at this. On 6 April 2018 at 00:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I looked at it and here are my thoughts on it after having for the first > time looked very closely at it. > > * Regarding PartitionPruneInfo: > > I think the names of the fields could be improved -- like pruning_steps > instead prunesteps, unpruned_parts instead of allpartindexs. The latter > is even a bit misleading because it doesn't in fact contain *all* > partition indexes, only those that are unpruned, because each either has a > subpath or it appears in (unpruned) partitioned_rels list. Also, I didn't > like the name subnodeindex and subpartindex very much. subplan_indexes > and parent_indexes would sound more informative to me. Seems mostly fair. I'm not a fan of using the term "unpruned" though. I'll have a think. The "all" is meant in terms of what exists as subnodes. subplan_indexes and parent_indexes seem like better names, I agree. > * make_partition_pruneinfo has a parameter resultRelations that's not used > anywhere It gets used in 0005. I guess I could only add it in 0005, but make_partition_pruneinfo is only used in 0003, so you could say the same about that entire function. Do you think I should delay adding that parameter until the 0005 patch? > * In make_partition_pruneinfo() > > allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size); > allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size); > > I think these arrays need to have root->simple_rel_array_size + 1 > elements, because they're subscripted using RT indexes which start at 1. RT indexes are always 1-based. See setup_simple_rel_arrays. It already sets the array size to list_length(rtable) + 1. > * Also in make_partition_pruneinfo() > > /* Initialize to -1 to indicate the rel was not found */ > for (i = 0; i < root->simple_rel_array_size; i++) > { > allsubnodeindex[i] = -1; > allsubpartindex[i] = -1; > } > > Maybe, allocate the arrays above mentioned using palloc0 and don't do this > initialization. Instead make the indexes that are stored in these start > with 1 and consider 0 as invalid entries. 0 is a valid subplan index. It is possible to make this happen, but I'd need to subtract 1 everywhere I used the map. That does not seem very nice. Seems more likely to result in bugs where we might forget to do the - 1. Did you want this because you'd rather have the palloc0() than the for loop setting the array elements to -1? Or is there another reason? > * Regarding the code added in execPartition.c and execPartition.h: > > I wondered why PartitionedRelPruning is named the way it is. I saw many > parallels with PartitionDispatchData both in terms of the main thing it > consists of, that is, the map that translates partition indexes as in > partition descriptor to that of subplans or of some other executor > structure. Also, I imagine you tried to mimic PartitionTupleRouting with > PartitionPruning but not throughout. For example, tuple routing struct > pointer variables are throughout called proute, whereas PartitionPruning > ones are called partprune instead of, say, pprune. Consistency would > help, imho. Yes, I saw similarities and did end up moving all the code into execPartition a while back. I'll look into this renaming. > * Instead of nesting PartitionedRelPruning inside another, just store them > in a global flat array in the PartitionPruning, like PartitionTupleRouting > does for PartitionDispatch of individual partitioned tables in the tree. > > typedef struct PartitionedRelPruning > { > int nparts; > int *subnodeindex; > struct PartitionedRelPruning **subpartprune; There is a flat array in PartitionPruning. subpartprune contains pointers into that array. I want to have this pointer array so I can directly reference the flat array in PartitionPruning. Maybe I've misunderstood what you mean here. > * I don't see why there needs to be nparts in the above, because it > already has a PartitionPruneContext member which has that information. Good point. I'll remove that. > In fact, I made most of changes myself while going through the code. > Please see attached the delta patch. It also tweaks quite a few other > things including various comments. I think parts of it apply to 0001, > 0003, and 0004 patches. See if this looks good to you. Thanks. I'll look. It's late over this side now, so will look tomorrow. Thanks again for reviewing this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, First of all: Solid patch set with good documentation. On 04/05/2018 09:41 AM, David Rowley wrote: > Seems mostly fair. I'm not a fan of using the term "unpruned" though. > I'll have a think. The "all" is meant in terms of what exists as > subnodes. > 'included_parts' / 'excluded_parts' probably isn't better... > subplan_indexes and parent_indexes seem like better names, I agree. > More clear. >> * Also in make_partition_pruneinfo() >> >> /* Initialize to -1 to indicate the rel was not found */ >> for (i = 0; i < root->simple_rel_array_size; i++) >> { >> allsubnodeindex[i] = -1; >> allsubpartindex[i] = -1; >> } >> >> Maybe, allocate the arrays above mentioned using palloc0 and don't do this >> initialization. Instead make the indexes that are stored in these start >> with 1 and consider 0 as invalid entries. > > 0 is a valid subplan index. It is possible to make this happen, but > I'd need to subtract 1 everywhere I used the map. That does not seem > very nice. Seems more likely to result in bugs where we might forget > to do the - 1. > > Did you want this because you'd rather have the palloc0() than the for > loop setting the array elements to -1? Or is there another reason? > I think that doing palloc0 would be confusing; -1 is more clear, especially since it is used with 'allpartindexes' which is a Bitmapset. Doing the variable renames as Amit suggests would be good. I ran some tests (v50_v20) (make check-world passes), w/ and w/o choose_custom_plan() being false, and seeing good performance results without running into issues. Maybe 0005 should be expanded in partition_prune.sql with the supported cases to make those more clear. Thanks ! Best regards, Jesper
Hi David. On 2018/04/05 22:41, David Rowley wrote: >> * make_partition_pruneinfo has a parameter resultRelations that's not used >> anywhere > > It gets used in 0005. > > I guess I could only add it in 0005, but make_partition_pruneinfo is > only used in 0003, so you could say the same about that entire > function. > > Do you think I should delay adding that parameter until the 0005 patch? Yes, I think. >> * In make_partition_pruneinfo() >> >> allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size); >> allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size); >> >> I think these arrays need to have root->simple_rel_array_size + 1 >> elements, because they're subscripted using RT indexes which start at 1. > > RT indexes are always 1-based. See setup_simple_rel_arrays. It already > sets the array size to list_length(rtable) + 1. Oh, I missed that simple_rel_array_size itself is set to consider 1-based RT indexes. relnode.c:73 root->simple_rel_array_size = list_length(root->parse->rtable) + 1; >> * Also in make_partition_pruneinfo() >> >> /* Initialize to -1 to indicate the rel was not found */ >> for (i = 0; i < root->simple_rel_array_size; i++) >> { >> allsubnodeindex[i] = -1; >> allsubpartindex[i] = -1; >> } >> >> Maybe, allocate the arrays above mentioned using palloc0 and don't do this >> initialization. Instead make the indexes that are stored in these start >> with 1 and consider 0 as invalid entries. > > 0 is a valid subplan index. It is possible to make this happen, but > I'd need to subtract 1 everywhere I used the map. That does not seem > very nice. Seems more likely to result in bugs where we might forget > to do the - 1. You can subtract 1 right here in make_partition_pruneinfo before setting the values in PartitionPruneInfo's subplan_indexes and parent_indexes. I'm only proposing to make make_partition_pruneinfo() a bit faster by not looping over both the local map arrays setting them to -1. IOW, I'm not saying that we emit PartitionPruneInfo nodes that contain 1-based indexes. > Did you want this because you'd rather have the palloc0() than the for > loop setting the array elements to -1? Or is there another reason? Yeah, that's it. >> * Instead of nesting PartitionedRelPruning inside another, just store them >> in a global flat array in the PartitionPruning, like PartitionTupleRouting >> does for PartitionDispatch of individual partitioned tables in the tree. >> >> typedef struct PartitionedRelPruning >> { >> int nparts; >> int *subnodeindex; >> struct PartitionedRelPruning **subpartprune; > > There is a flat array in PartitionPruning. subpartprune contains > pointers into that array. I want to have this pointer array so I can > directly reference the flat array in PartitionPruning. > > Maybe I've misunderstood what you mean here. I think we can save some space here by not having the pointers stored here. Instead of passing the pointer itself in the recursive calls to find_subplans_for_extparams_recurse, et al, just pass the entire array and an offset to use for the given recursive invocation. If you look at ExecFindPartition used for tuple routing, you may see that there no recursion at all. Similarly find_subplans_for_extparams_recurse, et al might be able to avoid recursion if similar tricks are used. Finally about having two different functions for different sets of Params: can't we have just named find_subplans_for_params_recurse() and use the appropriate one based on the value of some parameter? I can't help but notice the duplication of code. Thanks, Amit
(sending my reply in parts for concurrency) On 6 April 2018 at 14:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I think we can save some space here by not having the pointers stored > here. Instead of passing the pointer itself in the recursive calls to > find_subplans_for_extparams_recurse, et al, just pass the entire array and > an offset to use for the given recursive invocation. hmm, where would those offsets be stored? I don't want to have to do any linear searching to determine the offset, which is why I just stored the pointer to the flat array. It seems very efficient to me to do this. Remember that this pruning can occur per tuple in cases like parameterized nested loops. Are you worried about memory consumption? It's one pointer per partition. I imagine there's lots more allocated for DML on a partitioned table as it needs to store maps to map attribute numbers. Or are you thinking the saving of storing an array of 32-bit int values is better than the array of probably 64-bit pointers? So requires half the space? > If you look at ExecFindPartition used for tuple routing, you may see that > there no recursion at all. Similarly find_subplans_for_extparams_recurse, > et al might be able to avoid recursion if similar tricks are used. That seems pretty different. That's looking for a single node in a tree, so just is following a single path from the root, it never has to go back up a level and look down any other paths. What we need for the find_subplans_for_extparams_recurse is to find all nodes in the entire tree which match the given clause. Doing this without recursion would require some sort of stack so we can go back up a level and search again down another branch. There are ways around this without using recursion, sure, but I don't think any of them will be quite as convenient and simple. The best I can think of is to palloc some stack manually and use some depth_level to track which element to use. An actual stack seems more simple. I can't quite think of a good way to know in advance how many elements we'd need to palloc. > Finally about having two different functions for different sets of Params: > can't we have just named find_subplans_for_params_recurse() and use the > appropriate one based on the value of some parameter? I can't help but > notice the duplication of code. I had decided not to make this one function previously as I didn't really want to add unnecessary branching in the code. After implementing it, it does not look as bad as I thought. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, On 2018/04/06 12:27, David Rowley wrote: > (sending my reply in parts for concurrency) > > On 6 April 2018 at 14:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I think we can save some space here by not having the pointers stored >> here. Instead of passing the pointer itself in the recursive calls to >> find_subplans_for_extparams_recurse, et al, just pass the entire array and >> an offset to use for the given recursive invocation. > > hmm, where would those offsets be stored? I don't want to have to do > any linear searching to determine the offset, which is why I just > stored the pointer to the flat array. It seems very efficient to me to > do this. Remember that this pruning can occur per tuple in cases like > parameterized nested loops. > > Are you worried about memory consumption? It's one pointer per > partition. I imagine there's lots more allocated for DML on a > partitioned table as it needs to store maps to map attribute numbers. > > Or are you thinking the saving of storing an array of 32-bit int > values is better than the array of probably 64-bit pointers? So > requires half the space? Yeah, just copy it from the PartitionPruneInfo like you're doing for subnodeindex: memcpy(partrelprune->subpartindex, pinfo->subpartindex, sizeof(int) * pinfo->nparts); Instead I see ExecSetupPartitionPruning is now doing this: /* * Setup the PartitionedRelPruning's subpartprune so that we can * quickly find sub-PartitionedRelPruning details for any * sub-partitioned tables that this partitioned table contains. * We need to be able to find these quickly during our recursive * search to find all matching subnodes. */ for (j = 0; j < pinfo->nparts; j++) { int subpartidx = pinfo->subpartindex[j]; Assert(subpartidx < list_length(partitionpruneinfo)); if (subpartidx >= 0) partrelprune->subpartprune[j] = &partrelprunes[subpartidx]; else partrelprune->subpartprune[j] = NULL; } With that in place, pass the index/offset instead of the pointer to the next recursive invocation of find_subplans_*, along with the array containing all PartitionedRelPruning's. So, where you have in each of find_subplans_*: if (partrelprune->subnodeindex[i] >= 0) *validsubplans = bms_add_member(*validsubplans, partrelprune->subnodeindex[i]); else if (partrelprune->subpartprune[i] != NULL) find_subplans_for_allparams_recurse(partrelprune->subpartprune[i], validsubplans); I'm proposing that you do: if (partrelprune->subnodeindex[i] >= 0) *validsubplans = bms_add_member(*validsubplans, partrelprune->subnodeindex[i]); else if (partrelprune->subpartindex[i] >= 0) find_subplans_for_allparams_recurse(all_partrelprunes, partrelprune->subpartindex[i], validsubplans); And at the top of each of find_subplans_*: ParitionedRelPruning *partrelprune = all_partrelprunes[offset]; where the signature is: static void find_subplans_for_allparams_recurse( PartitionRelPruning **all_partrelprune, int offset, Bitmapset **validsubplans) The all_partrelprune above refers to the flat array in PartitionPruning. On the first call from ExecFindMatchingSubPlans, et al, you'd pass 0 for offset to start pruning with the root parent's PartitionedRelPruning. All the values contained in subnodeindex and subpartindex are indexes into the global array (whole-tree that is) anyway and that fact would be more apparent if we use this code structure, imho. >> If you look at ExecFindPartition used for tuple routing, you may see that >> there no recursion at all. Similarly find_subplans_for_extparams_recurse, >> et al might be able to avoid recursion if similar tricks are used. > > That seems pretty different. That's looking for a single node in a > tree, so just is following a single path from the root, it never has > to go back up a level and look down any other paths. > > What we need for the find_subplans_for_extparams_recurse is to find > all nodes in the entire tree which match the given clause. Doing this > without recursion would require some sort of stack so we can go back > up a level and search again down another branch. There are ways > around this without using recursion, sure, but I don't think any of > them will be quite as convenient and simple. The best I can think of > is to palloc some stack manually and use some depth_level to track > which element to use. An actual stack seems more simple. I can't > quite think of a good way to know in advance how many elements we'd > need to palloc. Hmm, yeah. I just remembered that I had to give up suggesting this a while back on this thread. So, okay, you don't need to do anything about this. >> Finally about having two different functions for different sets of Params: >> can't we have just named find_subplans_for_params_recurse() and use the >> appropriate one based on the value of some parameter? I can't help but >> notice the duplication of code. > > I had decided not to make this one function previously as I didn't > really want to add unnecessary branching in the code. After > implementing it, it does not look as bad as I thought. You could at the top of, say, find_subplans_for_params_recurse(..., bool extparam): Bitmapset *params = extparam ? partrelprune->extparams : partrelprune->allparams; ISTT, find_subplans_for_extparams_recurse and find_subplans_for_allparams_recurse contain the exact same code except these two lines in the two functions, respectively: if (!bms_is_empty(partrelprune->extparams)) { context->safeparams = partrelprune->extparams; if (!bms_is_empty(partrelprune->allparams)) { context->safeparams = partrelprune->allparams; I didn't suggest the same for say ExecFindInitialMatchingSubPlans and ExecFindMatchingSubPlans because they seem to be at least a bit different in their missions. IIUC, the former has to do index value adjustment (those in subnodeindex and subpartindex) after having discovered a new set of matching partitions in the tree after pruning with external params at Append/MergeAppend startup and those params won't change after that point. Thanks, Amit
On 6 April 2018 at 00:54, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > * Regarding PartitionPruneInfo: > > I think the names of the fields could be improved -- like pruning_steps > instead prunesteps, unpruned_parts instead of allpartindexs. The latter > is even a bit misleading because it doesn't in fact contain *all* > partition indexes, only those that are unpruned, because each either has a > subpath or it appears in (unpruned) partitioned_rels list. Also, I didn't > like the name subnodeindex and subpartindex very much. subplan_indexes > and parent_indexes would sound more informative to me. I've done a load of renaming. I went with subnode_map and subpart_map instead of _indexes. I thought this was better. Let me know if you disagree. > * make_partition_pruneinfo has a parameter resultRelations that's not used > anywhere I've taken this out of 0001 and it is now introduced again in 0005, where it's used. > * Also in make_partition_pruneinfo() > > /* Initialize to -1 to indicate the rel was not found */ > for (i = 0; i < root->simple_rel_array_size; i++) > { > allsubnodeindex[i] = -1; > allsubpartindex[i] = -1; > } > > Maybe, allocate the arrays above mentioned using palloc0 and don't do this > initialization. Instead make the indexes that are stored in these start > with 1 and consider 0 as invalid entries. I've made this change. > * Regarding the code added in execPartition.c and execPartition.h: > > I wondered why PartitionedRelPruning is named the way it is. I saw many > parallels with PartitionDispatchData both in terms of the main thing it > consists of, that is, the map that translates partition indexes as in > partition descriptor to that of subplans or of some other executor > structure. Also, I imagine you tried to mimic PartitionTupleRouting with > PartitionPruning but not throughout. For example, tuple routing struct > pointer variables are throughout called proute, whereas PartitionPruning > ones are called partprune instead of, say, pprune. Consistency would > help, imho. I've made a series of changes here too, but didn't use the word "Dispatch" anywhere. I'm not really sure what the origin of this word is. To me, it means to send something somewhere, which I thought might be why we see it tuple routing, since the tuple is being "dispatched" to the correct partition. We're not dispatching anything anywhere in partition pruning, so don't really think the term can be used here. Although, if you see some other reason for using that word, please explain. > * Instead of nesting PartitionedRelPruning inside another, just store them > in a global flat array in the PartitionPruning, like PartitionTupleRouting > does for PartitionDispatch of individual partitioned tables in the tree. > > typedef struct PartitionedRelPruning > { > int nparts; > int *subnodeindex; > struct PartitionedRelPruning **subpartprune; I've kept the same subpart_map from the PartitionPruneInfo. I actually ended up using that one without performing a memcpy() on it, since we're not changing it anywhere. That might or might not be a good idea since way might day think we can change it which would alter the plan's copy, but on the other hand, there's a little performance boost in not performing a copy. > * I don't see why there needs to be nparts in the above, because it > already has a PartitionPruneContext member which has that information. I've removed that field. I've also been doing a bit of work on the 0005 patch: 1. It now properly supports skipping subplans when exec params can eliminate certain partitions. Previously I'd only coded it to work with external params. 2. Fixed bug where statement level triggers would not fire when all partitions were pruned. 3. Added tests The patch is still based on v50 of the runtime pruning patch [1] [1] https://www.postgresql.org/message-id/77a518ac-e4a0-4cd1-9988-e5d754a6501f%40lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
I rebased this series on top of the committed version of the other patch. Here's v22, with no other changes than rebasing. I did not include 0005, though. Git changed the author tag for 0001. Not intentional. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera wrote: > I rebased this series on top of the committed version of the other patch. > Here's v22, with no other changes than rebasing. I did not include > 0005, though. Apologies, I forgot to "git add" one fixup for 0001. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 7 April 2018 at 09:29, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Alvaro Herrera wrote: >> I rebased this series on top of the committed version of the other patch. >> Here's v22, with no other changes than rebasing. I did not include >> 0005, though. > > Apologies, I forgot to "git add" one fixup for 0001. 0003 I think. I'm looking over the rebased patches now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 April 2018 at 10:45, David Rowley <david.rowley@2ndquadrant.com> wrote: > I'm looking over the rebased patches now. I've made a complete read of 0001 and 0002 so far. Your rebase looks fine. After the complete read, I only have the following comments: 0001: 1. missing "the" before "partition key": * Extract Params matching partition key and record if we got any. 2. Is this the property name we're going to stick with: ExplainPropertyInteger("Subplans Pruned", NULL, nplans - nsubnodes, es); Other ideas are: "Subplans Removed" 3. In the following comment I've used the word "hierarchy", but maybe we need to add the word "flattened" before it. * PartitionPruning - Encapsulates a hierarchy of PartitionRelPruning 4. Comment mentions "after init plan", but really we can only know the value of an exec param during actual execution. So: * Parameters that are safe to be used for partition pruning. execparams * are not safe to use until after init plan. maybe better as: * Parameters that are safe to be used for partition pruning. execparams * are not safe to use until the executor is running. 0002: Looks fine. But if I was committing this, to give me confidence, I'd want to know how the left_most_one table was generated. I used: #include <stdio.h> int main(void) { int i = 1; printf("0, "); while (i < 256) { printf("%d, ", 31 - __builtin_clz(i)); if ((i & 0xf) == 0xf) putchar('\n'); i++; } return 0; } Continuing to read 0003 and 0004 now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 April 2018 at 12:03, David Rowley <david.rowley@2ndquadrant.com> wrote: > Continuing to read 0003 and 0004 now. 0003: 1. "setup" -> "set" /* If run-time partition pruning is enabled, then setup that up now */ 2. We should be able to get rid of as_noopscan and just have another special negative value for as_whichplan. I've attached a patch to do this. 3. I've forgotten to drop table boolvalues in the tests. Patched attached to fix. 0004: 1. "ms_valid_subplans" -> "valid_subplans" in: * ms_valid_subplans for runtime pruning, valid mergeplans indexes to * scan. All the other fields are not being prefixed with ms_ in these comments. Everything else looks fine from my point of view. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sat, Apr 7, 2018 at 11:26 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Everything else looks fine from my point of view. Me too, although I still think having struct names PartitionPruning and PartitionRelPruning is going to be a bit confusing. We should think about naming the latter to something else. I suggested PartitionPruningDispatch(Data), but David doesn't seem to like it. Maybe, PartitionPruneState, because it parallels the PartitionPruneInfo that comes from the planner for every partitioned table in the tree. Thanks, Amit
On 2018-04-07 13:26:51 +0900, Amit Langote wrote: > On Sat, Apr 7, 2018 at 11:26 AM, David Rowley > <david.rowley@2ndquadrant.com> wrote: > > Everything else looks fine from my point of view. > > Me too, although I still think having struct names PartitionPruning > and PartitionRelPruning is going to be a bit confusing. We should > think about naming the latter to something else. I suggested > PartitionPruningDispatch(Data), but David doesn't seem to like it. > Maybe, PartitionPruneState, because it parallels the > PartitionPruneInfo that comes from the planner for every partitioned > table in the tree. I've not followed this thread/feature at all, but I don't find the comments atop partprune.c even remotely sufficient. Unless there's an README hidden or such hidden somewhere? Greetings, Andres Freund
On 7 April 2018 at 16:26, Amit Langote <amitlangote09@gmail.com> wrote: > Maybe, PartitionPruneState, because it parallels the > PartitionPruneInfo that comes from the planner for every partitioned > table in the tree. I like that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 April 2018 at 16:31, Andres Freund <andres@anarazel.de> wrote: > I've not followed this thread/feature at all, but I don't find the > comments atop partprune.c even remotely sufficient. Unless there's an > README hidden or such hidden somewhere? There's not a README file. The comments for partprune.c, do you want this to explain more about how partition pruning works or how this patch uses the existing code? Probably if we need to explain more there about how pruning works then it should be a fixup patch to 9fdb675fc, no? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote: > On 2018-04-07 13:26:51 +0900, Amit Langote wrote: >> On Sat, Apr 7, 2018 at 11:26 AM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >> > Everything else looks fine from my point of view. >> >> Me too, although I still think having struct names PartitionPruning >> and PartitionRelPruning is going to be a bit confusing. We should >> think about naming the latter to something else. I suggested >> PartitionPruningDispatch(Data), but David doesn't seem to like it. >> Maybe, PartitionPruneState, because it parallels the >> PartitionPruneInfo that comes from the planner for every partitioned >> table in the tree. > > I've not followed this thread/feature at all, but I don't find the > comments atop partprune.c even remotely sufficient. Unless there's an > README hidden or such hidden somewhere? Sorry there isn't a README and I agree partprune.c's header comment could be improved quite a bit. Just to be clear, that's the fault of the patch that was already committed earlier today (9fdb675fc "Faster partition pruning"), not this patch, which just extends partition.c's functionality to implement additional planner and executor support for runtime pruning. I'm drafting a patch that expands the partprune.c comment and will post shortly. Thanks, Amit
On Sat, Apr 7, 2018 at 1:58 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Probably if we need to explain more there about how pruning works then > it should be a fixup patch to 9fdb675fc, no? Yes, I just replied and working on a patch. Thanks, Amit
On 2018-04-07 16:58:01 +1200, David Rowley wrote: > On 7 April 2018 at 16:31, Andres Freund <andres@anarazel.de> wrote: > > I've not followed this thread/feature at all, but I don't find the > > comments atop partprune.c even remotely sufficient. Unless there's an > > README hidden or such hidden somewhere? > > There's not a README file. The comments for partprune.c, do you want > this to explain more about how partition pruning works or how this > patch uses the existing code? Primarily the first. This isn't trivial straightforward code. > Probably if we need to explain more there about how pruning works then > it should be a fixup patch to 9fdb675fc, no? Yea, it's about that. Sorry for accidentally jumping on the wrong thread. Greetings, Andres Freund
On Sat, Apr 7, 2018 at 1:58 PM, Amit Langote <amitlangote09@gmail.com> wrote: > On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote: >> I've not followed this thread/feature at all, but I don't find the >> comments atop partprune.c even remotely sufficient. Unless there's an >> README hidden or such hidden somewhere? > > Sorry there isn't a README and I agree partprune.c's header comment > could be improved quite a bit. > > Just to be clear, that's the fault of the patch that was already > committed earlier today (9fdb675fc "Faster partition pruning"), not > this patch, which just extends partition.c's functionality to > implement additional planner and executor support for runtime pruning. > > I'm drafting a patch that expands the partprune.c comment and will post shortly. See if the attached makes it any better. Now I know we don't have the runtime pruning in yet, but since the proposed patch would extend its functionality I have included its description in the comment. Thanks, Amit
Attachment
On 7 April 2018 at 16:26, Amit Langote <amitlangote09@gmail.com> wrote: > On Sat, Apr 7, 2018 at 11:26 AM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> Everything else looks fine from my point of view. > > Me too, although I still think having struct names PartitionPruning > and PartitionRelPruning is going to be a bit confusing. We should > think about naming the latter to something else. I suggested > PartitionPruningDispatch(Data), but David doesn't seem to like it. > Maybe, PartitionPruneState, because it parallels the > PartitionPruneInfo that comes from the planner for every partitioned > table in the tree. Ok, so I've gone and done this. PartitionPruning has become PartitionPruneState PartitionRelPruning has become PartitionPruningData I've changed pointers to PartitionPruneStates to be named prunestate, sometimes having the node prefix; as_, ma_, in these cases prune and state are separated with a _ which seems to be the general rule for executor state struct members. Generally, pointers to PartitionPruningData are now named pprune. Hopefully, that's ok, as this was the name previously used for PartitionPruning pointers. I applied the patch to get rid of as_noop_scan in favour of using a special as_whichplan value. There was already one special value (INVALID_SUBPLAN_INDEX), so seemed better to build on that rather than inventing something new. This also means we don't have to make the AppendState struct and wider too, which seems like a good thing to try to do. I made all the fixups which I mentioned in my review earlier and also re-removed the resultRelation parameter from make_partition_pruneinfo. It sneaked back into v22. v23 is attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Amit Langote wrote: > See if the attached makes it any better. > > Now I know we don't have the runtime pruning in yet, but since the > proposed patch would extend its functionality I have included its > description in the comment. Thanks! I edited it as attached, to 1. avoid mentioning functionality that doesn't yet exist, and 2. avoid excessive internal detail (we want a high-level overview here), which from experience gets outdated pretty quickly. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 8 April 2018 at 00:23, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I edited it as attached, to 1. avoid mentioning functionality that > doesn't yet exist, and 2. avoid excessive internal detail (we want a > high-level overview here), which from experience gets outdated pretty > quickly. It's not exactly wrong but: + * are turned into a set of "pruning steps", which are then executed to + * produce a set of RTIs of partitions whose bounds satisfy the constraints in + * the step. Partitions not in the set are said to have been pruned. It's only prune_append_rel_partitions which is only used for the planner's pruning needs that converts the partition indexes to RTIs. Would it be better to mention that the output is partition indexes? Maybe: "which are then executed to produce a set of partition indexes whose bounds satisfy the constraints in the step. These partition indexes may then be translated into RTIs", or maybe even not mention the RTIs. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley wrote: > It's not exactly wrong but: > > + * are turned into a set of "pruning steps", which are then executed to > + * produce a set of RTIs of partitions whose bounds satisfy the constraints in > + * the step. Partitions not in the set are said to have been pruned. > > It's only prune_append_rel_partitions which is only used for the > planner's pruning needs that converts the partition indexes to RTIs. > Would it be better to mention that the output is partition indexes? > Maybe: > > "which are then executed to produce a set of partition indexes whose > bounds satisfy the constraints in the step. These partition indexes > may then be translated into RTIs", or maybe even not mention the RTIs. Amit had it as "indexes" also in his original. I wanted to avoid using the "indexes" word alone, whose meaning is so overloaded. How about this? "... which are then executed to produce a set of partitions (as indexes of resultRelInfo->part_rels array) that satisfy the constraints in the step". Maybe "the boundinfo array" instead of part_rels, which as I understand also uses the same indexing as the other array, and partprune mostly works based on boundinfo anyway? Not mentioning RTIs seems fine. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8 April 2018 at 01:18, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Amit had it as "indexes" also in his original. I wanted to avoid using > the "indexes" word alone, whose meaning is so overloaded. hmm, good point. > How about this? > "... which are then executed to produce a set of partitions (as indexes > of resultRelInfo->part_rels array) that satisfy the constraints in the > step". Works for me, but with RelOptInfo rather than resultRelInfo. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley wrote: > On 8 April 2018 at 01:18, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Amit had it as "indexes" also in his original. I wanted to avoid using > > the "indexes" word alone, whose meaning is so overloaded. > > hmm, good point. > > > How about this? > > "... which are then executed to produce a set of partitions (as indexes > > of resultRelInfo->part_rels array) that satisfy the constraints in the > > step". > > Works for me, but with RelOptInfo rather than resultRelInfo. Oops, sorry about that. Pushed now, adding one line to get_matching_partitions also. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 04/07/2018 04:45 AM, David Rowley wrote: > Ok, so I've gone and done this. > > PartitionPruning has become PartitionPruneState > PartitionRelPruning has become PartitionPruningData > > I've changed pointers to PartitionPruneStates to be named prunestate, > sometimes having the node prefix; as_, ma_, in these cases prune and > state are separated with a _ which seems to be the general rule for > executor state struct members. > > Generally, pointers to PartitionPruningData are now named pprune. > Hopefully, that's ok, as this was the name previously used for > PartitionPruning pointers. > > I applied the patch to get rid of as_noop_scan in favour of using a > special as_whichplan value. There was already one special value > (INVALID_SUBPLAN_INDEX), so seemed better to build on that rather than > inventing something new. This also means we don't have to make the > AppendState struct and wider too, which seems like a good thing to try > to do. > > I made all the fixups which I mentioned in my review earlier and also > re-removed the resultRelation parameter from make_partition_pruneinfo. > It sneaked back into v22. > > v23 is attached. > Passes check-world. Changing explain.c to "Subplans Removed" as suggested by you in [1] is a good idea. [1] https://www.postgresql.org/message-id/CAKJS1f99JnkbOshdV_4zoJZ96DPtKeHMHv43JRL_ZdHRkkVKCA%40mail.gmail.com Best regards, Jesper
I pushed this patch -- 0001, 0002 and 0003 only. I did not include anything from 0004 and 0005; I didn't even get to the point of reading them, so that I could focus on the first part. I did not find anything to complain about. I made a few adjustments and asked David to supply a paragraph for perform.sgml (the "Using EXPLAIN" section) which is included here. I also adopted Jesper's (actually David's) suggestion of changing "Partitions Pruned" to "Partitions Removed" in the EXPLAIN output. I had reservations about a relation_open() in the new executor code. It seemed a bit odd; we don't have any other relation_open in the executor anywhere. However, setting up the pruneinfo needs some stuff from relcache that I don't see a reasonable mechanism to pass through planner. I asked Andres about it on IM and while he didn't endorse the patch in any way, his quick opinion was that "it wasn't entirely insane". I verified that we already hold lock on the relation. While we didn't get fast pruning support for MergeAppend or the DELETE/UPDATE parts, I think those are valuable and recommend to resubmit those for PG12. Thank you! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8 April 2018 at 09:13, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I pushed this patch -- 0001, 0002 and 0003 only. I did not include > anything from 0004 and 0005; I didn't even get to the point of reading > them, so that I could focus on the first part. Oh great! Thank you for working on this and pushing it, especially so during your weekend. > While we didn't get fast pruning support for MergeAppend or the > DELETE/UPDATE parts, I think those are valuable and recommend to > resubmit those for PG12. Thanks. I'll certainly be doing that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I had reservations about a relation_open() in the new executor code. It > seemed a bit odd; we don't have any other relation_open in the executor > anywhere. However, setting up the pruneinfo needs some stuff from > relcache that I don't see a reasonable mechanism to pass through > planner. I asked Andres about it on IM and while he didn't endorse the > patch in any way, his quick opinion was that "it wasn't entirely > insane". I verified that we already hold lock on the relation. I don't get this. The executor surely had to (and did) open all of the relations somewhere even before this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I had reservations about a relation_open() in the new executor code. It > > seemed a bit odd; we don't have any other relation_open in the executor > > anywhere. However, setting up the pruneinfo needs some stuff from > > relcache that I don't see a reasonable mechanism to pass through > > planner. I asked Andres about it on IM and while he didn't endorse the > > patch in any way, his quick opinion was that "it wasn't entirely > > insane". I verified that we already hold lock on the relation. > > I don't get this. The executor surely had to (and did) open all of > the relations somewhere even before this patch. Yeah. I was worried that this coding could be seen as breaking modularity, or trying to do excessive work. However, after looking closer at it, it doesn't really look like it's the case. So, nevermind. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Robert Haas wrote: >> On Sat, Apr 7, 2018 at 5:13 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> > I had reservations about a relation_open() in the new executor code. It >> > seemed a bit odd; we don't have any other relation_open in the executor >> > anywhere. However, setting up the pruneinfo needs some stuff from >> > relcache that I don't see a reasonable mechanism to pass through >> > planner. I asked Andres about it on IM and while he didn't endorse the >> > patch in any way, his quick opinion was that "it wasn't entirely >> > insane". I verified that we already hold lock on the relation. >> >> I don't get this. The executor surely had to (and did) open all of >> the relations somewhere even before this patch. > > Yeah. > > I was worried that this coding could be seen as breaking modularity, or > trying to do excessive work. However, after looking closer at it, it > doesn't really look like it's the case. So, nevermind. Well what I'm saying is that it shouldn't be necessary. If the relations are being opened already and the pointers to the relcache entries are being saved someplace, you shouldn't need to re-open them elsewhere to get pointers to the relcache entries. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Robert Haas wrote: > >> I don't get this. The executor surely had to (and did) open all of > >> the relations somewhere even before this patch. > > > > I was worried that this coding could be seen as breaking modularity, or > > trying to do excessive work. However, after looking closer at it, it > > doesn't really look like it's the case. So, nevermind. > > Well what I'm saying is that it shouldn't be necessary. If the > relations are being opened already and the pointers to the relcache > entries are being saved someplace, you shouldn't need to re-open them > elsewhere to get pointers to the relcache entries. Oh, okay. I can look into that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas wrote: > On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >> I don't get this. The executor surely had to (and did) open all of > >> the relations somewhere even before this patch. > > I was worried that this coding could be seen as breaking modularity, or > > trying to do excessive work. However, after looking closer at it, it > > doesn't really look like it's the case. So, nevermind. > > Well what I'm saying is that it shouldn't be necessary. If the > relations are being opened already and the pointers to the relcache > entries are being saved someplace, you shouldn't need to re-open them > elsewhere to get pointers to the relcache entries. I looked a bit more into this. It turns out that we have indeed opened the relation before -- first in parserOpenTable (for addRangeTableEntry), then in expandRTE, then in QueryRewrite, then in subquery_planner, then in get_relation_info. So, frankly, since each module thinks it's okay to open it every once in a while, I'm not sure we should be terribly stressed about doing it once more for partition pruning. Particularly since communicating the pointer seems to be quite troublesome. To figure out, I used the attached patch (not intended for application) to add a backtrace to each log message, plus a couple of accusatory elog() calls in relation_open and ExecSetupPartitionPruneState. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2018/04/11 6:32, Alvaro Herrera wrote: > Robert Haas wrote: >> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >>>> I don't get this. The executor surely had to (and did) open all of >>>> the relations somewhere even before this patch. > >>> I was worried that this coding could be seen as breaking modularity, or >>> trying to do excessive work. However, after looking closer at it, it >>> doesn't really look like it's the case. So, nevermind. >> >> Well what I'm saying is that it shouldn't be necessary. If the >> relations are being opened already and the pointers to the relcache >> entries are being saved someplace, you shouldn't need to re-open them >> elsewhere to get pointers to the relcache entries. > > I looked a bit more into this. It turns out that we have indeed opened > the relation before -- first in parserOpenTable (for addRangeTableEntry), > then in expandRTE, then in QueryRewrite, then in subquery_planner, then > in get_relation_info. > > So, frankly, since each module thinks it's okay to open it every once in > a while, I'm not sure we should be terribly stressed about doing it once > more for partition pruning. Particularly since communicating the > pointer seems to be quite troublesome. Maybe, Robert was saying somewhere in the executor itself, before ExecInitAppend, or more precisely, ExecSetupPartitionPruneState is called. But that doesn't seem to be the case. For the result relation case (insert/update/delete on a partitioned table), we don't need to do extra relation_opens, as InitPlan opens the needed relations when building the ResultRelInfo(s), from where they're later accessed, for example, in ExecInitModifyTable. However, in the Append/MergeAppend cases, we don't, at any earlier point in the executor, open the partitioned tables. InitPlan doesn't touch them. In ExecInitAppend, ExecLockNonLeafAppendTables that we call before calling ExecSetupPartitionPruneState does not open, just locks them using LockRelationOid. So, relation_open on partitioned tables in ExecSetupPartitionPruneState seem to be the first time they're opened *in the executor*. Thanks, Amit
On 11 April 2018 at 09:32, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Robert Haas wrote: >> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >> >> I don't get this. The executor surely had to (and did) open all of >> >> the relations somewhere even before this patch. > >> > I was worried that this coding could be seen as breaking modularity, or >> > trying to do excessive work. However, after looking closer at it, it >> > doesn't really look like it's the case. So, nevermind. >> >> Well what I'm saying is that it shouldn't be necessary. If the >> relations are being opened already and the pointers to the relcache >> entries are being saved someplace, you shouldn't need to re-open them >> elsewhere to get pointers to the relcache entries. > > I looked a bit more into this. It turns out that we have indeed opened > the relation before -- first in parserOpenTable (for addRangeTableEntry), > then in expandRTE, then in QueryRewrite, then in subquery_planner, then > in get_relation_info. > > So, frankly, since each module thinks it's okay to open it every once in > a while, I'm not sure we should be terribly stressed about doing it once > more for partition pruning. Particularly since communicating the > pointer seems to be quite troublesome. I guess the problem there would be there's nothing to say that parse analysis will shortly be followed by a call to the planner, and a call to the planner does not mean the plan is about to be executed. So I don't think it would be possible to keep pointers to relcache entries between these modules, and it would be hard to determine whose responsibility it would be to call relation_close(). It might be possible to do something better in each module by keeping an array indexed by RTI which have each entry NULL initially then on first relation_open set the element in the array to that pointer. This might mean we'd save a few relation_open calls, but I don't know if there would be a way to somehow remove the Relation from the array on relation_close. Having something like this might mean we could detect lock upgrade hazards more easily, but the whole thing is a cache on top of a cache which does seem a bit weird. relation_open() should be pretty cheap if the relation is already open. It's just a hash table lookup. What is described above just changes that to an array lookup. It also does nothing for index_open. However, something like the above would simplify ExecLockNonLeafAppendTables() a bit and get rid of the O(N^2) which checks the partition is not a result relation. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Apr 12, 2018 at 12:40 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > I guess the problem there would be there's nothing to say that parse > analysis will shortly be followed by a call to the planner, and a call > to the planner does not mean the plan is about to be executed. So I > don't think it would be possible to keep pointers to relcache entries > between these modules, and it would be hard to determine whose > responsibility it would be to call relation_close(). Yeah, that's definitely a non-starter. > It might be possible to do something better in each module by keeping > an array indexed by RTI which have each entry NULL initially then on > first relation_open set the element in the array to that pointer. I'm not sure that makes a lot of sense in the planner, but in the executor it might be a good idea. See also https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com for related discussion. I think that a coding pattern where we rely on relation_open(..., NoLock) is inherently dangerous -- it's too easy to be wrong about whether the lock is sure to have been taken. It would be much better to open the relation once and hold onto the pointer, not just for performance reasons, but for robustness. BTW, looking at ExecSetupPartitionPruneState: /* * Create a sub memory context which we'll use when making calls to the * query planner's function to determine which partitions will match. The * planner is not too careful about freeing memory, so we'll ensure we * call the function in this context to avoid any memory leaking in the * executor's memory context. */ This is a sloppy cut-and-paste job, not only because somebody changed one copy of the word "planner" to "executor" and left the others untouched, but also because the rationale isn't really correct for the executor anyway, which has memory contexts all over the place and frees them all the time. I don't know whether the context is not needed at all or whether the context is needed but the rationale is different, but I don't buy that explanation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 13 April 2018 at 04:57, Robert Haas <robertmhaas@gmail.com> wrote: > BTW, looking at ExecSetupPartitionPruneState: > > /* > * Create a sub memory context which we'll use when making calls to the > * query planner's function to determine which partitions will > match. The > * planner is not too careful about freeing memory, so we'll ensure we > * call the function in this context to avoid any memory leaking in the > * executor's memory context. > */ > > This is a sloppy cut-and-paste job, not only because somebody changed > one copy of the word "planner" to "executor" and left the others > untouched, but also because the rationale isn't really correct for the > executor anyway, which has memory contexts all over the place and > frees them all the time. I don't know whether the context is not > needed at all or whether the context is needed but the rationale is > different, but I don't buy that explanation. The comment is written exactly as intended. Unsure which of the "planner"s you think should be "executor". The context is needed. I can easily produce an OOM without it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/04/13 1:57, Robert Haas wrote: >> It might be possible to do something better in each module by keeping >> an array indexed by RTI which have each entry NULL initially then on >> first relation_open set the element in the array to that pointer. > > I'm not sure that makes a lot of sense in the planner, but in the > executor it might be a good idea. See also > https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com > for related discussion. I think that a coding pattern where we rely > on relation_open(..., NoLock) is inherently dangerous -- it's too easy > to be wrong about whether the lock is sure to have been taken. It > would be much better to open the relation once and hold onto the > pointer, not just for performance reasons, but for robustness. About the specific relation_open(.., NoLock) under question, I think there might be a way to address this by opening the tables with the appropriate lock mode in partitioned_rels list in ExecLockNonleafAppendTables and keeping the Relation pointers around until ExecEndNode. Then instead of ExecSetupPartitionPruneState doing relation_open/close(.., NoLock), it just reuses the one that's passed by the caller. Attached a PoC patch. David, thoughts? Thanks, Amit
Attachment
On 2018/04/13 14:38, Amit Langote wrote: > About the specific relation_open(.., NoLock) under question, I think there > might be a way to address this by opening the tables with the appropriate > lock mode in partitioned_rels list in ExecLockNonleafAppendTables That may have sounded a bit confusing: I meant to say: "by opening the tables in partitioned_rels list with the appropriate lock mode in ExecLockNonleafAppendTables" > Attached a PoC patch. There were a couple of unnecessary hunks, which removed in the attached. Thanks, Amit
Attachment
On 2018/04/13 14:48, Amit Langote wrote: > On 2018/04/13 14:38, Amit Langote wrote: >> About the specific relation_open(.., NoLock) under question, I think there >> might be a way to address this by opening the tables with the appropriate >> lock mode in partitioned_rels list in ExecLockNonleafAppendTables > > That may have sounded a bit confusing: > > I meant to say: "by opening the tables in partitioned_rels list with the > appropriate lock mode in ExecLockNonleafAppendTables" > >> Attached a PoC patch. > > There were a couple of unnecessary hunks, which removed in the attached. Sorry, still a couple more were unnecessary. Thanks, Amit
Attachment
On Thu, Apr 12, 2018 at 6:01 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 13 April 2018 at 04:57, Robert Haas <robertmhaas@gmail.com> wrote: >> BTW, looking at ExecSetupPartitionPruneState: >> >> /* >> * Create a sub memory context which we'll use when making calls to the >> * query planner's function to determine which partitions will >> match. The >> * planner is not too careful about freeing memory, so we'll ensure we >> * call the function in this context to avoid any memory leaking in the >> * executor's memory context. >> */ >> >> This is a sloppy cut-and-paste job, not only because somebody changed >> one copy of the word "planner" to "executor" and left the others >> untouched, but also because the rationale isn't really correct for the >> executor anyway, which has memory contexts all over the place and >> frees them all the time. I don't know whether the context is not >> needed at all or whether the context is needed but the rationale is >> different, but I don't buy that explanation. > > The comment is written exactly as intended. Unsure which of the > "planner"s you think should be "executor". > > The context is needed. I can easily produce an OOM without it. Oh, crap. You know, I totally misread what that comment was trying to say. Sorry. But I wonder why it's the executor's job to clean up after the planner, instead of adjusting the relevant planner functions to avoid leaking memory? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 14 April 2018 at 05:04, Robert Haas <robertmhaas@gmail.com> wrote: > But I wonder why it's the executor's job to clean up after the > planner, instead of adjusting the relevant planner functions to avoid > leaking memory? It might be possible, but it might also be risky and difficult. For a while, during my review of the faster partition pruning patch I was asking Amit to add pfree() calls in various places for this exact reason, but in the end, I gave up and decided it was easier to just create a new memory context to call the planner function from. I've now forgotten the exact reason why I finally decided it was too much trouble. The pruning code now works using your step logic so perhaps that reason no longer applies, although, on a quick scan of the pruning code now, it seems to require that get_matching_partitions performs a deep pfree of each PruneStepResult. However, there is still partkey_datum_from_expr which performs ExecInitExpr, although perhaps that can just be done once, and the result stashed in the PartitionPruneContext. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley wrote: > For a while, during my review of the faster partition pruning patch I > was asking Amit to add pfree() calls in various places for this exact > reason, but in the end, I gave up and decided it was easier to just > create a new memory context to call the planner function from. I've > now forgotten the exact reason why I finally decided it was too much > trouble. The pruning code now works using your step logic so perhaps > that reason no longer applies, although, on a quick scan of the > pruning code now, it seems to require that get_matching_partitions > performs a deep pfree of each PruneStepResult. However, there is still > partkey_datum_from_expr which performs ExecInitExpr, although perhaps > that can just be done once, and the result stashed in the > PartitionPruneContext. I think trying to replace a well-placed MemoryContextReset (or Delete) with a bunch of individual pfrees is pointless. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 17 April 2018 at 14:33, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > David Rowley wrote: > >> For a while, during my review of the faster partition pruning patch I >> was asking Amit to add pfree() calls in various places for this exact >> reason, but in the end, I gave up and decided it was easier to just >> create a new memory context to call the planner function from. I've >> now forgotten the exact reason why I finally decided it was too much >> trouble. The pruning code now works using your step logic so perhaps >> that reason no longer applies, although, on a quick scan of the >> pruning code now, it seems to require that get_matching_partitions >> performs a deep pfree of each PruneStepResult. However, there is still >> partkey_datum_from_expr which performs ExecInitExpr, although perhaps >> that can just be done once, and the result stashed in the >> PartitionPruneContext. > > I think trying to replace a well-placed MemoryContextReset (or Delete) > with a bunch of individual pfrees is pointless. I agree. I think I'd sleep better at night with the context reset in there rather than hoping we've managed to pfree everything. I did go and start working on a patch to test how possible this would be and came up with the attached. I've left a stray MemoryContextStatsDetail call in there which does indicate that something is not being freed. I'm just not sure what it is yet. The patch does happen to improve performance slightly, but that is most likely due to the caching of the ExprStates rather than the change of memory management. It's not really possible to do that with the reset unless we stored the executor's memory context in PartitionPruneContext and did a context switch back inside partkey_datum_from_expr before calling ExecInitExpr. My test case was as follows: create table p (a int, value int) partition by hash (a); select 'create table p'||x|| ' partition of p for values with (modulus 10, remainder '||x||');' from generate_series(0,9) x; \gexec create table t1 (a int); insert into p select x,x from generate_Series(1,10000000) x; insert into t1 select x from generate_series(1,10000000) x; create index on p(a); set enable_hashjoin = 0; set enable_mergejoin = 0; explain analyze select count(*) from t1 inner join p on t1.a=p.a; -- Unpatched Execution Time: 19725.981 ms Execution Time: 19533.655 ms Execution Time: 19542.854 ms -- Patched Execution Time: 17389.537 ms Execution Time: 17603.802 ms Execution Time: 17618.670 ms -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Mon, Apr 16, 2018 at 10:46 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > I did go and start working on a patch to test how possible this would > be and came up with the attached. I've left a stray > MemoryContextStatsDetail call in there which does indicate that > something is not being freed. I'm just not sure what it is yet. > > The patch does happen to improve performance slightly, but that is > most likely due to the caching of the ExprStates rather than the > change of memory management. It's not really possible to do that with > the reset unless we stored the executor's memory context in > PartitionPruneContext and did a context switch back inside > partkey_datum_from_expr before calling ExecInitExpr. 10% is more than a "slight" improvement, I'd say! It's certainly got to be worth avoiding the repeated calls to ExecInitExpr, whatever we do about the memory contexts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 19 April 2018 at 03:13, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 16, 2018 at 10:46 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> I did go and start working on a patch to test how possible this would >> be and came up with the attached. I've left a stray >> MemoryContextStatsDetail call in there which does indicate that >> something is not being freed. I'm just not sure what it is yet. >> >> The patch does happen to improve performance slightly, but that is >> most likely due to the caching of the ExprStates rather than the >> change of memory management. It's not really possible to do that with >> the reset unless we stored the executor's memory context in >> PartitionPruneContext and did a context switch back inside >> partkey_datum_from_expr before calling ExecInitExpr. > > 10% is more than a "slight" improvement, I'd say! It's certainly got > to be worth avoiding the repeated calls to ExecInitExpr, whatever we > do about the memory contexts. I've attached a patch which does just this. On benchmarking again with this single change performance has improved 15% over master. Also, out of curiosity, I also checked what this performed like before the run-time pruning patch was committed (5c0675215). Taking the average of the times below, it seems without this patch the performance of this case has improved about 356% and about 410% with this patch. So, I agree, it might be worth considering. create table p (a int, value int) partition by hash (a); select 'create table p'||x|| ' partition of p for values with (modulus 10, remainder '||x||');' from generate_series(0,9) x; \gexec create table t1 (a int); insert into p select x,x from generate_Series(1,1000) x; insert into t1 select x from generate_series(1,1000) x; create index on p(a); set enable_hashjoin = 0; set enable_mergejoin = 0; explain analyze select count(*) from t1 inner join p on t1.a=p.a; -- Unpatched Execution Time: 20413.975 ms Execution Time: 20232.050 ms Execution Time: 20229.116 ms -- Patched Execution Time: 17758.111 ms Execution Time: 17645.151 ms Execution Time: 17492.260 ms -- 5c0675215e153ba1297fd494b34af2fdebd645d1 Execution Time: 72875.161 ms Execution Time: 71817.757 ms Execution Time: 72411.730 ms -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 19 April 2018 at 12:04, David Rowley <david.rowley@2ndquadrant.com> wrote: > insert into p select x,x from generate_Series(1,1000) x; > insert into t1 select x from generate_series(1,1000) x; Correction. These were meant to read: insert into p select x,x from generate_Series(1,10000000) x; insert into t1 select x from generate_series(1,10000000) x; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David. On 2018/04/19 9:04, David Rowley wrote: > On 19 April 2018 at 03:13, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Apr 16, 2018 at 10:46 PM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >>> The patch does happen to improve performance slightly, but that is >>> most likely due to the caching of the ExprStates rather than the >>> change of memory management. It's not really possible to do that with >>> the reset unless we stored the executor's memory context in >>> PartitionPruneContext and did a context switch back inside >>> partkey_datum_from_expr before calling ExecInitExpr. >> >> 10% is more than a "slight" improvement, I'd say! It's certainly got >> to be worth avoiding the repeated calls to ExecInitExpr, whatever we >> do about the memory contexts. > > I've attached a patch which does just this. On benchmarking again with > this single change performance has improved 15% over master. > > Also, out of curiosity, I also checked what this performed like before > the run-time pruning patch was committed (5c0675215). Taking the > average of the times below, it seems without this patch the > performance of this case has improved about 356% and about 410% with > this patch. So, I agree, it might be worth considering. > > create table p (a int, value int) partition by hash (a); > select 'create table p'||x|| ' partition of p for values with (modulus > 10, remainder '||x||');' from generate_series(0,9) x; > \gexec > create table t1 (a int); > > insert into p select x,x from generate_Series(1,1000) x; > insert into t1 select x from generate_series(1,1000) x; > > create index on p(a); > > set enable_hashjoin = 0; > set enable_mergejoin = 0; > explain analyze select count(*) from t1 inner join p on t1.a=p.a; > > -- Unpatched > Execution Time: 20413.975 ms > Execution Time: 20232.050 ms > Execution Time: 20229.116 ms > > -- Patched > Execution Time: 17758.111 ms > Execution Time: 17645.151 ms > Execution Time: 17492.260 ms > > -- 5c0675215e153ba1297fd494b34af2fdebd645d1 > Execution Time: 72875.161 ms > Execution Time: 71817.757 ms > Execution Time: 72411.730 ms That's neat! Definitely agree that we should call ExecInitExpr just once here. The patch looks good too, except the long line. Maybe: @@ -1514,13 +1514,15 @@ ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) foreach(lc3, step->exprs) { Expr *expr = (Expr *) lfirst(lc3); + int step_id = step->step.step_id; /* * partkey_datum_from_expr does not need an expression state * to evaluate a Const. */ if (!IsA(expr, Const)) - context->exprstates[step->step.step_id * partnatts + keyno] = ExecInitExpr(expr, context->planstate); + context->exprstates[step_id * partnatts + keyno] = + ExecInitExpr(expr, context->planstate); Thanks, Amit
Anybody wanna argue against pushing this patch now? I'm inclined to push it on the grounds of being closure for already committed work, but there are possible arguments about this being new development after feature freeze. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amit Langote wrote: > That's neat! Definitely agree that we should call ExecInitExpr just once > here. The patch looks good too, except the long line. How about this as a small tweak? Determine the array index using a macro, which serves as documentation. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Pushed. Thanks! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-04-19 12:04:35 +1200, David Rowley wrote: > On 19 April 2018 at 03:13, Robert Haas <robertmhaas@gmail.com> wrote: > > 10% is more than a "slight" improvement, I'd say! It's certainly got > > to be worth avoiding the repeated calls to ExecInitExpr, whatever we > > do about the memory contexts. Yea, that seems important. Good that that got in. What I wonder, after skimming this change, is where the relevant expression context is reset? That's not really related to this change but the wider thread, I just noticed it while looking at this. Greetings, Andres Freund
Andres Freund wrote: > What I wonder, after skimming this change, is where the relevant > expression context is reset? That's not really related to this change > but the wider thread, I just noticed it while looking at this. Do you mean ResetExprContext? We use the plan's context, so it should occur together with the plan reset itself, i.e. nodeAppend.c should do it somewhere. ... Hmm, it appears we don't do it anywhere there actually. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 25 April 2018 at 06:21, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Andres Freund wrote: > >> What I wonder, after skimming this change, is where the relevant >> expression context is reset? That's not really related to this change >> but the wider thread, I just noticed it while looking at this. > > Do you mean ResetExprContext? We use the plan's context, so it should > occur together with the plan reset itself, i.e. nodeAppend.c should do > it somewhere. > > ... Hmm, it appears we don't do it anywhere there actually. It's not immediately obvious to me why this is required. All the expressions that are initialized here must live the entire length of the executor run, and since they're allocated in the ExecutorState context they'll be destroyed in FreeExecutorState(). If we do need to call ResetExprContext for these, then we'd just need to invent a teardown for ExecSetupPartitionPruneState which would free off the memory allocated and call ResetExprContext for all non-NULL ExprStates in each context->exprstates. This function would need to be called from the node's End function for any node that's set up a PartitionPruneState. Do we really need to do this given that the memory context these are allocated in will be released a moment later anyway? Just to ensure I'm not dreaming, I ran the following and couldn't see the backend's memory consumption move. create table lp (a int, value int) partition by list(a); create table lp_1 partition of lp for values in(1); create table lp_2 partition of lp for values in(2); create function lp_value(p_a int) returns int as $$ select value from lp where a = p_a; $$ language sql; insert into lp values(1,10),(2,20); select sum(lp_value(x)) from generate_Series(1,2) x, generate_series(1,10000000); -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 25 April 2018 at 05:10, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Pushed. Thanks! Thanks! -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-04-10 23:32, Alvaro Herrera wrote: > To figure out, I used the attached patch (not intended for application) > to add a backtrace to each log message, plus a couple of accusatory > elog() calls in relation_open and ExecSetupPartitionPruneState. What do people think about adding something like this errbacktrace() from Álvaro's patch to core PostgreSQL? If we could devise a way to selectively enable it, it might be an easier way for users to provide backtraces from errors. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 2018-04-10 23:32, Alvaro Herrera wrote: >> To figure out, I used the attached patch (not intended for application) >> to add a backtrace to each log message, plus a couple of accusatory >> elog() calls in relation_open and ExecSetupPartitionPruneState. > What do people think about adding something like this errbacktrace() > from Álvaro's patch to core PostgreSQL? If we could devise a way to > selectively enable it, it might be an easier way for users to provide > backtraces from errors. I think we did discuss it right after that, or somewhere nearby, and concluded that the output is so imprecise that it's not really going to be worth whatever portability issues we'd have to deal with. I'd be all for a better version, but glibc's backtrace() just sucks, at least given our coding style with lots of static functions. regards, tom lane
I wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> What do people think about adding something like this errbacktrace() >> from Álvaro's patch to core PostgreSQL? > I think we did discuss it right after that, or somewhere nearby, and > concluded that the output is so imprecise that it's not really going > to be worth whatever portability issues we'd have to deal with. Hmm, after some digging in the archives, the closest thing I can find is this thread: https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com where we discussed using libunwind instead, but people didn't like the extra dependency. However, I stand by the assertion that glibc's backtrace() is too imprecise to be useful; I've experimented with it and despaired of being able to tell where control had actually been. regards, tom lane
Hi, On 2019-05-24 11:34:58 -0400, Tom Lane wrote: > I wrote: > > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > >> What do people think about adding something like this errbacktrace() > >> from Álvaro's patch to core PostgreSQL? > > > I think we did discuss it right after that, or somewhere nearby, and > > concluded that the output is so imprecise that it's not really going > > to be worth whatever portability issues we'd have to deal with. > > Hmm, after some digging in the archives, the closest thing I can find > is this thread: > > https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com > > where we discussed using libunwind instead, but people didn't like > the extra dependency. Hm, I didn't actually see that much concern about that. I still think we should just go for libunwind. At least on debian it's likely to already be installed: andres@alap4:~$ apt rdepends libunwind8 libunwind8 Reverse Depends: Depends: libunwind-dev (= 1.2.1-9) Depends: linux-perf-4.16 Depends: linux-perf-4.15 Depends: linux-perf-4.14 Depends: rspamd Depends: linux-perf-5.0 Depends: libjulia1 Depends: julia Depends: geary Depends: libunwind8-dbgsym (= 1.2.1-9) Depends: xwayland Depends: xvfb Depends: xserver-xorg-core Depends: xserver-xephyr Depends: xnest Depends: xdmx Depends: trafficserver Depends: tigervnc-standalone-server Depends: tarantool Depends: strace Depends: spring Depends: rspamd Depends: linux-perf-4.19 Depends: libunwind-setjmp0 Depends: libeina1a Depends: libjulia1 Depends: julia Depends: intel-gpu-tools Depends: libheaptrack Depends: libgoogle-perftools4 Depends: libgoogle-glog0v5 Depends: gdnsd Depends: libevas1-engines-x Depends: libevas1 In particular strace, xserver-xorg-core, perf are reasonably likely to already installed. It's also not a large library. I'd bet if we made it an optional build-time dependency it'd get included by just about every distro. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-05-24 11:34:58 -0400, Tom Lane wrote: >> Hmm, after some digging in the archives, the closest thing I can find >> is this thread: >> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com >> where we discussed using libunwind instead, but people didn't like >> the extra dependency. > Hm, I didn't actually see that much concern about that. I still think we > should just go for libunwind. Is it actually better? The basic problem with backtrace() is that it only knows about global functions, and so reports call sites in static functions as if they were in whatever global function physically precedes the static one. I think doing materially better requires depending on debug symbols, which (at least in the Red Hat world) aren't going to be there in a typical production situation. regards, tom lane
Hi, On 2019-05-24 12:08:57 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2019-05-24 11:34:58 -0400, Tom Lane wrote: > >> Hmm, after some digging in the archives, the closest thing I can find > >> is this thread: > >> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com > >> where we discussed using libunwind instead, but people didn't like > >> the extra dependency. > > > Hm, I didn't actually see that much concern about that. I still think we > > should just go for libunwind. > > Is it actually better? I've not looked in a while, but I think at some point it was. > The basic problem with backtrace() is that it > only knows about global functions, and so reports call sites in static > functions as if they were in whatever global function physically precedes > the static one. Does that depend on whether the program was compiled with -fno-omit-frame-pointer? At least some distros now compile with frame pointers enabled, to get reasonably fast perf profiles (at a basically immeasurable slowdown, on modern-ish CPUs). > I think doing materially better requires depending on > debug symbols, which (at least in the Red Hat world) aren't going to > be there in a typical production situation. It's still a lot easier to install debug symbols than to attach a debugger and get a backtrace that way. Especially when the problem is hard to reproduce. Greetings, Andres Freund
On Fri, May 24, 2019 at 09:24:28AM -0700, Andres Freund wrote: >Hi, > >On 2019-05-24 12:08:57 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >> > On 2019-05-24 11:34:58 -0400, Tom Lane wrote: >> >> Hmm, after some digging in the archives, the closest thing I can find >> >> is this thread: >> >> https://www.postgresql.org/message-id/flat/CAMsr%2BYGL%2ByfWE%3DJvbUbnpWtrRZNey7hJ07%2BzT4bYJdVp4Szdrg%40mail.gmail.com >> >> where we discussed using libunwind instead, but people didn't like >> >> the extra dependency. >> >> > Hm, I didn't actually see that much concern about that. I still think we >> > should just go for libunwind. >> >> Is it actually better? > >I've not looked in a while, but I think at some point it was. > > >> The basic problem with backtrace() is that it >> only knows about global functions, and so reports call sites in static >> functions as if they were in whatever global function physically precedes >> the static one. > >Does that depend on whether the program was compiled with >-fno-omit-frame-pointer? At least some distros now compile with frame >pointers enabled, to get reasonably fast perf profiles (at a basically >immeasurable slowdown, on modern-ish CPUs). > I doubt that, because if that was the case we'd not be able to get accurate profiles from perf, no? And AFAICS that's not the case, irrespectedly of whether -fno-omit-frame-pointer is used. > >> I think doing materially better requires depending on >> debug symbols, which (at least in the Red Hat world) aren't going to >> be there in a typical production situation. > >It's still a lot easier to install debug symbols than to attach a >debugger and get a backtrace that way. Especially when the problem is >hard to reproduce. > Right. Debugger requires interaction with a running process, while having it integrated would make that unnecessary. But I think Peter also suggested this might require the ability to selectively enable the backtraces, and I think he's right. I doubt we want to log them for every log message, right? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-05-25 00:42:39 +0200, Tomas Vondra wrote: > On Fri, May 24, 2019 at 09:24:28AM -0700, Andres Freund wrote: > > On 2019-05-24 12:08:57 -0400, Tom Lane wrote: > > > Andres Freund <andres@anarazel.de> writes: > > > The basic problem with backtrace() is that it > > > only knows about global functions, and so reports call sites in static > > > functions as if they were in whatever global function physically precedes > > > the static one. > > > > Does that depend on whether the program was compiled with > > -fno-omit-frame-pointer? At least some distros now compile with frame > > pointers enabled, to get reasonably fast perf profiles (at a basically > > immeasurable slowdown, on modern-ish CPUs). > > > > I doubt that, because if that was the case we'd not be able to get > accurate profiles from perf, no? And AFAICS that's not the case, > irrespectedly of whether -fno-omit-frame-pointer is used. I can't parse this. With perf you can get accurate call-graph profiles if you either use -fno-omit-frame-pointer, to force frame pointers to be present (so the call graph can cheaply be assembled during profiling), or with dwarf (the entire stack is saved, and then dwarf is unwinding at perf report time - very large), or with lbr (CPU saves traces of branches taken, enabling call graphs to be computed, but it needs they're not very deep). > > > I think doing materially better requires depending on > > > debug symbols, which (at least in the Red Hat world) aren't going to > > > be there in a typical production situation. > > > > It's still a lot easier to install debug symbols than to attach a > > debugger and get a backtrace that way. Especially when the problem is > > hard to reproduce. > > > > Right. Debugger requires interaction with a running process, while > having it integrated would make that unnecessary. > > But I think Peter also suggested this might require the ability to > selectively enable the backtraces, and I think he's right. I doubt we > want to log them for every log message, right? Well, I think that if we had PANIC, SIGSEGV/BUS most FATALs covered, we'd be off to a very good start. I'm not sure it's wise to give users control over the computation of stack computations. Greetings, Andres Freund
On Fri, May 24, 2019 at 12:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Is it actually better? The basic problem with backtrace() is that it > only knows about global functions, and so reports call sites in static > functions as if they were in whatever global function physically precedes > the static one. I think doing materially better requires depending on > debug symbols, which (at least in the Red Hat world) aren't going to > be there in a typical production situation. I don't have an opinion on glibc vs. libunwind, but I don't understand this argument. If you are unlucky enough to have a production server that is crashing due to some hitherto-unknown bug, and if it's not possible to get a good backtrace without installing debugging symbols, then you are going to have to pick between (1) installing those debugging symbols and (2) getting a poor backtrace. I don't really see that as a problem so much as just the way life is. You can't expect to get good debugging output without debugging symbols, just as you can't expect to get a clean audit without bank statements or a clear idea of how to find your way to an unknown destination without a map. If you don't have the thing that contains the information that is needed, you can't get the information; c'est la vie. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, May 24, 2019 at 12:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Is it actually better? The basic problem with backtrace() is that it >> only knows about global functions, and so reports call sites in static >> functions as if they were in whatever global function physically precedes >> the static one. I think doing materially better requires depending on >> debug symbols, which (at least in the Red Hat world) aren't going to >> be there in a typical production situation. > I don't have an opinion on glibc vs. libunwind, but I don't understand > this argument. If you are unlucky enough to have a production server > that is crashing due to some hitherto-unknown bug, and if it's not > possible to get a good backtrace without installing debugging symbols, > then you are going to have to pick between (1) installing those > debugging symbols and (2) getting a poor backtrace. I don't really > see that as a problem so much as just the way life is. Well, it *is* a problem. The whole point of this discussion I think is to try to get better information "by default" for routine bug reports. So if those come from production servers without debug symbols, which I believe will be the usual case, then it seems likely to me that libunwind will produce no better results than glibc. (But perhaps I'm wrong about that --- I have not experimented with libunwind.) Now it's true that "install debug symbols" is less of an ask than "install debug symbols, *and* gdb, and make sure server core dumps are enabled, and then go through this arcane manual procedure next time you get a core dump". But we shouldn't fool ourselves that it isn't an ask that's going to be hard for people with corporate policies against installing extra stuff on production servers. regards, tom lane
On Wed, May 29, 2019 at 6:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, it *is* a problem. The whole point of this discussion I think is > to try to get better information "by default" for routine bug reports. > So if those come from production servers without debug symbols, which > I believe will be the usual case, then it seems likely to me that > libunwind will produce no better results than glibc. (But perhaps > I'm wrong about that --- I have not experimented with libunwind.) Sure, I agree. > Now it's true that "install debug symbols" is less of an ask than > "install debug symbols, *and* gdb, and make sure server core dumps are > enabled, and then go through this arcane manual procedure next time > you get a core dump". But we shouldn't fool ourselves that it isn't > an ask that's going to be hard for people with corporate policies > against installing extra stuff on production servers. There may be cases where that is true, but as you say, it's better than what we have now. Plus, what exactly is the alternative? We could: - encourage packagers to install debug symbols by default (but they might not; it might even be against policy), or - invent our own system for generating backtraces and ignore what the OS toolchain knows how to do (sounds painfully complex and expensive), or - just live with the fact that it's imperfect. Is there a fourth option? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Now, in my experience, the current system for custom plans vs. generic
plans doesn't approach the problem in this way at all, and in my
experience that results in some pretty terrible behavior. It will do
things like form a custom plan every time because the estimated cost
of the custom plan is lower than the estimated cost of the generic
plan even though the two plans are structurally identical; only the
estimates differ. It will waste gobs of CPU cycles by replanning a
primary key lookup 5 times just on the off chance that a lookup on the
primary key index isn't the best option. But this patch isn't going
to fix any of that. The best we can probably do is try to adjust the
costing for Append paths in some way that reflects the costs and
benefits of pruning. I'm tentatively in favor of trying to do
something modest in that area, but I don't have a detailed proposal.
I just realized this issue recently and reported it at [1], then Amit pointed
me to this issue being discussed here, so I would like to continue this topic
here.
I think we can split the issue into 2 issues. One is the partition prune in initial
partition prune, which maybe happen in custom plan case only and caused
the above issue. The other one happens in the "Run-Time" partition prune,
I admit that is an important issue to resolve as well, but looks harder. So I
think we can fix the first one at first.
The proposal to fix the first one is we can remember how many partitions
survived after plan time pruned for a RelOptInfo for a custom plan. and record
such information in the CachedPlanSource. When we count for the cost of a
generic plan, we can reduce the cost based on such information.
Any thought about this? I'd be sorry if I missed some already existing discussion
on this topic.
Best Regards
Andy Fan
On Sun, Oct 4, 2020 at 3:10 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Now, in my experience, the current system for custom plans vs. generic
plans doesn't approach the problem in this way at all, and in my
experience that results in some pretty terrible behavior. It will do
things like form a custom plan every time because the estimated cost
of the custom plan is lower than the estimated cost of the generic
plan even though the two plans are structurally identical; only the
estimates differ. It will waste gobs of CPU cycles by replanning a
primary key lookup 5 times just on the off chance that a lookup on the
primary key index isn't the best option. But this patch isn't going
to fix any of that. The best we can probably do is try to adjust the
costing for Append paths in some way that reflects the costs and
benefits of pruning. I'm tentatively in favor of trying to do
something modest in that area, but I don't have a detailed proposal.I just realized this issue recently and reported it at [1], then Amit pointedme to this issue being discussed here, so I would like to continue this topichere.I think we can split the issue into 2 issues. One is the partition prune in initialpartition prune, which maybe happen in custom plan case only and causedthe above issue. The other one happens in the "Run-Time" partition prune,I admit that is an important issue to resolve as well, but looks harder. So Ithink we can fix the first one at first.... When we count for the cost of ageneric plan, we can reduce the cost based on such information.
This way doesn't work since after the initial partition prune, not only the
cost of the Append node should be reduced, the cost of other plans should
be reduced as well [1]
However I think if we can use partition prune information from a custom plan
at the cost_append_path stage, it looks the issue can be fixed. If so, the idea
is similar to David's idea in [2], however Robert didn't agree with this[2].
Can anyone elaborate this objection? for a partkey > $1 or BETWEEN cases,
some real results from the past are probably better than some hard-coded
assumptions IMO.
Best Regards
Andy Fan
On Wed, Oct 7, 2020 at 5:05 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
On Sun, Oct 4, 2020 at 3:10 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Now, in my experience, the current system for custom plans vs. generic
plans doesn't approach the problem in this way at all, and in my
experience that results in some pretty terrible behavior. It will do
things like form a custom plan every time because the estimated cost
of the custom plan is lower than the estimated cost of the generic
plan even though the two plans are structurally identical; only the
estimates differ. It will waste gobs of CPU cycles by replanning a
primary key lookup 5 times just on the off chance that a lookup on the
primary key index isn't the best option. But this patch isn't going
to fix any of that. The best we can probably do is try to adjust the
costing for Append paths in some way that reflects the costs and
benefits of pruning. I'm tentatively in favor of trying to do
something modest in that area, but I don't have a detailed proposal.I just realized this issue recently and reported it at [1], then Amit pointedme to this issue being discussed here, so I would like to continue this topichere.I think we can split the issue into 2 issues. One is the partition prune in initialpartition prune, which maybe happen in custom plan case only and causedthe above issue. The other one happens in the "Run-Time" partition prune,I admit that is an important issue to resolve as well, but looks harder. So Ithink we can fix the first one at first.... When we count for the cost of ageneric plan, we can reduce the cost based on such information.This way doesn't work since after the initial partition prune, not only thecost of the Append node should be reduced, the cost of other plans shouldbe reduced as well [1]However I think if we can use partition prune information from a custom planat the cost_append_path stage, it looks the issue can be fixed. If so, the ideais similar to David's idea in [2], however Robert didn't agree with this[2].Can anyone elaborate this objection? for a partkey > $1 or BETWEEN cases,some real results from the past are probably better than some hard-codedassumptions IMO.
I can understand Robert's idea now, he intended to resolve both the
"initial-partition-prune" case and "runtime partition prune" case while I always think
about the former case (Amit reminded me about that at the beginning, but I just
wake up right now..)
With the Robert's idea, I think we can do some hack at create_append_path,
we can figure out the pruneinfo (it was done at create_append_plan now) at that
stage, and then did a fix_append_path with such pruneinfo. We need to fix not
only the cost of AppendPath, but also rows of AppendPath, For example:
SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
plan is:
Nest Loop
Nest Loop
t1
Append (p)
t2
Then the rows of Append (p) will be very important.
For this idea, how to estimate how many partitions/rows can be pruned is a key
part. Robert said "I was thinking, rather, that if we know for example that
we've doing pruning on partition_column = $1, then we know that onlyone partition will match. That's probably a common case. If we've
got partition_column > $1, we could assume that, say, 75% of the
partitions would match. partition_column BETWEEN $1 and $2 is
probably a bit more selective, so maybe we assume 50% of the
partitions would match.". I think we can't say the 75% or 50% is a good
number, but the keypoint may be "partition_column = $1" is a common
case. And for the others case, we probably don't make it worse.
I think we need to do something here, any thoughts? Personally I'm more
like this idea now.
Best Regards
Andy Fan
On Wed, Oct 7, 2020 at 7:00 PM Andy Fan <zhihui.fan1213@gmail.com> wrote: > > > > On Wed, Oct 7, 2020 at 5:05 PM Andy Fan <zhihui.fan1213@gmail.com> wrote: >> >> >> >> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan <zhihui.fan1213@gmail.com> wrote: >>>> >>>> >>>> >>>> Now, in my experience, the current system for custom plans vs. generic >>>> plans doesn't approach the problem in this way at all, and in my >>>> experience that results in some pretty terrible behavior. It will do >>>> things like form a custom plan every time because the estimated cost >>>> of the custom plan is lower than the estimated cost of the generic >>>> plan even though the two plans are structurally identical; only the >>>> estimates differ. It will waste gobs of CPU cycles by replanning a >>>> primary key lookup 5 times just on the off chance that a lookup on the >>>> primary key index isn't the best option. But this patch isn't going >>>> to fix any of that. The best we can probably do is try to adjust the >>>> costing for Append paths in some way that reflects the costs and >>>> benefits of pruning. I'm tentatively in favor of trying to do >>>> something modest in that area, but I don't have a detailed proposal. >>>> >>> >>> I just realized this issue recently and reported it at [1], then Amit pointed >>> me to this issue being discussed here, so I would like to continue this topic >>> here. >>> >>> I think we can split the issue into 2 issues. One is the partition prune in initial >>> partition prune, which maybe happen in custom plan case only and caused >>> the above issue. The other one happens in the "Run-Time" partition prune, >>> I admit that is an important issue to resolve as well, but looks harder. So I >>> think we can fix the first one at first. >>> >>> ... When we count for the cost of a >>> generic plan, we can reduce the cost based on such information. >> >> >> This way doesn't work since after the initial partition prune, not only the >> cost of the Append node should be reduced, the cost of other plans should >> be reduced as well [1] >> >> However I think if we can use partition prune information from a custom plan >> at the cost_append_path stage, it looks the issue can be fixed. If so, the idea >> is similar to David's idea in [2], however Robert didn't agree with this[2]. >> Can anyone elaborate this objection? for a partkey > $1 or BETWEEN cases, >> some real results from the past are probably better than some hard-coded >> assumptions IMO. > > > I can understand Robert's idea now, he intended to resolve both the > "initial-partition-prune" case and "runtime partition prune" case while I always think > about the former case (Amit reminded me about that at the beginning, but I just > wake up right now..) > > With the Robert's idea, I think we can do some hack at create_append_path, > we can figure out the pruneinfo (it was done at create_append_plan now) at that > stage, and then did a fix_append_path with such pruneinfo. We need to fix not > only the cost of AppendPath, but also rows of AppendPath, For example: > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the > plan is: > Nest Loop > Nest Loop > t1 > Append (p) > t2 > > Then the rows of Append (p) will be very important. > > For this idea, how to estimate how many partitions/rows can be pruned is a key > part. Robert said "I was thinking, rather, that if we know for example that > we've doing pruning on partition_column = $1, then we know that only > one partition will match. That's probably a common case. If we've > got partition_column > $1, we could assume that, say, 75% of the > partitions would match. partition_column BETWEEN $1 and $2 is > probably a bit more selective, so maybe we assume 50% of the > partitions would match.". I think we can't say the 75% or 50% is a good > number, but the keypoint may be "partition_column = $1" is a common > case. And for the others case, we probably don't make it worse. > > I think we need to do something here, any thoughts? Personally I'm more > like this idea now. Yes, I think we have to do something on those lines. But I am wondering why our stats machinery is failing to do that already. For equality I think it's straight forward. But even for other operators we should use the statistics from the partitioned table to estimate the selectivity and then adjust number of scanned partitions = (number of partitions * fraction of rows scanned). That might be better than 50% or 75%. -- Best Wishes, Ashutosh Bapat
Hi Ashutosh:
On Thu, Oct 8, 2020 at 7:25 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Wed, Oct 7, 2020 at 7:00 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
>
> On Wed, Oct 7, 2020 at 5:05 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>
>>
>>
>> On Sun, Oct 4, 2020 at 3:10 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Now, in my experience, the current system for custom plans vs. generic
>>>> plans doesn't approach the problem in this way at all, and in my
>>>> experience that results in some pretty terrible behavior. It will do
>>>> things like form a custom plan every time because the estimated cost
>>>> of the custom plan is lower than the estimated cost of the generic
>>>> plan even though the two plans are structurally identical; only the
>>>> estimates differ. It will waste gobs of CPU cycles by replanning a
>>>> primary key lookup 5 times just on the off chance that a lookup on the
>>>> primary key index isn't the best option. But this patch isn't going
>>>> to fix any of that. The best we can probably do is try to adjust the
>>>> costing for Append paths in some way that reflects the costs and
>>>> benefits of pruning. I'm tentatively in favor of trying to do
>>>> something modest in that area, but I don't have a detailed proposal.
>>>>
>>>
>>> I just realized this issue recently and reported it at [1], then Amit pointed
>>> me to this issue being discussed here, so I would like to continue this topic
>>> here.
>>>
>>> I think we can split the issue into 2 issues. One is the partition prune in initial
>>> partition prune, which maybe happen in custom plan case only and caused
>>> the above issue. The other one happens in the "Run-Time" partition prune,
>>> I admit that is an important issue to resolve as well, but looks harder. So I
>>> think we can fix the first one at first.
>>>
>>> ... When we count for the cost of a
>>> generic plan, we can reduce the cost based on such information.
>>
>>
>> This way doesn't work since after the initial partition prune, not only the
>> cost of the Append node should be reduced, the cost of other plans should
>> be reduced as well [1]
>>
>> However I think if we can use partition prune information from a custom plan
>> at the cost_append_path stage, it looks the issue can be fixed. If so, the idea
>> is similar to David's idea in [2], however Robert didn't agree with this[2].
>> Can anyone elaborate this objection? for a partkey > $1 or BETWEEN cases,
>> some real results from the past are probably better than some hard-coded
>> assumptions IMO.
>
>
> I can understand Robert's idea now, he intended to resolve both the
> "initial-partition-prune" case and "runtime partition prune" case while I always think
> about the former case (Amit reminded me about that at the beginning, but I just
> wake up right now..)
>
> With the Robert's idea, I think we can do some hack at create_append_path,
> we can figure out the pruneinfo (it was done at create_append_plan now) at that
> stage, and then did a fix_append_path with such pruneinfo. We need to fix not
> only the cost of AppendPath, but also rows of AppendPath, For example:
> SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> plan is:
> Nest Loop
> Nest Loop
> t1
> Append (p)
> t2
>
> Then the rows of Append (p) will be very important.
>
> For this idea, how to estimate how many partitions/rows can be pruned is a key
> part. Robert said "I was thinking, rather, that if we know for example that
> we've doing pruning on partition_column = $1, then we know that only
> one partition will match. That's probably a common case. If we've
> got partition_column > $1, we could assume that, say, 75% of the
> partitions would match. partition_column BETWEEN $1 and $2 is
> probably a bit more selective, so maybe we assume 50% of the
> partitions would match.". I think we can't say the 75% or 50% is a good
> number, but the keypoint may be "partition_column = $1" is a common
> case. And for the others case, we probably don't make it worse.
>
> I think we need to do something here, any thoughts? Personally I'm more
> like this idea now.
Yes, I think we have to do something on those lines. But I am
wondering why our stats machinery is failing to do that already. For
equality I think it's straight forward. But even for other operators
we should use the statistics from the partitioned table to estimate
the selectivity and then adjust number of scanned partitions = (number
of partitions * fraction of rows scanned). That might be better than
50% or 75%.
Sorry for the late reply! Suppose we have partition defined like this:
p
- p1
- p2
When you talk about "the statistics from the partitioned table", do you
mean the statistics from p or p1/p2? I just confirmed there is no statistics
for p (at least pg_class.reltuples = -1), so I think you are talking about
p1/p2.
Here we are talking about partkey = $1 or partkey = RunTimeValue.
so even the value can hit 1 partition only, but since we don't know
the exact value, so we treat all the partition equally. so looks
nothing wrong with partition level estimation. However when we cost
the Append path, we need know just one of them can be hit, then
we need do something there. Both AppendPath->rows/total_cost
should be adjusted (That doesn't happen now).
Best Regards
Andy Fan
On Mon, Oct 12, 2020 at 7:59 AM Andy Fan <zhihui.fan1213@gmail.com> wrote: > > Sorry for the late reply! Suppose we have partition defined like this: > p > - p1 > - p2 > > When you talk about "the statistics from the partitioned table", do you > mean the statistics from p or p1/p2? I just confirmed there is no statistics > for p (at least pg_class.reltuples = -1), so I think you are talking about > p1/p2. I am talking about p when I say statistics from the partitioned table. I see that pg_statistic row from p is well populated. pg_class.reltuples = -1 indicates that the heap doesn't have any rows. set_rel_size() sets the number of rows in the partitioned table based on the rows in individual unpruned partitions. > > Here we are talking about partkey = $1 or partkey = RunTimeValue. > so even the value can hit 1 partition only, but since we don't know > the exact value, so we treat all the partition equally. so looks > nothing wrong with partition level estimation. However when we cost > the Append path, we need know just one of them can be hit, then > we need do something there. Both AppendPath->rows/total_cost > should be adjusted (That doesn't happen now). I think in this case we can safely assume that only one partition will remain so normalize costs considering that only one partition will survive. -- Best Wishes, Ashutosh Bapat
On Mon, Oct 12, 2020 at 5:48 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
On Mon, Oct 12, 2020 at 7:59 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Sorry for the late reply! Suppose we have partition defined like this:
> p
> - p1
> - p2
>
> When you talk about "the statistics from the partitioned table", do you
> mean the statistics from p or p1/p2? I just confirmed there is no statistics
> for p (at least pg_class.reltuples = -1), so I think you are talking about
> p1/p2.
I am talking about p when I say statistics from the partitioned table.
I see that pg_statistic row from p is well populated.
pg_class.reltuples = -1 indicates that the heap doesn't have any rows.
set_rel_size() sets the number of rows in the partitioned table based
on the rows in individual unpruned partitions.
Glad to know that, Thanks for this info!
>
> Here we are talking about partkey = $1 or partkey = RunTimeValue.
> so even the value can hit 1 partition only, but since we don't know
> the exact value, so we treat all the partition equally. so looks
> nothing wrong with partition level estimation. However when we cost
> the Append path, we need know just one of them can be hit, then
> we need do something there. Both AppendPath->rows/total_cost
> should be adjusted (That doesn't happen now).
I think in this case we can safely assume that only one partition will
remain so normalize costs considering that only one partition will
survive.
Exactly. What I am trying to do is fix this at create_append_path,
do you have different suggestions? about the pkey > $1 case, I think
even if we use the statistics from partition level, it would be
hard-code as well since we don't know what value $1 is.
I have gone through the main part of the RunTime partition prune, hope
I can update a runnable patch soon. The main idea is fix the rows/
costs at create_append_path stage. So any suggestion in a different
direction will be very useful.
Best Regards
Andy Fan
On Thu, Oct 8, 2020 at 8:25 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > On Wed, Oct 7, 2020 at 7:00 PM Andy Fan <zhihui.fan1213@gmail.com> wrote: > > I can understand Robert's idea now, he intended to resolve both the > > "initial-partition-prune" case and "runtime partition prune" case while I always think > > about the former case (Amit reminded me about that at the beginning, but I just > > wake up right now..) > > > > With the Robert's idea, I think we can do some hack at create_append_path, > > we can figure out the pruneinfo (it was done at create_append_plan now) at that > > stage, and then did a fix_append_path with such pruneinfo. We need to fix not > > only the cost of AppendPath, but also rows of AppendPath, For example: > > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the > > plan is: > > Nest Loop > > Nest Loop > > t1 > > Append (p) > > t2 > > > > Then the rows of Append (p) will be very important. > > > > For this idea, how to estimate how many partitions/rows can be pruned is a key > > part. Robert said "I was thinking, rather, that if we know for example that > > we've doing pruning on partition_column = $1, then we know that only > > one partition will match. That's probably a common case. If we've > > got partition_column > $1, we could assume that, say, 75% of the > > partitions would match. partition_column BETWEEN $1 and $2 is > > probably a bit more selective, so maybe we assume 50% of the > > partitions would match.". I think we can't say the 75% or 50% is a good > > number, but the keypoint may be "partition_column = $1" is a common > > case. And for the others case, we probably don't make it worse. > > > > I think we need to do something here, any thoughts? Personally I'm more > > like this idea now. > > Yes, I think we have to do something on those lines. But I am > wondering why our stats machinery is failing to do that already. For > equality I think it's straight forward. But even for other operators > we should use the statistics from the partitioned table to estimate > the selectivity and then adjust number of scanned partitions = (number > of partitions * fraction of rows scanned). That might be better than > 50% or 75%. This is an interesting idea, that is, the idea of consulting parent relation's stats to guess at the number of partitions that might be scanned. However, we don't currently consult an inheritance parent relation's stats at all during "base" relation planning, which is why you don't see the parent relation's statistics reflected in the row count and costs assigned to its (Append at al) paths. The hard-coded rule is to sum up children's rows and their paths' costs; see cost_append(). My thinking on this was that we'd just extend that hard-coded rule to take into account that run-time pruning, if applicable in a given plan, will cause some of those child paths to be discarded. Maybe Andy is headed in that direction? -- Amit Langote EDB: http://www.enterprisedb.com
On Tue, Oct 13, 2020 at 3:48 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, Oct 8, 2020 at 8:25 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Wed, Oct 7, 2020 at 7:00 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> > I can understand Robert's idea now, he intended to resolve both the
> > "initial-partition-prune" case and "runtime partition prune" case while I always think
> > about the former case (Amit reminded me about that at the beginning, but I just
> > wake up right now..)
> >
> > With the Robert's idea, I think we can do some hack at create_append_path,
> > we can figure out the pruneinfo (it was done at create_append_plan now) at that
> > stage, and then did a fix_append_path with such pruneinfo. We need to fix not
> > only the cost of AppendPath, but also rows of AppendPath, For example:
> > SELECT .. FROM t1, t2, p where t1.a = p.partkey and t1.b = t2.b, if the
> > plan is:
> > Nest Loop
> > Nest Loop
> > t1
> > Append (p)
> > t2
> >
> > Then the rows of Append (p) will be very important.
> >
> > For this idea, how to estimate how many partitions/rows can be pruned is a key
> > part. Robert said "I was thinking, rather, that if we know for example that
> > we've doing pruning on partition_column = $1, then we know that only
> > one partition will match. That's probably a common case. If we've
> > got partition_column > $1, we could assume that, say, 75% of the
> > partitions would match. partition_column BETWEEN $1 and $2 is
> > probably a bit more selective, so maybe we assume 50% of the
> > partitions would match.". I think we can't say the 75% or 50% is a good
> > number, but the keypoint may be "partition_column = $1" is a common
> > case. And for the others case, we probably don't make it worse.
> >
> > I think we need to do something here, any thoughts? Personally I'm more
> > like this idea now.
>
> Yes, I think we have to do something on those lines. But I am
> wondering why our stats machinery is failing to do that already. For
> equality I think it's straight forward. But even for other operators
> we should use the statistics from the partitioned table to estimate
> the selectivity and then adjust number of scanned partitions = (number
> of partitions * fraction of rows scanned). That might be better than
> 50% or 75%.
This is an interesting idea, that is, the idea of consulting parent
relation's stats to guess at the number of partitions that might be
scanned.
However, we don't currently consult an inheritance parent relation's
stats at all during "base" relation planning, which is why you don't
see the parent relation's statistics reflected in the row count and
costs assigned to its (Append at al) paths. The hard-coded rule is to
sum up children's rows and their paths' costs; see cost_append().
My thinking on this was that we'd just extend that hard-coded rule to
take into account that run-time pruning, if applicable in a given
plan, will cause some of those child paths to be discarded. Maybe
Andy is headed in that direction?
Yes, that's what I am trying to do. The minimum code in my mind is:
create_append_path(...)
{
double run_time_prune_est;
if (enable_partition_prune && .. )
List * partition_prune_step_ops = cal_prune_step_ops(rel, partitioned_rels);
run_time_prune_est = estimate_runtime_prune_ratio(partition_prune_step_ops);
}
// adjust the rows/costs of AppendPath based on the run_time_prune_est.
estimate_runtime_prune_ratio. Since I still don't make my mind runnable now,
some data structures may change, but the overall idea is something like above.
--
Best Regards
Andy Fan