Thread: speeding up planning with partitions
It is more or less well known that the planner doesn't perform well with more than a few hundred partitions even when only a handful of partitions are ultimately included in the plan. Situation has improved a bit in PG 11 where we replaced the older method of pruning partitions one-by-one using constraint exclusion with a much faster method that finds relevant partitions by using partitioning metadata. However, we could only use it for SELECT queries, because UPDATE/DELETE are handled by a completely different code path, whose structure doesn't allow it to call the new pruning module's functionality. Actually, not being able to use the new pruning is not the only problem for UPDATE/DELETE, more on which further below. While situation improved with new pruning where it could, there are still overheads in the way planner handles partitions. As things stand today, it will spend cycles and allocate memory for partitions even before pruning is performed, meaning most of that effort could be for partitions that were better left untouched. Currently, planner will lock, heap_open *all* partitions, create range table entries and AppendRelInfos for them, and finally initialize RelOptInfos for them, even touching the disk file of each partition in the process, in an earlier phase of planning. All of that processing is vain for partitions that are pruned, because they won't be included in the final plan. This problem grows worse as the number of partitions grows beyond thousands, because range table grows too big. That could be fixed by delaying all that per-partition activity to a point where pruning has already been performed, so that we know the partitions to open and create planning data structures for, such as somewhere downstream to query_planner. But before we can do that we must do something about the fact that UPDATE/DELETE won't be able to cope with that because the code path that currently handles the planning of UPDATE/DELETE on partitioned tables (inheritance_planner called from subquery_planner) relies on AppendRelInfos for all partitions having been initialized by an earlier planning phase. Delaying it to query_planner would be too late, because inheritance_planner calls query_planner for each partition, not for the parent. That is, if query_planner, which is downstream to inheritance_planner, was in the charge of determining which partitions to open, the latter wouldn't know which partitions to call the former for. :) That would be fixed if there is no longer this ordering dependency, which is what I propose to do with the attached patch 0001. I've tried to describe how the patch manages to do that in its commit message, but I'll summarize here. As things stand today, inheritance_planner modifies the query for each leaf partition to make the partition act as the query's result relation instead of the original partitioned table and calls grouping_planner on the query. That means anything that's joined to partitioned table looks to instead be joined to the partition and join paths are generated likewise. Also, the resulting path's targetlist is adjusted to be suitable for the result partition. Upon studying how this works, I concluded that the same result can be achieved if we call grouping_planner only once and repeat the portions of query_planner's and grouping_planner's processing that generate the join paths and appropriate target list, respectively, for each partition. That way, we can rely on query_planner determining result partitions for us, which in turn relies on the faster partprune.c based method of pruning. That speeds things up in two ways. Faster pruning and we no longer repeat common processing for each partition. With 0001 in place, there is nothing that requires that partitions be opened by an earlier planning phase, so, I propose patch 0002, which refactors the opening and creation of planner data structures for partitions such that it is now performed after pruning. However, it doesn't do anything about the fact that partitions are all still locked in the earlier phase. With various overheads gone thanks to 0001 and 0002, locking of all partitions via find_all_inheritos can be seen as the single largest bottleneck, which 0003 tries to address. I've kept it a separate patch, because I'll need to think a bit more to say that it's actually to safe to defer locking to late planning, due mainly to the concern about the change in the order of locking from the current method. I'm attaching it here, because I also want to show the performance improvement we can expect with it. I measured the gain in performance due to each patch on a modest virtual machine. Details of the measurement and results follow. * Benchmark scripts update.sql update ht set a = 0 where b = 1; select.sql select * from ht where b = 1; * Table: create table ht (a int, b int) partition by hash (b) create table ht_1 partition of ht for values with (modulus N, remainder 0) .. create table ht_N partition of ht for values with (modulus N, remainder N-1) * Rounded tps with update.sql and select.sql against regular table (nparts = 0) and partitioned table with various partition counts: pgbench -n -T 60 -f update.sql nparts master 0001 0002 0003 ====== ====== ==== ==== ==== 0 2856 2893 2862 2816 8 507 1115 1447 1872 16 260 765 1173 1892 32 119 483 922 1884 64 59 282 615 1881 128 29 153 378 1835 256 14 79 210 1803 512 5 40 113 1728 1024 2 17 57 1616 2048 0* 9 30 1471 4096 0+ 4 15 1236 8192 0= 2 7 975 * 0.46 + 0.0064 = 0 (OOM on a virtual machine with 4GB RAM) As can be seen here, 0001 is a big help for update queries. pgbench -n -T 60 -f select.sql For a select query that doesn't contain join and needs to scan only one partition: nparts master 0001 0002 0003 ====== ====== ==== ==== ==== 0 2290 2329 2319 2268 8 1058 1077 1414 1788 16 711 729 1124 1789 32 450 475 879 1773 64 265 272 603 1765 128 146 149 371 1685 256 76 77 214 1678 512 39 39 112 1636 1024 16 17 59 1525 2048 8 9 29 1416 4096 4 4 15 1195 8192 2 2 7 932 Actually, here we get almost same numbers with 0001 as with master, because 0001 changes nothing for SELECT queries. We start seeing improvement with 0002, the patch to delay opening partitions. Thanks, Amit
Attachment
On 30 August 2018 at 00:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > nparts master 0001 0002 0003 > ====== ====== ==== ==== ==== > 0 2856 2893 2862 2816 > 8 507 1115 1447 1872 > 16 260 765 1173 1892 > 32 119 483 922 1884 > 64 59 282 615 1881 > 128 29 153 378 1835 > 256 14 79 210 1803 > 512 5 40 113 1728 > 1024 2 17 57 1616 > 2048 0* 9 30 1471 > 4096 0+ 4 15 1236 > 8192 0= 2 7 975 Those look promising. Are you going to submit these patches to the September 'fest? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > I measured the gain in performance due to each patch on a modest virtual > machine. Details of the measurement and results follow. Amazing! UPDATE > nparts master 0001 0002 0003 > ====== ====== ==== ==== ==== > 0 2856 2893 2862 2816 > 8 507 1115 1447 1872 SELECT > nparts master 0001 0002 0003 > ====== ====== ==== ==== ==== > 0 2290 2329 2319 2268 > 8 1058 1077 1414 1788 Even a small number of partitions still introduces a not-small overhead (UPDATE:34%, SELECT:22%). Do you think this overheadcan be reduced further? What part do you guess would be relevant? This one? > that it is now performed after pruning. However, it doesn't do anything > about the fact that partitions are all still locked in the earlier phase. Regards Takayuki Tsunakawa
On 2018/08/30 7:27, David Rowley wrote: > On 30 August 2018 at 00:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> nparts master 0001 0002 0003 >> ====== ====== ==== ==== ==== >> 0 2856 2893 2862 2816 >> 8 507 1115 1447 1872 >> 16 260 765 1173 1892 >> 32 119 483 922 1884 >> 64 59 282 615 1881 >> 128 29 153 378 1835 >> 256 14 79 210 1803 >> 512 5 40 113 1728 >> 1024 2 17 57 1616 >> 2048 0* 9 30 1471 >> 4096 0+ 4 15 1236 >> 8192 0= 2 7 975 > > Those look promising. Are you going to submit these patches to the > September 'fest? Thanks, I just did. https://commitfest.postgresql.org/19/1778/ Regards, Amit
On 2018/08/30 10:09, Tsunakawa, Takayuki wrote: > From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] >> I measured the gain in performance due to each patch on a modest virtual >> machine. Details of the measurement and results follow. > > Amazing! Thanks. > UPDATE >> nparts master 0001 0002 0003 >> ====== ====== ==== ==== ==== >> 0 2856 2893 2862 2816 >> 8 507 1115 1447 1872 > > SELECT >> nparts master 0001 0002 0003 >> ====== ====== ==== ==== ==== >> 0 2290 2329 2319 2268 >> 8 1058 1077 1414 1788 > > Even a small number of partitions still introduces a not-small overhead (UPDATE:34%, SELECT:22%). Yeah, that's true. > Do you think this overhead can be reduced further? We can definitely try, but I'm not immediately sure if the further improvements will come from continuing to fix the planner. Maybe, the overhead of partitioning could be attributed to other parts of the system. > What part do you guess would be relevant? This one? > >> that it is now performed after pruning. However, it doesn't do anything >> about the fact that partitions are all still locked in the earlier phase. Actually, I wrote that for patch 0002. The next patch (0003) is meant to fix that. So, the overhead you're seeing is even after making sure that only the selected partitions are locked. Thanks, Amit
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > We can definitely try, but I'm not immediately sure if the further > improvements will come from continuing to fix the planner. Maybe, the > overhead of partitioning could be attributed to other parts of the system. > Actually, I wrote that for patch 0002. The next patch (0003) is meant to > fix that. So, the overhead you're seeing is even after making sure that > only the selected partitions are locked. Thanks for telling your thought. I understood we should find the bottleneck with profiling first. Regards Takayuki Tsunakawa
On 2018/08/29 21:06, Amit Langote wrote: > I measured the gain in performance due to each patch on a modest virtual > machine. Details of the measurement and results follow. > > UPDATE: > > nparts master 0001 0002 0003 > ====== ====== ==== ==== ==== > 0 2856 2893 2862 2816 > 8 507 1115 1447 1872 > 16 260 765 1173 1892 > 32 119 483 922 1884 > 64 59 282 615 1881 > 128 29 153 378 1835 > 256 14 79 210 1803 > 512 5 40 113 1728 > 1024 2 17 57 1616 > 2048 0* 9 30 1471 > 4096 0+ 4 15 1236 > 8192 0= 2 7 975 > > For SELECT: > > nparts master 0001 0002 0003 > ====== ====== ==== ==== ==== > 0 2290 2329 2319 2268 > 8 1058 1077 1414 1788 > 16 711 729 1124 1789 > 32 450 475 879 1773 > 64 265 272 603 1765 > 128 146 149 371 1685 > 256 76 77 214 1678 > 512 39 39 112 1636 > 1024 16 17 59 1525 > 2048 8 9 29 1416 > 4096 4 4 15 1195 > 8192 2 2 7 932 Prompted by Tsunakawa-san's comment, I tried to look at the profiles when running the benchmark with partitioning and noticed a few things that made clear why, even with 0003 applied, tps numbers decreased as the number of partitions increased. Some functions that appeared high up in the profiles were related to partitioning: * set_relation_partition_info calling partition_bounds_copy(), which calls datumCopy() on N Datums, where N is the number of partitions. The more the number of partitions, higher up it is in profiles. I suspect that this copying might be redundant; planner can keep using the same pointer as relcache There are a few existing and newly introduced sites in the planner where the code iterates over *all* partitions of a table where processing just the partition selected for scanning would suffice. I observed the following functions in profiles: * make_partitionedrel_pruneinfo, which goes over all partitions to generate subplan_map and subpart_map arrays to put into the PartitionedRelPruneInfo data structure that it's in the charge of generating * apply_scanjoin_target_to_paths, which goes over all partitions to adjust their Paths for applying required scanjoin target, although most of those are dummy ones that won't need the adjustment * For UPDATE, a couple of functions I introduced in patch 0001 were doing the same thing as apply_scanjoin_target_to_paths, which is unnecessary To fix the above three instances of redundant processing, I added a Bitmapset 'live_parts' to the RelOptInfo which stores the set of indexes of only the unpruned partitions (into the RelOptInfo.part_rels array) and replaced the for (i = 0; i < rel->nparts; i++) loops in those sites with the loop that iterates over the members of 'live_parts'. Results looked were promising indeed, especially after applying 0003 which gets rid of locking all partitions. UPDATE: nparts master 0001 0002 0003 ====== ====== ==== ==== ==== 0 2856 2893 2862 2816 8 507 1115 1466 1845 16 260 765 1161 1876 32 119 483 910 1862 64 59 282 609 1895 128 29 153 376 1884 256 14 79 212 1874 512 5 40 115 1859 1024 2 17 58 1847 2048 0 9 29 1883 4096 0 4 15 1867 8192 0 2 7 1826 SELECT: nparts master 0001 0002 0003 ====== ====== ==== ==== ==== 0 2290 2329 2319 2268 8 1058 1077 1431 1800 16 711 729 1158 1781 32 450 475 908 1777 64 265 272 612 1791 128 146 149 379 1777 256 76 77 213 1785 512 39 39 114 1776 1024 16 17 59 1756 2048 8 9 30 1746 4096 4 4 15 1722 8192 2 2 7 1706 Note that with 0003, tps doesn't degrade as the number of partitions increase. Attached updated patches, with 0002 containing the changes mentioned above. Thanks, Amit
Attachment
Hi, Amit Great! With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400. Applying three all patches, 20% performance improved for 100 partitions. I have the same number of records for each partition, do you do the same? Also, in my case, performance was better when not prepare. I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right? Details of measurement conditions and results are as follows. - base source master(@777e6ddf17) + Speeding up Insert v8 patch[1] - table definition(e.g. 100 partition) create table test.accounts(aid serial, abalance int) partition by range(aid); create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zone) partition by range(aid); create table test.account_part_1 partition of test.accounts for values from (1) to (65); create table test.account_part_2 partition of test.accounts for values from (65) to (129); ... create table test.account_part_100 partition of test.accounts for values from (6337) to (6400); create table test.ah_part_1 partition of test.accounts_history for values from (1) to (65); create table test.ah_part_2 partition of test.accounts_history for values from (65) to (129); ... create table test.ah_part_100 partition of test.accounts_history for values from (6337) to (6400); - benchmark script \set aid random(1, 6400) \set delta random(-5000, 5000) BEGIN; UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM test.accounts WHERE aid = :aid; INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP); END; - command option pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -results base source no prepared part_num | tps_ex | update_latency | select_latency | insert_latency ----------+------------+----------------+----------------+---------------- 100 | 662.414805 | 0.357 | 0.265 | 0.421 200 | 494.478431 | 0.439 | 0.349 | 0.579 400 | 307.982089 | 0.651 | 0.558 | 0.927 800 | 191.360676 | 0.979 | 0.876 | 1.548 1600 | 75.344947 | 2.253 | 2.003 | 4.301 3200 | 30.643902 | 5.716 | 4.955 | 10.118 6400 | 16.726056 | 12.512 | 8.582 | 18.054 0001 no prepared part_num | tps_ex | update_latency | select_latency | insert_latency ----------+------------+----------------+----------------+---------------- 100 | 429.816329 | 0.811 | 0.75 | 0.365 200 | 275.211531 | 1.333 | 1.248 | 0.501 400 | 160.499833 | 2.384 | 2.252 | 0.754 800 | 79.387776 | 4.935 | 4.698 | 1.468 1600 | 24.787377 | 16.593 | 15.954 | 4.302 3200 | 9.846421 | 42.96 | 42.139 | 8.848 6400 | 4.919772 | 87.43 | 83.109 | 16.56 0001 prepared part_num | tps_ex | update_latency | select_latency | insert_latency ----------+------------+----------------+----------------+---------------- 100 | 245.100776 | 2.728 | 0.374 | 0.476 200 | 140.249283 | 5.116 | 0.603 | 0.686 400 | 67.608559 | 11.383 | 1.055 | 1.179 800 | 23.085806 | 35.781 | 2.585 | 2.677 1600 | 6.211247 | 141.093 | 7.096 | 6.785 3200 | 1.808214 | 508.045 | 15.741 | 13.243 6400 | 0.495635 | 1919.362 | 37.691 | 28.177 0001 + 0002 no prepared part_num | tps_ex | update_latency | select_latency | insert_latency ----------+------------+----------------+----------------+---------------- 100 | 682.53091 | 0.388 | 0.35 | 0.35 200 | 469.906601 | 0.543 | 0.496 | 0.51 400 | 321.915349 | 0.78 | 0.721 | 0.752 800 | 201.620975 | 1.246 | 1.156 | 1.236 1600 | 94.438204 | 2.612 | 2.335 | 2.745 3200 | 38.292922 | 6.657 | 5.579 | 6.808 6400 | 21.48462 | 11.989 | 10.104 | 12.601 0001 + 0002 prepared part_num | tps_ex | update_latency | select_latency | insert_latency ----------+------------+----------------+----------------+---------------- 100 | 591.10863 | 0.433 | 0.342 | 0.422 200 | 393.223638 | 0.625 | 0.522 | 0.614 400 | 253.672736 | 0.946 | 0.828 | 0.928 800 | 146.840262 | 1.615 | 1.448 | 1.604 1600 | 52.805593 | 4.656 | 3.811 | 4.473 3200 | 21.461606 | 11.48 | 9.56 | 10.661 6400 | 11.888232 | 22.869 | 16.841 | 18.871 0001 + 0002 + 0003 no prepared part_num | tps_ex | update_latency | select_latency | insert_latency ----------+------------+----------------+----------------+---------------- 100 | 798.962029 | 0.304 | 0.267 | 0.339 200 | 577.893396 | 0.384 | 0.346 | 0.487 400 | 426.542177 | 0.472 | 0.435 | 0.717 800 | 288.616213 | 0.63 | 0.591 | 1.162 1600 | 154.247034 | 1.056 | 0.987 | 2.384 3200 | 59.711446 | 2.416 | 2.233 | 6.514 6400 | 37.109761 | 3.387 | 3.099 | 11.762 0001 + 0002 + 0003 prepared part_num | tps_ex | update_latency | select_latency | insert_latency ----------+------------+----------------+----------------+---------------- 100 | 662.414805 | 0.357 | 0.265 | 0.421 200 | 494.478431 | 0.439 | 0.349 | 0.579 400 | 307.982089 | 0.651 | 0.558 | 0.927 800 | 191.360676 | 0.979 | 0.876 | 1.548 1600 | 75.344947 | 2.253 | 2.003 | 4.301 3200 | 30.643902 | 5.716 | 4.955 | 10.118 6400 | 16.726056 | 12.512 | 8.582 | 18.054 Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better. 11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared part_num | tps_ex | update_latency | select_latency | insert_latency ----------+-------------+----------------+----------------+---------------- 100 | 756.07228 | 0.942 | 0.091 | 0.123 [1] https://www.postgresql.org/message-id/CAKJS1f_QN-nmf6jCQ4gRU_8ab0zrd0ms-U%3D_Dj0KUARJiuGpOA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz%3DGVBwvGh4a6xA%40mail.gmail.com regards, Sho Kato -----Original Message----- From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] Sent: Wednesday, August 29, 2018 9:06 PM To: Pg Hackers <pgsql-hackers@postgresql.org> Subject: speeding up planning with partitions It is more or less well known that the planner doesn't perform well with more than a few hundred partitions even when onlya handful of partitions are ultimately included in the plan. Situation has improved a bit in PG 11 where we replaced the older method of pruning partitions one-by-one using constraint exclusion with a much faster methodthat finds relevant partitions by using partitioning metadata. However, we could only use it for SELECT queries, becauseUPDATE/DELETE are handled by a completely different code path, whose structure doesn't allow it to call the new pruningmodule's functionality. Actually, not being able to use the new pruning is not the only problem for UPDATE/DELETE,more on which further below. While situation improved with new pruning where it could, there are still overheads in the way planner handles partitions. As things stand today, it will spend cycles and allocate memory for partitions even before pruning is performed,meaning most of that effort could be for partitions that were better left untouched. Currently, planner will lock,heap_open *all* partitions, create range table entries and AppendRelInfos for them, and finally initialize RelOptInfos for them, eventouching the disk file of each partition in the process, in an earlier phase of planning. All of that processing isvain for partitions that are pruned, because they won't be included in the final plan. This problem grows worse as thenumber of partitions grows beyond thousands, because range table grows too big. That could be fixed by delaying all that per-partition activity to a point where pruning has already been performed, so thatwe know the partitions to open and create planning data structures for, such as somewhere downstream to query_planner. But before we can do that we must do something about the fact that UPDATE/DELETE won't be able to cope withthat because the code path that currently handles the planning of UPDATE/DELETE on partitioned tables (inheritance_plannercalled from subquery_planner) relies on AppendRelInfos for all partitions having been initialized by an earlier planning phase. Delayingit to query_planner would be too late, because inheritance_planner calls query_planner for each partition, not forthe parent. That is, if query_planner, which is downstream to inheritance_planner, was in the charge of determining whichpartitions to open, the latter wouldn't know which partitions to call the former for. :) That would be fixed if there is no longer this ordering dependency, which is what I propose to do with the attached patch0001. I've tried to describe how the patch manages to do that in its commit message, but I'll summarize here. As thingsstand today, inheritance_planner modifies the query for each leaf partition to make the partition act as the query'sresult relation instead of the original partitioned table and calls grouping_planner on the query. That means anythingthat's joined to partitioned table looks to instead be joined to the partition and join paths are generated likewise. Also, the resulting path's targetlist is adjusted to be suitable for the result partition. Upon studying how thisworks, I concluded that the same result can be achieved if we call grouping_planner only once and repeat the portionsof query_planner's and grouping_planner's processing that generate the join paths and appropriate target list, respectively,for each partition. That way, we can rely on query_planner determining result partitions for us, which in turnrelies on the faster partprune.c based method of pruning. That speeds things up in two ways. Faster pruning and weno longer repeat common processing for each partition. With 0001 in place, there is nothing that requires that partitions be opened by an earlier planning phase, so, I proposepatch 0002, which refactors the opening and creation of planner data structures for partitions such that it is nowperformed after pruning. However, it doesn't do anything about the fact that partitions are all still locked in the earlierphase. With various overheads gone thanks to 0001 and 0002, locking of all partitions via find_all_inheritos can be seen as thesingle largest bottleneck, which 0003 tries to address. I've kept it a separate patch, because I'll need to think a bitmore to say that it's actually to safe to defer locking to late planning, due mainly to the concern about the change inthe order of locking from the current method. I'm attaching it here, because I also want to show the performance improvementwe can expect with it. I measured the gain in performance due to each patch on a modest virtual machine. Details of the measurement and resultsfollow. * Benchmark scripts update.sql update ht set a = 0 where b = 1; select.sql select * from ht where b = 1; * Table: create table ht (a int, b int) partition by hash (b) create table ht_1 partition of ht for values with (modulus N, remainder0) .. create table ht_N partition of ht for values with (modulus N, remainder N-1) * Rounded tps with update.sql and select.sql against regular table (nparts = 0) and partitioned table with various partitioncounts: pgbench -n -T 60 -f update.sql nparts master 0001 0002 0003 ====== ====== ==== ==== ==== 0 2856 2893 2862 2816 8 507 1115 1447 1872 16 260 765 1173 1892 32 119 483 922 1884 64 59 282 615 1881 128 29 153 378 1835 256 14 79 210 1803 512 5 40 113 1728 1024 2 17 57 1616 2048 0* 9 30 1471 4096 0+ 4 15 1236 8192 0= 2 7 975 * 0.46 + 0.0064 = 0 (OOM on a virtual machine with 4GB RAM) As can be seen here, 0001 is a big help for update queries. pgbench -n -T 60 -f select.sql For a select query that doesn't contain join and needs to scan only one partition: nparts master 0001 0002 0003 ====== ====== ==== ==== ==== 0 2290 2329 2319 2268 8 1058 1077 1414 1788 16 711 729 1124 1789 32 450 475 879 1773 64 265 272 603 1765 128 146 149 371 1685 256 76 77 214 1678 512 39 39 112 1636 1024 16 17 59 1525 2048 8 9 29 1416 4096 4 4 15 1195 8192 2 2 7 932 Actually, here we get almost same numbers with 0001 as with master, because 0001 changes nothing for SELECT queries. Westart seeing improvement with 0002, the patch to delay opening partitions. Thanks, Amit
Thank you Kato-san for testing. On 2018/08/31 19:48, Kato, Sho wrote: > Hi, Amit > > Great! > With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400. > Applying three all patches, 20% performance improved for 100 partitions. > > I have the same number of records for each partition, do you do the same? I didn't load any data into tables when running the tests, because these patches are meant for reducing planner latency. More specifically, they're addressed to fix the current planner behavior that its latency increases with increasing number of partitions, with focus on the common case where only a single partition will need to be scanned by a given query. I'd try to avoid using a benchmark whose results is affected by anything other than the planning latency. It will be a good idea if you leave the tables empty and just vary the number of partitions and nothing else. Also, we're interested in planning latency, so using just SELECT and UPDATE in your benchmark script will be enough, because their planning time is affected by the number of partitions. No need to try to measure the INSERT latency, because its planning latency is not affected by the number of partitions. Moreover, I'd rather suggest you take out the INSERT statement from the benchmark for now, because its execution time does vary unfavorably with increasing number of partitions. Sure, there are other patches to address that, but it's better not to mix the patches to avoid confusion. > Also, in my case, performance was better when not prepare. Patches in this thread do nothing for the execution, so, there is no need to compare prepared vs non-prepared. In fact, just measure non-prepared tps and latency, because we're only interested in planning time here. > I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right? We already have run-time pruning code (that is the code in the patch you linked) committed into the tree in PG 11, so the master tree also has it. But since we're not interested in execution time, no need to worry about run-time pruning. > Details of measurement conditions and results are as follows. > - base source > master(@777e6ddf17) + Speeding up Insert v8 patch[1] > > - table definition(e.g. 100 partition) > create table test.accounts(aid serial, abalance int) > partition by range(aid); > create table test.accounts_history(id serial, aid int, delta int, > mtime timestamp without time zone) > partition by range(aid); > > - command option > pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare > pgbench -d testdb -f benchmark.pgbench -T 180 -r -n > > -results > base source no prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+------------+----------------+----------------+---------------- > 100 | 662.414805 | 0.357 | 0.265 | 0.421 > 200 | 494.478431 | 0.439 | 0.349 | 0.579 > 400 | 307.982089 | 0.651 | 0.558 | 0.927 > 800 | 191.360676 | 0.979 | 0.876 | 1.548 > 1600 | 75.344947 | 2.253 | 2.003 | 4.301 > 3200 | 30.643902 | 5.716 | 4.955 | 10.118 > 6400 | 16.726056 | 12.512 | 8.582 | 18.054 > > 0001 no prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+------------+----------------+----------------+---------------- > 100 | 429.816329 | 0.811 | 0.75 | 0.365 > 200 | 275.211531 | 1.333 | 1.248 | 0.501 > 400 | 160.499833 | 2.384 | 2.252 | 0.754 > 800 | 79.387776 | 4.935 | 4.698 | 1.468 > 1600 | 24.787377 | 16.593 | 15.954 | 4.302 > 3200 | 9.846421 | 42.96 | 42.139 | 8.848 > 6400 | 4.919772 | 87.43 | 83.109 | 16.56 Hmm, since 0001 is meant to improve update planning time, it seems odd that you'd get poorer results compared to base source. But, it seems base source is actually master + the patch to improve the execution time of update, so maybe that patch is playing a part here, although I'm not sure why even that's making this much of a difference. I suggest that you use un-patched master as base source, that is, leave out any patches to improve execution time. [ ... ] > 0001 + 0002 no prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+------------+----------------+----------------+---------------- > 100 | 682.53091 | 0.388 | 0.35 | 0.35 > 200 | 469.906601 | 0.543 | 0.496 | 0.51 > 400 | 321.915349 | 0.78 | 0.721 | 0.752 > 800 | 201.620975 | 1.246 | 1.156 | 1.236 > 1600 | 94.438204 | 2.612 | 2.335 | 2.745 > 3200 | 38.292922 | 6.657 | 5.579 | 6.808 > 6400 | 21.48462 | 11.989 | 10.104 | 12.601 > [ ... ] > > 0001 + 0002 + 0003 no prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+------------+----------------+----------------+---------------- > 100 | 798.962029 | 0.304 | 0.267 | 0.339 > 200 | 577.893396 | 0.384 | 0.346 | 0.487 > 400 | 426.542177 | 0.472 | 0.435 | 0.717 > 800 | 288.616213 | 0.63 | 0.591 | 1.162 > 1600 | 154.247034 | 1.056 | 0.987 | 2.384 > 3200 | 59.711446 | 2.416 | 2.233 | 6.514 > 6400 | 37.109761 | 3.387 | 3.099 | 11.762 > [ ... ] By the way, as you may have noticed, I posted a version 2 of the patches on this thread. If you apply them, you will be be able to see almost same TPS for any number of partitions with master + 0001 + 0002 + 0003. > Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better. > > 11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+-------------+----------------+----------------+---------------- > 100 | 756.07228 | 0.942 | 0.091 | 0.123 I guess if you had applied the latest version of "Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch" (which is v8 posted at [1]) on top of master + 0001 + 0002 + 0003, you'd get better performance too. But, as mentioned above, we're interested in measuring planning latency, not execution latency, so we should leave out any patches that are meant toward improving execution latency to avoid confusion. Thanks again. Regards, Amit [1] https://www.postgresql.org/message-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g%40mail.gmail.com
On Wed, Aug 29, 2018 at 5:36 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > It is more or less well known that the planner doesn't perform well with > more than a few hundred partitions even when only a handful of partitions > are ultimately included in the plan. Situation has improved a bit in PG > 11 where we replaced the older method of pruning partitions one-by-one > using constraint exclusion with a much faster method that finds relevant > partitions by using partitioning metadata. However, we could only use it > for SELECT queries, because UPDATE/DELETE are handled by a completely > different code path, whose structure doesn't allow it to call the new > pruning module's functionality. Actually, not being able to use the new > pruning is not the only problem for UPDATE/DELETE, more on which further > below. > > > pgbench -n -T 60 -f update.sql > > nparts master 0001 0002 0003 > ====== ====== ==== ==== ==== > 0 2856 2893 2862 2816 > 8 507 1115 1447 1872 > 16 260 765 1173 1892 > 32 119 483 922 1884 > 64 59 282 615 1881 > 128 29 153 378 1835 > 256 14 79 210 1803 > 512 5 40 113 1728 > 1024 2 17 57 1616 > 2048 0* 9 30 1471 > 4096 0+ 4 15 1236 > 8192 0= 2 7 975 > > * 0.46 > + 0.0064 > = 0 (OOM on a virtual machine with 4GB RAM) > The idea looks interesting while going through the patch I observed this comment. /* * inheritance_planner * Generate Paths in the case where the result relation is an * inheritance set. * * We have to handle this case differently from cases where a source relation * is an inheritance set. Source inheritance is expanded at the bottom of the * plan tree (see allpaths.c), but target inheritance has to be expanded at * the top. I think with your patch these comments needs to be change? if (parse->resultRelation && - rt_fetch(parse->resultRelation, parse->rtable)->inh) + rt_fetch(parse->resultRelation, parse->rtable)->inh && + rt_fetch(parse->resultRelation, parse->rtable)->relkind != + RELKIND_PARTITIONED_TABLE) inheritance_planner(root); else grouping_planner(root, false, tuple_fraction); I think we can add some comments to explain if the target rel itself is partitioned rel then why we can directly go to the grouping planner. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, Amit. Thank you for your reply. > I didn't load any data into tables when running the tests, because these patches are meant for reducing planner latency. More specifically, they're addressed to fix the current planner behavior that its latency increases with increasingnumber of partitions, with focus on the common case where only a single partition will need to be scanned by agiven query. Thank you for telling me details. It is very helpful. > No need to try to measure the INSERT latency, because its planning latency is not affected by the number of partitions. Moreover, I'd rather suggest you take out the INSERT statement from the benchmark for now, because its executiontime does vary unfavorably with increasing number of partitions. Thank you for your advice. > In fact, just measure non-prepared tps and latency, because we're only interested in planning time here. I got it. > Hmm, since 0001 is meant to improve update planning time, it seems odd that you'd get poorer results compared to base source. But, it seems base source is actually master + the patch to improve the execution time of update, so maybe that patchis playing a part here, although I'm not sure why even that's making this much of a difference. > I suggest that you use un-patched master as base source, that is, leave out any patches to improve execution time. > By the way, as you may have noticed, I posted a version 2 of the patches on this thread. If you apply them, you will bebe able to see almost same TPS for any number of partitions with master + 0001 + 0002 + 0003. > I guess if you had applied the latest version of "Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch" (which is v8posted at [1]) on top of master + 0001 + 0002 + 0003, you'd get better performance too. Thank you. I will try out these cases. Thanks, -- Sho Kato -----Original Message----- From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] Sent: Monday, September 3, 2018 2:12 PM To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; Pg Hackers <pgsql-hackers@postgresql.org> Subject: Re: speeding up planning with partitions Thank you Kato-san for testing. On 2018/08/31 19:48, Kato, Sho wrote: > Hi, Amit > > Great! > With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400. > Applying three all patches, 20% performance improved for 100 partitions. > > I have the same number of records for each partition, do you do the same? I didn't load any data into tables when running the tests, because these patches are meant for reducing planner latency. More specifically, they're addressed to fix the current planner behavior that its latency increases with increasingnumber of partitions, with focus on the common case where only a single partition will need to be scanned by agiven query. I'd try to avoid using a benchmark whose results is affected by anything other than the planning latency. It will be a goodidea if you leave the tables empty and just vary the number of partitions and nothing else. Also, we're interested in planning latency, so using just SELECT and UPDATE in your benchmark script will be enough, becausetheir planning time is affected by the number of partitions. No need to try to measure the INSERT latency, becauseits planning latency is not affected by the number of partitions. Moreover, I'd rather suggest you take out the INSERTstatement from the benchmark for now, because its execution time does vary unfavorably with increasing number of partitions. Sure, there are other patches to address that, but it's better not to mix the patches to avoid confusion. > Also, in my case, performance was better when not prepare. Patches in this thread do nothing for the execution, so, there is no need to compare prepared vs non-prepared. In fact,just measure non-prepared tps and latency, because we're only interested in planning time here. > I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right? We already have run-time pruning code (that is the code in the patch you linked) committed into the tree in PG 11, so the master tree also has it. But since we're not interested in execution time, no need to worry about run-time pruning. > Details of measurement conditions and results are as follows. > - base source > master(@777e6ddf17) + Speeding up Insert v8 patch[1] > > - table definition(e.g. 100 partition) > create table test.accounts(aid serial, abalance int) > partition by range(aid); > create table test.accounts_history(id serial, aid int, delta int, > mtime timestamp without time zone) > partition by range(aid); > > - command option > pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare > pgbench -d testdb -f benchmark.pgbench -T 180 -r -n > > -results > base source no prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+------------+----------------+----------------+---------------- > 100 | 662.414805 | 0.357 | 0.265 | 0.421 > 200 | 494.478431 | 0.439 | 0.349 | 0.579 > 400 | 307.982089 | 0.651 | 0.558 | 0.927 > 800 | 191.360676 | 0.979 | 0.876 | 1.548 > 1600 | 75.344947 | 2.253 | 2.003 | 4.301 > 3200 | 30.643902 | 5.716 | 4.955 | 10.118 > 6400 | 16.726056 | 12.512 | 8.582 | 18.054 > > 0001 no prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+------------+----------------+----------------+---------------- > 100 | 429.816329 | 0.811 | 0.75 | 0.365 > 200 | 275.211531 | 1.333 | 1.248 | 0.501 > 400 | 160.499833 | 2.384 | 2.252 | 0.754 > 800 | 79.387776 | 4.935 | 4.698 | 1.468 > 1600 | 24.787377 | 16.593 | 15.954 | 4.302 > 3200 | 9.846421 | 42.96 | 42.139 | 8.848 > 6400 | 4.919772 | 87.43 | 83.109 | 16.56 Hmm, since 0001 is meant to improve update planning time, it seems odd that you'd get poorer results compared to base source. But, it seems base source is actually master + the patch to improve the execution time of update, so maybe that patchis playing a part here, although I'm not sure why even that's making this much of a difference. I suggest that you use un-patched master as base source, that is, leave out any patches to improve execution time. [ ... ] > 0001 + 0002 no prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+------------+----------------+----------------+---------------- > 100 | 682.53091 | 0.388 | 0.35 | 0.35 > 200 | 469.906601 | 0.543 | 0.496 | 0.51 > 400 | 321.915349 | 0.78 | 0.721 | 0.752 > 800 | 201.620975 | 1.246 | 1.156 | 1.236 > 1600 | 94.438204 | 2.612 | 2.335 | 2.745 > 3200 | 38.292922 | 6.657 | 5.579 | 6.808 > 6400 | 21.48462 | 11.989 | 10.104 | 12.601 > [ ... ] > > 0001 + 0002 + 0003 no prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+------------+----------------+----------------+---------------- > 100 | 798.962029 | 0.304 | 0.267 | 0.339 > 200 | 577.893396 | 0.384 | 0.346 | 0.487 > 400 | 426.542177 | 0.472 | 0.435 | 0.717 > 800 | 288.616213 | 0.63 | 0.591 | 1.162 > 1600 | 154.247034 | 1.056 | 0.987 | 2.384 > 3200 | 59.711446 | 2.416 | 2.233 | 6.514 > 6400 | 37.109761 | 3.387 | 3.099 | 11.762 > [ ... ] By the way, as you may have noticed, I posted a version 2 of the patches on this thread. If you apply them, you will bebe able to see almost same TPS for any number of partitions with master + 0001 + 0002 + 0003. > Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better. > > 11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared > part_num | tps_ex | update_latency | select_latency | insert_latency > ----------+-------------+----------------+----------------+---------------- > 100 | 756.07228 | 0.942 | 0.091 | 0.123 I guess if you had applied the latest version of "Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch" (which is v8 postedat [1]) on top of master + 0001 + 0002 + 0003, you'd get better performance too. But, as mentioned above, we're interestedin measuring planning latency, not execution latency, so we should leave out any patches that are meant towardimproving execution latency to avoid confusion. Thanks again. Regards, Amit [1] https://www.postgresql.org/message-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g%40mail.gmail.com
Hi Dilip, Thanks for taking a look. On 2018/09/03 20:57, Dilip Kumar wrote: > The idea looks interesting while going through the patch I observed > this comment. > > /* > * inheritance_planner > * Generate Paths in the case where the result relation is an > * inheritance set. > * > * We have to handle this case differently from cases where a source relation > * is an inheritance set. Source inheritance is expanded at the bottom of the > * plan tree (see allpaths.c), but target inheritance has to be expanded at > * the top. > > I think with your patch these comments needs to be change? Yes, maybe a good idea to mention that partitioned table result relations are not handled here. Actually, I've been wondering if this patch (0001) shouldn't get rid of inheritance_planner altogether and implement the new approach for *all* inheritance sets, not just partitioned tables, but haven't spent much time on that idea yet. > if (parse->resultRelation && > - rt_fetch(parse->resultRelation, parse->rtable)->inh) > + rt_fetch(parse->resultRelation, parse->rtable)->inh && > + rt_fetch(parse->resultRelation, parse->rtable)->relkind != > + RELKIND_PARTITIONED_TABLE) > inheritance_planner(root); > else > grouping_planner(root, false, tuple_fraction); > > I think we can add some comments to explain if the target rel itself > is partitioned rel then why > we can directly go to the grouping planner. Okay, I will try to add more explanatory comments here and there in the next version I will post later this week. Thanks, Amit
On 30 August 2018 at 00:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > With various overheads gone thanks to 0001 and 0002, locking of all > partitions via find_all_inheritos can be seen as the single largest > bottleneck, which 0003 tries to address. I've kept it a separate patch, > because I'll need to think a bit more to say that it's actually to safe to > defer locking to late planning, due mainly to the concern about the change > in the order of locking from the current method. I'm attaching it here, > because I also want to show the performance improvement we can expect with it. For now, find_all_inheritors() locks the tables in ascending Oid order. This makes sense with inheritance parent tables as it's much cheaper to sort on this rather than on something like the table's namespace and name. I see no reason why what we sort on really matters, as long as we always sort on the same thing and the key we sort on is always unique so that the locking order is well defined. For partitioned tables, there's really not much sense in sticking to the same lock by Oid order. The order of the PartitionDesc is already well defined so I don't see any reason why we can't just perform the locking in PartitionDesc order. This would mean you could perform the locking of the partitions once pruning is complete somewhere around add_rel_partitions_to_query(). Also, doing this would remove the need for scanning pg_inherits during find_all_inheritors() and would likely further speed up the planning of queries to partitioned tables with many partitions. I wrote a function named get_partition_descendants_worker() to do this in patch 0002 in [1] (it may have been a bad choice to have made this a separate function rather than just part of find_all_inheritors() as it meant I had to change a bit too much code in tablecmds.c). There might be something you can salvage from that patch to help here. [1] https://www.postgresql.org/message-id/CAKJS1f9QjUwQrio20Pi%3DyCHmnouf4z3SfN8sqXaAcwREG6k0zQ%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/09/04 22:24, David Rowley wrote: > On 30 August 2018 at 00:06, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> With various overheads gone thanks to 0001 and 0002, locking of all >> partitions via find_all_inheritos can be seen as the single largest >> bottleneck, which 0003 tries to address. I've kept it a separate patch, >> because I'll need to think a bit more to say that it's actually to safe to >> defer locking to late planning, due mainly to the concern about the change >> in the order of locking from the current method. I'm attaching it here, >> because I also want to show the performance improvement we can expect with it. > > For now, find_all_inheritors() locks the tables in ascending Oid > order. This makes sense with inheritance parent tables as it's much > cheaper to sort on this rather than on something like the table's > namespace and name. I see no reason why what we sort on really > matters, as long as we always sort on the same thing and the key we > sort on is always unique so that the locking order is well defined. > > For partitioned tables, there's really not much sense in sticking to > the same lock by Oid order. The order of the PartitionDesc is already > well defined so I don't see any reason why we can't just perform the > locking in PartitionDesc order. The reason we do the locking with find_all_inheritors for regular queries (planner (expand_inherited_rtentry) does it for select/update/delete and executor (ExecSetupPartitionTupleRouting) for insert) is that that's the order used by various DDL commands when locking partitions, such as when adding a column. So, we'd have one piece of code locking partitions by Oid order and others by canonical partition bound order or PartitionDesc order. I'm no longer sure if that's problematic though, about which more below. > This would mean you could perform the > locking of the partitions once pruning is complete somewhere around > add_rel_partitions_to_query(). Also, doing this would remove the need > for scanning pg_inherits during find_all_inheritors() and would likely > further speed up the planning of queries to partitioned tables with > many partitions. That's what happens with 0003; note the following hunk in it: @@ -1555,14 +1555,15 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) lockmode = AccessShareLock; /* Scan for all members of inheritance set, acquire needed locks */ - inhOIDs = find_all_inheritors(parentOID, lockmode, NULL); + if (rte->relkind != RELKIND_PARTITIONED_TABLE) + inhOIDs = find_all_inheritors(parentOID, lockmode, NULL); > I wrote a function named > get_partition_descendants_worker() to do this in patch 0002 in [1] (it > may have been a bad choice to have made this a separate function > rather than just part of find_all_inheritors() as it meant I had to > change a bit too much code in tablecmds.c). There might be something > you can salvage from that patch to help here. > > [1] https://www.postgresql.org/message-id/CAKJS1f9QjUwQrio20Pi%3DyCHmnouf4z3SfN8sqXaAcwREG6k0zQ%40mail.gmail.com Actually, I had written a similar patch to replace the usages of find_all_inheritors and find_inheritance_children by different partitioning-specific functions which would collect the the partition OIDs from the already open parent table's PartitionDesc, more or less like the patch you mention does. But I think we don't need any new function(s) to do that, that is, we don't need to change all the sites that call find_all_inheritors or find_inheritance_children in favor of new functions that return partition OIDs in PartitionDesc order, if only *because* we want to change planner to lock partitions in the PartitionDesc order. I'm failing to see why the difference in locking order matters. I understood the concern as that locking partitions in different order could lead to a deadlock if concurrent backends request mutually conflicting locks, but only one of the conflicting backends, the one that got lock on the parent, would be allowed to lock children. Thoughts on that? Thanks, Amit
On 30 August 2018 at 21:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patches, with 0002 containing the changes mentioned above. Here's my first pass review on this: 0001: 1. I think the following paragraph should probably mention something about some performance difference between the two methods: <para> Constraint exclusion works in a very similar way to partition pruning, except that it uses each table's <literal>CHECK</literal> constraints — which gives it its name — whereas partition pruning uses the table's partition bounds, which exist only in the case of declarative partitioning. Another difference is that constraint exclusion is only applied at plan time; there is no attempt to remove partitions at execution time. </para> Perhaps tagging on. "Constraint exclusion is also a much less efficient way of eliminating unneeded partitions as metadata for each partition must be loaded in the planner before constraint exclusion can be applied. This is not a requirement for partition pruning." 2. I think "rootrel" should be named "targetpart" in: + RelOptInfo *rootrel = root->simple_rel_array[root->parse->resultRelation]; 3. Why does the following need to list_copy()? + List *saved_join_info_list = list_copy(root->join_info_list); 4. Is the "root->parse->commandType != CMD_INSERT" required in: @@ -181,13 +185,30 @@ make_one_rel(PlannerInfo *root, List *joinlist) /* * Generate access paths for the entire join tree. + * + * If we're doing this for an UPDATE or DELETE query whose target is a + * partitioned table, we must do the join planning against each of its + * leaf partitions instead. */ - rel = make_rel_from_joinlist(root, joinlist); + if (root->parse->resultRelation && + root->parse->commandType != CMD_INSERT && + root->simple_rel_array[root->parse->resultRelation] && + root->simple_rel_array[root->parse->resultRelation]->part_scheme) + { Won't the simple_rel_array entry for the resultRelation always be NULL for an INSERT? 5. In regards to: + /* + * Hack to make the join planning code believe that 'partrel' can + * be joined against. + */ + partrel->reloptkind = RELOPT_BASEREL; Have you thought about other implications of join planning for other member rels, for example, equivalence classes and em_is_child? 6. It would be good to not have to rt_fetch the same RangeTblEntry twice here: @@ -959,7 +969,9 @@ subquery_planner(PlannerGlobal *glob, Query *parse, * needs special processing, else go straight to grouping_planner. */ if (parse->resultRelation && - rt_fetch(parse->resultRelation, parse->rtable)->inh) + rt_fetch(parse->resultRelation, parse->rtable)->inh && + rt_fetch(parse->resultRelation, parse->rtable)->relkind != + RELKIND_PARTITIONED_TABLE) inheritance_planner(root); 7. Why don't you just pass the Parse into the function as a parameter instead of overwriting PlannerInfo's copy in: + root->parse = partition_parse; + partitionwise_adjust_scanjoin_target(root, child_rel, + subroots, + partitioned_rels, + resultRelations, + subpaths, + WCOLists, + returningLists, + rowMarks); + /* Restore the Query for processing the next partition. */ + root->parse = parse; 8. Can you update the following comment to mention why you're not calling add_paths_to_append_rel for this case? @@ -6964,7 +7164,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root, } /* Build new paths for this relation by appending child paths. */ - if (live_children != NIL) + if (live_children != NIL && + !(rel->reloptkind == RELOPT_BASEREL && + rel->relid == root->parse->resultRelation)) add_paths_to_append_rel(root, rel, live_children); 9. The following should use >= 0, not > 0 + while ((relid = bms_next_member(child_rel->relids, relid)) > 0) 0002: 10. I think moving the PlannerInfo->total_table_pages calculation needs to be done in its own patch. This is a behavioural change where we no longer count pruned relations in the calculation which can result in plan changes. I posted a patch in [1] to fix this as a bug fix as I think the current code is incorrect. My patch also updates the first paragraph of the comment. You've not done that. 11. "pruning" + * And do prunin. Note that this adds AppendRelInfo's of only the 12. It's much more efficient just to do bms_add_range() outside the loop in: + for (i = 0; i < rel->nparts; i++) + { + rel->part_rels[i] = build_partition_rel(root, rel, + rel->part_oids[i]); + rel->live_parts = bms_add_member(rel->live_parts, i); + } 13. In set_append_rel_size() the foreach(l, root->append_rel_list) loop could be made to loop over RelOptInfo->live_parts instead which would save having to skip over AppendRelInfos that don't belong to this parent. You'd need to make live_parts more general purpose and also use it to mark children of inheritance parents. 14. I think you can skip the following if both are NULL. You could likely add more smarts for different join types, but I think you should at least skip when both are NULL. Perhaps the loop could be driven off of bms_intersec of the two Rel's live_parts? + if (child_rel1 == NULL) + child_rel1 = build_dummy_partition_rel(root, rel1, cnt_parts); + if (child_rel2 == NULL) + child_rel2 = build_dummy_partition_rel(root, rel2, cnt_parts); 15. The following is not required when append_rel_array was previously NULL. + MemSet(root->append_rel_array + root->simple_rel_array_size, + 0, sizeof(AppendRelInfo *) * num_added_parts); 16. I don't think scan_all_parts is worth the extra code. The cost of doing bms_num_members is going to be pretty small in comparison to building paths for and maybe doing a join search for all partitions. + num_added_parts = scan_all_parts ? rel->nparts : + bms_num_members(partindexes); In any case, you've got a bug in prune_append_rel_partitions() where you're setting scan_all_parts to true instead of returning when contradictory == true. 17. lockmode be of type LOCKMODE, not int. + Oid childOID, int lockmode, 18. Comment contradicts the code. + /* Open rel if needed; we already have required locks */ + if (childOID != parentOID) + { + childrel = heap_open(childOID, lockmode); I think you should be using NoLock here. 19. Comment contradicts the code. + /* Close child relations, but keep locks */ + if (childOID != parentOID) + { + Assert(childrel != NULL); + heap_close(childrel, lockmode); + } 20. I assume translated_vars can now be NIL due to build_dummy_partition_rel() not setting it. - if (var->varlevelsup == 0 && appinfo) + if (var->varlevelsup == 0 && appinfo && appinfo->translated_vars) It might be worth writing a comment to explain that, otherwise it's not quite clear why you're doing this. 21. Unrelated change; Assert(relid > 0 && relid < root->simple_rel_array_size); + 22. The following comment mentions that Oids are copied, but that does not happen in the code. + /* + * For partitioned tables, we just allocate space for RelOptInfo's. + * pointers for all partitions and copy the partition OIDs from the + * relcache. Actual RelOptInfo is built for a partition only if it is + * not pruned. + */ The Oid copying already happened during get_relation_info(). 23. Traditionally translated_vars populated with a sequence of Vars in the same order to mean no translation. Here you're changing how that works: + /* leaving translated_vars to NIL to mean no translation needed */ This seems to be about the extent of your documentation on this, which is not good enough. 24. "each" -> "reach"? ... Actually, I don't understand the comment. In a partitioned hierarchy, how can the one before the top-level partitioned table not be a partitioned table? /* * Keep moving up until we each the parent rel that's not a * partitioned table. The one before that one would be the root * parent. */ 25. "already" + * expand_inherited_rtentry alreay locked all partitions, so pass 26. Your changes to make_partitionedrel_pruneinfo() look a bit broken. You're wrongly assuming live_parts is the same as present_parts. If a CHECK constraint eliminated the partition then those will be present in live_parts but won't be part of the Append/MergeAppend subplans. You might be able to maintain some of this optimisation by checking for dummy rels inside the loop, but you're going to need to put back the code that sets present_parts. + present_parts = bms_copy(subpart->live_parts); 27. Comment contradicts the code: + Bitmapset *live_parts; /* unpruned parts; NULL if all are live */ in add_rel_partitions_to_query() you're doing: + if (scan_all_parts) + { + for (i = 0; i < rel->nparts; i++) + { + rel->part_rels[i] = build_partition_rel(root, rel, + rel->part_oids[i]); + rel->live_parts = bms_add_member(rel->live_parts, i); + } + } so the NULL part seems untrue. 28. Missing comments: + TupleDesc tupdesc; + Oid reltype; 29. The comment for prune_append_rel_partitions claims it "Returns RT indexes", but that's not the case, per: -Relids -prune_append_rel_partitions(RelOptInfo *rel) +void +prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel) 0003: 30. 2nd test can be tested inside the first test to remove redundant partition check. - inhOIDs = find_all_inheritors(parentOID, lockmode, NULL); + if (rte->relkind != RELKIND_PARTITIONED_TABLE) + inhOIDs = find_all_inheritors(parentOID, lockmode, NULL); /* * Check that there's at least one descendant, else treat as no-child * case. This could happen despite above has_subclass() check, if table * once had a child but no longer does. */ - if (list_length(inhOIDs) < 2) + if (rte->relkind != RELKIND_PARTITIONED_TABLE && list_length(inhOIDs) < 2) { 31. The following code is wrong: + /* Determine the correct lockmode to use. */ + if (rootRTindex == root->parse->resultRelation) + lockmode = RowExclusiveLock; + else if (rootrc && RowMarkRequiresRowShareLock(rootrc->markType)) + lockmode = RowShareLock; + else + lockmode = AccessShareLock; rootRTIndex remains at 0 if there are no row marks and resultRelation will be 0 for SELECT queries, this means you'll use a RowExclusiveLock for a SELECT instead of an AccessShareLock. Why not just check the lockmode of the parent and use that? [1] https://commitfest.postgresql.org/19/1768/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 29, 2018 at 5:06 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > It is more or less well known that the planner doesn't perform well with > more than a few hundred partitions even when only a handful of partitions > are ultimately included in the plan. Situation has improved a bit in PG > 11 where we replaced the older method of pruning partitions one-by-one > using constraint exclusion with a much faster method that finds relevant > partitions by using partitioning metadata. However, we could only use it > for SELECT queries, because UPDATE/DELETE are handled by a completely > different code path, whose structure doesn't allow it to call the new > pruning module's functionality. Actually, not being able to use the new > pruning is not the only problem for UPDATE/DELETE, more on which further > below. This was a big problem for the SQL MERGE patch. I hope that this problem can be fixed. -- Peter Geoghegan
On 2018/09/11 10:11, Peter Geoghegan wrote: > On Wed, Aug 29, 2018 at 5:06 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> It is more or less well known that the planner doesn't perform well with >> more than a few hundred partitions even when only a handful of partitions >> are ultimately included in the plan. Situation has improved a bit in PG >> 11 where we replaced the older method of pruning partitions one-by-one >> using constraint exclusion with a much faster method that finds relevant >> partitions by using partitioning metadata. However, we could only use it >> for SELECT queries, because UPDATE/DELETE are handled by a completely >> different code path, whose structure doesn't allow it to call the new >> pruning module's functionality. Actually, not being able to use the new >> pruning is not the only problem for UPDATE/DELETE, more on which further >> below. > > This was a big problem for the SQL MERGE patch. I hope that this > problem can be fixed. Sorry if I've missed some prior discussion about this, but could you clarify which aspect of UPDATE/DELETE planning got in the way of SQL MERGE patch? It'd be great if you can point to an email or portion of the SQL MERGE discussion thread where this aspect was discussed. In the updated patch that I'm still hacking on (also need to look at David's comments earlier today before posting it), I have managed to eliminate the separation of code paths handling SELECT vs UPDATE/DELETE on inheritance tables, but I can't be sure if the new approach (also) solves any problems that were faced during SQL MERGE development. Thanks, Amit
On Tue, Sep 4, 2018 at 12:44 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hi Dilip, > > Thanks for taking a look. > > On 2018/09/03 20:57, Dilip Kumar wrote: >> The idea looks interesting while going through the patch I observed >> this comment. >> >> /* >> * inheritance_planner >> * Generate Paths in the case where the result relation is an >> * inheritance set. >> * >> * We have to handle this case differently from cases where a source relation >> * is an inheritance set. Source inheritance is expanded at the bottom of the >> * plan tree (see allpaths.c), but target inheritance has to be expanded at >> * the top. >> >> I think with your patch these comments needs to be change? > > Yes, maybe a good idea to mention that partitioned table result relations > are not handled here. > > Actually, I've been wondering if this patch (0001) shouldn't get rid of > inheritance_planner altogether and implement the new approach for *all* > inheritance sets, not just partitioned tables, but haven't spent much time > on that idea yet. That will be interesting. > >> if (parse->resultRelation && >> - rt_fetch(parse->resultRelation, parse->rtable)->inh) >> + rt_fetch(parse->resultRelation, parse->rtable)->inh && >> + rt_fetch(parse->resultRelation, parse->rtable)->relkind != >> + RELKIND_PARTITIONED_TABLE) >> inheritance_planner(root); >> else >> grouping_planner(root, false, tuple_fraction); >> >> I think we can add some comments to explain if the target rel itself >> is partitioned rel then why >> we can directly go to the grouping planner. > > Okay, I will try to add more explanatory comments here and there in the > next version I will post later this week. Okay. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Thanks a lot for your detailed review. On 2018/09/11 8:23, David Rowley wrote: > On 30 August 2018 at 21:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached updated patches, with 0002 containing the changes mentioned above. > > Here's my first pass review on this: I've gone through your comments on 0001 so far, but didn't go through others yet, to which I'll reply separately. > 0001: > > 1. I think the following paragraph should probably mention something > about some performance difference between the two methods: > > <para> > Constraint exclusion works in a very similar way to partition > pruning, except that it uses each table's <literal>CHECK</literal> > constraints — which gives it its name — whereas partition > pruning uses the table's partition bounds, which exist only in the > case of declarative partitioning. Another difference is that > constraint exclusion is only applied at plan time; there is no attempt > to remove partitions at execution time. > </para> > > Perhaps tagging on. "Constraint exclusion is also a much less > efficient way of eliminating unneeded partitions as metadata for each > partition must be loaded in the planner before constraint exclusion > can be applied. This is not a requirement for partition pruning." Hmm, isn't that implied by the existing text? It already says that constraint exclusion uses *each* table's/partition's CHECK constraints, which should make it clear that for a whole lot of partitions, that will be a slower than partition pruning which requires accessing only one table's metadata. If we will have dissociated constraint exclusion completely from partitioned tables with these patches, I'm not sure if we have to stress that it is inefficient for large number of tables. > 2. I think "rootrel" should be named "targetpart" in: > > + RelOptInfo *rootrel = root->simple_rel_array[root->parse->resultRelation]; To me, "targetpart" makes a partitioned table sound like a partition, which it is not. I get that using "root" can be ambiguous, because a query can specify a non-root partitioned table. How about "targetrel"? > 3. Why does the following need to list_copy()? > > + List *saved_join_info_list = list_copy(root->join_info_list); In earlier versions of this code, root->join_info_list would be modified during make_one_rel_from_joinlist, but that no longer seems true. Removed the list_copy. > 4. Is the "root->parse->commandType != CMD_INSERT" required in: > > @@ -181,13 +185,30 @@ make_one_rel(PlannerInfo *root, List *joinlist) > > /* > * Generate access paths for the entire join tree. > + * > + * If we're doing this for an UPDATE or DELETE query whose target is a > + * partitioned table, we must do the join planning against each of its > + * leaf partitions instead. > */ > - rel = make_rel_from_joinlist(root, joinlist); > + if (root->parse->resultRelation && > + root->parse->commandType != CMD_INSERT && > + root->simple_rel_array[root->parse->resultRelation] && > + root->simple_rel_array[root->parse->resultRelation]->part_scheme) > + { > > Won't the simple_rel_array entry for the resultRelation always be NULL > for an INSERT? Yep, you're right. Removed. > 5. In regards to: > > + /* > + * Hack to make the join planning code believe that 'partrel' can > + * be joined against. > + */ > + partrel->reloptkind = RELOPT_BASEREL; > > Have you thought about other implications of join planning for other > member rels, for example, equivalence classes and em_is_child? Haven't really, but that seems like an important point. I will study it more closely. > 6. It would be good to not have to rt_fetch the same RangeTblEntry twice here: > > @@ -959,7 +969,9 @@ subquery_planner(PlannerGlobal *glob, Query *parse, > * needs special processing, else go straight to grouping_planner. > */ > if (parse->resultRelation && > - rt_fetch(parse->resultRelation, parse->rtable)->inh) > + rt_fetch(parse->resultRelation, parse->rtable)->inh && > + rt_fetch(parse->resultRelation, parse->rtable)->relkind != > + RELKIND_PARTITIONED_TABLE) > inheritance_planner(root); The new version doesn't call inheritance_planner at all; there is no inheritance_planner in the new version. > 7. Why don't you just pass the Parse into the function as a parameter > instead of overwriting PlannerInfo's copy in: > > + root->parse = partition_parse; > + partitionwise_adjust_scanjoin_target(root, child_rel, > + subroots, > + partitioned_rels, > + resultRelations, > + subpaths, > + WCOLists, > + returningLists, > + rowMarks); > + /* Restore the Query for processing the next partition. */ > + root->parse = parse; Okay, done. > 8. Can you update the following comment to mention why you're not > calling add_paths_to_append_rel for this case? > > @@ -6964,7 +7164,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root, > } > > /* Build new paths for this relation by appending child paths. */ > - if (live_children != NIL) > + if (live_children != NIL && > + !(rel->reloptkind == RELOPT_BASEREL && > + rel->relid == root->parse->resultRelation)) > add_paths_to_append_rel(root, rel, live_children); Oops, stray code from a previous revision. Removed this hunk. > > 9. The following should use >= 0, not > 0 > > + while ((relid = bms_next_member(child_rel->relids, relid)) > 0) Yeah, fixed. Sorry, I haven't yet worked on your comments on 0002 and 0003. For time being, I'd like to report what I've been up to these past couple of days, because starting tomorrow until the end of this month, I won't be able to reply to emails on -hackers due to personal vacation. So, I spent a better part of last few days on trying to revise the patches so that it changes the planning code for *all* inheritance cases instead of just focusing on partitioning. Because, really, the only difference between the partitioning code and regular inheritance code should be that partitioning is able to use faster pruning, and all the other parts should look and work more or less the same. There shouldn't be code here that deals with partitioning and code there to deal with inheritance. Minimizing code this way should be a good end to aim for, imho. Attached is what I have at the moment. As part of the effort mentioned above, I made 0001 to remove inheritance_planner altogether, instead of just taking out the partitioning case out of it. Then there is the WIP patch 0004 which tries to move even regular inheritance expansion to late planning stage, just like the earlier versions did for partitioning. It will need quite a bit of polishing before we could consider it for merging with 0002. Of course, I'll need to address your comments before considering doing that any seriously. Thanks, Amit
Attachment
On Fri, Sep 14, 2018 at 3:58 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks a lot for your detailed review. > I was going through your patch (v3-0002) and I have some suggestions 1. - if (nparts > 0) + /* + * For partitioned tables, we just allocate space for RelOptInfo's. + * pointers for all partitions and copy the partition OIDs from the + * relcache. Actual RelOptInfo is built for a partition only if it is + * not pruned. + */ + if (rte->relkind == RELKIND_PARTITIONED_TABLE) + { rel->part_rels = (RelOptInfo **) - palloc(sizeof(RelOptInfo *) * nparts); + palloc0(sizeof(RelOptInfo *) * rel->nparts); + return rel; + } I think we can delay allocating memory for rel->part_rels? And we can allocate in add_rel_partitions_to_query only for those partitions which survive pruning. 2. add_rel_partitions_to_query ... + /* Expand the PlannerInfo arrays to hold new partition objects. */ + num_added_parts = scan_all_parts ? rel->nparts : + bms_num_members(partindexes); + new_size = root->simple_rel_array_size + num_added_parts; + root->simple_rte_array = (RangeTblEntry **) + repalloc(root->simple_rte_array, + sizeof(RangeTblEntry *) * new_size); + root->simple_rel_array = (RelOptInfo **) + repalloc(root->simple_rel_array, + sizeof(RelOptInfo *) * new_size); + if (root->append_rel_array) + root->append_rel_array = (AppendRelInfo **) + repalloc(root->append_rel_array, + sizeof(AppendRelInfo *) * new_size); + else + root->append_rel_array = (AppendRelInfo **) + palloc0(sizeof(AppendRelInfo *) * + new_size); Instead of re-pallocing for every partitioned relation can't we first count the total number of surviving partitions and repalloc at once. 3. + /* + * And do prunin. Note that this adds AppendRelInfo's of only the + * partitions that are not pruned. + */ prunin/pruning -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Sep 29, 2018 at 07:00:02PM +0530, Dilip Kumar wrote: > I think we can delay allocating memory for rel->part_rels? And we can > allocate in add_rel_partitions_to_query only > for those partitions which survive pruning. This last review set has not been answered, and as it is recent I am moving the patch to next CF, waiting on author. -- Michael
Attachment
On 2018/10/02 10:20, Michael Paquier wrote: > On Sat, Sep 29, 2018 at 07:00:02PM +0530, Dilip Kumar wrote: >> I think we can delay allocating memory for rel->part_rels? And we can >> allocate in add_rel_partitions_to_query only >> for those partitions which survive pruning. > > This last review set has not been answered, and as it is recent I am > moving the patch to next CF, waiting on author. I was thinking of doing that myself today, thanks for taking care of that. Will get back to this by the end of this week. Thanks, Amit
On Fri, Sep 14, 2018 at 10:29 PM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached is what I have at the moment. I realise this is a WIP but FYI the docs don't build (you removed a <note> element that is still needed, when removing a paragraph). -- Thomas Munro http://www.enterprisedb.com
On 2018/10/03 8:31, Thomas Munro wrote: > On Fri, Sep 14, 2018 at 10:29 PM Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached is what I have at the moment. > > I realise this is a WIP but FYI the docs don't build (you removed a > <note> element that is still needed, when removing a paragraph). Thanks Thomas for the heads up, will fix. Regards, Amit
Hi, Amit! On Thu, Sept 13, 2018 at 10:29 PM, Amit Langote wrote: > Attached is what I have at the moment. I also do the code review of the patch. I could only see a v3-0001.patch so far, so below are all about v3-0001.patch. I am new to inheritance/partitioning codes and code review, so my review below might have some mistakes. If there are mistakes,please point out that kindly :) v3-0001: 1. Is there any reason inheritance_make_rel_from_joinlist returns "parent" that is passed to it? I think we can refer parentin the caller even if inheritance_make_rel_from_joinlist returns nothing. +static RelOptInfo * +inheritance_make_rel_from_joinlist(PlannerInfo *root, + RelOptInfo *parent, + List *joinlist) +{ ... + return parent; +} 2. Is there the possible case that IS_DUMMY_REL(child_joinrel) is true in inheritance_adjust_scanjoin_target()? +inheritance_adjust_scanjoin_target(PlannerInfo *root, ... +{ ... + foreach(lc, root->inh_target_child_rels) + { ... + /* + * If the child was pruned, its corresponding joinrel will be empty, + * which we can ignore safely. + */ + if (IS_DUMMY_REL(child_joinrel)) + continue; I think inheritance_make_rel_from_joinlist() doesn't make dummy joinrel for the child that was pruned. +inheritance_make_rel_from_joinlist(PlannerInfo *root, ... +{ ... + foreach(lc, root->append_rel_list) + { + RelOptInfo *childrel; ... + /* Ignore excluded/pruned children. */ + if (IS_DUMMY_REL(childrel)) + continue; ... + /* + * Save child joinrel to be processed later in + * inheritance_adjust_scanjoin_target, which adjusts its paths to + * be able to emit expressions in query's top-level target list. + */ + root->inh_target_child_rels = lappend(root->inh_target_child_rels, + childrel); + } +} 3. Is the "root->parse->commandType != CMD_INSERT" required in: @@ -2018,13 +1514,45 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, ... + /* + * For UPDATE/DELETE on an inheritance parent, we must generate and + * apply scanjoin target based on targetlist computed using each + * of the child relations. + */ + if (parse->commandType != CMD_INSERT && + current_rel->reloptkind == RELOPT_BASEREL && + current_rel->relid == root->parse->resultRelation && + root->simple_rte_array[current_rel->relid]->inh) ... @@ -2137,92 +1665,123 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, final_rel->fdwroutine = current_rel->fdwroutine; ... - foreach(lc, current_rel->pathlist) + if (current_rel->reloptkind == RELOPT_BASEREL && + current_rel->relid == root->parse->resultRelation && + !IS_DUMMY_REL(current_rel) && + root->simple_rte_array[current_rel->relid]->inh && + parse->commandType != CMD_INSERT) I think if a condition would be "current_rel->relid == root->parse->resultRelation && parse->commandType != CMD_INSERT" atthe above if clause, elog() is called in the query_planner and planner don't reach above if clause. Of course there is the case that query_planner returns the dummy joinrel and elog is not called, but in that case, current_rel->relidmay be 0(current_rel is dummy joinrel) and root->parse->resultRelation may be not 0(a query is INSERT). 4. Can't we use define value(IS_PARTITIONED or something like IS_INHERITANCE?) to identify inheritance and partitioned tablein below code? It was little confusing to me that which code is related to inheritance/partitioned when looking codes. In make_one_rel(): + if (root->parse->resultRelation && + root->simple_rte_array[root->parse->resultRelation]->inh) + { ... + rel = inheritance_make_rel_from_joinlist(root, targetrel, joinlist); In inheritance_make_rel_from_joinlist(): + if (childrel->part_scheme != NULL) + childrel = + inheritance_make_rel_from_joinlist(root, childrel, + translated_joinlist); I can't review inheritance_adjust_scanjoin_target() deeply, because it is difficult to me to understand fully codes aboutjoin processing. -- Yoshikazu Imai
Imai-san, On 2018/10/04 17:11, Imai, Yoshikazu wrote: > Hi, Amit! > > On Thu, Sept 13, 2018 at 10:29 PM, Amit Langote wrote: >> Attached is what I have at the moment. > > I also do the code review of the patch. Thanks a lot for your review. I'm working on updating the patches as mentioned upthread and should be able to post a new version sometime next week. I will consider your comments along with David's and Dilip's before posting the new patch. Regards, Amit
Hi Dilip, Thanks for your review comments. Sorry it took me a while replying to them. On 2018/09/29 22:30, Dilip Kumar wrote: > I was going through your patch (v3-0002) and I have some suggestions > > 1. > - if (nparts > 0) > + /* > + * For partitioned tables, we just allocate space for RelOptInfo's. > + * pointers for all partitions and copy the partition OIDs from the > + * relcache. Actual RelOptInfo is built for a partition only if it is > + * not pruned. > + */ > + if (rte->relkind == RELKIND_PARTITIONED_TABLE) > + { > rel->part_rels = (RelOptInfo **) > - palloc(sizeof(RelOptInfo *) * nparts); > + palloc0(sizeof(RelOptInfo *) * rel->nparts); > + return rel; > + } > > I think we can delay allocating memory for rel->part_rels? And we can > allocate in add_rel_partitions_to_query only > for those partitions which survive pruning. Unfortunately, we can't do that. The part_rels array is designed such that there is one entry in it for each partition of a partitioned table that physically exists. Since the pruning code returns a set of indexes into the array of *all* partitions, the planner's array must also be able to hold *all* partitions, even if most of them would be NULL in the common case. Also, partition wise join code depends on finding a valid (even if dummy) RelOptInfo for *all* partitions of a partitioned table to support outer join planning. Maybe, we can change partition wise join code to not depend on such dummy-marked RelOptInfos on the nullable side, but until then we'll have to leave part_rels like this. > 2. > add_rel_partitions_to_query > ... > + /* Expand the PlannerInfo arrays to hold new partition objects. */ > + num_added_parts = scan_all_parts ? rel->nparts : > + bms_num_members(partindexes); > + new_size = root->simple_rel_array_size + num_added_parts; > + root->simple_rte_array = (RangeTblEntry **) > + repalloc(root->simple_rte_array, > + sizeof(RangeTblEntry *) * new_size); > + root->simple_rel_array = (RelOptInfo **) > + repalloc(root->simple_rel_array, > + sizeof(RelOptInfo *) * new_size); > + if (root->append_rel_array) > + root->append_rel_array = (AppendRelInfo **) > + repalloc(root->append_rel_array, > + sizeof(AppendRelInfo *) * new_size); > + else > + root->append_rel_array = (AppendRelInfo **) > + palloc0(sizeof(AppendRelInfo *) * > + new_size); > > Instead of re-pallocing for every partitioned relation can't we first > count the total number of surviving partitions and > repalloc at once. Hmm, doing this seems a bit hard too. Since the function to perform partition pruning (prune_append_rel_partitions), which determines the number of partitions that will be added, currently contains a RelOptInfo argument for using the partitioning info created by set_relation_partition_info, we cannot call it on a partitioned table unless we've created a RelOptInfo for it. So, we cannot delay creating partition RelOptInfos to a point where we've figured out all the children, because some of them might be partitioned tables themselves. IOW, we must create them as we process each partitioned table (and its partitioned children recursively) and expand the PlannerInfo arrays as we go. I've thought about the alternative(s) that will allow us to do what you suggest, but it cannot be implemented without breaking how we initialize partitioning info in RelOptInfo. For example, we could refactor prune_append_rel_array's interface to accept a Relation pointer instead of RelOptInfo, but we don't have a Relation pointer handy at the point where we can do pruning without re-opening it, which is something to avoid. Actually, that's not the only refactoring needed as I've come to know when trying to implement it. On a related note, with the latest patch, I've also delayed regular inheritance expansion as a whole, to avoid making the partitioning expansion a special case. That means we'll expand the planner arrays every time a inheritance parent is encountered in the range table. The only aspect where partitioning becomes a special case in the new code is that we can call the pruning code even before we've expanded the planner arrays and the pruning limits the size to which the arrays must be expanded to. Regular inheritance requires that the planner arrays be expanded by the amount given by list_length(find_all_inheritors(root_parent_oid)). > 3. > + /* > + * And do prunin. Note that this adds AppendRelInfo's of only the > + * partitions that are not pruned. > + */ > > prunin/pruning I've since rewritten this comment and fixed the misspelling. I'm still working on some other comments. Thanks, Amit
Hi David, I've managed to get back to the rest of your comments. Sorry that it took me a while. On 2018/09/11 8:23, David Rowley wrote: > On 30 August 2018 at 21:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached updated patches, with 0002 containing the changes mentioned above. > > Here's my first pass review on this: > > 0002: > > 10. I think moving the PlannerInfo->total_table_pages calculation > needs to be done in its own patch. This is a behavioural change where > we no longer count pruned relations in the calculation which can > result in plan changes. I posted a patch in [1] to fix this as a bug > fix as I think the current code is incorrect. My patch also updates > the first paragraph of the comment. You've not done that. As mentioned on the other thread, I've included your patch in my own series. > 11. "pruning" > > + * And do prunin. Note that this adds AppendRelInfo's of only the Fixed. > 12. It's much more efficient just to do bms_add_range() outside the loop in: > > + for (i = 0; i < rel->nparts; i++) > + { > + rel->part_rels[i] = build_partition_rel(root, rel, > + rel->part_oids[i]); > + rel->live_parts = bms_add_member(rel->live_parts, i); > + } You're right, I almost forgot about bms_add_range that we just recently added. > 13. In set_append_rel_size() the foreach(l, root->append_rel_list) > loop could be made to loop over RelOptInfo->live_parts instead which > would save having to skip over AppendRelInfos that don't belong to > this parent. You'd need to make live_parts more general purpose and > also use it to mark children of inheritance parents. Hmm, I've thought about it, but live_parts is a set of indexes into the part_rels array, which is meaningful only for partitioned tables. I'm not really sure if we should generalize part_rels to child_rels and set it even for regular inheritance, if only for the efficiency of set_append_rel_size and set_append_rel_pathlists. In the updated patches, I've managed to make this whole effort applicable to regular inheritance in general (as I said I would last month), where the only special case code needed for partitioning is to utilize the functionality of partprune.c, but I wasn't generalize everything. We can, as a follow on patch maybe. > 14. I think you can skip the following if both are NULL. You could > likely add more smarts for different join types, but I think you > should at least skip when both are NULL. Perhaps the loop could be > driven off of bms_intersec of the two Rel's live_parts? > > + if (child_rel1 == NULL) > + child_rel1 = build_dummy_partition_rel(root, rel1, cnt_parts); > + if (child_rel2 == NULL) > + child_rel2 = build_dummy_partition_rel(root, rel2, cnt_parts); Actually, not needing to build the dummy partition rel here is something we could get back to doing someday, but not in this patch. The logic involved to do so is too much related to join planning, so I thought we could pursue it later. > 15. The following is not required when append_rel_array was previously NULL. > > + MemSet(root->append_rel_array + root->simple_rel_array_size, > + 0, sizeof(AppendRelInfo *) * num_added_parts); Yeah, I've fixed that. If previously NULL, it's palloc'd. > 16. I don't think scan_all_parts is worth the extra code. The cost of > doing bms_num_members is going to be pretty small in comparison to > building paths for and maybe doing a join search for all partitions. > > + num_added_parts = scan_all_parts ? rel->nparts : > + bms_num_members(partindexes); > > In any case, you've got a bug in prune_append_rel_partitions() where > you're setting scan_all_parts to true instead of returning when > contradictory == true. I too started disliking it, so got rid of scan_all_parts. > 17. lockmode be of type LOCKMODE, not int. > > + Oid childOID, int lockmode, This might no longer be the same code as you saw when reviewing, but I've changed all lockmode variables in the patch to use LOCKMODE. > 18. Comment contradicts the code. > > + /* Open rel if needed; we already have required locks */ > + if (childOID != parentOID) > + { > + childrel = heap_open(childOID, lockmode); > > I think you should be using NoLock here. > > 19. Comment contradicts the code. > > + /* Close child relations, but keep locks */ > + if (childOID != parentOID) > + { > + Assert(childrel != NULL); > + heap_close(childrel, lockmode); > + } These two blocks, too. There is no such code with the latest patch. Code's been moved such that there is no need for such special handling. > 20. I assume translated_vars can now be NIL due to > build_dummy_partition_rel() not setting it. > > - if (var->varlevelsup == 0 && appinfo) > + if (var->varlevelsup == 0 && appinfo && appinfo->translated_vars) > > It might be worth writing a comment to explain that, otherwise it's > not quite clear why you're doing this. Regarding this and and one more comment below, I've replied below. > 21. Unrelated change; > > Assert(relid > 0 && relid < root->simple_rel_array_size); > + Couldn't find it in the latest patch, so must be gone. > 22. The following comment mentions that Oids are copied, but that does > not happen in the code. > > + /* > + * For partitioned tables, we just allocate space for RelOptInfo's. > + * pointers for all partitions and copy the partition OIDs from the > + * relcache. Actual RelOptInfo is built for a partition only if it is > + * not pruned. > + */ > > The Oid copying already happened during get_relation_info(). Couldn't find this comment in the new code. > 23. Traditionally translated_vars populated with a sequence of Vars in > the same order to mean no translation. Here you're changing how that > works: > > + /* leaving translated_vars to NIL to mean no translation needed */ > > This seems to be about the extent of your documentation on this, which > is not good enough. I've changed the other code such that only the existing meaning of no-op translation list is preserved, that is, the one you wrote above. I modified build_dummy_partition_rel such that a no-op translated vars is built based on parent's properties, instead of leaving it NIL. I've also removed the special case code in adjust_appendrel_attrs_mutator which checked translated_vars for NIL. > 24. "each" -> "reach"? ... Actually, I don't understand the comment. > In a partitioned hierarchy, how can the one before the top-level > partitioned table not be a partitioned table? > > /* > * Keep moving up until we each the parent rel that's not a > * partitioned table. The one before that one would be the root > * parent. > */ This comment and the code has since been rewritten, but to clarify, this is the new comment: * Figuring out if we're the root partitioned table is a bit involved, * because the root table mentioned in the query itself might be a * child of UNION ALL subquery. */ > 25. "already" > > + * expand_inherited_rtentry alreay locked all partitions, so pass No longer appears in the latest patch. > 26. Your changes to make_partitionedrel_pruneinfo() look a bit broken. > You're wrongly assuming live_parts is the same as present_parts. If a > CHECK constraint eliminated the partition then those will be present > in live_parts but won't be part of the Append/MergeAppend subplans. > You might be able to maintain some of this optimisation by checking > for dummy rels inside the loop, but you're going to need to put back > the code that sets present_parts. > > + present_parts = bms_copy(subpart->live_parts); You're right. The new code structure is such that it allows to delete the partition index of a partition that's excluded by constraints, so I fixed it so that live_parts no longer contains the partitions that are excluded by constraints. The existing code that sets present parts goes through all partitions by looping over nparts members of part_rels, which is a pattern I think we should avoid, as I'd think you'd agree. > 27. Comment contradicts the code: > > + Bitmapset *live_parts; /* unpruned parts; NULL if all are live */ > > in add_rel_partitions_to_query() you're doing: > > + if (scan_all_parts) > + { > + for (i = 0; i < rel->nparts; i++) > + { > + rel->part_rels[i] = build_partition_rel(root, rel, > + rel->part_oids[i]); > + rel->live_parts = bms_add_member(rel->live_parts, i); > + } > + } > > so the NULL part seems untrue. I've rewritten the comment. > > 28. Missing comments: > > + TupleDesc tupdesc; > + Oid reltype; > > 29. The comment for prune_append_rel_partitions claims it "Returns RT > indexes", but that's not the case, per: > > -Relids > -prune_append_rel_partitions(RelOptInfo *rel) > +void > +prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel) Has changed in the latest patch and now it returns a bitmapset of partition indexes, same as what get_matching_partitions does. > 0003: > > 30. 2nd test can be tested inside the first test to remove redundant > partition check. > > - inhOIDs = find_all_inheritors(parentOID, lockmode, NULL); > + if (rte->relkind != RELKIND_PARTITIONED_TABLE) > + inhOIDs = find_all_inheritors(parentOID, lockmode, NULL); > > /* > * Check that there's at least one descendant, else treat as no-child > * case. This could happen despite above has_subclass() check, if table > * once had a child but no longer does. > */ > - if (list_length(inhOIDs) < 2) > + if (rte->relkind != RELKIND_PARTITIONED_TABLE && list_length(inhOIDs) < 2) > { This code has changed quite a bit in the latest patch, so the comment no longer applies. > 31. The following code is wrong: > > + /* Determine the correct lockmode to use. */ > + if (rootRTindex == root->parse->resultRelation) > + lockmode = RowExclusiveLock; > + else if (rootrc && RowMarkRequiresRowShareLock(rootrc->markType)) > + lockmode = RowShareLock; > + else > + lockmode = AccessShareLock; > > rootRTIndex remains at 0 if there are no row marks and resultRelation > will be 0 for SELECT queries, this means you'll use a RowExclusiveLock > for a SELECT instead of an AccessShareLock. > > Why not just check the lockmode of the parent and use that? No such logic exists anymore due to recent developments (addition of rellockmode to RTE). I'll post the latest patches in this week. Thanks, Amit
Imai-san, Thank you for reviewing. On 2018/10/04 17:11, Imai, Yoshikazu wrote: > Hi, Amit! > > On Thu, Sept 13, 2018 at 10:29 PM, Amit Langote wrote: >> Attached is what I have at the moment. > > I also do the code review of the patch. > I could only see a v3-0001.patch so far, so below are all about v3-0001.patch. > > I am new to inheritance/partitioning codes and code review, so my review below might have some mistakes. If there are mistakes,please point out that kindly :) > > > v3-0001: > 1. Is there any reason inheritance_make_rel_from_joinlist returns "parent" that is passed to it? I think we can refer parentin the caller even if inheritance_make_rel_from_joinlist returns nothing. > > +static RelOptInfo * > +inheritance_make_rel_from_joinlist(PlannerInfo *root, > + RelOptInfo *parent, > + List *joinlist) > +{ > ... > + return parent; > +} There used to be a reason to do that in previous versions, but seems it doesn't hold anymore. I've changed it to not return any value. > 2. Is there the possible case that IS_DUMMY_REL(child_joinrel) is true in inheritance_adjust_scanjoin_target()? > > +inheritance_adjust_scanjoin_target(PlannerInfo *root, > ... > +{ > ... > + foreach(lc, root->inh_target_child_rels) > + { > ... > + /* > + * If the child was pruned, its corresponding joinrel will be empty, > + * which we can ignore safely. > + */ > + if (IS_DUMMY_REL(child_joinrel)) > + continue; > > I think inheritance_make_rel_from_joinlist() doesn't make dummy joinrel for the child that was pruned. > > +inheritance_make_rel_from_joinlist(PlannerInfo *root, > ... > +{ > ... > + foreach(lc, root->append_rel_list) > + { > + RelOptInfo *childrel; > ... > + /* Ignore excluded/pruned children. */ > + if (IS_DUMMY_REL(childrel)) > + continue; > ... > + /* > + * Save child joinrel to be processed later in > + * inheritance_adjust_scanjoin_target, which adjusts its paths to > + * be able to emit expressions in query's top-level target list. > + */ > + root->inh_target_child_rels = lappend(root->inh_target_child_rels, > + childrel); > + } > +} You're right. Checking that in inheritance_adjust_scanjoin_target was redundant. > 3. > Is the "root->parse->commandType != CMD_INSERT" required in: > > @@ -2018,13 +1514,45 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, > ... > + /* > + * For UPDATE/DELETE on an inheritance parent, we must generate and > + * apply scanjoin target based on targetlist computed using each > + * of the child relations. > + */ > + if (parse->commandType != CMD_INSERT && > + current_rel->reloptkind == RELOPT_BASEREL && > + current_rel->relid == root->parse->resultRelation && > + root->simple_rte_array[current_rel->relid]->inh) > ... > > @@ -2137,92 +1665,123 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, > final_rel->fdwroutine = current_rel->fdwroutine; > > ... > - foreach(lc, current_rel->pathlist) > + if (current_rel->reloptkind == RELOPT_BASEREL && > + current_rel->relid == root->parse->resultRelation && > + !IS_DUMMY_REL(current_rel) && > + root->simple_rte_array[current_rel->relid]->inh && > + parse->commandType != CMD_INSERT) > > > I think if a condition would be "current_rel->relid == root->parse->resultRelation && parse->commandType != CMD_INSERT"at the above if clause, elog() is called in the query_planner and planner don't reach above if clause. > Of course there is the case that query_planner returns the dummy joinrel and elog is not called, but in that case, current_rel->relidmay be 0(current_rel is dummy joinrel) and root->parse->resultRelation may be not 0(a query is INSERT). Yeah, I realized that we can actually Assert(parse->commandType != CMD_INSERT) if the inh flag of the target/resultRelation RTE is true. So, checking it explicitly is redundant. > 4. Can't we use define value(IS_PARTITIONED or something like IS_INHERITANCE?) to identify inheritance and partitionedtable in below code? It was little confusing to me that which code is related to inheritance/partitioned whenlooking codes. > > In make_one_rel(): > + if (root->parse->resultRelation && > + root->simple_rte_array[root->parse->resultRelation]->inh) > + { > ... > + rel = inheritance_make_rel_from_joinlist(root, targetrel, joinlist); > > In inheritance_make_rel_from_joinlist(): > + if (childrel->part_scheme != NULL) > + childrel = > + inheritance_make_rel_from_joinlist(root, childrel, > + translated_joinlist); As it might have been clear, the value of inh flag is checked to detect whether the operation uses inheritance, because it requires special handling -- the operation must be applied to every child relation that satisfies the conditions of the query. part_scheme is checked to detect partitioning which has some special behavior with respect to how the AppendRelInfos are set up. For partitioning, AppendRelInfos map partitions to their immediate parent, not the top-most root parent, so the current code uses recursion to apply certain transformations. That's not required for non-partitioned case, because all children are mapped to the root inheritance parent. I'm trying to unify the two so that the partitioning case doesn't need any special handling. > I can't review inheritance_adjust_scanjoin_target() deeply, because it is difficult to me to understand fully codes aboutjoin processing. Thanks for your comments, they are valuable. Regards, Amit
And here is the latest set of patches. Sorry it took me a while. * To reiterate, this set of patches is intended to delay adding partitions to the query's range table and PlannerInfo, which leads to significantly reduced overheads in the common case where queries accessing partitioned tables need to scan only 1 or few partitions. As I mentioned before on this thread [1], I wanted to change the patches such that the new approach is adopted for all inheritance tables, not just partitioned tables, because most of the code and data structures are shared and it no longer made sense to me to create a diversion between the two cases by making the partitioning a special case in a number of places in the planner code. I think I've managed to get to that point with the latest patches. The new code that performs inheritance expansion is common to partitioning, inheritance, and also flattened UNION ALL to some degree, because all of these cases use more or less the same code. For partitioning, child relations are opened and RTEs are added for them *after* performing pruning, whereas for inheritance that's not possible, because pruning using constraint exclusion cannot be performed without opening the children. This consideration is the only main thing that differs between the handling of the two case, among other minor details that also differ. * With the unification described above, inheritance expansion is no longer part planner's prep phase, so I've located the new expansion code in a new file under optimizer/utils named append.c, because it manages creation of an appendrel's child relations. That means a lot of helper code that once was in prepunion.c is now in the new file. Actually, I've moved *some* of the helper routines such as adjust_appendre_* that do the mapping of expressions between appendrel parent and children to a different file appendinfo.c. * I've also rewritten the patch to change how inherited update/delete planning is done, so that it doesn't end up messing too much with grouping_planner. Instead of removing inheritance_planner altogether as the earlier patches did, I've rewritten it such that the CPU and memory intensive query_planner portion is run only once at the beginning to create scan/join paths for non-excluded/unpruned child target relations, followed by running grouping_planner to apply top-level targetlist suitable for each child target relation. grouping_planner is modified slightly such that when it runs for every child target relation, it doesn't again need to perform query planning. It instead receives the RelOptInfos that contain the scan/join paths of child target relations that were generated during the initial query_planner run. Beside this main change, I've removed quite a bit of code in inheritance_planner that relied on the existing way of redoing planning for each child target relation. The changes are described in the commit message of patch 0002. * A few notes on the patches: 0001 is a separate patch, because it might be useful in some other projects like [2]. 0003 is David's patch that he posted in [3]. I didn't get time today to repeat the performance tests, but I'm planning to next week. In the meantime, please feel free to test them and provide comments on the code changes. Thanks, Amit [1] https://www.postgresql.org/message-id/8097576f-f795-fc42-3c00-073f68bfb0b4%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/de957e5b-faa0-6fb9-c0ab-0b389d71cf5a%40lab.ntt.co.jp [3] Re: Calculate total_table_pages after set_base_rel_sizes() https://www.postgresql.org/message-id/CAKJS1f9NiQXO9KCv_cGgBShwqwT78wmArOht-5kWL%2BBt0v-AnQ%40mail.gmail.com
Attachment
- 0001-Store-inheritance-root-parent-index-in-otherrel-s-Re.patch
- 0002-Overhaul-inheritance-update-delete-planning.patch
- 0003-Calculate-total_table_pages-after-set_base_rel_sizes.patch
- 0004-Lazy-creation-of-RTEs-for-inheritance-children.patch
- 0005-Teach-planner-to-only-process-unpruned-partitions.patch
- 0006-Do-not-lock-all-partitions-at-the-beginning.patch
Hi, On Thu, Oct 25, 2018 at 10:38 PM, Amit Langote wrote: > And here is the latest set of patches. Sorry it took me a while. Thanks for revising the patch! > I didn't get time today to repeat the performance tests, but I'm planning > to next week. In the meantime, please feel free to test them and provide > comments on the code changes. Since it takes me a lot of time to see all of the patches, I will post comments little by little from easy parts like typo check. 1. 0002: + * inheritance_make_rel_from_joinlist + * Perform join planning for all non-dummy leaf inheritance chilren + * in their role as query's target relation "inheritance chilren" -> "inheritance children" 2. 0002: + /* + * Sub-partitions have to be processed recursively, because + * AppendRelInfoss link sub-partitions to their immediate parents, not + * the root partitioned table. + */ AppendRelInfoss -> AppendRelInfos (?) 3. 0002: + /* Reset interal join planning data structures. */ interal -> internal 4. 0004: -static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, - Index rti); Comments referring to expand_inherited_rtentry() is left. backend/optimizer/plan/planner.c:1310: * Because of the way expand_inherited_rtentry works, that should be backend/optimizer/plan/planner.c:1317: * Instead the duplicate child RTE created by expand_inherited_rtentry backend/optimizer/util/plancat.c:118: * the rewriter or when expand_inherited_rtentry() added it to the query's backend/optimizer/util/plancat.c:640: * the rewriter or when expand_inherited_rtentry() added it to the query's About the last two comments in the above, "the rewriter or when expand_inherited_rtentry() added it to the query's" would be "the rewriter or when add_inheritance_child_rel() added it to the query's". I don't know how to correct the first two comments. 5. 0004: -static void expand_partitioned_rtentry(PlannerInfo *root, - RangeTblEntry *parentrte, - Index parentRTindex, Relation parentrel, - PlanRowMark *top_parentrc, LOCKMODE lockmode, - List **appinfos); Comments referring to expand_partitioned_rtentry() is also left. backend/executor/execPartition.c:941: /* * get_partition_dispatch_recurse * Recursively expand partition tree rooted at rel * * As the partition tree is expanded in a depth-first manner, we maintain two * global lists: of PartitionDispatch objects corresponding to partitioned * tables in *pds and of the leaf partition OIDs in *leaf_part_oids. * * Note that the order of OIDs of leaf partitions in leaf_part_oids matches * the order in which the planner's expand_partitioned_rtentry() processes * them. It's not necessarily the case that the offsets match up exactly, * because constraint exclusion might prune away some partitions on the * planner side, whereas we'll always have the complete list; but unpruned * partitions will appear in the same order in the plan as they are returned * here. */ I think the second paragraph of the comments is no longer correct. expand_partitioned_rtentry() expands the partition tree in a depth-first manner, whereas expand_append_rel() doesn't neither in a depth-first manner nor a breadth-first manner as below. partitioned table tree image: pt sub1 sub1_1 leaf0 leaf1 sub2 leaf2 leaf3 append_rel_list(expanded by expand_partitioned_rtentry): [sub1, sub1_1, leaf0, leaf1, sub2, leaf2, leaf3] append_rel_list(expanded by expand_append_rel): [sub1, sub2, leaf3, sub1_1, leaf1, leaf0, leaf2] 6. 0004 + /* + * A partitioned child table with 0 children is a dummy rel, so don't + * bother creating planner objects for it. + */ + if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + { + PartitionDesc partdesc = RelationGetPartitionDesc(childrel); + + Assert(!RELATION_IS_OTHER_TEMP(childrel)); + if (partdesc->nparts == 0) + { + heap_close(childrel, NoLock); + return NULL; + } + } + + /* + * If childrel doesn't belong to this session, skip it, also relinquishing + * the lock. + */ + if (RELATION_IS_OTHER_TEMP(childrel)) + { + heap_close(childrel, lockmode); + return NULL; + } If we process the latter if block before the former one, Assert can be excluded from the code. It might be difficult to me to examine the codes whether there exists any logical wrongness along with significant planner code changes, but I will try to look it. Since it passes make check-world successfully, I think as of now there might be not any wrong points. If there's time, I will also do the performance tests. -- Yoshikazu Imai
About inheritance_make_rel_from_joinlist(), I considered how it processes joins for sub-partitioned-table. sub-partitioned-table image: part sub1 leaf1 leaf2 inheritance_make_rel_from_joinlist() translates join_list and join_info_list for each leafs(leaf1, leaf2 in above image). To translate those two lists for leaf1, inheritance_make_rel_from_joinlist() translates lists from part to sub1 and nextly from sub1 to leaf1. For leaf2, inheritance_make_rel_from_joinlist() translates lists from part to sub1 and from sub1 to leaf2. There are same translation from part to sub1, and I think it is better if we can eliminate it. I attached 0002-delta.patch. In the patch, inheritance_make_rel_from_joinlist() translates lists not only for leafs but for mid-nodes, in a depth-first manner, so it can use lists of mid-nodes for translating lists from mid-node to leafs, which eliminates same translation. I think it might be better if we can apply same logic to inheritance_planner(), which once implemented the same logic as below. foreach(lc, root->append_rel_list) { AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); ... /* * expand_inherited_rtentry() always processes a parent before any of * that parent's children, so the parent_root for this relation should * already be available. */ parent_root = parent_roots[appinfo->parent_relid]; Assert(parent_root != NULL); parent_parse = parent_root->parse; ... subroot->parse = (Query *) adjust_appendrel_attrs(parent_root, (Node *) parent_parse, 1, &appinfo); -- Yoshikazu Imai
Attachment
ISTM patch 0004 is impossible to review just because of size -- I suppose the bulk of it is just code that moves from one file to another, but that there are also code changes in it. Maybe it would be better to split it so that one patch moves the code to the new files without changing it, then another patch does the actual functional changes? If git shows the first half as a rename, it's easier to be confident about it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks for looking. On 2018/11/07 12:32, Alvaro Herrera wrote: > ISTM patch 0004 is impossible to review just because of size -- I > suppose the bulk of it is just code that moves from one file to another, > but that there are also code changes in it. Maybe it would be better to > split it so that one patch moves the code to the new files without > changing it, then another patch does the actual functional changes? If > git shows the first half as a rename, it's easier to be confident about > it. Okay, will post a new version structured that way shortly. Thanks, Amit
Imai-san, Thank you for reviewing. On 2018/11/05 17:28, Imai, Yoshikazu wrote: > Since it takes me a lot of time to see all of the patches, I will post comments > little by little from easy parts like typo check. I've fixed the typos you pointed out. > 4. > 0004: > -static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, > - Index rti); > > Comments referring to expand_inherited_rtentry() is left. > > backend/optimizer/plan/planner.c:1310: > * Because of the way expand_inherited_rtentry works, that should be > backend/optimizer/plan/planner.c:1317: > * Instead the duplicate child RTE created by expand_inherited_rtentry > backend/optimizer/util/plancat.c:118: > * the rewriter or when expand_inherited_rtentry() added it to the query's > backend/optimizer/util/plancat.c:640: > * the rewriter or when expand_inherited_rtentry() added it to the query's > > About the last two comments in the above, > "the rewriter or when expand_inherited_rtentry() added it to the query's" > would be > "the rewriter or when add_inheritance_child_rel() added it to the query's". > > I don't know how to correct the first two comments. I've since modified the patch to preserve expand_inherited_rtentry name, although with a new implementation. > 5. > 0004: > -static void expand_partitioned_rtentry(PlannerInfo *root, > - RangeTblEntry *parentrte, > - Index parentRTindex, Relation parentrel, > - PlanRowMark *top_parentrc, LOCKMODE lockmode, > - List **appinfos); > > Comments referring to expand_partitioned_rtentry() is also left. I've preserved this name as well. > backend/executor/execPartition.c:941: > /* > * get_partition_dispatch_recurse > * Recursively expand partition tree rooted at rel > * > * As the partition tree is expanded in a depth-first manner, we maintain two > * global lists: of PartitionDispatch objects corresponding to partitioned > * tables in *pds and of the leaf partition OIDs in *leaf_part_oids. > * > * Note that the order of OIDs of leaf partitions in leaf_part_oids matches > * the order in which the planner's expand_partitioned_rtentry() processes > * them. It's not necessarily the case that the offsets match up exactly, > * because constraint exclusion might prune away some partitions on the > * planner side, whereas we'll always have the complete list; but unpruned > * partitions will appear in the same order in the plan as they are returned > * here. > */ > > I think the second paragraph of the comments is no longer correct. > expand_partitioned_rtentry() expands the partition tree in a depth-first > manner, whereas expand_append_rel() doesn't neither in a depth-first manner > nor a breadth-first manner as below. Well, expand_append_rel doesn't process a whole partitioning tree on its own, only one partitioned table at a time. As set_append_rel_size() traverses the root->append_rel_list, leaf partitions get added to the resulting plan in what's effectively a depth-first order. I've updated this comment as far this patch is concerned, although the patch in the other thread about removal of executor overhead in partitioning is trying to remove this function (get_partition_dispatch_recurse) altogether anyway. > 6. > 0004 > + /* > + * A partitioned child table with 0 children is a dummy rel, so don't > + * bother creating planner objects for it. > + */ > + if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > + { > + PartitionDesc partdesc = RelationGetPartitionDesc(childrel); > + > + Assert(!RELATION_IS_OTHER_TEMP(childrel)); > + if (partdesc->nparts == 0) > + { > + heap_close(childrel, NoLock); > + return NULL; > + } > + } > + > + /* > + * If childrel doesn't belong to this session, skip it, also relinquishing > + * the lock. > + */ > + if (RELATION_IS_OTHER_TEMP(childrel)) > + { > + heap_close(childrel, lockmode); > + return NULL; > + } > > If we process the latter if block before the former one, Assert can be excluded > from the code. I've since moved the partitioning-related code to a partitioning specific function, so this no longer applies. Thanks, Amit
Imai-san, On 2018/11/07 10:00, Imai, Yoshikazu wrote: > About inheritance_make_rel_from_joinlist(), I considered how it processes > joins for sub-partitioned-table. > > sub-partitioned-table image: > part > sub1 > leaf1 > leaf2 > > inheritance_make_rel_from_joinlist() translates join_list and join_info_list > for each leafs(leaf1, leaf2 in above image). To translate those two lists for > leaf1, inheritance_make_rel_from_joinlist() translates lists from part to sub1 > and nextly from sub1 to leaf1. For leaf2, inheritance_make_rel_from_joinlist() > translates lists from part to sub1 and from sub1 to leaf2. There are same > translation from part to sub1, and I think it is better if we can eliminate it. > I attached 0002-delta.patch. > > In the patch, inheritance_make_rel_from_joinlist() translates lists not only for > leafs but for mid-nodes, in a depth-first manner, so it can use lists of > mid-nodes for translating lists from mid-node to leafs, which eliminates same > translation. I don't think the translation happens twice for the same leaf partitions. Applying adjust_appendrel_attrs_*multilevel* for only leaf nodes, as happens with the current code, is same as first translating using adjust_appendrel_attrs from part to sub1 and then from sub1 to leaf1 and leaf2 during recursion with sub1 as the parent. > I think it might be better if we can apply same logic to inheritance_planner(), > which once implemented the same logic as below. > > > foreach(lc, root->append_rel_list) > { > AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); > ... > /* > * expand_inherited_rtentry() always processes a parent before any of > * that parent's children, so the parent_root for this relation should > * already be available. > */ > parent_root = parent_roots[appinfo->parent_relid]; > Assert(parent_root != NULL); > parent_parse = parent_root->parse; > ... > subroot->parse = (Query *) > adjust_appendrel_attrs(parent_root, > (Node *) parent_parse, > 1, &appinfo); Actually, inheritance_planner is also using adjust_appendrel_attrs_multilevel. I'm not sure if you're looking at the latest (10/26) patch. Thanks, Amit
On 2018/11/07 12:44, Amit Langote wrote: > Thanks for looking. > > On 2018/11/07 12:32, Alvaro Herrera wrote: >> ISTM patch 0004 is impossible to review just because of size -- I >> suppose the bulk of it is just code that moves from one file to another, >> but that there are also code changes in it. Maybe it would be better to >> split it so that one patch moves the code to the new files without >> changing it, then another patch does the actual functional changes? If >> git shows the first half as a rename, it's easier to be confident about >> it. > > Okay, will post a new version structured that way shortly. And here are patches structured that way. I've addressed some of the comments posted by Imai-san upthread. Also, since David's patch to postpone PlannerInfo.total_pages calculation went into the tree earlier this week, it's no longer included in this set. With this breakdown, patches are as follows: v5-0001-Store-inheritance-root-parent-index-in-otherrel-s.patch Adds a inh_root_parent field that's set in inheritance child otherrel RelOptInfos to store the RT index of their root parent v5-0002-Overhaul-inheritance-update-delete-planning.patch Patch that adjusts planner so that inheritance_planner can use partition pruning. v5-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch Patch that adjusts planner such that inheritance expansion step is postponed from early subquery_planner to the beginning of set_append_rel_size, so that child tables are added to the Query and PlannerInfo after partition pruning. While the move only benefits partitioning, because non-partition children cannot be pruned until after they're fully opened, the new code handles both partitioned tables and regular inheritance parents. v5-0004-Move-append-expansion-code-into-its-own-file.patch With patch 0003, inheritance expansion is no longer a part of the prep phase of planning, so it seems odd that inheritance expansion code is in optimizer/prep/prepunion.c. This patch moves the code to a new file optimizer/utils/append.c. Also, some of the appendrel translation subroutines are moved over to optimizer/utils/appendinfo.c. No functional changes with this patch. v5-0005-Teach-planner-to-only-process-unpruned-partitions.patch Patch that adds a live_parts field to RelOptInfo which is set in partitioned rels to contain partition indexes of only non-dummy children replace the loops of the following form: for (i = 0; i < rel->nparts; i++) { RelOptInfo *partrel = rel->part_rels[i]; ...some processing } at various points within the planner with: i = -1 while ((i = bms_get_next(rel->live_parts)) >= 0) { RelOptInfo *partrel = rel->part_rels[i]; ...some processing } v5-0006-Do-not-lock-all-partitions-at-the-beginning.patch A simple patch that removes the find_all_inheritors called for partitioned root parent only to lock *all* partitions, in favor of locking only the unpruned partitions. Thanks, Amit
Attachment
- v5-0001-Store-inheritance-root-parent-index-in-otherrel-s.patch
- v5-0002-Overhaul-inheritance-update-delete-planning.patch
- v5-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v5-0004-Move-append-expansion-code-into-its-own-file.patch
- v5-0005-Teach-planner-to-only-process-unpruned-partitions.patch
- v5-0006-Do-not-lock-all-partitions-at-the-beginning.patch
Hi Amit, On 11/9/18 3:55 AM, Amit Langote wrote: > And here are patches structured that way. I've addressed some of the > comments posted by Imai-san upthread. Also, since David's patch to > postpone PlannerInfo.total_pages calculation went into the tree earlier > this week, it's no longer included in this set. > Thanks for doing the split this way. The patch passes check-world. I ran a SELECT test using hash partitions, and got Master v5 64: 6k 59k 1024: 283 59k The non-partitioned case gives 77k. The difference in TPS between the partition case vs. the non-partitioned case comes down to set_plain_rel_size() vs. set_append_rel_size() under set_base_rel_sizes(); flamegraphs for this sent off-list. Best regards, Jesper
On 9 November 2018 at 21:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > v5-0001-Store-inheritance-root-parent-index-in-otherrel-s.patch > > Adds a inh_root_parent field that's set in inheritance child otherrel > RelOptInfos to store the RT index of their root parent > > v5-0002-Overhaul-inheritance-update-delete-planning.patch > > Patch that adjusts planner so that inheritance_planner can use partition > pruning. I've started looking at these two, but only so far made it through 0001 and 0002. Here's what I noted down during the review. 0001: 1. Why do we need the new field that this patch adds? I see in 0002 it's used like: + if (childrel->inh_root_parent > 0 && + childrel->inh_root_parent == root->parse->resultRelation) Would something like... int relid; if (childrel->part_schema == NULL && bms_get_singleton_member(childrel->top_parent_relids, &relid) && relid == root->parse->resultRelation) ...not do the trick? 0002: 2. What's wrong with childrel->relids? + child_relids = bms_make_singleton(appinfo->child_relid); 3. Why not use childrel->top_parent_relids? + top_parent_relids = bms_make_singleton(childrel->inh_root_parent); 4. The following comment in inheritance_make_rel_from_joinlist() implies that the function will not be called for SELECT, but the comment above the function does not mention that. /* * For UPDATE/DELETE queries, the top parent can only ever be a table. * As a contrast, it could be a UNION ALL subquery in the case of SELECT. */ Assert(planner_rt_fetch(top_parent_relid, root)->rtekind == RTE_RELATION); 5. Should the following comment talk about "Sub-partitioned tables" rather than "sub-partitions"? + * Sub-partitions have to be processed recursively, because + * AppendRelInfos link sub-partitions to their immediate parents, not + * the root partitioned table. 6. Can't you just pass childrel->relids and childrel->top_parent_relids instead of making new ones? + child_relids = bms_make_singleton(appinfo->child_relid); + Assert(childrel->inh_root_parent > 0); + top_parent_relids = bms_make_singleton(childrel->inh_root_parent); + translated_joinlist = (List *) + adjust_appendrel_attrs_multilevel(root, + (Node *) joinlist, + child_relids, + top_parent_relids); 7. I'm just wondering what your idea is here? + /* Reset join planning specific data structures. */ + root->join_rel_list = NIL; + root->join_rel_hash = NULL; Is there a reason to nullify these? You're not releasing any memory and the new structures that will be built won't overlap with the ones built last round. I don't mean to imply that the code is wrong, it just does not appear to be particularly right. 8. In regards to: + * NB: Do we need to change the child EC members to be marked + * as non-child somehow? + */ + childrel->reloptkind = RELOPT_BASEREL; I know we talked a bit about this before, but this time I put together a crude patch that runs some code each time we skip an em_is_child == true EquivalenceMember. The code checks if any of the em_relids are RELOPT_BASEREL. What I'm looking for here are places where we erroneously skip the member when we shouldn't. Running the regression tests with this patch in place shows a number of problems. Likely I should only trigger the warning when bms_membership(em->em_relids) == BMS_SINGLETON, but it never-the-less appears to highlight various possible issues. Applying the same on master only appears to show the cases where em->em_relids isn't a singleton set. I've attached the patch to let you see what I mean. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi Amit, On Thu, Nov 8, 2018 at 8:26 PM, Amit Langote wrote: > On 2018/11/07 10:00, Imai, Yoshikazu wrote: > > About inheritance_make_rel_from_joinlist(), I considered how it processes > > joins for sub-partitioned-table. > > > > sub-partitioned-table image: > > part > > sub1 > > leaf1 > > leaf2 > > > > inheritance_make_rel_from_joinlist() translates join_list and join_info_list > > for each leafs(leaf1, leaf2 in above image). To translate those two lists for > > leaf1, inheritance_make_rel_from_joinlist() translates lists from part to sub1 > > and nextly from sub1 to leaf1. For leaf2, inheritance_make_rel_from_joinlist() > > translates lists from part to sub1 and from sub1 to leaf2. There are same > > translation from part to sub1, and I think it is better if we can eliminate it. > > I attached 0002-delta.patch. > > > > In the patch, inheritance_make_rel_from_joinlist() translates lists not only for > > leafs but for mid-nodes, in a depth-first manner, so it can use lists of > > mid-nodes for translating lists from mid-node to leafs, which eliminates same > > translation. > > I don't think the translation happens twice for the same leaf partitions. > > Applying adjust_appendrel_attrs_*multilevel* for only leaf nodes, as > happens with the current code, is same as first translating using > adjust_appendrel_attrs from part to sub1 and then from sub1 to leaf1 and > leaf2 during recursion with sub1 as the parent. Thanks for replying. I interpreted your thoughts about translation as below. adjust_appendrel_attrs_multilevel for leaf1: root -> sub1 -> leaf1 adjust_appendrel_attrs_multilevel for leaf2: sub1(produced at above) -> leaf2 But I wonder translation of leaf2 actually reuses the results of sub1 which is produced at leaf1 translation. ISTM translation for leaf1, leaf2 are executed as below. adjust_appendrel_attrs_multilevel for leaf1: root -> sub1 -> leaf1 adjust_appendrel_attrs_multilevel for leaf2: root -> sub1 -> leaf2 > > I think it might be better if we can apply same logic to inheritance_planner(), > > which once implemented the same logic as below. > > > > > > foreach(lc, root->append_rel_list) > > { > > AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); > > ... > > /* > > * expand_inherited_rtentry() always processes a parent before any of > > * that parent's children, so the parent_root for this relation should > > * already be available. > > */ > > parent_root = parent_roots[appinfo->parent_relid]; > > Assert(parent_root != NULL); > > parent_parse = parent_root->parse; > > ... > > subroot->parse = (Query *) > > adjust_appendrel_attrs(parent_root, > > (Node *) parent_parse, > > 1, &appinfo); > > Actually, inheritance_planner is also using > adjust_appendrel_attrs_multilevel. I'm not sure if you're looking at the > latest (10/26) patch. Sorry for my poor explanation. I described the above code as old code which is not patch applied. Since it is difficult to explain my thoughts with words, I will show the performance degration case. Partition tables are below two sets. Set1: [create 800 partitions directly under root] CREATE TABLE rt (a int, b int) PARTITION BY RANGE (a); \o /dev/null SELECT 'CREATE TABLE leaf' || x::text || ' PARTITION OF rt FOR VALUES FROM (' || (x)::text || ') TO (' || (x+1)::text || ');' FROM generate_series(0, 800) x; \gexec \o Set2: [create 800 partitions under a partitioned table which is under root] CREATE TABLE rt (a int, b int) PARTITION BY RANGE (a); CREATE TABLE sub1 PARTITION OF rt FOR VALUES FROM (0) TO (100) PARTITION BY RANGE (b); \o /dev/null SELECT 'CREATE TABLE leaf' || x::text || ' PARTITION OF sub1 FOR VALUES FROM (' || (x)::text || ') TO (' || (x+1)::text || ');' FROM generate_series(0, 800) x; \gexec \o Create a generic plan of updation or deletion. [create a delete generic plan] set plan_cache_mode = 'force_generic_plan'; prepare delete_stmt(int) as delete from rt where b = $1; execute delete_stmt(1); In creating generic plan, paths/plans for all partitions are created because we don't know which plan is used before "EXECUTE" command happens. In creating paths in inheritance_planner(), adjust_appendrel_attrs()/adjust_appendrel_attrs_multilevel() is executed many times and it allocates a lot of memory in total if there are many partitions. How amount of memory is used with above tests is... without v5 patches, Set1: 242MB without v5 patches, Set2: 247MB with v5 patches, Set1: 420MB with v5 patches, Set2: 820MB # Thanks for supplying v5 patches :) Without sub-partition(Set1), memory allocated by adjust_appendrel_attrs() with v5 patches is 1.7x larger than without v5 patches. With sub-partition(Set2), memory allocated by adjust_appendrel_attrs() with v5 patches is 3.3x larger than without v5 patches. I think why memory usage in "with v5 patches, Set2" is almost 2 times than "with v5 patches, Set1" is that adjust_appendrel_attrs() from root to sub1 is occurred at each translation for each partitions in "with v5 patches, Set2". Currently, a query to a partition table is processed faster by a custom plan than by a generic plan, so we would not use a generic plan that I don't know whether we should solve this large memory allocation problem. -- Yoshikazu Imai
David, Imai-san, Thanks for reviewing. I've included both replies in this email so that I can attach the latest patch as well. On 2018/11/10 20:59, David Rowley wrote: > I've started looking at these two, but only so far made it through > 0001 and 0002. > > Here's what I noted down during the review. > > 0001: > > 1. Why do we need the new field that this patch adds? I see in 0002 > it's used like: > > + if (childrel->inh_root_parent > 0 && > + childrel->inh_root_parent == root->parse->resultRelation) > > Would something like... > int relid; > if (childrel->part_schema == NULL && > bms_get_singleton_member(childrel->top_parent_relids, &relid) && > relid == root->parse->resultRelation) > > ...not do the trick? Actually, that's one way and earlier patches relied on that, but it gets a bit ugly given that it's not always the top_parent_relid we're looking for in the partitioning/inheritance specific code. Consider this: select * from (select a from parted_table p union all select a from inherited_table i) s where s.a = 1; top_parent_relids refers to the union all parent subquery 's RT index, which makes the partitioning/inheritance code scramble through a chain of parent relation relids to figure out the RT index of the table in the query. So, inh_root_parent is useful to distinguish the inheritance root parent from the Append relation root parent, without much code. That said, in the latest version, I've modified the UPDATE/DELETE planning patch such that it doesn't use inh_root_parent (or the new code doesn't need to refer to root parent where it previously did), so I've moved the patch that adds inh_root_parent to the 2nd in the series. > 0002: > > 2. What's wrong with childrel->relids? > > + child_relids = bms_make_singleton(appinfo->child_relid); I've modified the code to not have to use child_relids. > 3. Why not use childrel->top_parent_relids? > > + top_parent_relids = bms_make_singleton(childrel->inh_root_parent); Ditto. > 4. The following comment in inheritance_make_rel_from_joinlist() > implies that the function will not be called for SELECT, but the > comment above the function does not mention that. > > /* > * For UPDATE/DELETE queries, the top parent can only ever be a table. > * As a contrast, it could be a UNION ALL subquery in the case of SELECT. > */ > Assert(planner_rt_fetch(top_parent_relid, root)->rtekind == RTE_RELATION); Fixed the function's header comment to be clear. > 5. Should the following comment talk about "Sub-partitioned tables" > rather than "sub-partitions"? > > + * Sub-partitions have to be processed recursively, because > + * AppendRelInfos link sub-partitions to their immediate parents, not > + * the root partitioned table. Okay, done. > 6. Can't you just pass childrel->relids and > childrel->top_parent_relids instead of making new ones? > > + child_relids = bms_make_singleton(appinfo->child_relid); > + Assert(childrel->inh_root_parent > 0); > + top_parent_relids = bms_make_singleton(childrel->inh_root_parent); > + translated_joinlist = (List *) > + adjust_appendrel_attrs_multilevel(root, > + (Node *) joinlist, > + child_relids, > + top_parent_relids); The new code uses adjust_appendrel_attrs, so those Relids variables are not needed. > 7. I'm just wondering what your idea is here? > > + /* Reset join planning specific data structures. */ > + root->join_rel_list = NIL; > + root->join_rel_hash = NULL; > > Is there a reason to nullify these? You're not releasing any memory > and the new structures that will be built won't overlap with the ones > built last round. I don't mean to imply that the code is wrong, it > just does not appear to be particularly right. In initial versions of the patch, the same top-level PlannerInfo was used when calling make_rel_from_joinlist for all child tables. So, join_rel_hash built for a given child would be assumed to be valid for the next which isn't true, because joinrels would be different across children. In the latest patch, inheritance_make_rel_from_joinlist uses different PlannerInfos for different child target relations, so the above problem doesn't exist. IOW, you won't see above two lines in the latest patch. > 8. In regards to: > > + * NB: Do we need to change the child EC members to be marked > + * as non-child somehow? > + */ > + childrel->reloptkind = RELOPT_BASEREL; > > I know we talked a bit about this before, but this time I put together > a crude patch that runs some code each time we skip an em_is_child == > true EquivalenceMember. The code checks if any of the em_relids are > RELOPT_BASEREL. What I'm looking for here are places where we > erroneously skip the member when we shouldn't. Running the regression > tests with this patch in place shows a number of problems. Likely I > should only trigger the warning when bms_membership(em->em_relids) == > BMS_SINGLETON, but it never-the-less appears to highlight various > possible issues. Applying the same on master only appears to show the > cases where em->em_relids isn't a singleton set. I've attached the > patch to let you see what I mean. Thanks for this. I've been thinking about what to do about it, but haven't decided what's that yet. Please let me spend some more time thinking on it. AFAICT, dealing with this will ensure that join planning against target child relations can use EC-based optimizations, but it's not incorrect as is per se. On 2018/11/12 13:35, Imai, Yoshikazu wrote: > adjust_appendrel_attrs_multilevel for leaf1: root -> sub1 -> leaf1 > adjust_appendrel_attrs_multilevel for leaf2: root -> sub1 -> leaf2 Ah, I see what you mean. The root -> sub1 translation will be repeated for each leaf partition if done via adjust_appendrel_attrs_multilevel. On the other hand, if we could do the root to sub1 translation once and pass it to the recursive call using sub1 as the parent. I've changed the patch use adjust_appendrel_attrs. > Since it is difficult to explain my thoughts with words, I will show the > performance degration case. > > Partition tables are below two sets. [ ... ] > Create a generic plan of updation or deletion. > > [create a delete generic plan] > set plan_cache_mode = 'force_generic_plan'; > prepare delete_stmt(int) as delete from rt where b = $1; > execute delete_stmt(1); [ ... ] > How amount of memory is used with above tests is... > > without v5 patches, Set1: 242MB > without v5 patches, Set2: 247MB > with v5 patches, Set1: 420MB > with v5 patches, Set2: 820MB Although I didn't aim to fix planning for the generic plan case where no pruning occurs, the above result is not acceptable. That is, the new implementation of inheritance update/delete planning shouldn't consume more memory than the previous. In fact, it should've consumed less, because the patch claims that it gets rid of redundant processing per partition. I understood why update/delete planning consumed more memory with the patch. It was due to a problem with the patch that modifies inheritance update/delete planning. The exact problem was that the query tree would be translated (hence copied) *twice* for every partition! First during query planning where the query tree would be translated to figure out a targetlist for partitions and then again before calling grouping_planner. Also, the adjust_appendrel_attrs_multilevel made it worse for multi-level partitioning case, because of repeated copying for root to intermediate partitioned tables, as Imai-san pointed out. I've fixed that making sure that query tree is translated only once and saved for later steps to use. Imai-san, please check the memory consumption with the latest patch. Attached updated patches. Significant revisions are as follows (note that I reversed the order of 0001 and 0002): v6-0001-Overhaul-inheritance-update-delete-planning.patch The major changes is fixing the problem above. v6-0002-Store-inheritance-root-parent-index-in-otherrel-s.patch No change. v6-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch Made inheritance expansion a separate step of make_one_rel whereas previously it would be invoked at the beginning of set_append_rel_size. Now, it runs just before set_base_rel_sizes. The same step also recursively expands (and performs pruning for) any child partitioned tables that were added by the expansion of partitioned tables originally mentioned in the query. With this change, we don't need to worry about the range table changing as set_base_rel_size is executing, which could lead to problems. v6-0004-Move-append-expansion-code-into-its-own-file.patch v6-0005-Teach-planner-to-only-process-unpruned-partitions.patch v6-0006-Do-not-lock-all-partitions-at-the-beginning.patch No change. Thanks, Amit
Attachment
- v6-0001-Overhaul-inheritance-update-delete-planning.patch
- v6-0002-Store-inheritance-root-parent-index-in-otherrel-s.patch
- v6-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v6-0004-Move-append-expansion-code-into-its-own-file.patch
- v6-0005-Teach-planner-to-only-process-unpruned-partitions.patch
- v6-0006-Do-not-lock-all-partitions-at-the-beginning.patch
On 2018/11/14 19:28, Amit Langote wrote: > Attached updated patches. Significant revisions are as follows (note that > I reversed the order of 0001 and 0002): > > v6-0001-Overhaul-inheritance-update-delete-planning.patch > > The major changes is fixing the problem above. > > v6-0002-Store-inheritance-root-parent-index-in-otherrel-s.patch > > No change. > > v6-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch > > Made inheritance expansion a separate step of make_one_rel whereas > previously it would be invoked at the beginning of set_append_rel_size. > Now, it runs just before set_base_rel_sizes. The same step also > recursively expands (and performs pruning for) any child partitioned > tables that were added by the expansion of partitioned tables originally > mentioned in the query. With this change, we don't need to worry about > the range table changing as set_base_rel_size is executing, which could > lead to problems. > > v6-0004-Move-append-expansion-code-into-its-own-file.patch > v6-0005-Teach-planner-to-only-process-unpruned-partitions.patch > v6-0006-Do-not-lock-all-partitions-at-the-beginning.patch This went out sooner than it should have. I hadn't waited for make check-world to finish which showed that a file_fdw test exercising inheritance crashed with 0001 patch due to a bogus Assert. Fixed it in the attached version. Thanks, Amit
Attachment
- v7-0001-Overhaul-inheritance-update-delete-planning.patch
- v7-0002-Store-inheritance-root-parent-index-in-otherrel-s.patch
- v7-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v7-0004-Move-append-expansion-code-into-its-own-file.patch
- v7-0005-Teach-planner-to-only-process-unpruned-partitions.patch
- v7-0006-Do-not-lock-all-partitions-at-the-beginning.patch
Hi Amit, On Tue, Nov 13, 2018 at 10:29 PM, Amit Langote wrote: > On 2018/11/12 13:35, Imai, Yoshikazu wrote: > > adjust_appendrel_attrs_multilevel for leaf1: root -> sub1 -> leaf1 > > adjust_appendrel_attrs_multilevel for leaf2: root -> sub1 -> leaf2 > > Ah, I see what you mean. > > The root -> sub1 translation will be repeated for each leaf partition > if done via adjust_appendrel_attrs_multilevel. On the other hand, if > we could do the root to sub1 translation once and pass it to the recursive > call using sub1 as the parent. > > I've changed the patch use adjust_appendrel_attrs. > > > Since it is difficult to explain my thoughts with words, I will show > > the performance degration case. > > > > Partition tables are below two sets. > > [ ... ] > > > Create a generic plan of updation or deletion. > > > > [create a delete generic plan] > > set plan_cache_mode = 'force_generic_plan'; prepare delete_stmt(int) > > as delete from rt where b = $1; execute delete_stmt(1); > > [ ... ] > > > How amount of memory is used with above tests is... > > > > without v5 patches, Set1: 242MB > > without v5 patches, Set2: 247MB > > with v5 patches, Set1: 420MB > > with v5 patches, Set2: 820MB > > Although I didn't aim to fix planning for the generic plan case where > no pruning occurs, the above result is not acceptable. That is, the new > implementation of inheritance update/delete planning shouldn't consume > more memory than the previous. In fact, it should've consumed less, > because the patch claims that it gets rid of redundant processing per > partition. > > I understood why update/delete planning consumed more memory with the > patch. It was due to a problem with the patch that modifies inheritance > update/delete planning. The exact problem was that the query tree would > be translated (hence copied) *twice* for every partition! First during > query planning where the query tree would be translated to figure out > a targetlist for partitions and then again before calling > grouping_planner. > Also, the adjust_appendrel_attrs_multilevel made it worse for > multi-level partitioning case, because of repeated copying for root to > intermediate partitioned tables, as Imai-san pointed out. > > I've fixed that making sure that query tree is translated only once and > saved for later steps to use. Imai-san, please check the memory > consumption with the latest patch. Thanks for fixing! Now, memory consumption is lower than the previous. with v7 patches, Set1: 223MB with v7 patches, Set2: 226MB Thanks, -- Yoshikazu Imai
On 2018/11/15 10:19, Imai, Yoshikazu wrote: > On Tue, Nov 13, 2018 at 10:29 PM, Amit Langote wrote: >> On 2018/11/12 13:35, Imai, Yoshikazu wrote: >>> How amount of memory is used with above tests is... >>> >>> without v5 patches, Set1: 242MB >>> without v5 patches, Set2: 247MB >>> with v5 patches, Set1: 420MB >>> with v5 patches, Set2: 820MB >> >> I understood why update/delete planning consumed more memory with the >> patch. It was due to a problem with the patch that modifies inheritance >> update/delete planning. The exact problem was that the query tree would >> be translated (hence copied) *twice* for every partition! First during >> query planning where the query tree would be translated to figure out >> a targetlist for partitions and then again before calling >> grouping_planner. >> Also, the adjust_appendrel_attrs_multilevel made it worse for >> multi-level partitioning case, because of repeated copying for root to >> intermediate partitioned tables, as Imai-san pointed out. >> >> I've fixed that making sure that query tree is translated only once and >> saved for later steps to use. Imai-san, please check the memory >> consumption with the latest patch. > > Thanks for fixing! > Now, memory consumption is lower than the previous. > > with v7 patches, Set1: 223MB > with v7 patches, Set2: 226MB Thanks for checking. So at least we no longer have any memory over-allocation bug with the patch, but perhaps other bugs are still lurking. :) Regards, Amit
On 2018/11/14 19:28, Amit Langote wrote: > On 2018/11/10 20:59, David Rowley wrote: >> 8. In regards to: >> >> + * NB: Do we need to change the child EC members to be marked >> + * as non-child somehow? >> + */ >> + childrel->reloptkind = RELOPT_BASEREL; >> >> I know we talked a bit about this before, but this time I put together >> a crude patch that runs some code each time we skip an em_is_child == >> true EquivalenceMember. The code checks if any of the em_relids are >> RELOPT_BASEREL. What I'm looking for here are places where we >> erroneously skip the member when we shouldn't. Running the regression >> tests with this patch in place shows a number of problems. Likely I >> should only trigger the warning when bms_membership(em->em_relids) == >> BMS_SINGLETON, but it never-the-less appears to highlight various >> possible issues. Applying the same on master only appears to show the >> cases where em->em_relids isn't a singleton set. I've attached the >> patch to let you see what I mean. > > Thanks for this. I've been thinking about what to do about it, but > haven't decided what's that yet. Please let me spend some more time > thinking on it. AFAICT, dealing with this will ensure that join planning > against target child relations can use EC-based optimizations, but it's > not incorrect as is per se. I've been considered this a bit more and have some observations to share. I found that the new inheritance_planner causes regression when the query involves equivalence classes referencing the target relation, such as in the following example: create table ht (a int, b int) partition by hash (a); create table ht1 partition of ht for values with (modulus 1024, remainder 0); ... create table ht1024 partition of ht for values with (modulus 1024, remainder 1023); create table foo (a int, b int); update ht set a = foo.a from foo where ht.b = foo.b; For the above query, an EC containing ht.b and foo.b would be built. With the new approach this EC will need to be expanded to add em_is_child EC members for all un-pruned child tables, whereas with the previous approach there would be no child members because the EC would be built for the child as the query's target relation to begin with. So, with the old approach there will be {ht1.b, foo.b} for query with ht1 as target relation, {ht2.b, foo.b} for query with ht2 as target relation and so on. Whereas with the new approach there will be just one query_planner run and resulting EC will be {foo.b, ht.b, ht1.b, ht2.b, ...}. So, the planning steps that manipulate ECs now have to iterate through many members and become a bottleneck if there are many un-pruned children. To my surprise, those bottlenecks are worse than having to rerun query_planner for each child table. So with master, I get the following planning time for the above update query which btw doesn't prune (3 repeated runs) Planning Time: 688.830 ms Planning Time: 690.950 ms Planning Time: 704.702 ms And with the previous v7 patch: Planning Time: 1373.398 ms Planning Time: 1360.685 ms Planning Time: 1356.313 ms I've fixed that in the attached by utilizing the fact that now we build the child PlannerInfo before we add child EC members. By modifying add_child_rel_equivalences such that it *replaces* the parent EC member with the corresponding child member instead of appending it to the list, if the child is the target relation. That happens inside the child target's private PlannerInfo, so it's harmless. Also, it is no longer marked as em_is_child=true, so as a whole, this more or less restores the original behavior wrt to ECs (also proved by the fact that I now get the same regression.diffs by applying David's verify_em_child.diff patch [1] to the patched tree as with applying it to master modulo the varno differences due to the patched). With the attached updated patch (again this is with 0 partitions pruned): Planning Time: 332.503 ms Planning Time: 334.003 ms Planning Time: 334.212 ms If I add an additional condition so that only 1 partition is joined with foo and the rest pruned, such as the following query (the case we're trying to optimize): update ht set a = foo.a from foo where foo.a = ht.b and foo.a = 1 I get the following numbers with master (no change from 0 pruned case): Planning Time: 727.473 ms Planning Time: 726.145 ms Planning Time: 734.458 ms But with the patches: Planning Time: 0.797 ms Planning Time: 0.751 ms Planning Time: 0.801 ms Attached v8 patches. Thanks, Amit [1] https://www.postgresql.org/message-id/CAKJS1f8g9_BzE678BLBm-eoMMEYUUXhDABSpqtAHRUUTrm_vFA%40mail.gmail.com
Attachment
- v8-0001-Overhaul-inheritance-update-delete-planning.patch
- v8-0002-Store-inheritance-root-parent-index-in-otherrel-s.patch
- v8-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v8-0004-Move-append-expansion-code-into-its-own-file.patch
- v8-0005-Teach-planner-to-only-process-unpruned-partitions.patch
- v8-0006-Do-not-lock-all-partitions-at-the-beginning.patch
Hi Amit, On Tue, Nov 20, 2018 at 10:24 PM, Amit Langote wrote: > Attached v8 patches. Thanks for the patch. I took a look 0003, 0005, 0006 of v8 patch. 1. 0003: line 267-268 + * Child relation may have be marked dummy if build_append_child_rel + * found self-contradictory quals. /s/have be marked/have been marked/ 2. 0003: around line 1077 In append.c(or prepunion.c) 228 * Check that there's at least one descendant, else treat as no-child 229 * case. This could happen despite above has_subclass() check, if table 230 * once had a child but no longer does. has_subclass() check is moved to subquery_planner from above this code, so the comments need to be modified like below. s/above has_subclass() check/has_subclass() check in subquery_planner/ 3. 0003: line 1241-1244 0006: line ? In comments of expand_partitioned_rtentry: + * Note: even though only the unpruned partitions will be added to the + * resulting plan, this still locks *all* partitions via find_all_inheritors + * in order to avoid partitions being locked in a different order than other + * places in the backend that may lock partitions. This comments is no more needed if 0006 patch is applied because find_all_inheritors is removed in the 0006 patch. 4. 0003: line 1541-1544 + * Add the RelOptInfo. Even though we may not really scan this relation + * for reasons such as contradictory quals, we still need need to create + * one, because for every RTE in the query's range table, there must be an + * accompanying RelOptInfo. s/need need/need/ 5. 0003: line 1620-1621 + * After creating the RelOptInfo for the given child RT index, it goes on to + * initialize some of its fields base on the parent RelOptInfo. s/fields base on/fields based on/ 6. parsenodes.h 906 * inh is true for relation references that should be expanded to include 907 * inheritance children, if the rel has any. This *must* be false for 908 * RTEs other than RTE_RELATION entries. I think inh can become true now even if RTEKind equals RTE_SUBQUERY, so latter sentence need to be modified. 7. 0005: line 109-115 + /* + * If partition is excluded by constraints, remove it from + * live_parts, too. + */ + if (IS_DUMMY_REL(childrel)) + parentrel->live_parts = bms_del_member(parentrel->live_parts, i); + When I read this comment, I imagined that relation_excluded_by_constraints() would be called before this code. childrel is marked dummy if build_append_child_rel found self-contradictory quals, so comments can be modified more correctly like another comments in your patch as below. In 0003: line 267-271 + * Child relation may have be marked dummy if build_append_child_rel + * found self-contradictory quals. + */ + if (IS_DUMMY_REL(childrel)) + continue; 8. 0003: line 705-711 + * Query planning may have added some columns to the top-level tlist, + * which happens when there are row marks applied to inheritance + * parent relations (additional junk columns needed for applying row + * marks are added after expanding inheritance.) + */ + if (list_length(tlist) < list_length(root->processed_tlist)) + tlist = root->processed_tlist; In grouping_planner(): if (planned_rel == NULL) { ... root->processed_tlist = tlist; } else tlist = root->processed_tlist; ... if (current_rel == NULL) current_rel = query_planner(root, tlist, standard_qp_callback, &qp_extra); ... /* * Query planning may have added some columns to the top-level tlist, * which happens when there are row marks applied to inheritance * parent relations (additional junk columns needed for applying row * marks are added after expanding inheritance.) */ if (list_length(tlist) < list_length(root->processed_tlist)) tlist = root->processed_tlist; Are there any case tlist points to an address different from root->processed_tlist after calling query_planner? Junk columns are possibly added to root->processed_tlist after expanding inheritance, but that adding process don't change the root->processed_tlist's pointer address. I checked "Assert(tlist == root->processed_tlist)" after calling query_planner passes "make check". 9. 0003: line 1722-1763 In build_append_child_rel(): + /* + * In addition to the quals inherited from the parent, we might + * have securityQuals associated with this particular child node. + * (Currently this can only happen in appendrels originating from + * UNION ALL; inheritance child tables don't have their own + * securityQuals.) Pull any such securityQuals up into the ... + foreach(lc, childRTE->securityQuals) + { ... + } + Assert(security_level <= root->qual_security_level); + } This foreach loop loops only once in the current regression tests. I checked "Assert(childRTE->securityQuals->length == 1)" passes "make check". I think there are no need to change codes, I state this fact only for sharing. -- Yoshikazu Imai
Imai-san, Thanks for the review again. On 2018/12/05 11:29, Imai, Yoshikazu wrote: > On Tue, Nov 20, 2018 at 10:24 PM, Amit Langote wrote: >> Attached v8 patches. > > Thanks for the patch. I took a look 0003, 0005, 0006 of v8 patch. > > 1. > 0003: line 267-268 > + * Child relation may have be marked dummy if build_append_child_rel > + * found self-contradictory quals. > > /s/have be marked/have been marked/ > > 2. > 0003: around line 1077 > In append.c(or prepunion.c) > 228 * Check that there's at least one descendant, else treat as no-child > 229 * case. This could happen despite above has_subclass() check, if table > 230 * once had a child but no longer does. > > has_subclass() check is moved to subquery_planner from above this code, > so the comments need to be modified like below. > > s/above has_subclass() check/has_subclass() check in subquery_planner/ > > 3. > 0003: line 1241-1244 > 0006: line ? > > In comments of expand_partitioned_rtentry: > + * Note: even though only the unpruned partitions will be added to the > + * resulting plan, this still locks *all* partitions via find_all_inheritors > + * in order to avoid partitions being locked in a different order than other > + * places in the backend that may lock partitions. > > This comments is no more needed if 0006 patch is applied because > find_all_inheritors is removed in the 0006 patch. > > 4. > 0003: line 1541-1544 > > + * Add the RelOptInfo. Even though we may not really scan this relation > + * for reasons such as contradictory quals, we still need need to create > + * one, because for every RTE in the query's range table, there must be an > + * accompanying RelOptInfo. > > s/need need/need/ > > 5. > 0003: line 1620-1621 > > + * After creating the RelOptInfo for the given child RT index, it goes on to > + * initialize some of its fields base on the parent RelOptInfo. > > s/fields base on/fields based on/ Fixed all of 1-5. > 6. > parsenodes.h > 906 * inh is true for relation references that should be expanded to include > 907 * inheritance children, if the rel has any. This *must* be false for > 908 * RTEs other than RTE_RELATION entries. > > I think inh can become true now even if RTEKind equals RTE_SUBQUERY, so latter > sentence need to be modified. Seems like an existing comment bug. Why don't you send a patch as you discovered it? :) > 7. > 0005: line 109-115 > + /* > + * If partition is excluded by constraints, remove it from > + * live_parts, too. > + */ > + if (IS_DUMMY_REL(childrel)) > + parentrel->live_parts = bms_del_member(parentrel->live_parts, i); > + > > When I read this comment, I imagined that relation_excluded_by_constraints() > would be called before this code. childrel is marked dummy if > build_append_child_rel found self-contradictory quals, so comments can be > modified more correctly like another comments in your patch as below. I realized that bms_del_member statement is unnecessary. I've revised the comment describing live_parts to say that it contains indexes into part_rels array of the non-NULL RelOptInfos contained in it, that is, RelOptInfos of un-pruned partitions (rest of the entries are NULL.) Un-pruned partitions may become dummy due to contradictory constraints or constraint exclusion using normal CHECK constraints later and whether it's dummy is checked properly by functions that iterate over live_parts. > 8. > 0003: line 705-711 > + * Query planning may have added some columns to the top-level tlist, > + * which happens when there are row marks applied to inheritance > + * parent relations (additional junk columns needed for applying row > + * marks are added after expanding inheritance.) > + */ > + if (list_length(tlist) < list_length(root->processed_tlist)) > + tlist = root->processed_tlist; > > In grouping_planner(): > > if (planned_rel == NULL) > { > ... > root->processed_tlist = tlist; > } > else > tlist = root->processed_tlist; > ... > if (current_rel == NULL) > current_rel = query_planner(root, tlist, > standard_qp_callback, &qp_extra); > ... > /* > * Query planning may have added some columns to the top-level tlist, > * which happens when there are row marks applied to inheritance > * parent relations (additional junk columns needed for applying row > * marks are added after expanding inheritance.) > */ > if (list_length(tlist) < list_length(root->processed_tlist)) > tlist = root->processed_tlist; > > > Are there any case tlist points to an address different from > root->processed_tlist after calling query_planner? Junk columns are possibly > added to root->processed_tlist after expanding inheritance, but that adding > process don't change the root->processed_tlist's pointer address. > I checked "Assert(tlist == root->processed_tlist)" after calling query_planner > passes "make check". You're right. I think that may not have been true in some version of the patch that I didn't share on the list, but it is with the latest patch. I've removed that block of code and adjusted nearby comments to mention that targetlist may change during query_planner. > 9. > 0003: line 1722-1763 > In build_append_child_rel(): > > + /* > + * In addition to the quals inherited from the parent, we might > + * have securityQuals associated with this particular child node. > + * (Currently this can only happen in appendrels originating from > + * UNION ALL; inheritance child tables don't have their own > + * securityQuals.) Pull any such securityQuals up into the > ... > + foreach(lc, childRTE->securityQuals) > + { > ... > + } > + Assert(security_level <= root->qual_security_level); > + } > > This foreach loop loops only once in the current regression tests. I checked > "Assert(childRTE->securityQuals->length == 1)" passes "make check". > I think there are no need to change codes, I state this fact only for sharing. Thanks for the information. There aren't any changes to the code itself due to this patch, just moved from one place to another. Attached updated patches. I have a few other changes in mind to make to 0001 such that the range table in each child's version of Query contains only that child table in place of the original target relation, instead of *all* child tables which is the current behavior. The current behavior makes range_table_mutator a big bottleneck when the number of un-pruned target children is large. But I'm saving it for the next week so that I can prepare for the PGConf.ASIA that's starting on Monday next week. See you there. :) Thanks, Amit
Attachment
- v9-0001-Overhaul-inheritance-update-delete-planning.patch
- v9-0002-Store-inheritance-root-parent-index-in-otherrel-s.patch
- v9-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v9-0004-Move-append-expansion-code-into-its-own-file.patch
- v9-0005-Teach-planner-to-only-process-unpruned-partitions.patch
- v9-0006-Do-not-lock-all-partitions-at-the-beginning.patch
Hi, Amit Thanks for the quick modification. On Wed, Dec 5, 2018 at 8:26 PM, Amit Langote wrote: > > 1. ... > > 5. > > 0003: line 1620-1621 > > > > + * After creating the RelOptInfo for the given child RT index, it goes on to > > + * initialize some of its fields base on the parent RelOptInfo. > > > > s/fields base on/fields based on/ > > Fixed all of 1-5. Thanks for fixing. > > 6. > > parsenodes.h > > 906 * inh is true for relation references that should be expanded to include > > 907 * inheritance children, if the rel has any. This *must* be false for > > 908 * RTEs other than RTE_RELATION entries. > > > > I think inh can become true now even if RTEKind equals RTE_SUBQUERY, so latter > > sentence need to be modified. > > > > Seems like an existing comment bug. Why don't you send a patch as you > discovered it? :) Thanks, I am pleased with your proposal. I'll post it as a small fix of the comment. > > 7. > > 0005: line 109-115 ... > Un-pruned partitions may become dummy due to contradictory constraints or > constraint exclusion using normal CHECK constraints later and whether it's > dummy is checked properly by functions that iterate over live_parts. Ah, I understand partitions are eliminated contradictory constraints or constraint exclusion, both using constraints. > Attached updated patches. I have a few other changes in mind to make to > 0001 such that the range table in each child's version of Query contains > only that child table in place of the original target relation, instead of > *all* child tables which is the current behavior. The current behavior > makes range_table_mutator a big bottleneck when the number of un-pruned > target children is large. But I'm saving it for the next week so that I OK. I will continue the review of 0001 before/after your supplying of next patch with keeping those in mind. > can prepare for the PGConf.ASIA that's starting on Monday next week. See > you there. :) Yeah, see you there. :) -- Yoshikazu Imai
Hi, Amit, On Fri, Dec 7, 2018 at 0:57 AM, Imai, Yoshikazu wrote: > OK. I will continue the review of 0001 before/after your supplying of > next patch with keeping those in mind. Here's the continuation of the review. Almost all of below comments are little fixes. --- 0001: line 76-77 In commit message: exclusion for target child relation, which is no longer is no longer needed. Constraint exclusion runs during query_planner s/which is no longer is no longer needed/which is no longer needed/ --- 0001: line 464 + if (IS_DUMMY_REL(find_base_rel(root, resultRelation ))) s/resultRelation )))/resultRelation)))/ (There is an extra space.) --- 0001: line 395-398 + * Reset inh_target_child_roots to not be same as parent root's so that + * the subroots for this child's own children (if any) don't end up in + * root parent's list. We'll eventually merge all entries into one list, + * but that's now now. s/that's now now/that's not now/ --- 0001: line 794 + * are put into a list that will be controlled by a single ModifyTable s/are put into a list/are put into a list/ (There are two spaces between "into" and "a".) --- 0001: line 241-242, 253-254, 291-294 (In set_append_rel_size()) + if (appinfo->parent_relid == root->parse->resultRelation) + subroot = adjust_inherit_target_child(root, childrel, appinfo); + add_child_rel_equivalences(subroot, appinfo, rel, childrel, + root != subroot); + if (subroot != root) + { + root->inh_target_child_roots = + lappend(root->inh_target_child_roots, subroot); A boolean value of "appinfo->parent_relid == root->parse->resultRelation" is same with "subroot != root"(because of line 241-242), so we can use bool variable here like child_is_target = (appinfo->parent_relid == root->parse->resultRelation). The name of child_is_target is brought from arguments of add_child_rel_equivalences() in your patch. I attached a little diff as "v9-0001-delta.diff". --- 0001: line 313-431 adjust_inherit_target_child() is in allpaths.c in your patch, but it has the code like below ones which are in master's(not patch applied) planner.c or planmain.c, so could it be possible in planner.c(or planmain.c)? This point is less important, but I was just wondering whether adjust_inherit_target_child() should be in allpaths.c, planner.c or planmain.c. + /* Translate the original query's expressions to this child. */ + subroot = makeNode(PlannerInfo); + memcpy(subroot, root, sizeof(PlannerInfo)); + root->parse->targetList = root->unexpanded_tlist; + subroot->parse = (Query *) adjust_appendrel_attrs(root, + (Node *) root->parse, + 1, &appinfo); + tlist = preprocess_targetlist(subroot); + subroot->processed_tlist = tlist; + build_base_rel_tlists(subroot, tlist); --- 0001: line 57-70 In commit message: This removes some existing code in inheritance_planner that dealt with any subquery RTEs in the query. The rationale of that code was that the subquery RTEs may change during each iteration of planning (that is, for different children), so different iterations better use different copies of those RTEs. ... Since with the new code we perform planning just once, I think we don't need this special handling. 0001: line 772-782 - * controlled by a single ModifyTable node. All the instances share the - * same rangetable, but each instance must have its own set of subquery - * RTEs within the finished rangetable because (1) they are likely to get - * scribbled on during planning, and (2) it's not inconceivable that - * subqueries could get planned differently in different cases. We need - * not create duplicate copies of other RTE kinds, in particular not the - * target relations, because they don't have either of those issues. Not - * having to duplicate the target relations is important because doing so - * (1) would result in a rangetable of length O(N^2) for N targets, with - * at least O(N^3) work expended here; and (2) would greatly complicate - * management of the rowMarks list. I used considerable time to confirm there are no needs copying subquery RTEs in the new code, but so far I couldn't. If copied RTEs are only used when planning, it might not need to copy RTEs in the new code because we perform planning just once, so I tried to detect when copied RTEs are used or overwritten, but I gave up. Of course, there are comments about this, - * same rangetable, but each instance must have its own set of subquery - * RTEs within the finished rangetable because (1) they are likely to get - * scribbled on during planning, and (2) it's not inconceivable that so copied RTEs might be used when planning, but I couldn't find the actual codes. I also checked commits[1, 2] related to these codes. I'll check these for more time but it would be better there are other's review and I also want a help here. --- Maybe I checked all the way of the v9 patch excluding the codes about EquivalenceClass codes(0001: line 567-638). I'll consider whether there are any performance degration case, but I have no idea for now. Do you have any points you concerns? If there, I'll check it. [1] https://github.com/postgres/postgres/commit/b3aaf9081a1a95c245fd605dcf02c91b3a5c3a29 [2] https://github.com/postgres/postgres/commit/c03ad5602f529787968fa3201b35c119bbc6d782 -- Yoshikazu Imai
Attachment
Thank you Imai-san. On 2018/12/25 16:47, Imai, Yoshikazu wrote: > Here's the continuation of the review. Almost all of below comments are > little fixes. > > --- > 0001: line 76-77 > In commit message: > exclusion for target child relation, which is no longer > is no longer needed. Constraint exclusion runs during query_planner > > s/which is no longer is no longer needed/which is no longer needed/ > > --- > 0001: line 464 > + if (IS_DUMMY_REL(find_base_rel(root, resultRelation ))) > > s/resultRelation )))/resultRelation)))/ > (There is an extra space.) > > --- > 0001: line 395-398 > + * Reset inh_target_child_roots to not be same as parent root's so that > + * the subroots for this child's own children (if any) don't end up in > + * root parent's list. We'll eventually merge all entries into one list, > + * but that's now now. > > s/that's now now/that's not now/ > > --- > 0001: line 794 > + * are put into a list that will be controlled by a single ModifyTable > > s/are put into a list/are put into a list/ > (There are two spaces between "into" and "a".) All fixed in my local repository. > --- > 0001: line 241-242, 253-254, 291-294 (In set_append_rel_size()) > > + if (appinfo->parent_relid == root->parse->resultRelation) > + subroot = adjust_inherit_target_child(root, childrel, appinfo); > > + add_child_rel_equivalences(subroot, appinfo, rel, childrel, > + root != subroot); > > + if (subroot != root) > + { > + root->inh_target_child_roots = > + lappend(root->inh_target_child_roots, subroot); > > A boolean value of "appinfo->parent_relid == root->parse->resultRelation" is > same with "subroot != root"(because of line 241-242), so we can use bool > variable here like > child_is_target = (appinfo->parent_relid == root->parse->resultRelation). > The name of child_is_target is brought from arguments of > add_child_rel_equivalences() in your patch. > > I attached a little diff as "v9-0001-delta.diff". Sounds like a good idea for clarifying the code, so done. > --- > 0001: line 313-431 > > adjust_inherit_target_child() is in allpaths.c in your patch, but it has the > code like below ones which are in master's(not patch applied) planner.c or > planmain.c, so could it be possible in planner.c(or planmain.c)? > This point is less important, but I was just wondering whether > adjust_inherit_target_child() should be in allpaths.c, planner.c or planmain.c. > > + /* Translate the original query's expressions to this child. */ > + subroot = makeNode(PlannerInfo); > + memcpy(subroot, root, sizeof(PlannerInfo)); > > + root->parse->targetList = root->unexpanded_tlist; > + subroot->parse = (Query *) adjust_appendrel_attrs(root, > + (Node *) root->parse, > + 1, &appinfo); > > + tlist = preprocess_targetlist(subroot); > + subroot->processed_tlist = tlist; > + build_base_rel_tlists(subroot, tlist); I'm thinking of changing where adjust_inherit_target_child gets called from. In the current patch, it's in the middle of set_rel_size which I'm starting to think is a not a good place for it. Maybe, I'll place the call call near to where inheritance is expanded. > --- > 0001: line 57-70 > > In commit message: > This removes some existing code in inheritance_planner that dealt > with any subquery RTEs in the query. The rationale of that code > was that the subquery RTEs may change during each iteration of > planning (that is, for different children), so different iterations > better use different copies of those RTEs. > ... > Since with the new code > we perform planning just once, I think we don't need this special > handling. > > 0001: line 772-782 > - * controlled by a single ModifyTable node. All the instances share the > - * same rangetable, but each instance must have its own set of subquery > - * RTEs within the finished rangetable because (1) they are likely to get > - * scribbled on during planning, and (2) it's not inconceivable that > - * subqueries could get planned differently in different cases. We need > - * not create duplicate copies of other RTE kinds, in particular not the > - * target relations, because they don't have either of those issues. Not > - * having to duplicate the target relations is important because doing so > - * (1) would result in a rangetable of length O(N^2) for N targets, with > - * at least O(N^3) work expended here; and (2) would greatly complicate > - * management of the rowMarks list. > > I used considerable time to confirm there are no needs copying subquery RTEs in > the new code, but so far I couldn't. If copied RTEs are only used when planning, > it might not need to copy RTEs in the new code because we perform planning just > once, so I tried to detect when copied RTEs are used or overwritten, but I gave > up. > > Of course, there are comments about this, > > - * same rangetable, but each instance must have its own set of subquery > - * RTEs within the finished rangetable because (1) they are likely to get > - * scribbled on during planning, and (2) it's not inconceivable that > > so copied RTEs might be used when planning, but I couldn't find the actual codes. > I also checked commits[1, 2] related to these codes. I'll check these for more > time but it would be better there are other's review and I also want a help here. > > [1] https://github.com/postgres/postgres/commit/b3aaf9081a1a95c245fd605dcf02c91b3a5c3a29 > [2] https://github.com/postgres/postgres/commit/c03ad5602f529787968fa3201b35c119bbc6d782 Thank you very much for spending time on that. I'd really like someone else to consider this as well. Here's the summary of what I'm proposing to change with respect to the above: because we perform scan-level planning only once for any given relation including for sub-queries with the patch applied, we no longer need to make copies of sub-query RTEs that are currently needed due to repeated planning, with one copy per child target relation. Since there are no new copies, there is no need to process various nodes to change the RT index being used to refer to the sub-query. I have removed the code that does to copying of subquery RTEs and the code that does translation due new RT index being assigned every time a new copy was being made. > --- > Maybe I checked all the way of the v9 patch excluding the codes about > EquivalenceClass codes(0001: line 567-638). > I'll consider whether there are any performance degration case, but I have > no idea for now. Do you have any points you concerns? If there, I'll check it. I will send an updated patch hopefully before my new year vacation that starts on Friday this week. Thanks, Amit
Hi, On 2018/12/26 13:28, Amit Langote wrote: > On 2018/12/25 16:47, Imai, Yoshikazu wrote: >> adjust_inherit_target_child() is in allpaths.c in your patch, but it has the >> code like below ones which are in master's(not patch applied) planner.c or >> planmain.c, so could it be possible in planner.c(or planmain.c)? >> This point is less important, but I was just wondering whether >> adjust_inherit_target_child() should be in allpaths.c, planner.c or planmain.c. >> >> + /* Translate the original query's expressions to this child. */ >> + subroot = makeNode(PlannerInfo); >> + memcpy(subroot, root, sizeof(PlannerInfo)); >> >> + root->parse->targetList = root->unexpanded_tlist; >> + subroot->parse = (Query *) adjust_appendrel_attrs(root, >> + (Node *) root->parse, >> + 1, &appinfo); >> >> + tlist = preprocess_targetlist(subroot); >> + subroot->processed_tlist = tlist; >> + build_base_rel_tlists(subroot, tlist); > > I'm thinking of changing where adjust_inherit_target_child gets called > from. In the current patch, it's in the middle of set_rel_size which I'm > starting to think is a not a good place for it. Maybe, I'll place the > call call near to where inheritance is expanded. So what I did here is that I added a new step to query_planner() itself that adds the PlannerInfos for inheritance child target relations and hence moved the function adjust_inherit_target_child() to planmain.c and renamed it to add_inherit_target_child_root(). I've removed its RelOptInfo argument and decided to modify the child RelOptInfo's targetlist in allpaths.c. So, add_inherit_target_child_root() only builds the child PlannerInfo now. Child PlannerInfo's are now stored in an array of simple_rel_array_size elements, instead of a list. It's filled with NULLs except for the slots corresponding to child target relations. In allpaths.c, I've added set_inherit_target_rel_sizes and set_inherit_target_rel_pathlists, which look like set_append_rel_size and set_append_rel_pathlist, respectively. The reason to add separate functions for the target relation case is that we don't want to combine the outputs of children to form an "append relation" in that case. So, this patch no longer modifies set_append_rel_size for handling the case where the parent relation is query's target relation. > I will send an updated patch hopefully before my new year vacation that > starts on Friday this week. Here's the v10 of the patch. I didn't get chance to do more changes than those described above and address Imai-san's review comments yesterday. Have a wonderful new year! I'll be back on January 7 to reply on this thread. Thanks, Amit
Attachment
- v10-0001-Overhaul-inheritance-update-delete-planning.patch
- v10-0002-Store-inheritance-root-parent-index-in-otherrel-.patch
- v10-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v10-0004-Move-append-expansion-code-into-its-own-file.patch
- v10-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v10-0006-Do-not-lock-all-partitions-at-the-beginning.patch
Hi, I don't think I can't help with code review, but I loaded our largest customer's schema into pg12dev and tested with this patch. It's working well - thanks for your work. Details follow. We have tooo many tables+indices so this vastly improves planning speed. Our large customers have ~300 parent tables and total ~20k child tables with total ~80k indices. Our largest tables heirarchies have ~1200 child tables, which may have as many as 6-8 indices each. And 5-10 of the largest heirarchies are unioned together in a view. Running pg11.1, explaining query for the largest view with condition eliminates all but today's tables can take several minutes with a cold cache, due to not only stat()ing every file in every table in a partition heirarchy, before pruning, but also actually open()ing all their indices. Running 11dev with your v10 patch applied, this takes 2244ms with empty buffer cache after postmaster restarted on a totally untuned instance (and a new backend, with no cached opened files). I was curious why it took even 2sec, and why it did so many opens() (but not 20k of them that PG11 does): [pryzbyj@database postgresql]$ cut -d'(' -f1 /tmp/strace-12dev-explain |sort |uniq -c |sort -nr 2544 lseek 1263 open ... It turns out 1050 open()s are due to historic data which is no longer being loaded and therefor never converted to relkind=p (but hasn't exceeded the retention interval so not yet DROPped, either). Cheers, Justin
On Fri, 4 Jan 2019 at 04:39, Justin Pryzby <pryzby@telsasoft.com> wrote: > Running 11dev with your v10 patch applied, this takes 2244ms with empty buffer > cache after postmaster restarted on a totally untuned instance (and a new > backend, with no cached opened files). > > I was curious why it took even 2sec, and why it did so many opens() (but not > 20k of them that PG11 does): It would be pretty hard to know that without seeing the query plan. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Thanks Justin for reporting the results of your testing. On 2019/01/07 17:40, David Rowley wrote: > On Fri, 4 Jan 2019 at 04:39, Justin Pryzby <pryzby@telsasoft.com> wrote: >> Running 11dev with your v10 patch applied, this takes 2244ms with empty buffer >> cache after postmaster restarted on a totally untuned instance (and a new >> backend, with no cached opened files). >> >> I was curious why it took even 2sec, and why it did so many opens() (but not >> 20k of them that PG11 does): > > It would be pretty hard to know that without seeing the query plan. Yeah, I too would be curious to see if the resulting plan really needs to do those open()s. If all the open()s being seen here are accounted for by un-pruned partitions (across the UNION ALL) contained in the plan and their indexes, then the patch has done enough to help. If the open()s can be traced to pruned partitions, then there's something wrong with the patch. Thanks, Amit
On Mon, Jan 07, 2019 at 09:40:50PM +1300, David Rowley wrote: > On Fri, 4 Jan 2019 at 04:39, Justin Pryzby <pryzby@telsasoft.com> wrote: > > Running 11dev with your v10 patch applied, this takes 2244ms with empty buffer > > cache after postmaster restarted on a totally untuned instance (and a new > > backend, with no cached opened files). > > > > I was curious why it took even 2sec, and why it did so many opens() (but not > > 20k of them that PG11 does): > > It would be pretty hard to know that without seeing the query plan. The issue was this: > > It turns out 1050 open()s are due to historic data which is no longer being > > loaded and therefor never converted to relkind=p (but hasn't exceeded the > > retention interval so not yet DROPped, either). So there's no evidence of any issue with the patch. Justin
On 2019/01/07 23:13, Justin Pryzby wrote: > The issue was this: >>> It turns out 1050 open()s are due to historic data which is no longer being >>> loaded and therefor never converted to relkind=p (but hasn't exceeded the >>> retention interval so not yet DROPped, either). > > So there's no evidence of any issue with the patch. Ah, so by this you had meant that the historic data is still a old-style (9.6-style) inheritance hierarchy, which gets folded under the UNION ALL whose other children are new-style partitioned tables. Thanks, Amit
On 2018/12/27 20:25, Amit Langote wrote: > Here's the v10 of the patch. I didn't get chance to do more changes than > those described above and address Imai-san's review comments yesterday. > > Have a wonderful new year! I'll be back on January 7 to reply on this thread. Rebased due to changed copyright year in prepunion.c. Also realized that the new files added by patch 0004 still had 2018 in them. Thanks, Amit
Attachment
- v11-0001-Overhaul-inheritance-update-delete-planning.patch
- v11-0002-Store-inheritance-root-parent-index-in-otherrel-.patch
- v11-0003-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v11-0004-Move-inheritance-expansion-code-into-its-own-fil.patch
- v11-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v11-0006-Do-not-lock-all-partitions-at-the-beginning.patch
On Tue, 8 Jan 2019 at 19:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Rebased due to changed copyright year in prepunion.c. Also realized that > the new files added by patch 0004 still had 2018 in them. I've made a pass over 0001. There's probably enough for you to look at while I look at 0002 and the others. 0001 1. In your doc changes, just below a paragraph that you removed, there's a paragraph starting "Both of these behaviors are likely to be changed in a future release". This needs to be fixed since you've removed the first of the two reasons. 2. This part should be left alone. - technique similar to partition pruning. While it is primarily used - for partitioning implemented using the legacy inheritance method, it can be - used for other purposes, including with declarative partitioning. + technique similar to partition pruning. It is primarily used + for partitioning implemented using the legacy inheritance method. Looking at set_inherit_target_rel_sizes(), constraint exclusion still is applied to partitions, it's just applied after pruning, according to: if (did_pruning && !bms_is_member(appinfo->child_relid, live_children)) { /* This partition was pruned; skip it. */ set_dummy_rel_pathlist(childrel); continue; } if (relation_excluded_by_constraints(root, childrel, childRTE)) 3. add_child_rel_equivalences(). You're replacing parent EMs with their child equivalent, but only when the eclass has no volatile functions. Is this really what you want? I think this would misbehave if we ever allowed: UPDATE ... SET .. ORDER BY, of which there's a legitimate use case of wanting to reduce the chances of deadlocks caused by non-deterministic UPDATE order. Or if you think skipping the volatile eclasses is fine today, then I think the comment you've added to add_child_rel_equivalences should mention that. 4. Do you think it's okay that add_child_rel_equivalences() does not update the ec_relids when removing the member? UPDATE: I see you're likely leaving this alone since you're only doing a shallow copy of the eclasses in adjust_inherited_target_child_root(). It seems like a pretty bad idea to do a shallow copy there. 5. What's CE? + /* CE failed, so finish copying/modifying join quals. */ 6. Typo: + * ass dummy. We must do this in this phase so that the rel's ass -> as 7. There's no accumulation going on here: + /* + * Accumulate size information from each live child. + */ + Assert(childrel->rows > 0); 8. Any point in this? We're about to loop again anyway... + /* + * If child is dummy, ignore it. + */ + if (IS_DUMMY_REL(childrel)) + continue; + } 9. It's a bit confusing to mention SELECT in this comment. The Assert ensures it's an UPDATE or DELETE. + /* + * For UPDATE/DELETE queries, the top parent can only ever be a table. + * As a contrast, it could be a UNION ALL subquery in the case of SELECT. + */ + Assert(root->parse->commandType == CMD_UPDATE || + root->parse->commandType == CMD_DELETE); 10. I'd say the subroot assignment can go after the IS_DUMMY_REL check. Keeping that loop as tight as possible for pruned rels seems like a good idea. + subroot = root->inh_target_child_roots[appinfo->child_relid]; + Assert(subroot->parse->resultRelation > 0); + childrel = find_base_rel(root, appinfo->child_relid); + + /* Ignore excluded/pruned children. */ + if (IS_DUMMY_REL(childrel)) + continue; 11. I don't think you should reuse the childrel variable here: + childrel->reloptkind = RELOPT_BASEREL; + + Assert(subroot->join_rel_list == NIL); + Assert(subroot->join_rel_hash == NULL); + + /* Perform join planning and save the resulting RelOptInfo. */ + childrel = make_rel_from_joinlist(subroot, translated_joinlist); + + /* + * Remember this child target rel. inheritance_planner will perform + * the remaining steps of planning for each child relation separately. + * Specifically, it will call grouping_planner on every + * RelOptInfo contained in the inh_target_child_rels list, each of + * which represents the source of tuples to be modified for a given + * target child rel. + */ + root->inh_target_child_joinrels = + lappend(root->inh_target_child_joinrels, childrel); 12. The following comment makes less sense now that you've modified the previous paragraph: + * Fortunately, the UPDATE/DELETE target can never be the nullable side of an + * outer join, so it's OK to generate the plan this way. This text used to refer to: but target inheritance has to be expanded at * the top. The reason is that for UPDATE, each target relation needs a * different targetlist matching its own column set. Fortunately, * the UPDATE/DELETE target can never be the nullable side of an outer join, * so it's OK to generate the plan this way. you no longer describe plan as being expanded from the top rather than at the bottom, which IMO is what "this way" refers to. 13. "tree is" -> "tree are" (references is plural) + * references in the join tree to the original target relation that's the + * root parent of the inheritance tree is replaced by each of its 14. Any reason to move this line from its original location? Assert(parse->commandType != CMD_INSERT); + parent_rte = planner_rt_fetch(top_parentRTindex, root); Previously it was assigned just before it was needed and there's a fast path out after where you moved it to and where it was. 15. relation_excluded_by_constraints(), the switch (constraint_exclusion), you could consider turning that into if (constraint_exclusion == CONSTRAINT_EXCLUSION_OFF) return false; /* * When constraint_exclusion is set to 'partition' we only handle * OTHER_MEMBER_RELs. */ else if (constraint_exclusion == CONSTRAINT_EXCLUSION_PARTITION && rel->reloptkind != RELOPT_OTHER_MEMBER_REL) return false; When I wrote that code I was trying my best to make the complex rules as simple as possible by separating them out. The rules have become quite simple after your change, so it probably does not warrant having the switch. 16. I think the following comment needs to explain how large this array is and what indexes it. The current comment would have you think there are only enough elements to store PlannerInfos for child rels and leaves you guessing about what they're indexed by. + /* + * PlannerInfos corresponding to each inheritance child targets. + * Content of each PlannerInfo is same as the parent PlannerInfo, except + * for the parse tree which is a translated copy of the parent's parse + * tree. + */ + struct PlannerInfo **inh_target_child_roots; 17. I'm getting an assert failure in add_paths_to_append_rel() for Assert(parallel_workers > 0) during the partition_join tests. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, I ran the performance tests for no prepared query and for prepared query with plan_cache_mode='auto' and plan_cache_mode='force_custom_plan'. I also changed number of partitions as 256 or 4096. I ran the tests on master and v9-patched. [settings] plan_cache_mode = 'auto' or 'force_custom_plan' max_parallel_workers = 0 max_parallel_workers_per_gather = 0 max_locks_per_transaction = 4096 [partitioning table definitions(with 4096 partitions)] create table rt (a int, b int, c int) partition by range (a); \o /dev/null select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x; \gexec \o [pgbench(with 4096 partitions)] no prepared: pgbench -n -f select4096.sql -T 60 prepared: pgbench -n -f select4096.sql -T 60 -M prepared [select4096.sql] \set a random(1, 4096) select a from rt where a = :a; [results] master: part_num no-prepared auto force_custom_plan (1-auto/force_custom_plan) 256 604 571 576 0.01 4096 17.5 17.5 15.1 -0.16 patched: part_num no-prepared auto force_custom_plan 256 8614 9446 9384 -0.006 4096 7158 7165 7864 0.089 There are almost no difference between auto and force_custom_plan with 256 partitions, but there are difference between auto and force_custom_plan with 4096 partitions. While auto is faster than force_custom_plan on master, force_custom_plan is faster than auto on patched. I wonder why force_custom_plan is faster than auto after applied the patch. When we use PREPARE-EXECUTE, a generic plan is created and used if its cost is cheaper than creating and using a custom plan with plan_cache_mode='auto', while a custom plan is always created and used with plan_cache_mode='force_custom_plan'. So one can think the difference in above results is because of creating or using a generic plan. I checked how many times a generic plan is created during executing pgbench and found a generic plan is created only once and custom plans are created at other times with plan_cache_mode='auto'. I also checked the time of creating a generic plan, but it didn't take so much(250ms or so with 4096 partitions). So the time of creating a generic plan does not affect the performance. Currently, a generic plan is created at sixth time of executing EXECUTE query. I changed it to more later (ex. at 400,000th time of executing EXECUTE query on master with 4096 partitions, because 7000TPS x 60sec=420,0000 transactions are run while executing pgbench.), then there are almost no difference between auto and force_custom_plan. I think that creation of a generic plan affects the time of executing queries which are ordered after creating generic plan. If my assumption is right, we can expect some additional process is occurred at executing queries ordered after creating a generic plan, which results in auto is slower than force_custom_plan because of additional process. But looking at above results, on master with 4096 partitions, auto is faster than force_custom_plan. So now I am confused. Do you have any ideas what does affect the performance? -- Yoshikazu Imai
> From 3b86331dd5a2368adc39c9fef92f3dd09d817a08 Mon Sep 17 00:00:00 2001 > From: amit <amitlangote09@gmail.com> > Date: Wed, 7 Nov 2018 16:51:31 +0900 > Subject: [PATCH v11 4/6] Move inheritance expansion code into its own file I wonder if it would make sense to commit 0004 ahead of the rest? To avoid carrying this huge one over and over ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 9, 2019 at 11:41 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > From 3b86331dd5a2368adc39c9fef92f3dd09d817a08 Mon Sep 17 00:00:00 2001 > > From: amit <amitlangote09@gmail.com> > > Date: Wed, 7 Nov 2018 16:51:31 +0900 > > Subject: [PATCH v11 4/6] Move inheritance expansion code into its own file > > I wonder if it would make sense to commit 0004 ahead of the rest? To > avoid carrying this huge one over and over ... Maybe a good idea. I will rearrange the patches that way tomorrow. Thanks, Amit
Hi Amit, On Mon, Jan 7, 2019 at 6:30 PM, Amit Langote wrote: > On 2018/12/27 20:25, Amit Langote wrote: > > Here's the v10 of the patch. I didn't get chance to do more changes > > than those described above and address Imai-san's review comments > yesterday. > > > > Have a wonderful new year! I'll be back on January 7 to reply on this > thread. > > Rebased due to changed copyright year in prepunion.c. Also realized that > the new files added by patch 0004 still had 2018 in them. Thank you for new patches. I also have some comments on 0001, set_inherit_target_rel_sizes(). In set_inherit_target_rel_sizes(): Some codes are placed not the same order as set_append_rel_size(). 0001: at line 325-326, + ListCell *l; + bool has_live_children; In set_append_rel_size(), "has_live_children" is above of the "ListCell *l"; 0001: at line 582-603 + if (IS_DUMMY_REL(childrel)) + continue; + ... + Assert(childrel->rows > 0); + + /* We have at least one live child. */ + has_live_children = true; In set_append_rel_size(), + /* We have at least one live child. */ + has_live_children = true; is directly under of + if (IS_DUMMY_REL(childrel)) + continue; 0001: at line 606-622, + if (!has_live_children) + { + /* + * All children were excluded by constraints, so mark the relation + * ass dummy. We must do this in this phase so that the rel's + * dummy-ness is visible when we generate paths for other rels. + */ + set_dummy_rel_pathlist(rel); + } + else + /* + * Set a non-zero value here to cope with the caller's requirement + * that non-dummy relations are actually not empty. We don't try to + * be accurate here, because we're not going to create a path that + * combines the children outputs. + */ + rel->rows = 1; In set_append_rel_size(), a condition of if clause is not !has_live_children but has_live_children. I also noticed there isn't else block while there is if block. -- Yoshikazu Imai
On 2019/01/10 0:16, Amit Langote wrote: > On Wed, Jan 9, 2019 at 11:41 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> From 3b86331dd5a2368adc39c9fef92f3dd09d817a08 Mon Sep 17 00:00:00 2001 >>> From: amit <amitlangote09@gmail.com> >>> Date: Wed, 7 Nov 2018 16:51:31 +0900 >>> Subject: [PATCH v11 4/6] Move inheritance expansion code into its own file >> >> I wonder if it would make sense to commit 0004 ahead of the rest? To >> avoid carrying this huge one over and over ... > > Maybe a good idea. I will rearrange the patches that way tomorrow. Here's v12, which is more or less same as v11 but with the order of patches switched so that the code movement patch is first. Note that the attached v12-0001 contains no functional changes (but there's tiny note in the commit message mentioning the addition of a tiny function which is just old code). In the v13 that I will try to post tomorrow, I will have hopefully addressed David's and Imai-san's review comments. Thank you both! Regards, Amit
Attachment
- v12-0001-Move-inheritance-expansion-code-into-its-own-fil.patch
- v12-0002-Overhaul-inheritance-update-delete-planning.patch
- v12-0003-Store-inheritance-root-parent-index-in-otherrel-.patch
- v12-0004-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v12-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v12-0006-Do-not-lock-all-partitions-at-the-beginning.patch
On Thu, 10 Jan 2019 at 21:41, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > In the v13 that I will try to post tomorrow, I will have hopefully > addressed David's and Imai-san's review comments. Thank you both! I'd been looking at v11's 0002 and started on 0003 too. I'll include my notes so far if you're about to send a v13. v11 0002 18. There's a missing case in the following code. I understand that makeNode() will 0 the member here, but that does not stop the various other initialisations that set the default value for the field. Below there's a missing case where parent != NULL && parent->rtekind != RTE_RELATION. You might be better just always zeroing the field below "rel->partitioned_child_rels = NIL;" + + /* + * For inheritance child relations, we also set inh_root_parent. + * Note that 'parent' might itself be a child (a sub-partitioned + * partition), in which case we simply use its value of + * inh_root_parent. + */ + if (parent->rtekind == RTE_RELATION) + rel->inh_root_parent = parent->inh_root_parent > 0 ? + parent->inh_root_parent : + parent->relid; } else + { rel->top_parent_relids = NULL; + rel->inh_root_parent = 0; + } 19. Seems strange to have a patch that adds a new field that is unused. I also don't quite understand yet why top_parent_relids can't be used instead. I think I recall being confused about that before, so maybe it's worth writing a comment to mention why it cannot be used. v11 0003 20. This code looks wrong: + /* + * expand_inherited_tables may have proved that the relation is empty, so + * check if it's so. + */ + else if (rte->inh && !IS_DUMMY_REL(rel)) Likely you'll want: else if rte->inh) { if (IS_DUMMY_REL(rel)) return; // other stuff } otherwise, you'll end up in the else clause when you shouldn't be. 21. is -> was + * The child rel's RelOptInfo is created during + * expand_inherited_tables(). */ childrel = find_base_rel(root, childRTindex); since you're talking about something that already happened. I'll continue looking at v12. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2019-Jan-10, Amit Langote wrote: > Here's v12, which is more or less same as v11 but with the order of > patches switched so that the code movement patch is first. Note that the > attached v12-0001 contains no functional changes (but there's tiny note in > the commit message mentioning the addition of a tiny function which is > just old code). Pushed 0001 with some minor tweaks, thanks. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 11 Jan 2019 at 06:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Pushed 0001 with some minor tweaks, thanks. Thanks for pushing. I had looked at 0001 last night and there wasn't much to report, just: v12 0001 1. I see you've moved translate_col_privs() out of prepunion.c into appendinfo.c. This required making it an external function, but it's only use is in inherit.c, so would it not be better to put it there and keep it static? 2. The following two lines I think need to swap their order. +#include "utils/rel.h" +#include "utils/lsyscache.h" Both are pretty minor details but thought I'd post them anyway. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2019/01/11 2:56, Alvaro Herrera wrote: > On 2019-Jan-10, Amit Langote wrote: > >> Here's v12, which is more or less same as v11 but with the order of >> patches switched so that the code movement patch is first. Note that the >> attached v12-0001 contains no functional changes (but there's tiny note in >> the commit message mentioning the addition of a tiny function which is >> just old code). > > Pushed 0001 with some minor tweaks, thanks. Thank you for the tweaks and committing. Regards, Amit
Sorry, I hadn't read this email before sending my earlier "thank you for committing" email. On 2019/01/11 6:47, David Rowley wrote: > On Fri, 11 Jan 2019 at 06:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Pushed 0001 with some minor tweaks, thanks. > > Thanks for pushing. I had looked at 0001 last night and there wasn't > much to report, just: > > v12 0001 > > 1. I see you've moved translate_col_privs() out of prepunion.c into > appendinfo.c. This required making it an external function, but it's > only use is in inherit.c, so would it not be better to put it there > and keep it static? Actually, I *was* a bit puzzled where to put it. I tend to agree with you now that it might be define it locally within inherit.c as it might not be needed elsewhere. Let's hear what Alvaro thinks. I'm attaching a patch anyway. > 2. The following two lines I think need to swap their order. > > +#include "utils/rel.h" > +#include "utils/lsyscache.h" Oops, fixed. > Both are pretty minor details but thought I'd post them anyway. Thank you for reporting. Attached find the patch. Regards, Amit
Attachment
On Thu, 10 Jan 2019 at 21:41, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Here's v12, which is more or less same as v11 but with the order of > patches switched so that the code movement patch is first. Note that the > attached v12-0001 contains no functional changes (but there's tiny note in > the commit message mentioning the addition of a tiny function which is > just old code). A few more comments based on reading v12. v12 0002: 1. Missing braces around the else clause. (Should be consistent with the "if" above) + if (!has_live_children) + { + /* + * All children were excluded by constraints, so mark the relation + * ass dummy. We must do this in this phase so that the rel's + * dummy-ness is visible when we generate paths for other rels. + */ + set_dummy_rel_pathlist(rel); + } + else + /* + * Set a non-zero value here to cope with the caller's requirement + * that non-dummy relations are actually not empty. We don't try to + * be accurate here, because we're not going to create a path that + * combines the children outputs. + */ + rel->rows = 1; v12 0004: 2. I wonder if there's a better way, instead of doing this: + if (child_rel1 == NULL) + child_rel1 = build_dummy_partition_rel(root, rel1, cnt_parts); + if (child_rel2 == NULL) + child_rel2 = build_dummy_partition_rel(root, rel2, cnt_parts); maybe add some logic in populate_joinrel_with_paths() to allow NULL rels to mean dummy rels. There's a bit of a problem there as the joinrel has already been created by that time, but perhaps populate_joinrel_with_paths is a better place to decide if the dummy rel needs to be built or not. 3. I wonder if there's a better way to handle what build_dummy_partition_rel() does. I see the child's relid to the parent's relid and makes up a fake AppendRelInfo and puts it in the parent's slot. What's going to happen when the parent is not the top-level parent? It'll already have a AppendRelInfo set. I had thought something like the following could break this, but of course, it does not since we build the dummy rel for the pruned sub_parent2, so we don't hit the NULL relation case when doing the next level. i.e we only make dummies for the top-level, never dummys of joinrels. Does that not mean that the if (parent->top_parent_relids) will always be false in build_dummy_partition_rel() and it'll only ever get rtekind == RTE_RELATION? drop table if exists parent; create table parent (id int, a int, b text, c float) partition by range (a); create table sub_parent1 (b text, c float, a int, id int) partition by range (a); create table sub_parent2 (c float, b text, id int, a int) partition by range (a); alter table parent attach partition sub_parent1 for values from (0) to (10); alter table parent attach partition sub_parent2 for values from (10) to (20); create table child11 (id int, b text, c float, a int); create table child12 (id int, b text, c float, a int); create table child21 (id int, b text, c float, a int); create table child22 (id int, b text, c float, a int); alter table sub_parent1 attach partition child11 for values from (0) to (5); alter table sub_parent1 attach partition child12 for values from (5) to (10); alter table sub_parent2 attach partition child21 for values from (10) to (15); alter table sub_parent2 attach partition child22 for values from (15) to (20); insert into parent values(0,1,'test',100.0); select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id < 10; 4. How are dummy rels handled in grouping_planner()? I see you have this: - if (IS_DUMMY_REL(planned_rel)) + if (!parent_rte->inh || IS_DUMMY_REL(planned_rel)) { grouping_planner(root, false, planned_rel, 0.0); return; With the above example I tried to see how it was handled by doing: postgres=# update parent set c = c where a = 333; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. I didn't look into what's causing the crash. 5. Wondering why you removed the if (childOID != parentOID) from: - if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation)) - { - heap_close(newrelation, lockmode); - continue; - } Isn't that releasing the only lock we hold on the rel defined in the query? I tested with: -- session 1 create temp table a1(a int); create temp table a2(a int) inherits(a1); -- session 2 select oid::regclass from pg_class where relname = 'a1'; oid -------------- pg_temp_3.a1 (1 row) explain select * from pg_temp_3.a1; WARNING: you don't own a lock of type AccessShareLock QUERY PLAN ------------------------------------------ Result (cost=0.00..0.00 rows=0 width=4) One-Time Filter: false (2 rows) 6. expand_planner_arrays() zeros a portion of the append_rel_array even if it just palloc0'd the array. While it's not a bug, it is repeat work. It should be okay to move the Memset() up to the repalloc(). 7. I see get_relation_info() grew an extra parameter. Would it now be better just to pass rte instead of doing; get_relation_info(root, rte->relid, rte->inh, rte->updatedCols, rel); -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, On Thu, Jan 10, 2019 at 4:02 PM, David Rowley wrote: > 3. I wonder if there's a better way to handle what > build_dummy_partition_rel() does. I see the child's relid to the > parent's relid and makes up a fake AppendRelInfo and puts it in the > parent's slot. What's going to happen when the parent is not the > top-level parent? It'll already have a AppendRelInfo set. ... > > select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id < 10; I think there is a mistake in the select SQL. "p1.id < 10" doesn't prune any partition because tables are partitioned by column "a" in your definition. Isn't it? > Does that not mean that the if (parent->top_parent_relids) will always > be false in build_dummy_partition_rel() and it'll only ever get > rtekind == RTE_RELATION? At least, I checked if (parent->top_parent_relids) can be true if I execute below SQL. select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id < 15; I couldn't check other points you mentioned, but I also think build_dummy_partition_rel() needs more consideration because I felt it has complicated logic when I was checking around here. Amit, I also realized there are some mistakes in the comments around this function. + * build_dummy_partition_rel + * Build a RelOptInfo and AppendRelInfo for a pruned partition s/and AppendRelInfo/and an AppendRelInfo/ + * Now we'll need a (no-op) AppendRelInfo for parent, because we're + * setting the dummy partition's relid to be same as the parent's. s/a \(no-op\) AppendRelInfo/an \(no-op\) AppendRelInfo/ -- Yoshikazu Imai
On 2019/01/11 11:07, Amit Langote wrote: > On 2019/01/11 6:47, David Rowley wrote: >> On Fri, 11 Jan 2019 at 06:56, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> Pushed 0001 with some minor tweaks, thanks. >> >> Thanks for pushing. I had looked at 0001 last night and there wasn't >> much to report, just: >> >> v12 0001 >> >> 1. I see you've moved translate_col_privs() out of prepunion.c into >> appendinfo.c. This required making it an external function, but it's >> only use is in inherit.c, so would it not be better to put it there >> and keep it static? > > Actually, I *was* a bit puzzled where to put it. I tend to agree with you > now that it might be define it locally within inherit.c as it might not be > needed elsewhere. Let's hear what Alvaro thinks. I'm attaching a patch > anyway. > >> 2. The following two lines I think need to swap their order. >> >> +#include "utils/rel.h" >> +#include "utils/lsyscache.h" > > Oops, fixed. > >> Both are pretty minor details but thought I'd post them anyway. > > Thank you for reporting. > > Attached find the patch. Looking around a bit more, I started thinking even build_child_join_sjinfo doesn't belong in appendinfo.c (just to be clear, it was defined in prepunion.c before this commit), so maybe we should move it to joinrels.c and instead export adjust_child_relids that's required by it from appendinfo.c. There's already adjust_child_relids_multilevel in appendinfo.h, so having adjust_child_relids next to it isn't too bad. At least not as bad as appendinfo.c exporting build_child_join_sjinfo for joinrels.c to use. I have updated the patch. Thanks, Amit
Attachment
On Thu, Jan 10, 2019 at 6:10 PM, Imai, Yoshikazu wrote: > > Does that not mean that the if (parent->top_parent_relids) will always > > be false in build_dummy_partition_rel() and it'll only ever get > > rtekind == RTE_RELATION? > > At least, I checked if (parent->top_parent_relids) can be true if I > execute below SQL. > > select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id < 15; Sorry, I also made mistake. I was executed: select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.a < 15; -- Yoshikazu Imai > -----Original Message----- > From: Imai, Yoshikazu [mailto:imai.yoshikazu@jp.fujitsu.com] > Sent: Friday, January 11, 2019 3:10 PM > To: 'David Rowley' <david.rowley@2ndquadrant.com>; Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> > Cc: Amit Langote <amitlangote09@gmail.com>; Alvaro Herrera > <alvherre@2ndquadrant.com>; Pg Hackers <pgsql-hackers@postgresql.org> > Subject: RE: speeding up planning with partitions > > Hi David, > > On Thu, Jan 10, 2019 at 4:02 PM, David Rowley wrote: > > 3. I wonder if there's a better way to handle what > > build_dummy_partition_rel() does. I see the child's relid to the > > parent's relid and makes up a fake AppendRelInfo and puts it in the > > parent's slot. What's going to happen when the parent is not the > > top-level parent? It'll already have a AppendRelInfo set. > ... > > > > select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id > > < 10; > > I think there is a mistake in the select SQL. > "p1.id < 10" doesn't prune any partition because tables are partitioned > by column "a" in your definition. Isn't it? > > > Does that not mean that the if (parent->top_parent_relids) will always > > be false in build_dummy_partition_rel() and it'll only ever get > > rtekind == RTE_RELATION? > > At least, I checked if (parent->top_parent_relids) can be true if I > execute below SQL. > > select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id > < 15; > > I couldn't check other points you mentioned, but I also think > build_dummy_partition_rel() needs more consideration because I felt it > has complicated logic when I was checking around here. > > > Amit, > I also realized there are some mistakes in the comments around this > function. > > + * build_dummy_partition_rel > + * Build a RelOptInfo and AppendRelInfo for a pruned > partition > s/and AppendRelInfo/and an AppendRelInfo/ > > + * Now we'll need a (no-op) AppendRelInfo for parent, because > we're > + * setting the dummy partition's relid to be same as the parent's. > s/a \(no-op\) AppendRelInfo/an \(no-op\) AppendRelInfo/ > > -- > Yoshikazu Imai
Thanks for reviewing, David, Imai-san. Replying to all reviews (up to and including David's comments earlier today) with this one email so that I can attach the finished patch here. On 2019/01/09 9:09, David Rowley wrote: > On Tue, 8 Jan 2019 at 19:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Rebased due to changed copyright year in prepunion.c. Also realized that >> the new files added by patch 0004 still had 2018 in them. > > I've made a pass over 0001. There's probably enough for you to look at > while I look at 0002 and the others. > > 0001 > > 1. In your doc changes, just below a paragraph that you removed, > there's a paragraph starting "Both of these behaviors are likely to be > changed in a future release". This needs to be fixed since you've > removed the first of the two reasons. OK, I've fixed the sentence and moved it to the previous paragraph. > 2. This part should be left alone. > > - technique similar to partition pruning. While it is primarily used > - for partitioning implemented using the legacy inheritance method, it can be > - used for other purposes, including with declarative partitioning. > + technique similar to partition pruning. It is primarily used > + for partitioning implemented using the legacy inheritance method. > > Looking at set_inherit_target_rel_sizes(), constraint exclusion still > is applied to partitions, it's just applied after pruning, according > to: OK, I've restored the sentence. > 3. add_child_rel_equivalences(). You're replacing parent EMs with > their child equivalent, but only when the eclass has no volatile > functions. Is this really what you want? I think this would misbehave > if we ever allowed: UPDATE ... SET .. ORDER BY, of which there's a > legitimate use case of wanting to reduce the chances of deadlocks > caused by non-deterministic UPDATE order. Or if you think skipping > the volatile eclasses is fine today, then I think the comment you've > added to add_child_rel_equivalences should mention that. To be honest, I hadn't considered this aspect before. I think it would be OK to create a new copy of the EC even if it's a volatile one as we're creating an entirely new one for the child's planning. Maybe, this will also ensure that someone who will work in the future on implementing UPDATE SET ORDER BY, they won't have to fiddle with this code. > 4. Do you think it's okay that add_child_rel_equivalences() does not > update the ec_relids when removing the member? That's an oversight. Fixed by making add_child_rel_equivalences do this: + cur_ec->ec_relids = bms_difference(cur_ec->ec_relids, + parent_rel->relids); + cur_ec->ec_relids = bms_add_members(cur_ec->ec_relids, + child_rel->relids); > > UPDATE: I see you're likely leaving this alone since you're only doing > a shallow copy of the eclasses in > adjust_inherited_target_child_root(). It seems like a pretty bad idea > to do a shallow copy there. So, you're talking about this code: /* * Child root should get its own copy of ECs, because they'll be modified * to replace parent EC expressions by child expressions in * add_child_rel_equivalences. */ subroot->eq_classes = NIL; foreach(lc, root->eq_classes) { EquivalenceClass *ec = lfirst(lc); EquivalenceClass *new_ec = makeNode(EquivalenceClass); memcpy(new_ec, ec, sizeof(EquivalenceClass)); new_ec->ec_members = list_copy(ec->ec_members); subroot->eq_classes = lappend(subroot->eq_classes, new_ec); } Can you say what you think is wrong with this way of making a copy of the ECs? > > 5. What's CE? > > + /* CE failed, so finish copying/modifying join quals. */ Constraint exclusion. It seems I needed to fix comments around here. > 6. Typo: > > + * ass dummy. We must do this in this phase so that the rel's > > ass -> as Oops! Fixed. > > 7. There's no accumulation going on here: > > + /* > + * Accumulate size information from each live child. > + */ > + Assert(childrel->rows > 0); Removed the comment. > > 8. Any point in this? We're about to loop again anyway... > > + /* > + * If child is dummy, ignore it. > + */ > + if (IS_DUMMY_REL(childrel)) > + continue; > + } Removed this code. > 9. It's a bit confusing to mention SELECT in this comment. The Assert > ensures it's an UPDATE or DELETE. > > + /* > + * For UPDATE/DELETE queries, the top parent can only ever be a table. > + * As a contrast, it could be a UNION ALL subquery in the case of SELECT. > + */ > + Assert(root->parse->commandType == CMD_UPDATE || > + root->parse->commandType == CMD_DELETE); I guess we don't need the 2nd sentence. Removed. > 10. I'd say the subroot assignment can go after the IS_DUMMY_REL > check. Keeping that loop as tight as possible for pruned rels seems > like a good idea. > > + subroot = root->inh_target_child_roots[appinfo->child_relid]; > + Assert(subroot->parse->resultRelation > 0); > + childrel = find_base_rel(root, appinfo->child_relid); > + > + /* Ignore excluded/pruned children. */ > + if (IS_DUMMY_REL(childrel)) > + continue; Agreed, done. > 11. I don't think you should reuse the childrel variable here: > > + childrel->reloptkind = RELOPT_BASEREL; > + > + Assert(subroot->join_rel_list == NIL); > + Assert(subroot->join_rel_hash == NULL); > + > + /* Perform join planning and save the resulting RelOptInfo. */ > + childrel = make_rel_from_joinlist(subroot, translated_joinlist); > + > + /* > + * Remember this child target rel. inheritance_planner will perform > + * the remaining steps of planning for each child relation separately. > + * Specifically, it will call grouping_planner on every > + * RelOptInfo contained in the inh_target_child_rels list, each of > + * which represents the source of tuples to be modified for a given > + * target child rel. > + */ > + root->inh_target_child_joinrels = > + lappend(root->inh_target_child_joinrels, childrel); OK, I've added a new variable called childjoinrel. > 12. The following comment makes less sense now that you've modified > the previous paragraph: > > + * Fortunately, the UPDATE/DELETE target can never be the nullable side of an > + * outer join, so it's OK to generate the plan this way. > > This text used to refer to: > > but target inheritance has to be expanded at > * the top. The reason is that for UPDATE, each target relation needs a > * different targetlist matching its own column set. Fortunately, > * the UPDATE/DELETE target can never be the nullable side of an outer join, > * so it's OK to generate the plan this way. > > you no longer describe plan as being expanded from the top rather than > at the bottom, which IMO is what "this way" refers to. To be honest, I never quite understood what that line really means, which is why I kept it around. What I do know though is that, even with the patched, we still generate subpaths for each child whose targetlist patches the child, so I think the part about "this way" that prompted someone to write that line still remains. Does that make sense to you? > 13. "tree is" -> "tree are" (references is plural) > > + * references in the join tree to the original target relation that's the > + * root parent of the inheritance tree is replaced by each of its Fixed. > 14. Any reason to move this line from its original location? > > Assert(parse->commandType != CMD_INSERT); > + parent_rte = planner_rt_fetch(top_parentRTindex, root); > > Previously it was assigned just before it was needed and there's a > fast path out after where you moved it to and where it was. Next patch in the series needs it moved, but no reason for this patch to move it. Put it back where it was. > 15. relation_excluded_by_constraints(), the switch > (constraint_exclusion), you could consider turning that into > > if (constraint_exclusion == CONSTRAINT_EXCLUSION_OFF) > return false; > /* > * When constraint_exclusion is set to 'partition' we only handle > * OTHER_MEMBER_RELs. > */ > else if (constraint_exclusion == CONSTRAINT_EXCLUSION_PARTITION && > rel->reloptkind != RELOPT_OTHER_MEMBER_REL) > return false; > > When I wrote that code I was trying my best to make the complex rules > as simple as possible by separating them out. The rules have become > quite simple after your change, so it probably does not warrant having > the switch. OK, done. > 16. I think the following comment needs to explain how large this > array is and what indexes it. The current comment would have you > think there are only enough elements to store PlannerInfos for child > rels and leaves you guessing about what they're indexed by. > > + /* > + * PlannerInfos corresponding to each inheritance child targets. > + * Content of each PlannerInfo is same as the parent PlannerInfo, except > + * for the parse tree which is a translated copy of the parent's parse > + * tree. > + */ > + struct PlannerInfo **inh_target_child_roots; I've added those details to the comment. > > 17. I'm getting an assert failure in add_paths_to_append_rel() for > Assert(parallel_workers > 0) during the partition_join tests. I guess that was due to not using the correct root in inherit_target_rel_pathlists. Fixed. On 2019/01/10 18:12, David Rowley wrote:> > I'd been looking at v11's 0002 and started on 0003 too. I'll include > my notes so far if you're about to send a v13. > > > v11 0002 > > 18. There's a missing case in the following code. I understand that > makeNode() will 0 the member here, but that does not stop the various > other initialisations that set the default value for the field. Below > there's a missing case where parent != NULL && parent->rtekind != > RTE_RELATION. You might be better just always zeroing the field below > "rel->partitioned_child_rels = NIL;" > > + > + /* > + * For inheritance child relations, we also set inh_root_parent. > + * Note that 'parent' might itself be a child (a sub-partitioned > + * partition), in which case we simply use its value of > + * inh_root_parent. > + */ > + if (parent->rtekind == RTE_RELATION) > + rel->inh_root_parent = parent->inh_root_parent > 0 ? > + parent->inh_root_parent : > + parent->relid; > } > else > + { > rel->top_parent_relids = NULL; > + rel->inh_root_parent = 0; > + } Okay, done. Actually, also did the same for top_parent_relids. > 19. Seems strange to have a patch that adds a new field that is > unused. I also don't quite understand yet why top_parent_relids can't > be used instead. I think I recall being confused about that before, so > maybe it's worth writing a comment to mention why it cannot be used. See if this updated comment makes it any clearer: /* * For inheritance children, this is the RT index of inheritance table * mentioned in the query from which this relation originated. * top_parent_relids cannot be used for this, because if the inheritance * root table is itself under UNION ALL, top_parent_relids contains the * RT index of UNION ALL parent subquery. */ This is its own patch, because it was thought it could be useful to another thread which has since stalled: https://www.postgresql.org/message-id/be766794-eb16-b798-52ec-1f786b26b61b%40lab.ntt.co.jp > v11 0003 > > 20. This code looks wrong: > > + /* > + * expand_inherited_tables may have proved that the relation is empty, so > + * check if it's so. > + */ > + else if (rte->inh && !IS_DUMMY_REL(rel)) > > > Likely you'll want: > > else if rte->inh) > { > if (IS_DUMMY_REL(rel)) > return; > // other stuff > } > > otherwise, you'll end up in the else clause when you shouldn't be. OK, done that way. > 21. is -> was > > + * The child rel's RelOptInfo is created during > + * expand_inherited_tables(). > */ > childrel = find_base_rel(root, childRTindex); > > since you're talking about something that already happened. OK, fixed. On 2019/01/11 13:01, David Rowley wrote:> > A few more comments based on reading v12. > > v12 0002: > > 1. Missing braces around the else clause. (Should be consistent with > the "if" above) > > + if (!has_live_children) > + { > + /* > + * All children were excluded by constraints, so mark the relation > + * ass dummy. We must do this in this phase so that the rel's > + * dummy-ness is visible when we generate paths for other rels. > + */ > + set_dummy_rel_pathlist(rel); > + } > + else > + /* > + * Set a non-zero value here to cope with the caller's requirement > + * that non-dummy relations are actually not empty. We don't try to > + * be accurate here, because we're not going to create a path that > + * combines the children outputs. > + */ > + rel->rows = 1; Agreed, done. > v12 0004: > > 2. I wonder if there's a better way, instead of doing this: > > + if (child_rel1 == NULL) > + child_rel1 = build_dummy_partition_rel(root, rel1, cnt_parts); > + if (child_rel2 == NULL) > + child_rel2 = build_dummy_partition_rel(root, rel2, cnt_parts); > > maybe add some logic in populate_joinrel_with_paths() to allow NULL > rels to mean dummy rels. There's a bit of a problem there as the > joinrel has already been created by that time, but perhaps > populate_joinrel_with_paths is a better place to decide if the dummy > rel needs to be built or not. Hmm, I'd thought about that, but concluded that I shouldn't mix that work with this refactoring project. We can try to hack the join planning code later, but until then build_dummy_partition_rel can keep things working. Once we have a working solution that works without having to create this dummy RelOptInfo, we can remove build_dummy_partition_rel. > 3. I wonder if there's a better way to handle what > build_dummy_partition_rel() does. I see the child's relid to the > parent's relid and makes up a fake AppendRelInfo and puts it in the > parent's slot. What's going to happen when the parent is not the > top-level parent? It'll already have a AppendRelInfo set. Yeah, the parent's AppendRelInfo would already be present and it won't be replaced: if (root->append_rel_array[parent->relid] == NULL) { AppendRelInfo *appinfo = make_append_rel_info(parent, parentrte, parent->tupdesc, parentrte->relid, parent->reltype, parent->relid); root->append_rel_array[parent->relid] = appinfo; } Now you'll say why the discrepancy? For the top-level parent's dummy children, appinfo generated is actually no-op, because it's generated using the above code. But for an intermediate parent's dummy parent's there already exists a "real" appinfo. It doesn't make much difference as far as I can tell, because the appinfo is not used for anything significant. > I had thought something like the following could break this, but of > course, it does not since we build the dummy rel for the pruned > sub_parent2, so we don't hit the NULL relation case when doing the > next level. i.e we only make dummies for the top-level, never dummys > of joinrels. > > Does that not mean that the if (parent->top_parent_relids) will always > be false in build_dummy_partition_rel() and it'll only ever get > rtekind == RTE_RELATION? Well, parentrte->rtekind should always be RTE_RELATION in build_dummy_partition_rel, because partitionwise join is considered only for partitioned tables and joinrels resulting from joining partitioned tables. parent->top_parent_relids might be non-NULL if it's an intermediate parent. > 4. How are dummy rels handled in grouping_planner()? > > I see you have this: > > - if (IS_DUMMY_REL(planned_rel)) > + if (!parent_rte->inh || IS_DUMMY_REL(planned_rel)) > { > grouping_planner(root, false, planned_rel, 0.0); > return; > > With the above example I tried to see how it was handled by doing: > > postgres=# update parent set c = c where a = 333; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > I didn't look into what's causing the crash. I tried your example, but it didn't crash for me: explain update parent set c = c where a = 333; QUERY PLAN ──────────────────────────────────────────────────── Update on parent (cost=0.00..0.00 rows=0 width=0) -> Result (cost=0.00..0.00 rows=0 width=54) One-Time Filter: false (3 rows) > 5. Wondering why you removed the if (childOID != parentOID) from: > > - if (childOID != parentOID && RELATION_IS_OTHER_TEMP(newrelation)) > - { > - heap_close(newrelation, lockmode); > - continue; > - } > > Isn't that releasing the only lock we hold on the rel defined in the query? I think I mistakenly removed it. Added it back. > 6. expand_planner_arrays() zeros a portion of the append_rel_array > even if it just palloc0'd the array. While it's not a bug, it is > repeat work. It should be okay to move the Memset() up to the > repalloc(). Done. Also moved other MemSets to their respective repallocs. > 7. I see get_relation_info() grew an extra parameter. Would it now be > better just to pass rte instead of doing; > > get_relation_info(root, rte->relid, rte->inh, rte->updatedCols, > rel); > OK, done. On 2019/01/10 15:37, Imai, Yoshikazu wrote:> > I also have some comments on 0001, set_inherit_target_rel_sizes(). > > In set_inherit_target_rel_sizes(): > > Some codes are placed not the same order as set_append_rel_size(). > > 0001: at line 325-326, > + ListCell *l; > + bool has_live_children; > > In set_append_rel_size(), "has_live_children" is above of the "ListCell *l"; OK, moved to look like set_append_rel_size. > 0001: at line 582-603 > + if (IS_DUMMY_REL(childrel)) > + continue; > + > ... > + Assert(childrel->rows > 0); > + > + /* We have at least one live child. */ > + has_live_children = true; > > In set_append_rel_size(), > + /* We have at least one live child. */ > + has_live_children = true; > is directly under of > + if (IS_DUMMY_REL(childrel)) > + continue; OK, done. > 0001: at line 606-622, > + if (!has_live_children) > + { > + /* > + * All children were excluded by constraints, so mark the relation > + * ass dummy. We must do this in this phase so that the rel's > + * dummy-ness is visible when we generate paths for other rels. > + */ > + set_dummy_rel_pathlist(rel); > + } > + else > + /* > + * Set a non-zero value here to cope with the caller's requirement > + * that non-dummy relations are actually not empty. We don't try to > + * be accurate here, because we're not going to create a path that > + * combines the children outputs. > + */ > + rel->rows = 1; > > In set_append_rel_size(), a condition of if clause is not > !has_live_children but > has_live_children. > > I also noticed there isn't else block while there is if block. Things that set_inherit_target_rel_sizes and set_append_rel_size need to do, respectively, are different, so the code looks different. Anyway, I've switched the above if else blocks' code so that it looks like this: if (has_live_children) rel->rows = 1; else set_dummy_rel_pathlist(rel); Here are the updated patches. I've also attached the patch I posted earlier today here as 0001-Some-fixups-for-b60c397599-v2.patch. Thanks you again. Regards, Amit
Attachment
- 0001-Some-fixups-for-b60c397599-v2.patch
- v13-0002-Overhaul-inheritance-update-delete-planning.patch
- v13-0003-Store-inheritance-root-parent-index-in-otherrel-.patch
- v13-0004-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v13-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v13-0006-Do-not-lock-all-partitions-at-the-beginning.patch
On Sat, 12 Jan 2019 at 02:00, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > That's an oversight. Fixed by making add_child_rel_equivalences do this: > > + cur_ec->ec_relids = bms_difference(cur_ec->ec_relids, > + parent_rel->relids); > + cur_ec->ec_relids = bms_add_members(cur_ec->ec_relids, > + child_rel->relids); > > > > > UPDATE: I see you're likely leaving this alone since you're only doing > > a shallow copy of the eclasses in > > adjust_inherited_target_child_root(). It seems like a pretty bad idea > > to do a shallow copy there. > > So, you're talking about this code: > > /* > * Child root should get its own copy of ECs, because they'll be modified > * to replace parent EC expressions by child expressions in > * add_child_rel_equivalences. > */ > subroot->eq_classes = NIL; > foreach(lc, root->eq_classes) > { > EquivalenceClass *ec = lfirst(lc); > EquivalenceClass *new_ec = makeNode(EquivalenceClass); > > memcpy(new_ec, ec, sizeof(EquivalenceClass)); > new_ec->ec_members = list_copy(ec->ec_members); > subroot->eq_classes = lappend(subroot->eq_classes, new_ec); > } > > Can you say what you think is wrong with this way of making a copy of the ECs? If you shallow copy with memcpy(new_ec, ec, sizeof(EquivalenceClass));, then fields such as ec_relids will just point to the same memory as the parent PlannerInfo's EquivalenceClasses, so when you do your adjustment (as above) on the child eclasses, you'll modify memory belonging to the parent. I'd assume you'd not want to do that since you need to keep the parent intact at least to make copies for other child rels. You'd have gotten away with it before since you performed a list_copy() and only were deleting the parent's EMs with list_delete_cell() which was modifying the copy, not the original list. Since you were missing the alteration to ec_relids, then you might not have seen issues with the shallow copy, but now that you are changing that field, are you not modifying the ec_relids field that still set in the parent's PlannerInfo? In this instance, you've sidestepped that issue by using bms_difference() which creates a copy of the Bitmapset and modifies that. I think you'd probably see issues if you tried to use bms_del_members(). I think not doing the deep copy is going to give other people making changes in this are a hard time in the future. > > 12. The following comment makes less sense now that you've modified > > the previous paragraph: > > > > + * Fortunately, the UPDATE/DELETE target can never be the nullable side of an > > + * outer join, so it's OK to generate the plan this way. > > > > This text used to refer to: > > > > but target inheritance has to be expanded at > > * the top. The reason is that for UPDATE, each target relation needs a > > * different targetlist matching its own column set. Fortunately, > > * the UPDATE/DELETE target can never be the nullable side of an outer join, > > * so it's OK to generate the plan this way. > > > > you no longer describe plan as being expanded from the top rather than > > at the bottom, which IMO is what "this way" refers to. > > To be honest, I never quite understood what that line really means, which > is why I kept it around. What I do know though is that, even with the > patched, we still generate subpaths for each child whose targetlist > patches the child, so I think the part about "this way" that prompted > someone to write that line still remains. Does that make sense to you? Not quite. I thought that "OK to generate the plan this way" is talking about expanding the children at the top of the plan, e.g. explain (costs off) update listp set b = b + 1 from t where listp.a = t.a; QUERY PLAN -------------------------------------- Update on listp Update on listp1 Update on listp2 -> Merge Join Merge Cond: (listp1.a = t.a) -> Sort Sort Key: listp1.a -> Seq Scan on listp1 -> Sort Sort Key: t.a -> Seq Scan on t -> Merge Join Merge Cond: (listp2.a = t.a) -> Sort Sort Key: listp2.a -> Seq Scan on listp2 -> Sort Sort Key: t.a -> Seq Scan on t i.e each non-pruned partition gets joined to the other tables (expanded from the top of the plan tree). The other tables appear once for each child target relation. instead of: postgres=# explain (costs off) select * from listp inner join t on listp.a = t.a; QUERY PLAN -------------------------------------- Merge Join Merge Cond: (t.a = listp1.a) -> Sort Sort Key: t.a -> Seq Scan on t -> Sort Sort Key: listp1.a -> Append -> Seq Scan on listp1 -> Seq Scan on listp2 where the children are expanded from the bottom (or at the leaf side of the plan tree), since we always plant our trees up-side-down, in computer science. In my view, since you removed the part about where the children are expanded then it does not quite make sense anymore. Of course, I might not be reading the comment as it was intended. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 12 Jan 2019 at 21:49, David Rowley <david.rowley@2ndquadrant.com> wrote: > > Can you say what you think is wrong with this way of making a copy of the ECs? > > If you shallow copy with memcpy(new_ec, ec, > sizeof(EquivalenceClass));, then fields such as ec_relids will just > point to the same memory as the parent PlannerInfo's > EquivalenceClasses, so when you do your adjustment (as above) on the > child eclasses, you'll modify memory belonging to the parent. I'd > assume you'd not want to do that since you need to keep the parent > intact at least to make copies for other child rels. You'd have > gotten away with it before since you performed a list_copy() and only > were deleting the parent's EMs with list_delete_cell() which was > modifying the copy, not the original list. Since you were missing the > alteration to ec_relids, then you might not have seen issues with the > shallow copy, but now that you are changing that field, are you not > modifying the ec_relids field that still set in the parent's > PlannerInfo? In this instance, you've sidestepped that issue by using > bms_difference() which creates a copy of the Bitmapset and modifies > that. I think you'd probably see issues if you tried to use > bms_del_members(). I think not doing the deep copy is going to give > other people making changes in this are a hard time in the future. Setting to waiting on author pending clarification about shallow vs deep copying of ECs. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Thanks for the comments. On 2019/01/12 17:49, David Rowley wrote: > On Sat, 12 Jan 2019 at 02:00, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> That's an oversight. Fixed by making add_child_rel_equivalences do this: >> >> + cur_ec->ec_relids = bms_difference(cur_ec->ec_relids, >> + parent_rel->relids); >> + cur_ec->ec_relids = bms_add_members(cur_ec->ec_relids, >> + child_rel->relids); >> >>> >>> UPDATE: I see you're likely leaving this alone since you're only doing >>> a shallow copy of the eclasses in >>> adjust_inherited_target_child_root(). It seems like a pretty bad idea >>> to do a shallow copy there. >> >> So, you're talking about this code: >> >> /* >> * Child root should get its own copy of ECs, because they'll be modified >> * to replace parent EC expressions by child expressions in >> * add_child_rel_equivalences. >> */ >> subroot->eq_classes = NIL; >> foreach(lc, root->eq_classes) >> { >> EquivalenceClass *ec = lfirst(lc); >> EquivalenceClass *new_ec = makeNode(EquivalenceClass); >> >> memcpy(new_ec, ec, sizeof(EquivalenceClass)); >> new_ec->ec_members = list_copy(ec->ec_members); >> subroot->eq_classes = lappend(subroot->eq_classes, new_ec); >> } >> >> Can you say what you think is wrong with this way of making a copy of the ECs? > > If you shallow copy with memcpy(new_ec, ec, > sizeof(EquivalenceClass));, then fields such as ec_relids will just > point to the same memory as the parent PlannerInfo's > EquivalenceClasses, so when you do your adjustment (as above) on the > child eclasses, you'll modify memory belonging to the parent. I'd > assume you'd not want to do that since you need to keep the parent > intact at least to make copies for other child rels. You'd have > gotten away with it before since you performed a list_copy() and only > were deleting the parent's EMs with list_delete_cell() which was > modifying the copy, not the original list. Since you were missing the > alteration to ec_relids, then you might not have seen issues with the > shallow copy, but now that you are changing that field, are you not > modifying the ec_relids field that still set in the parent's > PlannerInfo? Yep. add_eq_member, when adding the child member to the child's copy of EC, will end up modifying ec_relids that it shares with the parent's copy, because it uses bms_add_member which modifies the input bitmapset in-place. > In this instance, you've sidestepped that issue by using > bms_difference() which creates a copy of the Bitmapset and modifies > that. I think you'd probably see issues if you tried to use > bms_del_members(). I think not doing the deep copy is going to give > other people making changes in this are a hard time in the future. I thought of inventing a _copyEquivalenceClass but noticed this in _copyPathKey: /* EquivalenceClasses are never moved, so just shallow-copy the pointer */ COPY_SCALAR_FIELD(pk_eclass); I think that won't be a problem in our case, because this comment seems to be talking about identical copies of a given EC. In our case, we're talking about making copies that are essentially different because of parent-to-child translation. However, providing a function in copyfuncs.c to copy ECs would make it sound like making identical copies of ECs is OK (which it apparently isn't), so I added the copying code in the place where the patch wants to use it. The new code takes care to make copies of all the fields that might change, not just ec_members. Also, realizing another problem with where the copying code is placed now in the patch (adjust_inherited_target_child_root), I have moved it to adjust_inherit_target_rel_sizes so that translation from parent-to-child of EC members works correctly. Concretely, the problem is that a level 2 partition's EC members won't be created with the patch's current way of initializing eq_classes in the partition subroots -- all initially get the copy of root parent's EC members, whereas translation can only occur between a partition and its immediate parent. Without EC members, a level 2 leaf partition will not be able to merge join. So, for example: explain update p set a = foo.a+1 from foo where p.a = foo.a; QUERY PLAN ────────────────────────────────────────────────────────────────────────── Update on p (cost=0.00..98556.15 rows=6535012 width=16) Update on p11 Update on p2 -> Nested Loop (cost=0.00..97614.88 rows=6502500 width=16) -> Seq Scan on p11 (cost=0.00..35.50 rows=2550 width=10) -> Materialize (cost=0.00..48.25 rows=2550 width=10) -> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=10) -> Merge Join (cost=359.57..941.28 rows=32512 width=16) Merge Cond: (p2.a = foo.a) -> Sort (cost=179.78..186.16 rows=2550 width=10) Sort Key: p2.a -> Seq Scan on p2 (cost=0.00..35.50 rows=2550 width=10) -> Sort (cost=179.78..186.16 rows=2550 width=10) Sort Key: foo.a -> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=10) vs. explain update p set a = foo.a+1 from foo where p.a = foo.a; QUERY PLAN ────────────────────────────────────────────────────────────────────────── Update on p (cost=359.57..1882.55 rows=65024 width=16) Update on p11 Update on p2 -> Merge Join (cost=359.57..941.28 rows=32512 width=16) Merge Cond: (p11.a = foo.a) -> Sort (cost=179.78..186.16 rows=2550 width=10) Sort Key: p11.a -> Seq Scan on p11 (cost=0.00..35.50 rows=2550 width=10) -> Sort (cost=179.78..186.16 rows=2550 width=10) Sort Key: foo.a -> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=10) -> Merge Join (cost=359.57..941.28 rows=32512 width=16) Merge Cond: (p2.a = foo.a) -> Sort (cost=179.78..186.16 rows=2550 width=10) Sort Key: p2.a -> Seq Scan on p2 (cost=0.00..35.50 rows=2550 width=10) -> Sort (cost=179.78..186.16 rows=2550 width=10) Sort Key: foo.a -> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=10) (19 rows) after fixing. Note that p11 is a partition of p1 which is partition of p. p2 is partition of p. >>> 12. The following comment makes less sense now that you've modified >>> the previous paragraph: >>> >>> + * Fortunately, the UPDATE/DELETE target can never be the nullable side of an >>> + * outer join, so it's OK to generate the plan this way. >>> >>> This text used to refer to: >>> >>> but target inheritance has to be expanded at >>> * the top. The reason is that for UPDATE, each target relation needs a >>> * different targetlist matching its own column set. Fortunately, >>> * the UPDATE/DELETE target can never be the nullable side of an outer join, >>> * so it's OK to generate the plan this way. >>> >>> you no longer describe plan as being expanded from the top rather than >>> at the bottom, which IMO is what "this way" refers to. >> >> To be honest, I never quite understood what that line really means, which >> is why I kept it around. What I do know though is that, even with the >> patched, we still generate subpaths for each child whose targetlist >> patches the child, so I think the part about "this way" that prompted >> someone to write that line still remains. Does that make sense to you? > > Not quite. I thought that "OK to generate the plan this way" is > talking about expanding the children at the top of the plan, e.g. > > explain (costs off) update listp set b = b + 1 from t where listp.a = t.a; > QUERY PLAN > -------------------------------------- > Update on listp > Update on listp1 > Update on listp2 > -> Merge Join > Merge Cond: (listp1.a = t.a) > -> Sort > Sort Key: listp1.a > -> Seq Scan on listp1 > -> Sort > Sort Key: t.a > -> Seq Scan on t > -> Merge Join > Merge Cond: (listp2.a = t.a) > -> Sort > Sort Key: listp2.a > -> Seq Scan on listp2 > -> Sort > Sort Key: t.a > -> Seq Scan on t > > i.e each non-pruned partition gets joined to the other tables > (expanded from the top of the plan tree). The other tables appear once > for each child target relation. > > instead of: > > postgres=# explain (costs off) select * from listp inner join t on > listp.a = t.a; > QUERY PLAN > -------------------------------------- > Merge Join > Merge Cond: (t.a = listp1.a) > -> Sort > Sort Key: t.a > -> Seq Scan on t > -> Sort > Sort Key: listp1.a > -> Append > -> Seq Scan on listp1 > -> Seq Scan on listp2 > > where the children are expanded from the bottom (or at the leaf side > of the plan tree), since we always plant our trees up-side-down, in > computer science. > > In my view, since you removed the part about where the children are > expanded then it does not quite make sense anymore. Thanks for explaining. I think my previous reply was confusing -- I wanted to say that we *do* still expand the children at the top, in the sense that the resulting plan shape hasn't changed due to the patch. I suspect that it's the resulting plan shape that "generating plan this way" refers to, which hasn't changed with the patch. It's true that the patch optimizes away repeated query_planner calls but only because it makes other arrangements to produce the same output as before. Attached updated patches with mainly fixes around EC copying as described above and other cosmetic fixes like comment updates. I've also moved around the code a bit, putting the new functions add_inherit_target_child_roots, etc. in inherit.c instead of planmain.c. Thanks, Amit
Attachment
- 0001-Some-fixups-for-b60c397599-v2.patch
- v14-0002-Overhaul-inheritance-update-delete-planning.patch
- v14-0003-Store-inheritance-root-parent-index-in-otherrel-.patch
- v14-0004-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v14-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v14-0006-Do-not-lock-all-partitions-at-the-beginning.patch
On 2019-Jan-11, Amit Langote wrote: > Looking around a bit more, I started thinking even build_child_join_sjinfo > doesn't belong in appendinfo.c (just to be clear, it was defined in > prepunion.c before this commit), so maybe we should move it to joinrels.c > and instead export adjust_child_relids that's required by it from > appendinfo.c. There's already adjust_child_relids_multilevel in > appendinfo.h, so having adjust_child_relids next to it isn't too bad. At > least not as bad as appendinfo.c exporting build_child_join_sjinfo for > joinrels.c to use. OK, pushed, thanks. I may have caused merge conflicts with the rest of the series, because I reordered some functions -- sorry about that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019/01/17 4:32, Alvaro Herrera wrote: > On 2019-Jan-11, Amit Langote wrote: > >> Looking around a bit more, I started thinking even build_child_join_sjinfo >> doesn't belong in appendinfo.c (just to be clear, it was defined in >> prepunion.c before this commit), so maybe we should move it to joinrels.c >> and instead export adjust_child_relids that's required by it from >> appendinfo.c. There's already adjust_child_relids_multilevel in >> appendinfo.h, so having adjust_child_relids next to it isn't too bad. At >> least not as bad as appendinfo.c exporting build_child_join_sjinfo for >> joinrels.c to use. > > OK, pushed, thanks. Thank you. > I may have caused merge conflicts with the rest of > the series, because I reordered some functions -- sorry about that. No problem. I have rebased the patches. Other than rebasing, I've squashed the patch to add inh_root_parent field to RelOptInfo which contained no other code changes to use that field into the patch that contains changes to start using it. Updated the commit message of lazy-partition-initialization patch to be accurate (it contained old details that have changed since I first wrote that commit message). Thanks, Amit
Attachment
Thank you Imai-san for testing. Sorry it totally slipped my mind to reply to this email. On 2019/01/09 11:08, Imai, Yoshikazu wrote: > I wonder why force_custom_plan is faster than auto after applied the patch. > > When we use PREPARE-EXECUTE, a generic plan is created and used if its cost is > cheaper than creating and using a custom plan with plan_cache_mode='auto', > while a custom plan is always created and used with plan_cache_mode='force_custom_plan'. > So one can think the difference in above results is because of creating or > using a generic plan. > > I checked how many times a generic plan is created during executing pgbench and > found a generic plan is created only once and custom plans are created at other > times with plan_cache_mode='auto'. I also checked the time of creating a > generic plan, but it didn't take so much(250ms or so with 4096 partitions). So > the time of creating a generic plan does not affect the performance. > > Currently, a generic plan is created at sixth time of executing EXECUTE query. > I changed it to more later (ex. at 400,000th time of executing EXECUTE query on > master with 4096 partitions, because 7000TPS x 60sec=420,0000 transactions are > run while executing pgbench.), then there are almost no difference between auto > and force_custom_plan. I think that creation of a generic plan affects the time > of executing queries which are ordered after creating generic plan. > > If my assumption is right, we can expect some additional process is occurred at > executing queries ordered after creating a generic plan, which results in auto is > slower than force_custom_plan because of additional process. But looking at > above results, on master with 4096 partitions, auto is faster than force_custom_plan. > So now I am confused. > > Do you have any ideas what does affect the performance? Are you saying that, when using auto mode, all executions of the query starting from 7th are slower than the first 5 executions? That is, the latency of creating and using a custom plan increases *after* a generic plan is created and discarded on the 6th execution of the query? If so, that is inexplicable to me. Thanks, Amit
Imai-san, From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > On 2019/01/09 11:08, Imai, Yoshikazu wrote: > > I wonder why force_custom_plan is faster than auto after applied the patch. > > > > When we use PREPARE-EXECUTE, a generic plan is created and used if its > cost is > > cheaper than creating and using a custom plan with plan_cache_mode='auto', > > while a custom plan is always created and used with > plan_cache_mode='force_custom_plan'. > > So one can think the difference in above results is because of creating > or > > using a generic plan. > > > > I checked how many times a generic plan is created during executing pgbench > and > > found a generic plan is created only once and custom plans are created > at other > > times with plan_cache_mode='auto'. I also checked the time of creating > a > > generic plan, but it didn't take so much(250ms or so with 4096 partitions). > So > > the time of creating a generic plan does not affect the performance. > > > > Currently, a generic plan is created at sixth time of executing EXECUTE > query. > > I changed it to more later (ex. at 400,000th time of executing EXECUTE > query on > > master with 4096 partitions, because 7000TPS x 60sec=420,0000 > transactions are > > run while executing pgbench.), then there are almost no difference between > auto > > and force_custom_plan. I think that creation of a generic plan affects > the time > > of executing queries which are ordered after creating generic plan. > > > > If my assumption is right, we can expect some additional process is > occurred at > > executing queries ordered after creating a generic plan, which results > in auto is > > slower than force_custom_plan because of additional process. But looking > at > > above results, on master with 4096 partitions, auto is faster than > force_custom_plan. > > So now I am confused. > > > > Do you have any ideas what does affect the performance? > > Are you saying that, when using auto mode, all executions of the query > starting from 7th are slower than the first 5 executions? That is, the > latency of creating and using a custom plan increases *after* a generic > plan is created and discarded on the 6th execution of the query? If so, > that is inexplicable to me. Isn't CheckCachedPlan() (and AcquireExecutorLocks() therein) called in every EXECUTE after 6th one due to some unknow issue? Does choose_custom_plan() always return true after 6th EXECUTE? Regards Takayuki Tsunakawa
Tsunakawa-san, On 2019/01/18 14:12, Tsunakawa, Takayuki wrote: > From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] >> Are you saying that, when using auto mode, all executions of the query >> starting from 7th are slower than the first 5 executions? That is, the >> latency of creating and using a custom plan increases *after* a generic >> plan is created and discarded on the 6th execution of the query? If so, >> that is inexplicable to me. > > Isn't CheckCachedPlan() (and AcquireExecutorLocks() therein) called in every EXECUTE after 6th one due to some unknow issue? CheckCachedPlan is only called if choose_custom_plan() returns false resulting in generic plan being created/reused. With plan_cache_mode = auto, I see it always returns true, because a custom plan containing a single partition to scan is way cheaper than the generic plan. > Does choose_custom_plan() always return true after 6th EXECUTE? Yes. Thanks, Amit
Hi Amit-san, On Wed, Jan 17, 2019 at 10:25 AM, Amit Langote wrote: > Thank you Imai-san for testing. Sorry it totally slipped my mind to reply to this email. Thanks for replying and sorry for my late reply. I've been undertaking on-the-job training last week. > Are you saying that, when using auto mode, all executions of the query > starting from 7th are slower than the first 5 executions? That is, the > latency of creating and using a custom plan increases *after* a generic > plan is created and discarded on the 6th execution of the query? If so, > that is inexplicable to me. Yes. And it's also inexplicable to me. I'll check if this fact is really correct by majoring the time of the first 5 queries before generic plan is created and the other queries after generic plan is created. Yoshikazu Imai
Tsunakawa-san On Thu, Jan 18, 2019 at 5:29 AM, Amit Langote wrote: > On 2019/01/18 14:12, Tsunakawa, Takayuki wrote: ... > > Isn't CheckCachedPlan() (and AcquireExecutorLocks() therein) called > in every EXECUTE after 6th one due to some unknow issue? > > CheckCachedPlan is only called if choose_custom_plan() returns false > resulting in generic plan being created/reused. With plan_cache_mode > = auto, I see it always returns true, because a custom plan containing > a single partition to scan is way cheaper than the generic plan. > > > Does choose_custom_plan() always return true after 6th EXECUTE? > > Yes. Yes. I also checked choose_custom_plan() always returns true excluding 6th EXECUTE and CheckCachedPlan() is only called at 6th EXECUTE. Yoshikazu Imai
On Mon, 21 Jan 2019 at 13:45, Imai, Yoshikazu <imai.yoshikazu@jp.fujitsu.com> wrote: > On Wed, Jan 17, 2019 at 10:25 AM, Amit Langote wrote: > > Are you saying that, when using auto mode, all executions of the query > > starting from 7th are slower than the first 5 executions? That is, the > > latency of creating and using a custom plan increases *after* a generic > > plan is created and discarded on the 6th execution of the query? If so, > > that is inexplicable to me. > > Yes. And it's also inexplicable to me. > > I'll check if this fact is really correct by majoring the time of the > first 5 queries before generic plan is created and the other queries > after generic plan is created. It would be interesting to see the profiles of having the generic plan being built on the 6th execution vs the 400,000th execution. I'd thought maybe one difference would be the relcache hash table having been expanded greatly after the generic plan was created, but I see even the generic plan is selecting a random partition, so the cache would have ended up with that many items eventually anyway, and since we're talking in the odds of 7.8k TPS with 4k partitions, it would have only have taken about 2-3 seconds out of the 60 second run to hit most or all of those partitions anyway. So I doubt it could be that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Imai-san, On 2019/01/21 9:45, Imai, Yoshikazu wrote: > I'll check if this fact is really correct by majoring the time of the > first 5 queries before generic plan is created and the other queries > after generic plan is created. That would really help. If you are able to recreate that behavior consistently, that might something to try to fix at some point. Thanks, Amit
On 2019/01/17 11:17, Amit Langote wrote: > I have rebased the patches. Other than rebasing, I've squashed the patch > to add inh_root_parent field to RelOptInfo which contained no other code > changes to use that field into the patch that contains changes to start > using it. Updated the commit message of lazy-partition-initialization > patch to be accurate (it contained old details that have changed since I > first wrote that commit message). Rebased again due to 8d8dcead1295. Thanks, Amit
Attachment
Hi, On 1/21/19 4:01 AM, Amit Langote wrote: > Rebased again due to 8d8dcead1295. > Passes check-world on 8d8dcead1295. I ran a work load on the series, measuring the time to load the data plus the run-time to complete the work load. Load Run master 3m39s 144s 1778 3m12s 36s NP [*] 2m20s 11s [*] Non partitioned case. Best regards, Jesper
I measured the latency of queries executed before and after creating generic plan with master + v15-patch. In this test, table is partitioned into 4k partitions. I executed 400,0001 queries by pgbench. I changed the timing of creating generic plan at 1st, 10,001st, 20,001st, 50,001st, ..., 390,001st by changing the sourcecode. I run the test with setting both plan_cache_mode = auto and plan_cache_mode = force_custom_plan. The results is below. The value in before columns is showing the average latency of queries before creating generic plan in microsec. The value in after columns is showing the average latency of queries after creating generic plan in microsec. [auto] time of creating generic plan | before[usec] | after[usec] 1st 531 142 10,001st 144 141 20,001st 141 144 50,001st 133 140 200,001st 131 143 390,001st 130 138 [force_custom_plan] time of creating generic plan | before[usec] | after[usec] 1st 10,001st 144 129 20,001st 137 131 50,001st 133 134 200,001st 131 133 390,001st 132 131 * A generic plan is actually not created with plan_cache_mode = force_custom_plan. Looking at the results of force_custom_plan, the latency of first 10,000 transactions is 144 microsec, and the latency offirst 50,000 transactions is 133 microsec. I think that is because in the early transactions, relcache hash table is not hot (as David mentioned?). Comparing the latencies in after column between auto and force_custom_plan, auto ones are higher about 8% than force_custom_planones. That is, it seems creating generic plan affects the latency of queries executed after creating generic plan. On Mon, Jan 21, 2019 at 1:32 AM, David Rowley wrote: > It would be interesting to see the profiles of having the generic plan > being built on the 6th execution vs the 400,000th execution. > > I'd thought maybe one difference would be the relcache hash table > having been expanded greatly after the generic plan was created Does it mean that in the executing queries after the generic plan was created, the time of searching entry in the relcachehash table becomes slow and it increases the latency? > but I > see even the generic plan is selecting a random partition, so the > cache would have ended up with that many items eventually anyway, and > since we're talking in the odds of 7.8k TPS with 4k partitions, it > would have only have taken about 2-3 seconds out of the 60 second run > to hit most or all of those partitions anyway. And does it mean even if we executes a lot of custom plan without creating generic plan, cache would have been ended up tothe same size of which is after creating generic plan? Anyway, I'll check the relcache size. Since I don't know how to get profile at the just time of building generic plan, I'll use MemoryContextStats(MemoryContext*)function to check the relcache size at before/after building generic plan and at afterexecuting a lot of custom plans. Yoshikazu Imai
Imai-san, If the time for EXECUTE differs cleary before and after the creation of the generic plan, why don't you try to count thefunction calls for each EXECUTE? Although I haven't tried it, but you can probably do it with SystemTap, like this: probe process("your_postgres_path").function("*").call { /* accumulate the call count in an associative array */ } Then, sort the functions by their call counts. You may find some notable difference between the 5th and 7th EXECUTE. Regards Takayuki Tsunakawa
Tsunakawa-san Thanks for giving the information. I didn't use it yet, but I just used perf to clarify the difference of before and after the creation of the generic plan,and I noticed that usage of hash_seq_search() is increased about 3% in EXECUTE queries after the creation of the genericplan. What I understand so far is about 10,000 while loops at total (4098+4098+some extra) is needed in hash_seq_search() in EXECUTEquery after the creation of the generic plan. 10,000 while loops takes about 10 microsec (of course, we can't estimate correct time), and the difference of the latencybetween 5th and 7th EXECUTE is about 8 microsec, I currently think this causes the difference. I don't know this problem relates to Amit-san's patch, but I'll continue to investigate it. Yoshikazu Imai
On Tue, 22 Jan 2019 at 20:01, Imai, Yoshikazu <imai.yoshikazu@jp.fujitsu.com> wrote: > I didn't use it yet, but I just used perf to clarify the difference of before and after the creation of the generic plan,and I noticed that usage of hash_seq_search() is increased about 3% in EXECUTE queries after the creation of the genericplan. > > What I understand so far is about 10,000 while loops at total (4098+4098+some extra) is needed in hash_seq_search() inEXECUTE query after the creation of the generic plan. > 10,000 while loops takes about 10 microsec (of course, we can't estimate correct time), and the difference of the latencybetween 5th and 7th EXECUTE is about 8 microsec, I currently think this causes the difference. > > I don't know this problem relates to Amit-san's patch, but I'll continue to investigate it. I had another thought... when you're making a custom plan you're only grabbing locks on partitions that were not pruned (just 1 partition in your case), but when making the generic plan, locks will be acquired on all partitions (all 4000 of them). This likely means that when building the generic plan for the first time that the LockMethodLocalHash table is expanded to fit all those locks, and since we never shrink those down again, it'll remain that size for the rest of your run. I imagine the reason for the slowdown is that during LockReleaseAll(), a sequential scan is performed over the entire hash table. I see from looking at the hash_seq_search() code that the value of max_bucket is pretty critical to how it'll perform. The while ((curElem = segp[segment_ndx]) == NULL) loop will need to run fewer times with a lower max_bucket. I just did a quick and crude test by throwing the following line into the end of hash_seq_init(): elog(NOTICE, "%s %u", hashp->tabname, hashp->hctl->max_bucket); With a 5000 partition table I see: postgres=# set plan_cache_mode = 'auto'; postgres=# prepare q1 (int) as select * from listp where a = $1; postgres=# explain analyze execute q1(1); NOTICE: Portal hash 15 NOTICE: LOCALLOCK hash 15 NOTICE: LOCALLOCK hash 15 ... <skip forward to the 6th execution> postgres=# explain analyze execute q1(1); NOTICE: LOCALLOCK hash 5002 NOTICE: Portal hash 15 NOTICE: LOCALLOCK hash 5002 NOTICE: LOCALLOCK hash 5002 QUERY PLAN -------------------------------------------------------------------------------------------------------- Append (cost=0.00..41.94 rows=13 width=4) (actual time=0.005..0.005 rows=0 loops=1) -> Seq Scan on listp1 (cost=0.00..41.88 rows=13 width=4) (actual time=0.005..0.005 rows=0 loops=1) Filter: (a = 1) Planning Time: 440.822 ms Execution Time: 0.017 ms (5 rows) I've not gone as far to try to recreate the performance drop you've mentioned but I believe there's a very high chance that this is the cause, and if so it's not for Amit to do anything with on this patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2019/01/22 09:47, David Rowley wrote: > On Tue, 22 Jan 2019 at 20:01, Imai, Yoshikazu > <imai.yoshikazu@jp.fujitsu.com> wrote: >> I didn't use it yet, but I just used perf to clarify the difference of before and after the creation of the generic plan,and I noticed that usage of hash_seq_search() is increased about 3% in EXECUTE queries after the creation of the genericplan. >> >> What I understand so far is about 10,000 while loops at total (4098+4098+some extra) is needed in hash_seq_search() inEXECUTE query after the creation of the generic plan. >> 10,000 while loops takes about 10 microsec (of course, we can't estimate correct time), and the difference of the latencybetween 5th and 7th EXECUTE is about 8 microsec, I currently think this causes the difference. > > >> I don't know this problem relates to Amit-san's patch, but I'll continue to investigate it. > > I had another thought... when you're making a custom plan you're only > grabbing locks on partitions that were not pruned (just 1 partition in > your case), but when making the generic plan, locks will be acquired > on all partitions (all 4000 of them). This likely means that when > building the generic plan for the first time that the > LockMethodLocalHash table is expanded to fit all those locks, and > since we never shrink those down again, it'll remain that size for the > rest of your run. I imagine the reason for the slowdown is that > during LockReleaseAll(), a sequential scan is performed over the > entire hash table. I see from looking at the hash_seq_search()code > that the value of max_bucket is pretty critical to how it'll perform. > The while ((curElem = segp[segment_ndx]) == NULL) loop will need to > run fewer times with a lower max_bucket. Really thanks to giving the detail explanation. Yes, 10,000 while loops I mentioned is exactly scanning entire hash table in hash_seq_search() which is called in LockReleaseAll(). And I also checked expand_table() is called during building the generic plan which increases max_bucket. > > I just did a quick and crude test by throwing the following line into > the end of hash_seq_init(): > > elog(NOTICE, "%s %u", hashp->tabname, hashp->hctl->max_bucket); > > With a 5000 partition table I see: > > postgres=# set plan_cache_mode = 'auto'; > postgres=# prepare q1 (int) as select * from listp where a = $1; > postgres=# explain analyze execute q1(1); > NOTICE: Portal hash 15 > NOTICE: LOCALLOCK hash 15 > NOTICE: LOCALLOCK hash 15 > ... > <skip forward to the 6th execution> > > postgres=# explain analyze execute q1(1); > NOTICE: LOCALLOCK hash 5002 > NOTICE: Portal hash 15 > NOTICE: LOCALLOCK hash 5002 > NOTICE: LOCALLOCK hash 5002 > QUERY PLAN > -------------------------------------------------------------------------------------------------------- > Append (cost=0.00..41.94 rows=13 width=4) (actual time=0.005..0.005 > rows=0 loops=1) > -> Seq Scan on listp1 (cost=0.00..41.88 rows=13 width=4) (actual > time=0.005..0.005 rows=0 loops=1) > Filter: (a = 1) > Planning Time: 440.822 ms > Execution Time: 0.017 ms > (5 rows) > > I've not gone as far to try to recreate the performance drop you've > mentioned but I believe there's a very high chance that this is the > cause, and if so it's not for Amit to do anything with on this patch. Since expand_table() is called if we need locks more than max_bucket, I think the performance would also drop after executing the query like "select * from listp where a >= 1 and a <= 5000", but I've not tried it neither. > cause, and if so it's not for Amit to do anything with on this patch. +1 One concern I have is force_custom_plan is slower than auto in master, which is opposite result in patched. I am investigating about it, but its causes would be also in another place from Amit's patch. Thanks -- Yoshikazu Imai
On 2019/01/22 18:47, David Rowley wrote: > On Tue, 22 Jan 2019 at 20:01, Imai, Yoshikazu >> What I understand so far is about 10,000 while loops at total (4098+4098+some extra) is needed in hash_seq_search() inEXECUTE query after the creation of the generic plan. >> 10,000 while loops takes about 10 microsec (of course, we can't estimate correct time), and the difference of the latencybetween 5th and 7th EXECUTE is about 8 microsec, I currently think this causes the difference. > > >> I don't know this problem relates to Amit-san's patch, but I'll continue to investigate it. > > I had another thought... when you're making a custom plan you're only > grabbing locks on partitions that were not pruned (just 1 partition in > your case), but when making the generic plan, locks will be acquired > on all partitions (all 4000 of them). This likely means that when > building the generic plan for the first time that the > LockMethodLocalHash table is expanded to fit all those locks, and > since we never shrink those down again, it'll remain that size for the > rest of your run. I imagine the reason for the slowdown is that > during LockReleaseAll(), a sequential scan is performed over the > entire hash table. I see from looking at the hash_seq_search() code > that the value of max_bucket is pretty critical to how it'll perform. > The while ((curElem = segp[segment_ndx]) == NULL) loop will need to > run fewer times with a lower max_bucket. I too think that that might be behind that slight drop in performance. So, it's good to know what one of the performance bottlenecks is when dealing with large number of relations in queries. Thanks, Amit
On 2019/01/21 18:01, Amit Langote wrote: > On 2019/01/17 11:17, Amit Langote wrote: >> I have rebased the patches. Other than rebasing, I've squashed the patch >> to add inh_root_parent field to RelOptInfo which contained no other code >> changes to use that field into the patch that contains changes to start >> using it. Updated the commit message of lazy-partition-initialization >> patch to be accurate (it contained old details that have changed since I >> first wrote that commit message). > > Rebased again due to 8d8dcead1295. Rebased due to the heap_open/close() -> table_open/close() change. Thanks, Amit
Attachment
On Wed, Jan 23, 2019 at 1:35 AM, Amit Langote wrote: > Rebased due to the heap_open/close() -> table_open/close() change. Maybe there are not many things I can point out through reviewing the patch, so I ran the performance test against v17 patchesinstead of reviewing codes. There are already a lot of tests about partition pruning case and we confirmed performance improves in those cases. In thistime, I tested about accessing all partitions case. I tested with master, master + 0001, master + 0001 + 0002, ..., master + 0001 + 0002 + 0003 + 0004. I ran pgbench 3 times in each test case and below results are average of those. [postgresql.conf] max_parallel_workers = 0 max_parallel_workers_per_gather = 0 [partition table definitions(8192 partitions case)] create table rt (a int, b int, c int) partition by range (a) create table rt_1 partition of rt for values from (1) to (2); ... create table rt_8192 partition of rt for values from (8191) to (8192); [pgbench commands] pgbench -n -f update.sql -T 30 postgres [update.sql(updating partkey case)] update rt set a = 1; [update.sql(updating non-partkey case)] update rt set b = 1; [results] updating partkey case: part-num master 0001 0002 0003 0004 1 8215.34 7924.99 7931.15 8407.40 8475.65 2 7137.49 7026.45 7128.84 7583.08 7593.73 4 5880.54 5896.47 6014.82 6405.33 6398.71 8 4222.96 4446.40 4518.54 4802.43 4785.82 16 2634.91 2891.51 2946.99 3085.81 3087.91 32 935.12 1125.28 1169.17 1199.44 1202.04 64 352.37 405.27 417.09 425.78 424.53 128 236.26 310.01 307.70 315.29 312.81 256 65.36 86.84 87.67 84.39 89.27 512 18.34 24.84 23.55 23.91 23.91 1024 4.83 6.93 6.51 6.45 6.49 updating non-partkey case: part-num master 0001 0002 0003 0004 1 8862.58 8421.49 8575.35 9843.71 10065.30 2 7715.05 7575.78 7654.28 8800.84 8720.60 4 6249.95 6321.32 6470.26 7278.14 7280.10 8 4514.82 4730.48 4823.37 5382.93 5341.10 16 2815.21 3123.27 3162.51 3422.36 3393.94 32 968.45 1702.47 1722.38 1809.89 1799.88 64 364.17 420.48 432.87 440.20 435.31 128 119.94 148.77 150.47 152.18 143.35 256 45.09 46.35 46.93 48.30 45.85 512 8.74 10.59 10.23 10.27 10.13 1024 2.28 2.60 2.56 2.57 2.51 Looking at the results, if we only apply 0001 or 0001 + 0002 and if number of partition is few like 1 or 2, performance degradescompare to master(A maximum reduction is about 5%, which is 8863->8421). In all other cases, performance improves compare to master. -- Yoshikazu Imai
On Thu, Jan 24, 2019 at 6:10 AM, Imai, Yoshikazu wrote: > updating partkey case: > > part-num master 0001 0002 0003 0004 > 1 8215.34 7924.99 7931.15 8407.40 8475.65 > 2 7137.49 7026.45 7128.84 7583.08 7593.73 > 4 5880.54 5896.47 6014.82 6405.33 6398.71 > 8 4222.96 4446.40 4518.54 4802.43 4785.82 > 16 2634.91 2891.51 2946.99 3085.81 3087.91 > 32 935.12 1125.28 1169.17 1199.44 1202.04 > 64 352.37 405.27 417.09 425.78 424.53 > 128 236.26 310.01 307.70 315.29 312.81 > 256 65.36 86.84 87.67 84.39 89.27 > 512 18.34 24.84 23.55 23.91 23.91 > 1024 4.83 6.93 6.51 6.45 6.49 I also tested with non-partitioned table case. updating partkey case: part-num master 0001 0002 0003 0004 0 10956.7 10370.5 10472.6 10571.0 10581.5 1 8215.34 7924.99 7931.15 8407.40 8475.65 ... 1024 4.83 6.93 6.51 6.45 6.49 In my performance results, it seems update performance degrades in non-partitioned case with v17-patch applied. But it seems this degrades did not happen at v2-patch. On Thu, Aug 30, 2018 at 1:45 AM, Amit, Langote wrote: > UPDATE: > > nparts master 0001 0002 0003 > ====== ====== ==== ==== ==== > 0 2856 2893 2862 2816 Does this degradation only occur in my tests? Or if this result is correct, what may causes the degradation? -- Yoshikazu Imai
On Sat, 12 Jan 2019 at 02:00, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > On 2019/01/09 9:09, David Rowley wrote: > > postgres=# update parent set c = c where a = 333; > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > > > I didn't look into what's causing the crash. > > I tried your example, but it didn't crash for me: > > explain update parent set c = c where a = 333; > QUERY PLAN > ──────────────────────────────────────────────────── > Update on parent (cost=0.00..0.00 rows=0 width=0) > -> Result (cost=0.00..0.00 rows=0 width=54) > One-Time Filter: false > (3 rows) I had a closer look. The crash is due to set_inherit_target_rel_sizes() forgetting to set has_live_children to false. This results in the relation not properly being set to a dummy rel and the code then making a modify table node without any subnodes. That crashes due to getTargetResultRelInfo() returning NULL due to rootResultRelInfo and resultRelInfo both being NULL. The attached fixes it. If you were not seeing the crash then has_live_children must have been zero/false by chance during your test. A simple case of: create table listp (a int, b int) partition by list(a); create table listp1 partition of listp for values in(1); update listp set b = b + 1 where a = 42; was crashing for me. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Imai-san, Thanks for testing. On 2019/01/24 15:09, Imai, Yoshikazu wrote: > [pgbench commands] > pgbench -n -f update.sql -T 30 postgres > > [update.sql(updating partkey case)] > update rt set a = 1; > > [update.sql(updating non-partkey case)] > update rt set b = 1; > > [results] > updating partkey case: > > part-num master 0001 0002 0003 0004 > 1 8215.34 7924.99 7931.15 8407.40 8475.65 > 2 7137.49 7026.45 7128.84 7583.08 7593.73 > 4 5880.54 5896.47 6014.82 6405.33 6398.71 > 8 4222.96 4446.40 4518.54 4802.43 4785.82 > 16 2634.91 2891.51 2946.99 3085.81 3087.91 > 32 935.12 1125.28 1169.17 1199.44 1202.04 > 64 352.37 405.27 417.09 425.78 424.53 > 128 236.26 310.01 307.70 315.29 312.81 > 256 65.36 86.84 87.67 84.39 89.27 > 512 18.34 24.84 23.55 23.91 23.91 > 1024 4.83 6.93 6.51 6.45 6.49 > > > updating non-partkey case: > > part-num master 0001 0002 0003 0004 > 1 8862.58 8421.49 8575.35 9843.71 10065.30 > 2 7715.05 7575.78 7654.28 8800.84 8720.60 > 4 6249.95 6321.32 6470.26 7278.14 7280.10 > 8 4514.82 4730.48 4823.37 5382.93 5341.10 > 16 2815.21 3123.27 3162.51 3422.36 3393.94 > 32 968.45 1702.47 1722.38 1809.89 1799.88 > 64 364.17 420.48 432.87 440.20 435.31 > 128 119.94 148.77 150.47 152.18 143.35 > 256 45.09 46.35 46.93 48.30 45.85 > 512 8.74 10.59 10.23 10.27 10.13 > 1024 2.28 2.60 2.56 2.57 2.51 > > > Looking at the results, if we only apply 0001 or 0001 + 0002 and if number of partition is few like 1 or 2, performancedegrades compare to master(A maximum reduction is about 5%, which is 8863->8421). > In all other cases, performance improves compare to master. Just to be clear, these are cases where pruning *doesn't* occur, though we'll still need to at least figure out why the degradation occurs for small number of partitions. Thanks, Amit
Hi David, On 2019/01/28 13:18, David Rowley wrote: > On Sat, 12 Jan 2019 at 02:00, Amit Langote wrote: >> On 2019/01/09 9:09, David Rowley wrote: >>> postgres=# update parent set c = c where a = 333; >>> server closed the connection unexpectedly >>> This probably means the server terminated abnormally >>> before or while processing the request. >>> >>> I didn't look into what's causing the crash. >> >> I tried your example, but it didn't crash for me: >> >> explain update parent set c = c where a = 333; >> QUERY PLAN >> ──────────────────────────────────────────────────── >> Update on parent (cost=0.00..0.00 rows=0 width=0) >> -> Result (cost=0.00..0.00 rows=0 width=54) >> One-Time Filter: false >> (3 rows) > > I had a closer look. The crash is due to > set_inherit_target_rel_sizes() forgetting to set has_live_children to > false. This results in the relation not properly being set to a dummy > rel and the code then making a modify table node without any subnodes. > That crashes due to getTargetResultRelInfo() returning NULL due to > rootResultRelInfo and resultRelInfo both being NULL. Oops, you're right. > The attached fixes it. If you were not seeing the crash then > has_live_children must have been zero/false by chance during your > test. Thanks for the fix, I'll incorporate it in the next version I'll post by tomorrow. Regards, Amit
On Mon, 28 Jan 2019 at 21:21, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks for the fix, I'll incorporate it in the next version I'll post by > tomorrow. I just started reading 0001 again. I made a few notes: 1. Should bms_del_members() instead of bms_difference() if you don't mind modifying in place, which it sounds like you don't, going by the comment. /* * Now fix up EC's relids set. It's OK to modify EC like this, * because caller must have made a copy of the original EC. * For example, see adjust_inherited_target_child_root. */ cur_ec->ec_relids = bms_difference(cur_ec->ec_relids, parent_rel->relids); cur_ec->ec_relids = bms_add_members(cur_ec->ec_relids, child_rel->relids); 2. In the comment: /* * Now fix up EC's relids set. It's OK to modify EC like this, * because caller must have made a copy of the original EC. * For example, see adjust_inherited_target_child_root. */ You mention that the caller must have made a copy, but the header comment mentions nothing about that to warn callers. 3. Would it be better to do the following in make_rel_from_joinlist()? /* * Check that we got at least one usable path. In the case of an * inherited update/delete operation, no path has been created for * the query's actual target relation yet. */ if (!root->inherited_update && (!final_rel || !final_rel->cheapest_total_path || final_rel->cheapest_total_path->param_info != NULL)) elog(ERROR, "failed to construct the join relation"); That way the test is also performed for each partition's join problem and you don't need that weird special case to disable the check. 4. Do you think it would be nicer if inheritance_make_rel_from_joinlist returned rel? inheritance_make_rel_from_joinlist(root, joinlist); /* * Return the RelOptInfo of original target relation, although this * doesn't really contain the final path. inheritance_planner * from where we got here will generate the final path, but it will * do so by iterative over child subroots, not through this * RelOptInfo. */ rel = find_base_rel(root, root->parse->resultRelation); 5. "iterative over child subroots" -> "iterating over the child subroots". 6. In set_inherit_target_rel_sizes() the !bms_is_member(appinfo->child_relid, live_children) check could be moved much earlier, likely just after the if (appinfo->parent_relid != parentRTindex) check. I understand you're copying set_append_rel_size() here, but I don't quite understand the reason why that function is doing it that way. Seems like wasted effort building the quals for pruned partitions. 7. In set_inherit_target_rel_sizes() I still don't really like the way you're adjusting the EquivalenceClasses. Would it not be better to invent a function similar to adjust_appendrel_attrs(), or just use that function? 8. Still a typo in this comment: + * ass dummy. We must do this in this phase so that the rel's 9. "parent's" -> "the parent's" /* Also propagate this child's own children into parent's list. */ 10. Too many "of": * For each child of of the query's result relation, this translates the 11. Badly formed multiline comment. /* Save the just translated targetlist as unexpanded_tlist in the child's * subroot, so that this child's own children can use it. Must use copy 12. DEELETE -> DELETE /* * The following fields are set during query planning portion of an * inherited UPDATE/DEELETE operation. */ 13. Surely this is not true? * non-NULL. Content of each PlannerInfo is same as the parent * PlannerInfo, except for the parse tree which is a translated copy of * the parent's parse tree. Are you not modifying the EquivalenceClasses? 14. This comment plus the variable name is confusing me: /* * RelOptInfos corresponding to each child target rel. For leaf children, * it's the RelOptInfo representing the output of make_rel_from_joinlist() * called with the parent rel in the original join tree replaced by a * given leaf child. For non-leaf children, it's the baserel RelOptInfo * itself, left as a placeholder. */ List *inh_target_child_joinrels; The variable name mentions joinrels, but the comment target rels. A join rel can't be the target of an UPDATE/DELETE. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2019/01/29 11:23, David Rowley wrote: > On Mon, 28 Jan 2019 at 21:21, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Thanks for the fix, I'll incorporate it in the next version I'll post by >> tomorrow. > > I just started reading 0001 again. I made a few notes: Thanks. > 1. Should bms_del_members() instead of bms_difference() if you don't > mind modifying in place, which it sounds like you don't, going by the > comment. > > /* > * Now fix up EC's relids set. It's OK to modify EC like this, > * because caller must have made a copy of the original EC. > * For example, see adjust_inherited_target_child_root. > */ > cur_ec->ec_relids = bms_difference(cur_ec->ec_relids, > parent_rel->relids); > cur_ec->ec_relids = bms_add_members(cur_ec->ec_relids, > child_rel->relids); Fixed. > 2. In the comment: > > /* > * Now fix up EC's relids set. It's OK to modify EC like this, > * because caller must have made a copy of the original EC. > * For example, see adjust_inherited_target_child_root. > */ > > You mention that the caller must have made a copy, but the header > comment mentions nothing about that to warn callers. OK, added a sentence. > 3. Would it be better to do the following in make_rel_from_joinlist()? > > /* > * Check that we got at least one usable path. In the case of an > * inherited update/delete operation, no path has been created for > * the query's actual target relation yet. > */ > if (!root->inherited_update && > (!final_rel || > !final_rel->cheapest_total_path || > final_rel->cheapest_total_path->param_info != NULL)) > elog(ERROR, "failed to construct the join relation"); > > That way the test is also performed for each partition's join problem > and you don't need that weird special case to disable the check. Moving it inside make_rel_from_joinlist() seems a bit awkward as there are many return statements, with those appearing earlier in the function being being for fast-path cases. How about in the direct callers of make_rel_from_joinlist() viz. make_one_rel() and inheritance_make_rel_from_joinlist()? > 4. Do you think it would be nicer if > inheritance_make_rel_from_joinlist returned rel? > > inheritance_make_rel_from_joinlist(root, joinlist); > > /* > * Return the RelOptInfo of original target relation, although this > * doesn't really contain the final path. inheritance_planner > * from where we got here will generate the final path, but it will > * do so by iterative over child subroots, not through this > * RelOptInfo. > */ > rel = find_base_rel(root, root->parse->resultRelation); OK, done. It now returns RelOptInfo root->parse->resultRelation. > 5. "iterative over child subroots" -> "iterating over the child subroots". Sentence no longer exists after above adjustments. > 6. In set_inherit_target_rel_sizes() the > !bms_is_member(appinfo->child_relid, live_children) check could be > moved much earlier, likely just after the if (appinfo->parent_relid != > parentRTindex) check. > > I understand you're copying set_append_rel_size() here, but I don't > quite understand the reason why that function is doing it that way. > Seems like wasted effort building the quals for pruned partitions. Yeah, fixed that in both functions. Actually, as you might be aware, the next patch obviates the need for this stanza at all, because pruned partitions aren't added to root->append_rel_list. > 7. In set_inherit_target_rel_sizes() I still don't really like the way > you're adjusting the EquivalenceClasses. Would it not be better to > invent a function similar to adjust_appendrel_attrs(), or just use > that function? OK, I added a function copy_eq_classes_for_child_root() that simply makes a copy of eq_classes from the source root (deep-copying where applicable) and returns the list. > 8. Still a typo in this comment: > > + * ass dummy. We must do this in this phase so that the rel's Sorry, fixed. > 9. "parent's" -> "the parent's" > > /* Also propagate this child's own children into parent's list. */ > > 10. Too many "of": > > * For each child of of the query's result relation, this translates the > > 11. Badly formed multiline comment. > > /* Save the just translated targetlist as unexpanded_tlist in the child's > * subroot, so that this child's own children can use it. Must use copy > > 12. DEELETE -> DELETE > > /* > * The following fields are set during query planning portion of an > * inherited UPDATE/DEELETE operation. > */ > > 13. Surely this is not true? > > * non-NULL. Content of each PlannerInfo is same as the parent > * PlannerInfo, except for the parse tree which is a translated copy of > * the parent's parse tree. > > Are you not modifying the EquivalenceClasses? Yep, fixed. Also fixed items 9-12. > 14. This comment plus the variable name is confusing me: > > /* > * RelOptInfos corresponding to each child target rel. For leaf children, > * it's the RelOptInfo representing the output of make_rel_from_joinlist() > * called with the parent rel in the original join tree replaced by a > * given leaf child. For non-leaf children, it's the baserel RelOptInfo > * itself, left as a placeholder. > */ > List *inh_target_child_joinrels; > > The variable name mentions joinrels, but the comment target rels. A > join rel can't be the target of an UPDATE/DELETE. How about inh_target_child_path_rels? The final ModifyTable's child "subpaths" have to hang off of some RelOptInfo until we've done the final grouping_planner() steps. Those RelOptInfos are stored here. Attached updated patches. Thanks, Amit PS: replies might be slow until Feb 5 (attending FOSDEM for the first time!)
Attachment
Imai-san, On 2019/01/28 10:44, Imai, Yoshikazu wrote: > On Thu, Jan 24, 2019 at 6:10 AM, Imai, Yoshikazu wrote: >> updating partkey case: >> >> part-num master 0001 0002 0003 0004 >> 1 8215.34 7924.99 7931.15 8407.40 8475.65 >> 2 7137.49 7026.45 7128.84 7583.08 7593.73 >> 4 5880.54 5896.47 6014.82 6405.33 6398.71 >> 8 4222.96 4446.40 4518.54 4802.43 4785.82 >> 16 2634.91 2891.51 2946.99 3085.81 3087.91 >> 32 935.12 1125.28 1169.17 1199.44 1202.04 >> 64 352.37 405.27 417.09 425.78 424.53 >> 128 236.26 310.01 307.70 315.29 312.81 >> 256 65.36 86.84 87.67 84.39 89.27 >> 512 18.34 24.84 23.55 23.91 23.91 >> 1024 4.83 6.93 6.51 6.45 6.49 > > I also tested with non-partitioned table case. > > updating partkey case: > > part-num master 0001 0002 0003 0004 > 0 10956.7 10370.5 10472.6 10571.0 10581.5 > 1 8215.34 7924.99 7931.15 8407.40 8475.65 > ... > 1024 4.83 6.93 6.51 6.45 6.49 > > > In my performance results, it seems update performance degrades in non-partitioned case with v17-patch applied. > But it seems this degrades did not happen at v2-patch. > > On Thu, Aug 30, 2018 at 1:45 AM, Amit, Langote wrote: >> UPDATE: >> >> nparts master 0001 0002 0003 >> ====== ====== ==== ==== ==== >> 0 2856 2893 2862 2816 > > Does this degradation only occur in my tests? Or if this result is correct, what may causes the degradation? I re-ran tests with v18 using the following setup [1]: create table ht (a int, b int) partition by hash (b); create table ht_0 partition of ht for values with (modulus N, remainder 0); ... $ cat update-noprune.sql update ht set a = 0; pgbench -n -T 60 -f update-noprune,sql TPS: nparts master 0001 0002 0003 0004 ====== ====== ==== ==== ==== ==== 0 4408 4335 4423 4379 4314 1 3883 3873 3679 3856 4007 2 3495 3476 3477 3500 3627 I can see some degradation for small number of partitions, but maybe it's just noise? At least, I don't yet have a systematic explanation for that happening. Thanks, Amit [1] I changed the compile flags in build scripts to drop -DCATCACHE_FORCE_RELEASE, which would cause many syscache misses in my test runs
On Tue, 29 Jan 2019 at 22:32, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > On 2019/01/29 11:23, David Rowley wrote: > > 7. In set_inherit_target_rel_sizes() I still don't really like the way > > you're adjusting the EquivalenceClasses. Would it not be better to > > invent a function similar to adjust_appendrel_attrs(), or just use > > that function? > > OK, I added a function copy_eq_classes_for_child_root() that simply makes > a copy of eq_classes from the source root (deep-copying where applicable) > and returns the list. hmm, but you've added a case for SpecialJoinInfo, is there a good reason not just to do the translation by just adding an EquivalenceClass case to adjust_appendrel_attrs_mutator() then just get rid of the new "replace" flag in add_child_rel_equivalences()? That way you'd also remove the churn in the couple of places you've had to modify the existing calls to add_child_rel_equivalences(). -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Amit-san, On Tue, Jan 29, 2019 at 10:11 AM, Amit Langote wrote: > On 2019/01/28 10:44, Imai, Yoshikazu wrote: > > On Thu, Jan 24, 2019 at 6:10 AM, Imai, Yoshikazu wrote: > >> updating partkey case: > >> > >> part-num master 0001 0002 0003 0004 > >> 1 8215.34 7924.99 7931.15 8407.40 8475.65 > >> 2 7137.49 7026.45 7128.84 7583.08 7593.73 > >> 4 5880.54 5896.47 6014.82 6405.33 6398.71 > >> 8 4222.96 4446.40 4518.54 4802.43 4785.82 > >> 16 2634.91 2891.51 2946.99 3085.81 3087.91 > >> 32 935.12 1125.28 1169.17 1199.44 1202.04 > >> 64 352.37 405.27 417.09 425.78 424.53 > >> 128 236.26 310.01 307.70 315.29 312.81 > >> 256 65.36 86.84 87.67 84.39 89.27 > >> 512 18.34 24.84 23.55 23.91 23.91 > >> 1024 4.83 6.93 6.51 6.45 6.49 > > > > I also tested with non-partitioned table case. > > > > updating partkey case: > > > > part-num master 0001 0002 0003 0004 > > 0 10956.7 10370.5 10472.6 10571.0 10581.5 > > 1 8215.34 7924.99 7931.15 8407.40 8475.65 > > ... > > 1024 4.83 6.93 6.51 6.45 6.49 > > > > > > In my performance results, it seems update performance degrades in > non-partitioned case with v17-patch applied. > > But it seems this degrades did not happen at v2-patch. > > > > On Thu, Aug 30, 2018 at 1:45 AM, Amit, Langote wrote: > >> UPDATE: > >> > >> nparts master 0001 0002 0003 > >> ====== ====== ==== ==== ==== > >> 0 2856 2893 2862 2816 > > > > Does this degradation only occur in my tests? Or if this result is correct, > what may causes the degradation? > > I re-ran tests with v18 using the following setup [1]: > > create table ht (a int, b int) partition by hash (b); create table ht_0 > partition of ht for values with (modulus N, remainder 0); ... > $ cat update-noprune.sql > update ht set a = 0; > > pgbench -n -T 60 -f update-noprune,sql > > TPS: > > nparts master 0001 0002 0003 0004 > ====== ====== ==== ==== ==== ==== > 0 4408 4335 4423 4379 4314 > 1 3883 3873 3679 3856 4007 > 2 3495 3476 3477 3500 3627 > > I can see some degradation for small number of partitions, but maybe it's > just noise? At least, I don't yet have a systematic explanation for that > happening. Thanks for testing. I also re-ran tests with v18 using settings almost same as I used before, but this time I run pgbench for 60 second whichwas 30 second in previous test. TPS: nparts master 0001 0002 0003 0004 ====== ====== ==== ==== ==== ==== 0 10794 11018 10761 10552 11066 1 7574 7625 7558 8071 8219 2 6745 6778 6746 7281 7344 I can see no degradation, so I also think that performance degradation in my previous test and your test was because of justnoise. Why I did these tests is that I wanted to confirm that even if we apply each patch one by one, there's no performance problem.Because patches are quite large, I just felt it might be difficult to commit these patches all at once and I thoughtcommitting patch one by one would be another option to commit these patches. I don't know there is the rule in thecommunity how patches should be committed, and if there, my thoughts above may be bad. Anyway, I'll restart code reviewing :) -- Yoshikazu Imai
On 2019-Jan-30, Imai, Yoshikazu wrote: > Why I did these tests is that I wanted to confirm that even if we > apply each patch one by one, there's no performance problem. Because > patches are quite large, I just felt it might be difficult to commit > these patches all at once and I thought committing patch one by one > would be another option to commit these patches. I don't know there is > the rule in the community how patches should be committed, and if > there, my thoughts above may be bad. There are no absolute rules, but if I was committing it, I would certainly commit each separately, mostly because reviewing the whole series at once looks daunting ... and given the proposed commit messages, I'd guess that writing a combined commit message would also be very difficult. So thanks for doing these tests. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 30, 2019 at 3:26 PM, Alvaro Herrera wrote: > On 2019-Jan-30, Imai, Yoshikazu wrote: > > > Why I did these tests is that I wanted to confirm that even if we > > apply each patch one by one, there's no performance problem. Because > > patches are quite large, I just felt it might be difficult to commit > > these patches all at once and I thought committing patch one by one > > would be another option to commit these patches. I don't know there > is > > the rule in the community how patches should be committed, and if > > there, my thoughts above may be bad. > > There are no absolute rules, but if I was committing it, I would certainly > commit each separately, mostly because reviewing the whole series at once > looks daunting ... and given the proposed commit messages, I'd guess that > writing a combined commit message would also be very difficult. Ah, I see. > So thanks for doing these tests. I'm glad to hear that! Thanks -- Yoshikazu Imai
On Wed, Jan 30, 2019 at 3:20 AM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Tue, 29 Jan 2019 at 22:32, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > > > On 2019/01/29 11:23, David Rowley wrote: > > > 7. In set_inherit_target_rel_sizes() I still don't really like the way > > > you're adjusting the EquivalenceClasses. Would it not be better to > > > invent a function similar to adjust_appendrel_attrs(), or just use > > > that function? > > > > OK, I added a function copy_eq_classes_for_child_root() that simply makes > > a copy of eq_classes from the source root (deep-copying where applicable) > > and returns the list. > > hmm, but you've added a case for SpecialJoinInfo, is there a good > reason not just to do the translation by just adding an > EquivalenceClass case to adjust_appendrel_attrs_mutator() then just > get rid of the new "replace" flag in add_child_rel_equivalences()? > That way you'd also remove the churn in the couple of places you've > had to modify the existing calls to add_child_rel_equivalences(). Sounds like something worth trying to make work. Will do, thanks for the idea. Thanks, Amit
On Tue, 29 Jan 2019 at 22:32, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patches. I think we could make the 0001 patch a bit smaller if we were to apply the attached first. Details in the commit message. What do you think? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2019-Jan-31, David Rowley wrote: > On Tue, 29 Jan 2019 at 22:32, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Attached updated patches. > > I think we could make the 0001 patch a bit smaller if we were to apply > the attached first. > > Details in the commit message. > > What do you think? Amit didn't say what he thinks, but I think it looks good and the rationale for it makes sense, so I pushed it. I only amended some comments a little bit. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Please rebase. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 1 Feb 2019 at 22:52, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2019-Jan-31, David Rowley wrote: > > > > I think we could make the 0001 patch a bit smaller if we were to apply > > the attached first. > > > > Details in the commit message. > > > > What do you think? > > Amit didn't say what he thinks, but I think it looks good and the > rationale for it makes sense, so I pushed it. I only amended some > comments a little bit. Thanks. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, 1 Feb 2019 at 23:01, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Please rebase. I had a short list of other things I noticed when making a partial pass over the patch again. I may as well send these now if there's a new version on the way: 1. I think it's okay to convert the following: /* * Adjust all_baserels to replace the original target relation with the * child target relation. Copy it before modifying though. */ subroot->all_baserels = bms_copy(root->all_baserels); subroot->all_baserels = bms_del_member(subroot->all_baserels, root->parse->resultRelation); subroot->all_baserels = bms_add_member(subroot->all_baserels, subroot->parse->resultRelation); into: /* Adjust all_baserels */ subroot->all_baserels = adjust_child_relids(root->all_baserels, 1, &appinfo); 2. Any reason to do: /* * Generate access paths for the entire join tree. * * For UPDATE/DELETE on an inheritance parent, join paths should be * generated for each child result rel separately. */ if (root->parse->resultRelation && root->simple_rte_array[root->parse->resultRelation]->inh) instead of just checking: if (root->inherited_update) 3. This seems like useless code in set_inherit_target_rel_sizes(). /* * If parallelism is allowable for this query in general, see whether * it's allowable for this childrel in particular. For consistency, * do this before calling set_rel_size() for the child. */ if (root->glob->parallelModeOK) set_rel_consider_parallel(subroot, childrel, childRTE); parallelModeOK is only ever set for SELECT. Likely it's fine just to replace these with: + /* We don't consider parallel paths for UPDATE/DELETE statements */ + childrel->consider_parallel = false; or perhaps it's fine to leave it out since build_simple_rel() sets it to false. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Thanks David for the patch that Alvaro committed yesterday. Agree that it's a good idea. On Fri, Feb 1, 2019 at 3:11 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > On Fri, 1 Feb 2019 at 23:01, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Please rebase. > > I had a short list of other things I noticed when making a partial > pass over the patch again. > > I may as well send these now if there's a new version on the way: Thanks. > 1. I think it's okay to convert the following: > > /* > * Adjust all_baserels to replace the original target relation with the > * child target relation. Copy it before modifying though. > */ > subroot->all_baserels = bms_copy(root->all_baserels); > subroot->all_baserels = bms_del_member(subroot->all_baserels, > root->parse->resultRelation); > subroot->all_baserels = bms_add_member(subroot->all_baserels, > subroot->parse->resultRelation); > > into: > /* Adjust all_baserels */ > subroot->all_baserels = adjust_child_relids(root->all_baserels, 1, &appinfo); Makes sense, done. > 2. Any reason to do: > > /* > * Generate access paths for the entire join tree. > * > * For UPDATE/DELETE on an inheritance parent, join paths should be > * generated for each child result rel separately. > */ > if (root->parse->resultRelation && > root->simple_rte_array[root->parse->resultRelation]->inh) > > instead of just checking: if (root->inherited_update) Good reminder, done. > 3. This seems like useless code in set_inherit_target_rel_sizes(). > > /* > * If parallelism is allowable for this query in general, see whether > * it's allowable for this childrel in particular. For consistency, > * do this before calling set_rel_size() for the child. > */ > if (root->glob->parallelModeOK) > set_rel_consider_parallel(subroot, childrel, childRTE); > > > parallelModeOK is only ever set for SELECT. Likely it's fine just to > replace these with: > > + /* We don't consider parallel paths for UPDATE/DELETE > statements */ > + childrel->consider_parallel = false; > > or perhaps it's fine to leave it out since build_simple_rel() sets it to false. OK, removed that code. Attached updated patches. It took me a bit longer than expected to rebase the patches as I hit a mysterious bug that I couldn't pinpoint until this afternoon. One big change is related to how ECs are transferred to child PlannerInfos. As David suggested upthread, I created a block in adjust_appendrel_attrs_mutator that creates a translated copy of a given EC containing wherein the parent expression in the original ec_members list is replaced by the corresponding child expression. With that in place, we no longer need the changes to add_child_rel_equivalences(). Instead there's just: subroot->eq_classes = adjust_appendrel_attrs(root, root->eq_classes, ...), just as David described upthread. Thanks, Amit
Attachment
Imai-san, From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > On 2019/01/22 18:47, David Rowley wrote: > > On Tue, 22 Jan 2019 at 20:01, Imai, Yoshikazu > >> What I understand so far is about 10,000 while loops at total > (4098+4098+some extra) is needed in hash_seq_search() in EXECUTE query > after the creation of the generic plan. > >> 10,000 while loops takes about 10 microsec (of course, we can't estimate > correct time), and the difference of the latency between 5th and 7th EXECUTE > is about 8 microsec, I currently think this causes the difference. > > > > >> I don't know this problem relates to Amit-san's patch, but I'll continue > to investigate it. > > > > I had another thought... when you're making a custom plan you're only > > grabbing locks on partitions that were not pruned (just 1 partition in > > your case), but when making the generic plan, locks will be acquired > > on all partitions (all 4000 of them). This likely means that when > > building the generic plan for the first time that the > > LockMethodLocalHash table is expanded to fit all those locks, and > > since we never shrink those down again, it'll remain that size for the > > rest of your run. I imagine the reason for the slowdown is that > > during LockReleaseAll(), a sequential scan is performed over the > > entire hash table. I see from looking at the hash_seq_search() code > > that the value of max_bucket is pretty critical to how it'll perform. > > The while ((curElem = segp[segment_ndx]) == NULL) loop will need to > > run fewer times with a lower max_bucket. > > I too think that that might be behind that slight drop in performance. > So, it's good to know what one of the performance bottlenecks is when > dealing with large number of relations in queries. Can you compare the performance of auto and force_custom_plan again with the attached patch? It uses PGPROC's LOCALLOCKlist instead of the hash table. Regards Takayuki Tsunakawa
Attachment
On 2019/02/02 22:52, Amit Langote wrote: > Attached updated patches. > > One big change is related to how ECs are transferred to child > PlannerInfos. As David suggested upthread, I created a block in > adjust_appendrel_attrs_mutator that creates a translated copy of a > given EC containing wherein the parent expression in the original > ec_members list is replaced by the corresponding child expression. > With that in place, we no longer need the changes to > add_child_rel_equivalences(). Instead there's just: > subroot->eq_classes = adjust_appendrel_attrs(root, root->eq_classes, > ...), just as David described upthread. Rebased over bdd9a99aac. That commit fixes the bug that lateral_relids were not propagated to grandchildren of an appendrel in some cases due to the way parent rels were mapped to child rels in a nested loop over root->simple_rel_array and root->append_rel_list. The problem that was fixed with that commit was not present with the patches here to begin with. With the patch 0002 here, lateral_relids are propagated from parent rel to child rel directly when the latter's RelOptInfo is built, so lateral_relids are properly propagated from the (possibly RTE_SUBQUERY) top-most parent rel to all the child rels. Thanks, Amit
Attachment
Tsunakawa-san, On Wed, Feb 6, 2019 at 2:04 AM, Tsunakawa, Takayuki wrote: > Can you compare the performance of auto and force_custom_plan again with > the attached patch? It uses PGPROC's LOCALLOCK list instead of the hash > table. Thanks for the patch, but it seems to have some problems. When I executed create/drop/select commands to large partitions, like over than 512 partitions, backend died unexpectedly.Since I could see the difference of the performance of auto and force_custom_plan when partitions is large,patch needs to be modified to check whether performance is improved or not. Thanks -- Yoshikazu Imai
On Fri, 8 Feb 2019 at 14:34, Imai, Yoshikazu <imai.yoshikazu@jp.fujitsu.com> wrote: > > On Wed, Feb 6, 2019 at 2:04 AM, Tsunakawa, Takayuki wrote: > > Can you compare the performance of auto and force_custom_plan again with > > the attached patch? It uses PGPROC's LOCALLOCK list instead of the hash > > table. > > Thanks for the patch, but it seems to have some problems. > When I executed create/drop/select commands to large partitions, like over than 512 partitions, backend died unexpectedly.Since I could see the difference of the performance of auto and force_custom_plan when partitions is large,patch needs to be modified to check whether performance is improved or not. It's good to see work being done to try and improve this, but I think it's best to do it on another thread. I think there was some agreement upthread about this not being Amit's patch's problem. Doing it here will make keeping track of this more complex than it needs to be. There's also Amit's issue of keeping his patch series up to date. The CFbot is really useful to alert patch authors when that's required, but having other patches posted to the same thread can cause the CFbot to check the wrong patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
From: David Rowley [mailto:david.rowley@2ndquadrant.com] > It's good to see work being done to try and improve this, but I think > it's best to do it on another thread. I think there was some agreement > upthread about this not being Amit's patch's problem. Doing it here > will make keeping track of this more complex than it needs to be. > There's also Amit's issue of keeping his patch series up to date. The > CFbot is really useful to alert patch authors when that's required, > but having other patches posted to the same thread can cause the CFbot > to check the wrong patch. OK, you're right. I'll continue this on another thread. Regards Takayuki Tsunakawa
Tsunakawa-san, On 2019/02/08 10:50, Tsunakawa, Takayuki wrote: > From: David Rowley [mailto:david.rowley@2ndquadrant.com] >> It's good to see work being done to try and improve this, but I think >> it's best to do it on another thread. I think there was some agreement >> upthread about this not being Amit's patch's problem. Doing it here >> will make keeping track of this more complex than it needs to be. >> There's also Amit's issue of keeping his patch series up to date. The >> CFbot is really useful to alert patch authors when that's required, >> but having other patches posted to the same thread can cause the CFbot >> to check the wrong patch. > > OK, you're right. I'll continue this on another thread. Thank you. I do appreciate that Imai-san has persistently tried to find interesting problems to solve beyond the patches we're working on here. Maybe I chose the the subject line of this thread poorly when I began working on it. It should perhaps have been something like "speeding up planning of point-lookup queries with many partitions" or something like that. There are important use cases beyond point lookup even with partitioned tables (or maybe more so with partitioned tables), but perhaps unsurprisingly, the bottlenecks in those cases are not *just* in the planner. Thanks, Amit
Hi Amit, From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > Maybe I chose the the subject line of this thread poorly when I began > working on it. It should perhaps have been something like "speeding up > planning of point-lookup queries with many partitions" or something like > that. There are important use cases beyond point lookup even with > partitioned tables (or maybe more so with partitioned tables), but perhaps > unsurprisingly, the bottlenecks in those cases are not *just* in the > planner. No, it's simply my fault. I wasn't aware of the CF Bot and the CF entry page that act on the latest submitted patch. I'mrelieved to see you have submitted the revised patch. Regards Takayuki Tsunakawa
Amit-san, On Thu, Feb 7, 2019 at 10:22 AM, Amit Langote wrote: > Rebased over bdd9a99aac. I did code review of 0001 and I have some suggestions. Could you check them? 1. 0001: line 418 + * add_inherit_target_child_root() would've added only those that are add_inherit_target_child_root() doesn't exist now, so an above comment needs to be modified. 2. 0001: line 508-510 In set_inherit_target_rel_pathlists(): + /* Nothing to do if all the children were excluded. */ + if (IS_DUMMY_REL(rel)) + return; These codes aren't needed or can be replaced by Assert because set_inherit_target_rel_pathlists is only called from set_rel_pathlistwhich excutes IS_DUMMY_REL(rel) before calling set_inherit_target_rel_pathlists, as follows. set_rel_pathlist(...) { ... if (IS_DUMMY_REL(rel)) { /* We already proved the relation empty, so nothing more to do */ } else if (rte->inh) { /* * If it's a target relation, set the pathlists of children instead. * Otherwise, we'll append the outputs of children, so process it as * an "append relation". */ if (root->inherited_update && root->parse->resultRelation == rti) { inherited_update = true; set_inherit_target_rel_pathlists(root, rel, rti, rte); 3. 0001: line 1919-1920 - case CONSTRAINT_EXCLUSION_ON: - break; /* always try to exclude */ CONSTRAINT_EXCLUSION_ON is no longer used, so should we remove it also from guc parameters? 4. 0001: Can we remove enum InheritanceKind which is no longer used? I also see the patch from a perspective of separating codes from 0001 which are not essential of Overhaul inheritance update/deleteplanning. Although almost all of codes are related each other, but I found below two things can be moved toanother patch. --- 0001: line 550-608 This section seems to be just refinement of set_append_rel_size(). So can we separate this from 0001 to another patch? --- 0001: line 812-841, 940-947, 1525-1536, 1938-1947 These codes are related to removement of InheritanceKind from relation_excluded_by_constraints(), so I think it is somethinglike cleaning of unneeded codes. Can we separate this to patch as some-code-clearnups-of-0001.patch? Of course,we can do that only if removing of these codes from 0001 would not bother success of "make check" of 0001. I also think that what I pointed out at above 3. and 4. can also be included in some-code-clearnups-of-0001.patch. What do you think? -- Yoshikazu Imai
Hi Amit, I'm afraid v20-0001 fails to apply to the current HEAD (precisely, ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz). Could you check it? I'm trying to reproduce what Imai-san hit with my patch. His environment is master@Jan/28 + v18 of your patches. When headded my patch there, CREATE TABLE crashed. On the other hand, the current master + my patch succeeds. So, I wanted to test with the current HEAD + v20 of your patch+ my patch. Regards Takayuki Tsunakawa
Tsunakawa-san, On 2019/02/08 15:40, Tsunakawa, Takayuki wrote: > Hi Amit, > > I'm afraid v20-0001 fails to apply to the current HEAD (precisely, ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz). Could you check it? Hmm, I had rebased v20 over HEAD as of yesterday evening. CF bot seemed to be happy with it too: http://cfbot.cputube.org/amit-langote.html Also, I rebased the patches again on the latest HEAD as of this morning and there were no issues. Thanks, Amit
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] > Sent: Friday, February 08, 2019 3:52 PM > Hmm, I had rebased v20 over HEAD as of yesterday evening. CF bot seemed > to be happy with it too: > > http://cfbot.cputube.org/amit-langote.html > > Also, I rebased the patches again on the latest HEAD as of this morning > and there were no issues. There seem to have been some modifications between the latest snapshot tarball and HEAD. I've just managed to apply yourv20 to HEAD. Thanks. Regards Takayuki Tsunakawa
On Fri, Feb 8, 2019 at 1:34 AM, I wrote: > On Wed, Feb 6, 2019 at 2:04 AM, Tsunakawa, Takayuki wrote: > > Can you compare the performance of auto and force_custom_plan again > > with the attached patch? It uses PGPROC's LOCALLOCK list instead of > > the hash table. > > Thanks for the patch, but it seems to have some problems. I just missed compiling. Performance degradation I saw before is improved! The results are below. [v20 + faster-locallock-scan.patch] auto: 9,069 TPS custom: 9,015 TPS [v20] auto: 8,037 TPS custom: 8,933 TPS As David and I mentioned this patch should be discussed on another thread, so Tsunakawa-san, could you launch the anotherthread please? Thanks -- Yoshikazu Imai
Imai-san, Thanks for the comments. On 2019/02/08 13:44, Imai, Yoshikazu wrote: > 3. > 0001: line 1919-1920 > > - case CONSTRAINT_EXCLUSION_ON: > - break; /* always try to exclude */ > > CONSTRAINT_EXCLUSION_ON is no longer used, so should we remove it also from guc parameters? Well, we haven't removed the "on" setting itself. > 4. > 0001: > > Can we remove enum InheritanceKind which is no longer used? That we can do. > I also see the patch from a perspective of separating codes from 0001 which are not essential of Overhaul inheritance update/deleteplanning. Although almost all of codes are related each other, but I found below two things can be moved toanother patch. > > --- > 0001: line 550-608 > > This section seems to be just refinement of set_append_rel_size(). > So can we separate this from 0001 to another patch? > > --- > 0001: line 812-841, 940-947, 1525-1536, 1938-1947 > > These codes are related to removement of InheritanceKind from relation_excluded_by_constraints(), so I think it is somethinglike cleaning of unneeded codes. Can we separate this to patch as some-code-clearnups-of-0001.patch? Of course,we can do that only if removing of these codes from 0001 would not bother success of "make check" of 0001. > I also think that what I pointed out at above 3. and 4. can also be included in some-code-clearnups-of-0001.patch. Okay, I've broken down those changes into separate patches, so that cleanup hunks are not fixed with other complex changes. 0001 is now a patch to remove duplicate code from set_append_rel_size. It combines multiple blocks that have the same body doing set_dummy_rel_pathlist(). 0002 is the "overhaul inherited update/delete planning" 0003 is a cleanup patch that gets rid of some code that is rendered useless due to 0002 (partitioned tables no longer use constraint exclusion) I think 0001 can be committed on its own. 0002+0003 can be committed together. 0004-0006 are the patches that were previously 0002-0004. The new version also contains a change to what was previously 0001 patch (attached 0002) -- eq_classes is now copied right when the child subroot is created, that is, in create_inherited_target_child_root(), no longer in the loop in set_inherit_target_rel_sizes(). Thanks, Amit
Attachment
- v21-0001-Reduce-code-duplication-in-set_append_rel_size.patch
- v21-0002-Overhaul-inheritance-update-delete-planning.patch
- v21-0003-Get-rid-of-some-useless-code.patch
- v21-0004-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v21-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v21-0006-Do-not-lock-all-partitions-at-the-beginning.patch
On Fri, 8 Feb 2019 at 22:12, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > 0001 is now a patch to remove duplicate code from set_append_rel_size. It > combines multiple blocks that have the same body doing > set_dummy_rel_pathlist(). > > 0002 is the "overhaul inherited update/delete planning" I had started looking at v20's 0001. I've not done a complete pass over it yet, but I'll likely just start again since v21 is out now: I've removed the things you've fixed in v21. I think most of these will apply to the 0002 patch, apart form maybe #2. 1. In set_rel_pathlist(), I wonder if you should be skipping the set_rel_pathlist_hook call when inherited_update is true. Another comment mentions: * not this rel. Also, this rel's sole path (ModifyTable) will be set * by inheritance_planner later, so we can't check its paths yet. So you're adding any paths for this rel 2. The comment here mentions "partition", but it might just be a child of an inheritance parent: if (childpruned || !apply_child_basequals(root, rel, childrel, childRTE, appinfo) || relation_excluded_by_constraints(root, childrel, childRTE)) { /* This partition needn't be scanned; skip it. */ set_dummy_rel_pathlist(childrel); continue; } This occurs in both set_inherit_target_rel_sizes() and set_append_rel_size() 3. I think the following comment: /* If the child itself is partitioned it may turn into a dummy rel. */ It might be better to have it as: /* * If the child is a partitioned table it may have been marked * as a dummy rel from having all its partitions pruned. */ I mostly think that by saying "if the child itself is partitioned", then you're implying that we're currently operating on a partitioned table, but we might be working on an inheritance parent. 4. In set_inherit_target_rel_pathlists(), you have: /* * If set_append_rel_size() decided the parent appendrel was * parallel-unsafe at some point after visiting this child rel, we * need to propagate the unsafety marking down to the child, so that * we don't generate useless partial paths for it. */ if (!rel->consider_parallel) childrel->consider_parallel = false; But I don't quite see why set_append_rel_size() would have ever been called for that rel. It should have been set_inherit_target_rel_sizes(). It seems like the entire chunk can be removed since set_inherit_target_rel_sizes() does not set consider_parallel. 5. In planner.c you mention: * inherit_make_rel_from_joinlist - this translates the jointree, replacing * the target relation in the original jointree (the root parent) by * individual child target relations and performs join planning on the * resulting join tree, saving the RelOptInfos of resulting join relations * into the top-level PlannerInfo I can't see a function named inherit_make_rel_from_joinlist(). 6. No such function: * First, save the unexpanded version of the query's targetlist. * create_inherit_target_child_root will use it as base when expanding * it for a given child relation as the query's target relation instead * of the parent. */ and in: /* * Add missing Vars to child's reltarget. * * create_inherit_target_child_root() would've added only those that * are needed to be present in the top-level tlist (or ones that * preprocess_targetlist thinks are needed to be in the tlist.) We * may need other attributes such as those contained in WHERE clauses, * which are already computed for the parent during * deconstruct_jointree processing of the original query (child's * query never goes through deconstruct_jointree.) */ 7. Where is "false" being passed here? /* We haven't expanded inheritance yet, so pass false. */ tlist = preprocess_targetlist(root); root->processed_tlist = tlist; qp_extra.tlist = tlist; qp_extra.activeWindows = NIL; qp_extra.groupClause = NIL; planned_rel = query_planner(root, tlist, standard_qp_callback, &qp_extra); -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, 8 Feb 2019 at 22:27, David Rowley <david.rowley@2ndquadrant.com> wrote: > I had started looking at v20's 0001. I've not done a complete pass > over it yet, but I'll likely just start again since v21 is out now: I've now done a complete pass over v21. Here's what I wrote down. 8. Is this code in the wrong patch? I don't see any function named build_dummy_partition_rel in this patch. * Make child entries in the EquivalenceClass as well. If the childrel * appears to be a dummy one (one built by build_dummy_partition_rel()), * no need to make any new entries, because anything that'd need those * can instead use the parent's (rel). */ if (childrel->relid != rel->relid && 9. "to use" seems out of place here. It makes more sense if you remove those words. * Add child subroots needed to use during planning for individual child * targets 10. Is this comment really needed? /* * This is needed, because inheritance_make_rel_from_joinlist needs to * translate root->join_info_list executing make_rel_from_joinlist for a * given child. */ None of the other node types mention what they're used for. Seems like something that might get outdated pretty quickly. 11. adjust_appendrel_attrs_mutator: This does not seem very robust: /* * No point in searching if parent rel not mentioned in eclass; but we * can't tell that for sure if parent rel is itself a child. */ for (cnt = 0; cnt < nappinfos; cnt++) { if (bms_is_member(appinfos[cnt]->parent_relid, ec->ec_relids)) { appinfo = appinfos[cnt]; break; } } What if the caller passes multiple appinfos and actually wants them all processed? You'll only process the first one you find that has an eclass member. I think you should just loop over each appinfo and process all the ones that have a match, not just the first. I understand the current caller only passes 1, but I don't think that gives you an excuse to take a shortcut on the implementation. I think probably you've done this as that's what is done for Var in adjust_appendrel_attrs_mutator(), but a Var can only belong to a single relation. An EquivalenceClass can have members for multiple relations. 13. adjust_appendrel_attrs_mutator: This seems wrong: /* * We have found and replaced the parent expression, so done * with EC. */ break; Surely there could be multiple members for the parent. Say: UPDATE parted SET ... WHERE x = y AND y = 1; 14. adjust_appendrel_attrs_mutator: Comment is wrong. There's no function called adjust_inherited_target_child_root and the EC is copied in the function, not the caller. /* * Now fix up EC's relids set. It's OK to modify EC like this, * because caller must have made a copy of the original EC. * For example, see adjust_inherited_target_child_root. */ 15. I don't think "Copy it before modifying though." should be part of this comment. /* * Adjust all_baserels to replace the original target relation with the * child target relation. Copy it before modifying though. */ subroot->all_baserels = adjust_child_relids(subroot->all_baserels, 1, &appinfo); -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Thanks for the review. On 2019/02/08 18:27, David Rowley wrote: > On Fri, 8 Feb 2019 at 22:12, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> 0001 is now a patch to remove duplicate code from set_append_rel_size. It >> combines multiple blocks that have the same body doing >> set_dummy_rel_pathlist(). >> >> 0002 is the "overhaul inherited update/delete planning" > > I had started looking at v20's 0001. I've not done a complete pass > over it yet, but I'll likely just start again since v21 is out now: > > I've removed the things you've fixed in v21. I think most of these > will apply to the 0002 patch, apart form maybe #2. > > 1. In set_rel_pathlist(), I wonder if you should be skipping the > set_rel_pathlist_hook call when inherited_update is true. > > Another comment mentions: > > * not this rel. Also, this rel's sole path (ModifyTable) will be set > * by inheritance_planner later, so we can't check its paths yet. > > So you're adding any paths for this rel I've changed this part such that set_rel_pathlist returns right after the last else{...} block if inherited_update is true. > 2. The comment here mentions "partition", but it might just be a child > of an inheritance parent: > > if (childpruned || > !apply_child_basequals(root, rel, childrel, childRTE, appinfo) || > relation_excluded_by_constraints(root, childrel, childRTE)) > { > /* This partition needn't be scanned; skip it. */ > set_dummy_rel_pathlist(childrel); > continue; > } > > This occurs in both set_inherit_target_rel_sizes() and set_append_rel_size() Fixed. > 3. I think the following comment: > > /* If the child itself is partitioned it may turn into a dummy rel. */ > > It might be better to have it as: > > /* > * If the child is a partitioned table it may have been marked > * as a dummy rel from having all its partitions pruned. > */ > > I mostly think that by saying "if the child itself is partitioned", > then you're implying that we're currently operating on a partitioned > table, but we might be working on an inheritance parent. Agreed. Your suggested wording is clearer, so rewrote the comment that way. > 4. In set_inherit_target_rel_pathlists(), you have: > > /* > * If set_append_rel_size() decided the parent appendrel was > * parallel-unsafe at some point after visiting this child rel, we > * need to propagate the unsafety marking down to the child, so that > * we don't generate useless partial paths for it. > */ > if (!rel->consider_parallel) > childrel->consider_parallel = false; > > But I don't quite see why set_append_rel_size() would have ever been > called for that rel. It should have been > set_inherit_target_rel_sizes(). It seems like the entire chunk can be > removed since set_inherit_target_rel_sizes() does not set > consider_parallel. This is a copy-pasted bit of code that's apparently useless. Removed. > 5. In planner.c you mention: > > * inherit_make_rel_from_joinlist - this translates the jointree, replacing > * the target relation in the original jointree (the root parent) by > * individual child target relations and performs join planning on the > * resulting join tree, saving the RelOptInfos of resulting join relations > * into the top-level PlannerInfo > > > I can't see a function named inherit_make_rel_from_joinlist(). > > 6. No such function: > > * First, save the unexpanded version of the query's targetlist. > * create_inherit_target_child_root will use it as base when expanding > * it for a given child relation as the query's target relation instead > * of the parent. > */ > > and in: > > /* > * Add missing Vars to child's reltarget. > * > * create_inherit_target_child_root() would've added only those that > * are needed to be present in the top-level tlist (or ones that > * preprocess_targetlist thinks are needed to be in the tlist.) We > * may need other attributes such as those contained in WHERE clauses, > * which are already computed for the parent during > * deconstruct_jointree processing of the original query (child's > * query never goes through deconstruct_jointree.) > */ Oops, fixed. So, these are the new functions: set_inherit_target_rel_sizes set_inherit_target_rel_pathlists add_inherited_target_child_roots create_inherited_target_child_root inheritance_make_rel_from_joinlist I've renamed set_inherit_target_* functions to set_inherited_target_* to avoid having multiple styles of naming inheritance-related functions. > 7. Where is "false" being passed here? > > /* We haven't expanded inheritance yet, so pass false. */ > tlist = preprocess_targetlist(root); > root->processed_tlist = tlist; > qp_extra.tlist = tlist; > qp_extra.activeWindows = NIL; > qp_extra.groupClause = NIL; > planned_rel = query_planner(root, tlist, standard_qp_callback, &qp_extra); Hmm, thought I'd fixed this but maybe messed up again while rebasing. > 8. Is this code in the wrong patch? I don't see any function named > build_dummy_partition_rel in this patch. > > * Make child entries in the EquivalenceClass as well. If the childrel > * appears to be a dummy one (one built by build_dummy_partition_rel()), > * no need to make any new entries, because anything that'd need those > * can instead use the parent's (rel). > */ > if (childrel->relid != rel->relid && Again, appears to be a rebasing mistake. Moved that hunk to the "Lazy creation of..." patch which necessitates it. > 9. "to use" seems out of place here. It makes more sense if you remove > those words. > > * Add child subroots needed to use during planning for individual child > * targets Removed. > 10. Is this comment really needed? > > /* > * This is needed, because inheritance_make_rel_from_joinlist needs to > * translate root->join_info_list executing make_rel_from_joinlist for a > * given child. > */ > > None of the other node types mention what they're used for. Seems > like something that might get outdated pretty quickly. OK, removed. > 11. adjust_appendrel_attrs_mutator: This does not seem very robust: > > /* > * No point in searching if parent rel not mentioned in eclass; but we > * can't tell that for sure if parent rel is itself a child. > */ > for (cnt = 0; cnt < nappinfos; cnt++) > { > if (bms_is_member(appinfos[cnt]->parent_relid, ec->ec_relids)) > { > appinfo = appinfos[cnt]; > break; > } > } > > What if the caller passes multiple appinfos and actually wants them > all processed? You'll only process the first one you find that has an > eclass member. I think you should just loop over each appinfo and > process all the ones that have a match, not just the first. > > I understand the current caller only passes 1, but I don't think that > gives you an excuse to take a shortcut on the implementation. I think > probably you've done this as that's what is done for Var in > adjust_appendrel_attrs_mutator(), but a Var can only belong to a > single relation. An EquivalenceClass can have members for multiple > relations. OK, I've refactored the code such that translation is carried out with *all* appinfos passed to adjust_appendrel_attrs_mutator. > 13. adjust_appendrel_attrs_mutator: This seems wrong: > > /* > * We have found and replaced the parent expression, so done > * with EC. > */ > break; > > Surely there could be multiple members for the parent. Say: > > UPDATE parted SET ... WHERE x = y AND y = 1; I hadn't considered that. Removed the break so that *all* members of a given EC are considered for translating. > 14. adjust_appendrel_attrs_mutator: Comment is wrong. There's no > function called adjust_inherited_target_child_root and the EC is > copied in the function, not the caller. > > /* > * Now fix up EC's relids set. It's OK to modify EC like this, > * because caller must have made a copy of the original EC. > * For example, see adjust_inherited_target_child_root. > */ > > > 15. I don't think "Copy it before modifying though." should be part of > this comment. > > /* > * Adjust all_baserels to replace the original target relation with the > * child target relation. Copy it before modifying though. > */ > subroot->all_baserels = adjust_child_relids(subroot->all_baserels, > 1, &appinfo); Updated both of these stale comments. Attached updated patches. Thanks, Amit
Attachment
- v22-0001-Reduce-code-duplication-in-set_append_rel_size.patch
- v22-0002-Overhaul-inheritance-update-delete-planning.patch
- v22-0003-Get-rid-of-some-useless-code.patch
- v22-0004-Lazy-creation-of-RTEs-for-inheritance-children.patch
- v22-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v22-0006-Do-not-lock-all-partitions-at-the-beginning.patch
Amit-san Sorry for my late reply. I had another work to do. On Fri, Feb 8, 2019 at 9:13 AM, Amit Langote wrote: > On 2019/02/08 13:44, Imai, Yoshikazu wrote: > > 3. > > 0001: line 1919-1920 > > > > - case CONSTRAINT_EXCLUSION_ON: > > - break; /* always try > to exclude */ > > > > CONSTRAINT_EXCLUSION_ON is no longer used, so should we remove it also > from guc parameters? > > Well, we haven't removed the "on" setting itself. Ah, I understand. > Okay, I've broken down those changes into separate patches, so that > cleanup hunks are not fixed with other complex changes. > > 0001 is now a patch to remove duplicate code from set_append_rel_size. > It combines multiple blocks that have the same body doing > set_dummy_rel_pathlist(). > > 0002 is the "overhaul inherited update/delete planning" > > 0003 is a cleanup patch that gets rid of some code that is rendered useless > due to 0002 (partitioned tables no longer use constraint exclusion) Thanks for doing these. > I think 0001 can be committed on its own. +1. In commit message: s/contradictory quals found/contradictory quals are found/ s/child excluded/child is excluded/ I think others in 0001 are ok. > 0002+0003 can be committed > together. > > 0004-0006 are the patches that were previously 0002-0004. I will do code review of v22 patches again and send notes as soon as possible. Yoshikazu Imai
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > [ v22 patch set ] I started to look at this, and immediately choked on the 0001 patch: if (childpruned || !apply_child_basequals(root, rel, childrel, childRTE, appinfo) || relation_excluded_by_constraints(root, childrel, childRTE)) { Frankly, that code is just horrid. Having a function with side effects in an if-test is questionable at the best of times, and having it be the second of three conditions (which the third condition silently depends on) is unreadable and unmaintainable. I think the existing code here is considerably cleaner than what this patch proposes. I suppose you are doing this because you intend to jam some additional cleanup code into the successfully-pruned-it code path, but if said code is really too bulky to have multiple copies of, couldn't you put it into a subroutine? You're not going to be able to get to only one copy of such cleanup anyhow, because there is another early-exit further down, for the case where set_rel_size detects dummy-ness. regards, tom lane
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > [ v22 patch set ] I did some review of the 0002 patch. I like the general idea, but there are a lot of details I don't much like. Probably the biggest complaint is that I don't like what you did to the API of grouping_planner(): it's ugly and unprincipled. Considering that only a tiny fraction of what grouping_planner does is actually relevant to inherited UPDATES/DELETEs, maybe we should factor that part out --- or, heavens, just duplicate some code --- so that inheritance_planner doesn't need to call grouping_planner at all. (I see that you have another patch in the queue that proposes to fix this through a rather massive refactoring of grouping_planner, but I do not think I like that approach. grouping_planner does perhaps need refactoring, but I don't want to drive such work off an arbitrary-and-dubious requirement that it shouldn't call query_planner.) Another point is that I don't like this division of labor between equivclass.c and appendinfo.c. I don't like exposing add_eq_member globally --- that's just an invitation for code outside equivclass.c to break the fundamental invariants of ECs --- and I also think you've taught appendinfo.c far more than it ought to know about the innards of ECs. I'd suggest putting all of that logic into equivclass.c, with a name along the lines of translate_equivalence_class_to_child. That would reverse the dependency, in that equivclass.c would now need to call adjust_appendrel_attrs ... but that function is already globally exposed. I don't much care for re-calling build_base_rel_tlists to add extra Vars to the appropriate relations; 99% of the work it does will be wasted, and with enough child rels you could run into an O(N^2) problem. Maybe you could call add_vars_to_targetlist directly, since you know exactly what Vars you're adding? What is the point of moving the calculation of all_baserels? The earlier you construct that, the more likelihood that code will have to be written to modify it (like, say, what you had to put into create_inherited_target_child_root), and I do not see anything in this patch series that needs it to be available earlier. The business with translated_exprs and child_target_exprs in set_inherited_target_rel_sizes seems to be dead code --- nothing is done with the list. Should that be updating childrel->reltarget? Or is that now done elsewhere, and if so, why isn't the elsewhere also handling childrel->joininfo? + * Set a non-zero value here to cope with the caller's requirement + * that non-dummy relations are actually not empty. We don't try to What caller is that? Perhaps we should change that rather than inventing a bogus value here? Couldn't inh_target_child_rels list be removed in favor of looking at the relid fields of the inh_target_child_path_rels entries? Having to keep those two lists in sync seems messy. If you're adding fields to PlannerInfo (or pretty much any other planner data structure) you should update outfuncs.c to print them if feasible. Also, please avoid "add new fields at the end" syndrome. Put them where they logically belong. For example, if inh_target_child_roots has to be the same length as simple_rel_array, it's not just confusing for it not to be near that field, it's outright dangerous: it increases the risk that somebody will forget to manipulate both fields. regards, tom lane
On Mon, Feb 18, 2019 at 5:28 PM, Tom Lane wrote: > Frankly, that code is just horrid. Having a function with side effects > in an if-test is questionable at the best of times, and having it be the > second of three conditions (which the third condition silently depends > on) is unreadable and unmaintainable. When I reviewed this, I thought there are no problems in the codes, but I googled what Tom pointed out[1], read it and Iwas ashamed of my ignorance. > I think the existing code here is considerably cleaner than what this > patch proposes. > > I suppose you are doing this because you intend to jam some additional > cleanup code into the successfully-pruned-it code path, but if said code > is really too bulky to have multiple copies of, couldn't you put it into > a subroutine? ISTM the 0004 patch eventually removes these codes from multiple places (set_append_rel_size and set_inherited_target_rel_sizes)so we might be better to not be struggling here? [1] https://www.teamten.com/lawrence/programming/keep-if-clauses-side-effect-free.html -- Yoshikazu Imai
Thanks for looking. On 2019/02/19 2:27, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> [ v22 patch set ] > > I started to look at this, and immediately choked on the 0001 patch: > > if (childpruned || > !apply_child_basequals(root, rel, childrel, childRTE, appinfo) || > relation_excluded_by_constraints(root, childrel, childRTE)) > { > > Frankly, that code is just horrid. Having a function with side effects > in an if-test is questionable at the best of times, and having it be > the second of three conditions (which the third condition silently depends > on) is unreadable and unmaintainable. > > I think the existing code here is considerably cleaner than what this > patch proposes. OK, I think we can just skip this patch. > I suppose you are doing this because you intend to jam some additional > cleanup code into the successfully-pruned-it code path, but if said > code is really too bulky to have multiple copies of, couldn't you > put it into a subroutine? Actually, one of the later patches (lazy creation of partition RTEs) *replaces* the the above code block by: if (IS_DUMMY_REL(childrel)) continue; because with that patch, the step that prunes/excludes children will occur earlier than set_rel_size / set_append_rel_size. For pruned children, there won't RTE/RelOptInfo/AppendRelInfo to begin with. Children that survive partition pruning but get excluded due to contradictory quals (apply_child_basequals returning false) or constraint exclusion will be marked dummy before even getting to set_append_rel_size. I'll adjust the patches accordingly. Thanks, Amit
"Imai, Yoshikazu" <imai.yoshikazu@jp.fujitsu.com> writes: > ISTM the 0004 patch eventually removes these codes from multiple places (set_append_rel_size and set_inherited_target_rel_sizes)so we might be better to not be struggling here? Yeah, Amit just pointed that out (and I'd not read 0004 before reacting to 0001). If the final state of the code isn't going to look like this, then whether the intermediate state is good style becomes far less important. Still, maybe it'd be better to drop the 0001 patch and absorb its effects into the later patch that makes that if-test go away entirely. regards, tom lane
Thanks for the review. On 2019/02/19 4:42, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> [ v22 patch set ] > > I did some review of the 0002 patch. I like the general idea, > but there are a lot of details I don't much like. > > Probably the biggest complaint is that I don't like what you did to the > API of grouping_planner(): it's ugly and unprincipled. Considering that > only a tiny fraction of what grouping_planner does is actually relevant to > inherited UPDATES/DELETEs, maybe we should factor that part out --- or, > heavens, just duplicate some code --- so that inheritance_planner doesn't > need to call grouping_planner at all. (I see that you have another patch > in the queue that proposes to fix this through a rather massive > refactoring of grouping_planner, but I do not think I like that approach. > grouping_planner does perhaps need refactoring, but I don't want to drive > such work off an arbitrary-and-dubious requirement that it shouldn't call > query_planner.) OK, modified the patch to leave grouping_planner unchanged (except got rid of inheritance_update argument). Now inheritance_planner directly modifies the paths produced by query_planner (per child target relation that is), so that they have a PathTarget suitable for producing query's top-level targetlist. I refactored portion of apply_scanjoin_target_to_paths() that does that work and inheritance_planner calls it directly. > Another point is that I don't like this division of labor between > equivclass.c and appendinfo.c. I don't like exposing add_eq_member > globally --- that's just an invitation for code outside equivclass.c to > break the fundamental invariants of ECs --- and I also think you've taught > appendinfo.c far more than it ought to know about the innards of ECs. > I'd suggest putting all of that logic into equivclass.c, with a name along > the lines of translate_equivalence_class_to_child. That would reverse the > dependency, in that equivclass.c would now need to call > adjust_appendrel_attrs ... but that function is already globally exposed. That makes sense. I've tried implementing EC translation to occur this way in the updated patch. > I don't much care for re-calling build_base_rel_tlists to add extra > Vars to the appropriate relations; 99% of the work it does will be > wasted, and with enough child rels you could run into an O(N^2) > problem. Maybe you could call add_vars_to_targetlist directly, > since you know exactly what Vars you're adding? Assuming you're talking about the build_base_rel_tlists() call in create_inherited_target_child_root(), it's necessary because *all* tlist Vars are new, because the targetlist has just been translated at that point. But maybe I missed your point? By the way, later patch in the series will cause partition pruning to occur before these child PlannerInfos are generated, so create_inherited_target_child_root will be called only as many times as there are un-pruned child target relations. > What is the point of moving the calculation of all_baserels? The earlier > you construct that, the more likelihood that code will have to be written > to modify it (like, say, what you had to put into > create_inherited_target_child_root), and I do not see anything in this > patch series that needs it to be available earlier. all_baserels needs to be built in original PlannerInfo before child PlannerInfos are constructed, so that they can simply copy it and have the parent target baserel RT index in it replaced by child target baserel RT index. set_inherited_target_rel_sizes/pathlists use child PlannerInfos, so all_baserels must be set in them just like it is in the original PlannerInfo. > The business with translated_exprs and child_target_exprs in > set_inherited_target_rel_sizes seems to be dead code --- nothing is done > with the list. Should that be updating childrel->reltarget? Or is that > now done elsewhere, and if so, why isn't the elsewhere also handling > childrel->joininfo? Actually, child_target_exprs simply refers to childrel->reltarget->exprs, so modifying it modifies the latter too, but I've found that confusing myself. So, I have removed it. childrel->reltarget->exprs would only contain the targetlist Vars at this point (added by build_base_rel_tlists called by create_inherited_target_child_root), but translated_exprs will also contain Vars that are referenced in WHERE clauses, which have not been added to childrel->reltarget->exprs yet. That's what's getting added to childrel->reltarget in this code. > + * Set a non-zero value here to cope with the caller's requirement > + * that non-dummy relations are actually not empty. We don't try to > > What caller is that? Perhaps we should change that rather than inventing > a bogus value here? OK, modified relevant Asserts in the callers. One was set_rel_size and other set_inherited_target_rel_sizes itself (may be called recursively). > Couldn't inh_target_child_rels list be removed in favor of looking > at the relid fields of the inh_target_child_path_rels entries? > Having to keep those two lists in sync seems messy. Don't like having two lists either, but inh_target_child_path_rels entries can be RELOPT_JOINREL, so relid can be 0. > If you're adding fields to PlannerInfo (or pretty much any other > planner data structure) you should update outfuncs.c to print them > if feasible. Also, please avoid "add new fields at the end" syndrome. > Put them where they logically belong. For example, if > inh_target_child_roots has to be the same length as simple_rel_array, > it's not just confusing for it not to be near that field, it's > outright dangerous: it increases the risk that somebody will forget > to manipulate both fields. I've moved these fields around in the struct definition. Also, I've added unexpanded_tlist and inh_target_child_rels to _outPlannerInfo. Attached updated patches. 0001 that I'd previously posted is no longer included, as I said in the other email. Thanks, Amit
Attachment
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/02/19 4:42, Tom Lane wrote: >> I don't much care for re-calling build_base_rel_tlists to add extra >> Vars to the appropriate relations; 99% of the work it does will be >> wasted, and with enough child rels you could run into an O(N^2) >> problem. Maybe you could call add_vars_to_targetlist directly, >> since you know exactly what Vars you're adding? > Assuming you're talking about the build_base_rel_tlists() call in > create_inherited_target_child_root(), it's necessary because *all* tlist > Vars are new, because the targetlist has just been translated at that > point. But maybe I missed your point? Hmm, I'll take a closer look --- I thought it was just there to add the new ctid-or-equivalent columns. Didn't our earlier translation of the whole subroot structure reach the reltargetlists? > By the way, later patch in the series will cause partition pruning to > occur before these child PlannerInfos are generated, so > create_inherited_target_child_root will be called only as many times as > there are un-pruned child target relations. Well, that helps, but only for "point update" queries. You still need to be wary of not causing O(N^2) behavior when there are lots of unpruned partitions. >> What is the point of moving the calculation of all_baserels? The earlier >> you construct that, the more likelihood that code will have to be written >> to modify it (like, say, what you had to put into >> create_inherited_target_child_root), and I do not see anything in this >> patch series that needs it to be available earlier. > all_baserels needs to be built in original PlannerInfo before child > PlannerInfos are constructed, so that they can simply copy it and have the > parent target baserel RT index in it replaced by child target baserel RT > index. set_inherited_target_rel_sizes/pathlists use child PlannerInfos, > so all_baserels must be set in them just like it is in the original > PlannerInfo. No, I think you're not getting my point: if you build all_baserels in the same place where it already is being built, you don't need to do any of that because it's already correct. It looks to me like this code motion is left over from some earlier iteration of the patch where it probably was necessary, but now it's just making your life harder. >> Couldn't inh_target_child_rels list be removed in favor of looking >> at the relid fields of the inh_target_child_path_rels entries? >> Having to keep those two lists in sync seems messy. > Don't like having two lists either, but inh_target_child_path_rels entries > can be RELOPT_JOINREL, so relid can be 0. So? For a join, there's no particularly relevant integer to put into inh_target_child_rels either. (I might like these two lists better if they had better-chosen names. Merely making them long enough to induce carpal tunnel syndrome isn't helping distinguish them.) > Attached updated patches. 0001 that I'd previously posted is no longer > included, as I said in the other email. OK, I'll make another pass over 0001 today. regards, tom lane
I wrote: > OK, I'll make another pass over 0001 today. So I started the day with high hopes for this, but the more I looked at it the less happy I got, and finally I ran into something that looks to be a complete show-stopper. Namely, that the patch does not account for the possibility of an inherited target rel being the outside for a parameterized path to some other rel. Like this example in the regression database: explain update a set aa = aa + 1 from tenk1 t where a.aa = t.unique2; With HEAD, you get a perfectly nice plan that consists of an append of per-child plans like this: -> Nested Loop (cost=0.29..8.31 rows=1 width=16) -> Seq Scan on a (cost=0.00..0.00 rows=1 width=10) -> Index Scan using tenk1_unique2 on tenk1 t (cost=0.29..8.30 rows=1 width=10) Index Cond: (unique2 = a.aa) With the 0001 patch, this gets an Assert during set_base_rel_pathlists, because indxpath.c tries to make a parameterized path for tenk1 with "a" as the outer rel. Since tenk1's joinlist hasn't been touched, it's still referencing the inheritance parent, and the code notices that we haven't made a rowcount estimate for that. Even if we had, we'd generate a Path referencing Vars of the parent rel, which would not work. Conceivably, such a Path could be fixed up later (say by applying adjust_appendrel_attrs to it during createplan.c), but that is not going to fix the fundamental problem: the cost estimate for such a Path should vary depending on how large we think the outer rel is, and we don't have a reasonable way to set that if we're trying to make a one-size-fits-all Path for something that's being joined to an inheritance tree with a widely varying set of relation sizes. So I do not see any way to make this approach work without a significant(?) sacrifice in the quality of plans. I've got other issues with the patch too, but it's probably not worth getting into them unless we can answer this objection. regards, tom lane
On 2019/02/20 5:57, Tom Lane wrote: > I wrote: >> OK, I'll make another pass over 0001 today. > > So I started the day with high hopes for this, but the more I looked at > it the less happy I got, and finally I ran into something that looks to > be a complete show-stopper. Namely, that the patch does not account > for the possibility of an inherited target rel being the outside for a > parameterized path to some other rel. Like this example in the > regression database: > > explain update a set aa = aa + 1 > from tenk1 t where a.aa = t.unique2; > > With HEAD, you get a perfectly nice plan that consists of an append > of per-child plans like this: > > -> Nested Loop (cost=0.29..8.31 rows=1 width=16) > -> Seq Scan on a (cost=0.00..0.00 rows=1 width=10) > -> Index Scan using tenk1_unique2 on tenk1 t (cost=0.29..8.30 rows=1 > width=10) > Index Cond: (unique2 = a.aa) > > With the 0001 patch, this gets an Assert during set_base_rel_pathlists, > > because indxpath.c tries to make a parameterized path for tenk1 > with "a" as the outer rel. Since tenk1's joinlist hasn't been > touched, it's still referencing the inheritance parent, and the > code notices that we haven't made a rowcount estimate for that. Hmm, yeah. It wouldn't have crashed with an earlier version of the patch, because with it, we were setting the parent relation's rows to a dummy value of 1 at the end of set_inherited_target_rel_sizes, which I removed in the latest patch after your comment upthread. > Even if we had, we'd generate a Path referencing Vars of the parent > rel, which would not work. > > Conceivably, such a Path could be fixed up later (say by applying > adjust_appendrel_attrs to it during createplan.c), Actually, reparameterize_path_by_child (invented by partitionwise-join commit) seems to take care of fixing up the Path to have child attributes, so the plan comes out exactly as on HEAD. But to be honest, that means this new approach of inherited update join planning only appears to work by accident. > but that is not > going to fix the fundamental problem: the cost estimate for such a > Path should vary depending on how large we think the outer rel is, > and we don't have a reasonable way to set that if we're trying to > make a one-size-fits-all Path for something that's being joined to > an inheritance tree with a widely varying set of relation sizes. What if we set the parent target relation's rows to an average of child target relation's rows, that is, instead of setting it to dummy 1 that previous versions of the patches were doing? Thanks, Amit
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/02/20 5:57, Tom Lane wrote: >> but that is not >> going to fix the fundamental problem: the cost estimate for such a >> Path should vary depending on how large we think the outer rel is, >> and we don't have a reasonable way to set that if we're trying to >> make a one-size-fits-all Path for something that's being joined to >> an inheritance tree with a widely varying set of relation sizes. > What if we set the parent target relation's rows to an average of child > target relation's rows, that is, instead of setting it to dummy 1 that > previous versions of the patches were doing? Well, if somebody were holding a gun to our collective heads and saying you must do inherited UPDATE/DELETE this way, we could probably limp along with that; or maybe it'd be better to use the sum of the children's row counts. That depends on how many of the per-child join plans end up using the parameterized path, which is something we couldn't hope to guess so early. (Arguably, the way the code is now, it's overestimating the true costs of such paths, since it doesn't account for different child plans possibly using the same indexscan and thereby getting caching benefits.) In any case there'd be side-effects on code that currently expects appendrels to have size zero, eg the total_table_pages calculation in make_one_rel. However, there are other reasons why I'm not really happy with the approach proposed in this patch. The main problem is that cloning the PlannerInfo while still sharing a lot of infrastructure between the clones is a horrid hack that I think will be very buggy and unmaintainable. We've gotten away with it so far in inheritance_planner because (a) the complexity is all local to that function and (b) the cloning happens very early in the planning process, so that there's not much shared subsidiary data to worry about; mostly just the parse tree, which in fact isn't shared because the first thing we do is push it through adjust_appendrel_attrs. This patch proposes to clone much later, and both the actual cloning and the consequences of that are spread all over, and I don't think we're nearly done with the consequences :-(. I found the parameterized-path problem while wondering why it was okay to not worry about adjusting attrs in data structures used during path construction for other baserels ... turns out it isn't. There's a lot of other stuff in PlannerInfo that you're not touching, for instance pathkeys and placeholders; and I'm afraid much of that represents either current bugs or future problems. So what I feel we should do is set this aside for now and see if we can make something of the other idea I proposed. If we could get rid of expand-inheritance-at-the-top altogether, and plan inherited UPDATE/DELETE the same as inherited SELECT, that would be a large reduction in planner complexity, hence much to be preferred over this approach which is a large increase in planner complexity. If that approach crashes and burns, we can come back to this. There might be parts of this work we can salvage, though. It seems like the idea of postponing expand_inherited_tables() might be something we could use anyway. regards, tom lane
On 2019/02/21 0:50, Tom Lane wrote: > However, there are other reasons why I'm not really happy with the > approach proposed in this patch. > > The main problem is that cloning the PlannerInfo while still sharing a lot > of infrastructure between the clones is a horrid hack that I think will be > very buggy and unmaintainable. We've gotten away with it so far in > inheritance_planner because (a) the complexity is all local to that > function and (b) the cloning happens very early in the planning process, > so that there's not much shared subsidiary data to worry about; mostly > just the parse tree, which in fact isn't shared because the first thing > we do is push it through adjust_appendrel_attrs. This patch proposes > to clone much later, and both the actual cloning and the consequences > of that are spread all over, and I don't think we're nearly done with > the consequences :-(. I found the parameterized-path problem while > wondering why it was okay to not worry about adjusting attrs in data > structures used during path construction for other baserels ... turns > out it isn't. There's a lot of other stuff in PlannerInfo that you're > not touching, for instance pathkeys and placeholders; and I'm afraid > much of that represents either current bugs or future problems. > > So what I feel we should do is set this aside for now and see if we > can make something of the other idea I proposed. If we could get > rid of expand-inheritance-at-the-top altogether, and plan inherited > UPDATE/DELETE the same as inherited SELECT, that would be a large > reduction in planner complexity, hence much to be preferred over this > approach which is a large increase in planner complexity. If that > approach crashes and burns, we can come back to this. OK, I agree that the other approach might be a better way forward. It'll not just improve the performance in an elegant manner, but will also make other projects more feasible, such as, MERGE, what Fujita-san mentioned on the other thread, etc. > There might be parts of this work we can salvage, though. It seems > like the idea of postponing expand_inherited_tables() might be > something we could use anyway. +1. So, let's try to do things in this order: 1. Make inheritance-expansion-at-bottom case perform better now, addressing at least SELECT performance in PG 12, provided we manage to get the patches in order in time (I'll try to post the updated lazy-inheritance-expansion patch later this week.) 2. Overhaul inherited UPDATE/DELETE planning to use inheritance-expansion-at-bottom (PG 13) It's unfortunate that UPDATE/DELETE won't perform as well as SELECTs even couple of releases after declarative partitioning was introduced, but I agree that we should solve the underlying issues in an elegant way. Thanks, Amit
On 2019/02/21 11:31, Amit Langote wrote: > On 2019/02/21 0:50, Tom Lane wrote: >> There might be parts of this work we can salvage, though. It seems >> like the idea of postponing expand_inherited_tables() might be >> something we could use anyway. > > +1. So, let's try to do things in this order: > > 1. Make inheritance-expansion-at-bottom case perform better now, > addressing at least SELECT performance in PG 12, provided we manage to get > the patches in order in time (I'll try to post the updated > lazy-inheritance-expansion patch later this week.) I have updated the inheritance expansion patch. Patch 0001 rewrites optimizer/utils/inherit.c, so that it allows inheritance expansion to be invoked from make_one_rel(). Although the rewrite in this version of the patch is a bit different from earlier versions, because I needed to account for the fact that inheritance_planner (whose rewrite I'm withdrawing) will use the same expansion code to expand target inheritance. So, the code now needs to treat source-inheritance-expansion and target-inheritance-expansion cases a bit differently. I wanted to polish the code and various comments a bit more because the rewritten expansion code looks different from earlier versions as I mentioned above, but I needed to rush out today due to a family emergency and won't be able to reply until Wednesday next week. Sorry. Thanks, Amit
Attachment
On Fri, Feb 22, 2019 at 09:45:38PM +0900, Amit Langote wrote: > I have updated the inheritance expansion patch. > > Patch 0001 rewrites optimizer/utils/inherit.c, so that it allows Thanks for your continued work on this. I applied v23 patch and imported one of our customers' schema, and ran explain on a table with 210 partitions. With patch applied there are 10x fewer system calls, as intended. with patch: 173 pread64 76 lseek 47 open 38 brk without patch: 1276 lseek 693 pread64 647 open 594 brk > + if (IS_SIMPLE_REL(rel1) && child_rel1 == NULL) > + child_rel1 = build_dummy_partition_rel(root, rel1, baserel1, > + cnt_parts); > + if (IS_SIMPLE_REL(rel1) && child_rel2 == NULL) > + child_rel2 = build_dummy_partition_rel(root, rel2, baserel2, > + cnt_parts); Should 2nd "if" say IS_SIMPLE_REL(rel2) ? Justin
Hi Justin, Thanks for checking. On Sat, Feb 23, 2019 at 1:59 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Fri, Feb 22, 2019 at 09:45:38PM +0900, Amit Langote wrote: > > I have updated the inheritance expansion patch. > > > > Patch 0001 rewrites optimizer/utils/inherit.c, so that it allows > > Thanks for your continued work on this. > > I applied v23 patch and imported one of our customers' schema, and ran explain > on a table with 210 partitions. With patch applied there are 10x fewer system > calls, as intended. > > with patch: > 173 pread64 > 76 lseek > 47 open > 38 brk > > without patch: > 1276 lseek > 693 pread64 > 647 open > 594 brk OK, great. I guess you were running SELECT? Just in case you missed, the patch to improve UPDATE/DELETE scalability is no longer part of this patch series. > > + if (IS_SIMPLE_REL(rel1) && child_rel1 == NULL) > > + child_rel1 = build_dummy_partition_rel(root, rel1, baserel1, > > + cnt_parts); > > + if (IS_SIMPLE_REL(rel1) && child_rel2 == NULL) > > + child_rel2 = build_dummy_partition_rel(root, rel2, baserel2, > > + cnt_parts); > > Should 2nd "if" say IS_SIMPLE_REL(rel2) ? Good catch, fixed. Apparently not guarded by a test, but I haven't bothered to add new tests with this patch series. Please find attached updated patches. I've made a few updates in last couple of hours such as improving comments, fixing a few thinkos in inheritance_planner changes, etc. Thanks, Amit
Attachment
On Sat, Feb 23, 2019 at 02:54:35AM +0900, Amit Langote wrote: > > On Fri, Feb 22, 2019 at 09:45:38PM +0900, Amit Langote wrote: > > > I have updated the inheritance expansion patch. > > > > > > Patch 0001 rewrites optimizer/utils/inherit.c, so that it allows > > > > Thanks for your continued work on this. > > > > I applied v23 patch and imported one of our customers' schema, and ran explain > > on a table with 210 partitions. With patch applied there are 10x fewer system > > calls, as intended. > > OK, great. I guess you were running SELECT? Just in case you missed, > the patch to improve UPDATE/DELETE scalability is no longer part of > this patch series. Yes, I understand the scope is reduced. Actually, when I tested last month, I don't think I realized that UPDATE/DELETE was included. Otherwise I would've also tested to see if it resolves the excessive RAM use with many partitions, with the explanation given that query is being replanned for every partition. I set target version to v12. https://commitfest.postgresql.org/22/1778/ Justin
Hi Amit-san. On Fri, Feb 22, 2019 at 5:55 PM, Amit Langote wrote: > > Please find attached updated patches. I've made a few updates in last > couple of hours such as improving comments, fixing a few thinkos in > inheritance_planner changes, etc. Thanks for the patch. While doing code review of v24-0001, I found the performance degradation case. [creating tables] drop table rt; create table rt (a int, b int, c int) partition by range (a); \o /dev/null select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 3) x; \gexec \o drop table if exists jrt; create table jrt (a int, b int, c int) partition by range (a); \o /dev/null select 'create table jrt' || x::text || ' partition of jrt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x; \gexec \o [update_pt_with_joining_another_pt.sql] update rt set c = jrt.c + 100 from jrt where rt.b = jrt.b; [pgbench] pgbench -n -f update_pt_with_joining_another_pt_for_ptkey.sql -T 60 postgres [results] (part_num_rt, part_num_jrt) master patched(0001) --------------------------- ------ ------------- (3, 1024) 8.06 5.89 (3, 2048) 1.52 0.87 (6, 1024) 4.11 1.77 With HEAD, we add target inheritance and source inheritance to parse->rtable in inheritance_planner and copy and adjust itfor child tables at beginning of each planning of child tables. With the 0001 patch, we add target inheritance to parse->rtable in inheritance_planner and add source inheritance to parse->rtablein make_one_rel(under grouping_planner()) during each planning of child tables. Adding source inheritance to parse->rtable may be the same process between each planning of child tables and it might beuseless. OTOH, from the POV of making inheritance-expansion-at-bottom better, expanding source inheritance in make_one_relseems correct design to me. How should we do that...? -- Yoshikazu Imai
Imai-san, Thanks for testing and sorry it took me a while to reply. On 2019/02/25 15:24, Imai, Yoshikazu wrote: > [update_pt_with_joining_another_pt.sql] > update rt set c = jrt.c + 100 from jrt where rt.b = jrt.b; > > [pgbench] > pgbench -n -f update_pt_with_joining_another_pt_for_ptkey.sql -T 60 postgres > > [results] > (part_num_rt, part_num_jrt) master patched(0001) > --------------------------- ------ ------------- > (3, 1024) 8.06 5.89 > (3, 2048) 1.52 0.87 > (6, 1024) 4.11 1.77 > > > > With HEAD, we add target inheritance and source inheritance to parse->rtable in inheritance_planner and copy and adjustit for child tables at beginning of each planning of child tables. > > With the 0001 patch, we add target inheritance to parse->rtable in inheritance_planner and add source inheritance to parse->rtablein make_one_rel(under grouping_planner()) during each planning of child tables. > Adding source inheritance to parse->rtable may be the same process between each planning of child tables and it might beuseless. OTOH, from the POV of making inheritance-expansion-at-bottom better, expanding source inheritance in make_one_relseems correct design to me. > > How should we do that...? To solve this problem, I ended up teaching inheritance_planner to reuse the objects for *source* inheritance child relations (RTEs, AppendRelInfos, and PlanRowMarks) created during the planning of the 1st child query for the planning of subsequent child queries. Set of source child relations don't change between different planning runs, so it's okay to do so. Of course, I had to make sure that query_planner (which is not in the charge of adding source inheritance child objects) can notice that. Please find attached updated patches. Will update source code comments, commit message and perform other fine-tuning over the weekend if possible. Thanks, Amit
Attachment
On 2019/03/01 22:01, Amit Langote wrote: > Of course, I had to make sure that query_planner (which is not > in the charge of adding source inheritance child objects) can notice that. Oops, I meant to write "query_planner (which *is* in the charge of adding source inheritance child objects)..." Thanks, Amit
Amit-san, On Fri, Mar 1, 2019 at 1:02 PM, Amit Langote wrote: > Thanks for testing and sorry it took me a while to reply. Thanks for working about this late at night. I know you have a lot of things to do. > On 2019/02/25 15:24, Imai, Yoshikazu wrote: > > [update_pt_with_joining_another_pt.sql] > > update rt set c = jrt.c + 100 from jrt where rt.b = jrt.b; > > > > [pgbench] > > pgbench -n -f update_pt_with_joining_another_pt_for_ptkey.sql -T 60 > > postgres > > > > [results] > > (part_num_rt, part_num_jrt) master patched(0001) > > --------------------------- ------ ------------- > > (3, 1024) 8.06 5.89 > > (3, 2048) 1.52 0.87 > > (6, 1024) 4.11 1.77 > > > > > > > > With HEAD, we add target inheritance and source inheritance to > parse->rtable in inheritance_planner and copy and adjust it for child > tables at beginning of each planning of child tables. > > > > With the 0001 patch, we add target inheritance to parse->rtable in > inheritance_planner and add source inheritance to parse->rtable in > make_one_rel(under grouping_planner()) during each planning of child > tables. > > Adding source inheritance to parse->rtable may be the same process > between each planning of child tables and it might be useless. OTOH, from > the POV of making inheritance-expansion-at-bottom better, expanding > source inheritance in make_one_rel seems correct design to me. > > > > How should we do that...? > > To solve this problem, I ended up teaching inheritance_planner to reuse > the objects for *source* inheritance child relations (RTEs, > AppendRelInfos, and PlanRowMarks) created during the planning of the 1st > child query for the planning of subsequent child queries. Set of source > child relations don't change between different planning runs, so it's > okay to do so. Of course, I had to make sure that query_planner (which > is not in the charge of adding source inheritance child objects) can notice > that. I did above test again with v25 patch and checked the problem is solved. [results] (part_num_rt, part_num_jrt) master patched(0001) --------------------------- ------ ------------- (3, 1024) 6.11 6.82 (3, 2048) 1.05 1.48 (6, 1024) 3.05 3.45 Sorry that I haven't checked the codes related this problem yet, but I'll check it later. > Please find attached updated patches. Will update source code comments, > commit message and perform other fine-tuning over the weekend if possible. I've taken at glance the codes and there are some minor comments about the patch. * You changed a usage/arguments of get_relation_info, but why you did it? I made those codes back to the original code andchecked it passes make check. So ISTM there are no logical problems with not changing it. Or if you change it, how aboutalso change a usage/arguments of get_relation_info_hook in the same way? -get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent, - RelOptInfo *rel) +get_relation_info(PlannerInfo *root, RangeTblEntry *rte, RelOptInfo *rel) { + bool inhparent = rte->inh; - relation = table_open(relationObjectId, NoLock); + relation = heap_open(rte->relid, NoLock); ... if (get_relation_info_hook) - (*get_relation_info_hook) (root, relationObjectId, inhparent, rel); + (*get_relation_info_hook) (root, rte->relid, rte->inh, rel); @@ -217,15 +272,13 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent) /* Check type of rtable entry */ switch (rte->rtekind) { case RTE_RELATION: /* Table --- retrieve statistics from the system catalogs */ - get_relation_info(root, rte->relid, rte->inh, rel); + get_relation_info(root, rte, rel); * You moved the codes of initializing of append rel's partitioned_child_rels in set_append_rel_size() to build_simple_rel(),but is it better to do? I also checked the original code passes make check by doing like above. @@ -954,32 +948,6 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, Assert(IS_SIMPLE_REL(rel)); /* - * Initialize partitioned_child_rels to contain this RT index. - * - * Note that during the set_append_rel_pathlist() phase, we will bubble up - * the indexes of partitioned relations that appear down in the tree, so - * that when we've created Paths for all the children, the root - * partitioned table's list will contain all such indexes. - */ - if (rte->relkind == RELKIND_PARTITIONED_TABLE) - rel->partitioned_child_rels = list_make1_int(rti); @@ -274,55 +327,287 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent) list_length(rte->securityQuals)); /* - * If this rel is an appendrel parent, recurse to build "other rel" - * RelOptInfos for its children. They are "other rels" because they are - * not in the main join tree, but we will need RelOptInfos to plan access - * to them. + * Add the parent to partitioned_child_rels. + * + * Note that during the set_append_rel_pathlist() phase, values of the + * indexes of partitioned relations that appear down in the tree will + * be bubbled up into root parent's list so that when we've created + * Paths for all the children, the root table's list will contain all + * such indexes. */ - if (rte->inh) + if (rel->part_scheme) + rel->partitioned_child_rels = list_make1_int(relid); I'll review rest of codes. -- Yoshikazu Imai
Hi, On 2019/03/01 22:01, Amit Langote wrote: > Please find attached updated patches. Will update source code comments, > commit message and perform other fine-tuning over the weekend if possible. I realized when "fine-tuning" that the patch 0001 contained too many changes that seem logically separable. I managed to divide it into the following patches, which also amounts to a much shorter overall diff against the master. Also, the smaller individual patches made it easier to spot a lot of useless diffs of inherit.c. Attached patches are as follows: 1. Create the "otherrel" RelOptInfos of the appendrels as a separate step of query_planner. Newly added function add_other_rels_to_query() which adds "other rel" RelOptInfos of child relations is called after query_planner has finished distributing restrict clauses to baserels. Child relations in this case include both those of flattened UNION ALL subqueries and inheritance child tables. Child RangeTblEntrys and AppendRelInfos are added early in subquery_planner for both types of child relations. Of the two, we'd like to delay adding the inheritance children, which is done in the next patch. See patch 0001. 2. Defer inheritance expansion to add_other_rels_to_query(). Although the purpose of doing this is to perform partition pruning before adding the children, this patch doesn't change when the pruning occurs. It deals with other issues that must be taken care of due to adding children during query_planner instead of during subquery_planner. Especially, inheritance_planner now has to add the child target relations on its own. Also, delaying adding children also affects adding junk columns to the query's targetlist based on PlanRowMarks, because preprocess_targetlist can no longer finalize which junk columns to add for a "parent" PlanRowMark; that must be delayed until all child PlanRowMarks are added and their allMarkTypes propagated to the parent PlanRowMark. See patch 0002. 3. Because inheritance_planner calls query_planner separately for each target child relation and the patch 0002 above puts query_planner in charge of inheritance expansion, that means child tables of source inheritance sets will be added as many times as there are target children. This makes update/delete queries containing inherited source tables somewhat slow. This patch adjusts inheritance_planner to reuse source inheritance children added during the planning of 1st child query for the planning of subsequent child queries. See patch 0003. 4. Now that all the issues arising due to late inheritance expansion have been taken care of, this patch moves where partition pruning occurs. Today it's in set_append_rel_size() and this patch moves it to expand_partitioned_rtentry() where partitions are added. Only the partitions remaining after pruning are added, so some members of part_rels can remain NULL. Some of the places that access partition RelOptInfos using that array needed to be made aware of that. See patch 0004. 5. There are a few places which loop over *all* members of part_rels array of a partitioned parent's RelOptInfo to do something with the partition rels. Some of those loops run even for point-lookup queries where only one partition would be accessed, which is inefficient. This patch adds a Bitmapset member named 'live_parts' to RelOptInfo, whose value is the set of indexes of unpruned partitions in the parent's RelOptInfo. The aforementioned loops are now replaced by looping over live_parts Bitmapset instead. See patch 0005. 6. set_relation_partition_info today copies the PartitionBoundInfo from the relcache using partition_bounds_copy. Doing partition_bounds_copy gets expensive as the number of partitions increases and it really doesn't seem necessary for the planner to create its own copy. This patch removes the partition_bounds_copy() and simply uses the relcache pointer. See patch 0006. Hope that the above division makes the changes easier to review. Thanks, Amit
Attachment
- v26-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v26-0002-Delay-adding-inheritance-child-tables-until-quer.patch
- v26-0003-Adjust-inheritance_planner-to-reuse-source-child.patch
- v26-0004-Perform-pruning-in-expand_partitioned_rtentry.patch
- v26-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v26-0006-Don-t-copy-PartitionBoundInfo-in-set_relation_pa.patch
Imai-san, Thanks for the review. On 2019/03/04 18:14, Imai, Yoshikazu wrote: > I've taken at glance the codes and there are some minor comments about the patch. > > * You changed a usage/arguments of get_relation_info, but why you did it? I made those codes back to the original codeand checked it passes make check. So ISTM there are no logical problems with not changing it. Or if you change it, howabout also change a usage/arguments of get_relation_info_hook in the same way? > > -get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent, > - RelOptInfo *rel) > +get_relation_info(PlannerInfo *root, RangeTblEntry *rte, RelOptInfo *rel) > { > + bool inhparent = rte->inh; > - relation = table_open(relationObjectId, NoLock); > + relation = heap_open(rte->relid, NoLock); > ... > if (get_relation_info_hook) > - (*get_relation_info_hook) (root, relationObjectId, inhparent, rel); > + (*get_relation_info_hook) (root, rte->relid, rte->inh, rel); > > > @@ -217,15 +272,13 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent) > /* Check type of rtable entry */ > switch (rte->rtekind) > { > case RTE_RELATION: > /* Table --- retrieve statistics from the system catalogs */ > - get_relation_info(root, rte->relid, rte->inh, rel); > + get_relation_info(root, rte, rel); > > > * You moved the codes of initializing of append rel's partitioned_child_rels in set_append_rel_size() to build_simple_rel(),but is it better to do? I also checked the original code passes make check by doing like above. > > @@ -954,32 +948,6 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, > Assert(IS_SIMPLE_REL(rel)); > > /* > - * Initialize partitioned_child_rels to contain this RT index. > - * > - * Note that during the set_append_rel_pathlist() phase, we will bubble up > - * the indexes of partitioned relations that appear down in the tree, so > - * that when we've created Paths for all the children, the root > - * partitioned table's list will contain all such indexes. > - */ > - if (rte->relkind == RELKIND_PARTITIONED_TABLE) > - rel->partitioned_child_rels = list_make1_int(rti); > > @@ -274,55 +327,287 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent) > list_length(rte->securityQuals)); > > /* > - * If this rel is an appendrel parent, recurse to build "other rel" > - * RelOptInfos for its children. They are "other rels" because they are > - * not in the main join tree, but we will need RelOptInfos to plan access > - * to them. > + * Add the parent to partitioned_child_rels. > + * > + * Note that during the set_append_rel_pathlist() phase, values of the > + * indexes of partitioned relations that appear down in the tree will > + * be bubbled up into root parent's list so that when we've created > + * Paths for all the children, the root table's list will contain all > + * such indexes. > */ > - if (rte->inh) > + if (rel->part_scheme) > + rel->partitioned_child_rels = list_make1_int(relid); Both of these changes are not present in the latest patches I posted, where I also got rid of a lot of unnecessary diffs. Thanks, Amit
Hi Amit, Passes check-world. On 3/4/19 5:38 AM, Amit Langote wrote: > See patch 0001. > +* members of any appendrels we find here are added built later when s/built// > See patch 0002. > + grouping_planner(root, false, 0.0 /* retrieve all tuples */); Move comment out of function call. + if (root->simple_rte_array[childRTindex]) + elog(ERROR, "rel %d already exists", childRTindex); + root->simple_rte_array[childRTindex] = childrte; + if (root->append_rel_array[childRTindex]) + elog(ERROR, "child relation %d already exists", childRTindex); + root->append_rel_array[childRTindex] = appinfo; Could the "if"s be made into Assert's instead ? + * the newly added bytes with zero Extra spaces + if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children) s/TAB/space > See patch 0003. > + * because they correspond to flattneed UNION ALL subqueries. Especially, s/flattneed/flatten > See patch 0004. > + * no need to make any new entries, because anything that'd need those Use "would" explicit + * this case, since it needn't be scanned. , since it doesn't need to be scanned > See patch 0005. > > See patch 0006. > I'll run some tests using a hash partitioned setup. Best regards, Jesper
Hi Jesper, Thanks for the review. I've made all of the changes per your comments in my local repository. I'll post the updated patches after diagnosing what I'm suspecting a memory over-allocation bug in one of the patches. If you configure build with --enable-cassert, you'll see that throughput decreases as number of partitions run into many thousands, but it doesn't when asserts are turned off. On 2019/03/05 1:20, Jesper Pedersen wrote: > I'll run some tests using a hash partitioned setup. Thanks. Regards, Amit
Amit-san, On Tue, Mar 5, 2019 at 0:51 AM, Amit Langote wrote: > Hi Jesper, > > Thanks for the review. I've made all of the changes per your comments > in my local repository. I'll post the updated patches after diagnosing > what I'm suspecting a memory over-allocation bug in one of the patches. > If you configure build with --enable-cassert, you'll see that throughput > decreases as number of partitions run into many thousands, but it doesn't > when asserts are turned off. > > On 2019/03/05 1:20, Jesper Pedersen wrote: > > I'll run some tests using a hash partitioned setup. > > Thanks. I've also done code review of 0001 and 0002 patch so far. [0001] 1. Do we need to open/close a relation in add_appendrel_other_rels()? + if (rel->part_scheme) { - ListCell *l; ... - Assert(cnt_parts == nparts); + rel->part_rels = (RelOptInfo **) + palloc0(sizeof(RelOptInfo *) * rel->nparts); + relation = table_open(rte->relid, NoLock); } + if (relation) + table_close(relation, NoLock); 2. We can sophisticate the usage of Assert in add_appendrel_other_rels(). + if (rel->part_scheme) { ... + rel->part_rels = (RelOptInfo **) + palloc0(sizeof(RelOptInfo *) * rel->nparts); + relation = table_open(rte->relid, NoLock); } ... + i = 0; + foreach(l, root->append_rel_list) + { ... + if (rel->part_scheme != NULL) + { + Assert(rel->nparts > 0 && i < rel->nparts); + rel->part_rels[i] = childrel; + } + + i++; as below; + if (rel->part_scheme) { ... Assert(rel->nparts > 0); + rel->part_rels = (RelOptInfo **) + palloc0(sizeof(RelOptInfo *) * rel->nparts); + relation = table_open(rte->relid, NoLock); } ... + i = 0; + foreach(l, root->append_rel_list) + { ... + if (rel->part_scheme != NULL) + { + Assert(i < rel->nparts); + rel->part_rels[i] = childrel; + } + + i++; [0002] 3. If using variable like is_source_inh_expansion, the code would be easy to read. (I might mistakenly understand root->simple_rel_array_size> 0 means source inheritance expansion though.) In expand_inherited_rtentry() and expand_partitioned_rtentry(): + * Expand planner arrays for adding the child relations. Can't do + * this if we're not being called from query_planner. + */ + if (root->simple_rel_array_size > 0) + { + /* Needed when creating child RelOptInfos. */ + rel = find_base_rel(root, rti); + expand_planner_arrays(root, list_length(inhOIDs)); + } + /* Create the otherrel RelOptInfo too. */ + if (rel) + (void) build_simple_rel(root, childRTindex, rel); would be: + * Expand planner arrays for adding the child relations. Can't do + * this if we're not being called from query_planner. + */ + if (is_source_inh_expansion) + { + /* Needed when creating child RelOptInfos. */ + rel = find_base_rel(root, rti); + expand_planner_arrays(root, list_length(inhOIDs)); + } + /* Create the otherrel RelOptInfo too. */ + if (is_source_inh_expansion) + (void) build_simple_rel(root, childRTindex, rel); 4. I didn't see much carefully, but should the introduction of contains_inherit_children be in 0003 patch...? I'll continue to do code review of rest patches. -- Yoshikazu Imai
Imai-san, Thanks for checking. On 2019/03/05 15:03, Imai, Yoshikazu wrote: > I've also done code review of 0001 and 0002 patch so far. > > [0001] > 1. Do we need to open/close a relation in add_appendrel_other_rels()? > > + if (rel->part_scheme) > { > - ListCell *l; > ... > - Assert(cnt_parts == nparts); > + rel->part_rels = (RelOptInfo **) > + palloc0(sizeof(RelOptInfo *) * rel->nparts); > + relation = table_open(rte->relid, NoLock); > } > > + if (relation) > + table_close(relation, NoLock); Ah, it should be moved to another patch. Actually, to patch 0003, which makes it necessary to inspect the PartitionDesc. > 2. We can sophisticate the usage of Assert in add_appendrel_other_rels(). > > + if (rel->part_scheme) > { > ... > + rel->part_rels = (RelOptInfo **) > + palloc0(sizeof(RelOptInfo *) * rel->nparts); > + relation = table_open(rte->relid, NoLock); > } > ... > + i = 0; > + foreach(l, root->append_rel_list) > + { > ... > + if (rel->part_scheme != NULL) > + { > + Assert(rel->nparts > 0 && i < rel->nparts); > + rel->part_rels[i] = childrel; > + } > + > + i++; > > as below; > > + if (rel->part_scheme) > { > ... > Assert(rel->nparts > 0); > + rel->part_rels = (RelOptInfo **) > + palloc0(sizeof(RelOptInfo *) * rel->nparts); > + relation = table_open(rte->relid, NoLock); > } > ... > + i = 0; > + foreach(l, root->append_rel_list) > + { > ... > + if (rel->part_scheme != NULL) > + { > + Assert(i < rel->nparts); > + rel->part_rels[i] = childrel; > + } > + > + i++; You're right. That's what makes sense in this context. > [0002] > 3. If using variable like is_source_inh_expansion, the code would be easy to read. (I might mistakenly understand root->simple_rel_array_size> 0 means source inheritance expansion though.) root->simple_rel_array_size > 0 *does* mean source inheritance expansion, so you're not mistaken. As you can see, expand_inherited_rtentry is called by inheritance_planner() to expand target inheritance and by add_appendrel_other_rels() to expand source inheritance. Since the latter is called by query_planner, simple_rel_array would have been initialized and hence root->simple_rel_array_size > 0 is true. Maybe it'd be a good idea to introduce is_source_inh_expansion variable for clarity as you suggest and set it to (root->simple_rel_array_size > 0). > 4. I didn't see much carefully, but should the introduction of contains_inherit_children be in 0003 patch...? You're right. Thanks again for the comments. I've made changes to my local repository. Thanks, Amit
On 2019/03/05 9:50, Amit Langote wrote: > I'll post the updated patches after diagnosing what > I'm suspecting a memory over-allocation bug in one of the patches. If you > configure build with --enable-cassert, you'll see that throughput > decreases as number of partitions run into many thousands, but it doesn't > when asserts are turned off. Attached an updated version. This incorporates fixes for both Jesper's and Imai-san's review. I haven't been able to pin down the bug (or whatever) that makes throughput go down as the partition count increases, when tested with a --enable-cassert build. Thanks, Amit
Attachment
- v27-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v27-0002-Delay-adding-inheritance-child-tables-until-quer.patch
- v27-0003-Adjust-inheritance_planner-to-reuse-source-child.patch
- v27-0004-Perform-pruning-in-expand_partitioned_rtentry.patch
- v27-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v27-0006-Don-t-copy-PartitionBoundInfo-in-set_relation_pa.patch
On 2019/03/04 19:38, Amit Langote wrote: > 2. Defer inheritance expansion to add_other_rels_to_query(). Although the > purpose of doing this is to perform partition pruning before adding the > children, this patch doesn't change when the pruning occurs. It deals > with other issues that must be taken care of due to adding children during > query_planner instead of during subquery_planner. Especially, > inheritance_planner now has to add the child target relations on its own. > Also, delaying adding children also affects adding junk columns to the > query's targetlist based on PlanRowMarks, because preprocess_targetlist > can no longer finalize which junk columns to add for a "parent" > PlanRowMark; that must be delayed until all child PlanRowMarks are added > and their allMarkTypes propagated to the parent PlanRowMark. I thought more on this and started wondering why we can't call preprocess_targetlist() from query_planner() instead of from grouping_planner()? We don't have to treat parent row marks specially if preprocess_targetlist() is called after adding other rels (and hence all child row marks). This will change the order in which expressions are added to baserels targetlists and hence the order of expressions in their Path's targetlist, because the expressions contained in targetlist (including RETURNING) and other junk expressions will be added after expressions referenced in WHERE clauses, whereas the order is reverse today. But if we do what we propose above, the order will be uniform for all cases, that is, not one for regular table baserels and another for inherited table baserels. Thoughts? Thanks, Amit
On 3/5/19 5:24 AM, Amit Langote wrote: > Attached an updated version. This incorporates fixes for both Jesper's > and Imai-san's review. I haven't been able to pin down the bug (or > whatever) that makes throughput go down as the partition count increases, > when tested with a --enable-cassert build. > Thanks ! I'm seeing the throughput going down as well, but are you sure it isn't just the extra calls of MemoryContextCheck you are seeing ? A flamegraph diff highlights that area -- sent offline. A non cassert build shows the same profile for 64 and 1024 partitions. Best regards, Jesper
On 2019/03/06 0:57, Jesper Pedersen wrote: > On 3/5/19 5:24 AM, Amit Langote wrote: >> Attached an updated version. This incorporates fixes for both Jesper's >> and Imai-san's review. I haven't been able to pin down the bug (or >> whatever) that makes throughput go down as the partition count increases, >> when tested with a --enable-cassert build. >> > > Thanks ! > > I'm seeing the throughput going down as well, but are you sure it isn't > just the extra calls of MemoryContextCheck you are seeing ? A flamegraph > diff highlights that area -- sent offline. > > A non cassert build shows the same profile for 64 and 1024 partitions. Thanks for testing. I now see what's happening here. I was aware that it's MemoryContextCheck overhead but didn't quite understand why the time it takes should increase with the number of partitions increasing, especially given that the patch makes it so that only one partition is accessed if that's what the query needs to do. What I had forgotten however is that MemoryContextCheck checks *all* contexts in every transaction, including the "partition descriptor" context which stores a partitioned table's PartitionDesc. PartitionDesc gets bigger as the number of partitions increase, so does the time to check the context it's allocated in. So, the decrease in throughput in cassert build as the number of partitions increases is not due to any fault of this patch series as I had suspected. Phew! Thanks, Amit
Amit-san, On Tue, Mar 5, 2019 at 10:24 AM, Amit Langote wrote: > On 2019/03/05 9:50, Amit Langote wrote: > > I'll post the updated patches after diagnosing what I'm suspecting a > > memory over-allocation bug in one of the patches. If you configure > > build with --enable-cassert, you'll see that throughput decreases as > > number of partitions run into many thousands, but it doesn't when > > asserts are turned off. > > Attached an updated version. This incorporates fixes for both Jesper's > and Imai-san's review. Thanks for updating patches! Here is the code review for previous v26 patches. [0002] In expand_inherited_rtentry(): expand_inherited_rtentry() { ... + RelOptInfo *rel = NULL; can be declared at more later: if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ... else { List *appinfos = NIL; RangeTblEntry *childrte; Index childRTindex; + RelOptInfo *rel = NULL; [0003] In inheritance_planner: + rtable_with_target = list_copy(root->parse->rtable); can be: + rtable_with_target = list_copy(parse->rtable); [0004 or 0005] There are redundant process in add_appendrel_other_rels (or expand_xxx_rtentry()?). I modified add_appendrel_other_rels like below and it actually worked. add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index rti) { ListCell *l; RelOptInfo *rel; /* * Add inheritance children to the query if not already done. For child * tables that are themselves partitioned, their children will be added * recursively. */ if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children) { expand_inherited_rtentry(root, rte, rti); return; } rel = find_base_rel(root, rti); foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = lfirst(l); Index childRTindex = appinfo->child_relid; RangeTblEntry *childrte; RelOptInfo *childrel; if (appinfo->parent_relid != rti) continue; Assert(childRTindex < root->simple_rel_array_size); childrte = root->simple_rte_array[childRTindex]; Assert(childrte != NULL); build_simple_rel(root, childRTindex, rel); /* Child may itself be an inherited relation. */ if (childrte->inh) add_appendrel_other_rels(root, childrte, childRTindex); } } > and Imai-san's review. I haven't been able to pin down the bug (or > whatever) that makes throughput go down as the partition count increases, > when tested with a --enable-cassert build. I didn't investigate that problem, but there is another memory increase issue, which is because of 0003 patch I think. I'lltry to solve the latter issue. -- Yoshikazu Imai
On Wed, Mar 6, 2019 at 2:10 AM, Imai, Yoshikazu wrote: > > and Imai-san's review. I haven't been able to pin down the bug (or > > whatever) that makes throughput go down as the partition count > > increases, when tested with a --enable-cassert build. > > I didn't investigate that problem, but there is another memory increase > issue, which is because of 0003 patch I think. I'll try to solve the latter > issue. Ah, I noticed Amit-san identified the cause of problem with --enable-cassert build just now. -- Yoshikazu Imai
Imai-san, Thanks for the review. On 2019/03/06 11:09, Imai, Yoshikazu wrote: > Here is the code review for previous v26 patches. > > [0002] > In expand_inherited_rtentry(): > > expand_inherited_rtentry() > { > ... > + RelOptInfo *rel = NULL; > > can be declared at more later: > > if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) > ... > else > { > List *appinfos = NIL; > RangeTblEntry *childrte; > Index childRTindex; > + RelOptInfo *rel = NULL; > > This is fixed in v27. > [0003] > In inheritance_planner: > > + rtable_with_target = list_copy(root->parse->rtable); > > can be: > > + rtable_with_target = list_copy(parse->rtable); Sure. > [0004 or 0005] > There are redundant process in add_appendrel_other_rels (or expand_xxx_rtentry()?). > I modified add_appendrel_other_rels like below and it actually worked. > > > add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index rti) > { > ListCell *l; > RelOptInfo *rel; > > /* > * Add inheritance children to the query if not already done. For child > * tables that are themselves partitioned, their children will be added > * recursively. > */ > if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children) > { > expand_inherited_rtentry(root, rte, rti); > return; > } > > rel = find_base_rel(root, rti); > > foreach(l, root->append_rel_list) > { > AppendRelInfo *appinfo = lfirst(l); > Index childRTindex = appinfo->child_relid; > RangeTblEntry *childrte; > RelOptInfo *childrel; > > if (appinfo->parent_relid != rti) > continue; > > Assert(childRTindex < root->simple_rel_array_size); > childrte = root->simple_rte_array[childRTindex]; > Assert(childrte != NULL); > build_simple_rel(root, childRTindex, rel); > > /* Child may itself be an inherited relation. */ > if (childrte->inh) > add_appendrel_other_rels(root, childrte, childRTindex); > } > } If you don't intialize part_rels here, then it will break any code in the planner that expects a partitioned rel with non-zero number of partitions to have part_rels set to non-NULL. At the moment, that includes the code that implements partitioning-specific optimizations such partitionwise aggregate and join, run-time pruning etc. No existing regression tests cover the cases where source inherited roles participates in those optimizations, so you didn't find anything that broke. For an example, consider the following update query: update p set a = p1.a + 1 from p p1 where p1.a = (select 1); Planner will create a run-time pruning aware Append node for p (aliased p1), for which it will need to use the part_rels array. Note that p in this case is a source relation which the above code initializes. Maybe we should add such regression tests. > I didn't investigate that problem, but there is another memory increase issue, which is because of 0003 patch I think. I'll try to solve the latter issue. Interested in details as it seems to be a separate problem. Thanks, Amit
Amit-san, On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote: > On 2019/03/06 11:09, Imai, Yoshikazu wrote: > > [0004 or 0005] > > There are redundant process in add_appendrel_other_rels (or > expand_xxx_rtentry()?). > > I modified add_appendrel_other_rels like below and it actually worked. > > > > > > add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index > > rti) { > > ListCell *l; > > RelOptInfo *rel; > > > > /* > > * Add inheritance children to the query if not already done. For > child > > * tables that are themselves partitioned, their children will be > added > > * recursively. > > */ > > if (rte->rtekind == RTE_RELATION > && !root->contains_inherit_children) > > { > > expand_inherited_rtentry(root, rte, rti); > > return; > > } > > > > rel = find_base_rel(root, rti); > > > > foreach(l, root->append_rel_list) > > { > > AppendRelInfo *appinfo = lfirst(l); > > Index childRTindex = appinfo->child_relid; > > RangeTblEntry *childrte; > > RelOptInfo *childrel; > > > > if (appinfo->parent_relid != rti) > > continue; > > > > Assert(childRTindex < root->simple_rel_array_size); > > childrte = root->simple_rte_array[childRTindex]; > > Assert(childrte != NULL); > > build_simple_rel(root, childRTindex, rel); > > > > /* Child may itself be an inherited relation. */ > > if (childrte->inh) > > add_appendrel_other_rels(root, childrte, childRTindex); > > } > > } > > If you don't intialize part_rels here, then it will break any code in > the planner that expects a partitioned rel with non-zero number of > partitions to have part_rels set to non-NULL. At the moment, that > includes the code that implements partitioning-specific optimizations > such partitionwise aggregate and join, run-time pruning etc. No existing > regression tests cover the cases where source inherited roles > participates in those optimizations, so you didn't find anything that > broke. For an example, consider the following update query: > > update p set a = p1.a + 1 from p p1 where p1.a = (select 1); > > Planner will create a run-time pruning aware Append node for p (aliased > p1), for which it will need to use the part_rels array. Note that p in > this case is a source relation which the above code initializes. > > Maybe we should add such regression tests. Ah, now I understand that the codes below of expand_inherited_rtentry() initializes part_rels of child queries after firstchild target and part_rels of those are used in partitioning-specific optimizations. Thanks for the explanation. -- Yoshikazu Imai
Hi, On 3/5/19 5:24 AM, Amit Langote wrote: > Attached an updated version. Need a rebase due to 898e5e3290. Best regards, Jesper
On 2019/03/08 2:21, Jesper Pedersen wrote: > Hi, > > On 3/5/19 5:24 AM, Amit Langote wrote: >> Attached an updated version. > > Need a rebase due to 898e5e3290. Rebased patches attached. Thanks, Amit
Attachment
- v28-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v28-0002-Delay-adding-inheritance-child-tables-until-quer.patch
- v28-0003-Adjust-inheritance_planner-to-reuse-source-child.patch
- v28-0004-Perform-pruning-in-expand_partitioned_rtentry.patch
- v28-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v28-0006-Don-t-copy-PartitionBoundInfo-in-set_relation_pa.patch
Amit-san, On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote: ... > > I didn't investigate that problem, but there is another memory > > increase > issue, which is because of 0003 patch I think. I'll try to solve the latter > issue. > > Interested in details as it seems to be a separate problem. I solved this problem. I think we don't need to do list_copy in the below code. + /* + * No need to copy of the RTEs themselves, but do copy the List + * structure. + */ + subroot->parse->rtable = list_copy(rtable_with_target); Because subroot->parse->rtable will be copied again by: subroot->parse = (Query *) adjust_appendrel_attrs(parent_root, - (Node *) parent_parse, + (Node *) subroot->parse, 1, &appinfo); So I modified the code and did test to confirm memory increasement don't happen. The test and results are below. [test] * Create partitioned table with 1536 partitions. * Execute "update rt set a = random();" [results] A backend uses below amount of memory in update transaction: HEAD: 807MB With v26-0001, 0002: 790MB With v26-0001, 0002, 0003: 860MB With v26-0003 modified: 790MB I attached the diff of modification for v26-0003 patch which also contains some refactoring. Please see if it is ok. (Sorry it is modification for v26 patch though latest ones are v28.) -- Yoshikazu Imai
Attachment
On 2019/03/08 16:16, Imai, Yoshikazu wrote: > So I modified the code and did test to confirm memory increasement don't happen. The test and results are below. > > [test] > * Create partitioned table with 1536 partitions. > * Execute "update rt set a = random();" > > [results] > A backend uses below amount of memory in update transaction: > > HEAD: 807MB > With v26-0001, 0002: 790MB > With v26-0001, 0002, 0003: 860MB > With v26-0003 modified: 790MB Can you measure with v28, or better attached v29 (about which more below)? > I attached the diff of modification for v26-0003 patch which also contains some refactoring. > Please see if it is ok. I like the is_first_child variable which somewhat improves readability, so updated the patch to use it. Maybe you know that range_table_mutator() spends quite a long time if there are many target children, but I realized there's no need for range_table_mutator() to copy/mutate child target RTEs. First, there's nothing to translate in their case. Second, copying them is not necessary too, because they're not modified across different planning cycles. If we pass to adjust_appendrel_attrs only the RTEs in the original range table (that is, before any child target RTEs were added), then range_table_mutator() has to do significantly less work and allocates lot less memory than before. I've broken this change into its own patch; see patch 0004. Thanks, Amit
Attachment
- v29-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v29-0002-Delay-adding-inheritance-child-tables-until-quer.patch
- v29-0003-Adjust-inheritance_planner-to-reuse-source-child.patch
- v29-0004-Further-tweak-inheritance_planner-to-avoid-usele.patch
- v29-0005-Perform-pruning-in-expand_partitioned_rtentry.patch
- v29-0006-Teach-planner-to-only-process-unpruned-partition.patch
- v29-0007-Don-t-copy-PartitionBoundInfo-in-set_relation_pa.patch
Amit-san, On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote: > On 2019/03/08 16:16, Imai, Yoshikazu wrote: > > So I modified the code and did test to confirm memory increasement don't > happen. The test and results are below. > > > > [test] > > * Create partitioned table with 1536 partitions. > > * Execute "update rt set a = random();" > > > > [results] > > A backend uses below amount of memory in update transaction: > > > > HEAD: 807MB > > With v26-0001, 0002: 790MB > > With v26-0001, 0002, 0003: 860MB > > With v26-0003 modified: 790MB > > Can you measure with v28, or better attached v29 (about which more below)? > > > I attached the diff of modification for v26-0003 patch which also > contains some refactoring. > > Please see if it is ok. > > I like the is_first_child variable which somewhat improves readability, > so updated the patch to use it. > > Maybe you know that range_table_mutator() spends quite a long time if > there are many target children, but I realized there's no need for > range_table_mutator() to copy/mutate child target RTEs. First, there's > nothing to translate in their case. Second, copying them is not necessary > too, because they're not modified across different planning cycles. If > we pass to adjust_appendrel_attrs only the RTEs in the original range > table (that is, before any child target RTEs were added), then > range_table_mutator() has to do significantly less work and allocates > lot less memory than before. I've broken this change into its own patch; > see patch 0004. Cool! I tested with v29 patches and checked it saved a lot of memory.. HEAD: 807MB With v29-0001, 0002, 0003, 0004: 167MB Maybe planning time in this case is also improved, but I didn't check it. > but I realized there's no need for > range_table_mutator() to copy/mutate child target RTEs. First, there's > nothing to translate in their case. Second, copying them is not necessary > too, because they're not modified across different planning cycles. Yeah, although I couldn't check the codes in detail, but from the below comments in inheritance_planner(), ISTM we need copiesof subquery RTEs but need not copies of other RTEs in each planning. /* * We generate a modified instance of the original Query for each target * relation, plan that, and put all the plans into a list that will be * controlled by a single ModifyTable node. All the instances share the * same rangetable, but each instance must have its own set of subquery * RTEs within the finished rangetable because (1) they are likely to get * scribbled on during planning, and (2) it's not inconceivable that * subqueries could get planned differently in different cases. We need * not create duplicate copies of other RTE kinds, in particular not the * target relations, because they don't have either of those issues. Not * having to duplicate the target relations is important because doing so * (1) would result in a rangetable of length O(N^2) for N targets, with * at least O(N^3) work expended here; and (2) would greatly complicate * management of the rowMarks list. Thanks -- Yoshikazu Imai
On 2019/03/05 19:25, Amit Langote wrote: > On 2019/03/04 19:38, Amit Langote wrote: >> 2. Defer inheritance expansion to add_other_rels_to_query(). ... >> >> Also, delaying adding children also affects adding junk columns to the >> query's targetlist based on PlanRowMarks, because preprocess_targetlist >> can no longer finalize which junk columns to add for a "parent" >> PlanRowMark; that must be delayed until all child PlanRowMarks are added >> and their allMarkTypes propagated to the parent PlanRowMark. > > I thought more on this and started wondering why we can't call > preprocess_targetlist() from query_planner() instead of from > grouping_planner()? We don't have to treat parent row marks specially if > preprocess_targetlist() is called after adding other rels (and hence all > child row marks). This will change the order in which expressions are > added to baserels targetlists and hence the order of expressions in their > Path's targetlist, because the expressions contained in targetlist > (including RETURNING) and other junk expressions will be added after > expressions referenced in WHERE clauses, whereas the order is reverse > today. But if we do what we propose above, the order will be uniform for > all cases, that is, not one for regular table baserels and another for > inherited table baserels. I wrestled with this idea a bit and concluded that we don't have to postpone *all* of preprocess_targetlist() processing to query_planner, only the part that adds row mark "junk" Vars, because only those matter for the problem being solved. To recap, the problem is that delaying adding inheritance children (and hence their row marks if any) means we can't add "junk" columns needed to implement row marks, because which ones to add is not clear until we've seen all the children. I propose that we delay the addition of "junk" Vars to query_planner() so that it doesn't stand in the way of deferring inheritance expansion to query_planner() too. That means the order of reltarget expressions for row-marked rels will change, but I assume that's OK. At least it will be consistent for both non-inherited baserels and inherited ones. Attached updated version of the patches with the above proposal implemented by patch 0002. To summarize, the patches are as follows: 0001: make building of "other rels" a separate step that runs after deconstruct_jointree(), implemented by a new subroutine of query_planner named add_other_rels_to_query() 0002: delay adding "junk" Vars to after add_other_rels_to_query() 0003: delay inheritance expansion to add_other_rels_to_query() 0004, 0005: adjust inheritance_planner() to account for the fact that inheritance is now expanded by query_planner(), not subquery_planner() 0006: perform partition pruning while inheritance is being expanded, instead of during set_append_append_rel_size() 0007: add a 'live_parts' field to RelOptInfo to store partition indexes (not RT indexes) of unpruned partitions, which speeds up looping over part_rels array of the partitioned parent 0008: avoid copying PartitionBoundInfo struct for planning Thanks, Amit
Attachment
- v30-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v30-0002-Add-rowmark-junk-targetlist-entries-in-query_pla.patch
- v30-0003-Delay-adding-inheritance-child-tables-until-quer.patch
- v30-0004-Adjust-inheritance_planner-to-reuse-source-child.patch
- v30-0005-Further-tweak-inheritance_planner-to-avoid-usele.patch
- v30-0006-Perform-pruning-in-expand_partitioned_rtentry.patch
- v30-0007-Teach-planner-to-only-process-unpruned-partition.patch
- v30-0008-Don-t-copy-PartitionBoundInfo-in-set_relation_pa.patch
Hi Amit, On 3/12/19 4:22 AM, Amit Langote wrote: > I wrestled with this idea a bit and concluded that we don't have to > postpone *all* of preprocess_targetlist() processing to query_planner, > only the part that adds row mark "junk" Vars, because only those matter > for the problem being solved. To recap, the problem is that delaying > adding inheritance children (and hence their row marks if any) means we > can't add "junk" columns needed to implement row marks, because which ones > to add is not clear until we've seen all the children. > > I propose that we delay the addition of "junk" Vars to query_planner() so > that it doesn't stand in the way of deferring inheritance expansion to > query_planner() too. That means the order of reltarget expressions for > row-marked rels will change, but I assume that's OK. At least it will be > consistent for both non-inherited baserels and inherited ones. > > Attached updated version of the patches with the above proposal > implemented by patch 0002. To summarize, the patches are as follows: > > 0001: make building of "other rels" a separate step that runs after > deconstruct_jointree(), implemented by a new subroutine of query_planner > named add_other_rels_to_query() > > 0002: delay adding "junk" Vars to after add_other_rels_to_query() > > 0003: delay inheritance expansion to add_other_rels_to_query() > > 0004, 0005: adjust inheritance_planner() to account for the fact that > inheritance is now expanded by query_planner(), not subquery_planner() > > 0006: perform partition pruning while inheritance is being expanded, > instead of during set_append_append_rel_size() > > 0007: add a 'live_parts' field to RelOptInfo to store partition indexes > (not RT indexes) of unpruned partitions, which speeds up looping over > part_rels array of the partitioned parent > > 0008: avoid copying PartitionBoundInfo struct for planning > After applying 0004 check-world fails with the attached. CFBot agrees [1]. [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/505107884 Best regards, Jesper
Attachment
Thanks for the heads up. On Tue, Mar 12, 2019 at 10:23 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > After applying 0004 check-world fails with the attached. CFBot agrees [1]. Fixed. I had forgotten to re-run postgres_fdw tests after making a change just before submitting. Thanks, Amit
Attachment
- v31-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v31-0005-Further-tweak-inheritance_planner-to-avoid-usele.patch
- v31-0004-Adjust-inheritance_planner-to-reuse-source-child.patch
- v31-0002-Add-rowmark-junk-targetlist-entries-in-query_pla.patch
- v31-0003-Delay-adding-inheritance-child-tables-until-quer.patch
- v31-0006-Perform-pruning-in-expand_partitioned_rtentry.patch
- v31-0007-Teach-planner-to-only-process-unpruned-partition.patch
- v31-0008-Don-t-copy-PartitionBoundInfo-in-set_relation_pa.patch
Amit-san, On Tue, Mar 12, 2019 at 2:34 PM, Amit Langote wrote: > Thanks for the heads up. > > On Tue, Mar 12, 2019 at 10:23 PM Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: > > After applying 0004 check-world fails with the attached. CFBot agrees > [1]. > > Fixed. I had forgotten to re-run postgres_fdw tests after making a change > just before submitting. I have done code review of v31 patches from 0001 to 0004. I described below what I confirmed or thoughts. 0001: This seems ok. 0002: * I don't really know that delaying adding resjunk output columns to the plan doesn't affect any process in the planner.From the comments of PlanRowMark, those columns are used in only the executor so I think delaying adding junk Varswouldn't be harm, is that right? 0003: * Is there need to do CreatePartitionDirectory() if rte->inh becomes false? + else if (rte->rtekind == RTE_RELATION && rte->inh) + { + rte->inh = has_subclass(rte->relid); + + /* + * While at it, initialize the PartitionDesc infrastructure for + * this query, if not done yet. + */ + if (root->glob->partition_directory == NULL) + root->glob->partition_directory = + CreatePartitionDirectory(CurrentMemoryContext); + } 0004: * In addition to commit message, this patch also changes the method of adjusting AppendRelInfos which reference to the subqueryRTEs, and new one seems logically correct. * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, so the codes concatenating lists of append_rel_listmay be able to be moved before doing ChangeVarNodes() and then the codes concatenating lists of rowmarks,rtable and append_rel_list can be in one block (or one bunch). -- Yoshikazu Imai
Amit-san, I have done code review of v31 patches from 0004 to 0008. 0004: * s/childern/children 0005: * This seems reasonable for not using a lot of memory in specific case, although it needs special looking of planner experts. 0006: * The codes initializing/setting RelOptInfo's part_rels looks like a bit complicated, but I didn't come up with any gooddesign so far. 0007: * This changes some processes using "for loop" to using "while(bms_next_member())" which speeds up processing when we scanfew partitions in one statement, but when we scan a lot of partitions in one statement, its performance will likely degraded.I measured the performance of both cases. I executed select statement to the table which has 4096 partitions. [scanning 1 partition] Without 0007 : 3,450 TPS With 0007 : 3,723 TPS [scanning 4096 partitions] Without 0007 : 10.8 TPS With 0007 : 10.5 TPS In the above result, performance degrades 3% in case of scanning 4096 partitions compared before and after applying 0007patch. I think when scanning a lot of tables, executor time would be also longer, so the increasement of planner timewould be relatively smaller than it. So we might not have to care this performance degradation. 0008: This seems ok. -- Yoshikazu Imai
Imai-san, On 2019/03/13 19:34, Imai, Yoshikazu wrote: > I have done code review of v31 patches from 0001 to 0004. > I described below what I confirmed or thoughts. Thanks for checking. > 0001: This seems ok. > > 0002: > * I don't really know that delaying adding resjunk output columns to the plan doesn't affect any process in the planner.From the comments of PlanRowMark, those columns are used in only the executor so I think delaying adding junk Varswouldn't be harm, is that right? I think so. The only visible change in behavior is that "rowmark" junk columns are now placed later in the final plan's targetlist. > 0003: > * Is there need to do CreatePartitionDirectory() if rte->inh becomes false? > > + else if (rte->rtekind == RTE_RELATION && rte->inh) > + { > + rte->inh = has_subclass(rte->relid); > + > + /* > + * While at it, initialize the PartitionDesc infrastructure for > + * this query, if not done yet. > + */ > + if (root->glob->partition_directory == NULL) > + root->glob->partition_directory = > + CreatePartitionDirectory(CurrentMemoryContext); > + } We need to have create "partition directory" before we access a partitioned table's PartitionDesc for planning. These days, we rely solely on PartitionDesc to determine a partitioned table's children. So, we need to see the PartitionDesc before we can tell there are zero children and set inh to false. In other words, we need the "partition directory" to have been set up in advance. > 0004: > * In addition to commit message, this patch also changes the method of adjusting AppendRelInfos which reference to thesubquery RTEs, and new one seems logically correct. Do you mean I should mention it in the patch's commit message? > * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, so the codes concatenating lists of append_rel_listmay be able to be moved before doing ChangeVarNodes() and then the codes concatenating lists of rowmarks,rtable and append_rel_list can be in one block (or one bunch). Yeah, perhaps. I'll check. On 2019/03/14 17:35, Imai, Yoshikazu wrote:> Amit-san, > I have done code review of v31 patches from 0004 to 0008. > > 0004: > * s/childern/children Will fix. > 0005: > * This seems reasonable for not using a lot of memory in specific case, > although it needs special looking of planner experts. Certainly. > 0006: > * The codes initializing/setting RelOptInfo's part_rels looks like a bit > complicated, but I didn't come up with any good design so far. I guess you mean in add_appendrel_other_rels(). The other option is not have that code there and expand partitioned tables freshly for every target child, which we got rid of in patch 0004. But we don't want to do that. > 0007: > * This changes some processes using "for loop" to using > "while(bms_next_member())" which speeds up processing when we scan few > partitions in one statement, but when we scan a lot of partitions in one > statement, its performance will likely degraded. I measured the > performance of both cases. > I executed select statement to the table which has 4096 partitions. > > [scanning 1 partition] > Without 0007 : 3,450 TPS > With 0007 : 3,723 TPS > > [scanning 4096 partitions] > Without 0007 : 10.8 TPS > With 0007 : 10.5 TPS > > In the above result, performance degrades 3% in case of scanning 4096 > partitions compared before and after applying 0007 patch. I think when > scanning a lot of tables, executor time would be also longer, so the > increasement of planner time would be relatively smaller than it. So we > might not have to care this performance degradation. Well, live_parts bitmapset is optimized for queries scanning only few partitions. It's true that it's slightly more expensive than a simple for loop over part_rels, but not too much as you're also saying. Thanks, Amit
On Thu, 14 Mar 2019 at 21:35, Imai, Yoshikazu <imai.yoshikazu@jp.fujitsu.com> wrote: > 0007: > * This changes some processes using "for loop" to using "while(bms_next_member())" which speeds up processing when we scanfew partitions in one statement, but when we scan a lot of partitions in one statement, its performance will likely degraded.I measured the performance of both cases. > I executed select statement to the table which has 4096 partitions. > > [scanning 1 partition] > Without 0007 : 3,450 TPS > With 0007 : 3,723 TPS > > [scanning 4096 partitions] > Without 0007 : 10.8 TPS > With 0007 : 10.5 TPS > > In the above result, performance degrades 3% in case of scanning 4096 partitions compared before and after applying 0007patch. I think when scanning a lot of tables, executor time would be also longer, so the increasement of planner timewould be relatively smaller than it. So we might not have to care this performance degradation. I think it's better to focus on the fewer partitions case due to the fact that execution initialisation time and actual execution are likely to take much longer when more partitions are scanned. I did some work on run-time pruning to tune it for this case. Tom did make a similar argument in [1] and I explained my reasoning in [2]. bms_next_member has gotten a good performance boost since then and the cases are not exactly the same since the old version the loop in run-time pruning checked bms_is_member(), but the fact is, we did end up tuning for the few partitions case in the end. However, it would be good to see the performance results for plan+execution time of say a table with 4k parts looking up a single indexed value. You could have two columns, one that's the partition key which allows the pruning to take place, and one that's not and results in scanning all partitions. I'll be surprised if you even notice the difference between with and without 0007 with the latter case. [1] https://www.postgresql.org/message-id/16107.1542307838%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/CAKJS1f8ZnAW9VJNpJW16t5CtXSq3eAseyJXdumLaYb8DiTbhXA%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Amit-san, On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote: > > 0002: > > * I don't really know that delaying adding resjunk output columns to > the plan doesn't affect any process in the planner. From the comments > of PlanRowMark, those columns are used in only the executor so I think > delaying adding junk Vars wouldn't be harm, is that right? > > I think so. The only visible change in behavior is that "rowmark" junk > columns are now placed later in the final plan's targetlist. ok, thanks. > > 0003: > > * Is there need to do CreatePartitionDirectory() if rte->inh becomes > false? > > > > + else if (rte->rtekind == RTE_RELATION && rte->inh) > > + { > > + rte->inh = has_subclass(rte->relid); > > + > > + /* > > + * While at it, initialize the PartitionDesc > infrastructure for > > + * this query, if not done yet. > > + */ > > + if (root->glob->partition_directory == NULL) > > + root->glob->partition_directory = > > + > CreatePartitionDirectory(CurrentMemoryContext); > > + } > > We need to have create "partition directory" before we access a > partitioned table's PartitionDesc for planning. These days, we rely > solely on PartitionDesc to determine a partitioned table's children. So, > we need to see the PartitionDesc before we can tell there are zero children > and set inh to false. In other words, we need the "partition directory" > to have been set up in advance. Ah, I see. > > 0004: > > * In addition to commit message, this patch also changes the method > of adjusting AppendRelInfos which reference to the subquery RTEs, and > new one seems logically correct. > > Do you mean I should mention it in the patch's commit message? Actually I firstly thought that it's better to mention it in the patch's commit message, but I didn't mean that here. I onlywanted to state that I confirmed new method of adjusting AppendRelInfos seems logically correct :) > > * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, > so the codes concatenating lists of append_rel_list may be able to be > moved before doing ChangeVarNodes() and then the codes concatenating > lists of rowmarks, rtable and append_rel_list can be in one block (or > one bunch). > > Yeah, perhaps. I'll check. > > On 2019/03/14 17:35, Imai, Yoshikazu wrote:> Amit-san, > > I have done code review of v31 patches from 0004 to 0008. > > > > 0004: > > * s/childern/children > > Will fix. > > > 0005: > > * This seems reasonable for not using a lot of memory in specific > > case, although it needs special looking of planner experts. > > Certainly. Thanks for these. > > 0006: > > * The codes initializing/setting RelOptInfo's part_rels looks like a > > bit complicated, but I didn't come up with any good design so far. > > I guess you mean in add_appendrel_other_rels(). The other option is not > have that code there and expand partitioned tables freshly for every > target child, which we got rid of in patch 0004. But we don't want to > do that. Yeah, that's right. > > 0007: > > * This changes some processes using "for loop" to using > > "while(bms_next_member())" which speeds up processing when we scan few > > partitions in one statement, but when we scan a lot of partitions in > > one statement, its performance will likely degraded. I measured the > > performance of both cases. > > I executed select statement to the table which has 4096 partitions. > > > > [scanning 1 partition] > > Without 0007 : 3,450 TPS > > With 0007 : 3,723 TPS > > > > [scanning 4096 partitions] > > Without 0007 : 10.8 TPS > > With 0007 : 10.5 TPS > > > > In the above result, performance degrades 3% in case of scanning 4096 > > partitions compared before and after applying 0007 patch. I think when > > scanning a lot of tables, executor time would be also longer, so the > > increasement of planner time would be relatively smaller than it. So > > we might not have to care this performance degradation. > > Well, live_parts bitmapset is optimized for queries scanning only few > partitions. It's true that it's slightly more expensive than a simple > for loop over part_rels, but not too much as you're also saying. Thanks for the comments. -- Yoshikazu Imai
Hi, David On Thu, Mar 14, 2019 at 9:04 AM, David Rowley wrote: > On Thu, 14 Mar 2019 at 21:35, Imai, Yoshikazu > <imai.yoshikazu@jp.fujitsu.com> wrote: > > 0007: > > * This changes some processes using "for loop" to using > "while(bms_next_member())" which speeds up processing when we scan few > partitions in one statement, but when we scan a lot of partitions in one > statement, its performance will likely degraded. I measured the > performance of both cases. > > I executed select statement to the table which has 4096 partitions. > > > > [scanning 1 partition] > > Without 0007 : 3,450 TPS > > With 0007 : 3,723 TPS > > > > [scanning 4096 partitions] > > Without 0007 : 10.8 TPS > > With 0007 : 10.5 TPS > > > > In the above result, performance degrades 3% in case of scanning 4096 > partitions compared before and after applying 0007 patch. I think when > scanning a lot of tables, executor time would be also longer, so the > increasement of planner time would be relatively smaller than it. So we > might not have to care this performance degradation. > > I think it's better to focus on the fewer partitions case due to the fact > that execution initialisation time and actual execution are likely to > take much longer when more partitions are scanned. I did some work on > run-time pruning to tune it for this case. Tom did make a similar argument > in [1] and I explained my reasoning in [2]. Thanks for quoting these threads. Actually, I recalled this argument, so I tested this just to make sure. > bms_next_member has gotten a good performance boost since then and the > cases are not exactly the same since the old version the loop in run-time > pruning checked bms_is_member(), but the fact is, we did end up tuning > for the few partitions case in the end. Wow, I didn't know that. > However, it would be good to see the performance results for > plan+execution time of say a table with 4k parts looking up a single > indexed value. You could have two columns, one that's the partition key > which allows the pruning to take place, and one that's not and results > in scanning all partitions. I'll be surprised if you even notice the > difference between with and without 0007 with the latter case. So I tested for checking the performance for plan+execution time. [set up table and indexes] create table rt (a int, b int) partition by range (a); \o /dev/null select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x; \gexec \o create index b_idx on rt (b); insert into rt select a, b from generate_series(1, 4096) a, generate_series(1, 1000) b; [select_indexed_values.sql] \set b random(1, 1000) select count(*) from rt where b = :b; [pgbench] pgbench -n -f select_indexed_values.sql -T 60 postgres [results] Without 0007: 3.18 TPS (3.25, 3.13, 3.15) With 0007: 3.21 TPS (3.21, 3.23, 3.18) From the results, we didn't see the performance degradation in this case. Actually, the performance increased 1% before andafter applying 0007, but it would be just an measurement error. So, generally, we can think the performance difference of bms_next_member and for loop can be absorbed by other processing(executioninitialisation and actual execution) when scanning many partitions. -- Yoshikazu Imai
Amit-san, I have another little comments about v31-patches. * We don't need is_first_child in inheritance_planner(). On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote: > On 2019/03/08 16:16, Imai, Yoshikazu wrote: > > I attached the diff of modification for v26-0003 patch which also > contains some refactoring. > > Please see if it is ok. > > I like the is_first_child variable which somewhat improves readability, > so updated the patch to use it. I noticed that detecting first child query in inheritance_planner() is already done by "final_rtable == NIL" /* * If this is the first non-excluded child, its post-planning rtable * becomes the initial contents of final_rtable; otherwise, append * just its modified subquery RTEs to final_rtable. */ if (final_rtable == NIL) So I think we can use that instead of using is_first_child. -- Yoshikazu Imai
Imai-san, On 2019/03/15 14:40, Imai, Yoshikazu wrote: > Amit-san, > > I have another little comments about v31-patches. > > * We don't need is_first_child in inheritance_planner(). > > On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote: >> On 2019/03/08 16:16, Imai, Yoshikazu wrote: >>> I attached the diff of modification for v26-0003 patch which also >> contains some refactoring. >>> Please see if it is ok. >> >> I like the is_first_child variable which somewhat improves readability, >> so updated the patch to use it. > > I noticed that detecting first child query in inheritance_planner() is already done by "final_rtable == NIL" > > /* > * If this is the first non-excluded child, its post-planning rtable > * becomes the initial contents of final_rtable; otherwise, append > * just its modified subquery RTEs to final_rtable. > */ > if (final_rtable == NIL) > > So I think we can use that instead of using is_first_child. That's a good suggestion. One problem I see with that is that final_rtable is set only once we've found the first *non-dummy* child. So, if all the children except the last one are dummy, we'd end up never reusing the source child objects. Now, if final_rtable was set for the first child irrespective of whether it turned out to be dummy or not, which it seems OK to do, then we can implement your suggestion. Thanks, Amit
Amit-san, On Mon, Mar 18, 2019 at 9:56 AM, Amit Langote wrote: > On 2019/03/15 14:40, Imai, Yoshikazu wrote: > > Amit-san, > > > > I have another little comments about v31-patches. > > > > * We don't need is_first_child in inheritance_planner(). > > > > On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote: > >> On 2019/03/08 16:16, Imai, Yoshikazu wrote: > >>> I attached the diff of modification for v26-0003 patch which also > >> contains some refactoring. > >>> Please see if it is ok. > >> > >> I like the is_first_child variable which somewhat improves > >> readability, so updated the patch to use it. > > > > I noticed that detecting first child query in inheritance_planner() > is already done by "final_rtable == NIL" > > > > /* > > * If this is the first non-excluded child, its post-planning > rtable > > * becomes the initial contents of final_rtable; otherwise, > append > > * just its modified subquery RTEs to final_rtable. > > */ > > if (final_rtable == NIL) > > > > So I think we can use that instead of using is_first_child. > > That's a good suggestion. One problem I see with that is that > final_rtable is set only once we've found the first *non-dummy* child. Ah... I overlooked that. > So, if all the children except the last one are dummy, we'd end up never > reusing the source child objects. Now, if final_rtable was set for the > first child irrespective of whether it turned out to be dummy or not, > which it seems OK to do, then we can implement your suggestion. I thought you mean we can remove or move below code to under setting final_rtable. /* * If this child rel was excluded by constraint exclusion, exclude it * from the result plan. */ if (IS_DUMMY_REL(sub_final_rel)) continue; It seems logically ok, but I also thought there are the case where useless process happens. If we execute below update statement, final_rtable would be unnecessarily expanded by adding placeholder. * partition table rt with 1024 partitions. * partition table pt with 2 partitions. * update rt set c = ss.c from ( select b from pt union all select b + 3 from pt) ss where a = 1024 and rt.b = ss.b; After all, I think it might be better to use is_first_child because the meaning of is_first_child and final_rtable is different. is_first_child == true represents that we currently processing first child query and final_rtable == NIL represents we didn'tfind first non-excluded child query. -- Yoshikazu Imai
Imai-san, On 2019/03/15 9:33, Imai, Yoshikazu wrote: > On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote: >>> * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, >> so the codes concatenating lists of append_rel_list may be able to be >> moved before doing ChangeVarNodes() and then the codes concatenating >> lists of rowmarks, rtable and append_rel_list can be in one block (or >> one bunch). >> >> Yeah, perhaps. I'll check. I'm inclined to add source_appinfos to subroot->append_rel_list after finishing the ChangeVarNodes(subroot->append_rel_list) step, because if there are many entries in source_appinfos that would unnecessarily make ChangeVarNodes take longer. >> On 2019/03/14 17:35, Imai, Yoshikazu wrote:> Amit-san, >>> I have done code review of v31 patches from 0004 to 0008. >>> >>> 0004: >>> * s/childern/children >> >> Will fix. Fixed. I've attached updated patches. In the new version, I've moved some code from 0004 to 0005 patch, so as to avoid mixing unrelated modifications in one patch. Especially, orig_rtable now only appears after applying 0005. I appreciate your continued interest in these patches. Thanks, Amit
Attachment
- v32-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v32-0002-Add-rowmark-junk-targetlist-entries-in-query_pla.patch
- v32-0003-Delay-adding-inheritance-child-tables-until-quer.patch
- v32-0004-Adjust-inheritance_planner-to-reuse-source-child.patch
- v32-0005-Further-tweak-inheritance_planner-to-avoid-usele.patch
- v32-0006-Perform-pruning-in-expand_partitioned_rtentry.patch
- v32-0007-Teach-planner-to-only-process-unpruned-partition.patch
- v32-0008-Don-t-copy-PartitionBoundInfo-in-set_relation_pa.patch
Amit-san, On Tue, Mar 19, 2019 at 6:53 AM, Amit Langote wrote: > On 2019/03/15 9:33, Imai, Yoshikazu wrote: > > On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote: > >>> * In inheritance_planner(), we do ChangeVarNodes() only for > >>> orig_rtable, > >> so the codes concatenating lists of append_rel_list may be able to > be > >> moved before doing ChangeVarNodes() and then the codes concatenating > >> lists of rowmarks, rtable and append_rel_list can be in one block (or > >> one bunch). > >> > >> Yeah, perhaps. I'll check. > > I'm inclined to add source_appinfos to subroot->append_rel_list after > finishing the ChangeVarNodes(subroot->append_rel_list) step, because if > there are many entries in source_appinfos that would unnecessarily make > ChangeVarNodes take longer. OK, thanks. > I've attached updated patches. In the new version, I've moved some code > from 0004 to 0005 patch, so as to avoid mixing unrelated modifications > in one patch. Especially, orig_rtable now only appears after applying > 0005. > > I appreciate your continued interest in these patches. Thanks for new patches. I looked over them and there are little comments. [0002] * s/regresion/regression/g (in commit message.) [0003] * I thought "inh flag is it" is "inh flag is set" ...? + * For RTE_RELATION rangetable entries whose inh flag is it, adjust the * Below comments are correct when after applying 0004. + * the query's target relation and no other. Especially, children of any + * source relations are added when the loop below calls grouping_planner + * on the *1st* child target relation. [0004] * s/contain contain/contain/ + * will contain contain references to the subquery RTEs that we've * s/find them children/find their children/ + * AppendRelInfos needed to find them children, so the next [0006] * s/recursivly/recursively/ (in commit message) I have no more comments about codes other than above :) -- Yoshikazu Imai
Imai-san, Thanks for the review. On 2019/03/19 20:13, Imai, Yoshikazu wrote: > Thanks for new patches. > I looked over them and there are little comments. > > [0002] > * s/regresion/regression/g > (in commit message.) > > [0003] > * I thought "inh flag is it" is "inh flag is set" ...? > > + * For RTE_RELATION rangetable entries whose inh flag is it, adjust the > > > * Below comments are correct when after applying 0004. > > + * the query's target relation and no other. Especially, children of any > + * source relations are added when the loop below calls grouping_planner > + * on the *1st* child target relation. > > [0004] > * s/contain contain/contain/ > > + * will contain contain references to the subquery RTEs that we've > > > * s/find them children/find their children/ > > + * AppendRelInfos needed to find them children, so the next > > [0006] > * s/recursivly/recursively/ > (in commit message) > > > I have no more comments about codes other than above :) I have fixed all. Attached updated patches. Thanks, Amit
Attachment
- v33-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v33-0002-Add-rowmark-junk-targetlist-entries-in-query_pla.patch
- v33-0003-Delay-adding-inheritance-child-tables-until-quer.patch
- v33-0004-Adjust-inheritance_planner-to-reuse-source-child.patch
- v33-0005-Further-tweak-inheritance_planner-to-avoid-usele.patch
- v33-0006-Perform-pruning-in-expand_partitioned_rtentry.patch
- v33-0007-Teach-planner-to-only-process-unpruned-partition.patch
- v33-0008-Don-t-copy-PartitionBoundInfo-in-set_relation_pa.patch
On Fri, Mar 8, 2019 at 4:18 AM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Maybe you know that range_table_mutator() spends quite a long time if > there are many target children, but I realized there's no need for > range_table_mutator() to copy/mutate child target RTEs. First, there's > nothing to translate in their case. Second, copying them is not necessary > too, because they're not modified across different planning cycles. If we > pass to adjust_appendrel_attrs only the RTEs in the original range table > (that is, before any child target RTEs were added), then > range_table_mutator() has to do significantly less work and allocates lot > less memory than before. I've broken this change into its own patch; see > patch 0004. Was just glancing over 0001: - * every non-join RTE that is used in the query. Therefore, this routine - * is the only place that should call build_simple_rel with reloptkind - * RELOPT_BASEREL. (Note: build_simple_rel recurses internally to build - * "other rel" RelOptInfos for the members of any appendrels we find here.) + * every non-join RTE that is specified in the query. Therefore, this + * routine is the only place that should call build_simple_rel with + * reloptkind RELOPT_BASEREL. (Note: "other rel" RelOptInfos for the + * members of any appendrels we find here are built later when query_planner + * calls add_other_rels_to_query().) Used -> specified doesn't seem to change the meaning, so I'm not sure what the motivation is there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019/03/20 9:49, Robert Haas wrote: > On Fri, Mar 8, 2019 at 4:18 AM Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Maybe you know that range_table_mutator() spends quite a long time if >> there are many target children, but I realized there's no need for >> range_table_mutator() to copy/mutate child target RTEs. First, there's >> nothing to translate in their case. Second, copying them is not necessary >> too, because they're not modified across different planning cycles. If we >> pass to adjust_appendrel_attrs only the RTEs in the original range table >> (that is, before any child target RTEs were added), then >> range_table_mutator() has to do significantly less work and allocates lot >> less memory than before. I've broken this change into its own patch; see >> patch 0004. > > Was just glancing over 0001: > > - * every non-join RTE that is used in the query. Therefore, this routine > - * is the only place that should call build_simple_rel with reloptkind > - * RELOPT_BASEREL. (Note: build_simple_rel recurses internally to build > - * "other rel" RelOptInfos for the members of any appendrels we find here.) > + * every non-join RTE that is specified in the query. Therefore, this > + * routine is the only place that should call build_simple_rel with > + * reloptkind RELOPT_BASEREL. (Note: "other rel" RelOptInfos for the > + * members of any appendrels we find here are built later when query_planner > + * calls add_other_rels_to_query().) > > Used -> specified doesn't seem to change the meaning, so I'm not sure > what the motivation is there. Well, I thought it would clarify that now add_base_rels_to_query() only adds "baserel" RelOptInfos, that is, those for the relations that are directly mentioned in the query. Maybe: ...that is mentioned in the query. or ...that is directly mentioned in the query. ? Thanks, Amit
Amit-san, On Wed, Mar 20, 2019 at 0:42 AM, Amit Langote wrote: > On 2019/03/19 20:13, Imai, Yoshikazu wrote: > > Thanks for new patches. > > I looked over them and there are little comments. > > > > ... > > > > I have no more comments about codes other than above :) > > I have fixed all. Attached updated patches. Thanks for quick modification. I did performance test again. I did four tests in the below. There might be some point we can improve performance more from the results of last test in the below. (1) v33-0003 slows down queries when there are inherited tables in UPDATE/DELETE's FROM clause. v33-0004 fixes this problem. * rt with 32 partitions. * jrt with 32 partitions. * update rt set c = jrt.c from jrt where rt.b = jrt.b; patch TPS ----- ----- master 66.8 (67.2, 66.8, 66.4) 0003 47.5 (47.2, 47.6, 47.7) 0004 68.8 (69.2, 68.9, 68.4) It seems fixing the performance problem correctly. (2) v33-0005 speeds up UPDATE/DELETE by removing useless copy of child target RTEs in adjust_appendrel_attrs(). * rt with 32 partitions. * update rt set b = b + 1; patch TPS ----- ----- master 986 (999, 984, 974) 0005 1,589 (1608, 1577, 1582) It seems speeding up the performance as expected. (3) v33-0006, 0007, 0008 speeds up the case when few partitions are scanned among many partitions. * rt with 4096 partitions, partitioned by column 'a'. * select rt where rt.a = :a (:a ranges from 1 to 4096) patch TPS ----- ----- master 22.4 (22.4, 22.5, 22.2) 0005 20.9 (20.9, 21.2, 20.6) 0006 2,951 (2958, 2958, 2937) 0007 3,141 (3146, 3141, 3136) 0008 6,472 (6434, 6565, 6416) Certainly, it seems patches speed up the performance of the case. (4) We expect the performance does not depend on the number of partitions after applying all patches, if possible. num of part TPS ----------- ----- 1024 7,257 (7274, 7246, 7252) 2048 6,718 (6627, 6780, 6747) 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) 8192 6,008 (6018, 5999, 6007) It seems the performance still depend on the number of partitions. At the moment, I don't have any idea what cause this problembut can we improve this more? -- Yoshikazu Imai
Imai-san, On 2019/03/20 11:21, Imai, Yoshikazu wrote: > (4) > We expect the performance does not depend on the number of partitions after applying all patches, if possible. > > num of part TPS > ----------- ----- > 1024 7,257 (7274, 7246, 7252) > 2048 6,718 (6627, 6780, 6747) > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) > 8192 6,008 (6018, 5999, 6007) > > It seems the performance still depend on the number of partitions. At the moment, I don't have any idea what cause thisproblem but can we improve this more? I've noticed [1] this kind of degradation when the server is built with Asserts enabled. Did you? Thanks, Amit [1] https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18b1770d%40lab.ntt.co.jp
Amit-san, On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote: > On 2019/03/20 11:21, Imai, Yoshikazu wrote: > > (4) > > We expect the performance does not depend on the number of partitions > after applying all patches, if possible. > > > > num of part TPS > > ----------- ----- > > 1024 7,257 (7274, 7246, 7252) > > 2048 6,718 (6627, 6780, 6747) > > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) > > 8192 6,008 (6018, 5999, 6007) > > > > It seems the performance still depend on the number of partitions. At > the moment, I don't have any idea what cause this problem but can we improve > this more? > > I've noticed [1] this kind of degradation when the server is built with > Asserts enabled. Did you? > ... > [1] > https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18 > b1770d%40lab.ntt.co.jp No. I did test again from configuring without --enable-cassert but problem I mentioned still happens. -- Yoshikazu Imai
On 2019/03/20 11:51, Imai, Yoshikazu wrote: > Amit-san, > > On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote: >> On 2019/03/20 11:21, Imai, Yoshikazu wrote: >>> (4) >>> We expect the performance does not depend on the number of partitions >> after applying all patches, if possible. >>> >>> num of part TPS >>> ----------- ----- >>> 1024 7,257 (7274, 7246, 7252) >>> 2048 6,718 (6627, 6780, 6747) >>> 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) >>> 8192 6,008 (6018, 5999, 6007) >>> >>> It seems the performance still depend on the number of partitions. At >> the moment, I don't have any idea what cause this problem but can we improve >> this more? >> >> I've noticed [1] this kind of degradation when the server is built with >> Asserts enabled. Did you? >> ... >> [1] >> https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18 >> b1770d%40lab.ntt.co.jp > > No. I did test again from configuring without --enable-cassert but problem I mentioned still happens. Hmm, OK. Can you describe your test setup with more details? Thanks, Amit
Amit-san, On Wed, Mar 20, 2019 at 3:01 PM, Amit Langote wrote: > > On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote: > >> On 2019/03/20 11:21, Imai, Yoshikazu wrote: > >>> (4) > >>> We expect the performance does not depend on the number of > >>> partitions > >> after applying all patches, if possible. > >>> > >>> num of part TPS > >>> ----------- ----- > >>> 1024 7,257 (7274, 7246, 7252) > >>> 2048 6,718 (6627, 6780, 6747) > >>> 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s > results) > >>> 8192 6,008 (6018, 5999, 6007) > >>> > >>> It seems the performance still depend on the number of partitions. > >>> At > >> the moment, I don't have any idea what cause this problem but can we > >> improve this more? > >> > >> I've noticed [1] this kind of degradation when the server is built > >> with Asserts enabled. Did you? > >> ... > >> [1] > >> > https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18 > >> b1770d%40lab.ntt.co.jp > > > > No. I did test again from configuring without --enable-cassert but > problem I mentioned still happens. > > Hmm, OK. Can you describe your test setup with more details? Here the details. [creating partitioned tables (with 1024 partitions)] drop table if exists rt; create table rt (a int, b int, c int) partition by range (a); \o /dev/null select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x; \gexec \o [select1024.sql] \set a random (1, 1024) select * from rt where a = :a; [pgbench] pgbench -n -f select1024.sql -T 60 What I noticed so far is that it also might depends on the query. I created table with 8192 partitions and did select statementslike "select * from a = :a (which ranges from 1 to 1024)" and "select * from a = :a (which ranges from 1 to 8192)",and the results of those were different. I'll send perf to off-list. -- Yoshikazu Imai
Imai-san, On 2019/03/20 12:15, Imai, Yoshikazu wrote: > Here the details. > > [creating partitioned tables (with 1024 partitions)] > drop table if exists rt; > create table rt (a int, b int, c int) partition by range (a); > \o /dev/null > select 'create table rt' || x::text || ' partition of rt for values from (' || > (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x; > \gexec > \o > > [select1024.sql] > \set a random (1, 1024) > select * from rt where a = :a; > > [pgbench] > pgbench -n -f select1024.sql -T 60 Thank you. Could you please try with running pgbench for a bit longer than 60 seconds? Thanks, Amit
Amit-san, On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote: > On 2019/03/20 12:15, Imai, Yoshikazu wrote: > > Here the details. > > > > [creating partitioned tables (with 1024 partitions)] drop table if > > exists rt; create table rt (a int, b int, c int) partition by range > > (a); \o /dev/null select 'create table rt' || x::text || ' partition > > of rt for values from (' || (x)::text || ') to (' || (x+1)::text || > > ');' from generate_series(1, 1024) x; \gexec \o > > > > [select1024.sql] > > \set a random (1, 1024) > > select * from rt where a = :a; > > > > [pgbench] > > pgbench -n -f select1024.sql -T 60 > > Thank you. > > Could you please try with running pgbench for a bit longer than 60 seconds? I run pgbench for 180 seconds but there are still difference. 1024: 7,004 TPS 8192: 5,859 TPS I also tested for another number of partitions by running pgbench for 60 seconds. num of part TPS ----------- ----- 128 7,579 256 7,528 512 7,512 1024 7,257 (7274, 7246, 7252) 2048 6,718 (6627, 6780, 6747) 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) 8192 6,008 (6018, 5999, 6007) I checked whether there are the process which go through the number of partitions, but I couldn't find. I'm really wonderingwhy this degradation happens. Yoshikazu Imai
Imai-san, On 2019/03/20 17:36, Imai, Yoshikazu wrote: > On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote: >> On 2019/03/20 12:15, Imai, Yoshikazu wrote: >>> [select1024.sql] >>> \set a random (1, 1024) >>> select * from rt where a = :a; >>> >>> [pgbench] >>> pgbench -n -f select1024.sql -T 60 >> >> Thank you. >> >> Could you please try with running pgbench for a bit longer than 60 seconds? > > I run pgbench for 180 seconds but there are still difference. Thank you very much. > 1024: 7,004 TPS > 8192: 5,859 TPS > > > I also tested for another number of partitions by running pgbench for 60 seconds. > > num of part TPS > ----------- ----- > 128 7,579 > 256 7,528 > 512 7,512 > 1024 7,257 (7274, 7246, 7252) > 2048 6,718 (6627, 6780, 6747) > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) > 8192 6,008 (6018, 5999, 6007) > > > I checked whether there are the process which go through the number of partitions, but I couldn't find. I'm really wonderingwhy this degradation happens. Indeed, it's quite puzzling why. Will look into this. Thanks, Amit
Amit-san, On Wed, Mar 20, 2019 at 9:07 AM, Amit Langote wrote: > On 2019/03/20 17:36, Imai, Yoshikazu wrote: > > On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote: > >> On 2019/03/20 12:15, Imai, Yoshikazu wrote: > >>> [select1024.sql] > >>> \set a random (1, 1024) > >>> select * from rt where a = :a; > >>> > >>> [pgbench] > >>> pgbench -n -f select1024.sql -T 60 > >> > >> Thank you. > >> > >> Could you please try with running pgbench for a bit longer than 60 > seconds? > > > > I run pgbench for 180 seconds but there are still difference. > > Thank you very much. > > > 1024: 7,004 TPS > > 8192: 5,859 TPS > > > > > > I also tested for another number of partitions by running pgbench for > 60 seconds. > > > > num of part TPS > > ----------- ----- > > 128 7,579 > > 256 7,528 > > 512 7,512 > > 1024 7,257 (7274, 7246, 7252) > > 2048 6,718 (6627, 6780, 6747) > > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results) > > 8192 6,008 (6018, 5999, 6007) > > > > > > I checked whether there are the process which go through the number > of partitions, but I couldn't find. I'm really wondering why this > degradation happens. > > Indeed, it's quite puzzling why. Will look into this. I don't know whether it is useful, but I noticed the usage of get_tabstat_entry increased when many partitions are scanned. -- Yoshikazu Imai
Hi, On 3/19/19 11:15 PM, Imai, Yoshikazu wrote: > Here the details. > > [creating partitioned tables (with 1024 partitions)] > drop table if exists rt; > create table rt (a int, b int, c int) partition by range (a); > \o /dev/null > select 'create table rt' || x::text || ' partition of rt for values from (' || > (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x; > \gexec > \o > > [select1024.sql] > \set a random (1, 1024) > select * from rt where a = :a; > > [pgbench] > pgbench -n -f select1024.sql -T 60 > > My tests - using hash partitions - shows that the extra time is spent in make_partition_pruneinfo() for the relid_subplan_map variable. 64 partitions: make_partition_pruneinfo() 2.52% 8192 partitions: make_partition_pruneinfo() 5.43% TPS goes down ~8% between the two. The test setup is similar to the above. Given that Tom is planning to change the List implementation [1] in 13 I think using the palloc0 structure is ok for 12 before trying other implementation options. perf sent off-list. [1] https://www.postgresql.org/message-id/24783.1551568303%40sss.pgh.pa.us Best regards, Jesper
Hi Amit, On 3/19/19 8:41 PM, Amit Langote wrote: > I have fixed all. Attached updated patches. > The comments in the thread has been addressed, so I have put the CF entry into Ready for Committer. Best regards, Jesper
Hi Jesper, On 2019/03/20 23:25, Jesper Pedersen wrote:> Hi, > > On 3/19/19 11:15 PM, Imai, Yoshikazu wrote: >> Here the details. >> >> [creating partitioned tables (with 1024 partitions)] >> drop table if exists rt; >> create table rt (a int, b int, c int) partition by range (a); >> \o /dev/null >> select 'create table rt' || x::text || ' partition of rt for values >> from (' || >> (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, >> 1024) x; >> \gexec >> \o >> >> [select1024.sql] >> \set a random (1, 1024) >> select * from rt where a = :a; >> >> [pgbench] >> pgbench -n -f select1024.sql -T 60 >> >> > > My tests - using hash partitions - shows that the extra time is spent in > make_partition_pruneinfo() for the relid_subplan_map variable. > > 64 partitions: make_partition_pruneinfo() 2.52% > 8192 partitions: make_partition_pruneinfo() 5.43% > > TPS goes down ~8% between the two. The test setup is similar to the above. > > Given that Tom is planning to change the List implementation [1] in 13 I > think using the palloc0 structure is ok for 12 before trying other > implementation options. > > perf sent off-list. > > [1] https://www.postgresql.org/message-id/24783.1551568303%40sss.pgh.pa.us > > Best regards, > Jesper > > Thanks for testing. Yeah, after code looking, I think bottleneck seems be lurking in another place where this patch doesn't change. I also think the patch is ok as it is for 12, and this problem will be fixed in 13. -- Yoshikazu Imai
Jesper, Imai-san, Thanks for testing and reporting your findings. On 2019/03/21 23:10, Imai Yoshikazu wrote: > On 2019/03/20 23:25, Jesper Pedersen wrote:> Hi, > > My tests - using hash partitions - shows that the extra time is spent in > > make_partition_pruneinfo() for the relid_subplan_map variable. > > > > 64 partitions: make_partition_pruneinfo() 2.52% > > 8192 partitions: make_partition_pruneinfo() 5.43% > > > > TPS goes down ~8% between the two. The test setup is similar to the > above. > > > > Given that Tom is planning to change the List implementation [1] in 13 I > > think using the palloc0 structure is ok for 12 before trying other > > implementation options. Hmm, relid_subplan_map's size should be constant (number of partitions scanned) even as the actual partition count grows. However, looking into make_partitionedrel_pruneinfo(), it seems that it's unconditionally allocating 3 arrays that all have nparts elements: subplan_map = (int *) palloc0(nparts * sizeof(int)); subpart_map = (int *) palloc0(nparts * sizeof(int)); relid_map = (Oid *) palloc0(nparts * sizeof(int)); So, that part has got to cost more as the partition count grows. This is the code for runtime pruning, which is not exercised in our tests, so it might seem odd that we're spending any time here at all. I've been told that we have to perform at least *some* work here if only to conclude that runtime pruning is not needed and it seems that above allocations occur before making that conclusion. Maybe, that's salvageable by rearranging this code a bit. David may be in a better position to help with that. > Thanks for testing. > Yeah, after code looking, I think bottleneck seems be lurking in another > place where this patch doesn't change. I also think the patch is ok as > it is for 12, and this problem will be fixed in 13. If this drop in performance can be attributed to the fact that having too many partitions starts stressing other parts of the backend like stats collector, lock manager, etc. as time passes, then I agree that we'll have to tackle them later. Thanks, Amit
Hi, On 2019/03/22 11:17, Amit Langote wrote: > However, looking into make_partitionedrel_pruneinfo(), it seems that it's > unconditionally allocating 3 arrays that all have nparts elements: > > subplan_map = (int *) palloc0(nparts * sizeof(int)); > subpart_map = (int *) palloc0(nparts * sizeof(int)); > relid_map = (Oid *) palloc0(nparts * sizeof(int)); > > So, that part has got to cost more as the partition count grows. > > This is the code for runtime pruning, which is not exercised in our tests, > so it might seem odd that we're spending any time here at all. I've been > told that we have to perform at least *some* work here if only to conclude > that runtime pruning is not needed and it seems that above allocations > occur before making that conclusion. Maybe, that's salvageable by > rearranging this code a bit. David may be in a better position to help > with that. I took a stab at this. I think rearranging the code in make_partitionedrel_pruneinfo() slightly will take care of this. The problem is that make_partitionedrel_pruneinfo() does some work which involves allocating arrays containing nparts elements and then looping over the part_rels array to fill values in those arrays that will be used by run-time pruning. It does all that even *before* it has checked that run-time pruning will be needed at all, which if it is not, it will simply discard the result of the aforementioned work. Patch 0008 of 0009 rearranges the code such that we check whether we will need run-time pruning at all before doing any of the work that's necessary for run-time to work. Thanks, Amit
Attachment
- v34-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v34-0002-Add-rowmark-junk-targetlist-entries-in-query_pla.patch
- v34-0003-Delay-adding-inheritance-child-tables-until-quer.patch
- v34-0004-Adjust-inheritance_planner-to-reuse-source-child.patch
- v34-0005-Further-tweak-inheritance_planner-to-avoid-usele.patch
- v34-0006-Perform-pruning-in-expand_partitioned_rtentry.patch
- v34-0007-Teach-planner-to-only-process-unpruned-partition.patch
- v34-0008-Rearrange-the-code-in-make_partitionedrel_prunei.patch
- v34-0009-Don-t-copy-PartitionBoundInfo-in-set_relation_pa.patch
Hi Amit, On 3/22/19 3:39 AM, Amit Langote wrote: > I took a stab at this. I think rearranging the code in > make_partitionedrel_pruneinfo() slightly will take care of this. > > The problem is that make_partitionedrel_pruneinfo() does some work which > involves allocating arrays containing nparts elements and then looping > over the part_rels array to fill values in those arrays that will be used > by run-time pruning. It does all that even *before* it has checked that > run-time pruning will be needed at all, which if it is not, it will simply > discard the result of the aforementioned work. > > Patch 0008 of 0009 rearranges the code such that we check whether we will > need run-time pruning at all before doing any of the work that's necessary > for run-time to work. > Yeah, this is better :) Increasing number of partitions leads to a TPS within the noise level. Passes check-world. Best regards, Jesper
On Fri, 22 Mar 2019 at 20:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > The problem is that make_partitionedrel_pruneinfo() does some work which > involves allocating arrays containing nparts elements and then looping > over the part_rels array to fill values in those arrays that will be used > by run-time pruning. It does all that even *before* it has checked that > run-time pruning will be needed at all, which if it is not, it will simply > discard the result of the aforementioned work. > > Patch 0008 of 0009 rearranges the code such that we check whether we will > need run-time pruning at all before doing any of the work that's necessary > for run-time to work. I had a quick look at the make_partitionedrel_pruneinfo() code earlier before this patch appeared and I agree that something like this could be done. I've not gone over the patch in detail though. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > [ v34 patch set ] I had a bit of a look through this. I went ahead and pushed 0008 and 0009, as they seem straightforward and independent of the rest. (BTW, 0009 makes 0003's dubious optimization in set_relation_partition_info quite unnecessary.) As for the rest: 0001: OK in principle, but I wonder why you implemented it by adding another recursive scan of the jointree rather than just iterating over the baserel array, as in make_one_rel() for instance. Seems like this way is more work, and more code that we'll have to touch if we ever change the jointree representation. I also feel like you used a dartboard while deciding where to insert the call in query_planner(); dropping it into the middle of a sequence of equivalence-class-related operations seems quite random. Maybe we could delay that step all the way to just before make_one_rel, since the other stuff in between seems to only care about baserels? For example, it'd certainly be better if we could run remove_useless_joins before otherrel expansion, so that we needn't do otherrel expansion at all on a table that gets thrown away as being a useless join. BTW, it strikes me that we could take advantage of the fact that baserels must all appear before otherrels in the rel array, by having loops over that array stop early if they're only interested in baserels. We could either save the index of the last baserel, or just extend the loop logic to fall out upon hitting an otherrel. Seems like this'd save some cycles when there are lots of partitions, although perhaps loops like that would be fast enough to not matter. 0002: I seriously dislike this from a system-structure viewpoint. For one thing it makes root->processed_tlist rather moot, since it doesn't get set till after most of the work is done. (I don't know if there are any FDWs out there that look at processed_tlist, but they'd be broken by this if so. It'd be better to get rid of the field if we can, so that at least such breakage is obvious. Or, maybe, use root->processed_tlist as the communication field rather than having the tlist be a param to query_planner?) I also don't like the undocumented way that you've got build_base_rel_tlists working on something that's not the final tlist, and then going back and doing more work of the same sort later. I wonder whether we can postpone build_base_rel_tlists until after the other stuff happens, so that it can continue to do all of the tlist-distribution work. 0003: I haven't studied this in great detail, but it seems like there's some pretty random things in it, like this addition in inheritance_planner: + /* grouping_planner doesn't need to see the target children. */ + subroot->append_rel_list = list_copy(orig_append_rel_list); That sure seems like a hack unrelated to the purpose of the patch ... and since list_copy is a shallow copy, I'm not even sure it's safe. Won't we be mutating the contained (and shared) AppendRelInfos as we go along? 0004 and 0005: I'm not exactly thrilled about loading more layers of overcomplication into inheritance_planner, especially not when the complication then extends out into new global planner flags with questionable semantics. We should be striving to get rid of that function, as previously discussed, and I fear this is just throwing more roadblocks in the way of doing so. Can we skip these and come back to the question after we replace inheritance_planner with expand-at-the-bottom? 0006: Seems mostly OK, but I'm not very happy about the additional table_open calls it's throwing into random places in the planner. Those can be rather expensive. Why aren't we collecting the appropriate info during the initial plancat.c consultation of the relcache? 0007: Not really sold on this; it seems like it's going to be a net negative for cases where there aren't a lot of partitions. regards, tom lane
I wrote: > ... I also > don't like the undocumented way that you've got build_base_rel_tlists > working on something that's not the final tlist, and then going back > and doing more work of the same sort later. I wonder whether we can > postpone build_base_rel_tlists until after the other stuff happens, > so that it can continue to do all of the tlist-distribution work. On further reflection: we don't really need the full build_base_rel_tlists pushups for these extra variables. The existence of the original rowmark variable will be sufficient, for example, to make sure that we don't decide the parent partitioned table can be thrown away as a useless join. All we need to do is add the new Var to the parent's reltarget and adjust its attr_needed, and we can do that very late in query_planner. So now I'm thinking leave build_base_rel_tlists() where it is, and then fix up the tlist/reltarget data on the fly when we discover that new child rels need more rowmark types than we had before. (This'd be another reason to keep the tlist in root->processed_tlist throughout, likely.) regards, tom lane
On Fri, Mar 22, 2019 at 9:07 PM, Tom Lane wrote: > BTW, it strikes me that we could take advantage of the fact that baserels > must all appear before otherrels in the rel array, by having loops over > that array stop early if they're only interested in baserels. We could > either save the index of the last baserel, or just extend the loop logic > to fall out upon hitting an otherrel. > Seems like this'd save some cycles when there are lots of partitions, > although perhaps loops like that would be fast enough to not matter. Actually, this speeds up planning time little when scanning a lot of otherrels like selecting thousands of partitions. Itested below. * rt with 8192 partitions * execute "select * from rt;" for 60 seconds. [results] HEAD: 4.19 TPS (4.06, 4.34, 4.17) (v34 patches) + (looping over only baserels): 4.26 TPS (4.31, 4.28, 4.19) Attached is the diff I used for this test. -- Yoshikazu Imai
Attachment
Thanks a lot for reviewing. On 2019/03/23 6:07, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> [ v34 patch set ] > > I had a bit of a look through this. I went ahead and pushed 0008 and > 0009, as they seem straightforward and independent of the rest. (BTW, > 0009 makes 0003's dubious optimization in set_relation_partition_info > quite unnecessary.) As for the rest: > > > 0001: OK in principle, but I wonder why you implemented it by adding > another recursive scan of the jointree rather than just iterating > over the baserel array, as in make_one_rel() for instance. Seems > like this way is more work, and more code that we'll have to touch > if we ever change the jointree representation. I've changed this to work by iterating over baserel array. I was mostly worried about looping over potentially many elements of baserel array that we simply end up skipping, but considering the other patch that delays adding inheritance children, we don't need to worry about that. > I also feel like you used a dartboard while deciding where to insert the > call in query_planner(); dropping it into the middle of a sequence of > equivalence-class-related operations seems quite random. Maybe we could > delay that step all the way to just before make_one_rel, since the other > stuff in between seems to only care about baserels? For example, > it'd certainly be better if we could run remove_useless_joins before > otherrel expansion, so that we needn't do otherrel expansion at all > on a table that gets thrown away as being a useless join. create_lateral_join_info() expects otherrels to be present to propagate the information it creates. I have moved add_other_rels_to_query() followed by create_lateral_join_info() to just before make_one_rel(). > BTW, it strikes me that we could take advantage of the fact that > baserels must all appear before otherrels in the rel array, by > having loops over that array stop early if they're only interested > in baserels. We could either save the index of the last baserel, > or just extend the loop logic to fall out upon hitting an otherrel. > Seems like this'd save some cycles when there are lots of partitions, > although perhaps loops like that would be fast enough to not matter. As Imai-san's testing shows, there's certainly a slight speedup to be expected. If we want that, maybe we could use his patch. > 0002: I seriously dislike this from a system-structure viewpoint. > For one thing it makes root->processed_tlist rather moot, since it > doesn't get set till after most of the work is done. (I don't know > if there are any FDWs out there that look at processed_tlist, but > they'd be broken by this if so. It'd be better to get rid of the > field if we can, so that at least such breakage is obvious. Or, > maybe, use root->processed_tlist as the communication field rather > than having the tlist be a param to query_planner?) Getting rid of processed_tlist altogether seems rather daunting to me, so I've adopted your suggestion of adding any new junk TLEs to processed_tlist directly. > I also > don't like the undocumented way that you've got build_base_rel_tlists > working on something that's not the final tlist, and then going back > and doing more work of the same sort later. I wonder whether we can > postpone build_base_rel_tlists until after the other stuff happens, > so that it can continue to do all of the tlist-distribution work. Seeing your other reply to this part, I withdraw 0002 as previously proposed. Instead, I modified 0003 to teach expand_inherited_rtentry() to add "wholerow" junk TLE if any foreign table children need it, and a "tableoid" junk TLE needed for the inheritance case. It also calls add_vars_to_targetlist() directly to add those vars to parent's reltaget. The newly added TLEs are added directly to processed_tlist. "ctid" junk TLE is added by preprocess_targetlist() as before. Maybe, we can remove the code in preprocess_targetlist() for adding "wholerow" and "tableoid" junk TLEs, as it's essentially dead code after this patch. > 0003: I haven't studied this in great detail, but it seems like there's > some pretty random things in it, like this addition in > inheritance_planner: > > + /* grouping_planner doesn't need to see the target children. */ > + subroot->append_rel_list = list_copy(orig_append_rel_list); > > That sure seems like a hack unrelated to the purpose of the patch ... and > since list_copy is a shallow copy, I'm not even sure it's safe. Won't > we be mutating the contained (and shared) AppendRelInfos as we go along? Sorry, the comment wasn't very clear. As for how this is related to this patch, we need subroot to have its own append_rel_list, because query_planner() will add new entries to it if there are any inheritance parents that are source relations. We wouldn't want them to be added to the parent root's append_rel_list, because the next child will want to start with same append_rel_list in its subroot as the first child. As for why orig_append_rel_list and not parent_root->append_rel_list, the latter contains appinfos for target child relations, which need not be visible to query_planner(). In fact, all loops over append_rel_list during query planning simply ignore those appinfos, because their parent relation is not part of the translated query's join tree. Indeed, we can do copyObject(orig_append_rel_list) and get rid of some instances of code specific to subqueryRTindexes != NULL. The first block of such code makes copies of appinfos that reference subquery RTEs, which can simply be deleted because orig_append_rel_list contains only the appinfos pertaining to flattened UNION ALL. The second block applies ChangeVarNodes() to appinfos that reference subquery RTEs, necessitating copying in the first place, which currently loops over subroot->append_rel_list to filter out those that don't need to be changed. If subroot->append_rel_list is a copy of orig_append_rel_list, this filtering is unnecessary, so we can simply do: ChangeVarNodes((Node *) subroot->append_rel_list, rti, newrti, 0); I've made the above changes and updated the comment to reflect that. > 0004 and 0005: I'm not exactly thrilled about loading more layers of > overcomplication into inheritance_planner, especially not when the > complication then extends out into new global planner flags with > questionable semantics. We should be striving to get rid of that > function, as previously discussed, and I fear this is just throwing > more roadblocks in the way of doing so. Can we skip these and come > back to the question after we replace inheritance_planner with > expand-at-the-bottom? As you know, since we're changing things so that source inheritance is expanded in query_planner(), any UPDATE/DELETE queries containing inheritance parents as source relations will regress to some degree, because source inheritance expansion will be repeated for every child query. I don't like the new flag contains_inherit_children either, because it will be rendered unnecessary the day we get rid of inheritance_planner, but I don't see any other way to get what we want. If we're to forego 0004 because of that complexity, at least we should consider 0005, because its changes are fairly local to inheritance_planner(). The speedup and savings in memory consumption by avoiding putting target child RTEs in the child query are significant, as discussed upthread. I have moved 0004 to be the last patch in the series to make way for other simpler patches to be considered first. > 0006: Seems mostly OK, but I'm not very happy about the additional > table_open calls it's throwing into random places in the planner. > Those can be rather expensive. Why aren't we collecting the appropriate > info during the initial plancat.c consultation of the relcache? Hmm, you're seeing those because we're continuing to use the old internal interfaces of inherit.c. Especially, with the existing interfaces, we need both the parent and child relations to be open to build the AppendRelInfo. Note that we are using the same code for initializing both target child relations and source child relations, and because we don't have RelOptInfos available in the former case, we can't change the internal interfaces to use RelOptInfos for passing information around. Previous versions of this patch did add TupleDesc and Oid fields to RelOptInfo to store a relation's rd_att and rd_rel->reltype, respectively, that are needed to construct AppendRelInfo. It also performed massive refactoring of the internal interfaces of inherit.c to work by passing around parent RelOptInfo, also considering that one caller (inheritance_planner) doesn't build RelOptInfos before calling. But I thought all that refactoring would be a harder sell than adding a table_open() call in joinrels.c to be able to call make_append_rel_info() directly for building what's really a no-op AppendRelInfo. > 0007: Not really sold on this; it seems like it's going to be a net > negative for cases where there aren't a lot of partitions. Performance loss for smaller number of partitions is in the noise range, but what we gain for large number of partitions seems pretty significant to me: nparts no live_parts live_parts ====== ============= ========== 2 3397 3391 8 3365 3337 32 3316 3379 128 3338 3399 512 3273 3321 1024 3439 3517 4096 3113 3227 8192 2849 3215 Attached find updated patches. 0002 is a new patch to get rid of the duplicate RTE and PlanRowMark that's created for partitioned parent table, as it's pointless. I was planning to propose it later, but decided to post it now if we're modifying the nearby code anyway. Thanks, Amit
Attachment
- v35-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v35-0002-Get-rid-of-duplicate-child-RTE-for-partitioned-t.patch
- v35-0003-Delay-adding-inheritance-child-tables-until-quer.patch
- v35-0004-Perform-pruning-in-expand_partitioned_rtentry.patch
- v35-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v35-0006-Fix-inheritance_planner-to-avoid-useless-work.patch
- v35-0007-Adjust-inheritance_planner-to-reuse-source-child.patch
On 2019/03/25 20:34, Amit Langote wrote: > Performance loss for smaller number of partitions is in the noise range, > but what we gain for large number of partitions seems pretty significant > to me: I didn't specify the benchmark setup instructions: partitioned table creation (N: 2...8192): create table rt (a int, b int, c int) partition by range (a); select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, N) x; \gexec select.sql: \set param random(1, N) select * from rt where a = :param; pgbench -n -T 120 -f select.sql > nparts no live_parts live_parts > ====== ============= ========== > 2 3397 3391 > 8 3365 3337 > 32 3316 3379 > 128 3338 3399 > 512 3273 3321 > 1024 3439 3517 > 4096 3113 3227 > 8192 2849 3215 > > Attached find updated patches. Rebased patches attached. Thanks, Amit
Attachment
- v36-0001-Build-other-rels-of-appendrel-baserels-in-a-sepa.patch
- v36-0002-Get-rid-of-duplicate-child-RTE-for-partitioned-t.patch
- v36-0003-Delay-adding-inheritance-child-tables-until-quer.patch
- v36-0004-Perform-pruning-in-expand_partitioned_rtentry.patch
- v36-0005-Teach-planner-to-only-process-unpruned-partition.patch
- v36-0006-Fix-inheritance_planner-to-avoid-useless-work.patch
- v36-0007-Adjust-inheritance_planner-to-reuse-source-child.patch
Amit-san, On Tue, Mar 26, 2019 at 7:17 AM, Amit Langote wrote: > Rebased patches attached. I could only do the code review of v36-0001. On Mon, Mar 25, 2019 at 11:35 AM, Amit Langote wrote: > On 2019/03/23 6:07, Tom Lane wrote: > > Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> writes: > >> [ v34 patch set ] > > > > I had a bit of a look through this. I went ahead and pushed 0008 and > > 0009, as they seem straightforward and independent of the rest. (BTW, > > 0009 makes 0003's dubious optimization in set_relation_partition_info > > quite unnecessary.) As for the rest: > > > > 0001: OK in principle, but I wonder why you implemented it by adding > > another recursive scan of the jointree rather than just iterating > > over the baserel array, as in make_one_rel() for instance. Seems > > like this way is more work, and more code that we'll have to touch > > if we ever change the jointree representation. > > I've changed this to work by iterating over baserel array. I was mostly > worried about looping over potentially many elements of baserel array that > we simply end up skipping, but considering the other patch that delays > adding inheritance children, we don't need to worry about that. > > > I also feel like you used a dartboard while deciding where to insert the > > call in query_planner(); dropping it into the middle of a sequence of > > equivalence-class-related operations seems quite random. Maybe we could > > delay that step all the way to just before make_one_rel, since the other > > stuff in between seems to only care about baserels? For example, > > it'd certainly be better if we could run remove_useless_joins before > > otherrel expansion, so that we needn't do otherrel expansion at all > > on a table that gets thrown away as being a useless join. > > create_lateral_join_info() expects otherrels to be present to propagate > the information it creates. > > I have moved add_other_rels_to_query() followed by > create_lateral_join_info() to just before make_one_rel(). I checked 0001 patch modifies the thing which is discussed above correctly. What problem I only found is a little typo. s/propgated/propagated/ I'm really sorry for my shortage of time to review for now... -- Yoshikazu Imai
David Rowley <david.rowley@2ndquadrant.com> writes: > On Wed, 27 Mar 2019 at 18:39, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2019/03/27 14:26, David Rowley wrote: >>> Perhaps the way to make this work, at least in the long term is to do >>> in the planner what we did in the executor in d73f4c74dd34. >> Maybe you meant 9ddef36278a9? > Probably. Yeah, there's something to be said for having plancat.c open each table *and store the Relation pointer in the RelOptInfo*, and then close that again at the end of planning rather than immediately. If we can't avoid these retail table_opens without a great deal of pain, that's the direction I'd tend to go in. However it would add some overhead, in the form of a need to traverse the RelOptInfo array an additional time. regards, tom lane
On 2019/03/27 23:57, Tom Lane wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> On Wed, 27 Mar 2019 at 18:39, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> On 2019/03/27 14:26, David Rowley wrote: >>>> Perhaps the way to make this work, at least in the long term is to do >>>> in the planner what we did in the executor in d73f4c74dd34. > >>> Maybe you meant 9ddef36278a9? > >> Probably. > > Yeah, there's something to be said for having plancat.c open each table > *and store the Relation pointer in the RelOptInfo*, and then close that > again at the end of planning rather than immediately. If we can't avoid > these retail table_opens without a great deal of pain, that's the > direction I'd tend to go in. However it would add some overhead, in > the form of a need to traverse the RelOptInfo array an additional time. Just to be sure, do you mean we should do that now or later (David said "in the long term")? Thanks, Amit
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/03/27 23:57, Tom Lane wrote: >> Yeah, there's something to be said for having plancat.c open each table >> *and store the Relation pointer in the RelOptInfo*, and then close that >> again at the end of planning rather than immediately. If we can't avoid >> these retail table_opens without a great deal of pain, that's the >> direction I'd tend to go in. However it would add some overhead, in >> the form of a need to traverse the RelOptInfo array an additional time. > Just to be sure, do you mean we should do that now or later (David said > "in the long term")? It's probably not high priority, though I wonder how much time is being eaten by the repeated table_opens. regards, tom lane
On 2019/03/28 14:03, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2019/03/27 23:57, Tom Lane wrote: >>> Yeah, there's something to be said for having plancat.c open each table >>> *and store the Relation pointer in the RelOptInfo*, and then close that >>> again at the end of planning rather than immediately. If we can't avoid >>> these retail table_opens without a great deal of pain, that's the >>> direction I'd tend to go in. However it would add some overhead, in >>> the form of a need to traverse the RelOptInfo array an additional time. > >> Just to be sure, do you mean we should do that now or later (David said >> "in the long term")? > > It's probably not high priority, though I wonder how much time is being > eaten by the repeated table_opens. It took me a couple of hours to confirm this, but it seems significantly less than it takes to go over the simple_rel_array one more time to close the relations. The scan of simple_rel_array to close the relations needs to be done once per query_planner() invocation, whereas I'd hoped this could be done, say, in add_rtes_to_flat_rtable() that has to iterate over the range table anyway. That's because we call query_planner multiple times on the same query in a few cases, viz. build_minmax_path(), inheritance_planner(). Within query_planner(), table is opened in expand_inherited_rtentry(), get_relation_constraints(), get_relation_data_width(), and build_physical_tlist(), of which the first two are more common. So, we end up doing table_open() 3 times on average for any given relation, just inside query_planner(). Thanks, Amit
I've been hacking on this pretty hard over the last couple days, because I really didn't like the contortions you'd made to allow inheritance_planner to call expand_inherited_rtentry in a completely different context than the regular code path did. I eventually got rid of that by having inheritance_planner run one cycle of planning the query as if it were a SELECT, and extracting the list of unpruned children from that. I had to rearrange the generation of the final rtable a bit to make that work, but I think that inheritance_planner winds up somewhat cleaner and safer this way; it's making (slightly) fewer assumptions about how much the results of planning the child queries can vary. Perhaps somebody will object that that's one more planning pass than we had before, but I'm not very concerned, because (a) at least for partitioned tables that we can prune successfully, this should still be better than v11, since we avoid the planning passes for pruned children. (b) inheritance_planner is horridly inefficient anyway, in that it has to run a near-duplicate planning pass for each child table. If we're concerned about its cost, we should be working to get rid of the function altogether, as per [1]. In the meantime, I do not want to contort other code to make life easier for inheritance_planner. There's still some loose ends: 1. I don't like 0003 much, and omitted it from the attached. I think that what we ought to be doing instead is not having holes in the rel_parts[] arrays to begin with, ie they should only include the unpruned partitions. If we are actually encoding any important information in those array positions, I suspect that is broken anyway in view of 898e5e329: we can't assume that the association of child rels with particular PartitionDesc slots will hold still from planning to execution. 2. I seriously dislike what's been done in joinrels.c, too. That really seems like a kluge (and I haven't had time to study it closely). 3. It's not entirely clear to me why the patch has to touch execPartition.c. That implies that the planner-to-executor API changed, but how so, and why is there no comment update clarifying it? Given the short amount of time left in this CF, there may not be time to address the first two points, and I won't necessarily insist that those be changed before committing. I'd like at least a comment about point 3 though. Attached is updated patch as a single patch --- I didn't think the division into multiple patches was terribly helpful, due to the flapping in expected regression results. regards, tom lane [1] https://www.postgresql.org/message-id/357.1550612935@sss.pgh.pa.us diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 9ad9035..310c715 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -7116,9 +7116,9 @@ select * from bar where f1 in (select f1 from foo) for update; QUERY PLAN ---------------------------------------------------------------------------------------------- LockRows - Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid + Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid -> Hash Join - Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid + Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid Inner Unique: true Hash Cond: (bar.f1 = foo.f1) -> Append @@ -7128,15 +7128,15 @@ select * from bar where f1 in (select f1 from foo) for update; Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE -> Hash - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> HashAggregate - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid Group Key: foo.f1 -> Append -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> Foreign Scan on public.foo2 - Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 + Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (23 rows) @@ -7154,9 +7154,9 @@ select * from bar where f1 in (select f1 from foo) for share; QUERY PLAN ---------------------------------------------------------------------------------------------- LockRows - Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid + Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid -> Hash Join - Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid + Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid Inner Unique: true Hash Cond: (bar.f1 = foo.f1) -> Append @@ -7166,15 +7166,15 @@ select * from bar where f1 in (select f1 from foo) for share; Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE -> Hash - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> HashAggregate - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid Group Key: foo.f1 -> Append -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> Foreign Scan on public.foo2 - Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 + Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (23 rows) @@ -7203,15 +7203,15 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); -> Seq Scan on public.bar Output: bar.f1, bar.f2, bar.ctid -> Hash - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> HashAggregate - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid Group Key: foo.f1 -> Append -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> Foreign Scan on public.foo2 - Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 + Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -> Hash Join Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid @@ -7221,15 +7221,15 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo); Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE -> Hash - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> HashAggregate - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid Group Key: foo.f1 -> Append -> Seq Scan on public.foo - Output: foo.ctid, foo.*, foo.tableoid, foo.f1 + Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> Foreign Scan on public.foo2 - Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 + Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (39 rows) @@ -8435,7 +8435,7 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) Foreign Scan Output: t1.a, ftprt2_p1.b, ftprt2_p1.c Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2) - Remote SQL: SELECT r5.a, r7.b, r7.c FROM (public.fprt1_p1 r5 LEFT JOIN public.fprt2_p1 r7 ON (((r5.a = r7.b)) AND ((r5.b= r7.a)) AND ((r7.a < 10)))) WHERE ((r5.a < 10)) ORDER BY r5.a ASC NULLS LAST, r7.b ASC NULLS LAST, r7.c ASC NULLSLAST + Remote SQL: SELECT r5.a, r6.b, r6.c FROM (public.fprt1_p1 r5 LEFT JOIN public.fprt2_p1 r6 ON (((r5.a = r6.b)) AND ((r5.b= r6.a)) AND ((r6.a < 10)))) WHERE ((r5.a < 10)) ORDER BY r5.a ASC NULLS LAST, r6.b ASC NULLS LAST, r6.c ASC NULLSLAST (4 rows) SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHEREt1.a < 10 ORDER BY 1,2,3; diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index cfad8a3..b72db85 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -1654,9 +1654,17 @@ ExecCreatePartitionPruneState(PlanState *planstate, memcpy(pprune->subplan_map, pinfo->subplan_map, sizeof(int) * pinfo->nparts); - /* Double-check that list of relations has not changed. */ - Assert(memcmp(partdesc->oids, pinfo->relid_map, - pinfo->nparts * sizeof(Oid)) == 0); + /* + * Double-check that the list of unpruned relations has not + * changed. (Pruned partitions are not in relid_map[].) + */ +#ifdef USE_ASSERT_CHECKING + for (int k = 0; k < pinfo->nparts; k++) + { + Assert(partdesc->oids[k] == pinfo->relid_map[k] || + pinfo->subplan_map[k] == -1); + } +#endif } else { diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 56a5084..09f5f0c 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -139,9 +139,6 @@ static void subquery_push_qual(Query *subquery, static void recurse_push_qual(Node *setOp, Query *topquery, RangeTblEntry *rte, Index rti, Node *qual); static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel); -static bool apply_child_basequals(PlannerInfo *root, RelOptInfo *rel, - RelOptInfo *childrel, - RangeTblEntry *childRTE, AppendRelInfo *appinfo); /* @@ -946,8 +943,6 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, double *parent_attrsizes; int nattrs; ListCell *l; - Relids live_children = NULL; - bool did_pruning = false; /* Guard against stack overflow due to overly deep inheritance tree. */ check_stack_depth(); @@ -966,21 +961,6 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, rel->partitioned_child_rels = list_make1_int(rti); /* - * If the partitioned relation has any baserestrictinfo quals then we - * attempt to use these quals to prune away partitions that cannot - * possibly contain any tuples matching these quals. In this case we'll - * store the relids of all partitions which could possibly contain a - * matching tuple, and skip anything else in the loop below. - */ - if (enable_partition_pruning && - rte->relkind == RELKIND_PARTITIONED_TABLE && - rel->baserestrictinfo != NIL) - { - live_children = prune_append_rel_partitions(rel); - did_pruning = true; - } - - /* * If this is a partitioned baserel, set the consider_partitionwise_join * flag; currently, we only consider partitionwise joins with the baserel * if its targetlist doesn't contain a whole-row Var. @@ -1034,30 +1014,17 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, childrel = find_base_rel(root, childRTindex); Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL); - if (did_pruning && !bms_is_member(appinfo->child_relid, live_children)) - { - /* This partition was pruned; skip it. */ - set_dummy_rel_pathlist(childrel); + /* We may have already proven the child to be dummy. */ + if (IS_DUMMY_REL(childrel)) continue; - } /* * We have to copy the parent's targetlist and quals to the child, - * with appropriate substitution of variables. If any constant false - * or NULL clauses turn up, we can disregard the child right away. If - * not, we can apply constraint exclusion with just the - * baserestrictinfo quals. + * with appropriate substitution of variables. However, the + * baserestrictinfo quals were already copied/substituted when the + * child RelOptInfo was built. So we don't need any additional setup + * before applying constraint exclusion. */ - if (!apply_child_basequals(root, rel, childrel, childRTE, appinfo)) - { - /* - * Some restriction clause reduced to constant FALSE or NULL after - * substitution, so this child need not be scanned. - */ - set_dummy_rel_pathlist(childrel); - continue; - } - if (relation_excluded_by_constraints(root, childrel, childRTE)) { /* @@ -1069,7 +1036,8 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, } /* - * CE failed, so finish copying/modifying targetlist and join quals. + * Constraint exclusion failed, so copy the parent's join quals and + * targetlist to the child, with appropriate variable substitutions. * * NB: the resulting childrel->reltarget->exprs may contain arbitrary * expressions, which otherwise would not occur in a rel's targetlist. @@ -3594,133 +3562,6 @@ generate_partitionwise_join_paths(PlannerInfo *root, RelOptInfo *rel) list_free(live_children); } -/* - * apply_child_basequals - * Populate childrel's quals based on rel's quals, translating them using - * appinfo. - * - * If any of the resulting clauses evaluate to false or NULL, we return false - * and don't apply any quals. Caller can mark the relation as a dummy rel in - * this case, since it needn't be scanned. - * - * If any resulting clauses evaluate to true, they're unnecessary and we don't - * apply then. - */ -static bool -apply_child_basequals(PlannerInfo *root, RelOptInfo *rel, - RelOptInfo *childrel, RangeTblEntry *childRTE, - AppendRelInfo *appinfo) -{ - List *childquals; - Index cq_min_security; - ListCell *lc; - - /* - * The child rel's targetlist might contain non-Var expressions, which - * means that substitution into the quals could produce opportunities for - * const-simplification, and perhaps even pseudoconstant quals. Therefore, - * transform each RestrictInfo separately to see if it reduces to a - * constant or pseudoconstant. (We must process them separately to keep - * track of the security level of each qual.) - */ - childquals = NIL; - cq_min_security = UINT_MAX; - foreach(lc, rel->baserestrictinfo) - { - RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc); - Node *childqual; - ListCell *lc2; - - Assert(IsA(rinfo, RestrictInfo)); - childqual = adjust_appendrel_attrs(root, - (Node *) rinfo->clause, - 1, &appinfo); - childqual = eval_const_expressions(root, childqual); - /* check for flat-out constant */ - if (childqual && IsA(childqual, Const)) - { - if (((Const *) childqual)->constisnull || - !DatumGetBool(((Const *) childqual)->constvalue)) - { - /* Restriction reduces to constant FALSE or NULL */ - return false; - } - /* Restriction reduces to constant TRUE, so drop it */ - continue; - } - /* might have gotten an AND clause, if so flatten it */ - foreach(lc2, make_ands_implicit((Expr *) childqual)) - { - Node *onecq = (Node *) lfirst(lc2); - bool pseudoconstant; - - /* check for pseudoconstant (no Vars or volatile functions) */ - pseudoconstant = - !contain_vars_of_level(onecq, 0) && - !contain_volatile_functions(onecq); - if (pseudoconstant) - { - /* tell createplan.c to check for gating quals */ - root->hasPseudoConstantQuals = true; - } - /* reconstitute RestrictInfo with appropriate properties */ - childquals = lappend(childquals, - make_restrictinfo((Expr *) onecq, - rinfo->is_pushed_down, - rinfo->outerjoin_delayed, - pseudoconstant, - rinfo->security_level, - NULL, NULL, NULL)); - /* track minimum security level among child quals */ - cq_min_security = Min(cq_min_security, rinfo->security_level); - } - } - - /* - * In addition to the quals inherited from the parent, we might have - * securityQuals associated with this particular child node. (Currently - * this can only happen in appendrels originating from UNION ALL; - * inheritance child tables don't have their own securityQuals, see - * expand_inherited_rtentry().) Pull any such securityQuals up into the - * baserestrictinfo for the child. This is similar to - * process_security_barrier_quals() for the parent rel, except that we - * can't make any general deductions from such quals, since they don't - * hold for the whole appendrel. - */ - if (childRTE->securityQuals) - { - Index security_level = 0; - - foreach(lc, childRTE->securityQuals) - { - List *qualset = (List *) lfirst(lc); - ListCell *lc2; - - foreach(lc2, qualset) - { - Expr *qual = (Expr *) lfirst(lc2); - - /* not likely that we'd see constants here, so no check */ - childquals = lappend(childquals, - make_restrictinfo(qual, - true, false, false, - security_level, - NULL, NULL, NULL)); - cq_min_security = Min(cq_min_security, security_level); - } - security_level++; - } - Assert(security_level <= root->qual_security_level); - } - - /* - * OK, we've got all the baserestrictinfo quals for this child. - */ - childrel->baserestrictinfo = childquals; - childrel->baserestrict_min_security = cq_min_security; - - return true; -} /***************************************************************************** * DEBUG SUPPORT diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c index 9604a54..d17cc05 100644 --- a/src/backend/optimizer/path/joinrels.c +++ b/src/backend/optimizer/path/joinrels.c @@ -14,6 +14,7 @@ */ #include "postgres.h" +#include "access/table.h" #include "miscadmin.h" #include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" @@ -21,6 +22,8 @@ #include "optimizer/joininfo.h" #include "optimizer/pathnode.h" #include "optimizer/paths.h" +#include "optimizer/tlist.h" +#include "parser/parsetree.h" #include "partitioning/partbounds.h" #include "utils/lsyscache.h" #include "utils/memutils.h" @@ -51,6 +54,9 @@ static SpecialJoinInfo *build_child_join_sjinfo(PlannerInfo *root, Relids left_relids, Relids right_relids); static int match_expr_to_partition_keys(Expr *expr, RelOptInfo *rel, bool strict_op); +static RelOptInfo *build_dummy_partition_rel(PlannerInfo *root, + RelOptInfo *parent, Relation parentrel, + int partidx); /* @@ -1346,6 +1352,8 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo, List *parent_restrictlist) { + Relation baserel1 = NULL, + baserel2 = NULL; bool rel1_is_simple = IS_SIMPLE_REL(rel1); bool rel2_is_simple = IS_SIMPLE_REL(rel2); int nparts; @@ -1396,6 +1404,19 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, nparts = joinrel->nparts; + if (rel1_is_simple) + { + RangeTblEntry *rte = planner_rt_fetch(rel1->relid, root); + + baserel1 = table_open(rte->relid, NoLock); + } + if (rel2_is_simple) + { + RangeTblEntry *rte = planner_rt_fetch(rel2->relid, root); + + baserel2 = table_open(rte->relid, NoLock); + } + /* * Create child-join relations for this partitioned join, if those don't * exist. Add paths to child-joins for a pair of child relations @@ -1412,6 +1433,13 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, AppendRelInfo **appinfos; int nappinfos; + if (rel1_is_simple && child_rel1 == NULL) + child_rel1 = build_dummy_partition_rel(root, rel1, baserel1, + cnt_parts); + if (rel2_is_simple && child_rel2 == NULL) + child_rel2 = build_dummy_partition_rel(root, rel2, baserel2, + cnt_parts); + /* * If a child table has consider_partitionwise_join=false, it means * that it's a dummy relation for which we skipped setting up tlist @@ -1472,6 +1500,11 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, child_joinrel, child_sjinfo, child_restrictlist); } + + if (baserel1) + table_close(baserel1, NoLock); + if (baserel2) + table_close(baserel2, NoLock); } /* @@ -1490,8 +1523,14 @@ update_child_rel_info(PlannerInfo *root, (Node *) rel->reltarget->exprs, 1, &appinfo); - /* Make child entries in the EquivalenceClass as well */ - if (rel->has_eclass_joins || has_useful_pathkeys(root, rel)) + /* + * Make child entries in the EquivalenceClass as well. If the childrel + * appears to be a dummy one (one built by build_dummy_partition_rel()), + * no need to make any new entries, because anything that would need those + * can instead use the parent's (rel). + */ + if (childrel->relid != rel->relid && + (rel->has_eclass_joins || has_useful_pathkeys(root, rel))) add_child_rel_equivalences(root, appinfo, rel, childrel); childrel->has_eclass_joins = rel->has_eclass_joins; } @@ -1702,3 +1741,53 @@ match_expr_to_partition_keys(Expr *expr, RelOptInfo *rel, bool strict_op) return -1; } + +/* + * build_dummy_partition_rel + * Build a RelOptInfo and AppendRelInfo for a pruned partition + * + * This does not result in opening the relation or a range table entry being + * created. Also, the RelOptInfo thus created is not stored anywhere else + * beside the parent's part_rels array. + * + * The only reason this exists is because partition-wise join, in some cases, + * needs a RelOptInfo to represent an empty relation that's on the nullable + * side of an outer join, so that a Path representing the outer join can be + * created. + */ +static RelOptInfo * +build_dummy_partition_rel(PlannerInfo *root, RelOptInfo *parent, + Relation parentrel, int partidx) +{ + RelOptInfo *rel; + + Assert(parent->part_rels[partidx] == NULL); + + /* Create minimally valid-looking RelOptInfo with parent's relid. */ + rel = makeNode(RelOptInfo); + rel->reloptkind = RELOPT_OTHER_MEMBER_REL; + rel->relid = parent->relid; + rel->relids = bms_copy(parent->relids); + if (parent->top_parent_relids) + rel->top_parent_relids = parent->top_parent_relids; + else + rel->top_parent_relids = bms_copy(parent->relids); + rel->reltarget = copy_pathtarget(parent->reltarget); + parent->part_rels[partidx] = rel; + mark_dummy_rel(rel); + + /* + * Now we'll need a (no-op) AppendRelInfo for parent, because we're + * setting the dummy partition's relid to be same as the parent's. + */ + if (root->append_rel_array[parent->relid] == NULL) + { + AppendRelInfo *appinfo = make_append_rel_info(parentrel, parentrel, + parent->relid, + parent->relid); + + root->append_rel_array[parent->relid] = appinfo; + } + + return rel; +} diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c index 62dfac6..b3f264a 100644 --- a/src/backend/optimizer/plan/initsplan.c +++ b/src/backend/optimizer/plan/initsplan.c @@ -20,6 +20,7 @@ #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" +#include "optimizer/inherit.h" #include "optimizer/joininfo.h" #include "optimizer/optimizer.h" #include "optimizer/pathnode.h" @@ -161,9 +162,10 @@ add_other_rels_to_query(PlannerInfo *root) if (rte->inh) { /* Only relation and subquery RTEs can have children. */ - Assert(rte->rtekind == RTE_RELATION || - rte->rtekind == RTE_SUBQUERY); - add_appendrel_other_rels(root, rel, rti); + if (rte->rtekind == RTE_RELATION) + expand_inherited_rtentry(root, rel, rte, rti); + else + expand_appendrel_subquery(root, rel, rte, rti); } } } diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index ca7a0fb..c4d00b4 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -25,6 +25,7 @@ #include "access/table.h" #include "access/xact.h" #include "catalog/pg_constraint.h" +#include "catalog/pg_inherits.h" #include "catalog/pg_proc.h" #include "catalog/pg_type.h" #include "executor/executor.h" @@ -679,12 +680,14 @@ subquery_planner(PlannerGlobal *glob, Query *parse, flatten_simple_union_all(root); /* - * Detect whether any rangetable entries are RTE_JOIN kind; if not, we can - * avoid the expense of doing flatten_join_alias_vars(). Likewise check - * whether any are RTE_RESULT kind; if not, we can skip - * remove_useless_result_rtes(). Also check for outer joins --- if none, - * we can skip reduce_outer_joins(). And check for LATERAL RTEs, too. - * This must be done after we have done pull_up_subqueries(), of course. + * Check rangetable entries marked "inh" to see if they really need to be + * treated as inheritance parents. Also detect whether any rangetable + * entries are RTE_JOIN kind; if not, we can avoid the expense of doing + * flatten_join_alias_vars(). Also check for outer joins --- if none, we + * can skip reduce_outer_joins(). Likewise check whether any RTEs are + * RTE_RESULT kind; if not, we can skip remove_useless_result_rtes(). And + * check for LATERAL RTEs, too. This must be done after we have done + * pull_up_subqueries(), of course. */ root->hasJoinRTEs = false; root->hasLateralRTEs = false; @@ -694,15 +697,36 @@ subquery_planner(PlannerGlobal *glob, Query *parse, { RangeTblEntry *rte = lfirst_node(RangeTblEntry, l); - if (rte->rtekind == RTE_JOIN) + switch (rte->rtekind) { - root->hasJoinRTEs = true; - if (IS_OUTER_JOIN(rte->jointype)) - hasOuterJoins = true; - } - else if (rte->rtekind == RTE_RESULT) - { - hasResultRTEs = true; + case RTE_RELATION: + if (rte->inh) + { + /* + * Check to see if the relation actually has any children; + * if not, clear the inh flag so we can treat it as a + * plain base relation. + * + * Note: this could give a false-positive result, if the + * rel once had children but no longer does. We used to + * be able to reset rte->inh later on when we discovered + * that, but no more; we have to handle such cases as + * full-fledged inheritance. + */ + rte->inh = has_subclass(rte->relid); + } + break; + case RTE_JOIN: + root->hasJoinRTEs = true; + if (IS_OUTER_JOIN(rte->jointype)) + hasOuterJoins = true; + break; + case RTE_RESULT: + hasResultRTEs = true; + break; + default: + /* No work here for other RTE types */ + break; } if (rte->lateral) root->hasLateralRTEs = true; @@ -710,23 +734,11 @@ subquery_planner(PlannerGlobal *glob, Query *parse, /* * Preprocess RowMark information. We need to do this after subquery - * pullup (so that all non-inherited RTEs are present) and before - * inheritance expansion (so that the info is available for - * expand_inherited_tables to examine and modify). + * pullup, so that all base relations are present. */ preprocess_rowmarks(root); /* - * Expand any rangetable entries that are inheritance sets into "append - * relations". This can add entries to the rangetable, but they must be - * plain RTE_RELATION entries, so it's OK (and marginally more efficient) - * to do it after checking for joins and other special RTEs. We must do - * this after pulling up subqueries, else we'd fail to handle inherited - * tables in subqueries. - */ - expand_inherited_tables(root); - - /* * Set hasHavingQual to remember if HAVING clause is present. Needed * because preprocess_expression will reduce a constant-true condition to * an empty qual list ... but "HAVING TRUE" is not a semantic no-op. @@ -1180,11 +1192,17 @@ inheritance_planner(PlannerInfo *root) { Query *parse = root->parse; int top_parentRTindex = parse->resultRelation; + List *select_rtable; + List *select_appinfos; + List *child_appinfos; + List *old_child_rtis; + List *new_child_rtis; Bitmapset *subqueryRTindexes; - Bitmapset *modifiableARIindexes; + Index next_subquery_rti; int nominalRelation = -1; Index rootRelation = 0; List *final_rtable = NIL; + List *final_rowmarks = NIL; int save_rel_array_size = 0; RelOptInfo **save_rel_array = NULL; AppendRelInfo **save_append_rel_array = NULL; @@ -1196,14 +1214,15 @@ inheritance_planner(PlannerInfo *root) List *rowMarks; RelOptInfo *final_rel; ListCell *lc; + ListCell *lc2; Index rti; RangeTblEntry *parent_rte; - PlannerInfo *parent_root; - Query *parent_parse; - Bitmapset *parent_relids = bms_make_singleton(top_parentRTindex); - PlannerInfo **parent_roots = NULL; + Bitmapset *parent_relids; + Query **parent_parses; - Assert(parse->commandType != CMD_INSERT); + /* Should only get here for UPDATE or DELETE */ + Assert(parse->commandType == CMD_UPDATE || + parse->commandType == CMD_DELETE); /* * We generate a modified instance of the original Query for each target @@ -1234,39 +1253,14 @@ inheritance_planner(PlannerInfo *root) } /* - * Next, we want to identify which AppendRelInfo items contain references - * to any of the aforesaid subquery RTEs. These items will need to be - * copied and modified to adjust their subquery references; whereas the - * other ones need not be touched. It's worth being tense over this - * because we can usually avoid processing most of the AppendRelInfo - * items, thereby saving O(N^2) space and time when the target is a large - * inheritance tree. We can identify AppendRelInfo items by their - * child_relid, since that should be unique within the list. - */ - modifiableARIindexes = NULL; - if (subqueryRTindexes != NULL) - { - foreach(lc, root->append_rel_list) - { - AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); - - if (bms_is_member(appinfo->parent_relid, subqueryRTindexes) || - bms_is_member(appinfo->child_relid, subqueryRTindexes) || - bms_overlap(pull_varnos((Node *) appinfo->translated_vars), - subqueryRTindexes)) - modifiableARIindexes = bms_add_member(modifiableARIindexes, - appinfo->child_relid); - } - } - - /* * If the parent RTE is a partitioned table, we should use that as the * nominal target relation, because the RTEs added for partitioned tables * (including the root parent) as child members of the inheritance set do * not appear anywhere else in the plan, so the confusion explained below * for non-partitioning inheritance cases is not possible. */ - parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable); + parent_rte = rt_fetch(top_parentRTindex, parse->rtable); + Assert(parent_rte->inh); if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE) { nominalRelation = top_parentRTindex; @@ -1274,48 +1268,218 @@ inheritance_planner(PlannerInfo *root) } /* - * The PlannerInfo for each child is obtained by translating the relevant - * members of the PlannerInfo for its immediate parent, which we find - * using the parent_relid in its AppendRelInfo. We save the PlannerInfo - * for each parent in an array indexed by relid for fast retrieval. Since - * the maximum number of parents is limited by the number of RTEs in the - * query, we use that number to allocate the array. An extra entry is - * needed since relids start from 1. + * Before generating the real per-child-relation plans, do a cycle of + * planning as though the query were a SELECT. The objective here is to + * find out which child relations need to be processed, using the same + * expansion and pruning logic as for a SELECT. We'll then pull out the + * RangeTblEntry-s generated for the child rels, and make use of the + * AppendRelInfo entries for them to guide the real planning. (This is + * rather inefficient; we could perhaps stop short of making a full Path + * tree. But this whole function is inefficient and slated for + * destruction, so let's not contort query_planner for that.) + */ + { + PlannerInfo *subroot; + + /* + * Flat-copy the PlannerInfo to prevent modification of the original. + */ + subroot = makeNode(PlannerInfo); + memcpy(subroot, root, sizeof(PlannerInfo)); + + /* + * Make a deep copy of the parsetree for this planning cycle to mess + * around with, and change it to look like a SELECT. (Hack alert: the + * target RTE still has updatedCols set if this is an UPDATE, so that + * expand_partitioned_rtentry will correctly update + * subroot->partColsUpdated.) + */ + subroot->parse = copyObject(root->parse); + + subroot->parse->commandType = CMD_SELECT; + subroot->parse->resultRelation = 0; + + /* + * Ensure the subroot has its own copy of the original + * append_rel_list, since it'll be scribbled on. (Note that at this + * point, the list only contains AppendRelInfos for flattened UNION + * ALL subqueries.) + */ + subroot->append_rel_list = copyObject(root->append_rel_list); + + /* + * Better make a private copy of the rowMarks, too. + */ + subroot->rowMarks = copyObject(root->rowMarks); + + /* There shouldn't be any OJ info to translate, as yet */ + Assert(subroot->join_info_list == NIL); + /* and we haven't created PlaceHolderInfos, either */ + Assert(subroot->placeholder_list == NIL); + + /* Generate Path(s) for accessing this result relation */ + grouping_planner(subroot, true, 0.0 /* retrieve all tuples */ ); + + /* Extract the info we need. */ + select_rtable = subroot->parse->rtable; + select_appinfos = subroot->append_rel_list; + + /* + * We need to propagate partColsUpdated back, too. (The later + * planning cycles will not set this because they won't run + * expand_partitioned_rtentry for the UPDATE target.) + */ + root->partColsUpdated = subroot->partColsUpdated; + } + + /*---------- + * Since only one rangetable can exist in the final plan, we need to make + * sure that it contains all the RTEs needed for any child plan. This is + * complicated by the need to use separate subquery RTEs for each child. + * We arrange the final rtable as follows: + * 1. All original rtable entries (with their original RT indexes). + * 2. All the relation RTEs generated for children of the target table. + * 3. Subquery RTEs for children after the first. We need N * (K - 1) + * RT slots for this, if there are N subqueries and K child tables. + * 4. Additional RTEs generated during the child planning runs, such as + * children of inheritable RTEs other than the target table. + * We assume that each child planning run will create an identical set + * of type-4 RTEs. + * + * So the next thing to do is append the type-2 RTEs (the target table's + * children) to the original rtable. We look through select_appinfos + * to find them. + * + * To identify which AppendRelInfos are relevant as we thumb through + * select_appinfos, we need to look for both direct and indirect children + * of top_parentRTindex, so we use a bitmap of known parent relids. + * expand_inherited_rtentry() always processes a parent before any of that + * parent's children, so we should see an intermediate parent before its + * children. + *---------- + */ + child_appinfos = NIL; + old_child_rtis = NIL; + new_child_rtis = NIL; + parent_relids = bms_make_singleton(top_parentRTindex); + foreach(lc, select_appinfos) + { + AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); + RangeTblEntry *child_rte; + + /* append_rel_list contains all append rels; ignore others */ + if (!bms_is_member(appinfo->parent_relid, parent_relids)) + continue; + + /* remember relevant AppendRelInfos for use below */ + child_appinfos = lappend(child_appinfos, appinfo); + + /* extract RTE for this child rel */ + child_rte = rt_fetch(appinfo->child_relid, select_rtable); + + /* and append it to the original rtable */ + parse->rtable = lappend(parse->rtable, child_rte); + + /* remember child's index in the SELECT rtable */ + old_child_rtis = lappend_int(old_child_rtis, appinfo->child_relid); + + /* and its new index in the final rtable */ + new_child_rtis = lappend_int(new_child_rtis, list_length(parse->rtable)); + + /* if child is itself partitioned, update parent_relids */ + if (child_rte->inh) + { + Assert(child_rte->relkind == RELKIND_PARTITIONED_TABLE); + parent_relids = bms_add_member(parent_relids, appinfo->child_relid); + } + } + + /* + * It's possible that the RTIs we just assigned for the child rels in the + * final rtable are different from where they were in the SELECT query. + * Adjust the AppendRelInfos so that they will correctly map RT indexes to + * the final indexes. We can do this left-to-right since no child rel's + * final RT index could be greater than what it had in the SELECT query. */ - parent_roots = (PlannerInfo **) palloc0((list_length(parse->rtable) + 1) * - sizeof(PlannerInfo *)); - parent_roots[top_parentRTindex] = root; + forboth(lc, old_child_rtis, lc2, new_child_rtis) + { + int old_child_rti = lfirst_int(lc); + int new_child_rti = lfirst_int(lc2); + + if (old_child_rti == new_child_rti) + continue; /* nothing to do */ + + Assert(old_child_rti > new_child_rti); + + ChangeVarNodes((Node *) child_appinfos, + old_child_rti, new_child_rti, 0); + } + + /* + * Now set up rangetable entries for subqueries for additional children + * (the first child will just use the original ones). These all have to + * look more or less real, or EXPLAIN will get unhappy; so we just make + * them all clones of the original subqueries. + */ + next_subquery_rti = list_length(parse->rtable) + 1; + if (subqueryRTindexes != NULL) + { + int n_children = list_length(child_appinfos); + + while (n_children-- > 1) + { + int oldrti = -1; + + while ((oldrti = bms_next_member(subqueryRTindexes, oldrti)) >= 0) + { + RangeTblEntry *subqrte; + + subqrte = rt_fetch(oldrti, parse->rtable); + parse->rtable = lappend(parse->rtable, copyObject(subqrte)); + } + } + } + + /* + * The query for each child is obtained by translating the query for its + * immediate parent, since the AppendRelInfo data we have shows deltas + * between parents and children. We use the parent_parses array to + * remember the appropriate query trees. This is indexed by parent relid. + * Since the maximum number of parents is limited by the number of RTEs in + * the SELECT query, we use that number to allocate the array. An extra + * entry is needed since relids start from 1. + */ + parent_parses = (Query **) palloc0((list_length(select_rtable) + 1) * + sizeof(Query *)); + parent_parses[top_parentRTindex] = parse; /* * And now we can get on with generating a plan for each child table. */ - foreach(lc, root->append_rel_list) + foreach(lc, child_appinfos) { AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); + Index this_subquery_rti = next_subquery_rti; + Query *parent_parse; PlannerInfo *subroot; RangeTblEntry *child_rte; RelOptInfo *sub_final_rel; Path *subpath; - /* append_rel_list contains all append rels; ignore others */ - if (!bms_is_member(appinfo->parent_relid, parent_relids)) - continue; - /* * expand_inherited_rtentry() always processes a parent before any of - * that parent's children, so the parent_root for this relation should - * already be available. + * that parent's children, so the parent query for this relation + * should already be available. */ - parent_root = parent_roots[appinfo->parent_relid]; - Assert(parent_root != NULL); - parent_parse = parent_root->parse; + parent_parse = parent_parses[appinfo->parent_relid]; + Assert(parent_parse != NULL); /* * We need a working copy of the PlannerInfo so that we can control * propagation of information back to the main copy. */ subroot = makeNode(PlannerInfo); - memcpy(subroot, parent_root, sizeof(PlannerInfo)); + memcpy(subroot, root, sizeof(PlannerInfo)); /* * Generate modified query with this rel as target. We first apply @@ -1324,7 +1488,7 @@ inheritance_planner(PlannerInfo *root) * then fool around with subquery RTEs. */ subroot->parse = (Query *) - adjust_appendrel_attrs(parent_root, + adjust_appendrel_attrs(subroot, (Node *) parent_parse, 1, &appinfo); @@ -1360,9 +1524,7 @@ inheritance_planner(PlannerInfo *root) if (child_rte->inh) { Assert(child_rte->relkind == RELKIND_PARTITIONED_TABLE); - parent_relids = bms_add_member(parent_relids, appinfo->child_relid); - parent_roots[appinfo->child_relid] = subroot; - + parent_parses[appinfo->child_relid] = subroot->parse; continue; } @@ -1383,108 +1545,38 @@ inheritance_planner(PlannerInfo *root) * is used elsewhere in the plan, so using the original parent RTE * would give rise to confusing use of multiple aliases in EXPLAIN * output for what the user will think is the "same" table. OTOH, - * it's not a problem in the partitioned inheritance case, because the - * duplicate child RTE added for the parent does not appear anywhere - * else in the plan tree. + * it's not a problem in the partitioned inheritance case, because + * there is no duplicate RTE for the parent. */ if (nominalRelation < 0) nominalRelation = appinfo->child_relid; /* - * The rowMarks list might contain references to subquery RTEs, so - * make a copy that we can apply ChangeVarNodes to. (Fortunately, the - * executor doesn't need to see the modified copies --- we can just - * pass it the original rowMarks list.) + * As above, each child plan run needs its own append_rel_list and + * rowmarks, which should start out as pristine copies of the + * originals. There can't be any references to UPDATE/DELETE target + * rels in them; but there could be subquery references, which we'll + * fix up in a moment. */ - subroot->rowMarks = copyObject(parent_root->rowMarks); + subroot->append_rel_list = copyObject(root->append_rel_list); + subroot->rowMarks = copyObject(root->rowMarks); /* - * The append_rel_list likewise might contain references to subquery - * RTEs (if any subqueries were flattenable UNION ALLs). So prepare - * to apply ChangeVarNodes to that, too. As explained above, we only - * want to copy items that actually contain such references; the rest - * can just get linked into the subroot's append_rel_list. - * - * If we know there are no such references, we can just use the outer - * append_rel_list unmodified. - */ - if (modifiableARIindexes != NULL) - { - ListCell *lc2; - - subroot->append_rel_list = NIL; - foreach(lc2, parent_root->append_rel_list) - { - AppendRelInfo *appinfo2 = lfirst_node(AppendRelInfo, lc2); - - if (bms_is_member(appinfo2->child_relid, modifiableARIindexes)) - appinfo2 = copyObject(appinfo2); - - subroot->append_rel_list = lappend(subroot->append_rel_list, - appinfo2); - } - } - - /* - * Add placeholders to the child Query's rangetable list to fill the - * RT indexes already reserved for subqueries in previous children. - * These won't be referenced, so there's no need to make them very - * valid-looking. - */ - while (list_length(subroot->parse->rtable) < list_length(final_rtable)) - subroot->parse->rtable = lappend(subroot->parse->rtable, - makeNode(RangeTblEntry)); - - /* - * If this isn't the first child Query, generate duplicates of all - * subquery RTEs, and adjust Var numbering to reference the - * duplicates. To simplify the loop logic, we scan the original rtable - * not the copy just made by adjust_appendrel_attrs; that should be OK - * since subquery RTEs couldn't contain any references to the target - * rel. + * If this isn't the first child Query, adjust Vars and jointree + * entries to reference the appropriate set of subquery RTEs. */ if (final_rtable != NIL && subqueryRTindexes != NULL) { - ListCell *lr; + int oldrti = -1; - rti = 1; - foreach(lr, parent_parse->rtable) + while ((oldrti = bms_next_member(subqueryRTindexes, oldrti)) >= 0) { - RangeTblEntry *rte = lfirst_node(RangeTblEntry, lr); - - if (bms_is_member(rti, subqueryRTindexes)) - { - Index newrti; - - /* - * The RTE can't contain any references to its own RT - * index, except in its securityQuals, so we can save a - * few cycles by applying ChangeVarNodes to the rest of - * the rangetable before we append the RTE to it. - */ - newrti = list_length(subroot->parse->rtable) + 1; - ChangeVarNodes((Node *) subroot->parse, rti, newrti, 0); - ChangeVarNodes((Node *) subroot->rowMarks, rti, newrti, 0); - /* Skip processing unchanging parts of append_rel_list */ - if (modifiableARIindexes != NULL) - { - ListCell *lc2; - - foreach(lc2, subroot->append_rel_list) - { - AppendRelInfo *appinfo2 = lfirst_node(AppendRelInfo, lc2); + Index newrti = next_subquery_rti++; - if (bms_is_member(appinfo2->child_relid, - modifiableARIindexes)) - ChangeVarNodes((Node *) appinfo2, rti, newrti, 0); - } - } - rte = copyObject(rte); - ChangeVarNodes((Node *) rte->securityQuals, rti, newrti, 0); - subroot->parse->rtable = lappend(subroot->parse->rtable, - rte); - } - rti++; + ChangeVarNodes((Node *) subroot->parse, oldrti, newrti, 0); + ChangeVarNodes((Node *) subroot->append_rel_list, + oldrti, newrti, 0); + ChangeVarNodes((Node *) subroot->rowMarks, oldrti, newrti, 0); } } @@ -1514,22 +1606,43 @@ inheritance_planner(PlannerInfo *root) /* * If this is the first non-excluded child, its post-planning rtable - * becomes the initial contents of final_rtable; otherwise, append - * just its modified subquery RTEs to final_rtable. + * becomes the initial contents of final_rtable; otherwise, copy its + * modified subquery RTEs into final_rtable, to ensure we have sane + * copies of those. Also save the first non-excluded child's version + * of the rowmarks list; we assume all children will end up with + * equivalent versions of that. */ if (final_rtable == NIL) + { final_rtable = subroot->parse->rtable; + final_rowmarks = subroot->rowMarks; + } else - final_rtable = list_concat(final_rtable, - list_copy_tail(subroot->parse->rtable, - list_length(final_rtable))); + { + Assert(list_length(final_rtable) == + list_length(subroot->parse->rtable)); + if (subqueryRTindexes != NULL) + { + int oldrti = -1; + + while ((oldrti = bms_next_member(subqueryRTindexes, oldrti)) >= 0) + { + Index newrti = this_subquery_rti++; + RangeTblEntry *subqrte; + ListCell *newrticell; + + subqrte = rt_fetch(newrti, subroot->parse->rtable); + newrticell = list_nth_cell(final_rtable, newrti - 1); + lfirst(newrticell) = subqrte; + } + } + } /* * We need to collect all the RelOptInfos from all child plans into * the main PlannerInfo, since setrefs.c will need them. We use the - * last child's simple_rel_array (previous ones are too short), so we - * have to propagate forward the RelOptInfos that were already built - * in previous children. + * last child's simple_rel_array, so we have to propagate forward the + * RelOptInfos that were already built in previous children. */ Assert(subroot->simple_rel_array_size >= save_rel_array_size); for (rti = 1; rti < save_rel_array_size; rti++) @@ -1543,7 +1656,11 @@ inheritance_planner(PlannerInfo *root) save_rel_array = subroot->simple_rel_array; save_append_rel_array = subroot->append_rel_array; - /* Make sure any initplans from this rel get into the outer list */ + /* + * Make sure any initplans from this rel get into the outer list. Note + * we're effectively assuming all children generate the same + * init_plans. + */ root->init_plans = subroot->init_plans; /* Build list of sub-paths */ @@ -1626,6 +1743,9 @@ inheritance_planner(PlannerInfo *root) root->simple_rte_array[rti++] = rte; } + + /* Put back adjusted rowmarks, too */ + root->rowMarks = final_rowmarks; } /* @@ -6128,8 +6248,9 @@ plan_create_index_workers(Oid tableOid, Oid indexOid) * Build a minimal RTE. * * Set the target's table to be an inheritance parent. This is a kludge - * that prevents problems within get_relation_info(), which does not - * expect that any IndexOptInfo is currently undergoing REINDEX. + * to prevent get_relation_info() from fetching index information, which + * is needed because it does not expect that any IndexOptInfo is currently + * undergoing REINDEX. */ rte = makeNode(RangeTblEntry); rte->rtekind = RTE_RELATION; @@ -6993,6 +7114,10 @@ apply_scanjoin_target_to_paths(PlannerInfo *root, List *child_scanjoin_targets = NIL; ListCell *lc; + /* Skip processing pruned partitions. */ + if (child_rel == NULL) + continue; + /* Translate scan/join targets for this child. */ appinfos = find_appinfos_by_relids(root, child_rel->relids, &nappinfos); @@ -7093,6 +7218,10 @@ create_partitionwise_grouping_paths(PlannerInfo *root, RelOptInfo *child_grouped_rel; RelOptInfo *child_partially_grouped_rel; + /* Skip processing pruned partitions. */ + if (child_input_rel == NULL) + continue; + /* Input child rel must have a path */ Assert(child_input_rel->pathlist != NIL); diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c index 5392d1a..66e6ad9 100644 --- a/src/backend/optimizer/prep/preptlist.c +++ b/src/backend/optimizer/prep/preptlist.c @@ -121,7 +121,9 @@ preprocess_targetlist(PlannerInfo *root) /* * Add necessary junk columns for rowmarked rels. These values are needed * for locking of rels selected FOR UPDATE/SHARE, and to do EvalPlanQual - * rechecking. See comments for PlanRowMark in plannodes.h. + * rechecking. See comments for PlanRowMark in plannodes.h. If you + * change this stanza, see also expand_inherited_rtentry(), which has to + * be able to add on junk columns equivalent to these. */ foreach(lc, root->rowMarks) { diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c index 1d1e506..31c1bec 100644 --- a/src/backend/optimizer/util/inherit.c +++ b/src/backend/optimizer/util/inherit.c @@ -18,110 +18,72 @@ #include "access/table.h" #include "catalog/partition.h" #include "catalog/pg_inherits.h" +#include "catalog/pg_type.h" #include "miscadmin.h" +#include "nodes/makefuncs.h" #include "optimizer/appendinfo.h" #include "optimizer/inherit.h" +#include "optimizer/optimizer.h" +#include "optimizer/pathnode.h" +#include "optimizer/planmain.h" #include "optimizer/planner.h" #include "optimizer/prep.h" +#include "optimizer/restrictinfo.h" +#include "parser/parsetree.h" #include "partitioning/partdesc.h" +#include "partitioning/partprune.h" #include "utils/rel.h" -static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, - Index rti); -static void expand_partitioned_rtentry(PlannerInfo *root, +static void expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, - PlanRowMark *top_parentrc, LOCKMODE lockmode, - List **appinfos); + PlanRowMark *top_parentrc, LOCKMODE lockmode); static void expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, PlanRowMark *top_parentrc, Relation childrel, - List **appinfos, RangeTblEntry **childrte_p, + RangeTblEntry **childrte_p, Index *childRTindex_p); static Bitmapset *translate_col_privs(const Bitmapset *parent_privs, List *translated_vars); /* - * expand_inherited_tables - * Expand each rangetable entry that represents an inheritance set - * into an "append relation". At the conclusion of this process, - * the "inh" flag is set in all and only those RTEs that are append - * relation parents. - */ -void -expand_inherited_tables(PlannerInfo *root) -{ - Index nrtes; - Index rti; - ListCell *rl; - - /* - * expand_inherited_rtentry may add RTEs to parse->rtable. The function is - * expected to recursively handle any RTEs that it creates with inh=true. - * So just scan as far as the original end of the rtable list. - */ - nrtes = list_length(root->parse->rtable); - rl = list_head(root->parse->rtable); - for (rti = 1; rti <= nrtes; rti++) - { - RangeTblEntry *rte = (RangeTblEntry *) lfirst(rl); - - expand_inherited_rtentry(root, rte, rti); - rl = lnext(rl); - } -} - -/* * expand_inherited_rtentry - * Check whether a rangetable entry represents an inheritance set. - * If so, add entries for all the child tables to the query's - * rangetable, and build AppendRelInfo nodes for all the child tables - * and add them to root->append_rel_list. If not, clear the entry's - * "inh" flag to prevent later code from looking for AppendRelInfos. + * The given rangetable entry represents an inheritance set. + * Add entries for all the child tables to the query's rangetable, + * and build additional planner data structures for them, including + * RelOptInfos, AppendRelInfos, and possibly PlanRowMarks. * - * Note that the original RTE is considered to represent the whole - * inheritance set. The first of the generated RTEs is an RTE for the same - * table, but with inh = false, to represent the parent table in its role - * as a simple member of the inheritance set. - * - * A childless table is never considered to be an inheritance set. For - * regular inheritance, a parent RTE must always have at least two associated - * AppendRelInfos: one corresponding to the parent table as a simple member of - * the inheritance set and one or more corresponding to the actual children. - * (But a partitioned table might have only one associated AppendRelInfo, - * since it's not itself scanned and hence doesn't need a second RTE to - * represent itself as a member of the set.) + * Note that the original RTE is considered to represent the whole inheritance + * set. In the case of traditional inheritance, the first of the generated + * RTEs is an RTE for the same table, but with inh = false, to represent the + * parent table in its role as a simple member of the inheritance set. For + * partitioning, we don't need a second RTE because the partitioned table + * itself has no data and need not be scanned. */ -static void -expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) +void +expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel, + RangeTblEntry *rte, Index rti) { Oid parentOID; - PlanRowMark *oldrc; Relation oldrelation; LOCKMODE lockmode; - List *inhOIDs; - ListCell *l; + PlanRowMark *oldrc; + bool old_isParent = false; + int old_allMarkTypes = 0; + + /* Should only come here for plain relations with inh bit set */ + Assert(rte->inh); + Assert(rte->rtekind == RTE_RELATION); - /* Does RT entry allow inheritance? */ - if (!rte->inh) - return; - /* Ignore any already-expanded UNION ALL nodes */ - if (rte->rtekind != RTE_RELATION) - { - Assert(rte->rtekind == RTE_SUBQUERY); - return; - } - /* Fast path for common case of childless table */ parentOID = rte->relid; - if (!has_subclass(parentOID)) - { - /* Clear flag before returning */ - rte->inh = false; - return; - } + + /* + * We used to check has_subclass() here, but there's no longer any need + * to, because subquery_planner already did. + */ /* * The rewriter should already have obtained an appropriate lock on each @@ -141,7 +103,12 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) */ oldrc = get_plan_rowmark(root->rowMarks, rti); if (oldrc) + { + old_isParent = oldrc->isParent; oldrc->isParent = true; + /* Save initial value of allMarkTypes before children add to it */ + old_allMarkTypes = oldrc->allMarkTypes; + } /* Scan the inheritance set and expand it */ if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) @@ -151,17 +118,12 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) */ Assert(rte->relkind == RELKIND_PARTITIONED_TABLE); - if (root->glob->partition_directory == NULL) - root->glob->partition_directory = - CreatePartitionDirectory(CurrentMemoryContext); - /* - * If this table has partitions, recursively expand and lock them. - * While at it, also extract the partition key columns of all the - * partitioned tables. + * Recursively expand and lock the partitions. While at it, also + * extract the partition key columns of all the partitioned tables. */ - expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc, - lockmode, &root->append_rel_list); + expand_partitioned_rtentry(root, rel, rte, rti, + oldrelation, oldrc, lockmode); } else { @@ -170,25 +132,25 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) * that partitioned tables are not allowed to have inheritance * children, so it's not possible for both cases to apply.) */ - List *appinfos = NIL; - RangeTblEntry *childrte; - Index childRTindex; + List *inhOIDs; + ListCell *l; /* Scan for all members of inheritance set, acquire needed locks */ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL); /* - * Check that there's at least one descendant, else treat as no-child - * case. This could happen despite above has_subclass() check, if the - * table once had a child but no longer does. + * We used to special-case the situation where the table no longer has + * any children, by clearing rte->inh and exiting. That no longer + * works, because this function doesn't get run until after decisions + * have been made that depend on rte->inh. We have to treat such + * situations as normal inheritance. The table itself should always + * have been found, though. */ - if (list_length(inhOIDs) < 2) - { - /* Clear flag before returning */ - rte->inh = false; - heap_close(oldrelation, NoLock); - return; - } + Assert(inhOIDs != NIL); + Assert(linitial_oid(inhOIDs) == parentOID); + + /* Expand simple_rel_array and friends to hold child objects. */ + expand_planner_arrays(root, list_length(inhOIDs)); /* * Expand inheritance children in the order the OIDs were returned by @@ -198,6 +160,8 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) { Oid childOID = lfirst_oid(l); Relation newrelation; + RangeTblEntry *childrte; + Index childRTindex; /* Open rel if needed; we already have required locks */ if (childOID != parentOID) @@ -217,29 +181,78 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) continue; } - expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc, - newrelation, - &appinfos, &childrte, - &childRTindex); + /* Create RTE and AppendRelInfo, plus PlanRowMark if needed. */ + expand_single_inheritance_child(root, rte, rti, oldrelation, + oldrc, newrelation, + &childrte, &childRTindex); + + /* Create the otherrel RelOptInfo too. */ + (void) build_simple_rel(root, childRTindex, rel); /* Close child relations, but keep locks */ if (childOID != parentOID) table_close(newrelation, NoLock); } + } + + /* + * Some children might require different mark types, which would've been + * reported into oldrc. If so, add relevant entries to the top-level + * targetlist and update parent rel's reltarget. This should match what + * preprocess_targetlist() would have added if the mark types had been + * requested originally. + */ + if (oldrc) + { + int new_allMarkTypes = oldrc->allMarkTypes; + Var *var; + TargetEntry *tle; + char resname[32]; + List *newvars = NIL; + + /* The old PlanRowMark should already have necessitated adding TID */ + Assert(old_allMarkTypes & ~(1 << ROW_MARK_COPY)); + + /* Add whole-row junk Var if needed, unless we had it already */ + if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) && + !(old_allMarkTypes & (1 << ROW_MARK_COPY))) + { + var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root), + oldrc->rti, + 0, + false); + snprintf(resname, sizeof(resname), "wholerow%u", oldrc->rowmarkId); + tle = makeTargetEntry((Expr *) var, + list_length(root->processed_tlist) + 1, + pstrdup(resname), + true); + root->processed_tlist = lappend(root->processed_tlist, tle); + newvars = lappend(newvars, var); + } + + /* Add tableoid junk Var, unless we had it already */ + if (!old_isParent) + { + var = makeVar(oldrc->rti, + TableOidAttributeNumber, + OIDOID, + -1, + InvalidOid, + 0); + snprintf(resname, sizeof(resname), "tableoid%u", oldrc->rowmarkId); + tle = makeTargetEntry((Expr *) var, + list_length(root->processed_tlist) + 1, + pstrdup(resname), + true); + root->processed_tlist = lappend(root->processed_tlist, tle); + newvars = lappend(newvars, var); + } /* - * If all the children were temp tables, pretend it's a - * non-inheritance situation; we don't need Append node in that case. - * The duplicate RTE we added for the parent table is harmless, so we - * don't bother to get rid of it; ditto for the useless PlanRowMark - * node. + * Add the newly added Vars to parent's reltarget. We needn't worry + * about the childrens' reltargets, they'll be made later. */ - if (list_length(appinfos) < 2) - rte->inh = false; - else - root->append_rel_list = list_concat(root->append_rel_list, - appinfos); - + add_vars_to_targetlist(root, newvars, bms_make_singleton(0), false); } table_close(oldrelation, NoLock); @@ -250,25 +263,36 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) * Recursively expand an RTE for a partitioned table. */ static void -expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte, +expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo, + RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, - PlanRowMark *top_parentrc, LOCKMODE lockmode, - List **appinfos) + PlanRowMark *top_parentrc, LOCKMODE lockmode) { - int i; - RangeTblEntry *childrte; - Index childRTindex; PartitionDesc partdesc; + Bitmapset *live_parts; + int num_live_parts; + int i; + + check_stack_depth(); + + Assert(parentrte->inh); partdesc = PartitionDirectoryLookup(root->glob->partition_directory, parentrel); - check_stack_depth(); - /* A partitioned table should always have a partition descriptor. */ Assert(partdesc); - Assert(parentrte->inh); + /* + * If the partitioned table has no partitions, treat this as the + * non-inheritance case. + */ + if (partdesc->nparts == 0) + { + /* XXX wrong? */ + parentrte->inh = false; + return; + } /* * Note down whether any partition key cols are being updated. Though it's @@ -282,24 +306,40 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte, has_partition_attrs(parentrel, parentrte->updatedCols, NULL); /* - * If the partitioned table has no partitions, treat this as the - * non-inheritance case. + * Perform partition pruning using restriction clauses assigned to parent + * relation. live_parts will contain PartitionDesc indexes of partitions + * that survive pruning. Below, we will initialize child objects for the + * surviving partitions. */ - if (partdesc->nparts == 0) - { - parentrte->inh = false; - return; - } + live_parts = prune_append_rel_partitions(relinfo); + + /* Expand simple_rel_array and friends to hold child objects. */ + num_live_parts = bms_num_members(live_parts); + if (num_live_parts > 0) + expand_planner_arrays(root, num_live_parts); /* - * Create a child RTE for each partition. Note that unlike traditional - * inheritance, we don't need a child RTE for the partitioned table - * itself, because it's not going to be scanned. + * We also store partition RelOptInfo pointers in the parent relation. + * Since we're palloc0'ing, slots corresponding to pruned partitions will + * contain NULL. */ - for (i = 0; i < partdesc->nparts; i++) + Assert(relinfo->part_rels == NULL); + relinfo->part_rels = (RelOptInfo **) + palloc0(relinfo->nparts * sizeof(RelOptInfo *)); + + /* + * Create a child RTE for each live partition. Note that unlike + * traditional inheritance, we don't need a child RTE for the partitioned + * table itself, because it's not going to be scanned. + */ + i = -1; + while ((i = bms_next_member(live_parts, i)) >= 0) { Oid childOID = partdesc->oids[i]; Relation childrel; + RangeTblEntry *childrte; + Index childRTindex; + RelOptInfo *childrelinfo; /* Open rel, acquiring required locks */ childrel = table_open(childOID, lockmode); @@ -312,15 +352,20 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte, if (RELATION_IS_OTHER_TEMP(childrel)) elog(ERROR, "temporary relation from another session found as partition"); + /* Create RTE and AppendRelInfo, plus PlanRowMark if needed. */ expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel, top_parentrc, childrel, - appinfos, &childrte, &childRTindex); + &childrte, &childRTindex); + + /* Create the otherrel RelOptInfo too. */ + childrelinfo = build_simple_rel(root, childRTindex, relinfo); + relinfo->part_rels[i] = childrelinfo; /* If this child is itself partitioned, recurse */ if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) - expand_partitioned_rtentry(root, childrte, childRTindex, - childrel, top_parentrc, lockmode, - appinfos); + expand_partitioned_rtentry(root, childrelinfo, + childrte, childRTindex, + childrel, top_parentrc, lockmode); /* Close child relation, but keep locks */ table_close(childrel, NoLock); @@ -351,7 +396,7 @@ static void expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, PlanRowMark *top_parentrc, Relation childrel, - List **appinfos, RangeTblEntry **childrte_p, + RangeTblEntry **childrte_p, Index *childRTindex_p) { Query *parse = root->parse; @@ -363,8 +408,8 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, /* * Build an RTE for the child, and attach to query's rangetable list. We - * copy most fields of the parent's RTE, but replace relation OID and - * relkind, and set inh = false. Also, set requiredPerms to zero since + * copy most fields of the parent's RTE, but replace relation OID, + * relkind, and inh for the child. Also, set requiredPerms to zero since * all required permissions checks are done on the original RTE. Likewise, * set the child's securityQuals to empty, because we only want to apply * the parent's RLS conditions regardless of what RLS properties @@ -396,7 +441,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, */ appinfo = make_append_rel_info(parentrel, childrel, parentRTindex, childRTindex); - *appinfos = lappend(*appinfos, appinfo); + root->append_rel_list = lappend(root->append_rel_list, appinfo); /* * Translate the column permissions bitmaps to the child's attnums (we @@ -418,6 +463,16 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, } /* + * Store the RTE and appinfo in the respective PlannerInfo arrays, which + * the caller must already have allocated space for. + */ + Assert(childRTindex < root->simple_rel_array_size); + Assert(root->simple_rte_array[childRTindex] == NULL); + root->simple_rte_array[childRTindex] = childrte; + Assert(root->append_rel_array[childRTindex] == NULL); + root->append_rel_array[childRTindex] = appinfo; + + /* * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE. */ if (top_parentrc) @@ -437,7 +492,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, /* * We mark RowMarks for partitioned child tables as parent RowMarks so * that the executor ignores them (except their existence means that - * the child tables be locked using appropriate mode). + * the child tables will be locked using the appropriate mode). */ childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE); @@ -499,3 +554,129 @@ translate_col_privs(const Bitmapset *parent_privs, return child_privs; } + + +/* + * apply_child_basequals + * Populate childrel's base restriction quals from parent rel's quals, + * translating them using appinfo. + * + * If any of the resulting clauses evaluate to constant false or NULL, we + * return false and don't apply any quals. Caller should mark the relation as + * a dummy rel in this case, since it doesn't need to be scanned. + */ +bool +apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel, + RelOptInfo *childrel, RangeTblEntry *childRTE, + AppendRelInfo *appinfo) +{ + List *childquals; + Index cq_min_security; + ListCell *lc; + + /* + * The child rel's targetlist might contain non-Var expressions, which + * means that substitution into the quals could produce opportunities for + * const-simplification, and perhaps even pseudoconstant quals. Therefore, + * transform each RestrictInfo separately to see if it reduces to a + * constant or pseudoconstant. (We must process them separately to keep + * track of the security level of each qual.) + */ + childquals = NIL; + cq_min_security = UINT_MAX; + foreach(lc, parentrel->baserestrictinfo) + { + RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc); + Node *childqual; + ListCell *lc2; + + Assert(IsA(rinfo, RestrictInfo)); + childqual = adjust_appendrel_attrs(root, + (Node *) rinfo->clause, + 1, &appinfo); + childqual = eval_const_expressions(root, childqual); + /* check for flat-out constant */ + if (childqual && IsA(childqual, Const)) + { + if (((Const *) childqual)->constisnull || + !DatumGetBool(((Const *) childqual)->constvalue)) + { + /* Restriction reduces to constant FALSE or NULL */ + return false; + } + /* Restriction reduces to constant TRUE, so drop it */ + continue; + } + /* might have gotten an AND clause, if so flatten it */ + foreach(lc2, make_ands_implicit((Expr *) childqual)) + { + Node *onecq = (Node *) lfirst(lc2); + bool pseudoconstant; + + /* check for pseudoconstant (no Vars or volatile functions) */ + pseudoconstant = + !contain_vars_of_level(onecq, 0) && + !contain_volatile_functions(onecq); + if (pseudoconstant) + { + /* tell createplan.c to check for gating quals */ + root->hasPseudoConstantQuals = true; + } + /* reconstitute RestrictInfo with appropriate properties */ + childquals = lappend(childquals, + make_restrictinfo((Expr *) onecq, + rinfo->is_pushed_down, + rinfo->outerjoin_delayed, + pseudoconstant, + rinfo->security_level, + NULL, NULL, NULL)); + /* track minimum security level among child quals */ + cq_min_security = Min(cq_min_security, rinfo->security_level); + } + } + + /* + * In addition to the quals inherited from the parent, we might have + * securityQuals associated with this particular child node. (Currently + * this can only happen in appendrels originating from UNION ALL; + * inheritance child tables don't have their own securityQuals, see + * expand_inherited_rtentry().) Pull any such securityQuals up into the + * baserestrictinfo for the child. This is similar to + * process_security_barrier_quals() for the parent rel, except that we + * can't make any general deductions from such quals, since they don't + * hold for the whole appendrel. + */ + if (childRTE->securityQuals) + { + Index security_level = 0; + + foreach(lc, childRTE->securityQuals) + { + List *qualset = (List *) lfirst(lc); + ListCell *lc2; + + foreach(lc2, qualset) + { + Expr *qual = (Expr *) lfirst(lc2); + + /* not likely that we'd see constants here, so no check */ + childquals = lappend(childquals, + make_restrictinfo(qual, + true, false, false, + security_level, + NULL, NULL, NULL)); + cq_min_security = Min(cq_min_security, security_level); + } + security_level++; + } + Assert(security_level <= root->qual_security_level); + } + + /* + * OK, we've got all the baserestrictinfo quals for this child. + */ + childrel->baserestrictinfo = childquals; + childrel->baserestrict_min_security = cq_min_security; + + return true; +} diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 702c4f8..89b281f 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -2082,7 +2082,10 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel, { PartitionDesc partdesc; - Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); + /* Create the PartitionDirectory infrastructure if we didn't already */ + if (root->glob->partition_directory == NULL) + root->glob->partition_directory = + CreatePartitionDirectory(CurrentMemoryContext); partdesc = PartitionDirectoryLookup(root->glob->partition_directory, relation); diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index 0d40b8d..f0f1811 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -20,11 +20,11 @@ #include "optimizer/appendinfo.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" +#include "optimizer/inherit.h" #include "optimizer/pathnode.h" #include "optimizer/paths.h" #include "optimizer/placeholder.h" #include "optimizer/plancat.h" -#include "optimizer/prep.h" #include "optimizer/restrictinfo.h" #include "optimizer/tlist.h" #include "partitioning/partbounds.h" @@ -132,6 +132,49 @@ setup_append_rel_array(PlannerInfo *root) } /* + * expand_planner_arrays + * Expand the PlannerInfo's per-RTE arrays by add_size members + * and initialize the newly added entries to NULLs + */ +void +expand_planner_arrays(PlannerInfo *root, int add_size) +{ + int new_size; + + Assert(add_size > 0); + + new_size = root->simple_rel_array_size + add_size; + + root->simple_rte_array = (RangeTblEntry **) + repalloc(root->simple_rte_array, + sizeof(RangeTblEntry *) * new_size); + MemSet(root->simple_rte_array + root->simple_rel_array_size, + 0, sizeof(RangeTblEntry *) * add_size); + + root->simple_rel_array = (RelOptInfo **) + repalloc(root->simple_rel_array, + sizeof(RelOptInfo *) * new_size); + MemSet(root->simple_rel_array + root->simple_rel_array_size, + 0, sizeof(RelOptInfo *) * add_size); + + if (root->append_rel_array) + { + root->append_rel_array = (AppendRelInfo **) + repalloc(root->append_rel_array, + sizeof(AppendRelInfo *) * new_size); + MemSet(root->append_rel_array + root->simple_rel_array_size, + 0, sizeof(AppendRelInfo *) * add_size); + } + else + { + root->append_rel_array = (AppendRelInfo **) + palloc0(sizeof(AppendRelInfo *) * new_size); + } + + root->simple_rel_array_size = new_size; +} + +/* * build_simple_rel * Construct a new RelOptInfo for a base relation or 'other' relation. */ @@ -281,49 +324,60 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent) break; } - /* Save the finished struct in the query's simple_rel_array */ - root->simple_rel_array[relid] = rel; - /* * This is a convenient spot at which to note whether rels participating * in the query have any securityQuals attached. If so, increase * root->qual_security_level to ensure it's larger than the maximum - * security level needed for securityQuals. + * security level needed for securityQuals. (Must do this before we call + * apply_child_basequals, else we'll hit an Assert therein.) */ if (rte->securityQuals) root->qual_security_level = Max(root->qual_security_level, list_length(rte->securityQuals)); + /* + * Copy the parent's quals to the child, with appropriate substitution of + * variables. If any constant false or NULL clauses turn up, we can mark + * the child as dummy right away. (We must do this immediately so that + * pruning works correctly when recursing in expand_partitioned_rtentry.) + */ + if (parent) + { + AppendRelInfo *appinfo = root->append_rel_array[relid]; + + Assert(appinfo != NULL); + if (!apply_child_basequals(root, parent, rel, rte, appinfo)) + { + /* + * Some restriction clause reduced to constant FALSE or NULL after + * substitution, so this child need not be scanned. + */ + mark_dummy_rel(rel); + } + } + + /* Save the finished struct in the query's simple_rel_array */ + root->simple_rel_array[relid] = rel; + return rel; } /* - * add_appendrel_other_rels + * expand_appendrel_subquery * Add "other rel" RelOptInfos for the children of an appendrel baserel * - * "rel" is a relation that (still) has the rte->inh flag set, meaning it - * has appendrel children listed in root->append_rel_list. We need to build + * "rel" is a subquery relation that has the rte->inh flag set, meaning it + * is a UNION ALL subquery that's been flattened into an appendrel, with + * child subqueries listed in root->append_rel_list. We need to build * a RelOptInfo for each child relation so that we can plan scans on them. - * (The parent relation might be a partitioned table, a table with - * traditional inheritance children, or a flattened UNION ALL subquery.) */ void -add_appendrel_other_rels(PlannerInfo *root, RelOptInfo *rel, Index rti) +expand_appendrel_subquery(PlannerInfo *root, RelOptInfo *rel, + RangeTblEntry *rte, Index rti) { - int cnt_parts = 0; ListCell *l; - /* - * If rel is a partitioned table, then we also need to build a part_rels - * array so that the child RelOptInfos can be conveniently accessed from - * the parent. - */ - if (rel->part_scheme != NULL) - { - Assert(rel->nparts > 0); - rel->part_rels = (RelOptInfo **) - palloc0(sizeof(RelOptInfo *) * rel->nparts); - } + Assert(rte->rtekind == RTE_SUBQUERY); foreach(l, root->append_rel_list) { @@ -341,33 +395,18 @@ add_appendrel_other_rels(PlannerInfo *root, RelOptInfo *rel, Index rti) childrte = root->simple_rte_array[childRTindex]; Assert(childrte != NULL); - /* build child RelOptInfo, and add to main query data structures */ + /* Build the child RelOptInfo. */ childrel = build_simple_rel(root, childRTindex, rel); - /* - * If rel is a partitioned table, fill in the part_rels array. The - * order in which child tables appear in append_rel_list is the same - * as the order in which they appear in the parent's PartitionDesc, so - * assigning partitions like this works. - */ - if (rel->part_scheme != NULL) - { - Assert(cnt_parts < rel->nparts); - rel->part_rels[cnt_parts++] = childrel; - } - - /* Child may itself be an inherited relation. */ + /* Child may itself be an inherited rel, either table or subquery. */ if (childrte->inh) { - /* Only relation and subquery RTEs can have children. */ - Assert(childrte->rtekind == RTE_RELATION || - childrte->rtekind == RTE_SUBQUERY); - add_appendrel_other_rels(root, childrel, childRTindex); + if (childrte->rtekind == RTE_RELATION) + expand_inherited_rtentry(root, childrel, childrte, childRTindex); + else + expand_appendrel_subquery(root, childrel, childrte, childRTindex); } } - - /* We should have filled all of the part_rels array if it's partitioned */ - Assert(cnt_parts == rel->nparts); } /* diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c index af3f911..aecea82 100644 --- a/src/backend/partitioning/partprune.c +++ b/src/backend/partitioning/partprune.c @@ -45,6 +45,7 @@ #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" +#include "optimizer/cost.h" #include "optimizer/optimizer.h" #include "optimizer/pathnode.h" #include "parser/parsetree.h" @@ -474,18 +475,24 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel, * is, not pruned already). */ subplan_map = (int *) palloc(nparts * sizeof(int)); + memset(subplan_map, -1, nparts * sizeof(int)); subpart_map = (int *) palloc(nparts * sizeof(int)); - relid_map = (Oid *) palloc(nparts * sizeof(Oid)); + memset(subpart_map, -1, nparts * sizeof(Oid)); + relid_map = (Oid *) palloc0(nparts * sizeof(Oid)); present_parts = NULL; for (i = 0; i < nparts; i++) { RelOptInfo *partrel = subpart->part_rels[i]; - int subplanidx = relid_subplan_map[partrel->relid] - 1; - int subpartidx = relid_subpart_map[partrel->relid] - 1; + int subplanidx; + int subpartidx; - subplan_map[i] = subplanidx; - subpart_map[i] = subpartidx; + /* Skip processing pruned partitions. */ + if (partrel == NULL) + continue; + + subplan_map[i] = subplanidx = relid_subplan_map[partrel->relid] - 1; + subpart_map[i] = subpartidx = relid_subpart_map[partrel->relid] - 1; relid_map[i] = planner_rt_fetch(partrel->relid, root)->relid; if (subplanidx >= 0) { @@ -567,23 +574,20 @@ gen_partprune_steps(RelOptInfo *rel, List *clauses, bool *contradictory) /* * prune_append_rel_partitions - * Returns RT indexes of the minimum set of child partitions which must - * be scanned to satisfy rel's baserestrictinfo quals. + * Returns indexes into rel->part_rels of the minimum set of child + * partitions which must be scanned to satisfy rel's baserestrictinfo + * quals. * * Callers must ensure that 'rel' is a partitioned table. */ -Relids +Bitmapset * prune_append_rel_partitions(RelOptInfo *rel) { - Relids result; List *clauses = rel->baserestrictinfo; List *pruning_steps; bool contradictory; PartitionPruneContext context; - Bitmapset *partindexes; - int i; - Assert(clauses != NIL); Assert(rel->part_scheme != NULL); /* If there are no partitions, return the empty set */ @@ -591,6 +595,13 @@ prune_append_rel_partitions(RelOptInfo *rel) return NULL; /* + * If pruning is disabled or if there are no clauses to prune with, return + * all partitions. + */ + if (!enable_partition_pruning || clauses == NIL) + return bms_add_range(NULL, 0, rel->nparts - 1); + + /* * Process clauses. If the clauses are found to be contradictory, we can * return the empty set. */ @@ -617,15 +628,7 @@ prune_append_rel_partitions(RelOptInfo *rel) context.evalexecparams = false; /* Actual pruning happens here. */ - partindexes = get_matching_partitions(&context, pruning_steps); - - /* Add selected partitions' RT indexes to result. */ - i = -1; - result = NULL; - while ((i = bms_next_member(partindexes, i)) >= 0) - result = bms_add_member(result, rel->part_rels[i]->relid); - - return result; + return get_matching_partitions(&context, pruning_steps); } /* diff --git a/src/include/optimizer/inherit.h b/src/include/optimizer/inherit.h index d2418f1..02a23e5 100644 --- a/src/include/optimizer/inherit.h +++ b/src/include/optimizer/inherit.h @@ -17,6 +17,11 @@ #include "nodes/pathnodes.h" -extern void expand_inherited_tables(PlannerInfo *root); +extern void expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel, + RangeTblEntry *rte, Index rti); + +extern bool apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel, + RelOptInfo *childrel, RangeTblEntry *childRTE, + AppendRelInfo *appinfo); #endif /* INHERIT_H */ diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h index 9e79e1c..21d0e67 100644 --- a/src/include/optimizer/pathnode.h +++ b/src/include/optimizer/pathnode.h @@ -277,10 +277,11 @@ extern Path *reparameterize_path_by_child(PlannerInfo *root, Path *path, */ extern void setup_simple_rel_arrays(PlannerInfo *root); extern void setup_append_rel_array(PlannerInfo *root); +extern void expand_planner_arrays(PlannerInfo *root, int add_size); extern RelOptInfo *build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent); -extern void add_appendrel_other_rels(PlannerInfo *root, RelOptInfo *rel, - Index rti); +extern void expand_appendrel_subquery(PlannerInfo *root, RelOptInfo *rel, + RangeTblEntry *rte, Index rti); extern RelOptInfo *find_base_rel(PlannerInfo *root, int relid); extern RelOptInfo *find_join_rel(PlannerInfo *root, Relids relids); extern RelOptInfo *build_join_rel(PlannerInfo *root, diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out index 6bc1068..1450cef 100644 --- a/src/test/regress/expected/partition_aggregate.out +++ b/src/test/regress/expected/partition_aggregate.out @@ -144,7 +144,7 @@ SELECT c, sum(a) FROM pagg_tab WHERE 1 = 2 GROUP BY c; QUERY PLAN -------------------------------- HashAggregate - Group Key: pagg_tab.c + Group Key: c -> Result One-Time Filter: false (4 rows) @@ -159,7 +159,7 @@ SELECT c, sum(a) FROM pagg_tab WHERE c = 'x' GROUP BY c; QUERY PLAN -------------------------------- GroupAggregate - Group Key: pagg_tab.c + Group Key: c -> Result One-Time Filter: false (4 rows) diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out index 50ca03b..7806ba1 100644 --- a/src/test/regress/expected/partition_prune.out +++ b/src/test/regress/expected/partition_prune.out @@ -2568,6 +2568,60 @@ table ab; 1 | 3 (1 row) +-- Test UPDATE where source relation has run-time pruning enabled +truncate ab; +insert into ab values (1, 1), (1, 2), (1, 3), (2, 1); +explain (analyze, costs off, summary off, timing off) +update ab_a1 set b = 3 from ab_a2 where ab_a2.b = (select 1); + QUERY PLAN +---------------------------------------------------------------------- + Update on ab_a1 (actual rows=0 loops=1) + Update on ab_a1_b1 + Update on ab_a1_b2 + Update on ab_a1_b3 + InitPlan 1 (returns $0) + -> Result (actual rows=1 loops=1) + -> Nested Loop (actual rows=1 loops=1) + -> Seq Scan on ab_a1_b1 (actual rows=1 loops=1) + -> Materialize (actual rows=1 loops=1) + -> Append (actual rows=1 loops=1) + -> Seq Scan on ab_a2_b1 (actual rows=1 loops=1) + Filter: (b = $0) + -> Seq Scan on ab_a2_b2 (never executed) + Filter: (b = $0) + -> Seq Scan on ab_a2_b3 (never executed) + Filter: (b = $0) + -> Nested Loop (actual rows=1 loops=1) + -> Seq Scan on ab_a1_b2 (actual rows=1 loops=1) + -> Materialize (actual rows=1 loops=1) + -> Append (actual rows=1 loops=1) + -> Seq Scan on ab_a2_b1 (actual rows=1 loops=1) + Filter: (b = $0) + -> Seq Scan on ab_a2_b2 (never executed) + Filter: (b = $0) + -> Seq Scan on ab_a2_b3 (never executed) + Filter: (b = $0) + -> Nested Loop (actual rows=1 loops=1) + -> Seq Scan on ab_a1_b3 (actual rows=1 loops=1) + -> Materialize (actual rows=1 loops=1) + -> Append (actual rows=1 loops=1) + -> Seq Scan on ab_a2_b1 (actual rows=1 loops=1) + Filter: (b = $0) + -> Seq Scan on ab_a2_b2 (never executed) + Filter: (b = $0) + -> Seq Scan on ab_a2_b3 (never executed) + Filter: (b = $0) +(36 rows) + +select tableoid::regclass, * from ab; + tableoid | a | b +----------+---+--- + ab_a1_b3 | 1 | 3 + ab_a1_b3 | 1 | 3 + ab_a1_b3 | 1 | 3 + ab_a2_b1 | 2 | 1 +(4 rows) + drop table ab, lprt_a; -- Join create table tbl1(col1 int); diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql index a5514c7..2e4d2b4 100644 --- a/src/test/regress/sql/partition_prune.sql +++ b/src/test/regress/sql/partition_prune.sql @@ -588,6 +588,13 @@ explain (analyze, costs off, summary off, timing off) update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a; table ab; +-- Test UPDATE where source relation has run-time pruning enabled +truncate ab; +insert into ab values (1, 1), (1, 2), (1, 3), (2, 1); +explain (analyze, costs off, summary off, timing off) +update ab_a1 set b = 3 from ab_a2 where ab_a2.b = (select 1); +select tableoid::regclass, * from ab; + drop table ab, lprt_a; -- Join
Thanks a lot for hacking on the patch. I'm really happy with the direction you took for inheritance_planner, as it allows UPDATE/DELETE to use partition pruning. On 2019/03/29 7:38, Tom Lane wrote: > I've been hacking on this pretty hard over the last couple days, > because I really didn't like the contortions you'd made to allow > inheritance_planner to call expand_inherited_rtentry in a completely > different context than the regular code path did. I eventually got > rid of that Good riddance. > by having inheritance_planner run one cycle of planning > the query as if it were a SELECT, and extracting the list of unpruned > children from that. This is somewhat like my earlier patch that we decided to not to pursue, minus all the hackery within query_planner() that was in that patch, which is great. (I can't find the link, but David Rowley had posted a patch for allowing UPDATE/DELETE to use partition pruning in the late stages of PG 11 development, which had taken a similar approach.) > I had to rearrange the generation of the final > rtable a bit to make that work, but I think that inheritance_planner > winds up somewhat cleaner and safer this way; it's making (slightly) > fewer assumptions about how much the results of planning the child > queries can vary. > > Perhaps somebody will object that that's one more planning pass than > we had before, but I'm not very concerned, because > (a) at least for partitioned tables that we can prune successfully, > this should still be better than v11, since we avoid the planning > passes for pruned children. Certainly. Note that previously we'd always scan *all* hash partitions for UPDATE and DELETE queries, because constraint exclusion can't exclude hash partitions due to the shape of their partition constraint. I ran my usual benchmark with up to 8192 partitions. N: 2..8192 create table rt (a int, b int, c int) partition by range (a); select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, N) x; \gexec update.sql: \set param random(1, N) update rt set a = 0 where a = :param; pgbench -n -T 120 -f select.sql nparts v38 HEAD ====== ==== ==== 2 2971 2969 8 2980 1949 32 2955 733 128 2946 145 512 2924 11 1024 2986 3 4096 2702 0 8192 2531 OOM Obviously, you'll get similar numbers with hash or list partitioning. > (b) inheritance_planner is horridly inefficient anyway, in that it > has to run a near-duplicate planning pass for each child table. > If we're concerned about its cost, we should be working to get rid of > the function altogether, as per [1]. In the meantime, I do not want > to contort other code to make life easier for inheritance_planner. Agreed. > There's still some loose ends: > > 1. I don't like 0003 much, and omitted it from the attached. > I think that what we ought to be doing instead is not having holes > in the rel_parts[] arrays to begin with, ie they should only include > the unpruned partitions. If we are actually encoding any important > information in those array positions, I suspect that is broken anyway > in view of 898e5e329: we can't assume that the association of child > rels with particular PartitionDesc slots will hold still from planning > to execution. It's useful for part_rels array to be indexed in the same way as PartitionDesc. Firstly, because partition pruning code returns the PartitionDesc-defined indexes of unpruned partitions. Second, partitionwise join code decides two partitioned tables as being compatible for partitionwise joining, then it must join partitions that have identical *PartitionDesc* indexes, which is what it does by part_rels arrays of both sides in one loop. Regarding the impact of 898e5e329 on this, I think it invented PartitionDirectory exactly to avoid PartitionDesc changing under us affecting the planning or execution of a given query. As for PartitionDesc indexes being different between planning and execution, it only affects PartitionPruneInfo and the commit did make changes to ExecCreatePartitionPruningState to remap the old indexes of unpruned partitions in PartitionPruneInfo (as they were during planning) to the new ones. > 2. I seriously dislike what's been done in joinrels.c, too. That > really seems like a kluge (and I haven't had time to study it > closely). Those hunks account for the fact that pruned partitions, for which we no longer create RangeTblEntry and RelOptInfo, may appear on the nullable side of an outer join. We'll need a RelOptInfo holding a dummy path, so that outer join paths can be created with one side of join being dummy result path, which are built in the patch by build_dummy_partition_rel(). > 3. It's not entirely clear to me why the patch has to touch > execPartition.c. That implies that the planner-to-executor API > changed, but how so, and why is there no comment update clarifying it? The change is that make_partitionedrel_pruneinfo() no longer adds the OIDs of pruned partitions to the PartitionPruneInfo.relid_map array, a field which 898e5e329 added. 898e5e329 had also added an Assert in ExecCreatePartitionPruneState, which you're seeing is being changed in the patch. In the case that partition count hasn't changed between planning and execution (*), it asserts that partition OIDs in PartitionPruneInfo.relid_map, which is essentially an exact copy of the partition OIDs in the PartitionDesc that planner retrieved, are same as the OIDs in the PartitionDesc that executor retrieves. * this seems a bit shaky, because partition count not having changed doesn't discount the possibility that partitions themselves haven't changed > Given the short amount of time left in this CF, there may not be > time to address the first two points, and I won't necessarily > insist that those be changed before committing. I'd like at least > a comment about point 3 though. > > Attached is updated patch as a single patch --- I didn't think the > division into multiple patches was terribly helpful, due to the > flapping in expected regression results. Thanks again for the new patch. I'm reading it now and will send comments later today if I find something. Thanks, Amit
On Fri, Mar 29, 2019 at 3:45 PM, Amit Langote wrote: > Thanks a lot for hacking on the patch. I'm really happy with the direction > you took for inheritance_planner, as it allows UPDATE/DELETE to use > partition pruning. I was astonished by Tom's awesome works and really thanks him. > Certainly. Note that previously we'd always scan *all* hash partitions > for UPDATE and DELETE queries, because constraint exclusion can't exclude > hash partitions due to the shape of their partition constraint. > > I ran my usual benchmark with up to 8192 partitions. > > N: 2..8192 > > create table rt (a int, b int, c int) partition by range (a); select 'create > table rt' || x::text || ' partition of rt for values from (' || (x)::text > || ') to (' || (x+1)::text || ');' from generate_series(1, > N) x; > \gexec > > update.sql: > > \set param random(1, N) > update rt set a = 0 where a = :param; > > pgbench -n -T 120 -f select.sql > > nparts v38 HEAD > ====== ==== ==== > 2 2971 2969 > 8 2980 1949 > 32 2955 733 > 128 2946 145 > 512 2924 11 > 1024 2986 3 > 4096 2702 0 > 8192 2531 OOM > > Obviously, you'll get similar numbers with hash or list partitioning. I also ran the test for hash partitioning for just make sure. N: 2..8192 create table ht (a int, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (MODULUS N, REMAINDER || (x)::text || ');' from generate_series(0, N-1) x; \gexec update.sql: \set param random(1, N * 100) update ht set b = b + 1 where a = :param; pgbench -n -T 60 -f update.sql [updating one partition] nparts v38 HEAD ====== ==== ==== 0: 10538 10487 2: 6942 7028 4: 7043 5645 8: 6981 3954 16: 6932 2440 32: 6897 1243 64: 6897 309 128: 6753 120 256: 6727 46 512: 6708 12 1024: 6063 3 2048: 5894 1 4096: 5374 OOM 8192: 4572 OOM The performance for hash is also improved, though drop rate of performance with large partitions seems higher than that ofrange partitioning. Thanks -- Imai Yoshikazu
Here are some comments on v38. On 2019/03/29 12:44, Amit Langote wrote: > Thanks again for the new patch. I'm reading it now and will send comments > later today if I find something. - Assert(rte->rtekind == RTE_RELATION || - rte->rtekind == RTE_SUBQUERY); - add_appendrel_other_rels(root, rel, rti); + if (rte->rtekind == RTE_RELATION) + expand_inherited_rtentry(root, rel, rte, rti); + else + expand_appendrel_subquery(root, rel, rte, rti); Wouldn't it be a good idea to keep the Assert? + * It's possible that the RTIs we just assigned for the child rels in the + * final rtable are different from where they were in the SELECT query. In the 2nd sentence, maybe you meant "...from what they were" + forboth(lc, old_child_rtis, lc2, new_child_rtis) + { + int old_child_rti = lfirst_int(lc); + int new_child_rti = lfirst_int(lc2); + + if (old_child_rti == new_child_rti) + continue; /* nothing to do */ + + Assert(old_child_rti > new_child_rti); + + ChangeVarNodes((Node *) child_appinfos, + old_child_rti, new_child_rti, 0); + } This seems necessary? RTEs of children of the target table should be in the same position in the final_rtable as they are in the select_rtable. It seems that they can be added to parse->rtable simply as: orig_rtable_len = list_length(parse->rtable); parse->rtable = list_concat(parse->rtable, list_copy_tail(select_rtable, orig_rtable_len)); That is, after the block of code that plans the query as SELECT. + * about the childrens' reltargets, they'll be made later Should it be children's? + /* + * If the partitioned table has no partitions, treat this as the + * non-inheritance case. + */ + if (partdesc->nparts == 0) + { + /* XXX wrong? */ + parentrte->inh = false; + return; + } About the XXX: I think resetting inh flag is unnecessary, so we should just remove the line. If we do that, we can also get rid of the following code in set_rel_size(): else if (rte->relkind == RELKIND_PARTITIONED_TABLE) { /* * A partitioned table without any partitions is marked as * a dummy rel. */ set_dummy_rel_pathlist(rel); } Finally, it's not in the patch, but how about visiting get_relation_constraints() for revising this block of code: /* * Append partition predicates, if any. * * For selects, partition pruning uses the parent table's partition bound * descriptor, instead of constraint exclusion which is driven by the * individual partition's partition constraint. */ if (enable_partition_pruning && root->parse->commandType != CMD_SELECT) { List *pcqual = RelationGetPartitionQual(relation); if (pcqual) { /* * Run the partition quals through const-simplification similar to * check constraints. We skip canonicalize_qual, though, because * partition quals should be in canonical form already; also, * since the qual is in implicit-AND format, we'd have to * explicitly convert it to explicit-AND format and back again. */ pcqual = (List *) eval_const_expressions(root, (Node *) pcqual); /* Fix Vars to have the desired varno */ if (varno != 1) ChangeVarNodes((Node *) pcqual, 1, varno, 0); result = list_concat(result, pcqual); } } We will no longer need to load the partition constraints for "other rel" partitions, not even for UPDATE and DELETE queries. Now, we won't load them with the patch applied, because we're cheating by first planning the query as SELECT, so that's not an issue. But we should change the condition here to check if the input relation is a "baserel", in which case, this should still load the partition constraint so that constraint exclusion can use it when running with constraint_exclusion = on. In fact, I recently reported [1] on -hackers that we don't load the partition constraint even if the partition is being accessed directly as a bug introduced in PG 11. Thanks, Amit [1] https://www.postgresql.org/message-id/9813f079-f16b-61c8-9ab7-4363cab28d80%40lab.ntt.co.jp
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > Here are some comments on v38. Thanks for looking it over! I'll just reply to points worth discussing: > - Assert(rte->rtekind == RTE_RELATION || > - rte->rtekind == RTE_SUBQUERY); > - add_appendrel_other_rels(root, rel, rti); > + if (rte->rtekind == RTE_RELATION) > + expand_inherited_rtentry(root, rel, rte, rti); > + else > + expand_appendrel_subquery(root, rel, rte, rti); > Wouldn't it be a good idea to keep the Assert? There's an Assert in expand_appendrel_subquery that what it got is an RTE_SUBQUERY, so I thought the one at the call site was redundant. I suppose another way to do this would be like if (rte->rtekind == RTE_RELATION) expand_inherited_rtentry(root, rel, rte, rti); else if (rte->rtekind == RTE_SUBQUERY) expand_appendrel_subquery(root, rel, rte, rti); else Assert(false); Not sure if that's better or not. Or we could go back to the design of just having one function and letting it dispatch the case it doesn't want to the other function --- though I think I'd make expand_inherited_rtentry be the primary function, rather than the other way around as you had it in v37. > + forboth(lc, old_child_rtis, lc2, new_child_rtis) > + { > + int old_child_rti = lfirst_int(lc); > + int new_child_rti = lfirst_int(lc2); > + > + if (old_child_rti == new_child_rti) > + continue; /* nothing to do */ > + > + Assert(old_child_rti > new_child_rti); > + > + ChangeVarNodes((Node *) child_appinfos, > + old_child_rti, new_child_rti, 0); > + } > This seems necessary? RTEs of children of the target table should be in > the same position in the final_rtable as they are in the select_rtable. Well, that's what I'm not very convinced of. I have observed that the regression tests don't reach this ChangeVarNodes call, but I think that might just be lack of test cases rather than a proof that it's impossible. The question is whether it'd ever be possible for the update/delete target to not be the first "inh" table that gets expanded. Since that expansion is done in RTE order, it reduces to "is the target always before any other RTE entries that could need inheritance expansion?" Certainly that would typically be true, but I don't feel very comfortable about assuming that it must be true, when you start thinking about things like updatable views, rules, WITH queries, and so on. It might be worth trying to devise a test case that does reach this code. If we could convince ourselves that it's really impossible, I'd be willing to drop it in favor of putting a test-and-elog check in the earlier loop that the RTI pairs are all equal. But I'm not willing to do it without more investigation. > + /* XXX wrong? */ > + parentrte->inh = false; > About the XXX: I think resetting inh flag is unnecessary, so we should > just remove the line. Possibly. I hadn't had time to follow up the XXX annotation. > If we do that, we can also get rid of the following > code in set_rel_size(): > else if (rte->relkind == RELKIND_PARTITIONED_TABLE) > { > /* > * A partitioned table without any partitions is marked as > * a dummy rel. > */ > set_dummy_rel_pathlist(rel); > } Not following? Surely we need to mark the childless parent as dummy at some point, and that seems like as good a place as any. > Finally, it's not in the patch, but how about visiting > get_relation_constraints() for revising this block of code: That seems like probably an independent patch --- do you want to write it? regards, tom lane
I wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> About the XXX: I think resetting inh flag is unnecessary, so we should >> just remove the line. > Possibly. I hadn't had time to follow up the XXX annotation. Now I have ... Yeah, it seems we can just drop that and leave the flag alone. We'll end up running through set_append_rel_size and finding no relevant AppendRelInfos, but that's not going to take long enough to be a problem. It seems better to have the principle that rte->inh doesn't change after subquery_planner's initial scan of the rtable, so I'll make it so. >> If we do that, we can also get rid of the following >> code in set_rel_size(): No, we can't --- that's still reachable if somebody says "SELECT FROM ONLY partitioned_table". regards, tom lane
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/03/29 7:38, Tom Lane wrote: >> 2. I seriously dislike what's been done in joinrels.c, too. That >> really seems like a kluge (and I haven't had time to study it >> closely). > Those hunks account for the fact that pruned partitions, for which we no > longer create RangeTblEntry and RelOptInfo, may appear on the nullable > side of an outer join. We'll need a RelOptInfo holding a dummy path, so > that outer join paths can be created with one side of join being dummy > result path, which are built in the patch by build_dummy_partition_rel(). Just for the record, that code is completely broken: it falls over badly under GEQO. (Try running the regression tests with "alter system set geqo_threshold to 2".) However, the partitionwise join code was completely broken for GEQO before this patch, too, so I'm just going to log that as an open issue for the moment. regards, tom lane
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/03/29 7:38, Tom Lane wrote: >> 2. I seriously dislike what's been done in joinrels.c, too. That >> really seems like a kluge (and I haven't had time to study it >> closely). > Those hunks account for the fact that pruned partitions, for which we no > longer create RangeTblEntry and RelOptInfo, may appear on the nullable > side of an outer join. We'll need a RelOptInfo holding a dummy path, so > that outer join paths can be created with one side of join being dummy > result path, which are built in the patch by build_dummy_partition_rel(). Now that I've had a chance to look closer, there's no way I'm committing that change in joinrels.c. If it works at all, it's accidental, because it's breaking all sorts of data structure invariants. The business with an AppendRelInfo that maps from the parentrel to itself is particularly ugly; and I doubt that you can get away with assuming that root->append_rel_array[parent->relid] is available for use for that. (What if the parent is an intermediate partitioned table?) There's also the small problem of the GEQO crash. It's possible that that could be gotten around by switching into the long-term planner context in update_child_rel_info and build_dummy_partition_rel, but then you're creating a memory leak across GEQO cycles. It'd be much better to avoid touching base-relation data structures during join planning. What I propose we do about the GEQO problem is shown in 0001 attached (which would need to be back-patched into v11). This is based on the observation that, if we know an input relation is empty, we can often prove the join is empty and then skip building it at all. (In the existing partitionwise-join code, the same cases are detected by populate_joinrel_with_paths, but we do a fair amount of work before discovering that.) The cases where that's not true are where we have a pruned partition on the inside of a left join, or either side of a full join ... but frankly, what the existing code produces for those cases is not short of embarrassing: -> Hash Left Join Hash Cond: (pagg_tab1_p1.x = y) Filter: ((pagg_tab1_p1.x > 5) OR (y < 20)) -> Seq Scan on pagg_tab1_p1 Filter: (x < 20) -> Hash -> Result One-Time Filter: false That's just dumb. What we *ought* to be doing in such degenerate outer-join cases is just emitting the non-dummy side, ie -> Seq Scan on pagg_tab1_p1 Filter: (x < 20) AND ((pagg_tab1_p1.x > 5) OR (y < 20)) in this example. I would envision handling this by teaching the code to generate a path for the joinrel that's basically just a ProjectionPath atop a path for the non-dummy input rel, with the projection set up to emit nulls for the columns of the dummy side. (Note that this would be useful for outer joins against dummy rels in regular planning contexts, not only partitionwise joins.) Pending somebody doing the work for that, though, I do not have a problem with just being unable to generate partitionwise joins in such cases, so 0001 attached just changes the expected outputs for the affected regression test cases. 0002 attached is then the rest of the partition-planning patch; it doesn't need to mess with joinrels.c at all. I've addressed the other points discussed today in that, except for the business about whether we want your 0003 bitmap-of-live-partitions patch. I'm still inclined to think that that's not really worth it, especially in view of your performance results. If people are OK with this approach to solving the GEQO problem, I think these are committable. regards, tom lane diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 56a5084..3c9d84f 100644 *** a/src/backend/optimizer/path/allpaths.c --- b/src/backend/optimizer/path/allpaths.c *************** set_append_rel_size(PlannerInfo *root, R *** 1112,1122 **** * for partitioned child rels. * * Note: here we abuse the consider_partitionwise_join flag by setting ! * it *even* for child rels that are not partitioned. In that case, ! * we set it to tell try_partitionwise_join() that it doesn't need to ! * generate their targetlists and EC entries as they have already been ! * generated here, as opposed to the dummy child rels for which the ! * flag is left set to false so that it will generate them. */ if (rel->consider_partitionwise_join) childrel->consider_partitionwise_join = true; --- 1112,1122 ---- * for partitioned child rels. * * Note: here we abuse the consider_partitionwise_join flag by setting ! * it for child rels that are not themselves partitioned. We do so to ! * tell try_partitionwise_join() that the child rel is sufficiently ! * valid to be used as a per-partition input, even if it later gets ! * proven to be dummy. (It's not usable until we've set up the ! * reltarget and EC entries, which we just did.) */ if (rel->consider_partitionwise_join) childrel->consider_partitionwise_join = true; *************** generate_partitionwise_join_paths(Planne *** 3564,3570 **** { RelOptInfo *child_rel = part_rels[cnt_parts]; ! Assert(child_rel != NULL); /* Add partitionwise join paths for partitioned child-joins. */ generate_partitionwise_join_paths(root, child_rel); --- 3564,3572 ---- { RelOptInfo *child_rel = part_rels[cnt_parts]; ! /* If it's been pruned entirely, it's certainly dummy. */ ! if (child_rel == NULL) ! continue; /* Add partitionwise join paths for partitioned child-joins. */ generate_partitionwise_join_paths(root, child_rel); diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c index 9604a54..34cc7da 100644 *** a/src/backend/optimizer/path/joinrels.c --- b/src/backend/optimizer/path/joinrels.c *************** *** 15,23 **** #include "postgres.h" #include "miscadmin.h" - #include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" - #include "optimizer/clauses.h" #include "optimizer/joininfo.h" #include "optimizer/pathnode.h" #include "optimizer/paths.h" --- 15,21 ---- *************** static void try_partitionwise_join(Plann *** 44,51 **** RelOptInfo *rel2, RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo, List *parent_restrictlist); - static void update_child_rel_info(PlannerInfo *root, - RelOptInfo *rel, RelOptInfo *childrel); static SpecialJoinInfo *build_child_join_sjinfo(PlannerInfo *root, SpecialJoinInfo *parent_sjinfo, Relids left_relids, Relids right_relids); --- 42,47 ---- *************** try_partitionwise_join(PlannerInfo *root *** 1405,1410 **** --- 1401,1410 ---- { RelOptInfo *child_rel1 = rel1->part_rels[cnt_parts]; RelOptInfo *child_rel2 = rel2->part_rels[cnt_parts]; + bool rel1_empty = (child_rel1 == NULL || + IS_DUMMY_REL(child_rel1)); + bool rel2_empty = (child_rel2 == NULL || + IS_DUMMY_REL(child_rel2)); SpecialJoinInfo *child_sjinfo; List *child_restrictlist; RelOptInfo *child_joinrel; *************** try_partitionwise_join(PlannerInfo *root *** 1413,1436 **** int nappinfos; /* ! * If a child table has consider_partitionwise_join=false, it means * that it's a dummy relation for which we skipped setting up tlist ! * expressions and adding EC members in set_append_rel_size(), so do ! * that now for use later. */ if (rel1_is_simple && !child_rel1->consider_partitionwise_join) { Assert(child_rel1->reloptkind == RELOPT_OTHER_MEMBER_REL); Assert(IS_DUMMY_REL(child_rel1)); ! update_child_rel_info(root, rel1, child_rel1); ! child_rel1->consider_partitionwise_join = true; } if (rel2_is_simple && !child_rel2->consider_partitionwise_join) { Assert(child_rel2->reloptkind == RELOPT_OTHER_MEMBER_REL); Assert(IS_DUMMY_REL(child_rel2)); ! update_child_rel_info(root, rel2, child_rel2); ! child_rel2->consider_partitionwise_join = true; } /* We should never try to join two overlapping sets of rels. */ --- 1413,1481 ---- int nappinfos; /* ! * Check for cases where we can prove that this segment of the join ! * returns no rows, due to one or both inputs being empty (including ! * inputs that have been pruned away entirely). If so just ignore it. ! * These rules are equivalent to populate_joinrel_with_paths's rules ! * for dummy input relations. ! */ ! switch (parent_sjinfo->jointype) ! { ! case JOIN_INNER: ! case JOIN_SEMI: ! if (rel1_empty || rel2_empty) ! continue; /* ignore this join segment */ ! break; ! case JOIN_LEFT: ! case JOIN_ANTI: ! if (rel1_empty) ! continue; /* ignore this join segment */ ! break; ! case JOIN_FULL: ! if (rel1_empty && rel2_empty) ! continue; /* ignore this join segment */ ! break; ! default: ! /* other values not expected here */ ! elog(ERROR, "unrecognized join type: %d", ! (int) parent_sjinfo->jointype); ! break; ! } ! ! /* ! * If a child has been pruned entirely then we can't generate paths ! * for it, so we have to reject partitionwise joining unless we were ! * able to eliminate this partition above. ! */ ! if (child_rel1 == NULL || child_rel2 == NULL) ! { ! /* ! * Mark the joinrel as unpartitioned so that later functions treat ! * it correctly. ! */ ! joinrel->nparts = 0; ! return; ! } ! ! /* ! * If a leaf relation has consider_partitionwise_join=false, it means * that it's a dummy relation for which we skipped setting up tlist ! * expressions and adding EC members in set_append_rel_size(), so ! * again we have to fail here. */ if (rel1_is_simple && !child_rel1->consider_partitionwise_join) { Assert(child_rel1->reloptkind == RELOPT_OTHER_MEMBER_REL); Assert(IS_DUMMY_REL(child_rel1)); ! joinrel->nparts = 0; ! return; } if (rel2_is_simple && !child_rel2->consider_partitionwise_join) { Assert(child_rel2->reloptkind == RELOPT_OTHER_MEMBER_REL); Assert(IS_DUMMY_REL(child_rel2)); ! joinrel->nparts = 0; ! return; } /* We should never try to join two overlapping sets of rels. */ *************** try_partitionwise_join(PlannerInfo *root *** 1475,1502 **** } /* - * Set up tlist expressions for the childrel, and add EC members referencing - * the childrel. - */ - static void - update_child_rel_info(PlannerInfo *root, - RelOptInfo *rel, RelOptInfo *childrel) - { - AppendRelInfo *appinfo = root->append_rel_array[childrel->relid]; - - /* Make child tlist expressions */ - childrel->reltarget->exprs = (List *) - adjust_appendrel_attrs(root, - (Node *) rel->reltarget->exprs, - 1, &appinfo); - - /* Make child entries in the EquivalenceClass as well */ - if (rel->has_eclass_joins || has_useful_pathkeys(root, rel)) - add_child_rel_equivalences(root, appinfo, rel, childrel); - childrel->has_eclass_joins = rel->has_eclass_joins; - } - - /* * Construct the SpecialJoinInfo for a child-join by translating * SpecialJoinInfo for the join between parents. left_relids and right_relids * are the relids of left and right side of the join respectively. --- 1520,1525 ---- diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index ca7a0fb..031e709 100644 *** a/src/backend/optimizer/plan/planner.c --- b/src/backend/optimizer/plan/planner.c *************** apply_scanjoin_target_to_paths(PlannerIn *** 6993,6998 **** --- 6993,7002 ---- List *child_scanjoin_targets = NIL; ListCell *lc; + /* Pruned or dummy children can be ignored. */ + if (child_rel == NULL || IS_DUMMY_REL(child_rel)) + continue; + /* Translate scan/join targets for this child. */ appinfos = find_appinfos_by_relids(root, child_rel->relids, &nappinfos); *************** create_partitionwise_grouping_paths(Plan *** 7093,7100 **** RelOptInfo *child_grouped_rel; RelOptInfo *child_partially_grouped_rel; ! /* Input child rel must have a path */ ! Assert(child_input_rel->pathlist != NIL); /* * Copy the given "extra" structure as is and then override the --- 7097,7105 ---- RelOptInfo *child_grouped_rel; RelOptInfo *child_partially_grouped_rel; ! /* Pruned or dummy children can be ignored. */ ! if (child_input_rel == NULL || IS_DUMMY_REL(child_input_rel)) ! continue; /* * Copy the given "extra" structure as is and then override the *************** create_partitionwise_grouping_paths(Plan *** 7136,7149 **** extra->target_parallel_safe, child_extra.havingQual); - /* Ignore empty children. They contribute nothing. */ - if (IS_DUMMY_REL(child_input_rel)) - { - mark_dummy_rel(child_grouped_rel); - - continue; - } - /* Create grouping paths for this child relation. */ create_ordinary_grouping_paths(root, child_input_rel, child_grouped_rel, --- 7141,7146 ---- diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out index 6bc1068..9783281 100644 *** a/src/test/regress/expected/partition_aggregate.out --- b/src/test/regress/expected/partition_aggregate.out *************** SELECT a.x, sum(b.x) FROM pagg_tab1 a FU *** 721,752 **** -- non-nullable columns EXPLAIN (COSTS OFF) SELECT a.x, b.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x < 20) a LEFT JOIN (SELECT * FROM pagg_tab2 WHERE y > 10)b ON a.x = b.y WHERE a.x > 5 or b.y < 20 GROUP BY a.x, b.y ORDER BY 1, 2; ! QUERY PLAN ! ----------------------------------------------------------------------------- Sort ! Sort Key: pagg_tab1_p1.x, y ! -> Append ! -> HashAggregate ! Group Key: pagg_tab1_p1.x, y ! -> Hash Left Join ! Hash Cond: (pagg_tab1_p1.x = y) ! Filter: ((pagg_tab1_p1.x > 5) OR (y < 20)) -> Seq Scan on pagg_tab1_p1 Filter: (x < 20) - -> Hash - -> Result - One-Time Filter: false - -> HashAggregate - Group Key: pagg_tab1_p2.x, pagg_tab2_p2.y - -> Hash Left Join - Hash Cond: (pagg_tab1_p2.x = pagg_tab2_p2.y) - Filter: ((pagg_tab1_p2.x > 5) OR (pagg_tab2_p2.y < 20)) -> Seq Scan on pagg_tab1_p2 Filter: (x < 20) ! -> Hash -> Seq Scan on pagg_tab2_p2 Filter: (y > 10) ! (23 rows) SELECT a.x, b.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x < 20) a LEFT JOIN (SELECT * FROM pagg_tab2 WHERE y > 10)b ON a.x = b.y WHERE a.x > 5 or b.y < 20 GROUP BY a.x, b.y ORDER BY 1, 2; x | y | count --- 721,747 ---- -- non-nullable columns EXPLAIN (COSTS OFF) SELECT a.x, b.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x < 20) a LEFT JOIN (SELECT * FROM pagg_tab2 WHERE y > 10)b ON a.x = b.y WHERE a.x > 5 or b.y < 20 GROUP BY a.x, b.y ORDER BY 1, 2; ! QUERY PLAN ! ----------------------------------------------------------------------- Sort ! Sort Key: pagg_tab1_p1.x, pagg_tab2_p2.y ! -> HashAggregate ! Group Key: pagg_tab1_p1.x, pagg_tab2_p2.y ! -> Hash Left Join ! Hash Cond: (pagg_tab1_p1.x = pagg_tab2_p2.y) ! Filter: ((pagg_tab1_p1.x > 5) OR (pagg_tab2_p2.y < 20)) ! -> Append -> Seq Scan on pagg_tab1_p1 Filter: (x < 20) -> Seq Scan on pagg_tab1_p2 Filter: (x < 20) ! -> Hash ! -> Append -> Seq Scan on pagg_tab2_p2 Filter: (y > 10) ! -> Seq Scan on pagg_tab2_p3 ! Filter: (y > 10) ! (18 rows) SELECT a.x, b.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x < 20) a LEFT JOIN (SELECT * FROM pagg_tab2 WHERE y > 10)b ON a.x = b.y WHERE a.x > 5 or b.y < 20 GROUP BY a.x, b.y ORDER BY 1, 2; x | y | count *************** SELECT a.x, b.y, count(*) FROM (SELECT * *** 765,808 **** -- nullable columns EXPLAIN (COSTS OFF) SELECT a.x, b.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x < 20) a FULL JOIN (SELECT * FROM pagg_tab2 WHERE y > 10)b ON a.x = b.y WHERE a.x > 5 or b.y < 20 GROUP BY a.x, b.y ORDER BY 1, 2; ! QUERY PLAN ! ----------------------------------------------------------------------------------- ! Finalize GroupAggregate ! Group Key: pagg_tab1_p1.x, y ! -> Sort ! Sort Key: pagg_tab1_p1.x, y ! -> Append ! -> Partial HashAggregate ! Group Key: pagg_tab1_p1.x, y ! -> Hash Full Join ! Hash Cond: (pagg_tab1_p1.x = y) ! Filter: ((pagg_tab1_p1.x > 5) OR (y < 20)) ! -> Seq Scan on pagg_tab1_p1 ! Filter: (x < 20) ! -> Hash ! -> Result ! One-Time Filter: false ! -> Partial HashAggregate ! Group Key: pagg_tab1_p2.x, pagg_tab2_p2.y ! -> Hash Full Join ! Hash Cond: (pagg_tab1_p2.x = pagg_tab2_p2.y) ! Filter: ((pagg_tab1_p2.x > 5) OR (pagg_tab2_p2.y < 20)) ! -> Seq Scan on pagg_tab1_p2 ! Filter: (x < 20) ! -> Hash ! -> Seq Scan on pagg_tab2_p2 ! Filter: (y > 10) ! -> Partial HashAggregate ! Group Key: x, pagg_tab2_p3.y ! -> Hash Full Join ! Hash Cond: (pagg_tab2_p3.y = x) ! Filter: ((x > 5) OR (pagg_tab2_p3.y < 20)) -> Seq Scan on pagg_tab2_p3 Filter: (y > 10) ! -> Hash ! -> Result ! One-Time Filter: false ! (35 rows) SELECT a.x, b.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x < 20) a FULL JOIN (SELECT * FROM pagg_tab2 WHERE y > 10)b ON a.x = b.y WHERE a.x > 5 or b.y < 20 GROUP BY a.x, b.y ORDER BY 1, 2; x | y | count --- 760,786 ---- -- nullable columns EXPLAIN (COSTS OFF) SELECT a.x, b.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x < 20) a FULL JOIN (SELECT * FROM pagg_tab2 WHERE y > 10)b ON a.x = b.y WHERE a.x > 5 or b.y < 20 GROUP BY a.x, b.y ORDER BY 1, 2; ! QUERY PLAN ! ----------------------------------------------------------------------- ! Sort ! Sort Key: pagg_tab1_p1.x, pagg_tab2_p2.y ! -> HashAggregate ! Group Key: pagg_tab1_p1.x, pagg_tab2_p2.y ! -> Hash Full Join ! Hash Cond: (pagg_tab1_p1.x = pagg_tab2_p2.y) ! Filter: ((pagg_tab1_p1.x > 5) OR (pagg_tab2_p2.y < 20)) ! -> Append ! -> Seq Scan on pagg_tab1_p1 ! Filter: (x < 20) ! -> Seq Scan on pagg_tab1_p2 ! Filter: (x < 20) ! -> Hash ! -> Append ! -> Seq Scan on pagg_tab2_p2 ! Filter: (y > 10) -> Seq Scan on pagg_tab2_p3 Filter: (y > 10) ! (18 rows) SELECT a.x, b.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x < 20) a FULL JOIN (SELECT * FROM pagg_tab2 WHERE y > 10)b ON a.x = b.y WHERE a.x > 5 or b.y < 20 GROUP BY a.x, b.y ORDER BY 1, 2; x | y | count diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out index e19535d..e228e3f 100644 *** a/src/test/regress/expected/partition_join.out --- b/src/test/regress/expected/partition_join.out *************** SELECT t1.a, t1.c, t2.b, t2.c FROM (SELE *** 210,232 **** QUERY PLAN ----------------------------------------------------------- Sort ! Sort Key: prt1_p1.a, b ! -> Append ! -> Hash Left Join ! Hash Cond: (prt1_p1.a = b) ! -> Seq Scan on prt1_p1 ! Filter: ((a < 450) AND (b = 0)) ! -> Hash ! -> Result ! One-Time Filter: false ! -> Hash Right Join ! Hash Cond: (prt2_p2.b = prt1_p2.a) -> Seq Scan on prt2_p2 Filter: (b > 250) ! -> Hash -> Seq Scan on prt1_p2 Filter: ((a < 450) AND (b = 0)) ! (17 rows) SELECT t1.a, t1.c, t2.b, t2.c FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; a | c | b | c --- 210,230 ---- QUERY PLAN ----------------------------------------------------------- Sort ! Sort Key: prt1_p1.a, prt2_p2.b ! -> Hash Right Join ! Hash Cond: (prt2_p2.b = prt1_p1.a) ! -> Append -> Seq Scan on prt2_p2 Filter: (b > 250) ! -> Seq Scan on prt2_p3 ! Filter: (b > 250) ! -> Hash ! -> Append ! -> Seq Scan on prt1_p1 ! Filter: ((a < 450) AND (b = 0)) -> Seq Scan on prt1_p2 Filter: ((a < 450) AND (b = 0)) ! (15 rows) SELECT t1.a, t1.c, t2.b, t2.c FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; a | c | b | c *************** SELECT t1.a, t1.c, t2.b, t2.c FROM (SELE *** 244,279 **** EXPLAIN (COSTS OFF) SELECT t1.a, t1.c, t2.b, t2.c FROM (SELECT * FROM prt1 WHERE a < 450) t1 FULL JOIN (SELECT * FROM prt2 WHERE b > 250) t2ON t1.a = t2.b WHERE t1.b = 0 OR t2.a = 0 ORDER BY t1.a, t2.b; ! QUERY PLAN ! ------------------------------------------------------------ Sort ! Sort Key: prt1_p1.a, b ! -> Append ! -> Hash Full Join ! Hash Cond: (prt1_p1.a = b) ! Filter: ((prt1_p1.b = 0) OR (a = 0)) -> Seq Scan on prt1_p1 Filter: (a < 450) - -> Hash - -> Result - One-Time Filter: false - -> Hash Full Join - Hash Cond: (prt1_p2.a = prt2_p2.b) - Filter: ((prt1_p2.b = 0) OR (prt2_p2.a = 0)) -> Seq Scan on prt1_p2 Filter: (a < 450) ! -> Hash -> Seq Scan on prt2_p2 Filter: (b > 250) ! -> Hash Full Join ! Hash Cond: (prt2_p3.b = a) ! Filter: ((b = 0) OR (prt2_p3.a = 0)) ! -> Seq Scan on prt2_p3 ! Filter: (b > 250) ! -> Hash ! -> Result ! One-Time Filter: false ! (27 rows) SELECT t1.a, t1.c, t2.b, t2.c FROM (SELECT * FROM prt1 WHERE a < 450) t1 FULL JOIN (SELECT * FROM prt2 WHERE b > 250) t2ON t1.a = t2.b WHERE t1.b = 0 OR t2.a = 0 ORDER BY t1.a, t2.b; a | c | b | c --- 242,266 ---- EXPLAIN (COSTS OFF) SELECT t1.a, t1.c, t2.b, t2.c FROM (SELECT * FROM prt1 WHERE a < 450) t1 FULL JOIN (SELECT * FROM prt2 WHERE b > 250) t2ON t1.a = t2.b WHERE t1.b = 0 OR t2.a = 0 ORDER BY t1.a, t2.b; ! QUERY PLAN ! ------------------------------------------------------ Sort ! Sort Key: prt1_p1.a, prt2_p2.b ! -> Hash Full Join ! Hash Cond: (prt1_p1.a = prt2_p2.b) ! Filter: ((prt1_p1.b = 0) OR (prt2_p2.a = 0)) ! -> Append -> Seq Scan on prt1_p1 Filter: (a < 450) -> Seq Scan on prt1_p2 Filter: (a < 450) ! -> Hash ! -> Append -> Seq Scan on prt2_p2 Filter: (b > 250) ! -> Seq Scan on prt2_p3 ! Filter: (b > 250) ! (16 rows) SELECT t1.a, t1.c, t2.b, t2.c FROM (SELECT * FROM prt1 WHERE a < 450) t1 FULL JOIN (SELECT * FROM prt2 WHERE b > 250) t2ON t1.a = t2.b WHERE t1.b = 0 OR t2.a = 0 ORDER BY t1.a, t2.b; a | c | b | c *************** SELECT t1.a, t2.b FROM (SELECT * FROM pr *** 997,1025 **** QUERY PLAN ----------------------------------------------------------- Sort ! Sort Key: prt1_p1.a, b ! -> Append ! -> Merge Left Join ! Merge Cond: (prt1_p1.a = b) ! -> Sort ! Sort Key: prt1_p1.a -> Seq Scan on prt1_p1 Filter: ((a < 450) AND (b = 0)) - -> Sort - Sort Key: b - -> Result - One-Time Filter: false - -> Merge Left Join - Merge Cond: (prt1_p2.a = prt2_p2.b) - -> Sort - Sort Key: prt1_p2.a -> Seq Scan on prt1_p2 Filter: ((a < 450) AND (b = 0)) ! -> Sort ! Sort Key: prt2_p2.b -> Seq Scan on prt2_p2 Filter: (b > 250) ! (23 rows) SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a =t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; a | b --- 984,1007 ---- QUERY PLAN ----------------------------------------------------------- Sort ! Sort Key: prt1_p1.a, prt2_p2.b ! -> Merge Left Join ! Merge Cond: (prt1_p1.a = prt2_p2.b) ! -> Sort ! Sort Key: prt1_p1.a ! -> Append -> Seq Scan on prt1_p1 Filter: ((a < 450) AND (b = 0)) -> Seq Scan on prt1_p2 Filter: ((a < 450) AND (b = 0)) ! -> Sort ! Sort Key: prt2_p2.b ! -> Append -> Seq Scan on prt2_p2 Filter: (b > 250) ! -> Seq Scan on prt2_p3 ! Filter: (b > 250) ! (18 rows) SELECT t1.a, t2.b FROM (SELECT * FROM prt1 WHERE a < 450) t1 LEFT JOIN (SELECT * FROM prt2 WHERE b > 250) t2 ON t1.a =t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; a | b diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 9ad9035..310c715 100644 *** a/contrib/postgres_fdw/expected/postgres_fdw.out --- b/contrib/postgres_fdw/expected/postgres_fdw.out *************** select * from bar where f1 in (select f1 *** 7116,7124 **** QUERY PLAN ---------------------------------------------------------------------------------------------- LockRows ! Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid -> Hash Join ! Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true Hash Cond: (bar.f1 = foo.f1) -> Append --- 7116,7124 ---- QUERY PLAN ---------------------------------------------------------------------------------------------- LockRows ! Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid -> Hash Join ! Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid Inner Unique: true Hash Cond: (bar.f1 = foo.f1) -> Append *************** select * from bar where f1 in (select f1 *** 7128,7142 **** Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE -> Hash ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> HashAggregate ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append -> Seq Scan on public.foo ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (23 rows) --- 7128,7142 ---- Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE -> Hash ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> HashAggregate ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid Group Key: foo.f1 -> Append -> Seq Scan on public.foo ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (23 rows) *************** select * from bar where f1 in (select f1 *** 7154,7162 **** QUERY PLAN ---------------------------------------------------------------------------------------------- LockRows ! Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid -> Hash Join ! Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid Inner Unique: true Hash Cond: (bar.f1 = foo.f1) -> Append --- 7154,7162 ---- QUERY PLAN ---------------------------------------------------------------------------------------------- LockRows ! Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid -> Hash Join ! Output: bar.f1, bar.f2, bar.ctid, foo.ctid, bar.*, bar.tableoid, foo.*, foo.tableoid Inner Unique: true Hash Cond: (bar.f1 = foo.f1) -> Append *************** select * from bar where f1 in (select f1 *** 7166,7180 **** Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE -> Hash ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> HashAggregate ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append -> Seq Scan on public.foo ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (23 rows) --- 7166,7180 ---- Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE -> Hash ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> HashAggregate ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid Group Key: foo.f1 -> Append -> Seq Scan on public.foo ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (23 rows) *************** update bar set f2 = f2 + 100 where f1 in *** 7203,7217 **** -> Seq Scan on public.bar Output: bar.f1, bar.f2, bar.ctid -> Hash ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> HashAggregate ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append -> Seq Scan on public.foo ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -> Hash Join Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid --- 7203,7217 ---- -> Seq Scan on public.bar Output: bar.f1, bar.f2, bar.ctid -> Hash ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> HashAggregate ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid Group Key: foo.f1 -> Append -> Seq Scan on public.foo ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 -> Hash Join Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid *************** update bar set f2 = f2 + 100 where f1 in *** 7221,7235 **** Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE -> Hash ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> HashAggregate ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 Group Key: foo.f1 -> Append -> Seq Scan on public.foo ! Output: foo.ctid, foo.*, foo.tableoid, foo.f1 -> Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (39 rows) --- 7221,7235 ---- Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE -> Hash ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> HashAggregate ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid Group Key: foo.f1 -> Append -> Seq Scan on public.foo ! Output: foo.ctid, foo.f1, foo.*, foo.tableoid -> Foreign Scan on public.foo2 ! Output: foo2.ctid, foo2.f1, foo2.*, foo2.tableoid Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1 (39 rows) *************** SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT *** 8435,8441 **** Foreign Scan Output: t1.a, ftprt2_p1.b, ftprt2_p1.c Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2) ! Remote SQL: SELECT r5.a, r7.b, r7.c FROM (public.fprt1_p1 r5 LEFT JOIN public.fprt2_p1 r7 ON (((r5.a = r7.b)) AND ((r5.b= r7.a)) AND ((r7.a < 10)))) WHERE ((r5.a < 10)) ORDER BY r5.a ASC NULLS LAST, r7.b ASC NULLS LAST, r7.c ASC NULLSLAST (4 rows) SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHEREt1.a < 10 ORDER BY 1,2,3; --- 8435,8441 ---- Foreign Scan Output: t1.a, ftprt2_p1.b, ftprt2_p1.c Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2) ! Remote SQL: SELECT r5.a, r6.b, r6.c FROM (public.fprt1_p1 r5 LEFT JOIN public.fprt2_p1 r6 ON (((r5.a = r6.b)) AND ((r5.b= r6.a)) AND ((r6.a < 10)))) WHERE ((r5.a < 10)) ORDER BY r5.a ASC NULLS LAST, r6.b ASC NULLS LAST, r6.c ASC NULLSLAST (4 rows) SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHEREt1.a < 10 ORDER BY 1,2,3; diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index cfad8a3..b72db85 100644 *** a/src/backend/executor/execPartition.c --- b/src/backend/executor/execPartition.c *************** ExecCreatePartitionPruneState(PlanState *** 1654,1662 **** memcpy(pprune->subplan_map, pinfo->subplan_map, sizeof(int) * pinfo->nparts); ! /* Double-check that list of relations has not changed. */ ! Assert(memcmp(partdesc->oids, pinfo->relid_map, ! pinfo->nparts * sizeof(Oid)) == 0); } else { --- 1654,1670 ---- memcpy(pprune->subplan_map, pinfo->subplan_map, sizeof(int) * pinfo->nparts); ! /* ! * Double-check that the list of unpruned relations has not ! * changed. (Pruned partitions are not in relid_map[].) ! */ ! #ifdef USE_ASSERT_CHECKING ! for (int k = 0; k < pinfo->nparts; k++) ! { ! Assert(partdesc->oids[k] == pinfo->relid_map[k] || ! pinfo->subplan_map[k] == -1); ! } ! #endif } else { diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 3c9d84f..727da33 100644 *** a/src/backend/optimizer/path/allpaths.c --- b/src/backend/optimizer/path/allpaths.c *************** static void subquery_push_qual(Query *su *** 139,147 **** static void recurse_push_qual(Node *setOp, Query *topquery, RangeTblEntry *rte, Index rti, Node *qual); static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel); - static bool apply_child_basequals(PlannerInfo *root, RelOptInfo *rel, - RelOptInfo *childrel, - RangeTblEntry *childRTE, AppendRelInfo *appinfo); /* --- 139,144 ---- *************** set_rel_size(PlannerInfo *root, RelOptIn *** 396,403 **** else if (rte->relkind == RELKIND_PARTITIONED_TABLE) { /* ! * A partitioned table without any partitions is marked as ! * a dummy rel. */ set_dummy_rel_pathlist(rel); } --- 393,401 ---- else if (rte->relkind == RELKIND_PARTITIONED_TABLE) { /* ! * We could get here if asked to scan a partitioned table ! * with ONLY. In that case we shouldn't scan any of the ! * partitions, so mark it as a dummy rel. */ set_dummy_rel_pathlist(rel); } *************** set_append_rel_size(PlannerInfo *root, R *** 946,953 **** double *parent_attrsizes; int nattrs; ListCell *l; - Relids live_children = NULL; - bool did_pruning = false; /* Guard against stack overflow due to overly deep inheritance tree. */ check_stack_depth(); --- 944,949 ---- *************** set_append_rel_size(PlannerInfo *root, R *** 966,986 **** rel->partitioned_child_rels = list_make1_int(rti); /* - * If the partitioned relation has any baserestrictinfo quals then we - * attempt to use these quals to prune away partitions that cannot - * possibly contain any tuples matching these quals. In this case we'll - * store the relids of all partitions which could possibly contain a - * matching tuple, and skip anything else in the loop below. - */ - if (enable_partition_pruning && - rte->relkind == RELKIND_PARTITIONED_TABLE && - rel->baserestrictinfo != NIL) - { - live_children = prune_append_rel_partitions(rel); - did_pruning = true; - } - - /* * If this is a partitioned baserel, set the consider_partitionwise_join * flag; currently, we only consider partitionwise joins with the baserel * if its targetlist doesn't contain a whole-row Var. --- 962,967 ---- *************** set_append_rel_size(PlannerInfo *root, R *** 1034,1063 **** childrel = find_base_rel(root, childRTindex); Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL); ! if (did_pruning && !bms_is_member(appinfo->child_relid, live_children)) ! { ! /* This partition was pruned; skip it. */ ! set_dummy_rel_pathlist(childrel); continue; - } /* * We have to copy the parent's targetlist and quals to the child, ! * with appropriate substitution of variables. If any constant false ! * or NULL clauses turn up, we can disregard the child right away. If ! * not, we can apply constraint exclusion with just the ! * baserestrictinfo quals. */ - if (!apply_child_basequals(root, rel, childrel, childRTE, appinfo)) - { - /* - * Some restriction clause reduced to constant FALSE or NULL after - * substitution, so this child need not be scanned. - */ - set_dummy_rel_pathlist(childrel); - continue; - } - if (relation_excluded_by_constraints(root, childrel, childRTE)) { /* --- 1015,1031 ---- childrel = find_base_rel(root, childRTindex); Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL); ! /* We may have already proven the child to be dummy. */ ! if (IS_DUMMY_REL(childrel)) continue; /* * We have to copy the parent's targetlist and quals to the child, ! * with appropriate substitution of variables. However, the ! * baserestrictinfo quals were already copied/substituted when the ! * child RelOptInfo was built. So we don't need any additional setup ! * before applying constraint exclusion. */ if (relation_excluded_by_constraints(root, childrel, childRTE)) { /* *************** set_append_rel_size(PlannerInfo *root, R *** 1069,1075 **** } /* ! * CE failed, so finish copying/modifying targetlist and join quals. * * NB: the resulting childrel->reltarget->exprs may contain arbitrary * expressions, which otherwise would not occur in a rel's targetlist. --- 1037,1044 ---- } /* ! * Constraint exclusion failed, so copy the parent's join quals and ! * targetlist to the child, with appropriate variable substitutions. * * NB: the resulting childrel->reltarget->exprs may contain arbitrary * expressions, which otherwise would not occur in a rel's targetlist. *************** generate_partitionwise_join_paths(Planne *** 3596,3728 **** list_free(live_children); } - /* - * apply_child_basequals - * Populate childrel's quals based on rel's quals, translating them using - * appinfo. - * - * If any of the resulting clauses evaluate to false or NULL, we return false - * and don't apply any quals. Caller can mark the relation as a dummy rel in - * this case, since it needn't be scanned. - * - * If any resulting clauses evaluate to true, they're unnecessary and we don't - * apply then. - */ - static bool - apply_child_basequals(PlannerInfo *root, RelOptInfo *rel, - RelOptInfo *childrel, RangeTblEntry *childRTE, - AppendRelInfo *appinfo) - { - List *childquals; - Index cq_min_security; - ListCell *lc; - - /* - * The child rel's targetlist might contain non-Var expressions, which - * means that substitution into the quals could produce opportunities for - * const-simplification, and perhaps even pseudoconstant quals. Therefore, - * transform each RestrictInfo separately to see if it reduces to a - * constant or pseudoconstant. (We must process them separately to keep - * track of the security level of each qual.) - */ - childquals = NIL; - cq_min_security = UINT_MAX; - foreach(lc, rel->baserestrictinfo) - { - RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc); - Node *childqual; - ListCell *lc2; - - Assert(IsA(rinfo, RestrictInfo)); - childqual = adjust_appendrel_attrs(root, - (Node *) rinfo->clause, - 1, &appinfo); - childqual = eval_const_expressions(root, childqual); - /* check for flat-out constant */ - if (childqual && IsA(childqual, Const)) - { - if (((Const *) childqual)->constisnull || - !DatumGetBool(((Const *) childqual)->constvalue)) - { - /* Restriction reduces to constant FALSE or NULL */ - return false; - } - /* Restriction reduces to constant TRUE, so drop it */ - continue; - } - /* might have gotten an AND clause, if so flatten it */ - foreach(lc2, make_ands_implicit((Expr *) childqual)) - { - Node *onecq = (Node *) lfirst(lc2); - bool pseudoconstant; - - /* check for pseudoconstant (no Vars or volatile functions) */ - pseudoconstant = - !contain_vars_of_level(onecq, 0) && - !contain_volatile_functions(onecq); - if (pseudoconstant) - { - /* tell createplan.c to check for gating quals */ - root->hasPseudoConstantQuals = true; - } - /* reconstitute RestrictInfo with appropriate properties */ - childquals = lappend(childquals, - make_restrictinfo((Expr *) onecq, - rinfo->is_pushed_down, - rinfo->outerjoin_delayed, - pseudoconstant, - rinfo->security_level, - NULL, NULL, NULL)); - /* track minimum security level among child quals */ - cq_min_security = Min(cq_min_security, rinfo->security_level); - } - } - - /* - * In addition to the quals inherited from the parent, we might have - * securityQuals associated with this particular child node. (Currently - * this can only happen in appendrels originating from UNION ALL; - * inheritance child tables don't have their own securityQuals, see - * expand_inherited_rtentry().) Pull any such securityQuals up into the - * baserestrictinfo for the child. This is similar to - * process_security_barrier_quals() for the parent rel, except that we - * can't make any general deductions from such quals, since they don't - * hold for the whole appendrel. - */ - if (childRTE->securityQuals) - { - Index security_level = 0; - - foreach(lc, childRTE->securityQuals) - { - List *qualset = (List *) lfirst(lc); - ListCell *lc2; - - foreach(lc2, qualset) - { - Expr *qual = (Expr *) lfirst(lc2); - - /* not likely that we'd see constants here, so no check */ - childquals = lappend(childquals, - make_restrictinfo(qual, - true, false, false, - security_level, - NULL, NULL, NULL)); - cq_min_security = Min(cq_min_security, security_level); - } - security_level++; - } - Assert(security_level <= root->qual_security_level); - } - - /* - * OK, we've got all the baserestrictinfo quals for this child. - */ - childrel->baserestrictinfo = childquals; - childrel->baserestrict_min_security = cq_min_security; - - return true; - } /***************************************************************************** * DEBUG SUPPORT --- 3565,3570 ---- diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c index 62dfac6..9798dca 100644 *** a/src/backend/optimizer/plan/initsplan.c --- b/src/backend/optimizer/plan/initsplan.c *************** *** 20,25 **** --- 20,26 ---- #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" + #include "optimizer/inherit.h" #include "optimizer/joininfo.h" #include "optimizer/optimizer.h" #include "optimizer/pathnode.h" *************** add_other_rels_to_query(PlannerInfo *roo *** 159,170 **** /* If it's marked as inheritable, look for children. */ if (rte->inh) ! { ! /* Only relation and subquery RTEs can have children. */ ! Assert(rte->rtekind == RTE_RELATION || ! rte->rtekind == RTE_SUBQUERY); ! add_appendrel_other_rels(root, rel, rti); ! } } } --- 160,166 ---- /* If it's marked as inheritable, look for children. */ if (rte->inh) ! expand_inherited_rtentry(root, rel, rte, rti); } } diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 031e709..f08f1cd 100644 *** a/src/backend/optimizer/plan/planner.c --- b/src/backend/optimizer/plan/planner.c *************** *** 25,30 **** --- 25,31 ---- #include "access/table.h" #include "access/xact.h" #include "catalog/pg_constraint.h" + #include "catalog/pg_inherits.h" #include "catalog/pg_proc.h" #include "catalog/pg_type.h" #include "executor/executor.h" *************** subquery_planner(PlannerGlobal *glob, Qu *** 679,690 **** flatten_simple_union_all(root); /* ! * Detect whether any rangetable entries are RTE_JOIN kind; if not, we can ! * avoid the expense of doing flatten_join_alias_vars(). Likewise check ! * whether any are RTE_RESULT kind; if not, we can skip ! * remove_useless_result_rtes(). Also check for outer joins --- if none, ! * we can skip reduce_outer_joins(). And check for LATERAL RTEs, too. ! * This must be done after we have done pull_up_subqueries(), of course. */ root->hasJoinRTEs = false; root->hasLateralRTEs = false; --- 680,693 ---- flatten_simple_union_all(root); /* ! * Survey the rangetable to see what kinds of entries are present. We can ! * skip some later processing if relevant SQL features are not used; for ! * example if there are no JOIN RTEs we can avoid the expense of doing ! * flatten_join_alias_vars(). This must be done after we have finished ! * adding rangetable entries, of course. (Note: actually, processing of ! * inherited or partitioned rels can cause RTEs for their child tables to ! * get added later; but those must all be RTE_RELATION entries, so they ! * don't invalidate the conclusions drawn here.) */ root->hasJoinRTEs = false; root->hasLateralRTEs = false; *************** subquery_planner(PlannerGlobal *glob, Qu *** 694,732 **** { RangeTblEntry *rte = lfirst_node(RangeTblEntry, l); ! if (rte->rtekind == RTE_JOIN) ! { ! root->hasJoinRTEs = true; ! if (IS_OUTER_JOIN(rte->jointype)) ! hasOuterJoins = true; ! } ! else if (rte->rtekind == RTE_RESULT) { ! hasResultRTEs = true; } if (rte->lateral) root->hasLateralRTEs = true; } /* * Preprocess RowMark information. We need to do this after subquery ! * pullup (so that all non-inherited RTEs are present) and before ! * inheritance expansion (so that the info is available for ! * expand_inherited_tables to examine and modify). */ preprocess_rowmarks(root); /* - * Expand any rangetable entries that are inheritance sets into "append - * relations". This can add entries to the rangetable, but they must be - * plain RTE_RELATION entries, so it's OK (and marginally more efficient) - * to do it after checking for joins and other special RTEs. We must do - * this after pulling up subqueries, else we'd fail to handle inherited - * tables in subqueries. - */ - expand_inherited_tables(root); - - /* * Set hasHavingQual to remember if HAVING clause is present. Needed * because preprocess_expression will reduce a constant-true condition to * an empty qual list ... but "HAVING TRUE" is not a semantic no-op. --- 697,745 ---- { RangeTblEntry *rte = lfirst_node(RangeTblEntry, l); ! switch (rte->rtekind) { ! case RTE_RELATION: ! if (rte->inh) ! { ! /* ! * Check to see if the relation actually has any children; ! * if not, clear the inh flag so we can treat it as a ! * plain base relation. ! * ! * Note: this could give a false-positive result, if the ! * rel once had children but no longer does. We used to ! * be able to clear rte->inh later on when we discovered ! * that, but no more; we have to handle such cases as ! * full-fledged inheritance. ! */ ! rte->inh = has_subclass(rte->relid); ! } ! break; ! case RTE_JOIN: ! root->hasJoinRTEs = true; ! if (IS_OUTER_JOIN(rte->jointype)) ! hasOuterJoins = true; ! break; ! case RTE_RESULT: ! hasResultRTEs = true; ! break; ! default: ! /* No work here for other RTE types */ ! break; } + if (rte->lateral) root->hasLateralRTEs = true; } /* * Preprocess RowMark information. We need to do this after subquery ! * pullup, so that all base relations are present. */ preprocess_rowmarks(root); /* * Set hasHavingQual to remember if HAVING clause is present. Needed * because preprocess_expression will reduce a constant-true condition to * an empty qual list ... but "HAVING TRUE" is not a semantic no-op. *************** inheritance_planner(PlannerInfo *root) *** 1180,1190 **** { Query *parse = root->parse; int top_parentRTindex = parse->resultRelation; Bitmapset *subqueryRTindexes; ! Bitmapset *modifiableARIindexes; int nominalRelation = -1; Index rootRelation = 0; List *final_rtable = NIL; int save_rel_array_size = 0; RelOptInfo **save_rel_array = NULL; AppendRelInfo **save_append_rel_array = NULL; --- 1193,1209 ---- { Query *parse = root->parse; int top_parentRTindex = parse->resultRelation; + List *select_rtable; + List *select_appinfos; + List *child_appinfos; + List *old_child_rtis; + List *new_child_rtis; Bitmapset *subqueryRTindexes; ! Index next_subquery_rti; int nominalRelation = -1; Index rootRelation = 0; List *final_rtable = NIL; + List *final_rowmarks = NIL; int save_rel_array_size = 0; RelOptInfo **save_rel_array = NULL; AppendRelInfo **save_append_rel_array = NULL; *************** inheritance_planner(PlannerInfo *root) *** 1196,1209 **** List *rowMarks; RelOptInfo *final_rel; ListCell *lc; Index rti; RangeTblEntry *parent_rte; ! PlannerInfo *parent_root; ! Query *parent_parse; ! Bitmapset *parent_relids = bms_make_singleton(top_parentRTindex); ! PlannerInfo **parent_roots = NULL; ! Assert(parse->commandType != CMD_INSERT); /* * We generate a modified instance of the original Query for each target --- 1215,1229 ---- List *rowMarks; RelOptInfo *final_rel; ListCell *lc; + ListCell *lc2; Index rti; RangeTblEntry *parent_rte; ! Bitmapset *parent_relids; ! Query **parent_parses; ! /* Should only get here for UPDATE or DELETE */ ! Assert(parse->commandType == CMD_UPDATE || ! parse->commandType == CMD_DELETE); /* * We generate a modified instance of the original Query for each target *************** inheritance_planner(PlannerInfo *root) *** 1234,1272 **** } /* - * Next, we want to identify which AppendRelInfo items contain references - * to any of the aforesaid subquery RTEs. These items will need to be - * copied and modified to adjust their subquery references; whereas the - * other ones need not be touched. It's worth being tense over this - * because we can usually avoid processing most of the AppendRelInfo - * items, thereby saving O(N^2) space and time when the target is a large - * inheritance tree. We can identify AppendRelInfo items by their - * child_relid, since that should be unique within the list. - */ - modifiableARIindexes = NULL; - if (subqueryRTindexes != NULL) - { - foreach(lc, root->append_rel_list) - { - AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); - - if (bms_is_member(appinfo->parent_relid, subqueryRTindexes) || - bms_is_member(appinfo->child_relid, subqueryRTindexes) || - bms_overlap(pull_varnos((Node *) appinfo->translated_vars), - subqueryRTindexes)) - modifiableARIindexes = bms_add_member(modifiableARIindexes, - appinfo->child_relid); - } - } - - /* * If the parent RTE is a partitioned table, we should use that as the * nominal target relation, because the RTEs added for partitioned tables * (including the root parent) as child members of the inheritance set do * not appear anywhere else in the plan, so the confusion explained below * for non-partitioning inheritance cases is not possible. */ ! parent_rte = rt_fetch(top_parentRTindex, root->parse->rtable); if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE) { nominalRelation = top_parentRTindex; --- 1254,1267 ---- } /* * If the parent RTE is a partitioned table, we should use that as the * nominal target relation, because the RTEs added for partitioned tables * (including the root parent) as child members of the inheritance set do * not appear anywhere else in the plan, so the confusion explained below * for non-partitioning inheritance cases is not possible. */ ! parent_rte = rt_fetch(top_parentRTindex, parse->rtable); ! Assert(parent_rte->inh); if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE) { nominalRelation = top_parentRTindex; *************** inheritance_planner(PlannerInfo *root) *** 1274,1321 **** } /* ! * The PlannerInfo for each child is obtained by translating the relevant ! * members of the PlannerInfo for its immediate parent, which we find ! * using the parent_relid in its AppendRelInfo. We save the PlannerInfo ! * for each parent in an array indexed by relid for fast retrieval. Since ! * the maximum number of parents is limited by the number of RTEs in the ! * query, we use that number to allocate the array. An extra entry is ! * needed since relids start from 1. */ ! parent_roots = (PlannerInfo **) palloc0((list_length(parse->rtable) + 1) * ! sizeof(PlannerInfo *)); ! parent_roots[top_parentRTindex] = root; /* * And now we can get on with generating a plan for each child table. */ ! foreach(lc, root->append_rel_list) { AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); PlannerInfo *subroot; RangeTblEntry *child_rte; RelOptInfo *sub_final_rel; Path *subpath; - /* append_rel_list contains all append rels; ignore others */ - if (!bms_is_member(appinfo->parent_relid, parent_relids)) - continue; - /* * expand_inherited_rtentry() always processes a parent before any of ! * that parent's children, so the parent_root for this relation should ! * already be available. */ ! parent_root = parent_roots[appinfo->parent_relid]; ! Assert(parent_root != NULL); ! parent_parse = parent_root->parse; /* * We need a working copy of the PlannerInfo so that we can control * propagation of information back to the main copy. */ subroot = makeNode(PlannerInfo); ! memcpy(subroot, parent_root, sizeof(PlannerInfo)); /* * Generate modified query with this rel as target. We first apply --- 1269,1486 ---- } /* ! * Before generating the real per-child-relation plans, do a cycle of ! * planning as though the query were a SELECT. The objective here is to ! * find out which child relations need to be processed, using the same ! * expansion and pruning logic as for a SELECT. We'll then pull out the ! * RangeTblEntry-s generated for the child rels, and make use of the ! * AppendRelInfo entries for them to guide the real planning. (This is ! * rather inefficient; we could perhaps stop short of making a full Path ! * tree. But this whole function is inefficient and slated for ! * destruction, so let's not contort query_planner for that.) */ ! { ! PlannerInfo *subroot; ! ! /* ! * Flat-copy the PlannerInfo to prevent modification of the original. ! */ ! subroot = makeNode(PlannerInfo); ! memcpy(subroot, root, sizeof(PlannerInfo)); ! ! /* ! * Make a deep copy of the parsetree for this planning cycle to mess ! * around with, and change it to look like a SELECT. (Hack alert: the ! * target RTE still has updatedCols set if this is an UPDATE, so that ! * expand_partitioned_rtentry will correctly update ! * subroot->partColsUpdated.) ! */ ! subroot->parse = copyObject(root->parse); ! ! subroot->parse->commandType = CMD_SELECT; ! subroot->parse->resultRelation = 0; ! ! /* ! * Ensure the subroot has its own copy of the original ! * append_rel_list, since it'll be scribbled on. (Note that at this ! * point, the list only contains AppendRelInfos for flattened UNION ! * ALL subqueries.) ! */ ! subroot->append_rel_list = copyObject(root->append_rel_list); ! ! /* ! * Better make a private copy of the rowMarks, too. ! */ ! subroot->rowMarks = copyObject(root->rowMarks); ! ! /* There shouldn't be any OJ info to translate, as yet */ ! Assert(subroot->join_info_list == NIL); ! /* and we haven't created PlaceHolderInfos, either */ ! Assert(subroot->placeholder_list == NIL); ! ! /* Generate Path(s) for accessing this result relation */ ! grouping_planner(subroot, true, 0.0 /* retrieve all tuples */ ); ! ! /* Extract the info we need. */ ! select_rtable = subroot->parse->rtable; ! select_appinfos = subroot->append_rel_list; ! ! /* ! * We need to propagate partColsUpdated back, too. (The later ! * planning cycles will not set this because they won't run ! * expand_partitioned_rtentry for the UPDATE target.) ! */ ! root->partColsUpdated = subroot->partColsUpdated; ! } ! ! /*---------- ! * Since only one rangetable can exist in the final plan, we need to make ! * sure that it contains all the RTEs needed for any child plan. This is ! * complicated by the need to use separate subquery RTEs for each child. ! * We arrange the final rtable as follows: ! * 1. All original rtable entries (with their original RT indexes). ! * 2. All the relation RTEs generated for children of the target table. ! * 3. Subquery RTEs for children after the first. We need N * (K - 1) ! * RT slots for this, if there are N subqueries and K child tables. ! * 4. Additional RTEs generated during the child planning runs, such as ! * children of inheritable RTEs other than the target table. ! * We assume that each child planning run will create an identical set ! * of type-4 RTEs. ! * ! * So the next thing to do is append the type-2 RTEs (the target table's ! * children) to the original rtable. We look through select_appinfos ! * to find them. ! * ! * To identify which AppendRelInfos are relevant as we thumb through ! * select_appinfos, we need to look for both direct and indirect children ! * of top_parentRTindex, so we use a bitmap of known parent relids. ! * expand_inherited_rtentry() always processes a parent before any of that ! * parent's children, so we should see an intermediate parent before its ! * children. ! *---------- ! */ ! child_appinfos = NIL; ! old_child_rtis = NIL; ! new_child_rtis = NIL; ! parent_relids = bms_make_singleton(top_parentRTindex); ! foreach(lc, select_appinfos) ! { ! AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); ! RangeTblEntry *child_rte; ! ! /* append_rel_list contains all append rels; ignore others */ ! if (!bms_is_member(appinfo->parent_relid, parent_relids)) ! continue; ! ! /* remember relevant AppendRelInfos for use below */ ! child_appinfos = lappend(child_appinfos, appinfo); ! ! /* extract RTE for this child rel */ ! child_rte = rt_fetch(appinfo->child_relid, select_rtable); ! ! /* and append it to the original rtable */ ! parse->rtable = lappend(parse->rtable, child_rte); ! ! /* remember child's index in the SELECT rtable */ ! old_child_rtis = lappend_int(old_child_rtis, appinfo->child_relid); ! ! /* and its new index in the final rtable */ ! new_child_rtis = lappend_int(new_child_rtis, list_length(parse->rtable)); ! ! /* if child is itself partitioned, update parent_relids */ ! if (child_rte->inh) ! { ! Assert(child_rte->relkind == RELKIND_PARTITIONED_TABLE); ! parent_relids = bms_add_member(parent_relids, appinfo->child_relid); ! } ! } ! ! /* ! * It's possible that the RTIs we just assigned for the child rels in the ! * final rtable are different from what they were in the SELECT query. ! * Adjust the AppendRelInfos so that they will correctly map RT indexes to ! * the final indexes. We can do this left-to-right since no child rel's ! * final RT index could be greater than what it had in the SELECT query. ! */ ! forboth(lc, old_child_rtis, lc2, new_child_rtis) ! { ! int old_child_rti = lfirst_int(lc); ! int new_child_rti = lfirst_int(lc2); ! ! if (old_child_rti == new_child_rti) ! continue; /* nothing to do */ ! ! Assert(old_child_rti > new_child_rti); ! ! ChangeVarNodes((Node *) child_appinfos, ! old_child_rti, new_child_rti, 0); ! } ! ! /* ! * Now set up rangetable entries for subqueries for additional children ! * (the first child will just use the original ones). These all have to ! * look more or less real, or EXPLAIN will get unhappy; so we just make ! * them all clones of the original subqueries. ! */ ! next_subquery_rti = list_length(parse->rtable) + 1; ! if (subqueryRTindexes != NULL) ! { ! int n_children = list_length(child_appinfos); ! ! while (n_children-- > 1) ! { ! int oldrti = -1; ! ! while ((oldrti = bms_next_member(subqueryRTindexes, oldrti)) >= 0) ! { ! RangeTblEntry *subqrte; ! ! subqrte = rt_fetch(oldrti, parse->rtable); ! parse->rtable = lappend(parse->rtable, copyObject(subqrte)); ! } ! } ! } ! ! /* ! * The query for each child is obtained by translating the query for its ! * immediate parent, since the AppendRelInfo data we have shows deltas ! * between parents and children. We use the parent_parses array to ! * remember the appropriate query trees. This is indexed by parent relid. ! * Since the maximum number of parents is limited by the number of RTEs in ! * the SELECT query, we use that number to allocate the array. An extra ! * entry is needed since relids start from 1. ! */ ! parent_parses = (Query **) palloc0((list_length(select_rtable) + 1) * ! sizeof(Query *)); ! parent_parses[top_parentRTindex] = parse; /* * And now we can get on with generating a plan for each child table. */ ! foreach(lc, child_appinfos) { AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); + Index this_subquery_rti = next_subquery_rti; + Query *parent_parse; PlannerInfo *subroot; RangeTblEntry *child_rte; RelOptInfo *sub_final_rel; Path *subpath; /* * expand_inherited_rtentry() always processes a parent before any of ! * that parent's children, so the parent query for this relation ! * should already be available. */ ! parent_parse = parent_parses[appinfo->parent_relid]; ! Assert(parent_parse != NULL); /* * We need a working copy of the PlannerInfo so that we can control * propagation of information back to the main copy. */ subroot = makeNode(PlannerInfo); ! memcpy(subroot, root, sizeof(PlannerInfo)); /* * Generate modified query with this rel as target. We first apply *************** inheritance_planner(PlannerInfo *root) *** 1324,1330 **** * then fool around with subquery RTEs. */ subroot->parse = (Query *) ! adjust_appendrel_attrs(parent_root, (Node *) parent_parse, 1, &appinfo); --- 1489,1495 ---- * then fool around with subquery RTEs. */ subroot->parse = (Query *) ! adjust_appendrel_attrs(subroot, (Node *) parent_parse, 1, &appinfo); *************** inheritance_planner(PlannerInfo *root) *** 1360,1368 **** if (child_rte->inh) { Assert(child_rte->relkind == RELKIND_PARTITIONED_TABLE); ! parent_relids = bms_add_member(parent_relids, appinfo->child_relid); ! parent_roots[appinfo->child_relid] = subroot; ! continue; } --- 1525,1531 ---- if (child_rte->inh) { Assert(child_rte->relkind == RELKIND_PARTITIONED_TABLE); ! parent_parses[appinfo->child_relid] = subroot->parse; continue; } *************** inheritance_planner(PlannerInfo *root) *** 1383,1490 **** * is used elsewhere in the plan, so using the original parent RTE * would give rise to confusing use of multiple aliases in EXPLAIN * output for what the user will think is the "same" table. OTOH, ! * it's not a problem in the partitioned inheritance case, because the ! * duplicate child RTE added for the parent does not appear anywhere ! * else in the plan tree. */ if (nominalRelation < 0) nominalRelation = appinfo->child_relid; /* ! * The rowMarks list might contain references to subquery RTEs, so ! * make a copy that we can apply ChangeVarNodes to. (Fortunately, the ! * executor doesn't need to see the modified copies --- we can just ! * pass it the original rowMarks list.) ! */ ! subroot->rowMarks = copyObject(parent_root->rowMarks); ! ! /* ! * The append_rel_list likewise might contain references to subquery ! * RTEs (if any subqueries were flattenable UNION ALLs). So prepare ! * to apply ChangeVarNodes to that, too. As explained above, we only ! * want to copy items that actually contain such references; the rest ! * can just get linked into the subroot's append_rel_list. ! * ! * If we know there are no such references, we can just use the outer ! * append_rel_list unmodified. ! */ ! if (modifiableARIindexes != NULL) ! { ! ListCell *lc2; ! ! subroot->append_rel_list = NIL; ! foreach(lc2, parent_root->append_rel_list) ! { ! AppendRelInfo *appinfo2 = lfirst_node(AppendRelInfo, lc2); ! ! if (bms_is_member(appinfo2->child_relid, modifiableARIindexes)) ! appinfo2 = copyObject(appinfo2); ! ! subroot->append_rel_list = lappend(subroot->append_rel_list, ! appinfo2); ! } ! } ! ! /* ! * Add placeholders to the child Query's rangetable list to fill the ! * RT indexes already reserved for subqueries in previous children. ! * These won't be referenced, so there's no need to make them very ! * valid-looking. */ ! while (list_length(subroot->parse->rtable) < list_length(final_rtable)) ! subroot->parse->rtable = lappend(subroot->parse->rtable, ! makeNode(RangeTblEntry)); /* ! * If this isn't the first child Query, generate duplicates of all ! * subquery RTEs, and adjust Var numbering to reference the ! * duplicates. To simplify the loop logic, we scan the original rtable ! * not the copy just made by adjust_appendrel_attrs; that should be OK ! * since subquery RTEs couldn't contain any references to the target ! * rel. */ if (final_rtable != NIL && subqueryRTindexes != NULL) { ! ListCell *lr; ! rti = 1; ! foreach(lr, parent_parse->rtable) { ! RangeTblEntry *rte = lfirst_node(RangeTblEntry, lr); ! ! if (bms_is_member(rti, subqueryRTindexes)) ! { ! Index newrti; ! ! /* ! * The RTE can't contain any references to its own RT ! * index, except in its securityQuals, so we can save a ! * few cycles by applying ChangeVarNodes to the rest of ! * the rangetable before we append the RTE to it. ! */ ! newrti = list_length(subroot->parse->rtable) + 1; ! ChangeVarNodes((Node *) subroot->parse, rti, newrti, 0); ! ChangeVarNodes((Node *) subroot->rowMarks, rti, newrti, 0); ! /* Skip processing unchanging parts of append_rel_list */ ! if (modifiableARIindexes != NULL) ! { ! ListCell *lc2; ! ! foreach(lc2, subroot->append_rel_list) ! { ! AppendRelInfo *appinfo2 = lfirst_node(AppendRelInfo, lc2); ! if (bms_is_member(appinfo2->child_relid, ! modifiableARIindexes)) ! ChangeVarNodes((Node *) appinfo2, rti, newrti, 0); ! } ! } ! rte = copyObject(rte); ! ChangeVarNodes((Node *) rte->securityQuals, rti, newrti, 0); ! subroot->parse->rtable = lappend(subroot->parse->rtable, ! rte); ! } ! rti++; } } --- 1546,1583 ---- * is used elsewhere in the plan, so using the original parent RTE * would give rise to confusing use of multiple aliases in EXPLAIN * output for what the user will think is the "same" table. OTOH, ! * it's not a problem in the partitioned inheritance case, because ! * there is no duplicate RTE for the parent. */ if (nominalRelation < 0) nominalRelation = appinfo->child_relid; /* ! * As above, each child plan run needs its own append_rel_list and ! * rowmarks, which should start out as pristine copies of the ! * originals. There can't be any references to UPDATE/DELETE target ! * rels in them; but there could be subquery references, which we'll ! * fix up in a moment. */ ! subroot->append_rel_list = copyObject(root->append_rel_list); ! subroot->rowMarks = copyObject(root->rowMarks); /* ! * If this isn't the first child Query, adjust Vars and jointree ! * entries to reference the appropriate set of subquery RTEs. */ if (final_rtable != NIL && subqueryRTindexes != NULL) { ! int oldrti = -1; ! while ((oldrti = bms_next_member(subqueryRTindexes, oldrti)) >= 0) { ! Index newrti = next_subquery_rti++; ! ChangeVarNodes((Node *) subroot->parse, oldrti, newrti, 0); ! ChangeVarNodes((Node *) subroot->append_rel_list, ! oldrti, newrti, 0); ! ChangeVarNodes((Node *) subroot->rowMarks, oldrti, newrti, 0); } } *************** inheritance_planner(PlannerInfo *root) *** 1514,1535 **** /* * If this is the first non-excluded child, its post-planning rtable ! * becomes the initial contents of final_rtable; otherwise, append ! * just its modified subquery RTEs to final_rtable. */ if (final_rtable == NIL) final_rtable = subroot->parse->rtable; else ! final_rtable = list_concat(final_rtable, ! list_copy_tail(subroot->parse->rtable, ! list_length(final_rtable))); /* * We need to collect all the RelOptInfos from all child plans into * the main PlannerInfo, since setrefs.c will need them. We use the ! * last child's simple_rel_array (previous ones are too short), so we ! * have to propagate forward the RelOptInfos that were already built ! * in previous children. */ Assert(subroot->simple_rel_array_size >= save_rel_array_size); for (rti = 1; rti < save_rel_array_size; rti++) --- 1607,1649 ---- /* * If this is the first non-excluded child, its post-planning rtable ! * becomes the initial contents of final_rtable; otherwise, copy its ! * modified subquery RTEs into final_rtable, to ensure we have sane ! * copies of those. Also save the first non-excluded child's version ! * of the rowmarks list; we assume all children will end up with ! * equivalent versions of that. */ if (final_rtable == NIL) + { final_rtable = subroot->parse->rtable; + final_rowmarks = subroot->rowMarks; + } else ! { ! Assert(list_length(final_rtable) == ! list_length(subroot->parse->rtable)); ! if (subqueryRTindexes != NULL) ! { ! int oldrti = -1; ! ! while ((oldrti = bms_next_member(subqueryRTindexes, oldrti)) >= 0) ! { ! Index newrti = this_subquery_rti++; ! RangeTblEntry *subqrte; ! ListCell *newrticell; ! ! subqrte = rt_fetch(newrti, subroot->parse->rtable); ! newrticell = list_nth_cell(final_rtable, newrti - 1); ! lfirst(newrticell) = subqrte; ! } ! } ! } /* * We need to collect all the RelOptInfos from all child plans into * the main PlannerInfo, since setrefs.c will need them. We use the ! * last child's simple_rel_array, so we have to propagate forward the ! * RelOptInfos that were already built in previous children. */ Assert(subroot->simple_rel_array_size >= save_rel_array_size); for (rti = 1; rti < save_rel_array_size; rti++) *************** inheritance_planner(PlannerInfo *root) *** 1543,1549 **** save_rel_array = subroot->simple_rel_array; save_append_rel_array = subroot->append_rel_array; ! /* Make sure any initplans from this rel get into the outer list */ root->init_plans = subroot->init_plans; /* Build list of sub-paths */ --- 1657,1667 ---- save_rel_array = subroot->simple_rel_array; save_append_rel_array = subroot->append_rel_array; ! /* ! * Make sure any initplans from this rel get into the outer list. Note ! * we're effectively assuming all children generate the same ! * init_plans. ! */ root->init_plans = subroot->init_plans; /* Build list of sub-paths */ *************** inheritance_planner(PlannerInfo *root) *** 1626,1631 **** --- 1744,1752 ---- root->simple_rte_array[rti++] = rte; } + + /* Put back adjusted rowmarks, too */ + root->rowMarks = final_rowmarks; } /* *************** plan_create_index_workers(Oid tableOid, *** 6127,6135 **** /* * Build a minimal RTE. * ! * Set the target's table to be an inheritance parent. This is a kludge ! * that prevents problems within get_relation_info(), which does not ! * expect that any IndexOptInfo is currently undergoing REINDEX. */ rte = makeNode(RangeTblEntry); rte->rtekind = RTE_RELATION; --- 6248,6257 ---- /* * Build a minimal RTE. * ! * Mark the RTE with inh = true. This is a kludge to prevent ! * get_relation_info() from fetching index info, which is necessary ! * because it does not expect that any IndexOptInfo is currently ! * undergoing REINDEX. */ rte = makeNode(RangeTblEntry); rte->rtekind = RTE_RELATION; diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c index 5392d1a..66e6ad9 100644 *** a/src/backend/optimizer/prep/preptlist.c --- b/src/backend/optimizer/prep/preptlist.c *************** preprocess_targetlist(PlannerInfo *root) *** 121,127 **** /* * Add necessary junk columns for rowmarked rels. These values are needed * for locking of rels selected FOR UPDATE/SHARE, and to do EvalPlanQual ! * rechecking. See comments for PlanRowMark in plannodes.h. */ foreach(lc, root->rowMarks) { --- 121,129 ---- /* * Add necessary junk columns for rowmarked rels. These values are needed * for locking of rels selected FOR UPDATE/SHARE, and to do EvalPlanQual ! * rechecking. See comments for PlanRowMark in plannodes.h. If you ! * change this stanza, see also expand_inherited_rtentry(), which has to ! * be able to add on junk columns equivalent to these. */ foreach(lc, root->rowMarks) { diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c index 1d1e506..6b5709a 100644 *** a/src/backend/optimizer/util/inherit.c --- b/src/backend/optimizer/util/inherit.c *************** *** 18,127 **** #include "access/table.h" #include "catalog/partition.h" #include "catalog/pg_inherits.h" #include "miscadmin.h" #include "optimizer/appendinfo.h" #include "optimizer/inherit.h" #include "optimizer/planner.h" #include "optimizer/prep.h" #include "partitioning/partdesc.h" #include "utils/rel.h" ! static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, ! Index rti); ! static void expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, ! PlanRowMark *top_parentrc, LOCKMODE lockmode, ! List **appinfos); static void expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, PlanRowMark *top_parentrc, Relation childrel, ! List **appinfos, RangeTblEntry **childrte_p, Index *childRTindex_p); static Bitmapset *translate_col_privs(const Bitmapset *parent_privs, List *translated_vars); /* - * expand_inherited_tables - * Expand each rangetable entry that represents an inheritance set - * into an "append relation". At the conclusion of this process, - * the "inh" flag is set in all and only those RTEs that are append - * relation parents. - */ - void - expand_inherited_tables(PlannerInfo *root) - { - Index nrtes; - Index rti; - ListCell *rl; - - /* - * expand_inherited_rtentry may add RTEs to parse->rtable. The function is - * expected to recursively handle any RTEs that it creates with inh=true. - * So just scan as far as the original end of the rtable list. - */ - nrtes = list_length(root->parse->rtable); - rl = list_head(root->parse->rtable); - for (rti = 1; rti <= nrtes; rti++) - { - RangeTblEntry *rte = (RangeTblEntry *) lfirst(rl); - - expand_inherited_rtentry(root, rte, rti); - rl = lnext(rl); - } - } - - /* * expand_inherited_rtentry ! * Check whether a rangetable entry represents an inheritance set. ! * If so, add entries for all the child tables to the query's ! * rangetable, and build AppendRelInfo nodes for all the child tables ! * and add them to root->append_rel_list. If not, clear the entry's ! * "inh" flag to prevent later code from looking for AppendRelInfos. * ! * Note that the original RTE is considered to represent the whole ! * inheritance set. The first of the generated RTEs is an RTE for the same ! * table, but with inh = false, to represent the parent table in its role ! * as a simple member of the inheritance set. * ! * A childless table is never considered to be an inheritance set. For ! * regular inheritance, a parent RTE must always have at least two associated ! * AppendRelInfos: one corresponding to the parent table as a simple member of ! * the inheritance set and one or more corresponding to the actual children. ! * (But a partitioned table might have only one associated AppendRelInfo, ! * since it's not itself scanned and hence doesn't need a second RTE to ! * represent itself as a member of the set.) */ ! static void ! expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) { Oid parentOID; - PlanRowMark *oldrc; Relation oldrelation; LOCKMODE lockmode; ! List *inhOIDs; ! ListCell *l; ! /* Does RT entry allow inheritance? */ ! if (!rte->inh) ! return; ! /* Ignore any already-expanded UNION ALL nodes */ ! if (rte->rtekind != RTE_RELATION) { ! Assert(rte->rtekind == RTE_SUBQUERY); return; } ! /* Fast path for common case of childless table */ parentOID = rte->relid; ! if (!has_subclass(parentOID)) ! { ! /* Clear flag before returning */ ! rte->inh = false; ! return; ! } /* * The rewriter should already have obtained an appropriate lock on each --- 18,107 ---- #include "access/table.h" #include "catalog/partition.h" #include "catalog/pg_inherits.h" + #include "catalog/pg_type.h" #include "miscadmin.h" + #include "nodes/makefuncs.h" #include "optimizer/appendinfo.h" #include "optimizer/inherit.h" + #include "optimizer/optimizer.h" + #include "optimizer/pathnode.h" + #include "optimizer/planmain.h" #include "optimizer/planner.h" #include "optimizer/prep.h" + #include "optimizer/restrictinfo.h" + #include "parser/parsetree.h" #include "partitioning/partdesc.h" + #include "partitioning/partprune.h" #include "utils/rel.h" ! static void expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, ! PlanRowMark *top_parentrc, LOCKMODE lockmode); static void expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, PlanRowMark *top_parentrc, Relation childrel, ! RangeTblEntry **childrte_p, Index *childRTindex_p); static Bitmapset *translate_col_privs(const Bitmapset *parent_privs, List *translated_vars); + static void expand_appendrel_subquery(PlannerInfo *root, RelOptInfo *rel, + RangeTblEntry *rte, Index rti); /* * expand_inherited_rtentry ! * Expand a rangetable entry that has the "inh" bit set. * ! * "inh" is only allowed in two cases: RELATION and SUBQUERY RTEs. * ! * "inh" on a plain RELATION RTE means that it is a partitioned table or the ! * parent of a traditional-inheritance set. In this case we must add entries ! * for all the interesting child tables to the query's rangetable, and build ! * additional planner data structures for them, including RelOptInfos, ! * AppendRelInfos, and possibly PlanRowMarks. ! * ! * Note that the original RTE is considered to represent the whole inheritance ! * set. In the case of traditional inheritance, the first of the generated ! * RTEs is an RTE for the same table, but with inh = false, to represent the ! * parent table in its role as a simple member of the inheritance set. For ! * partitioning, we don't need a second RTE because the partitioned table ! * itself has no data and need not be scanned. ! * ! * "inh" on a SUBQUERY RTE means that it's the parent of a UNION ALL group, ! * which is treated as an appendrel similarly to inheritance cases; however, ! * we already made RTEs and AppendRelInfos for the subqueries. We only need ! * to build RelOptInfos for them, which is done by expand_appendrel_subquery. */ ! void ! expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel, ! RangeTblEntry *rte, Index rti) { Oid parentOID; Relation oldrelation; LOCKMODE lockmode; ! PlanRowMark *oldrc; ! bool old_isParent = false; ! int old_allMarkTypes = 0; ! Assert(rte->inh); /* else caller error */ ! ! if (rte->rtekind == RTE_SUBQUERY) { ! expand_appendrel_subquery(root, rel, rte, rti); return; } ! ! Assert(rte->rtekind == RTE_RELATION); ! parentOID = rte->relid; ! ! /* ! * We used to check has_subclass() here, but there's no longer any need ! * to, because subquery_planner already did. ! */ /* * The rewriter should already have obtained an appropriate lock on each *************** expand_inherited_rtentry(PlannerInfo *ro *** 141,147 **** --- 121,132 ---- */ oldrc = get_plan_rowmark(root->rowMarks, rti); if (oldrc) + { + old_isParent = oldrc->isParent; oldrc->isParent = true; + /* Save initial value of allMarkTypes before children add to it */ + old_allMarkTypes = oldrc->allMarkTypes; + } /* Scan the inheritance set and expand it */ if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) *************** expand_inherited_rtentry(PlannerInfo *ro *** 151,167 **** */ Assert(rte->relkind == RELKIND_PARTITIONED_TABLE); - if (root->glob->partition_directory == NULL) - root->glob->partition_directory = - CreatePartitionDirectory(CurrentMemoryContext); - /* ! * If this table has partitions, recursively expand and lock them. ! * While at it, also extract the partition key columns of all the ! * partitioned tables. */ ! expand_partitioned_rtentry(root, rte, rti, oldrelation, oldrc, ! lockmode, &root->append_rel_list); } else { --- 136,147 ---- */ Assert(rte->relkind == RELKIND_PARTITIONED_TABLE); /* ! * Recursively expand and lock the partitions. While at it, also ! * extract the partition key columns of all the partitioned tables. */ ! expand_partitioned_rtentry(root, rel, rte, rti, ! oldrelation, oldrc, lockmode); } else { *************** expand_inherited_rtentry(PlannerInfo *ro *** 170,194 **** * that partitioned tables are not allowed to have inheritance * children, so it's not possible for both cases to apply.) */ ! List *appinfos = NIL; ! RangeTblEntry *childrte; ! Index childRTindex; /* Scan for all members of inheritance set, acquire needed locks */ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL); /* ! * Check that there's at least one descendant, else treat as no-child ! * case. This could happen despite above has_subclass() check, if the ! * table once had a child but no longer does. */ ! if (list_length(inhOIDs) < 2) ! { ! /* Clear flag before returning */ ! rte->inh = false; ! heap_close(oldrelation, NoLock); ! return; ! } /* * Expand inheritance children in the order the OIDs were returned by --- 150,174 ---- * that partitioned tables are not allowed to have inheritance * children, so it's not possible for both cases to apply.) */ ! List *inhOIDs; ! ListCell *l; /* Scan for all members of inheritance set, acquire needed locks */ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL); /* ! * We used to special-case the situation where the table no longer has ! * any children, by clearing rte->inh and exiting. That no longer ! * works, because this function doesn't get run until after decisions ! * have been made that depend on rte->inh. We have to treat such ! * situations as normal inheritance. The table itself should always ! * have been found, though. */ ! Assert(inhOIDs != NIL); ! Assert(linitial_oid(inhOIDs) == parentOID); ! ! /* Expand simple_rel_array and friends to hold child objects. */ ! expand_planner_arrays(root, list_length(inhOIDs)); /* * Expand inheritance children in the order the OIDs were returned by *************** expand_inherited_rtentry(PlannerInfo *ro *** 198,203 **** --- 178,185 ---- { Oid childOID = lfirst_oid(l); Relation newrelation; + RangeTblEntry *childrte; + Index childRTindex; /* Open rel if needed; we already have required locks */ if (childOID != parentOID) *************** expand_inherited_rtentry(PlannerInfo *ro *** 217,245 **** continue; } ! expand_single_inheritance_child(root, rte, rti, oldrelation, oldrc, ! newrelation, ! &appinfos, &childrte, ! &childRTindex); /* Close child relations, but keep locks */ if (childOID != parentOID) table_close(newrelation, NoLock); } /* ! * If all the children were temp tables, pretend it's a ! * non-inheritance situation; we don't need Append node in that case. ! * The duplicate RTE we added for the parent table is harmless, so we ! * don't bother to get rid of it; ditto for the useless PlanRowMark ! * node. */ ! if (list_length(appinfos) < 2) ! rte->inh = false; ! else ! root->append_rel_list = list_concat(root->append_rel_list, ! appinfos); ! } table_close(oldrelation, NoLock); --- 199,276 ---- continue; } ! /* Create RTE and AppendRelInfo, plus PlanRowMark if needed. */ ! expand_single_inheritance_child(root, rte, rti, oldrelation, ! oldrc, newrelation, ! &childrte, &childRTindex); ! ! /* Create the otherrel RelOptInfo too. */ ! (void) build_simple_rel(root, childRTindex, rel); /* Close child relations, but keep locks */ if (childOID != parentOID) table_close(newrelation, NoLock); } + } + + /* + * Some children might require different mark types, which would've been + * reported into oldrc. If so, add relevant entries to the top-level + * targetlist and update parent rel's reltarget. This should match what + * preprocess_targetlist() would have added if the mark types had been + * requested originally. + */ + if (oldrc) + { + int new_allMarkTypes = oldrc->allMarkTypes; + Var *var; + TargetEntry *tle; + char resname[32]; + List *newvars = NIL; + + /* The old PlanRowMark should already have necessitated adding TID */ + Assert(old_allMarkTypes & ~(1 << ROW_MARK_COPY)); + + /* Add whole-row junk Var if needed, unless we had it already */ + if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) && + !(old_allMarkTypes & (1 << ROW_MARK_COPY))) + { + var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root), + oldrc->rti, + 0, + false); + snprintf(resname, sizeof(resname), "wholerow%u", oldrc->rowmarkId); + tle = makeTargetEntry((Expr *) var, + list_length(root->processed_tlist) + 1, + pstrdup(resname), + true); + root->processed_tlist = lappend(root->processed_tlist, tle); + newvars = lappend(newvars, var); + } + + /* Add tableoid junk Var, unless we had it already */ + if (!old_isParent) + { + var = makeVar(oldrc->rti, + TableOidAttributeNumber, + OIDOID, + -1, + InvalidOid, + 0); + snprintf(resname, sizeof(resname), "tableoid%u", oldrc->rowmarkId); + tle = makeTargetEntry((Expr *) var, + list_length(root->processed_tlist) + 1, + pstrdup(resname), + true); + root->processed_tlist = lappend(root->processed_tlist, tle); + newvars = lappend(newvars, var); + } /* ! * Add the newly added Vars to parent's reltarget. We needn't worry ! * about the children's reltargets, they'll be made later. */ ! add_vars_to_targetlist(root, newvars, bms_make_singleton(0), false); } table_close(oldrelation, NoLock); *************** expand_inherited_rtentry(PlannerInfo *ro *** 250,275 **** * Recursively expand an RTE for a partitioned table. */ static void ! expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, ! PlanRowMark *top_parentrc, LOCKMODE lockmode, ! List **appinfos) { - int i; - RangeTblEntry *childrte; - Index childRTindex; PartitionDesc partdesc; partdesc = PartitionDirectoryLookup(root->glob->partition_directory, parentrel); - check_stack_depth(); - /* A partitioned table should always have a partition descriptor. */ Assert(partdesc); - Assert(parentrte->inh); - /* * Note down whether any partition key cols are being updated. Though it's * the root partitioned table's updatedCols we are interested in, we --- 281,306 ---- * Recursively expand an RTE for a partitioned table. */ static void ! expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo, ! RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, ! PlanRowMark *top_parentrc, LOCKMODE lockmode) { PartitionDesc partdesc; + Bitmapset *live_parts; + int num_live_parts; + int i; + + check_stack_depth(); + + Assert(parentrte->inh); partdesc = PartitionDirectoryLookup(root->glob->partition_directory, parentrel); /* A partitioned table should always have a partition descriptor. */ Assert(partdesc); /* * Note down whether any partition key cols are being updated. Though it's * the root partitioned table's updatedCols we are interested in, we *************** expand_partitioned_rtentry(PlannerInfo * *** 281,305 **** root->partColsUpdated = has_partition_attrs(parentrel, parentrte->updatedCols, NULL); ! /* ! * If the partitioned table has no partitions, treat this as the ! * non-inheritance case. ! */ if (partdesc->nparts == 0) - { - parentrte->inh = false; return; - } /* ! * Create a child RTE for each partition. Note that unlike traditional ! * inheritance, we don't need a child RTE for the partitioned table ! * itself, because it's not going to be scanned. */ ! for (i = 0; i < partdesc->nparts; i++) { Oid childOID = partdesc->oids[i]; Relation childrel; /* Open rel, acquiring required locks */ childrel = table_open(childOID, lockmode); --- 312,356 ---- root->partColsUpdated = has_partition_attrs(parentrel, parentrte->updatedCols, NULL); ! /* Nothing further to do here if there are no partitions. */ if (partdesc->nparts == 0) return; /* ! * Perform partition pruning using restriction clauses assigned to parent ! * relation. live_parts will contain PartitionDesc indexes of partitions ! * that survive pruning. Below, we will initialize child objects for the ! * surviving partitions. */ ! live_parts = prune_append_rel_partitions(relinfo); ! ! /* Expand simple_rel_array and friends to hold child objects. */ ! num_live_parts = bms_num_members(live_parts); ! if (num_live_parts > 0) ! expand_planner_arrays(root, num_live_parts); ! ! /* ! * We also store partition RelOptInfo pointers in the parent relation. ! * Since we're palloc0'ing, slots corresponding to pruned partitions will ! * contain NULL. ! */ ! Assert(relinfo->part_rels == NULL); ! relinfo->part_rels = (RelOptInfo **) ! palloc0(relinfo->nparts * sizeof(RelOptInfo *)); ! ! /* ! * Create a child RTE for each live partition. Note that unlike ! * traditional inheritance, we don't need a child RTE for the partitioned ! * table itself, because it's not going to be scanned. ! */ ! i = -1; ! while ((i = bms_next_member(live_parts, i)) >= 0) { Oid childOID = partdesc->oids[i]; Relation childrel; + RangeTblEntry *childrte; + Index childRTindex; + RelOptInfo *childrelinfo; /* Open rel, acquiring required locks */ childrel = table_open(childOID, lockmode); *************** expand_partitioned_rtentry(PlannerInfo * *** 312,326 **** if (RELATION_IS_OTHER_TEMP(childrel)) elog(ERROR, "temporary relation from another session found as partition"); expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel, top_parentrc, childrel, ! appinfos, &childrte, &childRTindex); /* If this child is itself partitioned, recurse */ if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ! expand_partitioned_rtentry(root, childrte, childRTindex, ! childrel, top_parentrc, lockmode, ! appinfos); /* Close child relation, but keep locks */ table_close(childrel, NoLock); --- 363,382 ---- if (RELATION_IS_OTHER_TEMP(childrel)) elog(ERROR, "temporary relation from another session found as partition"); + /* Create RTE and AppendRelInfo, plus PlanRowMark if needed. */ expand_single_inheritance_child(root, parentrte, parentRTindex, parentrel, top_parentrc, childrel, ! &childrte, &childRTindex); ! ! /* Create the otherrel RelOptInfo too. */ ! childrelinfo = build_simple_rel(root, childRTindex, relinfo); ! relinfo->part_rels[i] = childrelinfo; /* If this child is itself partitioned, recurse */ if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) ! expand_partitioned_rtentry(root, childrelinfo, ! childrte, childRTindex, ! childrel, top_parentrc, lockmode); /* Close child relation, but keep locks */ table_close(childrel, NoLock); *************** static void *** 351,357 **** expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, PlanRowMark *top_parentrc, Relation childrel, ! List **appinfos, RangeTblEntry **childrte_p, Index *childRTindex_p) { Query *parse = root->parse; --- 407,413 ---- expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte, Index parentRTindex, Relation parentrel, PlanRowMark *top_parentrc, Relation childrel, ! RangeTblEntry **childrte_p, Index *childRTindex_p) { Query *parse = root->parse; *************** expand_single_inheritance_child(PlannerI *** 363,370 **** /* * Build an RTE for the child, and attach to query's rangetable list. We ! * copy most fields of the parent's RTE, but replace relation OID and ! * relkind, and set inh = false. Also, set requiredPerms to zero since * all required permissions checks are done on the original RTE. Likewise, * set the child's securityQuals to empty, because we only want to apply * the parent's RLS conditions regardless of what RLS properties --- 419,426 ---- /* * Build an RTE for the child, and attach to query's rangetable list. We ! * copy most fields of the parent's RTE, but replace relation OID, ! * relkind, and inh for the child. Also, set requiredPerms to zero since * all required permissions checks are done on the original RTE. Likewise, * set the child's securityQuals to empty, because we only want to apply * the parent's RLS conditions regardless of what RLS properties *************** expand_single_inheritance_child(PlannerI *** 396,402 **** */ appinfo = make_append_rel_info(parentrel, childrel, parentRTindex, childRTindex); ! *appinfos = lappend(*appinfos, appinfo); /* * Translate the column permissions bitmaps to the child's attnums (we --- 452,458 ---- */ appinfo = make_append_rel_info(parentrel, childrel, parentRTindex, childRTindex); ! root->append_rel_list = lappend(root->append_rel_list, appinfo); /* * Translate the column permissions bitmaps to the child's attnums (we *************** expand_single_inheritance_child(PlannerI *** 418,423 **** --- 474,489 ---- } /* + * Store the RTE and appinfo in the respective PlannerInfo arrays, which + * the caller must already have allocated space for. + */ + Assert(childRTindex < root->simple_rel_array_size); + Assert(root->simple_rte_array[childRTindex] == NULL); + root->simple_rte_array[childRTindex] = childrte; + Assert(root->append_rel_array[childRTindex] == NULL); + root->append_rel_array[childRTindex] = appinfo; + + /* * Build a PlanRowMark if parent is marked FOR UPDATE/SHARE. */ if (top_parentrc) *************** expand_single_inheritance_child(PlannerI *** 437,443 **** /* * We mark RowMarks for partitioned child tables as parent RowMarks so * that the executor ignores them (except their existence means that ! * the child tables be locked using appropriate mode). */ childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE); --- 503,509 ---- /* * We mark RowMarks for partitioned child tables as parent RowMarks so * that the executor ignores them (except their existence means that ! * the child tables will be locked using the appropriate mode). */ childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE); *************** translate_col_privs(const Bitmapset *par *** 499,501 **** --- 565,733 ---- return child_privs; } + + /* + * expand_appendrel_subquery + * Add "other rel" RelOptInfos for the children of an appendrel baserel + * + * "rel" is a subquery relation that has the rte->inh flag set, meaning it + * is a UNION ALL subquery that's been flattened into an appendrel, with + * child subqueries listed in root->append_rel_list. We need to build + * a RelOptInfo for each child relation so that we can plan scans on them. + */ + static void + expand_appendrel_subquery(PlannerInfo *root, RelOptInfo *rel, + RangeTblEntry *rte, Index rti) + { + ListCell *l; + + foreach(l, root->append_rel_list) + { + AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); + Index childRTindex = appinfo->child_relid; + RangeTblEntry *childrte; + RelOptInfo *childrel; + + /* append_rel_list contains all append rels; ignore others */ + if (appinfo->parent_relid != rti) + continue; + + /* find the child RTE, which should already exist */ + Assert(childRTindex < root->simple_rel_array_size); + childrte = root->simple_rte_array[childRTindex]; + Assert(childrte != NULL); + + /* Build the child RelOptInfo. */ + childrel = build_simple_rel(root, childRTindex, rel); + + /* Child may itself be an inherited rel, either table or subquery. */ + if (childrte->inh) + expand_inherited_rtentry(root, childrel, childrte, childRTindex); + } + } + + + /* + * apply_child_basequals + * Populate childrel's base restriction quals from parent rel's quals, + * translating them using appinfo. + * + * If any of the resulting clauses evaluate to constant false or NULL, we + * return false and don't apply any quals. Caller should mark the relation as + * a dummy rel in this case, since it doesn't need to be scanned. + */ + bool + apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel, + RelOptInfo *childrel, RangeTblEntry *childRTE, + AppendRelInfo *appinfo) + { + List *childquals; + Index cq_min_security; + ListCell *lc; + + /* + * The child rel's targetlist might contain non-Var expressions, which + * means that substitution into the quals could produce opportunities for + * const-simplification, and perhaps even pseudoconstant quals. Therefore, + * transform each RestrictInfo separately to see if it reduces to a + * constant or pseudoconstant. (We must process them separately to keep + * track of the security level of each qual.) + */ + childquals = NIL; + cq_min_security = UINT_MAX; + foreach(lc, parentrel->baserestrictinfo) + { + RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc); + Node *childqual; + ListCell *lc2; + + Assert(IsA(rinfo, RestrictInfo)); + childqual = adjust_appendrel_attrs(root, + (Node *) rinfo->clause, + 1, &appinfo); + childqual = eval_const_expressions(root, childqual); + /* check for flat-out constant */ + if (childqual && IsA(childqual, Const)) + { + if (((Const *) childqual)->constisnull || + !DatumGetBool(((Const *) childqual)->constvalue)) + { + /* Restriction reduces to constant FALSE or NULL */ + return false; + } + /* Restriction reduces to constant TRUE, so drop it */ + continue; + } + /* might have gotten an AND clause, if so flatten it */ + foreach(lc2, make_ands_implicit((Expr *) childqual)) + { + Node *onecq = (Node *) lfirst(lc2); + bool pseudoconstant; + + /* check for pseudoconstant (no Vars or volatile functions) */ + pseudoconstant = + !contain_vars_of_level(onecq, 0) && + !contain_volatile_functions(onecq); + if (pseudoconstant) + { + /* tell createplan.c to check for gating quals */ + root->hasPseudoConstantQuals = true; + } + /* reconstitute RestrictInfo with appropriate properties */ + childquals = lappend(childquals, + make_restrictinfo((Expr *) onecq, + rinfo->is_pushed_down, + rinfo->outerjoin_delayed, + pseudoconstant, + rinfo->security_level, + NULL, NULL, NULL)); + /* track minimum security level among child quals */ + cq_min_security = Min(cq_min_security, rinfo->security_level); + } + } + + /* + * In addition to the quals inherited from the parent, we might have + * securityQuals associated with this particular child node. (Currently + * this can only happen in appendrels originating from UNION ALL; + * inheritance child tables don't have their own securityQuals, see + * expand_single_inheritance_child().) Pull any such securityQuals up + * into the baserestrictinfo for the child. This is similar to + * process_security_barrier_quals() for the parent rel, except that we + * can't make any general deductions from such quals, since they don't + * hold for the whole appendrel. + */ + if (childRTE->securityQuals) + { + Index security_level = 0; + + foreach(lc, childRTE->securityQuals) + { + List *qualset = (List *) lfirst(lc); + ListCell *lc2; + + foreach(lc2, qualset) + { + Expr *qual = (Expr *) lfirst(lc2); + + /* not likely that we'd see constants here, so no check */ + childquals = lappend(childquals, + make_restrictinfo(qual, + true, false, false, + security_level, + NULL, NULL, NULL)); + cq_min_security = Min(cq_min_security, security_level); + } + security_level++; + } + Assert(security_level <= root->qual_security_level); + } + + /* + * OK, we've got all the baserestrictinfo quals for this child. + */ + childrel->baserestrictinfo = childquals; + childrel->baserestrict_min_security = cq_min_security; + + return true; + } diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index a0a7c54..2946e50 100644 *** a/src/backend/optimizer/util/plancat.c --- b/src/backend/optimizer/util/plancat.c *************** set_relation_partition_info(PlannerInfo *** 2094,2100 **** { PartitionDesc partdesc; ! Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); partdesc = PartitionDirectoryLookup(root->glob->partition_directory, relation); --- 2094,2103 ---- { PartitionDesc partdesc; ! /* Create the PartitionDirectory infrastructure if we didn't already */ ! if (root->glob->partition_directory == NULL) ! root->glob->partition_directory = ! CreatePartitionDirectory(CurrentMemoryContext); partdesc = PartitionDirectoryLookup(root->glob->partition_directory, relation); diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index 0d40b8d..f86f39c 100644 *** a/src/backend/optimizer/util/relnode.c --- b/src/backend/optimizer/util/relnode.c *************** *** 20,30 **** #include "optimizer/appendinfo.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" #include "optimizer/pathnode.h" #include "optimizer/paths.h" #include "optimizer/placeholder.h" #include "optimizer/plancat.h" - #include "optimizer/prep.h" #include "optimizer/restrictinfo.h" #include "optimizer/tlist.h" #include "partitioning/partbounds.h" --- 20,30 ---- #include "optimizer/appendinfo.h" #include "optimizer/clauses.h" #include "optimizer/cost.h" + #include "optimizer/inherit.h" #include "optimizer/pathnode.h" #include "optimizer/paths.h" #include "optimizer/placeholder.h" #include "optimizer/plancat.h" #include "optimizer/restrictinfo.h" #include "optimizer/tlist.h" #include "partitioning/partbounds.h" *************** setup_append_rel_array(PlannerInfo *root *** 132,137 **** --- 132,180 ---- } /* + * expand_planner_arrays + * Expand the PlannerInfo's per-RTE arrays by add_size members + * and initialize the newly added entries to NULLs + */ + void + expand_planner_arrays(PlannerInfo *root, int add_size) + { + int new_size; + + Assert(add_size > 0); + + new_size = root->simple_rel_array_size + add_size; + + root->simple_rte_array = (RangeTblEntry **) + repalloc(root->simple_rte_array, + sizeof(RangeTblEntry *) * new_size); + MemSet(root->simple_rte_array + root->simple_rel_array_size, + 0, sizeof(RangeTblEntry *) * add_size); + + root->simple_rel_array = (RelOptInfo **) + repalloc(root->simple_rel_array, + sizeof(RelOptInfo *) * new_size); + MemSet(root->simple_rel_array + root->simple_rel_array_size, + 0, sizeof(RelOptInfo *) * add_size); + + if (root->append_rel_array) + { + root->append_rel_array = (AppendRelInfo **) + repalloc(root->append_rel_array, + sizeof(AppendRelInfo *) * new_size); + MemSet(root->append_rel_array + root->simple_rel_array_size, + 0, sizeof(AppendRelInfo *) * add_size); + } + else + { + root->append_rel_array = (AppendRelInfo **) + palloc0(sizeof(AppendRelInfo *) * new_size); + } + + root->simple_rel_array_size = new_size; + } + + /* * build_simple_rel * Construct a new RelOptInfo for a base relation or 'other' relation. */ *************** build_simple_rel(PlannerInfo *root, int *** 281,373 **** break; } - /* Save the finished struct in the query's simple_rel_array */ - root->simple_rel_array[relid] = rel; - /* * This is a convenient spot at which to note whether rels participating * in the query have any securityQuals attached. If so, increase * root->qual_security_level to ensure it's larger than the maximum ! * security level needed for securityQuals. */ if (rte->securityQuals) root->qual_security_level = Max(root->qual_security_level, list_length(rte->securityQuals)); - return rel; - } - - /* - * add_appendrel_other_rels - * Add "other rel" RelOptInfos for the children of an appendrel baserel - * - * "rel" is a relation that (still) has the rte->inh flag set, meaning it - * has appendrel children listed in root->append_rel_list. We need to build - * a RelOptInfo for each child relation so that we can plan scans on them. - * (The parent relation might be a partitioned table, a table with - * traditional inheritance children, or a flattened UNION ALL subquery.) - */ - void - add_appendrel_other_rels(PlannerInfo *root, RelOptInfo *rel, Index rti) - { - int cnt_parts = 0; - ListCell *l; - /* ! * If rel is a partitioned table, then we also need to build a part_rels ! * array so that the child RelOptInfos can be conveniently accessed from ! * the parent. */ ! if (rel->part_scheme != NULL) ! { ! Assert(rel->nparts > 0); ! rel->part_rels = (RelOptInfo **) ! palloc0(sizeof(RelOptInfo *) * rel->nparts); ! } ! ! foreach(l, root->append_rel_list) { ! AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); ! Index childRTindex = appinfo->child_relid; ! RangeTblEntry *childrte; ! RelOptInfo *childrel; ! ! /* append_rel_list contains all append rels; ignore others */ ! if (appinfo->parent_relid != rti) ! continue; ! ! /* find the child RTE, which should already exist */ ! Assert(childRTindex < root->simple_rel_array_size); ! childrte = root->simple_rte_array[childRTindex]; ! Assert(childrte != NULL); ! ! /* build child RelOptInfo, and add to main query data structures */ ! childrel = build_simple_rel(root, childRTindex, rel); ! ! /* ! * If rel is a partitioned table, fill in the part_rels array. The ! * order in which child tables appear in append_rel_list is the same ! * as the order in which they appear in the parent's PartitionDesc, so ! * assigning partitions like this works. ! */ ! if (rel->part_scheme != NULL) ! { ! Assert(cnt_parts < rel->nparts); ! rel->part_rels[cnt_parts++] = childrel; ! } ! /* Child may itself be an inherited relation. */ ! if (childrte->inh) { ! /* Only relation and subquery RTEs can have children. */ ! Assert(childrte->rtekind == RTE_RELATION || ! childrte->rtekind == RTE_SUBQUERY); ! add_appendrel_other_rels(root, childrel, childRTindex); } } ! /* We should have filled all of the part_rels array if it's partitioned */ ! Assert(cnt_parts == rel->nparts); } /* --- 324,365 ---- break; } /* * This is a convenient spot at which to note whether rels participating * in the query have any securityQuals attached. If so, increase * root->qual_security_level to ensure it's larger than the maximum ! * security level needed for securityQuals. (Must do this before we call ! * apply_child_basequals, else we'll hit an Assert therein.) */ if (rte->securityQuals) root->qual_security_level = Max(root->qual_security_level, list_length(rte->securityQuals)); /* ! * Copy the parent's quals to the child, with appropriate substitution of ! * variables. If any constant false or NULL clauses turn up, we can mark ! * the child as dummy right away. (We must do this immediately so that ! * pruning works correctly when recursing in expand_partitioned_rtentry.) */ ! if (parent) { ! AppendRelInfo *appinfo = root->append_rel_array[relid]; ! Assert(appinfo != NULL); ! if (!apply_child_basequals(root, parent, rel, rte, appinfo)) { ! /* ! * Some restriction clause reduced to constant FALSE or NULL after ! * substitution, so this child need not be scanned. ! */ ! mark_dummy_rel(rel); } } ! /* Save the finished struct in the query's simple_rel_array */ ! root->simple_rel_array[relid] = rel; ! ! return rel; } /* diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c index af3f911..aecea82 100644 *** a/src/backend/partitioning/partprune.c --- b/src/backend/partitioning/partprune.c *************** *** 45,50 **** --- 45,51 ---- #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" #include "optimizer/appendinfo.h" + #include "optimizer/cost.h" #include "optimizer/optimizer.h" #include "optimizer/pathnode.h" #include "parser/parsetree.h" *************** make_partitionedrel_pruneinfo(PlannerInf *** 474,491 **** * is, not pruned already). */ subplan_map = (int *) palloc(nparts * sizeof(int)); subpart_map = (int *) palloc(nparts * sizeof(int)); ! relid_map = (Oid *) palloc(nparts * sizeof(Oid)); present_parts = NULL; for (i = 0; i < nparts; i++) { RelOptInfo *partrel = subpart->part_rels[i]; ! int subplanidx = relid_subplan_map[partrel->relid] - 1; ! int subpartidx = relid_subpart_map[partrel->relid] - 1; ! subplan_map[i] = subplanidx; ! subpart_map[i] = subpartidx; relid_map[i] = planner_rt_fetch(partrel->relid, root)->relid; if (subplanidx >= 0) { --- 475,498 ---- * is, not pruned already). */ subplan_map = (int *) palloc(nparts * sizeof(int)); + memset(subplan_map, -1, nparts * sizeof(int)); subpart_map = (int *) palloc(nparts * sizeof(int)); ! memset(subpart_map, -1, nparts * sizeof(Oid)); ! relid_map = (Oid *) palloc0(nparts * sizeof(Oid)); present_parts = NULL; for (i = 0; i < nparts; i++) { RelOptInfo *partrel = subpart->part_rels[i]; ! int subplanidx; ! int subpartidx; ! /* Skip processing pruned partitions. */ ! if (partrel == NULL) ! continue; ! ! subplan_map[i] = subplanidx = relid_subplan_map[partrel->relid] - 1; ! subpart_map[i] = subpartidx = relid_subpart_map[partrel->relid] - 1; relid_map[i] = planner_rt_fetch(partrel->relid, root)->relid; if (subplanidx >= 0) { *************** gen_partprune_steps(RelOptInfo *rel, Lis *** 567,589 **** /* * prune_append_rel_partitions ! * Returns RT indexes of the minimum set of child partitions which must ! * be scanned to satisfy rel's baserestrictinfo quals. * * Callers must ensure that 'rel' is a partitioned table. */ ! Relids prune_append_rel_partitions(RelOptInfo *rel) { - Relids result; List *clauses = rel->baserestrictinfo; List *pruning_steps; bool contradictory; PartitionPruneContext context; - Bitmapset *partindexes; - int i; - Assert(clauses != NIL); Assert(rel->part_scheme != NULL); /* If there are no partitions, return the empty set */ --- 574,593 ---- /* * prune_append_rel_partitions ! * Returns indexes into rel->part_rels of the minimum set of child ! * partitions which must be scanned to satisfy rel's baserestrictinfo ! * quals. * * Callers must ensure that 'rel' is a partitioned table. */ ! Bitmapset * prune_append_rel_partitions(RelOptInfo *rel) { List *clauses = rel->baserestrictinfo; List *pruning_steps; bool contradictory; PartitionPruneContext context; Assert(rel->part_scheme != NULL); /* If there are no partitions, return the empty set */ *************** prune_append_rel_partitions(RelOptInfo * *** 591,596 **** --- 595,607 ---- return NULL; /* + * If pruning is disabled or if there are no clauses to prune with, return + * all partitions. + */ + if (!enable_partition_pruning || clauses == NIL) + return bms_add_range(NULL, 0, rel->nparts - 1); + + /* * Process clauses. If the clauses are found to be contradictory, we can * return the empty set. */ *************** prune_append_rel_partitions(RelOptInfo * *** 617,631 **** context.evalexecparams = false; /* Actual pruning happens here. */ ! partindexes = get_matching_partitions(&context, pruning_steps); ! ! /* Add selected partitions' RT indexes to result. */ ! i = -1; ! result = NULL; ! while ((i = bms_next_member(partindexes, i)) >= 0) ! result = bms_add_member(result, rel->part_rels[i]->relid); ! ! return result; } /* --- 628,634 ---- context.evalexecparams = false; /* Actual pruning happens here. */ ! return get_matching_partitions(&context, pruning_steps); } /* diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index 24740c3..1cce762 100644 *** a/src/include/nodes/plannodes.h --- b/src/include/nodes/plannodes.h *************** typedef struct PartitionPruneInfo *** 1103,1108 **** --- 1103,1109 ---- * it is -1 if the partition is a leaf or has been pruned. Note that subplan * indexes, as stored in 'subplan_map', are global across the parent plan * node, but partition indexes are valid only within a particular hierarchy. + * relid_map[p] contains the partition's OID, or 0 if the partition was pruned. */ typedef struct PartitionedRelPruneInfo { *************** typedef struct PartitionedRelPruneInfo *** 1115,1121 **** int nexprs; /* Length of hasexecparam[] */ int *subplan_map; /* subplan index by partition index, or -1 */ int *subpart_map; /* subpart index by partition index, or -1 */ ! Oid *relid_map; /* relation OID by partition index, or -1 */ bool *hasexecparam; /* true if corresponding pruning_step contains * any PARAM_EXEC Params. */ bool do_initial_prune; /* true if pruning should be performed --- 1116,1122 ---- int nexprs; /* Length of hasexecparam[] */ int *subplan_map; /* subplan index by partition index, or -1 */ int *subpart_map; /* subpart index by partition index, or -1 */ ! Oid *relid_map; /* relation OID by partition index, or 0 */ bool *hasexecparam; /* true if corresponding pruning_step contains * any PARAM_EXEC Params. */ bool do_initial_prune; /* true if pruning should be performed diff --git a/src/include/optimizer/inherit.h b/src/include/optimizer/inherit.h index d2418f1..02a23e5 100644 *** a/src/include/optimizer/inherit.h --- b/src/include/optimizer/inherit.h *************** *** 17,22 **** #include "nodes/pathnodes.h" ! extern void expand_inherited_tables(PlannerInfo *root); #endif /* INHERIT_H */ --- 17,27 ---- #include "nodes/pathnodes.h" ! extern void expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel, ! RangeTblEntry *rte, Index rti); ! ! extern bool apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel, ! RelOptInfo *childrel, RangeTblEntry *childRTE, ! AppendRelInfo *appinfo); #endif /* INHERIT_H */ diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h index 9e79e1c..3a803b3 100644 *** a/src/include/optimizer/pathnode.h --- b/src/include/optimizer/pathnode.h *************** extern Path *reparameterize_path_by_chil *** 277,286 **** */ extern void setup_simple_rel_arrays(PlannerInfo *root); extern void setup_append_rel_array(PlannerInfo *root); extern RelOptInfo *build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent); - extern void add_appendrel_other_rels(PlannerInfo *root, RelOptInfo *rel, - Index rti); extern RelOptInfo *find_base_rel(PlannerInfo *root, int relid); extern RelOptInfo *find_join_rel(PlannerInfo *root, Relids relids); extern RelOptInfo *build_join_rel(PlannerInfo *root, --- 277,285 ---- */ extern void setup_simple_rel_arrays(PlannerInfo *root); extern void setup_append_rel_array(PlannerInfo *root); + extern void expand_planner_arrays(PlannerInfo *root, int add_size); extern RelOptInfo *build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent); extern RelOptInfo *find_base_rel(PlannerInfo *root, int relid); extern RelOptInfo *find_join_rel(PlannerInfo *root, Relids relids); extern RelOptInfo *build_join_rel(PlannerInfo *root, diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out index 9783281..a7305fc 100644 *** a/src/test/regress/expected/partition_aggregate.out --- b/src/test/regress/expected/partition_aggregate.out *************** SELECT c, sum(a) FROM pagg_tab WHERE 1 = *** 144,150 **** QUERY PLAN -------------------------------- HashAggregate ! Group Key: pagg_tab.c -> Result One-Time Filter: false (4 rows) --- 144,150 ---- QUERY PLAN -------------------------------- HashAggregate ! Group Key: c -> Result One-Time Filter: false (4 rows) *************** SELECT c, sum(a) FROM pagg_tab WHERE c = *** 159,165 **** QUERY PLAN -------------------------------- GroupAggregate ! Group Key: pagg_tab.c -> Result One-Time Filter: false (4 rows) --- 159,165 ---- QUERY PLAN -------------------------------- GroupAggregate ! Group Key: c -> Result One-Time Filter: false (4 rows) diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out index 50ca03b..7806ba1 100644 *** a/src/test/regress/expected/partition_prune.out --- b/src/test/regress/expected/partition_prune.out *************** table ab; *** 2568,2573 **** --- 2568,2627 ---- 1 | 3 (1 row) + -- Test UPDATE where source relation has run-time pruning enabled + truncate ab; + insert into ab values (1, 1), (1, 2), (1, 3), (2, 1); + explain (analyze, costs off, summary off, timing off) + update ab_a1 set b = 3 from ab_a2 where ab_a2.b = (select 1); + QUERY PLAN + ---------------------------------------------------------------------- + Update on ab_a1 (actual rows=0 loops=1) + Update on ab_a1_b1 + Update on ab_a1_b2 + Update on ab_a1_b3 + InitPlan 1 (returns $0) + -> Result (actual rows=1 loops=1) + -> Nested Loop (actual rows=1 loops=1) + -> Seq Scan on ab_a1_b1 (actual rows=1 loops=1) + -> Materialize (actual rows=1 loops=1) + -> Append (actual rows=1 loops=1) + -> Seq Scan on ab_a2_b1 (actual rows=1 loops=1) + Filter: (b = $0) + -> Seq Scan on ab_a2_b2 (never executed) + Filter: (b = $0) + -> Seq Scan on ab_a2_b3 (never executed) + Filter: (b = $0) + -> Nested Loop (actual rows=1 loops=1) + -> Seq Scan on ab_a1_b2 (actual rows=1 loops=1) + -> Materialize (actual rows=1 loops=1) + -> Append (actual rows=1 loops=1) + -> Seq Scan on ab_a2_b1 (actual rows=1 loops=1) + Filter: (b = $0) + -> Seq Scan on ab_a2_b2 (never executed) + Filter: (b = $0) + -> Seq Scan on ab_a2_b3 (never executed) + Filter: (b = $0) + -> Nested Loop (actual rows=1 loops=1) + -> Seq Scan on ab_a1_b3 (actual rows=1 loops=1) + -> Materialize (actual rows=1 loops=1) + -> Append (actual rows=1 loops=1) + -> Seq Scan on ab_a2_b1 (actual rows=1 loops=1) + Filter: (b = $0) + -> Seq Scan on ab_a2_b2 (never executed) + Filter: (b = $0) + -> Seq Scan on ab_a2_b3 (never executed) + Filter: (b = $0) + (36 rows) + + select tableoid::regclass, * from ab; + tableoid | a | b + ----------+---+--- + ab_a1_b3 | 1 | 3 + ab_a1_b3 | 1 | 3 + ab_a1_b3 | 1 | 3 + ab_a2_b1 | 2 | 1 + (4 rows) + drop table ab, lprt_a; -- Join create table tbl1(col1 int); diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql index a5514c7..2e4d2b4 100644 *** a/src/test/regress/sql/partition_prune.sql --- b/src/test/regress/sql/partition_prune.sql *************** explain (analyze, costs off, summary off *** 588,593 **** --- 588,600 ---- update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a; table ab; + -- Test UPDATE where source relation has run-time pruning enabled + truncate ab; + insert into ab values (1, 1), (1, 2), (1, 3), (2, 1); + explain (analyze, costs off, summary off, timing off) + update ab_a1 set b = 3 from ab_a2 where ab_a2.b = (select 1); + select tableoid::regclass, * from ab; + drop table ab, lprt_a; -- Join
Thanks for the new patches. On Sat, Mar 30, 2019 at 9:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > > On 2019/03/29 7:38, Tom Lane wrote: > >> 2. I seriously dislike what's been done in joinrels.c, too. That > >> really seems like a kluge (and I haven't had time to study it > >> closely). > > > Those hunks account for the fact that pruned partitions, for which we no > > longer create RangeTblEntry and RelOptInfo, may appear on the nullable > > side of an outer join. We'll need a RelOptInfo holding a dummy path, so > > that outer join paths can be created with one side of join being dummy > > result path, which are built in the patch by build_dummy_partition_rel(). > > Now that I've had a chance to look closer, there's no way I'm committing > that change in joinrels.c. If it works at all, it's accidental, because > it's breaking all sorts of data structure invariants. The business with > an AppendRelInfo that maps from the parentrel to itself is particularly > ugly; and I doubt that you can get away with assuming that > root->append_rel_array[parent->relid] is available for use for that. > (What if the parent is an intermediate partitioned table?) > > There's also the small problem of the GEQO crash. It's possible that > that could be gotten around by switching into the long-term planner > context in update_child_rel_info and build_dummy_partition_rel, but > then you're creating a memory leak across GEQO cycles. It'd be much > better to avoid touching base-relation data structures during join > planning. > > What I propose we do about the GEQO problem is shown in 0001 attached > (which would need to be back-patched into v11). This is based on the > observation that, if we know an input relation is empty, we can often > prove the join is empty and then skip building it at all. (In the > existing partitionwise-join code, the same cases are detected by > populate_joinrel_with_paths, but we do a fair amount of work before > discovering that.) The cases where that's not true are where we > have a pruned partition on the inside of a left join, or either side > of a full join ... but frankly, what the existing code produces for > those cases is not short of embarrassing: > > -> Hash Left Join > Hash Cond: (pagg_tab1_p1.x = y) > Filter: ((pagg_tab1_p1.x > 5) OR (y < 20)) > -> Seq Scan on pagg_tab1_p1 > Filter: (x < 20) > -> Hash > -> Result > One-Time Filter: false > > That's just dumb. What we *ought* to be doing in such degenerate > outer-join cases is just emitting the non-dummy side, ie > > -> Seq Scan on pagg_tab1_p1 > Filter: (x < 20) AND ((pagg_tab1_p1.x > 5) OR (y < 20)) > > in this example. I would envision handling this by teaching the > code to generate a path for the joinrel that's basically just a > ProjectionPath atop a path for the non-dummy input rel, with the > projection set up to emit nulls for the columns of the dummy side. > (Note that this would be useful for outer joins against dummy rels > in regular planning contexts, not only partitionwise joins.) > > Pending somebody doing the work for that, though, I do not > have a problem with just being unable to generate partitionwise > joins in such cases, so 0001 attached just changes the expected > outputs for the affected regression test cases. Fwiw, I agree that we should fix join planning so that we get the ProjectionPath atop scan path of non-nullable relation instead of a full-fledged join path with dummy path on the nullable side. It seems to me that the "fix" would be mostly be localized to try_partitionwise_join() at least insofar as detecting whether we should generate a join or the other plan shape is concerned, right? By the way, does it make sense to remove the tests whose output changes altogether and reintroduce them when we fix join planning? Especially, in partitionwise_aggregate.out, there are comments near changed plans which are no longer true. > 0002 attached is then the rest of the partition-planning patch; > it doesn't need to mess with joinrels.c at all. I've addressed > the other points discussed today in that, except for the business > about whether we want your 0003 bitmap-of-live-partitions patch. > I'm still inclined to think that that's not really worth it, > especially in view of your performance results. I think the performance results did prove that degradation due to those loops over part_rels becomes significant for very large partition counts. Is there a better solution than the bitmapset that you have in mind? > If people are OK with this approach to solving the GEQO problem, > I think these are committable. Thanks again. Really appreciate that you are putting so much of your time into this. Regards, Amit
Amit Langote <amitlangote09@gmail.com> writes: > On Sat, Mar 30, 2019 at 9:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What I propose we do about the GEQO problem is shown in 0001 attached >> (which would need to be back-patched into v11). >> ... >> That's just dumb. What we *ought* to be doing in such degenerate >> outer-join cases is just emitting the non-dummy side, ie > Fwiw, I agree that we should fix join planning so that we get the > ProjectionPath atop scan path of non-nullable relation instead of a > full-fledged join path with dummy path on the nullable side. It seems > to me that the "fix" would be mostly be localized to > try_partitionwise_join() at least insofar as detecting whether we > should generate a join or the other plan shape is concerned, right? Well, if we're going to do something about that, I would like to see it work for non-partition cases too, ie we're not smart about this either: regression=# explain select * from tenk1 left join (select 1 where false) ss(x) on unique1=x; QUERY PLAN ------------------------------------------------------------------- Nested Loop Left Join (cost=0.00..570.00 rows=10000 width=248) Join Filter: (tenk1.unique1 = 1) -> Seq Scan on tenk1 (cost=0.00..445.00 rows=10000 width=244) -> Result (cost=0.00..0.00 rows=0 width=0) One-Time Filter: false (5 rows) A general solution would presumably involve new logic in populate_joinrel_with_paths for the case where the nullable side is dummy. I'm not sure whether that leaves anything special to do in try_partitionwise_join or not. Maybe it would, since that would have a requirement to build the joinrel without any RHS input RelOptInfo, but I don't think that's the place to begin working on this. > By the way, does it make sense to remove the tests whose output > changes altogether and reintroduce them when we fix join planning? > Especially, in partitionwise_aggregate.out, there are comments near > changed plans which are no longer true. Good point about the comments, but we shouldn't just remove those test cases; they're useful to exercise the give-up-on-partitionwise-join code paths. I'll tweak the comments. >> 0002 attached is then the rest of the partition-planning patch; >> it doesn't need to mess with joinrels.c at all. I've addressed >> the other points discussed today in that, except for the business >> about whether we want your 0003 bitmap-of-live-partitions patch. >> I'm still inclined to think that that's not really worth it, >> especially in view of your performance results. > I think the performance results did prove that degradation due to > those loops over part_rels becomes significant for very large > partition counts. Is there a better solution than the bitmapset that > you have in mind? Hm, I didn't see much degradation in what you posted in <5c83dbca-12b5-1acf-0e85-58299e464a26@lab.ntt.co.jp>. I am curious as to why there seems to be more degradation for hash cases, as per Yoshikazu-san's results in <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>, but whatever's accounting for the difference probably is not that. Anyway I still believe that getting rid of these sparse arrays would be a better answer. Before that, though, I remain concerned that the PartitionPruneInfo data structure the planner is transmitting to the executor is unsafe against concurrent ATTACH PARTITION operations. The comment for PartitionedRelPruneInfo says in so many words that it's relying on indexes in the table's PartitionDesc; how is that not broken by 898e5e329? regards, tom lane
On Sat, Mar 30, 2019 at 11:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Before that, though, I remain concerned that the PartitionPruneInfo > data structure the planner is transmitting to the executor is unsafe > against concurrent ATTACH PARTITION operations. The comment for > PartitionedRelPruneInfo says in so many words that it's relying on > indexes in the table's PartitionDesc; how is that not broken by > 898e5e329? The only problem with PartitionPruneInfo structures of which I am aware is that they rely on PartitionDesc offsets not changing. But I added code in that commit in ExecCreatePartitionPruneState to handle that exact problem. See also paragraph 5 of the commit message, which begins with "Although in general..." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Mar 30, 2019 at 11:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Before that, though, I remain concerned that the PartitionPruneInfo >> data structure the planner is transmitting to the executor is unsafe >> against concurrent ATTACH PARTITION operations. The comment for >> PartitionedRelPruneInfo says in so many words that it's relying on >> indexes in the table's PartitionDesc; how is that not broken by >> 898e5e329? > The only problem with PartitionPruneInfo structures of which I am > aware is that they rely on PartitionDesc offsets not changing. But I > added code in that commit in ExecCreatePartitionPruneState to handle > that exact problem. See also paragraph 5 of the commit message, which > begins with "Although in general..." Ah. Grotty, but I guess it will cover the issue. regards, tom lane
On Sat, Mar 30, 2019 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The only problem with PartitionPruneInfo structures of which I am > > aware is that they rely on PartitionDesc offsets not changing. But I > > added code in that commit in ExecCreatePartitionPruneState to handle > > that exact problem. See also paragraph 5 of the commit message, which > > begins with "Although in general..." > > Ah. Grotty, but I guess it will cover the issue. I suppose it is. I am a little suspicious of the decision to make PartitionPruneInfo structures depend on PartitionDesc indexes. First, it's really non-obvious that the dependency exists, and I do not think I would have spotted it had not Alvaro pointed the problem out. Second, I wonder whether it is really a good idea in general to make a plan depend on array indexes when the array is not stored in the plan. In one sense, we do that all the time, because attnums are arguably just indexes into what is conceptually an array of attributes. However, I feel that's not quite the same, because the attnum is explicitly stored in the catalogs, and PartitionDesc array indexes are not stored anywhere, but rather are the result of a fairly complex calculation. Now I guess it's probably OK because we will probably have lots of other problems if we don't get the same answer every time we do that calculation, but it still makes me a little nervous. I would try to propose something better but I don't have a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Mar 31, 2019 at 12:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: > > I think the performance results did prove that degradation due to > > those loops over part_rels becomes significant for very large > > partition counts. Is there a better solution than the bitmapset that > > you have in mind? > > Hm, I didn't see much degradation in what you posted in > <5c83dbca-12b5-1acf-0e85-58299e464a26@lab.ntt.co.jp>. Sorry that I didn't mention the link to begin with, but I meant to point to numbers that I reported on Monday this week. https://www.postgresql.org/message-id/19f54c17-1619-b228-10e5-ca343be6a4e8%40lab.ntt.co.jp You were complaining of the bitmapset being useless overhead for small partition counts, but the numbers I get tend to suggest that any degradation in performance is within noise range, whereas the performance benefit from having them looks pretty significant for very large partition counts. > I am curious as to why there seems to be more degradation > for hash cases, as per Yoshikazu-san's results in > <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>, > but whatever's accounting for the difference probably > is not that. I suspected it may have been the lack of bitmapsets, but maybe only Imai-san could've confirmed that by applying the live_parts patch too. Thanks, Amit
On Sun, Mar 31, 2019 at 12:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Mar 30, 2019 at 11:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > The only problem with PartitionPruneInfo structures of which I am > > > aware is that they rely on PartitionDesc offsets not changing. But I > > > added code in that commit in ExecCreatePartitionPruneState to handle > > > that exact problem. See also paragraph 5 of the commit message, which > > > begins with "Although in general..." > > > > Ah. Grotty, but I guess it will cover the issue. > > I suppose it is. I am a little suspicious of the decision to make > PartitionPruneInfo structures depend on PartitionDesc indexes. Fwiw, I had complained when reviewing the run-time pruning patch that creating those maps in the planner and putting them in PartitionPruneInfo might not be a good idea, but David insisted that it'd be good for performance (in the context of using cached plans) to compute this information during planning. Thanks, Amit
On Sat, Mar 30, 2019 at 12:16 PM Amit Langote <amitlangote09@gmail.com> wrote: > Fwiw, I had complained when reviewing the run-time pruning patch that > creating those maps in the planner and putting them in > PartitionPruneInfo might not be a good idea, but David insisted that > it'd be good for performance (in the context of using cached plans) to > compute this information during planning. Well, he's not wrong about that, I expect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019/03/31 1:06, Amit Langote wrote: > On Sun, Mar 31, 2019 at 12:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Amit Langote <amitlangote09@gmail.com> writes: >>> I think the performance results did prove that degradation due to >>> those loops over part_rels becomes significant for very large >>> partition counts. Is there a better solution than the bitmapset that >>> you have in mind? >> >> Hm, I didn't see much degradation in what you posted in >> <5c83dbca-12b5-1acf-0e85-58299e464a26@lab.ntt.co.jp>. > > Sorry that I didn't mention the link to begin with, but I meant to > point to numbers that I reported on Monday this week. > > https://www.postgresql.org/message-id/19f54c17-1619-b228-10e5-ca343be6a4e8%40lab.ntt.co.jp > > You were complaining of the bitmapset being useless overhead for small > partition counts, but the numbers I get tend to suggest that any > degradation in performance is within noise range, whereas the > performance benefit from having them looks pretty significant for very > large partition counts. > >> I am curious as to why there seems to be more degradation >> for hash cases, as per Yoshikazu-san's results in >> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>, >> but whatever's accounting for the difference probably >> is not that. > > I suspected it may have been the lack of bitmapsets, but maybe only > Imai-san could've confirmed that by applying the live_parts patch too. Yeah, I forgot to applying live_parts patch. I did same test again which I did for hash before. (BTW, thanks for committing speeding up patches!) [HEAD(428b260)] nparts TPS ====== ===== 2: 13134 (13240, 13290, 13071, 13172, 12896) 1024: 12627 (12489, 12635, 12716, 12732, 12562) 8192: 10289 (10216, 10265, 10171, 10278, 10514) [HEAD(428b260) + live_parts.diff] nparts TPS ====== ===== 2: 13277 (13112, 13290, 13241, 13360, 13382) 1024: 12821 (12930, 12849, 12909, 12700, 12716) 8192: 11102 (11134, 11158, 11114, 10997, 11109) Degradations of performance are below. My test results from above (with live_parts, HEAD(428b260) + live_parts.diff) nparts live_parts HEAD ====== ========== ==== 2: 13277 13134 1024: 12821 12627 8192: 11102 10289 11102/13277 = 83.6 % Amit-san's test results (with live_parts) > nparts v38 HEAD > ====== ==== ==== > 2 2971 2969 > 8 2980 1949 > 32 2955 733 > 128 2946 145 > 512 2924 11 > 1024 2986 3 > 4096 2702 0 > 8192 2531 OOM 2531/2971 = 85.2 % My test results I posted before (without live_parts) > nparts v38 HEAD > ====== ==== ==== > 0: 10538 10487 > 2: 6942 7028 > 4: 7043 5645 > 8: 6981 3954 > 16: 6932 2440 > 32: 6897 1243 > 64: 6897 309 > 128: 6753 120 > 256: 6727 46 > 512: 6708 12 > 1024: 6063 3 > 2048: 5894 1 > 4096: 5374 OOM > 8192: 4572 OOM 4572/6942 = 65.9 % Certainly, using bitmapset contributes to the performance when scanning one partition(few partitions) from large partitions. Thanks -- Imai Yoshikazu
Attachment
On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu <yoshikazu_i443@live.jp> wrote: > On 2019/03/31 1:06, Amit Langote wrote: > > On Sun, Mar 31, 2019 at 12:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I am curious as to why there seems to be more degradation > >> for hash cases, as per Yoshikazu-san's results in > >> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>, > >> but whatever's accounting for the difference probably > >> is not that. > > > > I suspected it may have been the lack of bitmapsets, but maybe only > > Imai-san could've confirmed that by applying the live_parts patch too. > > Yeah, I forgot to applying live_parts patch. I did same test again which > I did for hash before. > (BTW, thanks for committing speeding up patches!) Thanks a lot for committing, Tom. I wish you had listed yourself as an author though. I will send the patch for get_relation_constraints() mentioned upthread tomorrow. > [HEAD(428b260)] > nparts TPS > ====== ===== > 2: 13134 (13240, 13290, 13071, 13172, 12896) > 1024: 12627 (12489, 12635, 12716, 12732, 12562) > 8192: 10289 (10216, 10265, 10171, 10278, 10514) > > [HEAD(428b260) + live_parts.diff] > nparts TPS > ====== ===== > 2: 13277 (13112, 13290, 13241, 13360, 13382) > 1024: 12821 (12930, 12849, 12909, 12700, 12716) > 8192: 11102 (11134, 11158, 11114, 10997, 11109) > > > Degradations of performance are below. > > > My test results from above (with live_parts, HEAD(428b260) + > live_parts.diff) > nparts live_parts HEAD > ====== ========== ==== > 2: 13277 13134 > 1024: 12821 12627 > 8192: 11102 10289 > > 11102/13277 = 83.6 % > > > Amit-san's test results (with live_parts) > > nparts v38 HEAD > > ====== ==== ==== > > 2 2971 2969 > > 8 2980 1949 > > 32 2955 733 > > 128 2946 145 > > 512 2924 11 > > 1024 2986 3 > > 4096 2702 0 > > 8192 2531 OOM > > 2531/2971 = 85.2 % > > > My test results I posted before (without live_parts) > > nparts v38 HEAD > > ====== ==== ==== > > 0: 10538 10487 > > 2: 6942 7028 > > 4: 7043 5645 > > 8: 6981 3954 > > 16: 6932 2440 > > 32: 6897 1243 > > 64: 6897 309 > > 128: 6753 120 > > 256: 6727 46 > > 512: 6708 12 > > 1024: 6063 3 > > 2048: 5894 1 > > 4096: 5374 OOM > > 8192: 4572 OOM > > 4572/6942 = 65.9 % > > > Certainly, using bitmapset contributes to the performance when scanning > one partition(few partitions) from large partitions. Thanks Imai-san for testing. Regards, Amit
Amit Langote <amitlangote09@gmail.com> writes: > On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu <yoshikazu_i443@live.jp> wrote: >> Certainly, using bitmapset contributes to the performance when scanning >> one partition(few partitions) from large partitions. > Thanks Imai-san for testing. I tried to replicate these numbers with the code as-committed, and could not. What I get, using the same table-creation code as you posted and a pgbench script file like \set param random(1, :N) select * from rt where a = :param; is scaling like this: N tps, range tps, hash 2 10520.519932 10415.230400 8 10443.361457 10480.987665 32 10341.196768 10462.551167 128 10370.953849 10383.885128 512 10207.578413 10214.049394 1024 10042.794340 10121.683993 4096 8937.561825 9214.993778 8192 8247.614040 8486.728918 If I use "-M prepared" the numbers go up a bit for lower N, but drop at high N: N tps, range tps, hash 2 11449.920527 11462.253871 8 11530.513146 11470.812476 32 11372.412999 11450.213753 128 11289.351596 11322.698856 512 11095.428451 11200.683771 1024 10757.646108 10805.052480 4096 8689.165875 8930.690887 8192 7301.609147 7502.806455 Digging into that, it seems like the degradation with -M prepared is mostly in LockReleaseAll's hash_seq_search over the locallock hash table. What I think must be happening is that with -M prepared, at some point the plancache decides to try a generic plan, which causes opening/locking all the partitions, resulting in permanent bloat in the locallock hash table. We immediately go back to using custom plans, but hash_seq_search has more buckets to look through for the remainder of the process' lifetime. I do see some cycles getting spent in apply_scanjoin_target_to_paths that look to be due to scanning over the long part_rels array, which your proposal would ameliorate. But (a) that's pretty small compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths is a remarkable display of brute force inefficiency to begin with. I think we should see if we can't nuke that function altogether in favor of generating the paths with the right target the first time. BTW, the real elephant in the room is the O(N^2) cost of creating these tables in the first place. The runtime for the table-creation scripts looks like N range hash 2 0m0.011s 0m0.011s 8 0m0.015s 0m0.014s 32 0m0.032s 0m0.030s 128 0m0.132s 0m0.099s 512 0m0.969s 0m0.524s 1024 0m3.306s 0m1.442s 4096 0m46.058s 0m15.522s 8192 3m11.995s 0m58.720s This seems to be down to the expense of doing RelationBuildPartitionDesc to rebuild the parent's relcache entry for each child CREATE TABLE. Not sure we can avoid that, but maybe we should consider adopting a cheaper-to-read representation of partition descriptors. The fact that range-style entries seem to be 3X more expensive to load than hash-style entries is strange. regards, tom lane
One thing that I intentionally left out of the committed patch was changes to stop short of scanning the whole simple_rel_array when looking only for baserels. I thought that had been done in a rather piecemeal fashion and it'd be better to address it holistically, which I've now done in the attached proposed patch. This probably makes little if any difference in the test cases we've mostly focused on in this thread, since there wouldn't be very many otherrels anyway now that we don't create them for pruned partitions. However, in a case where there's a lot of partitions that we can't prune, this could be useful. I have not done any performance testing to see if this is actually worth the trouble, though. Anybody want to do that? regards, tom lane diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 727da33..7a9aa12 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -157,7 +157,7 @@ make_one_rel(PlannerInfo *root, List *joinlist) * Construct the all_baserels Relids set. */ root->all_baserels = NULL; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; @@ -290,7 +290,7 @@ set_base_rel_sizes(PlannerInfo *root) { Index rti; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *rel = root->simple_rel_array[rti]; RangeTblEntry *rte; @@ -333,7 +333,7 @@ set_base_rel_pathlists(PlannerInfo *root) { Index rti; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *rel = root->simple_rel_array[rti]; @@ -1994,7 +1994,7 @@ has_multiple_baserels(PlannerInfo *root) int num_base_rels = 0; Index rti; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index 61b5b11..723643c 100644 --- a/src/backend/optimizer/path/equivclass.c +++ b/src/backend/optimizer/path/equivclass.c @@ -828,11 +828,11 @@ generate_base_implied_equalities(PlannerInfo *root) * This is also a handy place to mark base rels (which should all exist by * now) with flags showing whether they have pending eclass joins. */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; - if (brel == NULL) + if (brel == NULL || brel->reloptkind != RELOPT_BASEREL) continue; brel->has_eclass_joins = has_relevant_eclass_joinclause(root, brel); diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c index 9798dca..c5459b6 100644 --- a/src/backend/optimizer/plan/initsplan.c +++ b/src/backend/optimizer/plan/initsplan.c @@ -145,7 +145,7 @@ add_other_rels_to_query(PlannerInfo *root) { int rti; - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *rel = root->simple_rel_array[rti]; RangeTblEntry *rte = root->simple_rte_array[rti]; @@ -312,7 +312,7 @@ find_lateral_references(PlannerInfo *root) /* * Examine all baserels (the rel array has been set up by now). */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; @@ -460,7 +460,7 @@ create_lateral_join_info(PlannerInfo *root) /* * Examine all baserels (the rel array has been set up by now). */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; Relids lateral_relids; @@ -580,7 +580,7 @@ create_lateral_join_info(PlannerInfo *root) * The outer loop considers each baserel, and propagates its lateral * dependencies to those baserels that have a lateral dependency on it. */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; Relids outer_lateral_relids; @@ -595,7 +595,7 @@ create_lateral_join_info(PlannerInfo *root) continue; /* else scan all baserels */ - for (rti2 = 1; rti2 < root->simple_rel_array_size; rti2++) + for (rti2 = 1; rti2 <= root->last_base_relid; rti2++) { RelOptInfo *brel2 = root->simple_rel_array[rti2]; @@ -614,7 +614,7 @@ create_lateral_join_info(PlannerInfo *root) * with the set of relids of rels that reference it laterally (possibly * indirectly) --- that is, the inverse mapping of lateral_relids. */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *brel = root->simple_rel_array[rti]; Relids lateral_relids; diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 3a1b846..26aa7fa 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -1215,6 +1215,7 @@ inheritance_planner(PlannerInfo *root) List *final_rtable = NIL; List *final_rowmarks = NIL; int save_rel_array_size = 0; + int save_last_base_relid = 0; RelOptInfo **save_rel_array = NULL; AppendRelInfo **save_append_rel_array = NULL; List *subpaths = NIL; @@ -1664,6 +1665,8 @@ inheritance_planner(PlannerInfo *root) subroot->simple_rel_array[rti] = brel; } save_rel_array_size = subroot->simple_rel_array_size; + save_last_base_relid = Max(save_last_base_relid, + subroot->last_base_relid); save_rel_array = subroot->simple_rel_array; save_append_rel_array = subroot->append_rel_array; @@ -1741,6 +1744,7 @@ inheritance_planner(PlannerInfo *root) */ parse->rtable = final_rtable; root->simple_rel_array_size = save_rel_array_size; + root->last_base_relid = save_last_base_relid; root->simple_rel_array = save_rel_array; root->append_rel_array = save_append_rel_array; diff --git a/src/backend/optimizer/util/orclauses.c b/src/backend/optimizer/util/orclauses.c index b671581..c04c91d 100644 --- a/src/backend/optimizer/util/orclauses.c +++ b/src/backend/optimizer/util/orclauses.c @@ -78,7 +78,7 @@ extract_restriction_or_clauses(PlannerInfo *root) Index rti; /* Examine each baserel for potential join OR clauses */ - for (rti = 1; rti < root->simple_rel_array_size; rti++) + for (rti = 1; rti <= root->last_base_relid; rti++) { RelOptInfo *rel = root->simple_rel_array[rti]; ListCell *lc; diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index 272e2eb..a79c738 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -82,6 +82,9 @@ setup_simple_rel_arrays(PlannerInfo *root) root->simple_rel_array = (RelOptInfo **) palloc0(root->simple_rel_array_size * sizeof(RelOptInfo *)); + /* obviously, there are no RELOPT_BASEREL entries yet */ + root->last_base_relid = 0; + /* simple_rte_array is an array equivalent of the rtable list */ root->simple_rte_array = (RangeTblEntry **) palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *)); @@ -348,6 +351,14 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent) /* Save the finished struct in the query's simple_rel_array */ root->simple_rel_array[relid] = rel; + /* + * Track the highest index of any BASEREL entry. This is useful since + * many places scan the simple_rel_array for baserels and don't care about + * otherrels; they can stop before scanning otherrels. + */ + if (rel->reloptkind == RELOPT_BASEREL) + root->last_base_relid = Max(root->last_base_relid, relid); + return rel; } diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h index 88c8973..fcccffc 100644 --- a/src/include/nodes/pathnodes.h +++ b/src/include/nodes/pathnodes.h @@ -201,6 +201,7 @@ struct PlannerInfo */ struct RelOptInfo **simple_rel_array; /* All 1-rel RelOptInfos */ int simple_rel_array_size; /* allocated size of array */ + int last_base_relid; /* index of last baserel in array */ /* * simple_rte_array is the same length as simple_rel_array and holds
On Sun, 31 Mar 2019 at 05:50, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Mar 30, 2019 at 12:16 PM Amit Langote <amitlangote09@gmail.com> wrote: > > Fwiw, I had complained when reviewing the run-time pruning patch that > > creating those maps in the planner and putting them in > > PartitionPruneInfo might not be a good idea, but David insisted that > > it'd be good for performance (in the context of using cached plans) to > > compute this information during planning. > > Well, he's not wrong about that, I expect. I'm aware that there have been combinations of objects to either having these arrays and/or editing them during execution. I don't recall Amit's complaint, but I do recall Tom's. He suggested we not resequence the arrays in the executor and just maintain NULL elements in the Append/MergeAppend subplans. I did consider this when writing run-time pruning but found that performance suffers. I demonstrated this on a thread somewhere. IIRC, I wrote this code because there was no way to translate the result of the pruning code into Append/MergeAppend subplan indexes. Robert has since added a map of Oids to allow the executor to have those details, so it perhaps would be possible to take the result of the pruning code then lookup the Oids of the partitions that survived pruning, then map those to the subplans using the array Robert added. Using the array for that wouldn't be very efficient due to a lookup being O(n) per surviving partition. Maybe it could be thrown into a hashtable to make that faster. This solution would need to take into account mixed hierarchy Appends. e.g SELECT * FROM partitioned_table WHERE partkey = $1 UNION ALL SELECT * FROM something_else; so it would likely need to be a hashtable per partitioned table. If the pruning code returned a partition whose Oid we didn't know about, then it must be from a partition that was added concurrently since the plan was built... However, that shouldn't happen today since Robert went to great lengths for it not to. Further discussions are likely best put in their own thread. As far as I know, nothing is broken with the code today. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
(I've closed the CF entry: https://commitfest.postgresql.org/22/1778/) On 2019/04/01 2:04, Tom Lane wrote: > Amit Langote <amitlangote09@gmail.com> writes: >> On Sun, Mar 31, 2019 at 11:45 AM Imai Yoshikazu <yoshikazu_i443@live.jp> wrote: >>> Certainly, using bitmapset contributes to the performance when scanning >>> one partition(few partitions) from large partitions. > >> Thanks Imai-san for testing. > > I tried to replicate these numbers with the code as-committed, and > could not. Thanks for that. > What I get, using the same table-creation code as you > posted and a pgbench script file like > > \set param random(1, :N) > select * from rt where a = :param; > > is scaling like this: > > N tps, range tps, hash > > 2 10520.519932 10415.230400 > 8 10443.361457 10480.987665 > 32 10341.196768 10462.551167 > 128 10370.953849 10383.885128 > 512 10207.578413 10214.049394 > 1024 10042.794340 10121.683993 > 4096 8937.561825 9214.993778 > 8192 8247.614040 8486.728918 > > If I use "-M prepared" the numbers go up a bit for lower N, but > drop at high N: > > N tps, range tps, hash > > 2 11449.920527 11462.253871 > 8 11530.513146 11470.812476 > 32 11372.412999 11450.213753 > 128 11289.351596 11322.698856 > 512 11095.428451 11200.683771 > 1024 10757.646108 10805.052480 > 4096 8689.165875 8930.690887 > 8192 7301.609147 7502.806455 > > Digging into that, it seems like the degradation with -M prepared is > mostly in LockReleaseAll's hash_seq_search over the locallock hash table. > What I think must be happening is that with -M prepared, at some point the > plancache decides to try a generic plan, which causes opening/locking all > the partitions, resulting in permanent bloat in the locallock hash table. > We immediately go back to using custom plans, but hash_seq_search has > more buckets to look through for the remainder of the process' lifetime. Ah, we did find this to be a problem upthread [1] and Tsunakawa-san then even posted a patch which is being discussed at: https://commitfest.postgresql.org/22/1993/ > I do see some cycles getting spent in apply_scanjoin_target_to_paths > that look to be due to scanning over the long part_rels array, > which your proposal would ameliorate. But (a) that's pretty small > compared to other effects, and (b) IMO, apply_scanjoin_target_to_paths > is a remarkable display of brute force inefficiency to begin with. > I think we should see if we can't nuke that function altogether in > favor of generating the paths with the right target the first time. That's an option if we can make it work. Shouldn't we look at *all* of the places that have code that now look like this: for (i = 0; i < rel->nparts; i++) { RelOptInfo *partrel = rel->part_rels[i]; if (partrel == NULL) continue; ... } Beside apply_scanjoin_target_to_paths(), there are: create_partitionwise_grouping_paths() make_partitionedrel_pruneinfo() > BTW, the real elephant in the room is the O(N^2) cost of creating > these tables in the first place. The runtime for the table-creation > scripts looks like > > N range hash > > 2 0m0.011s 0m0.011s > 8 0m0.015s 0m0.014s > 32 0m0.032s 0m0.030s > 128 0m0.132s 0m0.099s > 512 0m0.969s 0m0.524s > 1024 0m3.306s 0m1.442s > 4096 0m46.058s 0m15.522s > 8192 3m11.995s 0m58.720s > > This seems to be down to the expense of doing RelationBuildPartitionDesc > to rebuild the parent's relcache entry for each child CREATE TABLE. > Not sure we can avoid that, but maybe we should consider adopting a > cheaper-to-read representation of partition descriptors. The fact that > range-style entries seem to be 3X more expensive to load than hash-style > entries is strange. I've noticed this many times too, but never prioritized doing something about it. I'll try sometime. Thanks, Amit [1] https://www.postgresql.org/message-id/CAKJS1f-dn1hDZqObwdMrYdV7-cELJwWCPRWet6EQX_WaV8JLgw%40mail.gmail.com
On 2019/04/01 3:46, Tom Lane wrote: > One thing that I intentionally left out of the committed patch was changes > to stop short of scanning the whole simple_rel_array when looking only for > baserels. I thought that had been done in a rather piecemeal fashion > and it'd be better to address it holistically, which I've now done in the > attached proposed patch. > > This probably makes little if any difference in the test cases we've > mostly focused on in this thread, since there wouldn't be very many > otherrels anyway now that we don't create them for pruned partitions. > However, in a case where there's a lot of partitions that we can't prune, > this could be useful. > > I have not done any performance testing to see if this is actually > worth the trouble, though. Anybody want to do that? Thanks for creating the patch. I spent some time looking for cases where this patch would provide recognizable benefit, but couldn't find one. As one would suspect, it's hard to notice it if only looking at the overall latency of the query, because time spent doing other things with such plans tends to be pretty huge anyway (both in the planner itself and other parts of the backend). I even devised a query on a partitioned table such that planner has to process all partitions, but ExecInitAppend can prune all but one, thus reducing the time spent in the executor, but still wasn't able to see an improvement in the overall latency of the query due to planner not doing looping over the long simple_rel_array. Thanks, Amit
On 2019/03/30 0:29, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> Finally, it's not in the patch, but how about visiting >> get_relation_constraints() for revising this block of code: > > That seems like probably an independent patch --- do you want to write it? Here is that patch. It revises get_relation_constraints() such that the partition constraint is loaded in only the intended cases. To summarize: * PG 11 currently misses one such intended case (select * from partition) causing a *bug* that constraint exclusion fails to exclude the partition with constraint_exclusion = on * HEAD loads the partition constraint even in some cases where 428b260f87 rendered doing that unnecessary Thanks, Amit
Attachment
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/04/01 3:46, Tom Lane wrote: >> One thing that I intentionally left out of the committed patch was changes >> to stop short of scanning the whole simple_rel_array when looking only for >> baserels. I thought that had been done in a rather piecemeal fashion >> and it'd be better to address it holistically, which I've now done in the >> attached proposed patch. >> I have not done any performance testing to see if this is actually >> worth the trouble, though. Anybody want to do that? > Thanks for creating the patch. > I spent some time looking for cases where this patch would provide > recognizable benefit, but couldn't find one. Yeah, I was afraid of that. In cases where we do have a ton of otherrels, the processing that's actually needed on them would probably swamp any savings from this patch. The only place where that might possibly not be true is create_lateral_join_info, since that has nested loops that could potentially impose an O(N^2) cost. However, since your patch went in, that runs before inheritance expansion anyway. So this probably isn't worth even the minuscule cost it imposes. regards, tom lane
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/03/30 0:29, Tom Lane wrote: >> That seems like probably an independent patch --- do you want to write it? > Here is that patch. > It revises get_relation_constraints() such that the partition constraint > is loaded in only the intended cases. So I see the problem you're trying to solve here, but I don't like this patch a bit, because it depends on root->inhTargetKind which IMO is a broken bit of junk that we need to get rid of. Here is an example of why, with this patch applied: regression=# create table p (a int) partition by list (a); CREATE TABLE regression=# create table p1 partition of p for values in (1); CREATE TABLE regression=# set constraint_exclusion to on; SET regression=# explain select * from p1 where a = 2; QUERY PLAN ------------------------------------------ Result (cost=0.00..0.00 rows=0 width=0) One-Time Filter: false (2 rows) So far so good, but watch what happens when we include the same case in an UPDATE on some other partitioned table: regression=# create table prtab (a int, b int) partition by list (a); CREATE TABLE regression=# create table prtab2 partition of prtab for values in (2); CREATE TABLE regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and p1.a=2; QUERY PLAN --------------------------------------------------------------------------- Update on prtab (cost=0.00..82.30 rows=143 width=20) Update on prtab2 -> Nested Loop (cost=0.00..82.30 rows=143 width=20) -> Seq Scan on p1 (cost=0.00..41.88 rows=13 width=10) Filter: (a = 2) -> Materialize (cost=0.00..38.30 rows=11 width=14) -> Seq Scan on prtab2 (cost=0.00..38.25 rows=11 width=14) Filter: (a = 2) (8 rows) No constraint exclusion, while in v10 you get Update on prtab (cost=0.00..0.00 rows=0 width=0) -> Result (cost=0.00..0.00 rows=0 width=0) One-Time Filter: false The reason is that this logic supposes that root->inhTargetKind describes *all* partitioned tables in the query, which is obviously wrong. Now maybe we could make it work by doing something like if (rel->reloptkind == RELOPT_BASEREL && (root->inhTargetKind == INHKIND_NONE || rel->relid != root->parse->resultRelation)) but I find that pretty messy, plus it's violating the concept that we shouldn't be allowing messiness from inheritance_planner to leak into other places. What I'd rather do is have this test just read if (rel->reloptkind == RELOPT_BASEREL) Making it be that way causes some changes in the partition_prune results, as attached, which suggest that removing the enable_partition_pruning check as you did wasn't such a great idea either. However, if I add that back in, then it breaks the proposed new regression test case. I'm not at all clear on what we think the interaction between enable_partition_pruning and constraint_exclusion ought to be, so I'm not sure what the appropriate resolution is here. Thoughts? BTW, just about all the other uses of root->inhTargetKind seem equally broken from here; none of them are accounting for whether the rel in question is the query target. regards, tom lane diff -U3 /home/postgres/pgsql/src/test/regress/expected/partition_prune.out /home/postgres/pgsql/src/test/regress/results/partition_prune.out --- /home/postgres/pgsql/src/test/regress/expected/partition_prune.out 2019-04-01 12:39:52.613109088 -0400 +++ /home/postgres/pgsql/src/test/regress/results/partition_prune.out 2019-04-01 13:18:02.852615395 -0400 @@ -3409,24 +3409,18 @@ -------------------------- Update on pp_lp Update on pp_lp1 - Update on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) - -> Seq Scan on pp_lp2 - Filter: (a = 1) -(7 rows) +(4 rows) explain (costs off) delete from pp_lp where a = 1; QUERY PLAN -------------------------- Delete on pp_lp Delete on pp_lp1 - Delete on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) - -> Seq Scan on pp_lp2 - Filter: (a = 1) -(7 rows) +(4 rows) set constraint_exclusion = 'off'; -- this should not affect the result. explain (costs off) select * from pp_lp where a = 1; @@ -3444,24 +3438,18 @@ -------------------------- Update on pp_lp Update on pp_lp1 - Update on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) - -> Seq Scan on pp_lp2 - Filter: (a = 1) -(7 rows) +(4 rows) explain (costs off) delete from pp_lp where a = 1; QUERY PLAN -------------------------- Delete on pp_lp Delete on pp_lp1 - Delete on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) - -> Seq Scan on pp_lp2 - Filter: (a = 1) -(7 rows) +(4 rows) drop table pp_lp; -- Ensure enable_partition_prune does not affect non-partitioned tables.
Thanks for taking a look. On 2019/04/02 2:34, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2019/03/30 0:29, Tom Lane wrote: >>> That seems like probably an independent patch --- do you want to write it? > >> Here is that patch. >> It revises get_relation_constraints() such that the partition constraint >> is loaded in only the intended cases. > > So I see the problem you're trying to solve here, but I don't like this > patch a bit, because it depends on root->inhTargetKind which IMO is a > broken bit of junk that we need to get rid of. Here is an example of > why, with this patch applied: > > regression=# create table p (a int) partition by list (a); > CREATE TABLE > regression=# create table p1 partition of p for values in (1); > CREATE TABLE > regression=# set constraint_exclusion to on; > SET > regression=# explain select * from p1 where a = 2; > QUERY PLAN > ------------------------------------------ > Result (cost=0.00..0.00 rows=0 width=0) > One-Time Filter: false > (2 rows) > > So far so good, but watch what happens when we include the same case > in an UPDATE on some other partitioned table: > > regression=# create table prtab (a int, b int) partition by list (a); > CREATE TABLE > regression=# create table prtab2 partition of prtab for values in (2); > CREATE TABLE > regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and p1.a=2; > QUERY PLAN > --------------------------------------------------------------------------- > Update on prtab (cost=0.00..82.30 rows=143 width=20) > Update on prtab2 > -> Nested Loop (cost=0.00..82.30 rows=143 width=20) > -> Seq Scan on p1 (cost=0.00..41.88 rows=13 width=10) > Filter: (a = 2) > -> Materialize (cost=0.00..38.30 rows=11 width=14) > -> Seq Scan on prtab2 (cost=0.00..38.25 rows=11 width=14) > Filter: (a = 2) > (8 rows) > > No constraint exclusion, while in v10 you get > > Update on prtab (cost=0.00..0.00 rows=0 width=0) > -> Result (cost=0.00..0.00 rows=0 width=0) > One-Time Filter: false > > The reason is that this logic supposes that root->inhTargetKind describes > *all* partitioned tables in the query, which is obviously wrong. > > Now maybe we could make it work by doing something like > > if (rel->reloptkind == RELOPT_BASEREL && > (root->inhTargetKind == INHKIND_NONE || > rel->relid != root->parse->resultRelation)) Ah, you're right. inhTargetKind has to be checked in conjunction with checking whether the relation is the target relation. > but I find that pretty messy, plus it's violating the concept that we > shouldn't be allowing messiness from inheritance_planner to leak into > other places. I'm afraid that we'll have to live with this particular hack as long as we have inheritance_planner(), but we maybe could somewhat reduce the extent to which the hack is spread into other planner files. How about we move the part of get_relation_constraints() that loads the partition constraint to its only caller relation_excluded_by_constraints()? If we do that, all uses of root->inhTargetKind will be confined to one place. Attached updated patch does that. > What I'd rather do is have this test just read > > if (rel->reloptkind == RELOPT_BASEREL) > > Making it be that way causes some changes in the partition_prune results, > as attached, which suggest that removing the enable_partition_pruning > check as you did wasn't such a great idea either. However, if I add > that back in, then it breaks the proposed new regression test case. > > I'm not at all clear on what we think the interaction between > enable_partition_pruning and constraint_exclusion ought to be, > so I'm not sure what the appropriate resolution is here. Thoughts? Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and DELETE queries is realized by applying constraint exclusion to the partition constraint of the target partition. The conclusion of the discussion when adding the enable_partition_pruning GUC [1] was that whether or not constraint exclusion is carried out (to facilitate partition pruning) should be governed by the new GUC, not constraint_exclusion, if only to avoid confusing users. 428b260f87 has obviated the need to check enable_partition_pruning in relation_excluded_by_constraints(), because inheritance_planner() now runs the query as if it were SELECT, which does partition pruning using partprune.c, governed by the setting of enable_partition_pruning. So, there's no need to check it again in relation_excluded_by_constraints(), because we won't be consulting the partition constraint again; well, at least after applying the proposed patch. > BTW, just about all the other uses of root->inhTargetKind seem equally > broken from here; none of them are accounting for whether the rel in > question is the query target. There's only one other use of its value, AFAICS: switch (constraint_exclusion) { case CONSTRAINT_EXCLUSION_OFF: /* * Don't prune if feature turned off -- except if the relation is * a partition. While partprune.c-style partition pruning is not * yet in use for all cases (update/delete is not handled), it * would be a UI horror to use different user-visible controls * depending on such a volatile implementation detail. Therefore, * for partitioned tables we use enable_partition_pruning to * control this behavior. */ if (root->inhTargetKind == INHKIND_PARTITIONED) break; Updated patch removes it though. Which other uses are there? Attached patch is only for HEAD this time. I'll post one for PG 11 (if you'd like) once we reach consensus on the best thing to do here is. Thanks, Amit [1] https://www.postgresql.org/message-id/flat/CAFjFpRcwq7G16J_w%2Byy_xiE7daD0Bm6iYTnhz81f79yrSOn4DA%40mail.gmail.com
Attachment
Hi all, First of all I would like to thank everyone involved in this patch for their hard work on this. This is a big step forward.I've done some performance and functionality testing with the patch that was committed to master and it looks verygood. I had a question about the performance of pruning of functions like now() and current_date. I know these are handled differently,as they cannot be excluded during the first phases of planning. However, curerntly, this new patch makes theperformance difference between the static timestamp variant and now() very obvious (even more than before). Consider select * from partitioned_table where ts >= now() or select * from partitioned_table where ts >= '2019-04-04' The second plans in less than a millisecond, whereas the first takes +- 180ms for a table with 1000 partitions. Both endup with the same plan. I'm not too familiar with the code that handles this, but is there a possibility for improvement in this area? Or is thestage at which exclusion for now()/current_date occurs already too far in the process to make any good improvements tothis? My apologies if this is considered off-topic for this patch, but I ran into this issue specifically when I was testingthis patch, so I thought I'd ask here about it. I do think a large number of use-cases for tables with a large numberof partitions involve a timestamp for partition key, and naturally people will start writing queries for this thatuse functions such as now() and current_date. Thanks again for your work on this patch! -Floris ________________________________________ From: Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> Sent: Tuesday, April 2, 2019 7:50 AM To: Tom Lane Cc: David Rowley; Imai Yoshikazu; jesper.pedersen@redhat.com; Imai, Yoshikazu; Amit Langote; Alvaro Herrera; Robert Haas;Justin Pryzby; Pg Hackers Subject: Re: speeding up planning with partitions [External] Thanks for taking a look. On 2019/04/02 2:34, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2019/03/30 0:29, Tom Lane wrote: >>> That seems like probably an independent patch --- do you want to write it? > >> Here is that patch. >> It revises get_relation_constraints() such that the partition constraint >> is loaded in only the intended cases. > > So I see the problem you're trying to solve here, but I don't like this > patch a bit, because it depends on root->inhTargetKind which IMO is a > broken bit of junk that we need to get rid of. Here is an example of > why, with this patch applied: > > regression=# create table p (a int) partition by list (a); > CREATE TABLE > regression=# create table p1 partition of p for values in (1); > CREATE TABLE > regression=# set constraint_exclusion to on; > SET > regression=# explain select * from p1 where a = 2; > QUERY PLAN > ------------------------------------------ > Result (cost=0.00..0.00 rows=0 width=0) > One-Time Filter: false > (2 rows) > > So far so good, but watch what happens when we include the same case > in an UPDATE on some other partitioned table: > > regression=# create table prtab (a int, b int) partition by list (a); > CREATE TABLE > regression=# create table prtab2 partition of prtab for values in (2); > CREATE TABLE > regression=# explain update prtab set b=b+1 from p1 where prtab.a=p1.a and p1.a=2; > QUERY PLAN > --------------------------------------------------------------------------- > Update on prtab (cost=0.00..82.30 rows=143 width=20) > Update on prtab2 > -> Nested Loop (cost=0.00..82.30 rows=143 width=20) > -> Seq Scan on p1 (cost=0.00..41.88 rows=13 width=10) > Filter: (a = 2) > -> Materialize (cost=0.00..38.30 rows=11 width=14) > -> Seq Scan on prtab2 (cost=0.00..38.25 rows=11 width=14) > Filter: (a = 2) > (8 rows) > > No constraint exclusion, while in v10 you get > > Update on prtab (cost=0.00..0.00 rows=0 width=0) > -> Result (cost=0.00..0.00 rows=0 width=0) > One-Time Filter: false > > The reason is that this logic supposes that root->inhTargetKind describes > *all* partitioned tables in the query, which is obviously wrong. > > Now maybe we could make it work by doing something like > > if (rel->reloptkind == RELOPT_BASEREL && > (root->inhTargetKind == INHKIND_NONE || > rel->relid != root->parse->resultRelation)) Ah, you're right. inhTargetKind has to be checked in conjunction with checking whether the relation is the target relation. > but I find that pretty messy, plus it's violating the concept that we > shouldn't be allowing messiness from inheritance_planner to leak into > other places. I'm afraid that we'll have to live with this particular hack as long as we have inheritance_planner(), but we maybe could somewhat reduce the extent to which the hack is spread into other planner files. How about we move the part of get_relation_constraints() that loads the partition constraint to its only caller relation_excluded_by_constraints()? If we do that, all uses of root->inhTargetKind will be confined to one place. Attached updated patch does that. > What I'd rather do is have this test just read > > if (rel->reloptkind == RELOPT_BASEREL) > > Making it be that way causes some changes in the partition_prune results, > as attached, which suggest that removing the enable_partition_pruning > check as you did wasn't such a great idea either. However, if I add > that back in, then it breaks the proposed new regression test case. > > I'm not at all clear on what we think the interaction between > enable_partition_pruning and constraint_exclusion ought to be, > so I'm not sure what the appropriate resolution is here. Thoughts? Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and DELETE queries is realized by applying constraint exclusion to the partition constraint of the target partition. The conclusion of the discussion when adding the enable_partition_pruning GUC [1] was that whether or not constraint exclusion is carried out (to facilitate partition pruning) should be governed by the new GUC, not constraint_exclusion, if only to avoid confusing users. 428b260f87 has obviated the need to check enable_partition_pruning in relation_excluded_by_constraints(), because inheritance_planner() now runs the query as if it were SELECT, which does partition pruning using partprune.c, governed by the setting of enable_partition_pruning. So, there's no need to check it again in relation_excluded_by_constraints(), because we won't be consulting the partition constraint again; well, at least after applying the proposed patch. > BTW, just about all the other uses of root->inhTargetKind seem equally > broken from here; none of them are accounting for whether the rel in > question is the query target. There's only one other use of its value, AFAICS: switch (constraint_exclusion) { case CONSTRAINT_EXCLUSION_OFF: /* * Don't prune if feature turned off -- except if the relation is * a partition. While partprune.c-style partition pruning is not * yet in use for all cases (update/delete is not handled), it * would be a UI horror to use different user-visible controls * depending on such a volatile implementation detail. Therefore, * for partitioned tables we use enable_partition_pruning to * control this behavior. */ if (root->inhTargetKind == INHKIND_PARTITIONED) break; Updated patch removes it though. Which other uses are there? Attached patch is only for HEAD this time. I'll post one for PG 11 (if you'd like) once we reach consensus on the best thing to do here is. Thanks, Amit [1] https://www.postgresql.org/message-id/flat/CAFjFpRcwq7G16J_w%2Byy_xiE7daD0Bm6iYTnhz81f79yrSOn4DA%40mail.gmail.com
On Fri, 5 Apr 2019 at 07:33, Floris Van Nee <florisvannee@optiver.com> wrote: > I had a question about the performance of pruning of functions like now() and current_date. I know these are handled differently,as they cannot be excluded during the first phases of planning. However, curerntly, this new patch makes theperformance difference between the static timestamp variant and now() very obvious (even more than before). Consider > select * from partitioned_table where ts >= now() > or > select * from partitioned_table where ts >= '2019-04-04' > > The second plans in less than a millisecond, whereas the first takes +- 180ms for a table with 1000 partitions. Both endup with the same plan. The patch here only aims to improve the performance of queries to partitioned tables when partitions can be pruned during planning. The now() version of the query is unable to do that since we don't know what that value will be during the execution of the query. In that version, you're most likely seeing "Subplans Removed: <n>". This means run-time pruning did some pruning and the planner generated subplans for what you see plus <n> others. Since planning for all partitions is still slow, you're getting a larger performance difference than before, but only due to the fact that the other plan is now faster to generate. If you're never using prepared statements, i.e, always planning right before execution, then you might want to consider using "where ts >= 'today'::timestamp". This will evaluate to the current date during parse and make the value available to the planner. You'll need to be pretty careful with that though, as if you do prepare queries or change to do that in the future then the bugs in your application could be very subtle and only do the wrong thing just after midnight on some day when the current time progresses over your partition boundary. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2019/04/05 6:59, David Rowley wrote: > On Fri, 5 Apr 2019 at 07:33, Floris Van Nee <florisvannee@optiver.com> wrote: >> I had a question about the performance of pruning of functions like now() and current_date. I know these are handled differently,as they cannot be excluded during the first phases of planning. However, curerntly, this new patch makes theperformance difference between the static timestamp variant and now() very obvious (even more than before). Consider >> select * from partitioned_table where ts >= now() >> or >> select * from partitioned_table where ts >= '2019-04-04' >> >> The second plans in less than a millisecond, whereas the first takes +- 180ms for a table with 1000 partitions. Both endup with the same plan. > > The patch here only aims to improve the performance of queries to > partitioned tables when partitions can be pruned during planning. The > now() version of the query is unable to do that since we don't know > what that value will be during the execution of the query. In that > version, you're most likely seeing "Subplans Removed: <n>". This means > run-time pruning did some pruning and the planner generated subplans > for what you see plus <n> others. Since planning for all partitions is > still slow, you're getting a larger performance difference than > before, but only due to the fact that the other plan is now faster to > generate. Yeah, the time for generating plan for a query that *can* use pruning but not during planning is still very much dependent on the number of partitions, because access plans must be created for all partitions, even if only one of those plans will actually be used and the rest pruned away during execution. > If you're never using prepared statements, Or if using prepared statements is an option, the huge planning cost mentioned above need not be paid repeatedly. Although, we still have ways to go in terms of scaling generic plan execution to larger partition counts, solution(s) for which have been proposed by David but haven't made it into master yet. Thanks, Amit
On Fri, 5 Apr 2019 at 16:09, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Although, we still have ways > to go in terms of scaling generic plan execution to larger partition > counts, solution(s) for which have been proposed by David but haven't made > it into master yet. Is that a reference to the last paragraph in [1]? That idea has not gone beyond me writing that text yet! :-( It was more of a passing comment on the only way I could think of to solve the problem. [1] https://www.postgresql.org/message-id/CAKJS1f-y1HQK+VjG7=C==vGcLnzxjN8ysD5NmaN8Wh4=VsYipw@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2019/04/05 12:18, David Rowley wrote: > On Fri, 5 Apr 2019 at 16:09, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Although, we still have ways >> to go in terms of scaling generic plan execution to larger partition >> counts, solution(s) for which have been proposed by David but haven't made >> it into master yet. > > Is that a reference to the last paragraph in [1]? That idea has not > gone beyond me writing that text yet! :-( It was more of a passing > comment on the only way I could think of to solve the problem. > > [1] https://www.postgresql.org/message-id/CAKJS1f-y1HQK+VjG7=C==vGcLnzxjN8ysD5NmaN8Wh4=VsYipw@mail.gmail.com Actually, I meant to refer to the following: https://commitfest.postgresql.org/22/1897/ Of course, we should pursue all available options. :) Thanks, Amit
On 2019/04/02 14:50, Amit Langote wrote: > Attached patch is only for HEAD this time. I'll post one for PG 11 (if > you'd like) once we reach consensus on the best thing to do here is. While we're on the topic of the relation between constraint exclusion and partition pruning, I'd like to (re-) propose this documentation update patch. The partitioning chapter in ddl.sgml says update/delete of partitioned tables uses constraint exclusion internally to emulate partition pruning, which is no longer true as of 428b260f8. The v2-0001 patch hasn't changed. Thanks, Amit
Attachment
Thanks for the details! Indeed the versions with now()/current_date use the runtime pruning rather than planning time. Iwasn't aware of the use of 'today' though - that could be useful in case we're sure statements won't be prepared. Previously (v10/ partly v11) it was necessary to make sure that statements on partioned tables were never prepared, becauserun-time pruning wasn't available - using a generic plan was almost always a bad option. Now in v12 it seems to bea tradeoff between whether or not run-time pruning can occur. If pruning is possible at planning time it's probably stillbetter not to prepare statements, whereas if run-time pruning has to occur, it's better to prepare them. One unrelated thing I noticed (but I'm not sure if it's worth a separate email thread) is that the changed default of jit=onin v12 doesn't work very well with a large number of partitions at run-time, for which a large number get excludedat run-time. A query that has an estimated cost above jit_optimize_above_cost takes about 30 seconds to run (fora table with 1000 partitions), because JIT is optimizing the full plan. Without JIT it's barely 20ms (+400ms planning).I can give more details in a separate thread if it's deemed interesting. Planning Time: 411.321 ms JIT: Functions: 5005 Options: Inlining false, Optimization true, Expressions true, Deforming true Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 ms, Emission 12533.611 ms, Total 29567.278 ms -Floris
Hi, On 2019/04/05 18:13, Floris Van Nee wrote: > One unrelated thing I noticed (but I'm not sure if it's worth a separate email thread) is that the changed default of jit=onin v12 doesn't work very well with a large number of partitions at run-time, for which a large number get excludedat run-time. A query that has an estimated cost above jit_optimize_above_cost takes about 30 seconds to run (fora table with 1000 partitions), because JIT is optimizing the full plan. Without JIT it's barely 20ms (+400ms planning).I can give more details in a separate thread if it's deemed interesting. > > Planning Time: 411.321 ms > JIT: > Functions: 5005 > Options: Inlining false, Optimization true, Expressions true, Deforming true > Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 ms, Emission 12533.611 ms, Total 29567.278 ms I've noticed a similar problem but in the context of interaction with parallel query mechanism. The planner, seeing that all partitions will be scanned (after failing to prune with clauses containing CURRENT_TIMESTAMP etc.), prepares a parallel plan (containing Parallel Append in this case). As you can imagine, parallel query initialization (Gather+workers) will take large amount of time relative to the time it will take to scan the partitions that remain after pruning (often just one). The problem in this case is that the planner is oblivious to the possibility of partition pruning occurring during execution, which may be common to both parallel query and JIT. If it wasn't oblivious, it would've set the cost of pruning-capable Append such that parallel query and/or JIT won't be invoked. We are going to have to fix that sooner or later. Thanks, Amit
On Fri, 5 Apr 2019 at 23:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Hi, > > On 2019/04/05 18:13, Floris Van Nee wrote: > > One unrelated thing I noticed (but I'm not sure if it's worth a separate email thread) is that the changed default ofjit=on in v12 doesn't work very well with a large number of partitions at run-time, for which a large number get excludedat run-time. A query that has an estimated cost above jit_optimize_above_cost takes about 30 seconds to run (fora table with 1000 partitions), because JIT is optimizing the full plan. Without JIT it's barely 20ms (+400ms planning).I can give more details in a separate thread if it's deemed interesting. > > > > Planning Time: 411.321 ms > > JIT: > > Functions: 5005 > > Options: Inlining false, Optimization true, Expressions true, Deforming true > > Timing: Generation 721.472 ms, Inlining 0.000 ms, Optimization 16312.195 ms, Emission 12533.611 ms, Total 29567.278ms > > I've noticed a similar problem but in the context of interaction with > parallel query mechanism. The planner, seeing that all partitions will be > scanned (after failing to prune with clauses containing CURRENT_TIMESTAMP > etc.), prepares a parallel plan (containing Parallel Append in this case). > As you can imagine, parallel query initialization (Gather+workers) will > take large amount of time relative to the time it will take to scan the > partitions that remain after pruning (often just one). > > The problem in this case is that the planner is oblivious to the > possibility of partition pruning occurring during execution, which may be > common to both parallel query and JIT. If it wasn't oblivious, it > would've set the cost of pruning-capable Append such that parallel query > and/or JIT won't be invoked. We are going to have to fix that sooner or > later. Robert and I had a go at discussing this in [1]. Some ideas were thrown around in the nature of contorting the Append/MergeAppend's total_cost in a similar way to how clauselist_selectivity does its estimates for unknown values. Perhaps it is possible to actually multiplying the total_cost by the clauselist_selectivity for the run-time pruning quals. That's pretty crude and highly unusual, but it's probably going to give something more sane than what's there today. The run-time prune quals would likely need to be determined earlier than createplan.c for that to work though. IIRC the reason it was done there is, because at the time, there wasn't a need to do it per path. I don't really have any better ideas right now, so if someone does then maybe we should take it up on a new thread. It would be good to leave this thread alone for unrelated things. It's long enough already. [1] https://www.postgresql.org/message-id/CA%2BTgmobhXJGMuHxKjbaKcEJXacxVZHG4%3DhEGFfPF_FrGt37T_Q%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, 5 Apr 2019 at 19:50, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > While we're on the topic of the relation between constraint exclusion and > partition pruning, I'd like to (re-) propose this documentation update > patch. The partitioning chapter in ddl.sgml says update/delete of > partitioned tables uses constraint exclusion internally to emulate > partition pruning, which is no longer true as of 428b260f8. Update-docs-that-update-delete-no-longer-use-cons.patch looks good to me. It should be changed as what the docs say is no longer true. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2019/04/11 14:03, David Rowley wrote: > On Fri, 5 Apr 2019 at 19:50, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> While we're on the topic of the relation between constraint exclusion and >> partition pruning, I'd like to (re-) propose this documentation update >> patch. The partitioning chapter in ddl.sgml says update/delete of >> partitioned tables uses constraint exclusion internally to emulate >> partition pruning, which is no longer true as of 428b260f8. > > Update-docs-that-update-delete-no-longer-use-cons.patch looks good to > me. It should be changed as what the docs say is no longer true. Thanks for the quick review. :) Regards, Amit
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/04/02 2:34, Tom Lane wrote: >> I'm not at all clear on what we think the interaction between >> enable_partition_pruning and constraint_exclusion ought to be, >> so I'm not sure what the appropriate resolution is here. Thoughts? > Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and > DELETE queries is realized by applying constraint exclusion to the > partition constraint of the target partition. The conclusion of the > discussion when adding the enable_partition_pruning GUC [1] was that > whether or not constraint exclusion is carried out (to facilitate > partition pruning) should be governed by the new GUC, not > constraint_exclusion, if only to avoid confusing users. I got back to thinking about how this ought to work. It appears to me that we've got half a dozen different behaviors that depend on one or both of these settings: 1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning, that is relation_excluded_by_constraints for baserels. This is enabled by constraint_exclusion = on. 2. Use of partition constraints in baserel pruning (applicable only when a partition is accessed directly). This is currently partly broken, and it's what your patch wants to change. 3. Use of ordinary table constraints in appendrel pruning, that is relation_excluded_by_constraints for appendrel members. This is enabled by constraint_exclusion >= partition. 4. Use of partition constraints in appendrel pruning. This is enabled by the combination of enable_partition_pruning AND constraint_exclusion >= partition. However, it looks to me like this is now nearly if not completely useless because of #5. 5. Use of partition constraints in expand_partitioned_rtentry. Enabled by enable_partition_pruning (see prune_append_rel_partitions). 6. Use of partition constraints in run-time partition pruning. This is also enabled by enable_partition_pruning, cf create_append_plan, create_merge_append_plan. Now in addition to what I mention above, there are assorted random differences in behavior depending on whether we are in an inherited UPDATE/DELETE or not. I consider these differences to be so bogus that I'm not even going to include them in this taxonomy; they should not exist. The UPDATE/DELETE target ought to act the same as a baserel. I think this is ridiculously overcomplicated even without said random differences. I propose that we do the following: * Get rid of point 4 by not considering partition constraints for appendrel members in relation_excluded_by_constraints. It's just useless cycles in view of point 5, or nearly so. (Possibly there are corner cases where we could prove contradictions between a relation's partition constraints and regular constraints ... but is it really worth spending planner cycles to look for that?) * Make point 2 like point 1 by treating partition constraints for baserels like ordinary table constraints, ie, they are considered only when constraint_exclusion = on (independently of whether enable_partition_pruning is on). * Treat an inherited UPDATE/DELETE target table as if it were an appendrel member for the purposes of relation_excluded_by_constraints, thus removing the behavioral differences between SELECT and UPDATE/DELETE. With this, constraint_exclusion would act pretty much as it traditionally has, and in most cases would not have any special impact on partitions compared to old-style inheritance. The behaviors that enable_partition_pruning would control are expand_partitioned_rtentry pruning and run-time pruning, neither of which have any applicability to old-style inheritance. Thoughts? regards, tom lane
On 2019/04/23 7:08, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2019/04/02 2:34, Tom Lane wrote: >>> I'm not at all clear on what we think the interaction between >>> enable_partition_pruning and constraint_exclusion ought to be, >>> so I'm not sure what the appropriate resolution is here. Thoughts? > >> Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and >> DELETE queries is realized by applying constraint exclusion to the >> partition constraint of the target partition. The conclusion of the >> discussion when adding the enable_partition_pruning GUC [1] was that >> whether or not constraint exclusion is carried out (to facilitate >> partition pruning) should be governed by the new GUC, not >> constraint_exclusion, if only to avoid confusing users. > > I got back to thinking about how this ought to work. Thanks a lot for taking time to look at this. > It appears to me > that we've got half a dozen different behaviors that depend on one or both > of these settings: > > 1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning, > that is relation_excluded_by_constraints for baserels. > This is enabled by constraint_exclusion = on. > > 2. Use of partition constraints in baserel pruning (applicable only > when a partition is accessed directly). > This is currently partly broken, and it's what your patch wants to > change. Yes. Any fix we come up with for this will need to be back-patched to 11, because it's a regression introduced in 11 when the then new partition pruning feature was committed (9fdb675fc). > 3. Use of ordinary table constraints in appendrel pruning, > that is relation_excluded_by_constraints for appendrel members. > This is enabled by constraint_exclusion >= partition. > > 4. Use of partition constraints in appendrel pruning. > This is enabled by the combination of enable_partition_pruning AND > constraint_exclusion >= partition. However, it looks to me like this > is now nearly if not completely useless because of #5. > > 5. Use of partition constraints in expand_partitioned_rtentry. > Enabled by enable_partition_pruning (see prune_append_rel_partitions). Right, #5 obviates #4. > 6. Use of partition constraints in run-time partition pruning. > This is also enabled by enable_partition_pruning, cf > create_append_plan, create_merge_append_plan. > > Now in addition to what I mention above, there are assorted random > differences in behavior depending on whether we are in an inherited > UPDATE/DELETE or not. I consider these differences to be so bogus > that I'm not even going to include them in this taxonomy; they should > not exist. The UPDATE/DELETE target ought to act the same as a baserel. The *partition* constraint of UPDATE/DELETE targets would never be refuted by the query, because we process only those partition targets that remain after applying partition pruning during the initial planning of the query as if it were SELECT. I'm saying we should distinguish such targets as such when addressing #2. Not sure if you'll like it but maybe we could ignore even regular inheritance child targets that are proven to be empty (is_dummy_rel()) for a given query during the initial SELECT planning. That way, we can avoid re-running relation_excluded_by_constraints() a second time for *all* child target relations. > I think this is ridiculously overcomplicated even without said random > differences. I propose that we do the following: > > * Get rid of point 4 by not considering partition constraints for > appendrel members in relation_excluded_by_constraints. It's just > useless cycles in view of point 5, or nearly so. (Possibly there > are corner cases where we could prove contradictions between a > relation's partition constraints and regular constraints ... but is > it really worth spending planner cycles to look for that?) I guess not. If partition constraint contradicts regular constraints, there wouldn't be any data in such partitions to begin with, no? > * Make point 2 like point 1 by treating partition constraints for > baserels like ordinary table constraints, ie, they are considered > only when constraint_exclusion = on (independently of whether > enable_partition_pruning is on). Right, enable_partition_pruning should only apply to appendrel pruning. If a partition is accessed directly and hence a baserel to the planner, we only consider constraint_exclusion and perform it only if the setting is on. Another opinion on this is that we treat partition constraints differently from regular constraints and don't mind the setting of constraint_exclusion, that is, always perform constraint exclusion using partition constraints. > * Treat an inherited UPDATE/DELETE target table as if it were an > appendrel member for the purposes of relation_excluded_by_constraints, > thus removing the behavioral differences between SELECT and UPDATE/DELETE. As I mentioned above, planner encounters any given UPDATE/DELETE *child* target *twice*. Once during the initial SELECT planning and then again during when planning the query with a given child target relation as its resultRelation. For partition targets, since the initial run only selects those that survive pruning, their partition constraint need not be considered in the 2nd encounter as the query's baserel. Also, it's during the 2nd encounter that we check inhTargetKind setting to distinguish partition target baserels from SELECT baserels. Its value is INHKIND_INHERITED or INHKIND_PARTITIONED for the former, whereas it's INHKIND_NONE for the latter. > With this, constraint_exclusion would act pretty much as it traditionally > has, and in most cases would not have any special impact on partitions > compared to old-style inheritance. The behaviors that > enable_partition_pruning would control are expand_partitioned_rtentry > pruning and run-time pruning, neither of which have any applicability to > old-style inheritance. That's right. Do you want me to update my patch considering the above summary? Thanks, Amit
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2019/04/23 7:08, Tom Lane wrote: >> [ a bunch of stuff ] > Not sure if you'll like it but maybe we could ignore even regular > inheritance child targets that are proven to be empty (is_dummy_rel()) for > a given query during the initial SELECT planning. That way, we can avoid > re-running relation_excluded_by_constraints() a second time for *all* > child target relations. My thought was to keep traditional inheritance working more or less as it has. To do what you're suggesting, we'd have to move generic constraint-exclusion logic up into the RTE expansion phase, and I don't think that's a particularly great idea. I think what we should be doing is applying partition pruning (which is a very specialized form of constraint exclusion) during RTE expansion, then applying generic constraint exclusion in relation_excluded_by_constraints, and not examining partition constraints again there if we already used them. > Do you want me to update my patch considering the above summary? Yes please. However, I wonder whether you're thinking differently in light of what you wrote in [1]: >>> Pruning in 10.2 works using internally generated partition constraints >>> (which for this purpose are same as CHECK constraints). With the new >>> pruning logic introduced in 11, planner no longer considers partition >>> constraint because it's redundant to check them in most cases, because >>> pruning would've selected the right set of partitions. Given that the new >>> pruning logic is still unable to handle the cases like above, maybe we >>> could change the planner to consider them, at least until we fix the >>> pruning logic to handle such cases. If we take that seriously then it would suggest not ignoring partition constraints in relation_excluded_by_constraints. However, I'm of the opinion that we shouldn't let temporary deficiencies in the partition-pruning logic drive what we do here. I don't think the set of cases where we could get a win by reconsidering the partition constraints is large enough to justify the cycles expended in doing so; and it'll get even smaller as pruning gets smarter. regards, tom lane [1] https://www.postgresql.org/message-id/358cd54d-c018-60f8-7d76-55780eef6678@lab.ntt.co.jp
On Sun, Apr 28, 2019 at 8:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > > Not sure if you'll like it but maybe we could ignore even regular > > inheritance child targets that are proven to be empty (is_dummy_rel()) for > > a given query during the initial SELECT planning. That way, we can avoid > > re-running relation_excluded_by_constraints() a second time for *all* > > child target relations. > > My thought was to keep traditional inheritance working more or less > as it has. To do what you're suggesting, we'd have to move generic > constraint-exclusion logic up into the RTE expansion phase, and I don't > think that's a particularly great idea. I think what we should be > doing is applying partition pruning (which is a very specialized form > of constraint exclusion) during RTE expansion, then applying generic > constraint exclusion in relation_excluded_by_constraints, and not > examining partition constraints again there if we already used them. Just to clarify, I wasn't suggesting that we change query_planner(), but the blocks in inheritance_planner() that perform initial planning as if the query was SELECT and gather child target relations from that planner run; the following consecutive blocks: /* * Before generating the real per-child-relation plans, do a cycle of * planning as though the query were a SELECT. ... */ { PlannerInfo *subroot; and: /*---------- * Since only one rangetable can exist in the final plan, we need to make * sure that it contains all the RTEs needed for any child plan. ... child_appinfos = NIL; old_child_rtis = NIL; new_child_rtis = NIL; parent_relids = bms_make_singleton(top_parentRTindex); foreach(lc, select_appinfos) { AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc); RangeTblEntry *child_rte; /* append_rel_list contains all append rels; ignore others */ if (!bms_is_member(appinfo->parent_relid, parent_relids)) continue; /* remember relevant AppendRelInfos for use below */ child_appinfos = lappend(child_appinfos, appinfo); I'm suggesting that we don't add the child relations that are dummy due to constraint exclusion to child_appinfos. Maybe, we'll need to combine the two blocks so that the latter can use the PlannerInfo defined in the former to look up the child relation to check if dummy. > > Do you want me to update my patch considering the above summary? > > Yes please. I will try to get that done hopefully by tomorrow. (On extended holidays that those of us who are in Japan have around this time of year.) > However, I wonder whether you're thinking differently in > light of what you wrote in [1]: Thanks for checking that thread. > >>> Pruning in 10.2 works using internally generated partition constraints > >>> (which for this purpose are same as CHECK constraints). With the new > >>> pruning logic introduced in 11, planner no longer considers partition > >>> constraint because it's redundant to check them in most cases, because > >>> pruning would've selected the right set of partitions. Given that the new > >>> pruning logic is still unable to handle the cases like above, maybe we > >>> could change the planner to consider them, at least until we fix the > >>> pruning logic to handle such cases. > > If we take that seriously then it would suggest not ignoring partition > constraints in relation_excluded_by_constraints. However, I'm of the > opinion that we shouldn't let temporary deficiencies in the > partition-pruning logic drive what we do here. I don't think the set > of cases where we could get a win by reconsidering the partition > constraints is large enough to justify the cycles expended in doing so; > and it'll get even smaller as pruning gets smarter. Yeah, maybe we could away with that by telling users to define equivalent CHECK constraints for corner cases like that although that's not really great. Thanks, Amit
Amit Langote <amitlangote09@gmail.com> writes: > Here is the patch. I've also included the patch to update the text in > ddl.sgml regarding constraint exclusion and partition pruning. I thought this was a bit messy. In particular, IMV the reason to have a split between get_relation_constraints and its only caller relation_excluded_by_constraints is to create a policy vs mechanism separation: relation_excluded_by_constraints figures out what kinds of constraints we need to look at, while get_relation_constraints does the gruntwork of digging them out of the catalog data. Somebody had already ignored this principle to the extent of putting this very-much-policy test into get_relation_constraints: if (enable_partition_pruning && root->parse->commandType != CMD_SELECT) but the way to improve that is to add another flag parameter to convey the policy choice, not to move the whole chunk of mechanism out to the caller. It also struck me while looking at the code that we were being unnecessarily stupid about non-inheritable constraints: rather than just throwing up our hands for traditional inheritance situations, we can still apply constraint exclusion, as long as we consider only constraints that aren't marked ccnoinherit. (attnotnull constraints have to be considered as if they were ccnoinherit, for ordinary tables but not partitioned ones.) So, I propose the attached revised patch. I'm not sure how much of this, if anything, we should back-patch to v11. It definitely doesn't seem like we should back-patch the improvement just explained. I tried diking out that change, as in the v11 variant attached, and found that this still causes quite a few other changes in v11's expected results, most of them not for the better. So I'm thinking that we'd better conclude that v11's ship has sailed. Its behavior is in some ways weird, but I am not sure that anyone will appreciate our changing it on the fourth minor release. It's somewhat interesting that we get these other changes in v11 but not HEAD. I think the reason is that we reimplemented so much of inheritance_planner in 428b260f8; that is, it seems the weird decisions we find in relation_excluded_by_constraints are mostly there to band-aid over the old weird behavior of inheritance_planner. Anyway, my current thought is to apply this to HEAD and do nothing in v11. I include the v11 patch just for amusement. (I did not check v11's behavior outside the core regression tests; it might possibly have additional test diffs in contrib.) regards, tom lane diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8ddab75..84341a3 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5084,10 +5084,11 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class=" The allowed values of <varname>constraint_exclusion</varname> are <literal>on</literal> (examine constraints for all tables), <literal>off</literal> (never examine constraints), and - <literal>partition</literal> (examine constraints only for inheritance child - tables and <literal>UNION ALL</literal> subqueries). + <literal>partition</literal> (examine constraints only for inheritance + child tables and <literal>UNION ALL</literal> subqueries). <literal>partition</literal> is the default setting. - It is often used with inheritance tables to improve performance. + It is often used with traditional inheritance trees to improve + performance. </para> <para> @@ -5111,15 +5112,19 @@ SELECT * FROM parent WHERE key = 2400; <para> Currently, constraint exclusion is enabled by default only for cases that are often used to implement table partitioning via - inheritance tables. Turning it on for all tables imposes extra + inheritance trees. Turning it on for all tables imposes extra planning overhead that is quite noticeable on simple queries, and most often will yield no benefit for simple queries. If you have no - inheritance partitioned tables you might prefer to turn it off entirely. + tables that are partitioned using traditional inheritance, you might + prefer to turn it off entirely. (Note that the equivalent feature for + partitioned tables is controlled by a separate parameter, + <xref linkend="guc-enable-partition-pruning"/>.) </para> <para> Refer to <xref linkend="ddl-partitioning-constraint-exclusion"/> for - more information on using constraint exclusion and partitioning. + more information on using constraint exclusion to implement + partitioning. </para> </listitem> </varlistentry> diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index cba2ea9..a0a7435 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -4535,24 +4535,11 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2008-01-01'; <note> <para> - Currently, pruning of partitions during the planning of an - <command>UPDATE</command> or <command>DELETE</command> command is - implemented using the constraint exclusion method (however, it is - controlled by the <literal>enable_partition_pruning</literal> rather than - <literal>constraint_exclusion</literal>) — see the following section - for details and caveats that apply. - </para> - - <para> Execution-time partition pruning currently only occurs for the <literal>Append</literal> and <literal>MergeAppend</literal> node types. It is not yet implemented for the <literal>ModifyTable</literal> node - type. - </para> - - <para> - Both of these behaviors are likely to be changed in a future release - of <productname>PostgreSQL</productname>. + type, but that is likely to be changed in a future release of + <productname>PostgreSQL</productname>. </para> </note> </sect2> diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 0a6710c..eb6f5a3 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -1513,8 +1513,9 @@ inheritance_planner(PlannerInfo *root) parent_rte->securityQuals = NIL; /* - * Mark whether we're planning a query to a partitioned table or an - * inheritance parent. + * HACK: setting this to a value other than INHKIND_NONE signals to + * relation_excluded_by_constraints() to treat the result relation as + * being an appendrel member. */ subroot->inhTargetKind = (rootRelation != 0) ? INHKIND_PARTITIONED : INHKIND_INHERITED; diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 3301331..3215c29 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -67,7 +67,9 @@ static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel, List *idxExprs); static List *get_relation_constraints(PlannerInfo *root, Oid relationObjectId, RelOptInfo *rel, - bool include_notnull); + bool include_noinherit, + bool include_notnull, + bool include_partition); static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index, Relation heapRelation); static List *get_relation_statistics(RelOptInfo *rel, Relation relation); @@ -1134,16 +1136,22 @@ get_relation_data_width(Oid relid, int32 *attr_widths) /* * get_relation_constraints * - * Retrieve the validated CHECK constraint expressions of the given relation. + * Retrieve the applicable constraint expressions of the given relation. * * Returns a List (possibly empty) of constraint expressions. Each one * has been canonicalized, and its Vars are changed to have the varno * indicated by rel->relid. This allows the expressions to be easily * compared to expressions taken from WHERE. * + * If include_noinherit is true, it's okay to include constraints that + * are marked NO INHERIT. + * * If include_notnull is true, "col IS NOT NULL" expressions are generated * and added to the result for each column that's marked attnotnull. * + * If include_partition is true, and the relation is a partition, + * also include the partitioning constraints. + * * Note: at present this is invoked at most once per relation per planner * run, and in many cases it won't be invoked at all, so there seems no * point in caching the data in RelOptInfo. @@ -1151,7 +1159,9 @@ get_relation_data_width(Oid relid, int32 *attr_widths) static List * get_relation_constraints(PlannerInfo *root, Oid relationObjectId, RelOptInfo *rel, - bool include_notnull) + bool include_noinherit, + bool include_notnull, + bool include_partition) { List *result = NIL; Index varno = rel->relid; @@ -1175,10 +1185,13 @@ get_relation_constraints(PlannerInfo *root, /* * If this constraint hasn't been fully validated yet, we must - * ignore it here. + * ignore it here. Also ignore if NO INHERIT and we weren't told + * that that's safe. */ if (!constr->check[i].ccvalid) continue; + if (constr->check[i].ccnoinherit && !include_noinherit) + continue; cexpr = stringToNode(constr->check[i].ccbin); @@ -1243,13 +1256,9 @@ get_relation_constraints(PlannerInfo *root, } /* - * Append partition predicates, if any. - * - * For selects, partition pruning uses the parent table's partition bound - * descriptor, instead of constraint exclusion which is driven by the - * individual partition's partition constraint. + * Add partitioning constraints, if requested. */ - if (enable_partition_pruning && root->parse->commandType != CMD_SELECT) + if (include_partition && relation->rd_rel->relispartition) { List *pcqual = RelationGetPartitionQual(relation); @@ -1366,7 +1375,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation) * * Detect whether the relation need not be scanned because it has either * self-inconsistent restrictions, or restrictions inconsistent with the - * relation's validated CHECK constraints. + * relation's applicable constraints. * * Note: this examines only rel->relid, rel->reloptkind, and * rel->baserestrictinfo; therefore it can be called before filling in @@ -1376,6 +1385,9 @@ bool relation_excluded_by_constraints(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte) { + bool include_noinherit; + bool include_notnull; + bool include_partition = false; List *safe_restrictions; List *constraint_pred; List *safe_constraints; @@ -1385,6 +1397,13 @@ relation_excluded_by_constraints(PlannerInfo *root, Assert(IS_SIMPLE_REL(rel)); /* + * If there are no base restriction clauses, we have no hope of proving + * anything below, so fall out quickly. + */ + if (rel->baserestrictinfo == NIL) + return false; + + /* * Regardless of the setting of constraint_exclusion, detect * constant-FALSE-or-NULL restriction clauses. Because const-folding will * reduce "anything AND FALSE" to just "FALSE", any such case should @@ -1410,35 +1429,41 @@ relation_excluded_by_constraints(PlannerInfo *root, switch (constraint_exclusion) { case CONSTRAINT_EXCLUSION_OFF: - - /* - * Don't prune if feature turned off -- except if the relation is - * a partition. While partprune.c-style partition pruning is not - * yet in use for all cases (update/delete is not handled), it - * would be a UI horror to use different user-visible controls - * depending on such a volatile implementation detail. Therefore, - * for partitioned tables we use enable_partition_pruning to - * control this behavior. - */ - if (root->inhTargetKind == INHKIND_PARTITIONED) - break; + /* In 'off' mode, never make any further tests */ return false; case CONSTRAINT_EXCLUSION_PARTITION: /* * When constraint_exclusion is set to 'partition' we only handle - * OTHER_MEMBER_RELs, or BASERELs in cases where the result target - * is an inheritance parent or a partitioned table. + * appendrel members. Normally, they are RELOPT_OTHER_MEMBER_REL + * relations, but we also consider inherited target relations as + * appendrel members for the purposes of constraint exclusion + * (since, indeed, they were appendrel members earlier in + * inheritance_planner). + * + * In both cases, partition pruning was already applied, so there + * is no need to consider the rel's partition constraints here. */ - if ((rel->reloptkind != RELOPT_OTHER_MEMBER_REL) && - !(rel->reloptkind == RELOPT_BASEREL && - root->inhTargetKind != INHKIND_NONE && - rel->relid == root->parse->resultRelation)) - return false; - break; + if (rel->reloptkind == RELOPT_OTHER_MEMBER_REL || + (rel->relid == root->parse->resultRelation && + root->inhTargetKind != INHKIND_NONE)) + break; /* appendrel member, so process it */ + return false; case CONSTRAINT_EXCLUSION_ON: + + /* + * In 'on' mode, always apply constraint exclusion. If we are + * considering a baserel that is a partition (i.e., it was + * directly named rather than expanded from a parent table), then + * its partition constraints haven't been considered yet, so + * include them in the processing here. + */ + if (rel->reloptkind == RELOPT_BASEREL && + !(rel->relid == root->parse->resultRelation && + root->inhTargetKind != INHKIND_NONE)) + include_partition = true; break; /* always try to exclude */ } @@ -1467,24 +1492,33 @@ relation_excluded_by_constraints(PlannerInfo *root, return true; /* - * Only plain relations have constraints. In a partitioning hierarchy, - * but not with regular table inheritance, it's OK to assume that any - * constraints that hold for the parent also hold for every child; for - * instance, table inheritance allows the parent to have constraints - * marked NO INHERIT, but table partitioning does not. We choose to check - * whether the partitioning parents can be excluded here; doing so - * consumes some cycles, but potentially saves us the work of excluding - * each child individually. + * Only plain relations have constraints, so stop here for other rtekinds. */ - if (rte->rtekind != RTE_RELATION || - (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)) + if (rte->rtekind != RTE_RELATION) return false; /* - * OK to fetch the constraint expressions. Include "col IS NOT NULL" - * expressions for attnotnull columns, in case we can refute those. + * If we are scanning just this table, we can use NO INHERIT constraints, + * but not if we're scanning its children too. (Note that partitioned + * tables should never have NO INHERIT constraints; but it's not necessary + * for us to assume that here.) + */ + include_noinherit = !rte->inh; + + /* + * Currently, attnotnull constraints must be treated as NO INHERIT unless + * this is a partitioned table. In future we might track their + * inheritance status more accurately, allowing this to be refined. + */ + include_notnull = (!rte->inh || rte->relkind == RELKIND_PARTITIONED_TABLE); + + /* + * Fetch the appropriate set of constraint expressions. */ - constraint_pred = get_relation_constraints(root, rte->relid, rel, true); + constraint_pred = get_relation_constraints(root, rte->relid, rel, + include_noinherit, + include_notnull, + include_partition); /* * We do not currently enforce that CHECK constraints contain only diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out index 0789b31..bd64bed 100644 --- a/src/test/regress/expected/partition_prune.out +++ b/src/test/regress/expected/partition_prune.out @@ -3639,4 +3639,46 @@ select * from listp where a = (select 2) and b <> 10; -> Result (never executed) (4 rows) +-- +-- check that a partition directly accessed in a query is excluded with +-- constraint_exclusion = on +-- +-- turn off partition pruning, so that it doesn't interfere +set enable_partition_pruning to off; +-- setting constraint_exclusion to 'partition' disables exclusion +set constraint_exclusion to 'partition'; +explain (costs off) select * from listp1 where a = 2; + QUERY PLAN +-------------------- + Seq Scan on listp1 + Filter: (a = 2) +(2 rows) + +explain (costs off) update listp1 set a = 1 where a = 2; + QUERY PLAN +-------------------------- + Update on listp1 + -> Seq Scan on listp1 + Filter: (a = 2) +(3 rows) + +-- constraint exclusion enabled +set constraint_exclusion to 'on'; +explain (costs off) select * from listp1 where a = 2; + QUERY PLAN +-------------------------- + Result + One-Time Filter: false +(2 rows) + +explain (costs off) update listp1 set a = 1 where a = 2; + QUERY PLAN +-------------------------------- + Update on listp1 + -> Result + One-Time Filter: false +(3 rows) + +reset constraint_exclusion; +reset enable_partition_pruning; drop table listp; diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql index c30e58e..246c627 100644 --- a/src/test/regress/sql/partition_prune.sql +++ b/src/test/regress/sql/partition_prune.sql @@ -990,4 +990,24 @@ create table listp2_10 partition of listp2 for values in (10); explain (analyze, costs off, summary off, timing off) select * from listp where a = (select 2) and b <> 10; +-- +-- check that a partition directly accessed in a query is excluded with +-- constraint_exclusion = on +-- + +-- turn off partition pruning, so that it doesn't interfere +set enable_partition_pruning to off; + +-- setting constraint_exclusion to 'partition' disables exclusion +set constraint_exclusion to 'partition'; +explain (costs off) select * from listp1 where a = 2; +explain (costs off) update listp1 set a = 1 where a = 2; +-- constraint exclusion enabled +set constraint_exclusion to 'on'; +explain (costs off) select * from listp1 where a = 2; +explain (costs off) update listp1 set a = 1 where a = 2; + +reset constraint_exclusion; +reset enable_partition_pruning; + drop table listp; diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d94d033..37e21e8 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4410,10 +4410,11 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class=" The allowed values of <varname>constraint_exclusion</varname> are <literal>on</literal> (examine constraints for all tables), <literal>off</literal> (never examine constraints), and - <literal>partition</literal> (examine constraints only for inheritance child - tables and <literal>UNION ALL</literal> subqueries). + <literal>partition</literal> (examine constraints only for inheritance + child tables and <literal>UNION ALL</literal> subqueries). <literal>partition</literal> is the default setting. - It is often used with inheritance tables to improve performance. + It is often used with traditional inheritance trees to improve + performance. </para> <para> @@ -4437,15 +4438,19 @@ SELECT * FROM parent WHERE key = 2400; <para> Currently, constraint exclusion is enabled by default only for cases that are often used to implement table partitioning via - inheritance tables. Turning it on for all tables imposes extra + inheritance trees. Turning it on for all tables imposes extra planning overhead that is quite noticeable on simple queries, and most often will yield no benefit for simple queries. If you have no - inheritance partitioned tables you might prefer to turn it off entirely. + tables that are partitioned using traditional inheritance, you might + prefer to turn it off entirely. (Note that the equivalent feature for + partitioned tables is controlled by a separate parameter, + <xref linkend="guc-enable-partition-pruning"/>.) </para> <para> Refer to <xref linkend="ddl-partitioning-constraint-exclusion"/> for - more information on using constraint exclusion and partitioning. + more information on using constraint exclusion to implement + partitioning. </para> </listitem> </varlistentry> diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 59685d7..f53e3c6 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -3918,22 +3918,11 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate >= DATE '2008-01-01'; <note> <para> - Currently, pruning of partitions during the planning of an - <command>UPDATE</command> or <command>DELETE</command> command is - implemented using the constraint exclusion method (however, it is - controlled by the <literal>enable_partition_pruning</literal> rather than - <literal>constraint_exclusion</literal>) — see the following section - for details and caveats that apply. - </para> - - <para> - Also, execution-time partition pruning currently only occurs for the - <literal>Append</literal> node type, not <literal>MergeAppend</literal>. - </para> - - <para> - Both of these behaviors are likely to be changed in a future release - of <productname>PostgreSQL</productname>. + Execution-time partition pruning currently only occurs for the + <literal>Append</literal> node type, not + for <literal>MergeAppend</literal> or <literal>ModifyTable</literal> + nodes. That is likely to be changed in a future release of + <productname>PostgreSQL</productname>. </para> </note> </sect2> diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 94b962b..0f46914 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -1324,8 +1324,9 @@ inheritance_planner(PlannerInfo *root) parent_rte->securityQuals = NIL; /* - * Mark whether we're planning a query to a partitioned table or an - * inheritance parent. + * HACK: setting this to a value other than INHKIND_NONE signals to + * relation_excluded_by_constraints() to treat the result relation as + * being an appendrel member. */ subroot->inhTargetKind = partitioned_relids ? INHKIND_PARTITIONED : INHKIND_INHERITED; diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index 8369e3a..2453953 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -66,7 +66,9 @@ static bool infer_collation_opclass_match(InferenceElem *elem, Relation idxRel, static int32 get_rel_data_width(Relation rel, int32 *attr_widths); static List *get_relation_constraints(PlannerInfo *root, Oid relationObjectId, RelOptInfo *rel, - bool include_notnull); + bool include_noinherit, + bool include_notnull, + bool include_partition); static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index, Relation heapRelation); static List *get_relation_statistics(RelOptInfo *rel, Relation relation); @@ -1157,16 +1159,22 @@ get_relation_data_width(Oid relid, int32 *attr_widths) /* * get_relation_constraints * - * Retrieve the validated CHECK constraint expressions of the given relation. + * Retrieve the applicable constraint expressions of the given relation. * * Returns a List (possibly empty) of constraint expressions. Each one * has been canonicalized, and its Vars are changed to have the varno * indicated by rel->relid. This allows the expressions to be easily * compared to expressions taken from WHERE. * + * If include_noinherit is true, it's okay to include constraints that + * are marked NO INHERIT. + * * If include_notnull is true, "col IS NOT NULL" expressions are generated * and added to the result for each column that's marked attnotnull. * + * If include_partition is true, and the relation is a partition, + * also include the partitioning constraints. + * * Note: at present this is invoked at most once per relation per planner * run, and in many cases it won't be invoked at all, so there seems no * point in caching the data in RelOptInfo. @@ -1174,7 +1182,9 @@ get_relation_data_width(Oid relid, int32 *attr_widths) static List * get_relation_constraints(PlannerInfo *root, Oid relationObjectId, RelOptInfo *rel, - bool include_notnull) + bool include_noinherit, + bool include_notnull, + bool include_partition) { List *result = NIL; Index varno = rel->relid; @@ -1198,10 +1208,13 @@ get_relation_constraints(PlannerInfo *root, /* * If this constraint hasn't been fully validated yet, we must - * ignore it here. + * ignore it here. Also ignore if NO INHERIT and we weren't told + * that that's safe. */ if (!constr->check[i].ccvalid) continue; + if (constr->check[i].ccnoinherit && !include_noinherit) + continue; cexpr = stringToNode(constr->check[i].ccbin); @@ -1266,13 +1279,9 @@ get_relation_constraints(PlannerInfo *root, } /* - * Append partition predicates, if any. - * - * For selects, partition pruning uses the parent table's partition bound - * descriptor, instead of constraint exclusion which is driven by the - * individual partition's partition constraint. + * Add partitioning constraints, if requested. */ - if (enable_partition_pruning && root->parse->commandType != CMD_SELECT) + if (include_partition && relation->rd_rel->relispartition) { List *pcqual = RelationGetPartitionQual(relation); @@ -1377,7 +1386,7 @@ get_relation_statistics(RelOptInfo *rel, Relation relation) * * Detect whether the relation need not be scanned because it has either * self-inconsistent restrictions, or restrictions inconsistent with the - * relation's validated CHECK constraints. + * relation's applicable constraints. * * Note: this examines only rel->relid, rel->reloptkind, and * rel->baserestrictinfo; therefore it can be called before filling in @@ -1387,6 +1396,9 @@ bool relation_excluded_by_constraints(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte) { + bool include_noinherit; + bool include_notnull; + bool include_partition = false; List *safe_restrictions; List *constraint_pred; List *safe_constraints; @@ -1396,6 +1408,13 @@ relation_excluded_by_constraints(PlannerInfo *root, Assert(IS_SIMPLE_REL(rel)); /* + * If there are no base restriction clauses, we have no hope of proving + * anything below, so fall out quickly. + */ + if (rel->baserestrictinfo == NIL) + return false; + + /* * Regardless of the setting of constraint_exclusion, detect * constant-FALSE-or-NULL restriction clauses. Because const-folding will * reduce "anything AND FALSE" to just "FALSE", any such case should @@ -1421,35 +1440,41 @@ relation_excluded_by_constraints(PlannerInfo *root, switch (constraint_exclusion) { case CONSTRAINT_EXCLUSION_OFF: - - /* - * Don't prune if feature turned off -- except if the relation is - * a partition. While partprune.c-style partition pruning is not - * yet in use for all cases (update/delete is not handled), it - * would be a UI horror to use different user-visible controls - * depending on such a volatile implementation detail. Therefore, - * for partitioned tables we use enable_partition_pruning to - * control this behavior. - */ - if (root->inhTargetKind == INHKIND_PARTITIONED) - break; + /* In 'off' mode, never make any further tests */ return false; case CONSTRAINT_EXCLUSION_PARTITION: /* * When constraint_exclusion is set to 'partition' we only handle - * OTHER_MEMBER_RELs, or BASERELs in cases where the result target - * is an inheritance parent or a partitioned table. + * appendrel members. Normally, they are RELOPT_OTHER_MEMBER_REL + * relations, but we also consider inherited target relations as + * appendrel members for the purposes of constraint exclusion + * (since, indeed, they were appendrel members earlier in + * inheritance_planner). + * + * In both cases, partition pruning was already applied, so there + * is no need to consider the rel's partition constraints here. */ - if ((rel->reloptkind != RELOPT_OTHER_MEMBER_REL) && - !(rel->reloptkind == RELOPT_BASEREL && - root->inhTargetKind != INHKIND_NONE && - rel->relid == root->parse->resultRelation)) - return false; - break; + if (rel->reloptkind == RELOPT_OTHER_MEMBER_REL || + (rel->relid == root->parse->resultRelation && + root->inhTargetKind != INHKIND_NONE)) + break; /* appendrel member, so process it */ + return false; case CONSTRAINT_EXCLUSION_ON: + + /* + * In 'on' mode, always apply constraint exclusion. If we are + * considering a baserel that is a partition (i.e., it was + * directly named rather than expanded from a parent table), then + * its partition constraints haven't been considered yet, so + * include them in the processing here. + */ + if (rel->reloptkind == RELOPT_BASEREL && + !(rel->relid == root->parse->resultRelation && + root->inhTargetKind != INHKIND_NONE)) + include_partition = true; break; /* always try to exclude */ } @@ -1478,24 +1503,40 @@ relation_excluded_by_constraints(PlannerInfo *root, return true; /* - * Only plain relations have constraints. In a partitioning hierarchy, - * but not with regular table inheritance, it's OK to assume that any - * constraints that hold for the parent also hold for every child; for - * instance, table inheritance allows the parent to have constraints - * marked NO INHERIT, but table partitioning does not. We choose to check - * whether the partitioning parents can be excluded here; doing so - * consumes some cycles, but potentially saves us the work of excluding - * each child individually. + * Only plain relations have constraints, so stop here for other rtekinds. */ - if (rte->rtekind != RTE_RELATION || - (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE)) + if (rte->rtekind != RTE_RELATION) return false; /* - * OK to fetch the constraint expressions. Include "col IS NOT NULL" - * expressions for attnotnull columns, in case we can refute those. + * In a partitioning hierarchy, but not with regular table inheritance, + * it's OK to assume that any constraints that hold for the parent also + * hold for every child; for instance, table inheritance allows the parent + * to have constraints marked NO INHERIT, but table partitioning does not. + * We choose to check whether the partitioning parents can be excluded + * here; doing so consumes some cycles, but potentially saves us the work + * of excluding each child individually. + * + * This is unnecessarily stupid, but making it smarter seems out of scope + * for v11. + */ + if (rte->inh && rte->relkind != RELKIND_PARTITIONED_TABLE) + return false; + + /* + * Given the above restriction, we can always include NO INHERIT and NOT + * NULL constraints. + */ + include_noinherit = true; + include_notnull = true; + + /* + * Fetch the appropriate set of constraint expressions. */ - constraint_pred = get_relation_constraints(root, rte->relid, rel, true); + constraint_pred = get_relation_constraints(root, rte->relid, rel, + include_noinherit, + include_notnull, + include_partition); /* * We do not currently enforce that CHECK constraints contain only diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out index 078b5fd..0e9373c 100644 --- a/src/test/regress/expected/partition_join.out +++ b/src/test/regress/expected/partition_join.out @@ -1681,6 +1681,8 @@ WHERE EXISTS ( --------------------------------------------------------------- Delete on prt1_l Delete on prt1_l_p1 + Delete on prt1_l_p2_p1 + Delete on prt1_l_p2_p2 Delete on prt1_l_p3_p1 Delete on prt1_l_p3_p2 -> Nested Loop Semi Join @@ -1692,7 +1694,7 @@ WHERE EXISTS ( -> Limit -> Seq Scan on int8_tbl -> Nested Loop Semi Join - -> Seq Scan on prt1_l_p3_p1 + -> Seq Scan on prt1_l_p2_p1 Filter: (c IS NULL) -> Nested Loop -> Seq Scan on int4_tbl @@ -1700,14 +1702,30 @@ WHERE EXISTS ( -> Limit -> Seq Scan on int8_tbl int8_tbl_1 -> Nested Loop Semi Join - -> Seq Scan on prt1_l_p3_p2 + -> Seq Scan on prt1_l_p2_p2 Filter: (c IS NULL) -> Nested Loop -> Seq Scan on int4_tbl -> Subquery Scan on ss_2 -> Limit -> Seq Scan on int8_tbl int8_tbl_2 -(28 rows) + -> Nested Loop Semi Join + -> Seq Scan on prt1_l_p3_p1 + Filter: (c IS NULL) + -> Nested Loop + -> Seq Scan on int4_tbl + -> Subquery Scan on ss_3 + -> Limit + -> Seq Scan on int8_tbl int8_tbl_3 + -> Nested Loop Semi Join + -> Seq Scan on prt1_l_p3_p2 + Filter: (c IS NULL) + -> Nested Loop + -> Seq Scan on int4_tbl + -> Subquery Scan on ss_4 + -> Limit + -> Seq Scan on int8_tbl int8_tbl_4 +(46 rows) -- -- negative testcases diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out index 79e29e7..00f076b 100644 --- a/src/test/regress/expected/partition_prune.out +++ b/src/test/regress/expected/partition_prune.out @@ -3047,18 +3047,24 @@ explain (costs off) update pp_arrpart set a = a where a = '{1}'; ---------------------------------------- Update on pp_arrpart Update on pp_arrpart1 + Update on pp_arrpart2 -> Seq Scan on pp_arrpart1 Filter: (a = '{1}'::integer[]) -(4 rows) + -> Seq Scan on pp_arrpart2 + Filter: (a = '{1}'::integer[]) +(7 rows) explain (costs off) delete from pp_arrpart where a = '{1}'; QUERY PLAN ---------------------------------------- Delete on pp_arrpart Delete on pp_arrpart1 + Delete on pp_arrpart2 -> Seq Scan on pp_arrpart1 Filter: (a = '{1}'::integer[]) -(4 rows) + -> Seq Scan on pp_arrpart2 + Filter: (a = '{1}'::integer[]) +(7 rows) drop table pp_arrpart; -- array type hash partition key @@ -3184,18 +3190,24 @@ explain (costs off) update pp_lp set value = 10 where a = 1; -------------------------- Update on pp_lp Update on pp_lp1 + Update on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) -(4 rows) + -> Seq Scan on pp_lp2 + Filter: (a = 1) +(7 rows) explain (costs off) delete from pp_lp where a = 1; QUERY PLAN -------------------------- Delete on pp_lp Delete on pp_lp1 + Delete on pp_lp2 -> Seq Scan on pp_lp1 Filter: (a = 1) -(4 rows) + -> Seq Scan on pp_lp2 + Filter: (a = 1) +(7 rows) set enable_partition_pruning = off; set constraint_exclusion = 'partition'; -- this should not affect the result. @@ -3417,4 +3429,46 @@ select * from listp where a = (select 2) and b <> 10; Filter: ((b <> 10) AND (a = $0)) (5 rows) +-- +-- check that a partition directly accessed in a query is excluded with +-- constraint_exclusion = on +-- +-- turn off partition pruning, so that it doesn't interfere +set enable_partition_pruning to off; +-- setting constraint_exclusion to 'partition' disables exclusion +set constraint_exclusion to 'partition'; +explain (costs off) select * from listp1 where a = 2; + QUERY PLAN +-------------------- + Seq Scan on listp1 + Filter: (a = 2) +(2 rows) + +explain (costs off) update listp1 set a = 1 where a = 2; + QUERY PLAN +-------------------------- + Update on listp1 + -> Seq Scan on listp1 + Filter: (a = 2) +(3 rows) + +-- constraint exclusion enabled +set constraint_exclusion to 'on'; +explain (costs off) select * from listp1 where a = 2; + QUERY PLAN +-------------------------- + Result + One-Time Filter: false +(2 rows) + +explain (costs off) update listp1 set a = 1 where a = 2; + QUERY PLAN +-------------------------------- + Update on listp1 + -> Result + One-Time Filter: false +(3 rows) + +reset constraint_exclusion; +reset enable_partition_pruning; drop table listp; diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql index 6aecf25..eafbec6 100644 --- a/src/test/regress/sql/partition_prune.sql +++ b/src/test/regress/sql/partition_prune.sql @@ -899,4 +899,24 @@ create table listp2_10 partition of listp2 for values in (10); explain (analyze, costs off, summary off, timing off) select * from listp where a = (select 2) and b <> 10; +-- +-- check that a partition directly accessed in a query is excluded with +-- constraint_exclusion = on +-- + +-- turn off partition pruning, so that it doesn't interfere +set enable_partition_pruning to off; + +-- setting constraint_exclusion to 'partition' disables exclusion +set constraint_exclusion to 'partition'; +explain (costs off) select * from listp1 where a = 2; +explain (costs off) update listp1 set a = 1 where a = 2; +-- constraint exclusion enabled +set constraint_exclusion to 'on'; +explain (costs off) select * from listp1 where a = 2; +explain (costs off) update listp1 set a = 1 where a = 2; + +reset constraint_exclusion; +reset enable_partition_pruning; + drop table listp;
Amit Langote <amitlangote09@gmail.com> writes: > On Tue, Apr 30, 2019 at 1:26 PM Amit Langote <amitlangote09@gmail.com> wrote: >> It would be nice if at least we fix the bug that directly accessed >> partitions are not excluded with constraint_exclusion = on, without >> removing PG 11's contortions in relation_excluded_by_constraints to >> work around the odd requirements imposed by inheritance_planner, which >> is causing the additional diffs in the regression expected output. > FWIW, attached is a delta patch that applies on top of your patch for > v11 branch that shows what may be one way to go about this. OK, I tweaked that a bit and pushed both versions. regards, tom lane
On Wed, May 1, 2019 at 4:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > OK, I tweaked that a bit and pushed both versions. Thank you. Regards, Amit