Thread: Small performance tweak to run-time partition pruning
While reviewing some other patches to improve partitioning performance I noticed that one of the loops in ExecFindInitialMatchingSubPlans() could be coded a bit more efficiently. The current code loops over all the original subplans checking if the subplan is newly pruned, if it is, the code sets the new_subplan_indexes array element to -1, else it sets it assigns the new subplan index. This can be done more efficiently if we make this array 1-based and initialise the whole thing to 0 then just loop over the non-pruned subplans instead of all subplans. Pruning all but 1 subplan is quite common. In profiles, I'd seen ExecFindInitialMatchingSubPlans() consume about 5.2% percent of CPU time. With the patch that dropped to 0.72%. A quick test with just 300 partitions shows about a 2.3% performance improvement. Hardly groundbreaking, but it seems like a small enough change for it to be worth it. The test was conducted as follows: postgresql.conf: plan_cache_mode = 'force_generic_plan' max_parallel_workers_per_gather = 0 setup: CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION BY RANGE (id); \o /dev/null select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench FOR VALUES FROM (' || (x*100000)::text || ') TO (' || ((x+1)*100000)::text || ');' from generate_Series(0,299) x; \gexec \o select.sql: \set p_id 29999999 select * from partbench where id = :p_id; Test: $ pgbench -n -f select.sql -M prepared -T 60 postgres Unpatched: tps = 6946.940678 (excluding connections establishing) tps = 6913.993655 (excluding connections establishing) tps = 6854.693214 (excluding connections establishing) Patched tps = 7066.854267 (excluding connections establishing) tps = 7082.890458 (excluding connections establishing) tps = 7052.255429 (excluding connections establishing) Patch attached. I'll park this here until the November 'fest. I've also included an additional test to ensure the other_subplans gets updated correctly. The other tests for this seem to only perform run-time pruning during init plan and do no further pruning, so don't fully test that other_subplans gets updated correctly. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 7 September 2018 at 19:29, David Rowley <david.rowley@2ndquadrant.com> wrote: > While reviewing some other patches to improve partitioning performance > I noticed that one of the loops in ExecFindInitialMatchingSubPlans() > could be coded a bit more efficiently. The current code loops over > all the original subplans checking if the subplan is newly pruned, if > it is, the code sets the new_subplan_indexes array element to -1, else > it sets it assigns the new subplan index. This can be done more > efficiently if we make this array 1-based and initialise the whole > thing to 0 then just loop over the non-pruned subplans instead of all > subplans. Pruning all but 1 subplan is quite common. I was looking at this again and I realised that we can completely skip the re-sequence of the subplan map when we're not going to perform any further pruning during execution. We possibly could also not make a copy of the subplan_map in this case at all in ExecCreatePartitionPruneState(), and just take the planner's copy verbatim as we do for the subpart_map. I was just unable to see any performance gains from doing this, so I've just left it for now. Currently, this improves performance about 2% with prepared queries and 300 partitions. Patched: tps = 5169.169452 (excluding connections establishing) tps = 5155.914286 (excluding connections establishing) Unpatched: tps = 5059.511370 (excluding connections establishing) tps = 5082.851062 (excluding connections establishing) However with other patches to remove partitioning bottlenecks in the executor, the TPS goes to about 25,000, so 2% becomes 10%, which seems more meaningful. I've attached an updated patch which skips the re-sequence work when doing that is not required for anything. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, David. Thanks for the patch! On Mon, Oct 8, 2018 at 1:00 AM, David Rowley wrote: > I was looking at this again and I realised that we can completely skip > the re-sequence of the subplan map when we're not going to perform any > further pruning during execution. I checked codes and I think so too. Confirmation of my understanding and For more information to others: The subplan map is used when we call ExecFindInitialMatchingSubPlans or ExecFindMatchingSubPlans. ExecFindInitialMatchingSubPlans is called by ExecInit(Merge)Append. ExecFindmatchingSubPlans is called by below fours which is executed after ExecInit(Merge)Append and it is called when the as_valid_subplans(or ms_valid_subplans) is not NULL. 1. choose_next_subplan_locally(AppendState *node) 2. choose_next_subplan_for_leader(AppendState *node) 3. choose_next_subplan_for_worker(AppendState *node) 4. ExecMergeAppend(PlanState *pstate) The as_valid_subplans(or ms_valid_subplans) is set in ExecInit(Merge)Append if pruning during execution is not required. That's why we can completely skip the re-sequence of the subplan map when we're not going to perform any further pruning during execution. > I've attached an updated patch which skips the re-sequence work when > doing that is not required for anything. I saw the patch and it seems good to me about the codes. I still couldn't check additional test cases in the patch. BTW, when I looking the codes, I thought there is further improvements in ExecInitAppend which is described below. j = i = 0; firstvalid = nplans; foreach(lc, node->appendplans) { if (bms_is_member(i, validsubplans)) { Plan *initNode = (Plan *) lfirst(lc); /* * Record the lowest appendplans index which is a valid partial * plan. */ if (i >= node->first_partial_plan && j < firstvalid) firstvalid = j; appendplanstates[j++] = ExecInitNode(initNode, estate, eflags); } i++; } If number of valid subplans is few compared to node->appendplans, it is a waste to check bms_is_member(i, validsubplans) for all node->appendplans. Of course, node->appendplans is List struct and we have to loop it to find valid plan, that we can't simply use while(bms_next_member(i, validsubplans)) loop. I don't have any good idea for it now, but can we improve it? -- Yoshikazu Imai
Hi, David. Thanks for the patch! On Mon, Oct 8, 2018 at 1:00 AM, David Rowley wrote: > I was looking at this again and I realised that we can completely skip > the re-sequence of the subplan map when we're not going to perform any > further pruning during execution. I checked codes and I think so too. Confirmation of my understanding and For more information to others: The subplan map is used when we call ExecFindInitialMatchingSubPlans or ExecFindMatchingSubPlans. ExecFindInitialMatchingSubPlans is called by ExecInit(Merge)Append. ExecFindmatchingSubPlans is called by below fours which is executed after ExecInit(Merge)Append and it is called when the as_valid_subplans(or ms_valid_subplans) is not NULL. 1. choose_next_subplan_locally(AppendState *node) 2. choose_next_subplan_for_leader(AppendState *node) 3. choose_next_subplan_for_worker(AppendState *node) 4. ExecMergeAppend(PlanState *pstate) The as_valid_subplans(or ms_valid_subplans) is set in ExecInit(Merge)Append if pruning during execution is not required. That's why we can completely skip the re-sequence of the subplan map when we're not going to perform any further pruning during execution. > I've attached an updated patch which skips the re-sequence work when > doing that is not required for anything. I saw the patch and it seems good to me about the codes. I still couldn't check additional test cases in the patch. BTW, when I looking the codes, I thought there is further improvements in ExecInitAppend which is described below. j = i = 0; firstvalid = nplans; foreach(lc, node->appendplans) { if (bms_is_member(i, validsubplans)) { Plan *initNode = (Plan *) lfirst(lc); /* * Record the lowest appendplans index which is a valid partial * plan. */ if (i >= node->first_partial_plan && j < firstvalid) firstvalid = j; appendplanstates[j++] = ExecInitNode(initNode, estate, eflags); } i++; } If number of valid subplans is few compared to node->appendplans, it is a waste to check bms_is_member(i, validsubplans) for all node->appendplans. Of course, node->appendplans is List struct and we have to loop it to find valid plan, that we can't simply use while(bms_next_member(i, validsubplans)) loop. I don't have any good idea for it now, but can we improve it? -- Yoshikazu Imai
On 9 October 2018 at 14:23, Imai, Yoshikazu <imai.yoshikazu@jp.fujitsu.com> wrote: > I checked codes and I think so too. > > Confirmation of my understanding and For more information to others: > > The subplan map is used when we call ExecFindInitialMatchingSubPlans or > ExecFindMatchingSubPlans. > ExecFindInitialMatchingSubPlans is called by ExecInit(Merge)Append. > ExecFindmatchingSubPlans is called by below fours which is executed after > ExecInit(Merge)Append and it is called when the > as_valid_subplans(or ms_valid_subplans) is not NULL. Thanks for looking at this. Yeah, so subplan_map is just an array that stores the List index of the Append or MergeAppend's subplans. When we perform run-time pruning during executor initialisation, if we prune away some of these subplans then we don't initialise those pruned subplans at all which results in missing executor nodes for those plans. Instead of maintaining an array to store these with a bunch of NULLs in them to represent the pruned subnodes, for performance reasons, we make a gapless array and store them in there. This means that for the run-time pruning that we could do running actual execution (ExecFindMatchingSubPlans), the old subplan_map would be out of date, as it would contain the indexes of the subplans as if we'd not pruned any. We can simply not bother adjusting the subplan_map if no further pruning is required. This could leave the map pointing to subplans that don't exist, but we only need to care about that when the map is actually going to be used for something. The good news is, we know in advance if the map will be used again. > I saw the patch and it seems good to me about the codes. > I still couldn't check additional test cases in the patch. > > > BTW, when I looking the codes, I thought there is further improvements in > ExecInitAppend which is described below. > > j = i = 0; > firstvalid = nplans; > foreach(lc, node->appendplans) > { > if (bms_is_member(i, validsubplans)) > { > Plan *initNode = (Plan *) lfirst(lc); > > /* > * Record the lowest appendplans index which is a valid partial > * plan. > */ > if (i >= node->first_partial_plan && j < firstvalid) > firstvalid = j; > > appendplanstates[j++] = ExecInitNode(initNode, estate, eflags); > } > i++; > } > > If number of valid subplans is few compared to node->appendplans, it is a waste to check > bms_is_member(i, validsubplans) for all node->appendplans. > Of course, node->appendplans is List struct and we have to loop it to find valid plan, > that we can't simply use while(bms_next_member(i, validsubplans)) loop. > I don't have any good idea for it now, but can we improve it? I do have other ideas for that but not ready to share code yet, but it basically requires reimplementing List to use arrays as their underlying implementation. This will allow the bms_next_member() type loop and list_nth() will be O(1) instead of O(N). It might be possible to squeeze a bit more performance out of this code with the current List implementation, but I it might actually slow performance in some cases, for example, if the only surviving partition was one of the last ones in the List. Getting the final element with list_nth() is optimized, but if you consider a time-based, say monthly, RANGE partition, a DBA might maintain a partition ready for the next month of data, so it might be very common to access the 2nd last element in the list for all queries looking at "this months" data. In that case, list_nth(), in its current form, is as slow as can be. You'd also need to do a bms_num_members() or bms_get_singleton_member(), in order to decide if the alternative method is going to be any faster. That call is not going to be free. It might also be possible to form the loop so that it calls bms_next_member() then store the result and loop until we reach that number. That would only save the bms_is_member call per loop, but I'd much rather do the array idea as it should also speed up plenty of other things. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 9 October 2018 at 14:23, Imai, Yoshikazu <imai.yoshikazu@jp.fujitsu.com> wrote: > I checked codes and I think so too. > > Confirmation of my understanding and For more information to others: > > The subplan map is used when we call ExecFindInitialMatchingSubPlans or > ExecFindMatchingSubPlans. > ExecFindInitialMatchingSubPlans is called by ExecInit(Merge)Append. > ExecFindmatchingSubPlans is called by below fours which is executed after > ExecInit(Merge)Append and it is called when the > as_valid_subplans(or ms_valid_subplans) is not NULL. Thanks for looking at this. Yeah, so subplan_map is just an array that stores the List index of the Append or MergeAppend's subplans. When we perform run-time pruning during executor initialisation, if we prune away some of these subplans then we don't initialise those pruned subplans at all which results in missing executor nodes for those plans. Instead of maintaining an array to store these with a bunch of NULLs in them to represent the pruned subnodes, for performance reasons, we make a gapless array and store them in there. This means that for the run-time pruning that we could do running actual execution (ExecFindMatchingSubPlans), the old subplan_map would be out of date, as it would contain the indexes of the subplans as if we'd not pruned any. We can simply not bother adjusting the subplan_map if no further pruning is required. This could leave the map pointing to subplans that don't exist, but we only need to care about that when the map is actually going to be used for something. The good news is, we know in advance if the map will be used again. > I saw the patch and it seems good to me about the codes. > I still couldn't check additional test cases in the patch. > > > BTW, when I looking the codes, I thought there is further improvements in > ExecInitAppend which is described below. > > j = i = 0; > firstvalid = nplans; > foreach(lc, node->appendplans) > { > if (bms_is_member(i, validsubplans)) > { > Plan *initNode = (Plan *) lfirst(lc); > > /* > * Record the lowest appendplans index which is a valid partial > * plan. > */ > if (i >= node->first_partial_plan && j < firstvalid) > firstvalid = j; > > appendplanstates[j++] = ExecInitNode(initNode, estate, eflags); > } > i++; > } > > If number of valid subplans is few compared to node->appendplans, it is a waste to check > bms_is_member(i, validsubplans) for all node->appendplans. > Of course, node->appendplans is List struct and we have to loop it to find valid plan, > that we can't simply use while(bms_next_member(i, validsubplans)) loop. > I don't have any good idea for it now, but can we improve it? I do have other ideas for that but not ready to share code yet, but it basically requires reimplementing List to use arrays as their underlying implementation. This will allow the bms_next_member() type loop and list_nth() will be O(1) instead of O(N). It might be possible to squeeze a bit more performance out of this code with the current List implementation, but I it might actually slow performance in some cases, for example, if the only surviving partition was one of the last ones in the List. Getting the final element with list_nth() is optimized, but if you consider a time-based, say monthly, RANGE partition, a DBA might maintain a partition ready for the next month of data, so it might be very common to access the 2nd last element in the list for all queries looking at "this months" data. In that case, list_nth(), in its current form, is as slow as can be. You'd also need to do a bms_num_members() or bms_get_singleton_member(), in order to decide if the alternative method is going to be any faster. That call is not going to be free. It might also be possible to form the loop so that it calls bms_next_member() then store the result and loop until we reach that number. That would only save the bms_is_member call per loop, but I'd much rather do the array idea as it should also speed up plenty of other things. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 9, 2018 at 2:02 AM, David Rowley wrote: > Yeah, so subplan_map is just an array that stores the List index of > the Append or MergeAppend's subplans. When we perform run-time pruning > during executor initialisation, if we prune away some of these > subplans then we don't initialise those pruned subplans at all which > results in missing executor nodes for those plans. Instead of > maintaining an array to store these with a bunch of NULLs in them to > represent the pruned subnodes, for performance reasons, we make a > gapless array and store them in there. This means that for the > run-time pruning that we could do running actual execution > (ExecFindMatchingSubPlans), the old subplan_map would be out of date, > as it would contain the indexes of the subplans as if we'd not pruned > any. We can simply not bother adjusting the subplan_map if no further > pruning is required. This could leave the map pointing to subplans > that don't exist, but we only need to care about that when the map is > actually going to be used for something. The good news is, we know in > advance if the map will be used again. Thanks for the detail explanation! I got it fully. > > I saw the patch and it seems good to me about the codes. > > I still couldn't check additional test cases in the patch. > > > > > > BTW, when I looking the codes, I thought there is further improvements in > > ExecInitAppend which is described below. > > > > j = i = 0; > > firstvalid = nplans; > > foreach(lc, node->appendplans) > > { > > if (bms_is_member(i, validsubplans)) > > { > > Plan *initNode = (Plan *) lfirst(lc); > > > > /* > > * Record the lowest appendplans index which is a valid partial > > * plan. > > */ > > if (i >= node->first_partial_plan && j < firstvalid) > > firstvalid = j; > > > > appendplanstates[j++] = ExecInitNode(initNode, estate, eflags); > > } > > i++; > > } > > > > If number of valid subplans is few compared to node->appendplans, it is a waste to check > > bms_is_member(i, validsubplans) for all node->appendplans. > > Of course, node->appendplans is List struct and we have to loop it to find valid plan, > > that we can't simply use while(bms_next_member(i, validsubplans)) loop. > > I don't have any good idea for it now, but can we improve it? > > > > I do have other ideas for that but not ready to share code yet, but it > basically requires reimplementing List to use arrays as their > underlying implementation. This will allow the bms_next_member() type > loop and list_nth() will be O(1) instead of O(N). Great! So there might be little gain in using memory, but we get large performance improvement. > It might be possible to squeeze a bit more performance out of this > code with the current List implementation, but I it might actually > slow performance in some cases, for example, if the only surviving > partition was one of the last ones in the List. Getting the final > element with list_nth() is optimized, but if you consider a > time-based, say monthly, RANGE partition, a DBA might maintain a > partition ready for the next month of data, so it might be very common > to access the 2nd last element in the list for all queries looking at > "this months" data. In that case, list_nth(), in its current form, is > as slow as can be. I understand accessing 2nd last element is slow in current List implementation, but I don't know your new implementation is also slow in that case. I don't know I might misunderstood "squeeze ~ out of ~ with ~" sentence. > You'd also need to do a bms_num_members() or > bms_get_singleton_member(), in order to decide if the alternative > method is going to be any faster. That call is not going to be free. Yeah, I need to use bms_num_members() or bms_get_singleton_member() to check number of valid subplans is enough smaller than number of all append plans to check alternative method is faster. > It might also be possible to form the loop so that it calls > bms_next_member() then store the result and loop until we reach that > number. That would only save the bms_is_member call per loop, but I'd > much rather do the array idea as it should also speed up plenty of > other things. Actually, it is possible refactor that loop with the method you described, but it would be complex and hacky codes for only saving the wasting loop. So, I also like your array idea. -- Yoshikazu Imai
On Tue, Oct 9, 2018 at 2:02 AM, David Rowley wrote: > Yeah, so subplan_map is just an array that stores the List index of > the Append or MergeAppend's subplans. When we perform run-time pruning > during executor initialisation, if we prune away some of these > subplans then we don't initialise those pruned subplans at all which > results in missing executor nodes for those plans. Instead of > maintaining an array to store these with a bunch of NULLs in them to > represent the pruned subnodes, for performance reasons, we make a > gapless array and store them in there. This means that for the > run-time pruning that we could do running actual execution > (ExecFindMatchingSubPlans), the old subplan_map would be out of date, > as it would contain the indexes of the subplans as if we'd not pruned > any. We can simply not bother adjusting the subplan_map if no further > pruning is required. This could leave the map pointing to subplans > that don't exist, but we only need to care about that when the map is > actually going to be used for something. The good news is, we know in > advance if the map will be used again. Thanks for the detail explanation! I got it fully. > > I saw the patch and it seems good to me about the codes. > > I still couldn't check additional test cases in the patch. > > > > > > BTW, when I looking the codes, I thought there is further improvements in > > ExecInitAppend which is described below. > > > > j = i = 0; > > firstvalid = nplans; > > foreach(lc, node->appendplans) > > { > > if (bms_is_member(i, validsubplans)) > > { > > Plan *initNode = (Plan *) lfirst(lc); > > > > /* > > * Record the lowest appendplans index which is a valid partial > > * plan. > > */ > > if (i >= node->first_partial_plan && j < firstvalid) > > firstvalid = j; > > > > appendplanstates[j++] = ExecInitNode(initNode, estate, eflags); > > } > > i++; > > } > > > > If number of valid subplans is few compared to node->appendplans, it is a waste to check > > bms_is_member(i, validsubplans) for all node->appendplans. > > Of course, node->appendplans is List struct and we have to loop it to find valid plan, > > that we can't simply use while(bms_next_member(i, validsubplans)) loop. > > I don't have any good idea for it now, but can we improve it? > > > > I do have other ideas for that but not ready to share code yet, but it > basically requires reimplementing List to use arrays as their > underlying implementation. This will allow the bms_next_member() type > loop and list_nth() will be O(1) instead of O(N). Great! So there might be little gain in using memory, but we get large performance improvement. > It might be possible to squeeze a bit more performance out of this > code with the current List implementation, but I it might actually > slow performance in some cases, for example, if the only surviving > partition was one of the last ones in the List. Getting the final > element with list_nth() is optimized, but if you consider a > time-based, say monthly, RANGE partition, a DBA might maintain a > partition ready for the next month of data, so it might be very common > to access the 2nd last element in the list for all queries looking at > "this months" data. In that case, list_nth(), in its current form, is > as slow as can be. I understand accessing 2nd last element is slow in current List implementation, but I don't know your new implementation is also slow in that case. I don't know I might misunderstood "squeeze ~ out of ~ with ~" sentence. > You'd also need to do a bms_num_members() or > bms_get_singleton_member(), in order to decide if the alternative > method is going to be any faster. That call is not going to be free. Yeah, I need to use bms_num_members() or bms_get_singleton_member() to check number of valid subplans is enough smaller than number of all append plans to check alternative method is faster. > It might also be possible to form the loop so that it calls > bms_next_member() then store the result and loop until we reach that > number. That would only save the bms_is_member call per loop, but I'd > much rather do the array idea as it should also speed up plenty of > other things. Actually, it is possible refactor that loop with the method you described, but it would be complex and hacky codes for only saving the wasting loop. So, I also like your array idea. -- Yoshikazu Imai
On Tue, Oct 9, 2018 at 1:24 AM, I wrote: > On Mon, Oct 8, 2018 at 1:00 AM, David Rowley wrote: > > I've attached an updated patch which skips the re-sequence work when > > doing that is not required for anything. > > > > I saw the patch and it seems good to me about the codes. > I still couldn't check additional test cases in the patch. I checked an additional test which is: On Thu, Sept 6, 2018 at 7:30 PM, David Rowley wrote: > I've also included an additional test to ensure the other_subplans > gets updated correctly. The other tests for this seem to only perform > run-time pruning during init plan and do no further pruning, so don't > fully test that other_subplans gets updated correctly. I execute the sql in this test with gdb and confirmed that it tests other_subplans gets updated correctly. It also performs exec run-time pruning and actually is through the codes in the patch which update other_subplans. I also did "check world" at the latest master e9edc1ba0b and all tests passed successfully. It seems to me that there is no problem in this patch as far. Is there another thing I have to do for the review? -- Yoshikazu Imai
On 11 October 2018 at 16:00, Imai, Yoshikazu <imai.yoshikazu@jp.fujitsu.com> wrote: > On Thu, Sept 6, 2018 at 7:30 PM, David Rowley wrote: >> I've also included an additional test to ensure the other_subplans >> gets updated correctly. The other tests for this seem to only perform >> run-time pruning during init plan and do no further pruning, so don't >> fully test that other_subplans gets updated correctly. > > I execute the sql in this test with gdb and confirmed that it tests > other_subplans gets updated correctly. It also performs exec run-time pruning > and actually is through the codes in the patch which update other_subplans. > > I also did "check world" at the latest master e9edc1ba0b and all tests passed > successfully. Many thanks for checking this in detail. > It seems to me that there is no problem in this patch as far. > Is there another thing I have to do for the review? There's a checklist in [1]. Perhaps there's something mentioned there that you've missed. [1] https://wiki.postgresql.org/wiki/Reviewing_a_Patch -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Sorry for the delay in replying. On Wed, Oct 10, 2018 at 4:05 PM, David Rowley wrote: > > It seems to me that there is no problem in this patch as far. > > Is there another thing I have to do for the review? > > There's a checklist in [1]. Perhaps there's something mentioned there > that you've missed. > > [1] https://wiki.postgresql.org/wiki/Reviewing_a_Patch Thanks for the URL. I did the performance test which is almost same as you did but only changed "\set p_id 1" in select.sql for another performance test that I will send that result in next mail. Maybe it doesn't affect the performance, so it's ok. I tested with master(28d750c) + v2patch. [Unpatched] ave 3669 TPS tps = 3700.319242 (excluding connections establishing) tps = 3642.287089 (excluding connections establishing) tps = 3668.243399 (excluding connections establishing) tps = 3689.457722 (excluding connections establishing) tps = 3714.309178 (excluding connections establishing) tps = 3697.488958 (excluding connections establishing) tps = 3573.372327 (excluding connections establishing) tps = 3620.473191 (excluding connections establishing) tps = 3689.794860 (excluding connections establishing) tps = 3692.317099 (excluding connections establishing) [Patched] ave 3718 TPS tps = 3751.639616 (excluding connections establishing) tps = 3736.482071 (excluding connections establishing) tps = 3747.613223 (excluding connections establishing) tps = 3745.578446 (excluding connections establishing) tps = 3662.612013 (excluding connections establishing) tps = 3715.271028 (excluding connections establishing) tps = 3718.671552 (excluding connections establishing) tps = 3698.766946 (excluding connections establishing) tps = 3639.026099 (excluding connections establishing) tps = 3760.507508 (excluding connections establishing) The patch improves the performance about 1.3% which is less than David's result, but it seems still improves the performance. Above performance test don't exec run-time pruning and we can't check the performance of the loop codes patch modified, so I also did the below test which do exec run-time pruning. setup: CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION BY RANGE (id); \o /dev/null select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench FOR VALUES FROM (' || (x*100000)::text || ') TO (' || ((x+1)*100000)::text || ') PARTITION BY RANGE (i1);' from generate_Series(0,299) x; \gexec \o \o /dev/null select 'CREATE TABLE partbench' || x::text || '_i1a PARTITION OF partbench' || x::text || ' FOR VALUES FROM (0) TO (1);' from generate_Series(0,299) x; \gexec \o \o /dev/null select 'CREATE TABLE partbench' || x::text || '_i1b PARTITION OF partbench' || x::text || ' FOR VALUES FROM (1) TO (2);' from generate_Series(0,299) x; \gexec \o select-exec-init.sql: \set p_id 1 select * from partbench where id = :p_id and i1 = (select 0); [Unpatched] ave 1060 TPS tps = 1067.286500 (excluding connections establishing) tps = 1052.705136 (excluding connections establishing) tps = 1056.684966 (excluding connections establishing) tps = 1059.803865 (excluding connections establishing) tps = 1053.418776 (excluding connections establishing) tps = 1053.383518 (excluding connections establishing) tps = 1058.542617 (excluding connections establishing) tps = 1071.875455 (excluding connections establishing) tps = 1058.064092 (excluding connections establishing) tps = 1066.869393 (excluding connections establishing) [Patched] ave 1069 TPS tps = 1071.621247 (excluding connections establishing) tps = 1067.881709 (excluding connections establishing) tps = 1073.274357 (excluding connections establishing) tps = 1058.648528 (excluding connections establishing) tps = 1068.490598 (excluding connections establishing) tps = 1064.739885 (excluding connections establishing) tps = 1069.189778 (excluding connections establishing) tps = 1070.253092 (excluding connections establishing) tps = 1070.395411 (excluding connections establishing) tps = 1071.003647 (excluding connections establishing) The patch also improves the performance about 0.85% in this case. -- Yoshikazu Imai
On 18 October 2018 at 16:13, Imai, Yoshikazu <imai.yoshikazu@jp.fujitsu.com> wrote: > The patch improves the performance about 1.3% which is less than David's > result, but it seems still improves the performance. Thanks for doing these benchmarks. The speedup is small, but it becomes much more significant once other bottlenecks are removed. More partitions may show a larger increase, but more partitions also means that a larger range table array gets built during ExecInitRangeTable(), which is also slow. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Oct 22, 2018 at 8:31 PM, David Rowley wrote: > On 18 October 2018 at 16:13, Imai, Yoshikazu > <imai.yoshikazu@jp.fujitsu.com> wrote: > > The patch improves the performance about 1.3% which is less than > > David's result, but it seems still improves the performance. > > Thanks for doing these benchmarks. > > The speedup is small, but it becomes much more significant once other > bottlenecks are removed. More partitions may show a larger increase, but > more partitions also means that a larger range table array gets built > during ExecInitRangeTable(), which is also slow. Since I did enough tests and ISTM the patch is sophisticated, I changed the state of this patch to "Ready for Committer". -- Yoshikazu Imai
"Imai, Yoshikazu" <imai.yoshikazu@jp.fujitsu.com> writes: > I changed the state of this patch to "Ready for Committer". I pushed this with a bit of extra tweaking --- notably, I added an Assert to ExecFindMatchingSubPlans to guard against the possibility that someone calls it when the remapping hasn't been done. I think that the main win here is the recognition that we only need remapping when both do_initial_prune and do_exec_prune are true. I suspect that's an unusual case, so that we'll usually get to skip this code altogether. I am not really convinced that the original idea of replacing the loop-over-subplans with a bms_next_member() loop is a good one. Yeah, it wins in the best case where there are a lot of subplans and initial pruning got rid of nearly all of them --- but how often is that true, given the context that both do_initial_prune and do_exec_prune are needed? And does it still win if most of the subplans *didn't* get pruned here? bms_next_member() is not cheap compared to an increment-and-compare. I am distressed by the fact that your performance testing seems to have examined only this best case. IMO performance testing should usually worry more about the possible downside in the worst case. Rather than arguing about that, though, I think we should focus on something else: I'd be a lot happier if this remapping logic just disappeared completely. I doubt it's a performance issue now that we run it only when initial and exec pruning are both needed. But it seems like a source of complexity and bugs, most especially the latter now that common cases won't run it. Why is it that we can't fix things so that ExecFindMatchingSubPlans just uses the original planner output data? regards, tom lane
On Fri, 16 Nov 2018 at 07:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I pushed this with a bit of extra tweaking --- notably, I added an > Assert to ExecFindMatchingSubPlans to guard against the possibility > that someone calls it when the remapping hasn't been done. Thanks for pushing. > I am not really convinced that the original idea of replacing the > loop-over-subplans with a bms_next_member() loop is a good one. > Yeah, it wins in the best case where there are a lot of subplans > and initial pruning got rid of nearly all of them --- but how often > is that true, given the context that both do_initial_prune and > do_exec_prune are needed? And does it still win if most of the > subplans *didn't* get pruned here? bms_next_member() is not cheap > compared to an increment-and-compare. I am distressed by the fact > that your performance testing seems to have examined only this > best case. IMO performance testing should usually worry more about > the possible downside in the worst case. You are correct that a for loop over a Bitmapset with all members set up to N while testing bms_is_member() is faster than a while loop over bms_next_member(), however, I don't think the case where many partitions remain after the initial prune is an interesting one to focus on. The additional overhead of doing init plan on a large number of partitions most likely to more than drown out anything we can do to improve the performance of ExecFindInitialMatchingSubPlans(). As an example, I tried on master with: create table listp (b bool, z int) partition by list(b); create table listp_false partition of listp for values in(false); create table listp_true partition of listp for values in(true) partition by list (z); select 'create table listp_true'||x::Text || ' partition of listp_true for values in(' || x::text || ');' from generate_series(1,1000) x; \gexec Having changed postgresql.conf: plan_cache_mode = force_generic_plan max_parallel_workers_per_gather = 0; I ran this through pgbench: \set p_b true prepare q1 (bool) as select * from listp where b = :p_b and z = (select 1); tps = 230.317940 (excluding connections establishing) With my original test case, I reported about a 4-microsecond per query performance improvement from this patch. That does not really account for very much in a query that takes well over 4000 microseconds to complete. I've not benchmarked after having reverted the patch as I don't think it would be easy to measure something that only takes 0.1% of the total query time. If we could have pruned all those other partitions during executor startup, aka: \set p_b true \set p_z 1 select * from listp where b = :p_b and z = :p_z then we'd get something more like: tps = 1952.976045 (excluding connections establishing) The saving here is not having to init/shutdown nearly as many scan nodes. Any additional overhead in ExecFindInitialMatchingSubPlans() is going to be noticed much more for a fast query such as this one, and in this case, where most or all but one partition was pruned, the bms_next_member() loop will be significantly faster. > Rather than arguing about that, though, I think we should focus > on something else: I'd be a lot happier if this remapping logic > just disappeared completely. I doubt it's a performance issue > now that we run it only when initial and exec pruning are both > needed. But it seems like a source of complexity and bugs, most > especially the latter now that common cases won't run it. Why > is it that we can't fix things so that ExecFindMatchingSubPlans > just uses the original planner output data? I understand you'd rather not have this code, but I believe it's worth doing things this way for the added performance improvements we get from not having to allocate a large empty array to store the initialized subnodes. I wrote a quick patch to get rid of the re-map code and just palloc0 an array for all subnodes, initializing only the ones that were not pruned. Testing this against master and the patched version with a partitioned table with 10k hash partitions I get: $ pgbench -n -f bench.sql -T 60 -M prepared postgres -- Don't re-map tps = 95.459176 (excluding connections establishing) tps = 96.320146 (excluding connections establishing) -- master tps = 102.019177 (excluding connections establishing) tps = 104.052788 (excluding connections establishing) So there's a non-zero performance penalty if we remove the re-map code. As far as I see it, we've still got lots to improve in the executor to further speed this up. Having to populate the large 10k element range table array in ExecInitRangeTable(), and also run ExecCheckRTPerms() on the range table that requires no permission checks for any of the partitions and then also lock each partition in AcquireExecutorLocks() instead of delaying locking of partitions until after pruning takes place. I've WIP patches locally to resolve each of these which takes performance to about 900 tps. The overhead of the palloc0() on the almost entirely empty Append subnode array becomes a much larger percentage then. Perhaps a compromise would to somehow move the re-map code into partprune.c so as to keep all that logic in the one place. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services