Re: Add proper planner support for ORDER BY / DISTINCT aggregates - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Add proper planner support for ORDER BY / DISTINCT aggregates |
Date | |
Msg-id | CAApHDvpkbvDEDJPNZrUqCsCLVC57ZY8qrPn8nGpQSrn4RYstZw@mail.gmail.com Whole thread Raw |
In response to | Re: Add proper planner support for ORDER BY / DISTINCT aggregates (Ronan Dunklau <ronan.dunklau@aiven.io>) |
Responses |
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Re: Add proper planner support for ORDER BY / DISTINCT aggregates Re: Add proper planner support for ORDER BY / DISTINCT aggregates |
List | pgsql-hackers |
On Tue, 8 Nov 2022 at 03:53, Ronan Dunklau <ronan.dunklau@aiven.io> wrote: > I can't see why an incrementalsort could be more expensive than a full sort, > using the same presorted path. It looks to me that in that case we should > always prefer an incrementalsort. Maybe we should bound incremental sorts cost > to make sure they are never more expensive than the full sort ? The only thing that I could think of that would cause incremental sort to be more expensive than sort is the tuplesort_reset() calls that are performed between sorts. However, I see cost_incremental_sort() accounts for those already with: run_cost += 2.0 * cpu_tuple_cost * input_groups; Also, I see at the top of incremental_sort.sql there's a comment claiming: -- When we have to sort the entire table, incremental sort will -- be slower than plain sort, so it should not be used. I'm just unable to verify that's true by doing the following: $ echo "select * from (select * from tenk1 order by four) t order by four, ten;" > bench.sql $ pgbench -n -f bench.sql -T 60 -M prepared regression | grep -E "^tps" tps = 102.136151 (without initial connection time) $ # disable sort so that the test performs Sort -> Incremental Sort rather $ # than Sort -> Sort $ psql -c "alter system set enable_sort=0;" regression $ psql -c "select pg_reload_conf();" regression $ pgbench -n -f bench.sql -T 60 -M prepared regression | grep -E "^tps" tps = 112.378761 (without initial connection time) When I disable sort, the plan changes to use Incremental Sort and execution becomes faster, not slower like the comment claims it will. Perhaps this was true during the development of Incremental sort and then something was changed to speed things up. I do recall reviewing that patch many years ago and hinting about the invention of tuplesort_reset(). I don't recall, but I assume the patch must have been creating a new tuplesort each group before that. Also, I was looking at add_paths_to_grouping_rel() and I saw that if presorted_keys > 0 that we'll create both a Sort and Incremental Sort path. If we assume Incremental Sort is always better when it can be done, then it seems useless to create the Sort path when Incremental Sort is possible. When working on making Incremental Sorts work for window functions I did things that way. Maybe we should just make add_paths_to_grouping_rel() work the same way. Regarding the 1.5 factor in cost_incremental_sort(), I assume this is for skewed groups. Imagine there's 1 huge group and 99 tiny ones. However, even if that were the case, I imagine the performance would still be around the same performance as the non-incremental variant of sort. I've been playing around with the attached patch which does: 1. Adjusts add_paths_to_grouping_rel so that we don't add a Sort path when we can add an Incremental Sort path instead. This removes quite a few redundant lines of code. 2. Removes the * 1.5 fuzz-factor in cost_incremental_sort() 3. Does various other code tidy stuff in cost_incremental_sort(). 4. Removes the test from incremental_sort.sql that was ensuring the inferior Sort -> Sort plan was being used instead of the superior Sort -> Incremental Sort plan. I'm not really that 100% confident in the removal of the * 1.5 thing. I wonder if there's some reason we're not considering that might cause a performance regression if we're to remove it. David
Attachment
pgsql-hackers by date: