Re: [PATCH] Incremental sort (was: PoC: Partial sort) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [PATCH] Incremental sort (was: PoC: Partial sort) |
Date | |
Msg-id | 20190720152512.jpdi2g3notbjcxsd@development Whole thread Raw |
In response to | Re: [PATCH] Incremental sort (was: PoC: Partial sort) (James Coleman <jtc331@gmail.com>) |
Responses |
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Re: [PATCH] Incremental sort (was: PoC: Partial sort) |
List | pgsql-hackers |
On Sat, Jul 20, 2019 at 10:33:02AM -0400, James Coleman wrote: >On Sat, Jul 20, 2019 at 9:22 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> >> On Fri, Jul 19, 2019 at 04:59:21PM -0400, James Coleman wrote: >> >On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra >> ><tomas.vondra@2ndquadrant.com> wrote: >> >> Now, consider this example: >> >> >> >> create table t (a int, b int, c int); >> >> insert into t select mod(i,100),mod(i,100),i from generate_series(1,10000000) s(i); >> >> create index on t (a); >> >> analyze t; >> >> explain select a,b,sum(c) from t group by 1,2 order by 1,2,3 limit 1; >> >> >> >> With 0001+0002+0003 pathes, I get a plan like this: >> >> >> >> QUERY PLAN >> >> -------------------------------------------------------------------------------------------------------------------- >> >> Limit (cost=10375.39..10594.72 rows=1 width=16) >> >> -> Incremental Sort (cost=10375.39..2203675.71 rows=10000 width=16) >> >> Sort Key: a, b, (sum(c)) >> >> Presorted Key: a, b >> >> -> GroupAggregate (cost=10156.07..2203225.71 rows=10000 width=16) >> >> Group Key: a, b >> >> -> Gather Merge (cost=10156.07..2128124.39 rows=10000175 width=12) >> >> Workers Planned: 2 >> >> -> Incremental Sort (cost=9156.04..972856.05 rows=4166740 width=12) >> >> Sort Key: a, b >> >> Presorted Key: a >> >> -> Parallel Index Scan using t_a_idx on t (cost=0.43..417690.30 rows=4166740 width=12) >> >> (12 rows) >> >> >> >> >> >> and with 0004, I get this: >> >> >> >> QUERY PLAN >> >> ------------------------------------------------------------------------------------------------------ >> >> Limit (cost=20443.84..20665.32 rows=1 width=16) >> >> -> Incremental Sort (cost=20443.84..2235250.05 rows=10000 width=16) >> >> Sort Key: a, b, (sum(c)) >> >> Presorted Key: a, b >> >> -> GroupAggregate (cost=20222.37..2234800.05 rows=10000 width=16) >> >> Group Key: a, b >> >> -> Incremental Sort (cost=20222.37..2159698.74 rows=10000175 width=12) >> >> Sort Key: a, b >> >> Presorted Key: a >> >> -> Index Scan using t_a_idx on t (cost=0.43..476024.65 rows=10000175 width=12) >> >> (10 rows) >> >> >> >> Notice that cost of the second plan is almost double the first one. That >> >> means 0004 does not even generate the first plan, i.e. there are cases >> >> where we don't try to add the explicit sort before passing the path to >> >> generate_gather_paths(). >> >> >> >> And I think I know why is that - while gather_grouping_paths() tries to >> >> add explicit sort below the gather merge, there are other places that >> >> call generate_gather_paths() that don't do that. In this case it's >> >> probably apply_scanjoin_target_to_paths() which simply builds >> >> >> >> parallel (seq|index) scan + gather merge >> >> >> >> and that's it. The problem is likely the same - the code does not know >> >> which pathkeys are "interesting" at that point. We probably need to >> >> teach planner to do this. >> > >> >I've been working on figuring out sample queries for each of the >> >places we're looking at adding create_increment_sort() (starting with >> >the cases enabled by gather-merge nodes). The >> >generate_useful_gather_paths() call in >> >apply_scanjoin_target_to_paths() is required to generate the above >> >preferred plan. > >As I continue this, I've added a couple of test cases (notably for >generate_useful_gather_paths() in both standard_join_search() and >apply_scanjoin_target_to_paths()). Those, plus the current WIP state >of my hacking on your patch adding generate_useful_gather_paths() is >attached as 0001-parallel-and-more-paths.patch. > >My current line of investigation is whether we need to do anything in >the parallel portion of create_ordered_paths(). I noticed that the >first-pass patch adding generate_useful_gather_paths() modified that >section but wasn't actually adding any new gather-merge paths (just >bare incremental sort paths). That seems pretty clearly just a >prototype miss, so I modified the prototype to build gather-merge >paths instead (as a side note that change seems to fix an oddity I was >seeing where plans would include a parallel index scan node even >though they weren't parallel plans). While the resulting plan for >something like: > Yes, that seems to be a bug. The block above it clealy has a gather merge nodes, so this one should too. >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by >t.a, t.b limit 50; > >changes cost (to be cheaper) ever so slightly with the gather-merge >addition to create_ordered_paths(), the plan itself is otherwise >identical (including row estimates): > >Limit > -> Gather Merge > -> Incremental Sort > -> Parallel Index Scan > >(Note: I'm forcing parallel plans here with: set >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0; >set parallel_tuple_cost=0; set parallel_setup_cost=0; set >min_parallel_index_scan_size=0;) > >I can't seem to come up with a case where adding these gather-merge >paths in create_ordered_paths() isn't entirely duplicative of paths >already created by generate_useful_gather_paths() as called from >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given >that both apply_scanjoin_target_to_paths() and create_ordered_paths() >are called by grouping_planner(). > >Can you think of a case I'm missing here that would make it valuable >to generate new parallel plans in create_ordered_paths()? > Good question. Not sure. I think such path would need to do something on a relation that is neither a join nor a scan - in which case the path should not be created by apply_scanjoin_target_to_paths(). So for example a query like this: SELECT a, b, sum(expensive_function(c)) FROM t GROUP BY a,b ORDER BY a,sum(...) LIMIT 10; should be able to produce a plan like this: -> Limit -> Gather Merge -> Incremental Sort (pathkeys: a, sum) -> Group Aggregate a, b, sum(expensive_function(c)) -> Index Scan (pathkeys: a, b) or something like that, maybe. I haven't tried this, though. The other question is whether such queries are useful in practice ... >> ... >> >> I think this may be a thinko, as this plan demonstrates - but I'm not >> sure about it. I wonder if this might be penalizing some other types of >> plans (essentially anything with limit + gather). >> >> Attached is a WIP patch fixing this by considering both startup and >> total cost (by calling compare_path_costs_fuzzily). > >It seems to me that this is likely a bug, and not just a changed >needed for this. Do you think it's better addressed in a separate >thread? Or retain it as part of this patch for now (and possibly break >it out later)? On the other hand, it's entirely possible that someone >more familiar with parallel plan limitations could explain why the >above comment holds true. That makes me lean towards asking in a new >thread. > Maybe. I think creating a separate thread would be useful, provided we manage to demonstrate the issue without an incremental sort. >I've also attached a new base patch (incremental-sort-30.patch) which >includes some of the other obvious fixes (costing, etc.) that you'd >previously proposed. > Thanks! -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: