Re: [PATCH] Incremental sort (was: PoC: Partial sort) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date
Msg-id 20190720165951.m6adwbqi26bsfbjb@development
Whole thread Raw
In response to Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (James Coleman <jtc331@gmail.com>)
Responses Re: [PATCH] Incremental sort (was: PoC: Partial sort)
List pgsql-hackers
On Sat, Jul 20, 2019 at 12:21:01PM -0400, James Coleman wrote:
>On Sat, Jul 20, 2019 at 12:12 PM James Coleman <jtc331@gmail.com> wrote:
>>
>> On Sat, Jul 20, 2019 at 11:25 AM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> ...
>> > >My current line of investigation is whether we need to do anything in
>> > >the parallel portion of create_ordered_paths(). I noticed that the
>> > >first-pass patch adding generate_useful_gather_paths() modified that
>> > >section but wasn't actually adding any new gather-merge paths (just
>> > >bare incremental sort paths). That seems pretty clearly just a
>> > >prototype miss, so I modified the prototype to build gather-merge
>> > >paths instead (as a side note that change seems to fix an oddity I was
>> > >seeing where plans would include a parallel index scan node even
>> > >though they weren't parallel plans). While the resulting plan for
>> > >something like:
>> > >
>> >
>> > Yes, that seems to be a bug. The block above it clealy has a gather
>> > merge nodes, so this one should too.
>> >
>> > >explain analyze select * from t where t.a in (1,2,3,4,5,6) order by
>> > >t.a, t.b limit 50;
>> > >
>> > >changes cost (to be cheaper) ever so slightly with the gather-merge
>> > >addition to create_ordered_paths(), the plan itself is otherwise
>> > >identical (including row estimates):
>> > >
>> > >Limit
>> > >  -> Gather Merge
>> > >       -> Incremental Sort
>> > >          -> Parallel Index Scan
>> > >
>> > >(Note: I'm forcing parallel plans here with: set
>> > >max_parallel_workers_per_gather=4; set min_parallel_table_scan_size=0;
>> > >set parallel_tuple_cost=0; set parallel_setup_cost=0; set
>> > >min_parallel_index_scan_size=0;)
>> > >
>> > >I can't seem to come up with a case where adding these gather-merge
>> > >paths in create_ordered_paths() isn't entirely duplicative of paths
>> > >already created by generate_useful_gather_paths() as called from
>> > >apply_scanjoin_target_to_paths() -- which I _think_ makes sense given
>> > >that both apply_scanjoin_target_to_paths() and create_ordered_paths()
>> > >are called by grouping_planner().
>> > >
>> > >Can you think of a case I'm missing here that would make it valuable
>> > >to generate new parallel plans in create_ordered_paths()?
>> > >
>> >
>> > Good question. Not sure. I think such path would need to do something on
>> > a relation that is neither a join nor a scan - in which case the path
>> > should not be created by apply_scanjoin_target_to_paths().
>> >
>> > So for example a query like this:
>> >
>> >   SELECT
>> >       a, b, sum(expensive_function(c))
>> >   FROM
>> >       t
>> >   GROUP BY a,b
>> >   ORDER BY a,sum(...) LIMIT 10;
>> >
>> > should be able to produce a plan like this:
>> >
>> >   -> Limit
>> >       -> Gather Merge
>> >         -> Incremental Sort (pathkeys: a, sum)
>> >           -> Group Aggregate
>> >              a, b, sum(expensive_function(c))
>> >             -> Index Scan (pathkeys: a, b)
>> >
>> > or something like that, maybe. I haven't tried this, though. The other
>> > question is whether such queries are useful in practice ...
>>
>> Hmm, when I step through on that example input_rel->partial_pathlist
>> != NIL is false, so we don't even attempt to consider any extra
>> parallel paths in create_ordered_paths(). Nonetheless we get a
>> parallel plan, but with a different shape:
>>
>>  Limit
>>    ->  Incremental Sort
>>          Sort Key: a, (sum(expensive_function(c)))
>>          Presorted Key: a
>>          ->  Finalize GroupAggregate
>>                Group Key: a, b
>>                ->  Gather Merge
>>                      Workers Planned: 4
>>                      ->  Partial GroupAggregate
>>                            Group Key: a, b
>>                            ->  Sort
>>                                  Sort Key: a, b
>>                                  ->  Parallel Seq Scan on t
>>
>> (or if I disable seqscan then the sort+seq scan above becomes inc sort
>> + index scan)
>>
>> To be honest, I don't think I understand how you would get a plan like
>> that anyway since the index here only has "a" and so can't provide
>> (pathkeys: a, b).
>>
>> Perhaps there's something I'm still missing though.

I wasn't stricly adhering to the example we used before, and I imagined
there would be an index on (a,b). Sorry if that wasn't clear.

>
>Also just realized I don't think (?) we can order by the sum inside a
>gather-merge -- at least not without having another sort above the
>parallel portion? Or is the group aggregate able to also provide
>ordering on the final sum after aggregating the partial sums?
>

Yes, you're right - an extra sort node would be necessary. But I don't
think that changes the idea behind that example. The question is whether
the extra sorts below the gather would be cheaper than doing a large sort
on top of it - but I don't see why wouldn't that be the case, and if we
only need a couple of rows from the beginning ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Sergei Kornilov
Date:
Subject: Re: [HACKERS] Block level parallel vacuum
Next
From: Tomas Vondra
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)