Re: [PATCH] Incremental sort (was: PoC: Partial sort) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date
Msg-id 20200328185406.vizs5jndlhgxbh26@development
Whole thread Raw
In response to Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (James Coleman <jtc331@gmail.com>)
Responses Re: [PATCH] Incremental sort (was: PoC: Partial sort)
List pgsql-hackers
On Sat, Mar 28, 2020 at 10:19:04AM -0400, James Coleman wrote:
>On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> ...
>> >> As a side note here, I'm wondering if this (determining useful pathkeys)
>> >> can be made a bit smarter by looking both at query_pathkeys and pathkeys
>> >> useful for merging, similarly to what truncate_useless_pathkeys() does.
>> >> But that can be seen as an improvement of what we do now.
>> >
>> >Unless your comment below about looking at truncate_useless_pathkeys
>> >is implying you're considering aiming to get this in now, I wonder if
>> >we should just expand the comment to reference pathkeys useful for
>> >merging as a possible future extension.
>> >
>>
>> Maybe. I've been thinking about how to generate those path keys, but
>> it's a bit tricky, because pathkeys_useful_for_merging() - the thing
>> that backs truncate_useless_pathkeys() actually gets pathkeys and merely
>> verifies if those are useful for merging. But this code needs to do the
>> opposite - generate those pathkeys.
>>
>> But let's say we know we have a join on columns (a,b,c). For each of
>> those PathKey values we know it's useful for merging, but should we
>> generate pathkeys (a,b,c), (b,c,a), ... or any other permutation? I
>> suppose we can look at pathkeys for existing paths of the relation to
>> prune the possibilities a bit, but what then?
>
>I'm not convinced it's worth it this time around. Both because of the
>late hour in the CF etc., but also because it seems likely to become
>pretty complex quickly, and also far more likely to raise performance
>questions in planning if there's not a lot of work done to keep it
>limited.
>

Agreed. There'll be time to add more optimizations in the future.

>> BTW I wonder if we actually need the ec_has_volatile check in
>> get_useful_pathkeys_for_relation. The comment says we can't but is it
>> really true? pathkeys_useful_for_ordering doesn't do any such checks, so
>> I'd bet this is merely an unnecessary copy-paste from postgres_fdw.
>> Opinions?
>
>I hadn't really looked at that part in depth before, but after reading
>it over, re-reading the definition of volatility, and running a few
>basic queries, I agree.
>
>For example: we already do allow volatile pathkeys in a simple query like:
>-- t(a, b) with index on (a)
>select * from t order by a, random();
>
>It makes sense that you couldn't push down such a sort to a foreign
>server, given there's no constraints said function isn't operating
>directly on the primary database (in the fdw case). But that obviously
>wouldn't apply here.
>

Thanks for confirming my reasoning.

>> >> >9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
>> >> >* Considering query_pathkeys is always worth it, because it might let us
>> >> >* avoid a local sort.
>> >> >
>> >> >That originally was a copy from the fdw code, but since the two
>> >> >functions have diverged (Is that concerning? I could be confusing, but
>> >> >isn't a compilation problem) I didn't move the function.
>> >> >
>> >>
>> >> I think it's OK the two functions diverged, it's simply because the FDW
>> >> one needs to check other things too. But I might rework this once I look
>> >> closer at truncate_useless_pathkeys.
>> >
>> >Agreed, for now at least. It's tempting to think they should always be
>> >shared, but I'm not convinced (without a lot more digging) that this
>> >represents structural rather than incidental duplication.
>> >
>>
>> The more I look at pathkeys_useful_for_ordering() the more I think the
>> get_useful_pathkeys_for_relation() function should look more like it
>> than the postgres_fdw one ...
>
>If we go down that path (and indeed this is actually implied by
>removing the volatile check too), doesn't that really just mean (at
>least for now) that get_useful_pathkeys_for_relation goes away
>entirely and we effectively use root->query_pathkeys directly?

Yes, basically.

>The only thing your lose them is the check that each eclass has a
>member in the rel. But that check probably wasn't quite right anyway:
>at least for incremental sort (maybe not regular sort), I think we
>probably don't necessarily care that the entire list has members in the
>rel, but rather that some prefix does, right?

Probably, although I always forget how exactly this EC business works.
My reasoning is more "If pathkeys_useful_for_ordering does not need
that, why should we need it here?"

>The idea there would be that we could create a gather merge here that
>supplies a partial ordering (that prefix of query_pathkeys) and then
>above it the planner might place another incremental sort (say above a
>join), or perhaps even a join above that would actually be sufficient
>(since many joins are capable of providing an ordering anyway).
>

Not sure. Isn't the idea that we do the Incremental Sort below the
Gather Merge, because then it's actually done in parallel?

>I've attached (added prefix .txt to avoid the cfbot assuming this is a
>full series) an idea in that direction to see if we're thinking along
>the same route. You'll want to apply and view with `git diff -w`
>probably.
>
>The attached also adds a few XXX comments. In particular, I wonder if
>we should verify that some prefix of the root->query_pathkeys is
>actually made up of eclass members for the current rel, because
>otherwise I think we can skip the loop on the subpaths entirely.
>

Hmmm, that's an interesting possible optimization. I wonder if that
actually saves anything, though, because the number of paths in the loop
is likely fairly low.

>> >> >I did notice though that find_em_expr_for_rel() is wholesale
>> >> >copied (and unchanged) from the fdw code, so I moved it to
>> >> >equivclass.c so both places can share it.
>> >> >
>> >>
>> >> +1
>> >>
>>
>> ... which would also get rid of find_em_expr_for_rel().
>
>... which, I think, would retain the need for find_em_expr_for_rel().
>
>Note: The attached applied to the previous series compiles and runs
>make check...but I haven't really tested it; it's meant more for "is
>this the direction we want to go".
>

Thanks, I'll take a look.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pgbench - refactor init functions with buffers
Next
From: Nikolay Shaplov
Date:
Subject: Re: [PATCH] Finally split StdRdOptions into HeapOptions and ToastOptions