Re: [PATCH] Incremental sort (was: PoC: Partial sort) - Mailing list pgsql-hackers

From James Coleman
Subject Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date
Msg-id CAAaqYe-34ZX5Q_+wU4t4=SvrE6Ej3mFgiEArSD+-Tocpku7rHA@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Tue, Mar 31, 2020 at 07:09:04PM -0400, James Coleman wrote:
> >> >On Tue, Mar 31, 2020 at 6:54 PM Tomas Vondra
> >> ><tomas.vondra@2ndquadrant.com> wrote:
> >> >>
> >> >> On Tue, Mar 31, 2020 at 06:35:32PM -0400, Tom Lane wrote:
> >> >> >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> >> >> >> In general, I think it'd be naive that we can make planner smarter with
> >> >> >> no extra overhead spent on planning, and we can never accept patches
> >> >> >> adding even tiny overhead. With that approach we'd probably end up with
> >> >> >> a trivial planner that generates just a single query plan, because
> >> >> >> that's going to be the fastest planner. A realistic approach needs to
> >> >> >> consider both the planning and execution phase, and benefits of this
> >> >> >> patch seem to be clear - if you have queries that do benefit from it.
> >> >> >
> >> >> >I think that's kind of attacking a straw man, though.  The thing that
> >> >> >people push back on, or should push back on IMO, is when a proposed
> >> >> >patch adds significant slowdown to queries that it has no or very little
> >> >> >hope of improving.  The trick is to do expensive stuff only when
> >> >> >there's a good chance of getting a better plan out of it.
> >> >> >
> >> >>
> >> >> Yeah, I agree with that. I think the main issue is that we don't really
> >> >> know what the "expensive stuff" is in this case, so it's not really
> >> >> clear how to be smarter :-(
> >> >
> >> >To add to this: I agree that ideally you'd check cheaply to know
> >> >you're in a situation that might help, and then do more work. But here
> >> >the question is always going to be simply "would we benefit from an
> >> >ordering, and, if so, do we have it already partially sorted". It's
> >> >hard to imagine that reducing much conceptually, so we're left with
> >> >optimizations of that check.
> >> >
> >>
> >> I think it depends on what exactly is the expensive part. For example if
> >> it's the construction of IncrementalSort paths, then maybe we could try
> >> do a quick/check check if the path can even be useful by estimating the
> >> cost and only then building the path.
> >>
> >> That's what we do for join algorithms, for example - we first compute
> >> initial_cost_nestloop and only when that seems cheap enough we do the
> >> more expensive stuff.
> >>
> >> But I'm not sure the path construction is the expensive part, as it
> >> should be disabled by enable_incrementalsort=off. But the regression
> >> does not seem to disappear, at least not entirely.
> >>
> >>
> >> >> One possibility is that it's just one of those regressions due to change
> >> >> in binary layout, but I'm not sure know how to verify that.
> >> >
> >> >If we are testing with a case that can't actually add more paths (due
> >> >to it checking the guc before building them), doesn't that effectively
> >> >leave one of these two options:
> >> >
> >> >1. Binary layout/cache/other untraceable change, or
> >> >2. Changes due to refactored function calls.
> >> >
> >>
> >> Hmm, but in case of (1) the overhead should be there even with tests
> >> that don't really have any additional paths to consider, right? I've
> >> tried with such test (single table with no indexes) and I don't quite
> >> see any regression (maybe ~1%).
> >
> >Not necessarily, if the cost is in sort costing or useful pathkeys
> >checking, right? We have run that code even without incremental sort,
> >but it's changed from master.
> >
>
> Ah, I should have mentioned I've done most of the tests on just the
> basic incremental sort patch (0001+0002), without the additional useful
> paths. I initially tested the whole patch series, but after discovering
> the regression I removed the last part (which I suspected might be the
> root cause). But the regression is still there, so it's not that.
>
> It might be in the reworked costing, yeah. But then I'd expect those
> function to show in the perf profile.

Right. I'm just grasping at straws on that.

> >> (2) might have impact, but I don't see any immediate supects. Did you
> >> have some functions in mind?
> >
> >I guess this is where the lines blur: I didn't see anything obvious
> >either, but the changes to sort costing...should probably not have
> >real impact...but...
> >
>
> :-(
>
> >> BTW I see the patch adds pathkeys_common but it's not called from
> >> anywhere. It's probably leftover from an earlier patch version.
> >>

BTW, I think I'm going to rename the pathkeys_common_contained_in
function to something like pathkeys_count_contained_in, unless you
have an objection to that. The name doesn't seem obvious at all to me.

> >> >There's not anything obvious in point (2) that would be a big cost,
> >> >but there are definitely changes there. I was surprised that just
> >> >eliminating the loop through the pathkeys on the query and the index
> >> >was enough to save us ~4%.
> >> >
> >> >Tomas: Earlier you'd wondered about if we should try to shortcut the
> >> >changes in costing...I was skeptical of that originally, but maybe
> >> >it's worth looking into? I'm going to try backing that out and see
> >> >what the numbers look like.
> >> >
> >
> >BTW, I did this test, and it looks like we can get back something
> >close to 1% by reverting that initial fix on partial path costing. But
> >we can't get rid of it all the time, at the very least. *Maybe* we
> >could condition it on incremental sort, but I'm not convinced that's
> >the only place it's needed as a fix.
> >
>
> Sounds interesting. I actually tried how much the add_partial_path
> change accounts for, and you're right it was quite a bit. But I forgot
> about that when investigating the rest.
>
> I wonder how large would the regression be without add_partial_path and
> with the fix in pathkeys_common_contained_in.
>
> I'm not sure how much we want to make add_partial_path() dependent on
> particular GUCs, but I guess if it gets rid of the regression, allows us
> to commit incremental sort and we can reasonably justify that only
> incremental sort needs those paths, it might be acceptable.

That's a good point.

> >> I've described the idea about something like initial_cost_nestloop and
> >> so on. But I'm a bit skeptical about it, considering that the GUC only
> >> has limited effect.
> >>
> >>
> >> A somewhat note is that the number of indexes has pretty significant
> >> impact on planning time, even on master. This is timing of the same
> >> explain script (similar to the one shown before) with different number
> >> of indexes on master:
> >>
> >>      0 indexes        7 indexes       49 indexes
> >>      -------------------------------------------
> >>          10.85            12.56            27.83
> >>
> >> The way I look at incremental sort is that it allows using indexes for
> >> queries that couldn't use it before, possibly requiring a separate
> >> index. So incremental sort might easily reduce the number of indexes
> >> needed, compensating for the overhead we're discussing here. Of course,
> >> that may or may not be true.
> >
> >One small idea, but I'm not yet sure it helps us a whole lot: if the
> >query pathkeys is only length 1, then we could skip the additional
> >path creation.
> >
>
> I don't follow. Why would we create incremental sort in this case at
> all? With single-element query_pathkeys the path is either unsorted or
> fully sorted - there's no room for incremental sort. No?

Well, we shouldn't, that's what I'm getting. But I didn't see anything
in the code now that explicitly excludes that case when decided
whether or not to create an incremental sort path, unless I'm missing
something obvious.

James



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Next
From: Andy Fan
Date:
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey