Re: [PATCH] Incremental sort (was: PoC: Partial sort) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date
Msg-id 20190614181135.tdj7aq4iffy5ghl2@development
Whole thread Raw
In response to Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (Rafia Sabih <rafia.pghackers@gmail.com>)
Responses Re: [PATCH] Incremental sort (was: PoC: Partial sort)
List pgsql-hackers
On Wed, Jun 05, 2019 at 06:14:14PM +0200, Rafia Sabih wrote:
>On Mon, 3 Jun 2019 at 15:39, James Coleman <jtc331@gmail.com> wrote:
>>
>> On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> > Currently, the costing logic (cost_incremental_sort) assumes all groups
>> > have the same size, which is fine for uniform distributions. But for
>> > skewed distributions, that may be an issue.
>> >
>> > Consider for example a table with 1M rows, two columns, 100 groups in each
>> > column, and index on the first one.
>> >
>> >     CREATE table t (a INT, b INT);
>> >
>> >     INSERT INTO t SELECT 100*random(), 100*random()
>> >       FROM generate_series(1,1000000);
>> >
>> > Now, let's do a simple limit query to find the first row:
>> >
>> >     SELECT * FROM t ORDER BU a, b LIMIT 1;
>> >
>> > In this case the current costing logic is fine - the groups are close to
>> > average size, and we only need to sort the first group, i.e. 1% of data.
>> >
>> > Now, let's say the first group is much larger:
>> >
>> >
>> >     INSERT INTO t SELECT 0, 100*random()
>> >       FROM generate_series(1,900000) s(i);
>> >
>> >     INSERT INTO t SELECT 100*random(), 100*random()
>> >       FROM generate_series(1,100000);
>> >
>> > That is, the first group is roughly 90% of data, but the number of groups
>> > is still the same. But we need to scan 90% of data. But the average group
>> > size is still the same, so the current cost model is oblivious to this.
>>
>> Thinking out loud here: the current implementation doesn't guarantee
>> that sort groups always have the same prefix column values because
>> (from code comments) "Sorting many small groups with tuplesort is
>> inefficient). While this seems a reasonable optimization, I think it's
>> possible that thinking steered away from an optimization in the
>> opposite direction. Perhaps we should always track whether or not all
>> prefix tuples are the same (currently we only do that after reaching
>> DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to
>> have tuplesort only care about the non-prefix columns (where currently
>> it has to sort on all pathkey columns even though for a large group
>> the prefix columns are guaranteed to be equal).
>>
>+1 for passing only the non-prefix columns to tuplesort.
>> Essentially I'm trying to think of ways that would get us to
>> comparable performance with regular sort in the case of large batch
>> sizes.
>>
>> One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the
>> case where you have a very small group and then a very large batch,
>> we'd lose the ability to optimize in the above way. That makes me
>> wonder if we shouldn't intentionally optimize for the possibility of
>> large batch sizes, since a little extra expense per group/tuple is
>> more likely to be a non-concern with small groups anyway when there
>> are large numbers of input tuples but a relatively small limit.
>>
>What about using the knowledge of MCV here, if we know the next value
>is in MCV list then take the overhead of sorting this small group
>alone and then leverage the optimization for the larger group, by
>passing only the non-prefix columns.

Not sure. It very much depends on how expensive the comparisons are for
that particular data type. If the comparisons are cheap, then I'm not sure
it matters very much whether the group is small or large. For expensive
comparison it may not be a win either, because we need to search the MCV
lists whenever the group changes.

I guess we'll need to make some benchmarks and see if it's a win or not.

>> Thoughts?
>>
>> > So I think we should look at the MCV list, and use that information when
>> > computing the startup/total cost. I think using the first/largest group to
>> > compute the startup cost, and the average group size for total cost would
>> > do the trick.
>>
>+1
>> I think this sounds very reasonable.
>>
>> > I don't think we can do much better than this during planning. There will
>> > inevitably be cases where the costing model will push us to do the wrong
>> > thing, in either direction. The question is how serious issue that is, and
>> > whether we could remedy that during execution somehow.
>> >
>> > When we "incorrectly" pick full sort (when the incremental sort would be
>> > faster), that's obviously sad but I think it's mostly fine because it's
>> > not a regression.
>> >
>> > The opposite direction (picking incremental sort, while full sort being
>> > faster) is probably more serious, because it's a regression between
>> > releases.
>> >
>> > I don't think we can fully fix that by refining the cost model. We have
>> > two basic options:
>> >
>> > 1) Argue/show that this is not an issue in practice, because (a) such
>> > cases are very rare, and/or (b) the regression is limited. In short, the
>> > benefits of the patch outweight the losses.
>>
>> My comments above go in this direction. If we can improve performance
>> in the worst case, I think it's plausible this concern becomes a
>> non-issue.
>>
>> > 2) Provide some fallback at execution time. For example, we might watch
>> > the size of the group, and if we run into an unexpectedly large one we
>> > might abandon the incremental sort and switch to a "full sort" mode.
>>
>> Are there good examples of our doing this in other types of nodes
>> (whether the fallback is an entirely different algorithm/node type)? I
>> like this idea in theory, but I also think it's likely it would add a
>> very significant amount of complexity. The other problem is knowing
>> where to draw the line: you end up creating these kinds of cliffs
>> where pulling one more tuple through the incremental sort would give
>> you your batch and result in not having to pull many more tuples in a
>> regular sort node, but the fallback logic kicks in anyway.
>>

I don't think we have nodes where we'd switch to an entirely different
algorithm - say from hash-join to nested-loop join (as is often proposed
as a solution for excessive memory consumption). That's obviously very
complex thing to implement.

But I don't think that's the type of fallback we'd need here - IMO it's
more similar to switching from in-memory to on-disk sort. Essentially,
we'd need to disable the extra logic (detecting the prefix grouping), and
just stash all the remaining tuples into tuplesort and do regular sort.

Of course, I haven't actually implemented this, so maybe it's trickier.

>What about having some simple mechanism for this, like if we encounter
>the group with more tuples than the one estimated then simply switch
>to normal sort for the remaining tuples, as the estimates does not
>hold true anyway. Atleast this will not give issues of having
>regressions of incremental sort being too bad than the normal sort.
>I mean having something like this, populate the tuplesortstate and
>keep on counting the number of tuples in a group, if they are within
>the budget call tuplesort_performsort, otherwise put all the tuples in
>the tuplesort and then call tuplesort_performsort. We may have an
>additional field in IncrementalSortState to save the estimated size of
>each group. I am assuming that we use MCV lists to approximate better
>the group sizes as suggested above by Tomas.
>

Maybe. I suggest we try to implement the simplest solution first, trying
to do as much as possible during planning, and then try to be smart at
execution time only if necessary.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Next
From: Andrew Gierth
Date:
Subject: Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)