Re: [PATCH] Incremental sort (was: PoC: Partial sort) - Mailing list pgsql-hackers

From James Coleman
Subject Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date
Msg-id CAAaqYe_PXXHnOavS_X+C+yJnWesQzWeCv80jsp33SW4osGRUWg@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (James Coleman <jtc331@gmail.com>)
Responses Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
List pgsql-hackers
On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>I think the first thing to do is get some concrete numbers on performance if we:
>
>1. Only sort one group at a time.
>2. Update the costing to prefer traditional sort unless we have very
>high confidence we'll win with incremental sort.
>
>It'd be nice not to have to add additional complexity if at all possible.

I've been focusing my efforts so far on seeing how much we can
eliminate performance penalties (relative to traditional sort). It
seems that if we can improve things enough there that we'd limit the
amount of adjustment needed to costing -- we'd still need to consider
cases where the lower startup cost results in picking significantly
different plans in a broad sense (presumably due to lower startup cost
and the ability to short circuit on a limit). But I'm hopeful then we
might be able to avoid having to consult MCV lists (and we wouldn't
have that available in all cases anyway)

As I see it the two most significant concerning cases right now are:
1. Very large batches (in particular where the batch is effectively
all of the matching rows such that we're really just doing a standard
sort).
2. Many very small batches.

(From reading the whole thread through again, there's a third case:
skewed group sizes, but I'm ignoring that for now because my intuition
is that if the sort can reasonably handle the above two cases *in
execution not just planning* the skew problem will become a
non-issue-)

For (1), I'd really like to be able to find a way to still benefit
from knowing we have prefix columns already sorted. Currently the
patch doesn't do that because of the optimization for small batches
precludes knowing that all prefixes in a batch are indeed equal. As
such I've temporarily removed that optimization in my testing to see
how much of a gain we get from not needing to sort prefix columns.
Intuitively it seems plausible that this would always beat a standard
sort, since a.) equality is sometimes much cheaper than ordering, and
b.) it reduces the likelihood of spilling to disk. In addition my fix
to make top-n heapsort improves things. The one thing I'm not sure
about here is the fact that I haven't seen parallel incremental sort
happen at all, and a parallel regular sort can sometimes beat single
process incremental sort (though at the cost of higher system
resources of course). As noted upthread though I haven't really
investigated the parallel component yet at all.

I did confirm a measurable speedup with suffix-only sorting (roughly
5%, relative to the current patch).

For (2), I'd love to have an optimization for the "batches are a
single tuple" case (indeed in my most common real-world use case,
that's virtually guaranteed). But that optimization is also
incompatible with the current minimum batch size optimization. There
was previously reference to the cost of copying the tuple slot being
the problem factor with small batches, and I've confirmed there is a
performance penalty for small batches if you handle every prefix as a
separate batch (relative to the current patch), but what I plan to
investigate next is whether or not that penalty is primarily due to
the bookkeeping of frequent copying of a pivot tuple or whether it's
due to running the sort algorithm itself so frequently. If it's just
the overhead of copying the pivot tuple, then I'm wondering if adding
functionality to tuplesort to allow peeking at the first inserted
tuple might be worth it.

Alternatively I've been thinking about ways to achieve a hybrid
approach: using single-batch suffix-only sort for large batches and
multi-batch full sort for very small batches. For example, I could
imagine maintaining two different tuplesorts (one for full sort and
one for suffix-only sort) and inserting tuples into the full sorter
until the minimum batch size, and then (like now) checking every tuple
to see when we've finished the batch. If we finish the batch by, say,
2 * minimum batch size, then we perform the full sort. If we don't
find a new batch by that point, we'd go back and check to see if the
tuples we've accumulated so far are actually all the same batch. If
so, we'd move all of them to the suffix-only sorter and continue
adding rows until we hit a new batch. If not, we'd perform the full
sort, but only return tuples out of the full sorter until we encounter
the current batch, and then we'd move the remainder into the
suffix-only sorter.

Using this kind of approach you can also imagine further optimizations
like checking the current tuple against the pivot tuple only every 10
processed tuples or so. That adds another level of complication,
obviously, but could be interesting.

Unrelated: There was some discussion upthread about whether or not
abbreviated column support could/should be supported. The comments
around tuplesort_set_bound in tuplesort.c claim that for bounded sort
abbreviated keys are not useful, and, in fact, the code there disables
abbreviated sort. Given that the single largest use case for this
patch is when we have a LIMIT, I think we can conclude not supporting
abbreviated keys is reasonable.

One other (meta) question: Do we care mostly about "real" queries on
synthetic data (and comparing them to master)? Or do we care more
about comparing solely the difference in sorting cost between the two
approaches? I think some of both is necessary, but I'm curious what
you all think.

James Coleman



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Disconnect from SPI manager on error
Next
From: Robert Haas
Date:
Subject: Re: Usage of epoch in txid_current