Re: parallelism and sorting - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: parallelism and sorting
Date
Msg-id CAM3SWZTsw0noq9nS_23LUTShVa8XVH78840dpaaTWCg1pwd+0w@mail.gmail.com
Whole thread Raw
In response to parallelism and sorting  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: parallelism and sorting  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Mon, Nov 23, 2015 at 2:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've been thinking about how parallelism interacts with sorting over
> the last few days and I wanted to share a few preliminary thoughts.  I
> definitely don't have all the answers worked out here yet, so thoughts
> are welcome.

I think it's definitely a good idea to have some high level discussion
of these issues now. My responses will in some cases also be high
level and aspirational.

> 2. Within parallel query, there are two reasons to care about data
> that is in sorted order.  First, we might need to deliver the results
> to the user in a particular order, because they've specified ORDER BY
> whatever.  Second, the optimal join strategy might be a merge join,
> which requires that both relations be sorted according to the join
> key.[1]

I gather the distinction you're making here is between a Sort node,
and a node that happens to use a tuplesort without an explicit Sort
node (like a "COUNT(DISTINCT(foo))" *Aggregate* node -- *not* a
GroupAggregate node). I am a little concerned cases like this might
accidentally not benefit due to not explicitly having a Sort node, as
you refer to below. Beyond that, CREATE INDEX and CLUSTER utility
cases will also need to be parallelized without all this executor
infrastructure.

> 3. The current Gather node reads tuples from the workers in
> round-robin fashion, skipping over workers that don't have a tuple
> ready yet when it needs one.  It seems potentially useful to have a
> Gather Merge node which would assume that the results from each worker
> are ordered with respect to each other, and do a final merge pass over
> those.  Then we could get the toplevel query ordering we want using a
> plan like this:
>
> Gather Merge
> -> Sort
>   -> Parallel Seq Scan on foo
>       Filter: something

I am of course strongly of the opinion that extending the new,
improved, but pending approach to external sorts [1] is the way to go.
Using the filesystem as "poor man's shared memory" when you can
actually afford real shared memory now seems like much less of a
problem than I thought in the past. More on that later.

The problem I see here is that having real executor nodes, while
preserving various useful properties of an on-the-fly merge implies a
degree of cross-node promiscuity that I think won't fly. For one
thing, all my tricks with memory pooling during the final on-the-fly
merge become iffy, to say the least. For another, the children cannot
very well feed SortTuples to the parent using the usual TupleTableSlot
mechanism -- we benefit plenty from reuse of SortSupport and so on
during the merge. Who would want to reconstruct something
SortTuple-like on the other side (within the Gather Merge)? Besides,
if one child cannot produce tuples in time, unlike many things there
is a legitimate need to hold everything up. I think we should
parallelize the Merge in a later release -- possibly much later.

It should probably still be presented to users more or less as you
outline -- it will just be an implementation detail.  IOW, what
explain.c calls "special child plans".

Actually, the answer here is probably simple -- as you suggest
separately, the "Gather Merge" actually does almost the same thing as
our existing on-the-fly merge step within tuplesort.c. The difference
is only that it gets information about runs to merge from the workers
when they finish the sort. There is a little bit of bookkeeping in
shared memory, plus we revise the tuplesort.c interface to allow what
is essentially my new approach to external sorts to happen in phases
managed at a slightly lower level by the tuplesort client. The
existing interface is preserved, plus a "build your own sort"
interface. Clients continue to pass back and forth a little opaque
state for tuplesort's benefit, some of which is stored by a
Gather-like node in shared memory, but that's it. We need a new
tuplesort_end() variant, to free memory early, without releasing
tapes, just for this, and the caller needs to know a bit about runs
(or that partitioning is a consequence of the number of workers
actually available).

> 4. Gather Merge would be an executor node, and thus not available to
> any code that uses tuplesort.c directly.  Also, it seems fairly
> mediocre for merge joins.  The best we could do is something like
> this:[2]
>
> Merge Join
> -> Gather Merge
>   -> Sort
>     -> Parallel Seq Scan
> -> Gather Merge
>   -> Sort
>     -> Parallel Seq Scan
>
> The problem with this plan is that the join itself is not done in
> parallel, only the sorting.  That's not great, especially if there are
> more joins that need to be done afterwards, necessarily not in
> parallel.[2]  It's possible that one side of the join could be an
> Index Scan rather than Gather Merge -> Sort -> Parallel Seq Scan, but
> that doesn't change the overall picture here much.

That's not a huge problem, because at least the Sort is the really
expensive part.

Have you tried contriving a merge join test case with a really cheap
sort or pair of sorts? I'd try coming up with a case with perfectly
sorted inputs into the sort nodes, and compare that with an equivalent
case with random inputs. Our silly "bubble sort best case" quicksort
optimization for pre-sorted input will make the sort artificially very
cheap. That might provide guidance on how much parallelizing the
synchronization of relations (merging) can be expected to help. That
cost will not grow O(n log n), an observation that crops up in a few
places.

The funny thing about linearithmic growth is that if you have enough
memory to do a big enough sort (or sort of a run), which these days
you probably do when a parallel sort is recommended, it will come to
dominate everything. Suddenly, problems like having to write out runs
matter way less, and asynchronous I/O becomes merely a nice-to-have,
where 10 or 15 years ago it was very important for parallel sort.
Merging might be in a similar category. We should consider that
Postgres development will benefit from coming late to parallel sort,
since the current large memory sizes and large database sizes (and to
a lesser extent the properties of modern block devices) significantly
reduce what in the past were larger problems. More research is
required, but it seems like something worth considering.

What I found really interesting during my experiments with the new
approach to sorting (simple hybrid sort-merge strategy) was how
performance was very consistent past a certain work_mem setting (about
1GB IIRC). Lower settings greater than that fuzzy threshold resulted
in a shorter, maybe even much shorter time spent sorting runs. And
yet, (at least without abbreviated keys, and especially for the
pass-by-value datum case) the later cost of merging grew surprisingly
well in-line with whatever the reduced time spent sorting runs was.
This indicated to me that linearithmic growth dominated. It also
indicates that we can perhaps partition very strategically, without
regard for anything other than keeping n workers busy, making
assigning work to workers relatively straightforward.

The whole idea that it's okay that there is a huge gulf between
internal and external sort is a bad, old-fashioned idea that needs to
die. Blurring the distinction between the two has benefits all over
the place, since for example it greatly reduces the cost of a
misestimation. Besides, big sorts will tend to need to be external
sorts. I also think that sort is something that is more amenable to
being usefully parallelized than any other thing -- nothing else is so
computationally intensive. As Nyberg et al put it in 1994:

"""
Reducing data cache misses can be an intractable problem if the data
references are made by a very large number of instructions. For
instance, code to execute the TPC-A benchmarks is usually
characterized by a very large number of basic blocks that do not loop.
In this environment, it is very difficult to understand the data
access patterns, let alone modify them to reduce cache misses.

In contrast, sorting belongs to a class of programs that make a very
large number of data accesses from a small amount of looping code. In
this environment, it is feasible to control data accesses via
algorithm or data structure modifications.

"""

More recently, it has been predicted that trends in CPU development --
more cores, *less* memory bandwidth per core (the per-core trend for
memory bandwidth is not mere stagnation, it's _regression_) will tend
to favor merge joins in the long term. Hash joins cannot scale as
well, primarily due to the memory bandwidth bottleneck, but also
because hashing is not amenable to using SIMD instructions, which we
can hope to eventually benefit from with sorting.

The point is that parallel sorting is relatively easy to get
significant benefits from as compared to parallelizing other things,
and maybe an orientation towards using merge joins more frequently is
a good long term goal to have.

> 5. Really nailing the merge join case seems to require partitioning
> both relations in a fashion compatible with the join attribute, and
> then joining the partitions separately.  Consider an operator
> Repartition which reads rows from its child plan and returns those
> where hash(joincol) % NumberOfWorkers == MyWorkerNumber.

I'll have to think about that some more.

> 7. Another option, instead or in addition to introducing a Repartition
> operator, is to make the sort itself parallel-aware.  Each worker
> reads rows until it fills work_mem, quicksorts them, and dumps them
> out as a run.  Suppose there are few enough runs that we don't need
> multiple merge passes, and that we have some way of making every
> worker available of every run performed by any worker.  Then any one
> or more of the workers can get the sorted results out by performing a
> final merge pass over the runs we produced.  We could support various
> models for reading the results of the sort: return every tuple to
> every worker, return every tuple to some worker but don't return any
> given tuple to more than one worker; return all tuples in the leader.
> So if we just want to sort a big pile of tuples, the plan can now look
> like this:
>
> Gather
> -> Parallel Sort
>     Output Mode: Leader Only
>   -> Parallel Seq Scan
>
> I'm not sure if that's better or worse or exactly equivalent to the
> Gather Merge > Sort > Parallel Seq Scan approach.  If we want to do a
> parallel merge join, we now have options like this:

As I went into already, my tentative view is that I think this is better.

> OK, that's all I've got.  So in the space of one email, I've proposed
> executor nodes for Gather Merge, Repartition, Partial Index Scan, and
> Parallel Sort (with three different output modes).  And I don't know
> which ones are actually most interesting, or whether we need them all.
> Whee!  Nor do I know whether any of this can work for code that
> currently uses tuplesort.c directly.  Double whee!

Sometimes it's appropriate to talk about things in a hand-wavey
fashion. We don't do enough of that.

Closing thought: work_mem is a bad model for sorting in the long run,
since higher settings won't help much past a certain threshold.

We need to come up with a model that basically only allows a very high
effective work_mem setting when that is enough to do the sort fully
internally (and possibly when we weren't going to parallelize the sort
anyway, since that may at least initially only work for external
sorts). Otherwise, we might as well size an effective work_mem setting
according to what our n workers require, or at a level past the
threshold at which the benefits of a higher setting are almost noise.
This is perhaps also a stepping stone to admission control. Finally,
it also empowers us to provide wiggle-room to allow a worker an
"effective work_mem burst", sufficient to not have an additional tiny
run, which seems like a good idea, and simplifies the partitioning
model that the planner needs.

[1] https://commitfest.postgresql.org/7/317/
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: documentation for wal_retrieve_retry_interval
Next
From: Guillaume Lelarge
Date:
Subject: Re: New email address