Re: parallelism and sorting - Mailing list pgsql-hackers

From Robert Haas
Subject Re: parallelism and sorting
Date
Msg-id CA+TgmoYwD8ZUA1w1RiXiSZQUZPYXNQ0nMe7hwEqba0fZd0fa_A@mail.gmail.com
Whole thread Raw
In response to Re: parallelism and sorting  (Peter Geoghegan <pg@heroku.com>)
Responses Re: parallelism and sorting  (Amit Kapila <amit.kapila16@gmail.com>)
Re: parallelism and sorting  (Ants Aasma <ants.aasma@eesti.ee>)
List pgsql-hackers
On Mon, Nov 23, 2015 at 8:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> 2. Within parallel query, there are two reasons to care about data
>> that is in sorted order.  First, we might need to deliver the results
>> to the user in a particular order, because they've specified ORDER BY
>> whatever.  Second, the optimal join strategy might be a merge join,
>> which requires that both relations be sorted according to the join
>> key.[1]
>
> I gather the distinction you're making here is between a Sort node,
> and a node that happens to use a tuplesort without an explicit Sort
> node (like a "COUNT(DISTINCT(foo))" *Aggregate* node -- *not* a
> GroupAggregate node).

Yes.  Or things that aren't part of the executor at all.

> I am a little concerned cases like this might
> accidentally not benefit due to not explicitly having a Sort node, as
> you refer to below.

A valid concern.

> Beyond that, CREATE INDEX and CLUSTER utility
> cases will also need to be parallelized without all this executor
> infrastructure.

Or, alternatively, CREATE INDEX and CLUSTER could be refactored to use
the executor.  This is might sound crazy, but maybe it's not.  Perhaps
we could have the executor tree output correctly-formed index tuples
that get funneled into a new kind of DestReceiver that puts them into
the index.  I don't know if that's a GOOD idea, but it's an idea.

> The problem I see here is that having real executor nodes, while
> preserving various useful properties of an on-the-fly merge implies a
> degree of cross-node promiscuity that I think won't fly. For one
> thing, all my tricks with memory pooling during the final on-the-fly
> merge become iffy, to say the least. For another, the children cannot
> very well feed SortTuples to the parent using the usual TupleTableSlot
> mechanism -- we benefit plenty from reuse of SortSupport and so on
> during the merge. Who would want to reconstruct something
> SortTuple-like on the other side (within the Gather Merge)? Besides,
> if one child cannot produce tuples in time, unlike many things there
> is a legitimate need to hold everything up. I think we should
> parallelize the Merge in a later release -- possibly much later.

The implementation I have in mind for Gather Merge is as follows.
Currently, a Gather node has two TupleTableSlots - one for tuples that
the leader generates itself by running the plan before the workers get
started or when they can't keep up, and a second for tuples read from
the workers.  What I plan to do is refactor it so that there is one
TupleTableSlot per worker.  If we're doing a standard Gather, we
simply return a tuple from whichever slot we manage to fill first.  If
we're doing a Gather Merge, we fill every slot, then build a heap of
the tuples and return the lowest one.  When we need the next tuple, we
refill that slot, restore the heap property, lather, rinse, repeat.
This is basically the same way MergeAppend works, but instead of
reading tuples from multiple subplans, we're reading them from
multiple workers.  There's really no cross-node promiscuity here -
whatever is under the Gather Merge neither knows nor cares what the
Gather Merge will do with the tuples, and it does not need to be fed
by an explicit sort any more than MergeAppend does.

>> 4. Gather Merge would be an executor node, and thus not available to
>> any code that uses tuplesort.c directly.  Also, it seems fairly
>> mediocre for merge joins.  The best we could do is something like
>> this:[2]
>>
>> Merge Join
>> -> Gather Merge
>>   -> Sort
>>     -> Parallel Seq Scan
>> -> Gather Merge
>>   -> Sort
>>     -> Parallel Seq Scan
>>
>> The problem with this plan is that the join itself is not done in
>> parallel, only the sorting.  That's not great, especially if there are
>> more joins that need to be done afterwards, necessarily not in
>> parallel.[2]  It's possible that one side of the join could be an
>> Index Scan rather than Gather Merge -> Sort -> Parallel Seq Scan, but
>> that doesn't change the overall picture here much.
>
> That's not a huge problem, because at least the Sort is the really
> expensive part.

OK, but suppose you need to do a hash or nested loop join to another
table after the merge join.  With this approach, you cannot
parallelize that.

> Have you tried contriving a merge join test case with a really cheap
> sort or pair of sorts?

No.  My real-world experience, back before I became a full-time
hacker, was that hash joins were often faster than nested loops, and
merge joins were dog slow.   I dunno if that's representative of other
people's experience, or whether subsequent releases have changed the
picture.

> What I found really interesting during my experiments with the new
> approach to sorting (simple hybrid sort-merge strategy) was how
> performance was very consistent past a certain work_mem setting (about
> 1GB IIRC). Lower settings greater than that fuzzy threshold resulted
> in a shorter, maybe even much shorter time spent sorting runs. And
> yet, (at least without abbreviated keys, and especially for the
> pass-by-value datum case) the later cost of merging grew surprisingly
> well in-line with whatever the reduced time spent sorting runs was.
> This indicated to me that linearithmic growth dominated. It also
> indicates that we can perhaps partition very strategically, without
> regard for anything other than keeping n workers busy, making
> assigning work to workers relatively straightforward.

Agreed on that last part.  It's interesting to think about what
further operations we can do with a built-in partitioning notion that
permits us to recast a join-of-appends as an append-of-joins, a
standard technique for making partitioned parallelism work well at
scale.  But in the meantime, and maybe even in the long term, the
algorithm we've actually implemented, where Parallel Seq Scan
partitions the relation block-by-block, has a lot to recommend it.
Worker imbalance is avoided because each worker slurps up data as fast
as it can, and that speed varies from worker to worker for whatever
reason, we still keep all the workers busy until the whole computation
is done.

> The point is that parallel sorting is relatively easy to get
> significant benefits from as compared to parallelizing other things,
> and maybe an orientation towards using merge joins more frequently is
> a good long term goal to have.

Maybe.  I don't think anyone's done a lot of work to compare the speed
of merge joins to the speed of say hash joins on a modern version of
PostgreSQL in situations where both algorithms are practical.  That
seems like an essential prerequisite to any thought of changing the
cost model.

>> 7. Another option, instead or in addition to introducing a Repartition
>> operator, is to make the sort itself parallel-aware.  Each worker
>> reads rows until it fills work_mem, quicksorts them, and dumps them
>> out as a run.  Suppose there are few enough runs that we don't need
>> multiple merge passes, and that we have some way of making every
>> worker available of every run performed by any worker.  Then any one
>> or more of the workers can get the sorted results out by performing a
>> final merge pass over the runs we produced.  We could support various
>> models for reading the results of the sort: return every tuple to
>> every worker, return every tuple to some worker but don't return any
>> given tuple to more than one worker; return all tuples in the leader.
>> So if we just want to sort a big pile of tuples, the plan can now look
>> like this:
>>
>> Gather
>> -> Parallel Sort
>>     Output Mode: Leader Only
>>   -> Parallel Seq Scan
>>
>> I'm not sure if that's better or worse or exactly equivalent to the
>> Gather Merge > Sort > Parallel Seq Scan approach.  If we want to do a
>> parallel merge join, we now have options like this:
>
> As I went into already, my tentative view is that I think this is better.

OK, that's helpful to know.  I'm quite sure we need Gather Merge no
matter what, because of stuff like:

Gather Merge
-> Nested Loop -> Index Scan on foo -> Index Scan on bar    Index Cond: bar.x = foo.x

Without Gather Merge, there's no way to do a parallel join that
preserves the ordering provided by the outer index scan, so the query
will involve an explicit sort where it need not.  The interesting
question in my mind isn't so much whether we need Gather Merge,
because I am pretty well sure we do, but what else we need, and it
sounds like you're saying a parallel-aware tuplesort.c ought to be on
the list.  Good enough.

What would this actually look like, from an API point of view?  I
think probably:

1. Caller creates a ParallelContext.
2. Caller creates a parallel-aware tuplesort using some tuplesort.h API.
3. Caller calls LaunchParallelWorkers(pcxt), arranging for each worker
to "attach" to the parallel-aware tuplesort.
4. Workers, and caller if desired, put data into the tuplesort to be
sorted.  When done, they perform the sort.
5. When all workers have performed the sort, the sorted data can be
read by one or all workers in the usual way.
6. After the workers exit, original process destroys the parallel context.

> Closing thought: work_mem is a bad model for sorting in the long run,
> since higher settings won't help much past a certain threshold.
>
> We need to come up with a model that basically only allows a very high
> effective work_mem setting when that is enough to do the sort fully
> internally (and possibly when we weren't going to parallelize the sort
> anyway, since that may at least initially only work for external
> sorts). Otherwise, we might as well size an effective work_mem setting
> according to what our n workers require, or at a level past the
> threshold at which the benefits of a higher setting are almost noise.
> This is perhaps also a stepping stone to admission control. Finally,
> it also empowers us to provide wiggle-room to allow a worker an
> "effective work_mem burst", sufficient to not have an additional tiny
> run, which seems like a good idea, and simplifies the partitioning
> model that the planner needs.

Interesting idea.

One idea about parallel sort is that perhaps if multiple workers feed
data into the sort, they can each just sort what they have and then
merge the results.  So there's no real distinction between internal
and external for parallel sorts; but a parallel sort always involves a
final merge.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Guillaume Lelarge
Date:
Subject: Re: New email address
Next
From: Etsuro Fujita
Date:
Subject: Re: Foreign join pushdown vs EvalPlanQual