Thread: Parallel Sort

Parallel Sort

From

Noah Misch

Date:

13 May 2013, 14:29:09

It would be great if one client session could take advantage of multiple CPU
cores.  EnterpriseDB wishes to start the trek into this problem space for 9.4
by implementing parallel internal (i.e. not spilling to disk) sort.  This
touches on a notable subset of the infrastructure components we'll need for
parallel general query.  My intent is to map out the key design topics, hear
about critical topics I hadn't considered, and solicit feedback on the quality
of the high-level plan.  Full designs for key pieces will come later.


* Worker Backends

A multi-process model, rather than a multi-thread, is best for introducing
parallelism to a PostgreSQL session.  With threading, all in-memory state
would become shared by default; with processes, in-memory data structures
remain unshared until we explicitly select them for sharing.  Some data
structures will require new sharing, but I expect unshared ones to remain the
typical case.  I refer to these additional processes as worker backends.  They
will have much in common with ordinary client backends, including a database
connection and a PGPROC.  They will have no client socket; instead, they will
take direction from the client-connected backend, termed the master, via
shared memory.

Each worker needs to make SnapshotNow visibility decisions coherent with the
master.  For sorting, this allows us to look up comparison functions, even
when the current transaction created or modified those functions.  This will
also be an essential building block for any parallelism project that consults
user tables.  Implementing this means copying the subtransaction stack and the
combocid hash to each worker.  For the sake of completeness, we should also
copy the global MVCC snapshot data (sorting probably won't care).  It also
means forbidding, while a parallel task is in flight, operations that affect
the transaction state:

- CommandCounterIncrement()
- GetCurrentCommandId(true)
- AssignTransactionId()
- subtransaction-management functions

I don't intend to commit us to a design fundamentally antagonistic to, for
example, starting a subtransaction in a worker backend.  That being said, I
think we could achieve a lot in the parallel query space before we would deem
it important to relax those restrictions.

Workers will copy all GUCs from the master.  bttextcmp() needs lc_collate, and
we'll inevitably reach the point of needing broad GUC coverage as we make more
things run in parallel.  Further GUC changes will be forbidden while a
parallel task is in flight.  (More generally, anything that mutates
backend-local state without reference to shared state will either need to be
forbidden or converted to route through shared state.)

The heavyweight locking mechanism will need to be aware of the association
between the master and its workers.  For sorting, workers will acquire
heavyweight locks on system catalogs only.  It is almost enough for the
workers to behave as though they are independent of the master for locking
purposes.  But an undetected deadlock would arise if the master holds an
AccessExclusiveLock on a system catalog a worker needs.

We can allocate a small amount of permanent shared memory for coordination
among a group of processes, but sorting will benefit from a region as large as
maintenance_work_mem.  Expect on-demand memory sharing.


* Identifying Parallel-Compatible Functions

Not all functions can reasonably run on a worker backend.  We should not
presume that a VOLATILE function can tolerate the unstable execution order
imposed by parallelism, though a function like clock_timestamp() is perfectly
reasonable to run that way.  STABLE does not have that problem, but neither
does it constitute a promise that the function implementation is compatible
with parallel execution.  Consider xid_age(), which would need code changes to
operate correctly in parallel.  IMMUTABLE almost guarantees enough; there may
come a day when all IMMUTABLE functions can be presumed parallel-safe.  For
now, an IMMUTABLE function could cause trouble by starting a (read-only)
subtransaction.  The bottom line is that parallel-compatibility needs to be
separate from volatility classes for the time being.

I'm not sure what the specific answer here should look like.  Simply having a
CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying, because
the rules are liable to loosen over time.


* Planner & Similar Issues

We should decide whether to actually sort in parallel based on the comparator
cost and the data size.  The system currently has no information on comparator
cost: bt*cmp (and indeed almost all built-in functions) all have procost=1,
but bttextcmp is at least 1000x slower than btint4cmp.  Let's improve the
procost estimates of all core B-tree and hash operators.  This will have other
benefits, but we will need to be cognizant of the risk of upsetting setups
that have tuned cpu_operator_cost based on the present situation.

The choice of whether to parallelize can probably be made a manner similar to
the choice to do an external sort: the planner guesses the outcome for costing
purposes, but the actual decision is made at execution time.  The planner
would determine a tuple count cutoff at which parallelism becomes favorable,
and tuplesort would check that to establish its actual decision.

Use of parallelism within one process has a distributed effect on the system,
similar to the use of work_mem.  Add a PGC_USERSET GUC representing the number
of worker processes the current session is willing to use.  It would default
to a small number, say zero or two.  On a throughput-oriented system with high
concurrent query counts, the DBA would tend to set/leave it at zero in
postgresql.conf; parallelism can only help if the system has free resources.
Separately, add a GUC limiting the total number of workers across the system
at any one time.


Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Andres Freund

Date:

13 May 2013, 14:39:08

Hi,

Interesting! Need to think about most, but one piece immediately came to
mind:

On 2013-05-13 10:28:59 -0400, Noah Misch wrote:
> Each worker needs to make SnapshotNow visibility decisions coherent with the
> master.  For sorting, this allows us to look up comparison functions, even
> when the current transaction created or modified those functions.

I don't really see how you can achieve that given how SnapshotNow
works. There's nothing protecting you against one backend seeing changes
made by another transaction while another doesn't see them. SnapshotNow
doesn't even guarantee consistency within a single backend during a
single scan...
If you are meaning the above to just apply to changes made by the local
"master" backend, sure I can see that.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Parallel Sort

From

Tom Lane

Date:

13 May 2013, 14:57:44

Noah Misch <noah@leadboat.com> writes:
> Each worker needs to make SnapshotNow visibility decisions coherent with the
> master.  For sorting, this allows us to look up comparison functions, even
> when the current transaction created or modified those functions.  This will
> also be an essential building block for any parallelism project that consults
> user tables.  Implementing this means copying the subtransaction stack and the
> combocid hash to each worker.

> [ ... and GUC settings, and who knows what else ... ]

This approach seems to me to be likely to guarantee that the startup
overhead for any parallel sort is so large that only fantastically
enormous sorts will come out ahead.

I think you need to think in terms of restricting the problem space
enough so that the worker startup cost can be trimmed to something
reasonable.  One obvious suggestion is to forbid the workers from
doing any database access of their own at all --- the parent would
have to do any required catalog lookups for sort functions etc.
before forking the children.

I think we should also seriously think about relying on fork() and
copy-on-write semantics to launch worker subprocesses, instead of
explicitly copying so much state over to them.  Yes, this would
foreclose ever having parallel query on Windows, but that's okay
with me (hm, now where did I put my asbestos longjohns ...)

Both of these lines of thought suggest that the workers should *not*
be full-fledged backends.
        regards, tom lane

Re: Parallel Sort

From

Andres Freund

Date:

13 May 2013, 15:04:42

On 2013-05-13 10:57:39 -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > Each worker needs to make SnapshotNow visibility decisions coherent with the
> > master.  For sorting, this allows us to look up comparison functions, even
> > when the current transaction created or modified those functions.  This will
> > also be an essential building block for any parallelism project that consults
> > user tables.  Implementing this means copying the subtransaction stack and the
> > combocid hash to each worker.
> 
> > [ ... and GUC settings, and who knows what else ... ]
> 
> This approach seems to me to be likely to guarantee that the startup
> overhead for any parallel sort is so large that only fantastically
> enormous sorts will come out ahead.

I think if this is the way to go - and I am not sure it is - we need to
use some worker pool that then are (re-)used everytime someone needs to
do a sort. Which would be easier if backends could switch databases...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Parallel Sort

From

Robert Haas

Date:

13 May 2013, 16:10:11

On Mon, May 13, 2013 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This approach seems to me to be likely to guarantee that the startup
> overhead for any parallel sort is so large that only fantastically
> enormous sorts will come out ahead.
>
> I think you need to think in terms of restricting the problem space
> enough so that the worker startup cost can be trimmed to something
> reasonable.  One obvious suggestion is to forbid the workers from
> doing any database access of their own at all --- the parent would
> have to do any required catalog lookups for sort functions etc.
> before forking the children.
>
> I think we should also seriously think about relying on fork() and
> copy-on-write semantics to launch worker subprocesses, instead of
> explicitly copying so much state over to them.  Yes, this would
> foreclose ever having parallel query on Windows, but that's okay
> with me (hm, now where did I put my asbestos longjohns ...)
>
> Both of these lines of thought suggest that the workers should *not*
> be full-fledged backends.

Eventually, PostgreSQL needs not only parallel sort, but a more
general parallel query facility.  The goal here is not to design
something specific to parallel sort, but to provide a general
infrastructure for server-side parallelism.  If we restrict ourselves
to a design where syscache lookups aren't possible from a worker
backend, I have trouble seeing how that's ever gonna work.  That's a
very low-level facility that a lot of things rely on.  Even if you
could make the shuttling of requests between master and slave
transparent to the backend code, it's cutting into the amount of
actual parallelizable stuff, and adding very significantly to the
overhead.

I don't see any reason to panic about the worker startup cost.  I
don't know whether the stuff that Noah mentioned copying will take 10
microseconds or 100 milliseconds, but there are plenty of sorts that
take large numbers of seconds or minutes to happen, so there's still
plenty of opportunity for win there.  By definition, the things that
you want to run in parallel are the ones that take a long time if you
don't run them in parallel.  Now, of course, if we can reduce the cost
of starting new backends (e.g. by keeping them around from one
parallel operation to the next, or by starting them via fork), that
will expand the range of cases where parallelism is a win.  But I
think we could win in plenty of interesting real-world cases even if
it took a full second to initialize each new worker, and surely it
won't be nearly that bad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Sort

From

Noah Misch

Date:

13 May 2013, 18:14:53

On Mon, May 13, 2013 at 04:39:01PM +0200, Andres Freund wrote:
> On 2013-05-13 10:28:59 -0400, Noah Misch wrote:
> > Each worker needs to make SnapshotNow visibility decisions coherent with the
> > master.  For sorting, this allows us to look up comparison functions, even
> > when the current transaction created or modified those functions.
> 
> I don't really see how you can achieve that given how SnapshotNow
> works. There's nothing protecting you against one backend seeing changes
> made by another transaction while another doesn't see them. SnapshotNow
> doesn't even guarantee consistency within a single backend during a
> single scan...
> If you are meaning the above to just apply to changes made by the local
> "master" backend, sure I can see that.

Yes; it only makes the workers consistent with the master to the extent that
the master is consistent with itself.  However, your comment makes me see that
this may not be enough.  For an object not protected by locks, if parallel
execution probes the syscache N times where the current code probes it only
once, we'll introduce new anomalies.  I don't think this affects sorting in
particular, because we already make little effort to behave sanely when a
comparison function is redefined mid-sort.  It seems likely that this will
need a better answer sooner or later as we move into the parallelism space.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Simon Riggs

Date:

13 May 2013, 18:55:25

On 13 May 2013 15:28, Noah Misch <noah@leadboat.com> wrote:

> The heavyweight locking mechanism will need to be aware of the association
> between the master and its workers.

Not sure I can see why that would be true.

ISTM that the workers need to be restricted in various ways from a
full-functioned master. If the workers can do things that conflict
with the master and/or each other, we're going to have all kinds of
hurt.

The key is going to be deciding what the restrictions are and then
working out how to enforce them.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Sort

From

Kohei KaiGai

Date:

13 May 2013, 19:52:49

<div dir="ltr">2013/5/13 Noah Misch <span dir="ltr"><<a href="mailto:noah@leadboat.com"
target="_blank">noah@leadboat.com</a>></span><br/><div class="gmail_extra"><div class="gmail_quote"><blockquote
class="gmail_quote"style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">* Planner
&Similar Issues<br /><br /> We should decide whether to actually sort in parallel based on the comparator<br />
costand the data size.  The system currently has no information on comparator<br /> cost: bt*cmp (and indeed almost all
built-infunctions) all have procost=1,<br /> but bttextcmp is at least 1000x slower than btint4cmp.  Let's improve
the<br/> procost estimates of all core B-tree and hash operators.  This will have other<br /> benefits, but we will
needto be cognizant of the risk of upsetting setups<br /> that have tuned cpu_operator_cost based on the present
situation.<br/><br /> The choice of whether to parallelize can probably be made a manner similar to<br /> the choice to
doan external sort: the planner guesses the outcome for costing<br /> purposes, but the actual decision is made at
executiontime.  The planner<br /> would determine a tuple count cutoff at which parallelism becomes favorable,<br />
andtuplesort would check that to establish its actual decision.<br /><br /></blockquote><div style="style">It probably
crossoversmy <span style="color:rgb(0,0,0);font-family:Arial,'Arial New','\00ff2d\00ff33 \00ff30
\0030b4\0030b7\0030c3\0030af',sans-serif;font-size:13px">problemconsciousness to off-load CPU bounds</span></div><div
style="style"><spanstyle="color:rgb(0,0,0);font-family:Arial,'Arial New','\00ff2d\00ff33 \00ff30
\0030b4\0030b7\0030c3\0030af',sans-serif;font-size:13px">workloads;that I partially tried to implement on writable
foreigntable feature.</span></div><div style="style"><font color="#000000" face="Arial, Arial New, ＭＳ Ｐ ゴシック,
sans-serif">Notonly sorting stuff, I think it may be worthful to have capability to push</font></div><div
style="style"><fontcolor="#000000" face="Arial, Arial New, ＭＳ Ｐ ゴシック, sans-serif">heavy workload (like sort, aggregate
orcomplex target-list) out external</font></div><div style="style"><font color="#000000" face="Arial, Arial New, ＭＳ Ｐ
ゴシック,sans-serif">computing resources.</font></div><div style="style">However, I doubt whether the decision to
parallelizeshould be done in</div><div style="style">execution time, rather than plan stage. For example, in case when
we</div><divstyle="style">have enough number of records and 10-core multiprocessor, the wise</div><div
style="style">planmay take parallel data load by 10-processors, partial-sort by 10-</div><div style="style">processors
individually,then merge-sort. It needs fundamental different</div><div style="style">tree structure from the
traditionalsingle-processors based plan-tree.</div><div style="style">So, it seems to me we should take an enhancement
toallow to inject</div><div style="style">plan-tree special purpose parallel processing plan node.</div><div
style="style">Howabout your opinion?</div><div style="style"><br /></div><div
style="style">Thanks,</div></div></div></div>

Re: Parallel Sort

From

Simon Riggs

Date:

13 May 2013, 21:27:46

On 13 May 2013 15:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I think you need to think in terms of restricting the problem space

+1

> One obvious suggestion is to forbid the workers from
> doing any database access of their own at all --- the parent would
> have to do any required catalog lookups for sort functions etc.

+1

> I think we should also seriously think about relying on fork() and
> copy-on-write semantics to launch worker subprocesses, instead of
> explicitly copying so much state over to them.  Yes, this would
> foreclose ever having parallel query on Windows, but that's okay
> with me (hm, now where did I put my asbestos longjohns ...)

If we relied on some kind of inherited state we could easily make the
mistake of relying on something that isn't actually being maintained
correctly in the worker. Luckily (?) that is exactly the stance we
need to make this work on Windows. Other than that, releasing on
Windows in later release sounds sensible, otherwise we'll just delay
the development so much it will still happen in the "later" timeframe,
just the chance of an earlier release on Linux/BSD will be missed.

For example, the idea of managing separate subtransactions in each
worker sounds nuts. Impressive, if you're already thinking about
parallel DML that can self recover halfway through a statement and
then continue processing, but that's a little advanced. The way to
think about this is as a 10 year journey, not as a single feature.

-1 for forking

> Both of these lines of thought suggest that the workers should *not*
> be full-fledged backends.

+1 to the idea of workers != masters

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Parallel Sort

From

Michael Paquier

Date:

14 May 2013, 04:51:47

On Mon, May 13, 2013 at 11:28 PM, Noah Misch <noah@leadboat.com> wrote:

* Identifying Parallel-Compatible Functions

Not all functions can reasonably run on a worker backend. We should not
presume that a VOLATILE function can tolerate the unstable execution order
imposed by parallelism, though a function like clock_timestamp() is perfectly
reasonable to run that way. STABLE does not have that problem, but neither
does it constitute a promise that the function implementation is compatible
with parallel execution. Consider xid_age(), which would need code changes to
operate correctly in parallel. IMMUTABLE almost guarantees enough; there may
come a day when all IMMUTABLE functions can be presumed parallel-safe. For
now, an IMMUTABLE function could cause trouble by starting a (read-only)
subtransaction. The bottom line is that parallel-compatibility needs to be
separate from volatility classes for the time being.

I am not sure that this problem is only limited to functions, but to all the expressions
and clauses of queries that could be shipped and evaluated on the worker backends when
fetching tuples that could be used to accelerate a parallel sort. Let's imagine for example
the case of a LIMIT clause that can be used by worker backends to limit the number of tuples
to sort as final result.
In some ways, Postgres-XC has faced (and is still facing) similar challenges and they have
been partially solved.

I'm not sure what the specific answer here should look like. Simply having a
CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying, because
the rules are liable to loosen over time.

Having a flag would be enough to control parallelism, but cannot we also determine if
the execution of a function can be shipped safely to a worker based on its volatility
only? Immutable functions are presumably safe as they do not modify the database state
and give always the same result, volatile and stable functions are definitely not safe.

For such reasons, it would be better to keep things simple and rely on simple rules to

determine if a given expression can be executed safely on a backend worker.

--
Michael

Re: Parallel Sort

From

Robert Haas

Date:

14 May 2013, 12:48:18

On Tue, May 14, 2013 at 12:51 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> I'm not sure what the specific answer here should look like.  Simply
>> having a
>> CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying,
>> because
>> the rules are liable to loosen over time.
>
> Having a flag would be enough to control parallelism, but cannot we also
> determine if
> the execution of a function can be shipped safely to a worker based on its
> volatility
> only? Immutable functions are presumably safe as they do not modify the
> database state
> and give always the same result, volatile and stable functions are
> definitely not safe.
> For such reasons, it would be better to keep things simple and rely on
> simple rules to
> determine if a given expression can be executed safely on a backend worker.

In the part of the text you didn't quote, Noah explained why not all
immutable functions are parallel-safe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Sort

From

Noah Misch

Date:

14 May 2013, 14:35:16

On Mon, May 13, 2013 at 07:55:16PM +0100, Simon Riggs wrote:
> On 13 May 2013 15:28, Noah Misch <noah@leadboat.com> wrote:
> 
> > The heavyweight locking mechanism will need to be aware of the association
> > between the master and its workers.
> 
> Not sure I can see why that would be true.
> 
> ISTM that the workers need to be restricted in various ways from a
> full-functioned master. If the workers can do things that conflict
> with the master and/or each other, we're going to have all kinds of
> hurt.
> 
> The key is going to be deciding what the restrictions are and then
> working out how to enforce them.

Yes, specific lock manager needs will depend on what we permit workers to do.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Noah Misch

Date:

14 May 2013, 14:51:00

On Mon, May 13, 2013 at 09:52:43PM +0200, Kohei KaiGai wrote:
> 2013/5/13 Noah Misch <noah@leadboat.com>
> > The choice of whether to parallelize can probably be made a manner similar
> > to
> > the choice to do an external sort: the planner guesses the outcome for
> > costing
> > purposes, but the actual decision is made at execution time.  The planner
> > would determine a tuple count cutoff at which parallelism becomes
> > favorable,
> > and tuplesort would check that to establish its actual decision.
> 
> It probably crossovers my problem consciousness to off-load CPU bounds
> workloads; that I partially tried to implement on writable foreign table
> feature.
> Not only sorting stuff, I think it may be worthful to have capability to
> push
> heavy workload (like sort, aggregate or complex target-list) out external
> computing resources.
> However, I doubt whether the decision to parallelize should be done in
> execution time, rather than plan stage. For example, in case when we
> have enough number of records and 10-core multiprocessor, the wise
> plan may take parallel data load by 10-processors, partial-sort by 10-
> processors individually, then merge-sort. It needs fundamental different
> tree structure from the traditional single-processors based plan-tree.

That's taking a few steps more in the direction of parallel general query; at
some point, the planner would definitely become parallel-aware.  For the
narrower topic of parallel sort, I don't think it's necessary.  The node tree
supplying the sort can't run in parallel (yet), and the node pulling from the
sort won't receive the first tuple until the sort is complete.  For the time
being, the planner just needs to know enough about the sort node's projected
use of parallelism to estimate cost.

> So, it seems to me we should take an enhancement to allow to inject
> plan-tree special purpose parallel processing plan node.
> How about your opinion?

I'm not picturing how, specifically, this new node or class of nodes would be
used.  Could you elaborate?

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Noah Misch

Date:

14 May 2013, 14:59:43

On Tue, May 14, 2013 at 01:51:42PM +0900, Michael Paquier wrote:
> On Mon, May 13, 2013 at 11:28 PM, Noah Misch <noah@leadboat.com> wrote:
> 
> > * Identifying Parallel-Compatible Functions
> >
> > Not all functions can reasonably run on a worker backend.  We should not
> > presume that a VOLATILE function can tolerate the unstable execution order
> > imposed by parallelism, though a function like clock_timestamp() is
> > perfectly
> > reasonable to run that way.  STABLE does not have that problem, but neither
> > does it constitute a promise that the function implementation is compatible
> > with parallel execution.  Consider xid_age(), which would need code
> > changes to
> > operate correctly in parallel.  IMMUTABLE almost guarantees enough; there
> > may
> > come a day when all IMMUTABLE functions can be presumed parallel-safe.  For
> > now, an IMMUTABLE function could cause trouble by starting a (read-only)
> > subtransaction.  The bottom line is that parallel-compatibility needs to be
> > separate from volatility classes for the time being.
> >
> I am not sure that this problem is only limited to functions, but to all
> the expressions
> and clauses of queries that could be shipped and evaluated on the worker
> backends when
> fetching tuples that could be used to accelerate a parallel sort. Let's
> imagine for example
> the case of a LIMIT clause that can be used by worker backends to limit the
> number of tuples
> to sort as final result.

It's true that the same considerations apply to other plan tree constructs;
however, every such construct is known at build time, so we can study each one
and decide how it fits with parallelism.  Since functions are user-definable,
it's preferable to reason about classes of functions.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Claudio Freire

Date:

14 May 2013, 15:15:33

On Tue, May 14, 2013 at 11:50 AM, Noah Misch <noah@leadboat.com> wrote:
> On Mon, May 13, 2013 at 09:52:43PM +0200, Kohei KaiGai wrote:
>> 2013/5/13 Noah Misch <noah@leadboat.com>
>> > The choice of whether to parallelize can probably be made a manner similar
>> > to
>> > the choice to do an external sort: the planner guesses the outcome for
>> > costing
>> > purposes, but the actual decision is made at execution time.  The planner
>> > would determine a tuple count cutoff at which parallelism becomes
>> > favorable,
>> > and tuplesort would check that to establish its actual decision.
>>
>> It probably crossovers my problem consciousness to off-load CPU bounds
>> workloads; that I partially tried to implement on writable foreign table
>> feature.
>> Not only sorting stuff, I think it may be worthful to have capability to
>> push
>> heavy workload (like sort, aggregate or complex target-list) out external
>> computing resources.
>> However, I doubt whether the decision to parallelize should be done in
>> execution time, rather than plan stage. For example, in case when we
>> have enough number of records and 10-core multiprocessor, the wise
>> plan may take parallel data load by 10-processors, partial-sort by 10-
>> processors individually, then merge-sort. It needs fundamental different
>> tree structure from the traditional single-processors based plan-tree.
>
> That's taking a few steps more in the direction of parallel general query; at
> some point, the planner would definitely become parallel-aware.  For the
> narrower topic of parallel sort, I don't think it's necessary.  The node tree
> supplying the sort can't run in parallel (yet), and the node pulling from the
> sort won't receive the first tuple until the sort is complete.  For the time
> being, the planner just needs to know enough about the sort node's projected
> use of parallelism to estimate cost.

You know what would be a low-hanging fruit that I've been thinking
would benefit many of my own queries?

"Parallel" sequential scan nodes. Even if there's no real parallelism
involved, when a query has to scan the same table at multiple nodes,
if it's big, it would be worth parallelizing the scans to transform
them into synchro scans.

I have absolutely no idea how this would work easily without forked
workers, because the scans might be buried in more complex execution
trees. But still, it's worth considering, that parallelizing may
benefit more than core usage.

If execution nodes could be paused at arbitrary points, a "parallel
scan" node could pause one branch that has consumed the circular
buffer, letting another branches consume their part, and thus
"parallelizing" branch execution. But this would be perhaps more
complex than simply forking.

Re: Parallel Sort

From

Michael Paquier

Date:

14 May 2013, 23:12:40

On Tue, May 14, 2013 at 11:59 PM, Noah Misch <noah@leadboat.com> wrote:

On Tue, May 14, 2013 at 01:51:42PM +0900, Michael Paquier wrote:
> On Mon, May 13, 2013 at 11:28 PM, Noah Misch <noah@leadboat.com> wrote:
>
> > * Identifying Parallel-Compatible Functions
> >
> > Not all functions can reasonably run on a worker backend. We should not
> > presume that a VOLATILE function can tolerate the unstable execution order
> > imposed by parallelism, though a function like clock_timestamp() is
> > perfectly
> > reasonable to run that way. STABLE does not have that problem, but neither
> > does it constitute a promise that the function implementation is compatible
> > with parallel execution. Consider xid_age(), which would need code
> > changes to
> > operate correctly in parallel. IMMUTABLE almost guarantees enough; there
> > may
> > come a day when all IMMUTABLE functions can be presumed parallel-safe. For
> > now, an IMMUTABLE function could cause trouble by starting a (read-only)
> > subtransaction. The bottom line is that parallel-compatibility needs to be
> > separate from volatility classes for the time being.
> >
> I am not sure that this problem is only limited to functions, but to all
> the expressions
> and clauses of queries that could be shipped and evaluated on the worker
> backends when
> fetching tuples that could be used to accelerate a parallel sort. Let's
> imagine for example
> the case of a LIMIT clause that can be used by worker backends to limit the
> number of tuples
> to sort as final result.

It's true that the same considerations apply to other plan tree constructs;
however, every such construct is known at build time, so we can study each one
and decide how it fits with parallelism.

The concept of clause parallelism for backend worker is close to the concept of clause shippability introduced in Postgres-XC. In the case of XC, the equivalent of the master backend is a backend located on a node called Coordinator that merges and organizes results fetched in parallel from remote nodes where data scans occur (on nodes called Datanodes). The backends used for tuple scans across Datanodes share the same data visibility as they use the same snapshot and transaction ID as the backend on Coordinator. This is different from the parallelism as there is no idea of snapshot import to worker backends.

However, the code in XC planner used for clause shippability evaluation is definitely worth looking at just considering the many similarities it shares with parallelism when evaluating if a given clause can be executed on a worker backend or not. It would be a waste to implement twice the same thing is there is code already available.

Since functions are user-definable, it's preferable to reason about classes of functions.

Yes. You are right.

--
Michael

Re: Parallel Sort

From

Hannu Krosing

Date:

15 May 2013, 04:14:29

On 05/13/2013 07:10 PM, Robert Haas wrote:
> On Mon, May 13, 2013 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This approach seems to me to be likely to guarantee that the startup
>> overhead for any parallel sort is so large that only fantastically
>> enormous sorts will come out ahead.
>>
>> I think you need to think in terms of restricting the problem space
>> enough so that the worker startup cost can be trimmed to something
>> reasonable.  One obvious suggestion is to forbid the workers from
>> doing any database access of their own at all --- the parent would
>> have to do any required catalog lookups for sort functions etc.
>> before forking the children.
>>
>> I think we should also seriously think about relying on fork() and
>> copy-on-write semantics to launch worker subprocesses, instead of
>> explicitly copying so much state over to them.  Yes, this would
>> foreclose ever having parallel query on Windows, but that's okay
>> with me (hm, now where did I put my asbestos longjohns ...)
>>
>> Both of these lines of thought suggest that the workers should *not*
>> be full-fledged backends.
> Eventually, PostgreSQL needs not only parallel sort, but a more
> general parallel query facility.  The goal here is not to design
> something specific to parallel sort, but to provide a general
> infrastructure for server-side parallelism.  If we restrict ourselves
> to a design where syscache lookups aren't possible from a worker
> backend, I have trouble seeing how that's ever gonna work.  That's a
> very low-level facility that a lot of things rely on.  Even if you
> could make the shuttling of requests between master and slave
> transparent to the backend code, it's cutting into the amount of
> actual parallelizable stuff, and adding very significantly to the
> overhead.
Has anybody looked into making syscache MVCC compliant ?

A lot of complexity would be avoided if there were some efficient
way to have the same MVCC power in syscache as there is in the rest
of the system.

SO instead of solving the cache transactional
coherency problems separately in each user of syscache why not
bring the syscache to the level of rest of postgres ?


> I don't see any reason to panic about the worker startup cost.  I
> don't know whether the stuff that Noah mentioned copying will take 10
> microseconds or 100 milliseconds, but there are plenty of sorts that
> take large numbers of seconds or minutes to happen, so there's still
> plenty of opportunity for win there.  By definition, the things that
> you want to run in parallel are the ones that take a long time if you
> don't run them in parallel.  Now, of course, if we can reduce the cost
> of starting new backends (e.g. by keeping them around from one
> parallel operation to the next, or by starting them via fork), that
> will expand the range of cases where parallelism is a win.  But I
> think we could win in plenty of interesting real-world cases even if
> it took a full second to initialize each new worker, and surely it
> won't be nearly that bad.
+1 to there being lots of use cases for high startup cost parallelism.

There are lots of non-parallel query plans as well which have high startup
cost and postgresql optimiser already  deals with them quite well.

Hannu
>

Re: Parallel Sort

From

Tom Lane

Date:

15 May 2013, 04:30:11

Hannu Krosing <hannu@krosing.net> writes:
> Has anybody looked into making syscache MVCC compliant ?

This is the wrong statement of the question.

The right statement is "how would you like backend A to be updating
table T in compliance with a view of table T's schema that is obsolete
because of backend B's already-committed DDL changes?"  For example,
ignoring a just-committed CHECK constraint because it's not visible
to your transaction's snapshot.

The point of SnapshotNow catalog lookups is to be sure that we see the
current definition of a table whether or not we're in a transaction
whose snapshot precedes the last DDL change to that table.  There might
be some other way to deal with that consistency problem, but just
changing syscache's behavior is not going to make things better.
        regards, tom lane

Re: Parallel Sort

From

Hannu Krosing

Date:

15 May 2013, 05:54:04

On 05/15/2013 07:30 AM, Tom Lane wrote:
> Hannu Krosing <hannu@krosing.net> writes:
>> Has anybody looked into making syscache MVCC compliant ?
> This is the wrong statement of the question.
>
> The right statement is "how would you like backend A to be updating
> table T in compliance with a view of table T's schema that is obsolete
> because of backend B's already-committed DDL changes?"  
Could we not treat it the same way we treat updated rows in
serializable transaction mode ? That is rollback and let the
client retry if it so wishes ?
> For example, ignoring a just-committed CHECK constraint because it's not visible
> to your transaction's snapshot.
Is there anything in SQL standard that tells us to handle schema changes
in parallel to DML in any other way than by rollback ?

I agree it is nice to have a way to carry on in this case, but there is
only
so much you can sensibly do. For example I can see no good way to handle
parallel DROP COLUMN, especially if you are trying to update that same
column
based on your old snapshot.
> The point of SnapshotNow catalog lookups is to be sure that we see the
> current definition of a table whether or not we're in a transaction
> whose snapshot precedes the last DDL change to that table. 
Can't we just detect that "there have been changes" and rollback if trying
any DML on changed tables. It does not seem like something what happens
too often.
>  There might
> be some other way to deal with that consistency problem, but just
> changing syscache's behavior is not going to make things better.
I was targeting mainly the read-only case of querying data from a changed
table. I have not checked how SnapshotNow catalog handle this, but I guess
it is not trivial in case there have been changes to catalog since the
active snapshot was taken.

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

Re: Parallel Sort

From

Amit Kapila

Date:

15 May 2013, 06:57:15

On Monday, May 13, 2013 7:59 PM Noah Misch wrote:
> It would be great if one client session could take advantage of
> multiple CPU
> cores.  EnterpriseDB wishes to start the trek into this problem space
> for 9.4
> by implementing parallel internal (i.e. not spilling to disk) sort.
> This
> touches on a notable subset of the infrastructure components we'll need
> for
> parallel general query.  My intent is to map out the key design topics,
> hear
> about critical topics I hadn't considered, and solicit feedback on the
> quality
> of the high-level plan.  Full designs for key pieces will come later.
> 
> 
> * Worker Backends
> 
> A multi-process model, rather than a multi-thread, is best for
> introducing
> parallelism to a PostgreSQL session.  With threading, all in-memory
> state
> would become shared by default; with processes, in-memory data
> structures
> remain unshared until we explicitly select them for sharing.  Some data
> structures will require new sharing, but I expect unshared ones to
> remain the
> typical case.  I refer to these additional processes as worker
> backends.  They
> will have much in common with ordinary client backends, including a
> database
> connection and a PGPROC.  They will have no client socket; instead,
> they will
> take direction from the client-connected backend, termed the master,
> via
> shared memory.
> We can allocate a small amount of permanent shared memory for
> coordination
> among a group of processes, but sorting will benefit from a region as
> large as
> maintenance_work_mem.  Expect on-demand memory sharing.

Will the shared memory used for coordinating tuples between master and
worker be fixed or varying depending on size of tuples to be sorted or
number of workers associated.
If it is varying, then it can sometimes encounter situation where required
memory is not available and in that case it has to revert to serial sorting
> 
> * Identifying Parallel-Compatible Functions
> 
> Not all functions can reasonably run on a worker backend.  We should
> not
> presume that a VOLATILE function can tolerate the unstable execution
> order
> imposed by parallelism, though a function like clock_timestamp() is
> perfectly
> reasonable to run that way.  STABLE does not have that problem, but
> neither
> does it constitute a promise that the function implementation is
> compatible
> with parallel execution.  Consider xid_age(), which would need code
> changes to
> operate correctly in parallel.  IMMUTABLE almost guarantees enough;
> there may
> come a day when all IMMUTABLE functions can be presumed parallel-safe.
> For
> now, an IMMUTABLE function could cause trouble by starting a (read-
> only)
> subtransaction.  The bottom line is that parallel-compatibility needs
> to be
> separate from volatility classes for the time being.
> 
> I'm not sure what the specific answer here should look like.  Simply
> having a
> CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying,
> because
> the rules are liable to loosen over time.
> 
> 
> * Planner & Similar Issues
> 
> We should decide whether to actually sort in parallel based on the
> comparator
> cost and the data size.  The system currently has no information on
> comparator
> cost: bt*cmp (and indeed almost all built-in functions) all have
> procost=1,
> but bttextcmp is at least 1000x slower than btint4cmp.  Let's improve
> the
> procost estimates of all core B-tree and hash operators.  This will
> have other
> benefits, but we will need to be cognizant of the risk of upsetting
> setups
> that have tuned cpu_operator_cost based on the present situation.
> 
> The choice of whether to parallelize can probably be made a manner
> similar to
> the choice to do an external sort: the planner guesses the outcome for
> costing
> purposes, but the actual decision is made at execution time.  The
> planner
> would determine a tuple count cutoff at which parallelism becomes
> favorable,
> and tuplesort would check that to establish its actual decision.
> 
> Use of parallelism within one process has a distributed effect on the
> system,
> similar to the use of work_mem.  Add a PGC_USERSET GUC representing the
> number
> of worker processes the current session is willing to use.  It would
> default
> to a small number, say zero or two.  On a throughput-oriented system
> with high
> concurrent query counts, the DBA would tend to set/leave it at zero in
> postgresql.conf; parallelism can only help if the system has free
> resources.
> Separately, add a GUC limiting the total number of workers across the
> system
> at any one time.

How will the parallel sorting tasks be divided and assigned to each worker?


With Regards,
Amit Kapila.

Re: Parallel Sort

From

Noah Misch

Date:

15 May 2013, 18:04:49

On Tue, May 14, 2013 at 12:15:24PM -0300, Claudio Freire wrote:
> You know what would be a low-hanging fruit that I've been thinking
> would benefit many of my own queries?
> 
> "Parallel" sequential scan nodes. Even if there's no real parallelism
> involved, when a query has to scan the same table at multiple nodes,
> if it's big, it would be worth parallelizing the scans to transform
> them into synchro scans.
> 
> I have absolutely no idea how this would work easily without forked
> workers, because the scans might be buried in more complex execution
> trees. But still, it's worth considering, that parallelizing may
> benefit more than core usage.
> 
> If execution nodes could be paused at arbitrary points, a "parallel
> scan" node could pause one branch that has consumed the circular
> buffer, letting another branches consume their part, and thus
> "parallelizing" branch execution. But this would be perhaps more
> complex than simply forking.

Execution nodes do pause between every output tuple, at least nominally.
Still, given the architecture of our executor and the planner work to
implement such a thing, I would not classify it as low-hanging fruit.  It
would primarily apply to a plan with independent sequential scans of the same
large (relative to total memory) relation.  I'm sure that comes up, but it
doesn't strike me as typical.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Noah Misch

Date:

15 May 2013, 18:11:47

On Wed, May 15, 2013 at 08:12:34AM +0900, Michael Paquier wrote:
> The concept of clause parallelism for backend worker is close to the
> concept of clause shippability introduced in Postgres-XC. In the case of
> XC, the equivalent of the master backend is a backend located on a node
> called Coordinator that merges and organizes results fetched in parallel
> from remote nodes where data scans occur (on nodes called Datanodes). The
> backends used for tuple scans across Datanodes share the same data
> visibility as they use the same snapshot and transaction ID as the backend
> on Coordinator. This is different from the parallelism as there is no idea
> of snapshot import to worker backends.

Worker backends would indeed share snapshot and XID.

> However, the code in XC planner used for clause shippability evaluation is
> definitely worth looking at just considering the many similarities it
> shares with parallelism when evaluating if a given clause can be executed
> on a worker backend or not. It would be a waste to implement twice the same
> thing is there is code already available.

Agreed.  Local parallel query is very similar to distributed query; the
specific IPC cost multipliers differ, but that's about it.  I hope we can
benefit from XC's experience in this area.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Noah Misch

Date:

15 May 2013, 18:12:31

On Wed, May 15, 2013 at 12:26:52PM +0530, Amit Kapila wrote:
> On Monday, May 13, 2013 7:59 PM Noah Misch wrote:
> > We can allocate a small amount of permanent shared memory for
> > coordination
> > among a group of processes, but sorting will benefit from a region as
> > large as
> > maintenance_work_mem.  Expect on-demand memory sharing.
> 
> Will the shared memory used for coordinating tuples between master and
> worker be fixed or varying depending on size of tuples to be sorted or
> number of workers associated.
> If it is varying, then it can sometimes encounter situation where required
> memory is not available and in that case it has to revert to serial sorting

> How will the parallel sorting tasks be divided and assigned to each worker?

I haven't selected answers for those details, yet.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Peter Geoghegan

Date:

15 May 2013, 18:32:46

On Mon, May 13, 2013 at 7:28 AM, Noah Misch <noah@leadboat.com> wrote:
> We should decide whether to actually sort in parallel based on the comparator
> cost and the data size.  The system currently has no information on comparator
> cost: bt*cmp (and indeed almost all built-in functions) all have procost=1,
> but bttextcmp is at least 1000x slower than btint4cmp.

I think that this effort could justify itself independently of any
attempt to introduce parallelism to in-memory sorting. I abandoned a
patch to introduce timsort to Postgres, because I knew that there was
no principled way to reap the benefits. Unless you introduce
parallelism, it's probably going to be virtually impossible to come up
with an alogorithm that does in-memory sorting faster (in terms of the
amount of system time taken) than a highly optimized quicksort when
sorting integers. But sorting types with really expensive comparators
(even considerably more expensive than bttextcmp) for
pass-by-reference Datums (where the memory locality advantage of
quicksort doesn't really help so much) makes timsort much more
compelling. That's why it's used for Python lists.

-- 
Peter Geoghegan

Re: Parallel Sort

From

Andres Freund

Date:

15 May 2013, 18:49:10

On 2013-05-13 10:28:59 -0400, Noah Misch wrote:
> Each worker needs to make SnapshotNow visibility decisions coherent with the
> master.  For sorting, this allows us to look up comparison functions, even
> when the current transaction created or modified those functions.  This will
> also be an essential building block for any parallelism project that consults
> user tables.  Implementing this means copying the subtransaction stack and the
> combocid hash to each worker.  For the sake of completeness, we should also
> copy the global MVCC snapshot data (sorting probably won't care).  It also
> means forbidding, while a parallel task is in flight, operations that affect
> the transaction state:

Btw, if you assume you can simply copy a snapshot from the normal
backend to the worker backend to make visibility decisions in the
general case: You're wrong. Unfortunately you need in-memory state to
make sense of combocids...

Not impossible to solve, but you should be aware of the issue.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Parallel Sort

From

Peter Geoghegan

Date:

15 May 2013, 19:03:21

On Wed, May 15, 2013 at 11:32 AM, Peter Geoghegan <pg@heroku.com> wrote:
> I think that this effort could justify itself independently of any
> attempt to introduce parallelism to in-memory sorting. I abandoned a
> patch to introduce timsort to Postgres, because I knew that there was
> no principled way to reap the benefits.

Just for the record, I attach a patch that introduces a timsort_arg
function as a drop-in replacement for quicksort_arg (including
replacing all of the specializations that went into 9.2). It has been
rebased against master. For what it's worth, if anyone wanted to pick
this up, that would be fine with me.

Don't be fooled by the superficial regression test failures. The tests
in question are subtly wrong, because they rely on a certain ordering
that isn't explicitly requested. Timsort is stable, whereas quicksort
generally isn't stable (our implementation certainly isn't).

--
Peter Geoghegan

Attachment

timsort.2013_05_15.patch.gz

Re: Parallel Sort

From

Claudio Freire

Date:

15 May 2013, 20:10:50

On Wed, May 15, 2013 at 3:04 PM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, May 14, 2013 at 12:15:24PM -0300, Claudio Freire wrote:
>> You know what would be a low-hanging fruit that I've been thinking
>> would benefit many of my own queries?
>>
>> "Parallel" sequential scan nodes. Even if there's no real parallelism
>> involved, when a query has to scan the same table at multiple nodes,
>> if it's big, it would be worth parallelizing the scans to transform
>> them into synchro scans.
>>
>> I have absolutely no idea how this would work easily without forked
>> workers, because the scans might be buried in more complex execution
>> trees. But still, it's worth considering, that parallelizing may
>> benefit more than core usage.
>>
>> If execution nodes could be paused at arbitrary points, a "parallel
>> scan" node could pause one branch that has consumed the circular
>> buffer, letting another branches consume their part, and thus
>> "parallelizing" branch execution. But this would be perhaps more
>> complex than simply forking.
>
> Execution nodes do pause between every output tuple, at least nominally.
> Still, given the architecture of our executor and the planner work to
> implement such a thing, I would not classify it as low-hanging fruit.  It
> would primarily apply to a plan with independent sequential scans of the same
> large (relative to total memory) relation.  I'm sure that comes up, but it
> doesn't strike me as typical.


I found it rather typical of some of my workloads, but it could
probably not be the case globally.

It would be rather easier if it could pause without returning rows. I
think ATM, not returning any rows means the node is finished doing its
scan. The nodes that would have to be pausable like this wouldn't be
sequential scans, but sorts, hashes, and in general those that take a
long time to start returning rows.

So, a plan that goes like:
 Seq on A -> Sort -> Merge -> result Seq on A -> Sort --/

Would be turned into:
 Seq on A -> Step Sort -> Parallel Merge -> result Seq on A -> Step Sort --/

Or even maybe
 Seq on A -> Sort -> Tee X -> Parallel Merge                                    X --/

I think Tee and Parallel Merge should be doable with current
infrastructure, because they don't require pausing without returning
any tuples. Not sure how may meters above ground that is, or how many
gotchas might be involved. But it's been circling in my head for a
while.

Re: Parallel Sort

From

Noah Misch

Date:

16 May 2013, 13:46:24

On Wed, May 15, 2013 at 08:49:00PM +0200, Andres Freund wrote:
> On 2013-05-13 10:28:59 -0400, Noah Misch wrote:
> > Each worker needs to make SnapshotNow visibility decisions coherent with the
> > master.  For sorting, this allows us to look up comparison functions, even
> > when the current transaction created or modified those functions.  This will
> > also be an essential building block for any parallelism project that consults
> > user tables.  Implementing this means copying the subtransaction stack and the
> > combocid hash to each worker.  For the sake of completeness, we should also
> > copy the global MVCC snapshot data (sorting probably won't care).  It also
> > means forbidding, while a parallel task is in flight, operations that affect
> > the transaction state:
> 
> Btw, if you assume you can simply copy a snapshot from the normal
> backend to the worker backend to make visibility decisions in the
> general case: You're wrong. Unfortunately you need in-memory state to
> make sense of combocids...

Correct.  If you think of any required state information that I did not list
above, please let me know.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Hitoshi Harada

Date:

19 May 2013, 19:06:47

On Wed, May 15, 2013 at 11:11 AM, Noah Misch <noah@leadboat.com> wrote:

On Wed, May 15, 2013 at 08:12:34AM +0900, Michael Paquier wrote:
> The concept of clause parallelism for backend worker is close to the
> concept of clause shippability introduced in Postgres-XC. In the case of
> XC, the equivalent of the master backend is a backend located on a node
> called Coordinator that merges and organizes results fetched in parallel
> from remote nodes where data scans occur (on nodes called Datanodes). The
> backends used for tuple scans across Datanodes share the same data
> visibility as they use the same snapshot and transaction ID as the backend
> on Coordinator. This is different from the parallelism as there is no idea
> of snapshot import to worker backends.

Worker backends would indeed share snapshot and XID.

> However, the code in XC planner used for clause shippability evaluation is
> definitely worth looking at just considering the many similarities it
> shares with parallelism when evaluating if a given clause can be executed
> on a worker backend or not. It would be a waste to implement twice the same
> thing is there is code already available.

Agreed. Local parallel query is very similar to distributed query; the
specific IPC cost multipliers differ, but that's about it. I hope we can
benefit from XC's experience in this area.

I believe the parallel execution is much easier to be done if the data is partitioned. Of course it is possible to make only the sort operation parallel but then the question would be how to split and pass each tuple to workers. XC and Greenplum use notion of hash distributed table that enables the parallel sort (XC doesn't perform parallel sort on replicated table, I guess). For postgres, I don't think hash distributed table is foreseeable option, but MergeAppend over inheritance is a good choice to run in parallel. You won't even need to modify many lines of sort execution code if you correctly dispatch the work, as it's just to split and assign the subnode of query plan to workers. Transactions and locks will be tricky though, and we might end up introducing small set of snapshot sharing infra for the former and notion of session id rather than process id for the latter. I don't think SnapshotNow is the problem as anyway executor is reading catalogs with that today.

Thanks,
Hitoshi

--
Hitoshi Harada

Re: Parallel Sort

From

Jim Nasby

Date:

24 May 2013, 18:13:41

On 5/13/13 9:28 AM, Noah Misch wrote:
> It would be great if one client session could take advantage of multiple CPU
> cores.  EnterpriseDB wishes to start the trek into this problem space for 9.4
> by implementing parallel internal (i.e. not spilling to disk) sort.  This
> touches on a notable subset of the infrastructure components we'll need for
> parallel general query.  My intent is to map out the key design topics, hear
> about critical topics I hadn't considered, and solicit feedback on the quality
> of the high-level plan.  Full designs for key pieces will come later.

Have you considered GPU-based sorting? I know there's been discussion in the past.

To me, the biggest advantage of GPU sorting is that most of the concerns you've laid out go away; a backend that needs
tosort just throws data at the GPU to do the actual sorting; all the MVCC issues and what not remain within the scope
ofa single backend.

-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Parallel Sort

From

james

Date:

24 May 2013, 18:52:35

 > Have you considered GPU-based sorting? I know there's been discussion 
in the past.

If you use OpenCL, then you can use a CPU driver if there is no GPU, and 
that can allow you to leverage all the CPU cores without having to do 
the multi-thread stuff in the backend.

While the compilation of a specific kernel can be quite expensive, it 
also has the effect of a JIT compiler in terms of system independence.

Re: Parallel Sort

From

Kohei KaiGai

Date:

24 May 2013, 20:03:43

Let me introduce one thing we discussed in the developer meeting at
Ottawa. We got a consensus that pluggable exec-node may be useful to
replace a part of exec-node tree with an alternative one being
implemented by extensions; which will allow to run something like
"GpuSort" instead of existing Sort.

http://wiki.postgresql.org/wiki/PgCon_2013_Developer_Meeting#Pluggable_plan.2Fexec_nodes

2013/5/24 james <james@mansionfamily.plus.com>:
>> Have you considered GPU-based sorting? I know there's been discussion in
>> the past.
>
> If you use OpenCL, then you can use a CPU driver if there is no GPU, and
> that can allow you to leverage all the CPU cores without having to do the
> multi-thread stuff in the backend.
>
> While the compilation of a specific kernel can be quite expensive, it also
> has the effect of a JIT compiler in terms of system independence.
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>

Re: Parallel Sort

From

Noah Misch

Date:

26 May 2013, 19:07:13

On Fri, May 24, 2013 at 01:13:21PM -0500, Jim Nasby wrote:
> On 5/13/13 9:28 AM, Noah Misch wrote:
>> It would be great if one client session could take advantage of multiple CPU
>> cores.  EnterpriseDB wishes to start the trek into this problem space for 9.4
>> by implementing parallel internal (i.e. not spilling to disk) sort.  This
>> touches on a notable subset of the infrastructure components we'll need for
>> parallel general query.  My intent is to map out the key design topics, hear
>> about critical topics I hadn't considered, and solicit feedback on the quality
>> of the high-level plan.  Full designs for key pieces will come later.
>
> Have you considered GPU-based sorting? I know there's been discussion in the past.

I had considered it briefly.

Parallel sort is mainly valuable for expensive comparison operators.  Sorting
int4, for example, is too cheap for parallelism to be compelling.  (In my test
build of a 16 GiB int4 index, sorting took 11s of the 391s build time.)
However, expensive operators are also liable to be difficult to reimplement
for the GPU.  In particular, implementing a GPU-based strcoll() for bttextcmp
sounds like quite a project in its own right.

> To me, the biggest advantage of GPU sorting is that most of the concerns you've laid out go away; a backend that
needsto sort just throws data at the GPU to do the actual sorting; all the MVCC issues and what not remain within the
scopeof a single backend.

Those are matters we would eventually need to address as we parallelize more
things, so I regard confronting them as an advantage.  Among other benefits,
this project is a vehicle for emplacing some infrastructure without inviting
the full complexity entailed by loftier goals.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Parallel Sort

From

Stephen Frost

Date:

27 May 2013, 00:59:19

* Noah Misch (noah@leadboat.com) wrote:
> In particular, implementing a GPU-based strcoll() for bttextcmp
> sounds like quite a project in its own right.

It also wouldn't likely be helpful...  To be able to use a GPU
effectively, last I looked, you need to be able to move a large
chunk of data to the GPU's memory, operate on it, and then bulk
move the results back because the cost of moving data between main
memory and GPU memory is very high.

> Those are matters we would eventually need to address as we parallelize more
> things, so I regard confronting them as an advantage.  Among other benefits,
> this project is a vehicle for emplacing some infrastructure without inviting
> the full complexity entailed by loftier goals.

Agreed.
Thanks,
    Stephen