Thread: using custom scan nodes to prototype parallel sequential scan

using custom scan nodes to prototype parallel sequential scan

From
Robert Haas
Date:
On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Something usable, with severe restrictions, is actually better than we
> have now. I understand the journey this work represents, so don't be
> embarrassed by submitting things with heuristics and good-enoughs in
> it. Our mentor, Mr.Lane, achieved much by spreading work over many
> releases, leaving others to join in the task.

It occurs to me that, now that the custom-scan stuff is committed, it
wouldn't be that hard to use that, plus the other infrastructure we
already have, to write a prototype of parallel sequential scan.  Given
where we are with the infrastructure, there would be a number of
unhandled problems, such as deadlock detection (needs group locking or
similar), assessment of quals as to parallel-safety (needs
proisparallel or similar), general waterproofing to make sure that
pushing down a qual we shouldn't does do anything really dastardly
like crash the server (another written but yet-to-be-published patch
adds a bunch of relevant guards), and snapshot sharing (likewise).
But if you don't do anything weird, it should basically work.

I think this would be useful for a couple of reasons.  First, it would
be a demonstrable show of progress, illustrating how close we are to
actually having something you can really deploy.  Second, we could use
it to demonstrate how the remaining infrastructure patches close up
gaps in the initial prototype.  Third, it would let us start doing
real performance testing.  It seems pretty clear that a parallel
sequential scan of data that's in memory (whether the page cache or
the OS cache) can be accelerated by having multiple processes scan it
in parallel.  But it's much less clear what will happen when the data
is being read in from disk.  Does parallelism help at all?  What
degree of parallelism helps?  Do we break OS readahead so badly that
performance actually regresses?  These are things that are likely to
need a fair amount of tuning before this is ready for prime time, so
being able to start experimenting with them in advance of all of the
infrastructure being completely ready seems like it might help.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: using custom scan nodes to prototype parallel sequential scan

From
Andres Freund
Date:
Hi Robert, All,

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
> On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > Something usable, with severe restrictions, is actually better than we
> > have now. I understand the journey this work represents, so don't be
> > embarrassed by submitting things with heuristics and good-enoughs in
> > it. Our mentor, Mr.Lane, achieved much by spreading work over many
> > releases, leaving others to join in the task.
> 
> It occurs to me that, now that the custom-scan stuff is committed, it
> wouldn't be that hard to use that, plus the other infrastructure we
> already have, to write a prototype of parallel sequential scan.  Given
> where we are with the infrastructure, there would be a number of
> unhandled problems, such as deadlock detection (needs group locking or
> similar), assessment of quals as to parallel-safety (needs
> proisparallel or similar), general waterproofing to make sure that
> pushing down a qual we shouldn't does do anything really dastardly
> like crash the server (another written but yet-to-be-published patch
> adds a bunch of relevant guards), and snapshot sharing (likewise).
> But if you don't do anything weird, it should basically work.
> 
> I think this would be useful for a couple of reasons.  First, it would
> be a demonstrable show of progress, illustrating how close we are to
> actually having something you can really deploy.  Second, we could use
> it to demonstrate how the remaining infrastructure patches close up
> gaps in the initial prototype.  Third, it would let us start doing
> real performance testing.

I think it might be a useful experiment - as long as it's clear that
it's that. Which is, I think, what you're thinking about?

> It seems pretty clear that a parallel sequential scan of data that's
> in memory (whether the page cache or the OS cache) can be accelerated
> by having multiple processes scan it in parallel.

Right.

> But it's much less clear what will happen when the data is being read
> in from disk.

I think that *very* heavily depends on the IO subsystem.

> Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage with a
raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better at
hinting the OS...

> What degree of parallelism helps?

That's quite a hard question. Generally the question about how much
parallelism for what is beneficial will be one of the most complicated
areas once the plumbing is in.

> Do we break OS readahead so badly that performance actually regresses?

I don't think it's likely that we break OS readahead - that works on a
per task basis at least on linux afaik. But it's nonetheless very easy
to have too many streams causing to many random reads.

> These are things that are likely to
> need a fair amount of tuning before this is ready for prime time, so
> being able to start experimenting with them in advance of all of the
> infrastructure being completely ready seems like it might help.

I'm not actually entirely sure how much that's going to help. I think
you could very well have a WIP patch ready reasonably quick that doesn't
solve the issues you mention above by patching it in. For the kind of
testing we're talking about that seems likely sufficient - a git branch
somewhere probably is actually easier to compile for people than some
contrib module that needs to be loaded...
And I *do* think that you'll very quickly hit the limits of the custom
scan API. And I'd much rather see you work on improving parallelism than
the custom scan stuff, just so you can prototype further ahead.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: using custom scan nodes to prototype parallel sequential scan

From
Haribabu Kommi
Date:
On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
>> Does parallelism help at all?
>
> I'm pretty damn sure. We can't even make a mildly powerfull storage
> fully busy right now. Heck, I can't make my workstation's storage with a
> raid 10 out of four spinning disks fully busy.
>
> I think some of that benefit also could be reaped by being better at
> hinting the OS...

Yes, it definitely helps but not only limited to IO bound operations.
It gives a good gain for the queries having CPU intensive where conditions.

One more point we may need to consider, is there any overhead in passing
the data row from workers to backend? I feel this may play a major role
when the selectivity is more.

Regards,
Hari Babu
Fujitsu Australia



Re: using custom scan nodes to prototype parallel sequential scan

From
Amit Kapila
Date:
On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
> >> Does parallelism help at all?
> >
> > I'm pretty damn sure. We can't even make a mildly powerfull storage
> > fully busy right now. Heck, I can't make my workstation's storage with a
> > raid 10 out of four spinning disks fully busy.
> >
> > I think some of that benefit also could be reaped by being better at
> > hinting the OS...
>
> Yes, it definitely helps but not only limited to IO bound operations.
> It gives a good gain for the queries having CPU intensive where conditions.
>
> One more point we may need to consider, is there any overhead in passing
> the data row from workers to backend?

I am not sure if that overhead will be too much visible if we improve the
use of I/O subsystem by making parallel tasks working on it. However
another idea here could be that instead of passing tuple data, we just
pass tuple id, but in that case we have to retain the pin on the buffer
that contains tuple untill master backend reads from it that might have
it's own kind of problems.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: using custom scan nodes to prototype parallel sequential scan

From
Haribabu Kommi
Date:
On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
>> >> Does parallelism help at all?
>> >
>> > I'm pretty damn sure. We can't even make a mildly powerfull storage
>> > fully busy right now. Heck, I can't make my workstation's storage with a
>> > raid 10 out of four spinning disks fully busy.
>> >
>> > I think some of that benefit also could be reaped by being better at
>> > hinting the OS...
>>
>> Yes, it definitely helps but not only limited to IO bound operations.
>> It gives a good gain for the queries having CPU intensive where
>> conditions.
>>
>> One more point we may need to consider, is there any overhead in passing
>> the data row from workers to backend?
>
> I am not sure if that overhead will be too much visible if we improve the
> use of I/O subsystem by making parallel tasks working on it.

I feel there may be an overhead because of workers needs to put the result
data in the shared memory and the backend has to read from there to process
it further. If the cost of transfering data from worker to backend is more than
fetching a tuple from the scan, then the overhead is visible when the
selectivity is more.

> However
> another idea here could be that instead of passing tuple data, we just
> pass tuple id, but in that case we have to retain the pin on the buffer
> that contains tuple untill master backend reads from it that might have
> it's own kind of problems.

Transfering tuple id doesn't solve the scenarios if the node needs any
projection.

Regards,
Hari Babu
Fujitsu Australia



Re: using custom scan nodes to prototype parallel sequential scan

From
Amit Kapila
Date:
On Tue, Nov 11, 2014 at 9:42 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> > wrote:
> >>
> >> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com>
> >> wrote:
> >> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
> >> >> Does parallelism help at all?
> >> >
> >> > I'm pretty damn sure. We can't even make a mildly powerfull storage
> >> > fully busy right now. Heck, I can't make my workstation's storage with a
> >> > raid 10 out of four spinning disks fully busy.
> >> >
> >> > I think some of that benefit also could be reaped by being better at
> >> > hinting the OS...
> >>
> >> Yes, it definitely helps but not only limited to IO bound operations.
> >> It gives a good gain for the queries having CPU intensive where
> >> conditions.
> >>
> >> One more point we may need to consider, is there any overhead in passing
> >> the data row from workers to backend?
> >
> > I am not sure if that overhead will be too much visible if we improve the
> > use of I/O subsystem by making parallel tasks working on it.
>
> I feel there may be an overhead because of workers needs to put the result
> data in the shared memory and the backend has to read from there to process
> it further. If the cost of transfering data from worker to backend is more than
> fetching a tuple from the scan, then the overhead is visible when the
> selectivity is more.
>
> > However
> > another idea here could be that instead of passing tuple data, we just
> > pass tuple id, but in that case we have to retain the pin on the buffer
> > that contains tuple untill master backend reads from it that might have
> > it's own kind of problems.
>
> Transfering tuple id doesn't solve the scenarios if the node needs any
> projection.

Hmm, that's why I told that we need to retain buffer pin, so that we can
get the tuple data.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: using custom scan nodes to prototype parallel sequential scan

From
Kouhei Kaigai
Date:


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Haribabu Kommi
> Sent: Tuesday, November 11, 2014 1:13 PM
> To: Amit Kapila
> Cc: Andres Freund; Robert Haas; Simon Riggs; Tom Lane;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] using custom scan nodes to prototype parallel
> sequential scan
> 
> On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi
> > <kommi.haribabu@gmail.com>
> > wrote:
> >>
> >> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund
> >> <andres@2ndquadrant.com>
> >> wrote:
> >> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
> >> >> Does parallelism help at all?
> >> >
> >> > I'm pretty damn sure. We can't even make a mildly powerfull storage
> >> > fully busy right now. Heck, I can't make my workstation's storage
> >> > with a raid 10 out of four spinning disks fully busy.
> >> >
> >> > I think some of that benefit also could be reaped by being better
> >> > at hinting the OS...
> >>
> >> Yes, it definitely helps but not only limited to IO bound operations.
> >> It gives a good gain for the queries having CPU intensive where
> >> conditions.
> >>
> >> One more point we may need to consider, is there any overhead in
> >> passing the data row from workers to backend?
> >
> > I am not sure if that overhead will be too much visible if we improve
> > the use of I/O subsystem by making parallel tasks working on it.
> 
> I feel there may be an overhead because of workers needs to put the result
> data in the shared memory and the backend has to read from there to process
> it further. If the cost of transfering data from worker to backend is more
> than fetching a tuple from the scan, then the overhead is visible when the
> selectivity is more.
> 
In my experience, data copy and transformation to fit TupleTableSlot is the
biggest overhead, rather than scan or join itself...
Probably, a straight-forward way is to construct an array of values/isnull
on a shared memory segment, then the backend process just switch pointers of
tts_values/tts_isnull, with no data copy. It gave us a performance gain.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: using custom scan nodes to prototype parallel sequential scan

From
Simon Riggs
Date:
On 10 November 2014 15:57, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Something usable, with severe restrictions, is actually better than we
>> have now. I understand the journey this work represents, so don't be
>> embarrassed by submitting things with heuristics and good-enoughs in
>> it. Our mentor, Mr.Lane, achieved much by spreading work over many
>> releases, leaving others to join in the task.
>
> It occurs to me that, now that the custom-scan stuff is committed, it
> wouldn't be that hard to use that, plus the other infrastructure we
> already have, to write a prototype of parallel sequential scan.

+1

> Given
> where we are with the infrastructure, there would be a number of
> unhandled problems, such as deadlock detection (needs group locking or
> similar), assessment of quals as to parallel-safety (needs
> proisparallel or similar), general waterproofing to make sure that
> pushing down a qual we shouldn't does do anything really dastardly
> like crash the server (another written but yet-to-be-published patch
> adds a bunch of relevant guards), and snapshot sharing (likewise).
> But if you don't do anything weird, it should basically work.

If we build something fairly restricted, but production ready we can
still do something useful to users.

* only functions marked as "CONTAINS NO SQL"
* parallel_workers = 2 or fixed
* only one Plan type, designed to maximise benefit and minimize optimizer change

Why?

* The above is all that is needed to test any infrastructure changes
we accept. We need both infrastructure and tests. We could do this as
a custom scan plugin, but why not make it work in core - that doesn't
prevent a plugin version also if you desire it.

* only functions marked as "CONTAINS NO SQL"
We don't really know what proisparallel is, but we do know what
CONTAINS NO SQL means and can easily check for it.
Plus I already have a patch for this, slightly bitrotted.

* parallel_workers = 2 (or at least not make it user settable)
By fixing the number of workers at 2 we avoid any problems caused by
having N variable, such as how to vary N fairly amongst users and
other such considerations. We get the main benefit of parallelism,
without causing other issues across the server.

* Fixed Plan: aggregate-scan
To make everything simpler, allow only plans of a single type.SELECT something, list of aggregatesFROM fooWHERE
filtersGROUPBY something
 
because we know that passing large amounts of data from worker to
master process will be slow, so focusing only on seq scan is not
sensible; we should focus on plans that significantly reduce the
number of rows passed upwards. We could just do this for very
selective WHERE clauses, but that is not an important class of query.
As soon as include aggregates, we reduce data passing significantly
AND we hit a very important subset of queries:

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

The execution plan for that query type looks like this...
Hash Aggregate  Gather From Workers     {Worker Nodes workers = 2       HashAggregate       PartialScan}

Which is simple enough that we include a mechanism for the Gather
operation, plus it is simple enough to not need extensive optimizer
changes - just changes to set_plain_rel_pathlist() to consider
parallel plans.

Plan costs are easily to calculate for the above...cpu_worker_startup_cost = 100 -- Startup cost is easy to calculate
by
observation, but a reasonably large default will be OKcpu_ipc_tuple_cost = 0.1 -- assume x10 normal cost of
cpu_tuple_cost
Partial scan costs are just same as SeqScan, just with fewer blocks.
All other costs are the same

We can submit the main patch by Dec 15, fix all the problems by Feb 15.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: using custom scan nodes to prototype parallel sequential scan

From
Kouhei Kaigai
Date:
> > Given
> > where we are with the infrastructure, there would be a number of
> > unhandled problems, such as deadlock detection (needs group locking or
> > similar), assessment of quals as to parallel-safety (needs
> > proisparallel or similar), general waterproofing to make sure that
> > pushing down a qual we shouldn't does do anything really dastardly
> > like crash the server (another written but yet-to-be-published patch
> > adds a bunch of relevant guards), and snapshot sharing (likewise).
> > But if you don't do anything weird, it should basically work.
> 
> If we build something fairly restricted, but production ready we can still
> do something useful to users.
> 
> * only functions marked as "CONTAINS NO SQL"
> * parallel_workers = 2 or fixed
> * only one Plan type, designed to maximise benefit and minimize optimizer
> change
> 
> Why?
> 
> * The above is all that is needed to test any infrastructure changes we
> accept. We need both infrastructure and tests. We could do this as a custom
> scan plugin, but why not make it work in core - that doesn't prevent a plugin
> version also if you desire it.
> 
> * only functions marked as "CONTAINS NO SQL"
> We don't really know what proisparallel is, but we do know what CONTAINS
> NO SQL means and can easily check for it.
> Plus I already have a patch for this, slightly bitrotted.
> 
Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?

> * parallel_workers = 2 (or at least not make it user settable) By fixing
> the number of workers at 2 we avoid any problems caused by having N variable,
> such as how to vary N fairly amongst users and other such considerations.
> We get the main benefit of parallelism, without causing other issues across
> the server.
> 
> * Fixed Plan: aggregate-scan
> To make everything simpler, allow only plans of a single type.
>  SELECT something, list of aggregates
>  FROM foo
>  WHERE filters
>  GROUP BY something
> because we know that passing large amounts of data from worker to master
> process will be slow, so focusing only on seq scan is not sensible; we should
> focus on plans that significantly reduce the number of rows passed upwards.
> We could just do this for very selective WHERE clauses, but that is not
> an important class of query.
> As soon as include aggregates, we reduce data passing significantly AND
> we hit a very important subset of queries:
> 
> This plan type is widely used in reporting queries, so will hit the mainline
> of BI applications and many Mat View creations.
> This will allow SELECT count(*) FROM foo to go faster also.
> 
I agree with you. The key of performance is how many rows can be reduced
by parallel processor, however, one thing I doubt is that the condition
to kick parallel execution may be too restrictive.
I'm not sure how much workloads run aggregate functions towards flat tables
rather than multiple joined relations.

Sorry, it might be a topic to discuss in a separated thread.
Can't we have a functionality to push-down aggregate functions across table
joins? If we could do this in some cases, but not any cases, it makes sense
to run typical BI workload in parallel.

Let me assume the following query:
 SELECT SUM(t1.X), SUM(t2.Y) FROM t1 JOIN t2 ON t1.pk_id = t2.fk_id GROUP BY t1.A, t2.B;

It is a usual query that joins two relations with PK and FK.

It can be broken down, like:
 SELECT SUM(t1.X), SUM(sq2.Y) FROM t1 JOIN (SELECT t2.fk_id, t2.B, SUM(t2.Y) Y FROM t2 GROUP BY t2.fk_id, t2.B) sq2
GROUPBY t1.A, sq2.B;
 

If FK has 100 tuples per one PK in average, aggregate function that runs prior
to join will reduce massive number of tuples to be joined, and we can leverage
parallel processors to make preliminary aggregate on flat table.

I'm now checking the previous academic people's job to identify the condition
what kind of join allows to push down aggregate functions.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.7751

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Re: using custom scan nodes to prototype parallel sequential scan

From
Robert Haas
Date:
On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> * only functions marked as "CONTAINS NO SQL"
> We don't really know what proisparallel is, but we do know what
> CONTAINS NO SQL means and can easily check for it.
> Plus I already have a patch for this, slightly bitrotted.

Interestingly, I have a fairly solid idea of what proisparallel is,
but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
I would imagine that srandom() contains no SQL under any reasonable
definition of what that means, but it ain't parallel-safe.

> * parallel_workers = 2 (or at least not make it user settable)
> By fixing the number of workers at 2 we avoid any problems caused by
> having N variable, such as how to vary N fairly amongst users and
> other such considerations. We get the main benefit of parallelism,
> without causing other issues across the server.

I think this is a fairly pointless restriction.  The code
simplification we'll get out of it appears to me to be quite minor,
and we'll just end up putting the stuff back in anyway.

> * Fixed Plan: aggregate-scan
> To make everything simpler, allow only plans of a single type.
>  SELECT something, list of aggregates
>  FROM foo
>  WHERE filters
>  GROUP BY something
> because we know that passing large amounts of data from worker to
> master process will be slow, so focusing only on seq scan is not
> sensible; we should focus on plans that significantly reduce the
> number of rows passed upwards. We could just do this for very
> selective WHERE clauses, but that is not an important class of query.
> As soon as include aggregates, we reduce data passing significantly
> AND we hit a very important subset of queries:

This is moving the goalposts in a way that I'm not at all comfortable
with.  Parallel sequential-scan is pretty simple and may well be a win
if there's a restrictive filter condition involved.  Parallel
aggregation requires introducing new infrastructure into the aggregate
machinery to allow intermediate state values to be combined, and that
would be a great project for someone to do at some time, but it seems
like a distraction for me to do that right now.

> This plan type is widely used in reporting queries, so will hit the
> mainline of BI applications and many Mat View creations.
> This will allow SELECT count(*) FROM foo to go faster also.
>
> The execution plan for that query type looks like this...
> Hash Aggregate
>    Gather From Workers
>       {Worker Nodes workers = 2
>         HashAggregate
>         PartialScan}

I'm going to aim for the simpler:

Hash Aggregate
-> Parallel Seq Scan   Workers: 4

Yeah, I know that won't perform as well as what you're proposing, but
I'm fairly sure it's simpler.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: using custom scan nodes to prototype parallel sequential scan

From
David Rowley
Date:
On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.


We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. 

Regards

David Rowley

Re: using custom scan nodes to prototype parallel sequential scan

From
Atri Sharma
Date:


On Wed, Nov 12, 2014 at 1:24 PM, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.


We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions. 


It might make sense to make a new planner operator which can be responsible for pulling from each of the individual parallel Agg nodes and then aggregating over the results.


A couple of things that might be worth considering are whether we want to enforce using parallel aggregation or let planner decide if it wants to do a parallel aggregate or go with a single plan. For eg, the average estimated size of groups might be one thing that planner may consider while deciding between a parallel and a single execution plan.

I dont see merging states as an easy problem, and should perhaps be tackled apart from this thread.

Also, do we want to allow parallelism only with GroupAggs?

Regards,

Atri

Re: using custom scan nodes to prototype parallel sequential scan

From
Robert Haas
Date:
On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?

There are certainly things that are parallel-safe that are not
immutable.  It might be the case that everything immutable is
parallel-safe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: using custom scan nodes to prototype parallel sequential scan

From
Michael Paquier
Date:
On Wed, Nov 12, 2014 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?
>
> There are certainly things that are parallel-safe that are not
> immutable.  It might be the case that everything immutable is
> parallel-safe.
FWIW, when working on the concept of expression and clause
shippability for Postgres-XC (aka the possibility to pass it safely to
another backend, but in another PG node in this case), we discussed
similar things and if I recall correctly we even discussed about
adding a flag to pg_proc to define if a function was shippable or not.
Finally what we finished with was not adding a new flag and use as
rule that all the immutable functions can be safely shipped, and
others not, even some stable functions that *could* be safe. Maybe
Ashutosh has more comments on that, my memory may be failing.
In the end, I think that you would finish with something similar.
My 2c.
-- 
Michael



Re: using custom scan nodes to prototype parallel sequential scan

From
Simon Riggs
Date:
On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote:

> I'm going to aim for the simpler:
>
> Hash Aggregate
> -> Parallel Seq Scan
>     Workers: 4
>
> Yeah, I know that won't perform as well as what you're proposing, but
> I'm fairly sure it's simpler.

Simple is best, so +1.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: using custom scan nodes to prototype parallel sequential scan

From
Simon Riggs
Date:
On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com> wrote:
> On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>>
>> This plan type is widely used in reporting queries, so will hit the
>> mainline of BI applications and many Mat View creations.
>> This will allow SELECT count(*) FROM foo to go faster also.
>>
>
> We'd also need to add some infrastructure to merge aggregate states together
> for this to work properly. This means that could also work for avg() and
> stddev etc. For max() and min() the merge functions would likely just be the
> same as the transition functions.

Do you mean something like a "subtotal" or "intermediate combination" functon?

I guess we'd need the same thing to make intermediate aggregates work
during a sort?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: using custom scan nodes to prototype parallel sequential scan

From
Simon Riggs
Date:
On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> * only functions marked as "CONTAINS NO SQL"
>> We don't really know what proisparallel is, but we do know what
>> CONTAINS NO SQL means and can easily check for it.
>> Plus I already have a patch for this, slightly bitrotted.
>
> Interestingly, I have a fairly solid idea of what proisparallel is,
> but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
> I would imagine that srandom() contains no SQL under any reasonable
> definition of what that means, but it ain't parallel-safe.

What is wrong in generating random numbers in parallel?

But I'm sure many volatile functions would be annoying to support, so
CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: using custom scan nodes to prototype parallel sequential scan

From
David Rowley
Date:
On Fri, Nov 14, 2014 at 1:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com> wrote:
> On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>>
>> This plan type is widely used in reporting queries, so will hit the
>> mainline of BI applications and many Mat View creations.
>> This will allow SELECT count(*) FROM foo to go faster also.
>>
>
> We'd also need to add some infrastructure to merge aggregate states together
> for this to work properly. This means that could also work for avg() and
> stddev etc. For max() and min() the merge functions would likely just be the
> same as the transition functions.

Do you mean something like a "subtotal" or "intermediate combination" functon?

 
If you had 4 parallel workers performing a seqscan, say the relation was 4000 pages in total, you could say that worker 1 would scan blocks 0-999, worker 2, 1000-1999 etc. After each worker had finished, there would then be 4 sets of records then needed to be merged into 1 set.

Take int4_avg_accum() for example it does:

transdata->count++;
transdata->sum += newval;
 
The merge function would need to perform something like:

transdata->count += transdata2merge.count;
transdata->sum += transdata2merge.sum;

Then the final function could be called on the merged aggregate state.

The same can be applied when the query contains a GROUP BY clause, just we'd need pay attention to which groups we merge together for that to work Any HAVING clause would have to be applied after the groups have been merged.

This whole topic is pretty exciting for data warehouse type workloads.

Regards

David Rowley 

Re: using custom scan nodes to prototype parallel sequential scan

From
Kouhei Kaigai
Date:
>     On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com>
> wrote:
>     > On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs
> <simon@2ndquadrant.com> wrote:
>     >>
>     >>
>     >> This plan type is widely used in reporting queries, so will hit
> the
>     >> mainline of BI applications and many Mat View creations.
>     >> This will allow SELECT count(*) FROM foo to go faster also.
>     >>
>     >
>     > We'd also need to add some infrastructure to merge aggregate
> states together
>     > for this to work properly. This means that could also work for
> avg() and
>     > stddev etc. For max() and min() the merge functions would likely
> just be the
>     > same as the transition functions.
> 
> 
>     Do you mean something like a "subtotal" or "intermediate
> combination" functon?
> 
> 
> 
> 
> If you had 4 parallel workers performing a seqscan, say the relation was
> 4000 pages in total, you could say that worker 1 would scan blocks 0-999,
> worker 2, 1000-1999 etc. After each worker had finished, there would then
> be 4 sets of records then needed to be merged into 1 set.
> 
> Take int4_avg_accum() for example it does:
> 
> 
> transdata->count++;
> transdata->sum += newval;
> 
> The merge function would need to perform something like:
> 
> transdata->count += transdata2merge.count; sum += transdata2merge.sum;
> 
> Then the final function could be called on the merged aggregate state.
> 
> The same can be applied when the query contains a GROUP BY clause, just
> we'd need pay attention to which groups we merge together for that to work
> Any HAVING clause would have to be applied after the groups have been merged.
> 
> This whole topic is pretty exciting for data warehouse type workloads.
> 
More simplify, we can describe parallel aware aggregate function.
Please assume AVG(X) function that takes nrows and sum of X. Its transition
function performs as like you described above, then final function works as
usual.

The job of parallel seq scan needs to do is: 1. replace AVG(X) by AVG(nrows, sum(X) 2. generate count(*) on the partial
relationbeing grouped. 3. generate sum(X) on the partial relation being grouped.
 

It looks like the following query: SELECT AVG(nrows, sum_X) FROM (   SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE
blknobetween 0 and 999 GROUP BY cat   UNION   SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 1000 and
1999GROUP BY cat   UNION   SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 2000 and 2999 GROUP BY cat
UNION   SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 3000 and 3999 GROUP BY cat );
 

Note that stuff inside sub-query is a job of parallel scan.

Do we need to invent a new infrastructure here?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Re: using custom scan nodes to prototype parallel sequential scan

From
David Rowley
Date:
On Fri, Nov 14, 2014 at 2:12 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>       On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com>
> Take int4_avg_accum() for example it does:
>
>
> transdata->count++;
> transdata->sum += newval;
>
> The merge function would need to perform something like:
>
> transdata->count += transdata2merge.count; sum += transdata2merge.sum;
>
> Then the final function could be called on the merged aggregate state.
>
More simplify, we can describe parallel aware aggregate function.
Please assume AVG(X) function that takes nrows and sum of X. Its transition
function performs as like you described above, then final function works as
usual.

The job of parallel seq scan needs to do is:
  1. replace AVG(X) by AVG(nrows, sum(X)
  2. generate count(*) on the partial relation being grouped.
  3. generate sum(X) on the partial relation being grouped.

It looks like the following query:
  SELECT AVG(nrows, sum_X) FROM (
    SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 0 and 999 GROUP BY cat
    UNION
    SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 1000 and 1999 GROUP BY cat
    UNION
    SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 2000 and 2999 GROUP BY cat
    UNION
    SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 3000 and 3999 GROUP BY cat
  );


Well, this would require giving the planner some kind of knowledge of what AVG() is. It currently knows nothing about that. It currently calls the transition function for each row, and the final function at the end and does not care or need to care about what either of those functions actually does. The transformation above looks like it would need to care and the logic to add that would be way more complex than aggregate merge functions.

Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the merge function would be the same as the transition function, as the state type is just the same as the input type. It would only be aggregates like avg(), stddev*(), bool_and() and bool_or() that would need a new merge function made... These would be no more complex than the transition functions... Which are just a few lines of code anyway.

We'd simply just not run parallel query if any aggregates used in the query didn't have a merge function.

When I mentioned this, I didn't mean to appear to be placing a road block.I was just bringing to the table the information that COUNT(*) + COUNT(*) works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not.

 Merge functions should be a simple patch to write. If I thought there was going to be a use for them in this release, I'd put my hand up to put a patch together.

Regards

David Rowley

Re: using custom scan nodes to prototype parallel sequential scan

From
Kouhei Kaigai
Date:
> On Fri, Nov 14, 2014 at 2:12 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> 
> 
>     >       On 12 November 2014 07:54, David Rowley
> <dgrowleyml@gmail.com>
>     > Take int4_avg_accum() for example it does:
>     >
>     >
>     > transdata->count++;
>     > transdata->sum += newval;
>     >
>     > The merge function would need to perform something like:
>     >
> 
>     > transdata->count += transdata2merge.count; sum +=
> transdata2merge.sum;
>     >
>     > Then the final function could be called on the merged aggregate
> state.
>     >
>     More simplify, we can describe parallel aware aggregate function.
>     Please assume AVG(X) function that takes nrows and sum of X. Its
> transition
>     function performs as like you described above, then final function
> works as
>     usual.
> 
>     The job of parallel seq scan needs to do is:
>       1. replace AVG(X) by AVG(nrows, sum(X)
>       2. generate count(*) on the partial relation being grouped.
>       3. generate sum(X) on the partial relation being grouped.
> 
>     It looks like the following query:
>       SELECT AVG(nrows, sum_X) FROM (
>         SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno
> between 0 and 999 GROUP BY cat
>         UNION
>         SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno
> between 1000 and 1999 GROUP BY cat
>         UNION
>         SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno
> between 2000 and 2999 GROUP BY cat
>         UNION
>         SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno
> between 3000 and 3999 GROUP BY cat
>       );
> 
> 
> 
> 
> Well, this would require giving the planner some kind of knowledge of what
> AVG() is. It currently knows nothing about that. It currently calls the
> transition function for each row, and the final function at the end and
> does not care or need to care about what either of those functions actually
> does. The transformation above looks like it would need to care and the
> logic to add that would be way more complex than aggregate merge functions.
> 
It may make sense to have an extra catalog to indicate how usual aggregate
function can be broken down (or unavailable).
Like, AVG(X) = {COUNT(X) + SUM(X)} x N-partitions

> Likely for most aggregates, like count, sum, max, min, bit_and and bit_or
> the merge function would be the same as the transition function, as the
> state type is just the same as the input type. It would only be aggregates
> like avg(), stddev*(), bool_and() and bool_or() that would need a new merge
> function made... These would be no more complex than the transition
> functions... Which are just a few lines of code anyway.
> 
> We'd simply just not run parallel query if any aggregates used in the query
> didn't have a merge function.
> 
> When I mentioned this, I didn't mean to appear to be placing a road block.I
> was just bringing to the table the information that COUNT(*) + COUNT(*)
> works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not.
> 
>  Merge functions should be a simple patch to write. If I thought there was
> going to be a use for them in this release, I'd put my hand up to put a
> patch together.
> 
Things I'm uncertain is, how caller of aggregate function distinguish a context
to call usual translation function, or new merge function.

Let's back an example of: SELECT cat, COUNT(*), AVG(X) FROM t GROUP BY cat;

Its plan tree is probably as follows: HashAggregate  Group Key: cat  ->  Custom Scan (Parallel Scan)        # of
workers:4
 

The caller of translation/merge/final function is HashAggregate node.
On the other hand, it has to know whether the underlying plan returns every
records of underlying table or sub-total by parallel scan.
Please correct me, if my assumption is wrong.

Once HashAggregate can know that sub-plan returns sub-total of the relation,
it can chose merge function instead of translation function.
However, what I want to clarify is how to inform HashAggregate node its sub-
plan intends to return sub-total, instead of individual rows.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: using custom scan nodes to prototype parallel sequential scan

From
Jim Nasby
Date:
On 11/12/14, 1:54 AM, David Rowley wrote:
> On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com <mailto:simon@2ndquadrant.com>> wrote:
>
>
>     This plan type is widely used in reporting queries, so will hit the
>     mainline of BI applications and many Mat View creations.
>     This will allow SELECT count(*) FROM foo to go faster also.
>
> We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means
thatcould also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as
thetransition functions.
 

Sanity check: what % of a large aggregate query fed by a seqscan actually spent in the aggregate functions? Even if you
lookstrictly at CPU cost, isn't there more code involved to get data to the aggregate function than in the aggregation
itself,except maybe for numeric?
 

In other words, I suspect that just having a dirt-simple parallel SeqScan could be a win for CPU. It should certainly
bea win IO-wise; in my experience we're not very good at maxing out IO systems.
 

(I was curious and came up with the list below for just the page-level stuff (ignoring IO). I don't see much code
involvedin per-tuple work, but I also never came across detoasting code, so I suspect I'm missing something...)
 

ExecScanFetch, heapgettup_pagemode, ReadBuffer, BufferAlloc, heap_page_prune_opt, LWLockAcquire... then you can finally
doper-tuple work. HeapTupleSatisfiesVisibility.
 
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: using custom scan nodes to prototype parallel sequential scan

From
David Rowley
Date:
On 14 November 2014 20:37, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 11/12/14, 1:54 AM, David Rowley wrote:

We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions.

Sanity check: what % of a large aggregate query fed by a seqscan actually spent in the aggregate functions? Even if you look strictly at CPU cost, isn't there more code involved to get data to the aggregate function than in the aggregation itself, except maybe for numeric?


You might be right, but that sounds like it would need all the parallel workers to send each matching tuple to a queue to be processed by some aggregate node. I guess this would have more advantage for wider tables or tables with many dead tuples, or if the query has quite a selective where clause, as less data would make it onto that queue.

Perhaps I've taken 1 step too far forward here. I had been thinking that each worker would perform the partial seqscan and in the worker context pass the tuple down to the aggregate node. Then later once each worker had complete some other perhaps new node type (MergeAggregateStates) would merge all those intermediate agg states into the final agg state (which would then be ready for the final function to be called).

Are there any plans for what will be in charge of deciding how many workers would be allocated to a parallel query? Will this be something that's done at planning time? Or should the planner just create a parallel friendly plan, iif the plan is costly enough and then just allow the executor decide how many workers to throw at the job based on how busy the system is with other tasks at execution time?

Regards

David Rowley

Re: using custom scan nodes to prototype parallel sequential scan

From
Simon Riggs
Date:
On 14 November 2014 07:37, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 11/12/14, 1:54 AM, David Rowley wrote:
>>
>> On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com
>> <mailto:simon@2ndquadrant.com>> wrote:
>>
>>
>>     This plan type is widely used in reporting queries, so will hit the
>>     mainline of BI applications and many Mat View creations.
>>     This will allow SELECT count(*) FROM foo to go faster also.
>>
>> We'd also need to add some infrastructure to merge aggregate states
>> together for this to work properly. This means that could also work for
>> avg() and stddev etc. For max() and min() the merge functions would likely
>> just be the same as the transition functions.
>
>
> Sanity check: what % of a large aggregate query fed by a seqscan actually
> spent in the aggregate functions? Even if you look strictly at CPU cost,
> isn't there more code involved to get data to the aggregate function than in
> the aggregation itself, except maybe for numeric?

Yes, which is why I suggested pre-aggregating before collecting the
streams together.

The point is not that the aggregation is expensive, its that the
aggregation eats data and the required bandwidth for later steps is
reduced and hence does not then become a bottleneck that renders the
parallel Seq Scan ineffective.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: using custom scan nodes to prototype parallel sequential scan

From
Simon Riggs
Date:
On 14 November 2014 01:51, David Rowley <dgrowleyml@gmail.com> wrote:

> When I mentioned this, I didn't mean to appear to be placing a road block.I
> was just bringing to the table the information that COUNT(*) + COUNT(*)
> works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not.
>
>  Merge functions should be a simple patch to write. If I thought there was
> going to be a use for them in this release, I'd put my hand up to put a
> patch together.

The hard part is describing and then agreeing the semantics.

If you raise a separate post on this, copy me in and I will help.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: using custom scan nodes to prototype parallel sequential scan

From
Kouhei Kaigai
Date:
> On 14 November 2014 07:37, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> > On 11/12/14, 1:54 AM, David Rowley wrote:
> >>
> >> On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com
> >> <mailto:simon@2ndquadrant.com>> wrote:
> >>
> >>
> >>     This plan type is widely used in reporting queries, so will hit the
> >>     mainline of BI applications and many Mat View creations.
> >>     This will allow SELECT count(*) FROM foo to go faster also.
> >>
> >> We'd also need to add some infrastructure to merge aggregate states
> >> together for this to work properly. This means that could also work
> >> for
> >> avg() and stddev etc. For max() and min() the merge functions would
> >> likely just be the same as the transition functions.
> >
> >
> > Sanity check: what % of a large aggregate query fed by a seqscan
> > actually spent in the aggregate functions? Even if you look strictly
> > at CPU cost, isn't there more code involved to get data to the
> > aggregate function than in the aggregation itself, except maybe for
> numeric?
> 
> Yes, which is why I suggested pre-aggregating before collecting the streams
> together.
> 
> The point is not that the aggregation is expensive, its that the aggregation
> eats data and the required bandwidth for later steps is reduced and hence
> does not then become a bottleneck that renders the parallel Seq Scan
> ineffective.
> 
I'd like to throw community folks a question. 
Did someone have a discussion to the challenge of aggregate push-down across
relations join in the past? It potentially reduces number of rows to be joined.
If we already had, I'd like to check up the discussion at that time.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: using custom scan nodes to prototype parallel sequential scan

From
David Rowley
Date:
On Fri, Nov 14, 2014 at 1:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote:
> Interestingly, I have a fairly solid idea of what proisparallel is,
> but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
> I would imagine that srandom() contains no SQL under any reasonable
> definition of what that means, but it ain't parallel-safe.

What is wrong in generating random numbers in parallel?

I was just watching Robert's talk on Parallel query on youtube... I think the answer is at 41:09, the link below should take you there:

Regards

David Rowley

Re: using custom scan nodes to prototype parallel sequential scan

From
Robert Haas
Date:
On Thu, Nov 13, 2014 at 7:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> * only functions marked as "CONTAINS NO SQL"
>>> We don't really know what proisparallel is, but we do know what
>>> CONTAINS NO SQL means and can easily check for it.
>>> Plus I already have a patch for this, slightly bitrotted.
>>
>> Interestingly, I have a fairly solid idea of what proisparallel is,
>> but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
>> I would imagine that srandom() contains no SQL under any reasonable
>> definition of what that means, but it ain't parallel-safe.
>
> What is wrong in generating random numbers in parallel?

If they're random, nothing, provide that you use a different seed in
each backend.  But if the seed has been set using srandom(), then we
must generate the particular sequence of random numbers associated
with that seed.  If we don't, the behavior is incompatible with what
happens apart from parallel mode.

> But I'm sure many volatile functions would be annoying to support, so
> CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing.

It might be OK to assume that immutable functions are fine, but not
all stable functions look like they'll be safe, or not without
additional effort: e.g. inet_client_addr(), pg_backend_pid(),
pg_cursor(), pg_event_trigger_dropped_objects().  Or even, more
mundanely, generate_series(), unless we ensure all calls are in the
same backend.  On the other hand, there are quite a few stable
functions that it seems like we would want to allow in parallel
queries, like to_char(), concat(), and textanycat().

I still don't know what CONTAINS NO SQL means.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: using custom scan nodes to prototype parallel sequential scan

From
Simon Riggs
Date:
On 14 November 2014 11:02, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

> I'd like to throw community folks a question.
> Did someone have a discussion to the challenge of aggregate push-down across
> relations join in the past? It potentially reduces number of rows to be joined.
> If we already had, I'd like to check up the discussion at that time.

Yes, I was looking at aggregate pushdown. I think it needs the same
changes to aggregates discussed upthread.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: using custom scan nodes to prototype parallel sequential scan

From
Simon Riggs
Date:
On 17 November 2014 16:01, Robert Haas <robertmhaas@gmail.com> wrote:

> I still don't know what CONTAINS NO SQL means.

It's a function marking that would indicate we aren't allowed to take
snapshots or run SQL.

I think you should publish the scheme for marking functions as safe
for parallelism, so we can judge that.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: using custom scan nodes to prototype parallel sequential scan

From
David Rowley
Date:
On 18 November 2014 05:19, Simon Riggs <simon@2ndquadrant.com> wrote:
On 14 November 2014 11:02, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

> I'd like to throw community folks a question.
> Did someone have a discussion to the challenge of aggregate push-down across
> relations join in the past? It potentially reduces number of rows to be joined.
> If we already had, I'd like to check up the discussion at that time.

Yes, I was looking at aggregate pushdown. I think it needs the same
changes to aggregates discussed upthread.


I have something that I've been working on locally. It's far from ready, but it does work in very simple cases, and shows a nice performance boost. I'll start another thread soon and copy you both in. Perhaps we can share some ideas.

Regards

David Rowley 

Re: using custom scan nodes to prototype parallel sequential scan

From
Kouhei Kaigai
Date:
> On 18 November 2014 05:19, Simon Riggs <simon@2ndquadrant.com> wrote:
> 
> 
>     On 14 November 2014 11:02, Kouhei Kaigai <kaigai@ak.jp.nec.com>
> wrote:
> 
>     > I'd like to throw community folks a question.
>     > Did someone have a discussion to the challenge of aggregate
> push-down across
>     > relations join in the past? It potentially reduces number of rows
> to be joined.
>     > If we already had, I'd like to check up the discussion at that
> time.
> 
>     Yes, I was looking at aggregate pushdown. I think it needs the same
>     changes to aggregates discussed upthread.
> 
> 
> 
> 
> I have something that I've been working on locally. It's far from ready,
> but it does work in very simple cases, and shows a nice performance boost.
> I'll start another thread soon and copy you both in. Perhaps we can share
> some ideas.
> 
Great, it's exactly valuable functionality to be in the core.

--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: using custom scan nodes to prototype parallel sequential scan

From
Bruce Momjian
Date:
On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
> Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the
> merge function would be the same as the transition function, as the state type
> is just the same as the input type. It would only be aggregates like avg(),
> stddev*(), bool_and() and bool_or() that would need a new merge function
> made... These would be no more complex than the transition functions... Which
> are just a few lines of code anyway.
> 
> We'd simply just not run parallel query if any aggregates used in the query
> didn't have a merge function.
> 
> When I mentioned this, I didn't mean to appear to be placing a road block.I was
> just bringing to the table the information that COUNT(*) + COUNT(*) works ok
> for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not.

Sorry, late reply, but, FYI, I don't think our percentile functions
can't be parallelized in the same way:
test=> \daS *percent*                                                      List of aggregate functions   Schema   |
Name       |  Result data type  |             Argument data types              |
Description------------+-----------------+--------------------+----------------------------------------------+-------------------------------------
pg_catalog| percent_rank    | double precision   | VARIADIC "any" ORDER BY VARIADIC "any"       | fractional rank of
hypotheticalrow pg_catalog | percentile_cont | double precision   | double precision ORDER BY double precision   |
continuousdistribution percentile pg_catalog | percentile_cont | double precision[] | double precision[] ORDER BY
doubleprecision | multiple continuous percentiles pg_catalog | percentile_cont | interval           | double precision
ORDERBY interval           | continuous distribution percentile pg_catalog | percentile_cont | interval[]         |
doubleprecision[] ORDER BY interval         | multiple continuous percentiles pg_catalog | percentile_disc | anyelement
       | double precision ORDER BY anyelement         | discrete percentile pg_catalog | percentile_disc | anyarray
     | double precision[] ORDER BY anyelement       | multiple discrete percentiles
 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: using custom scan nodes to prototype parallel sequential scan

From
Kouhei Kaigai
Date:
> On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
> > Likely for most aggregates, like count, sum, max, min, bit_and and
> > bit_or the merge function would be the same as the transition
> > function, as the state type is just the same as the input type. It
> > would only be aggregates like avg(), stddev*(), bool_and() and
> > bool_or() that would need a new merge function made... These would be
> > no more complex than the transition functions... Which are just a few
> lines of code anyway.
> >
> > We'd simply just not run parallel query if any aggregates used in the
> > query didn't have a merge function.
> >
> > When I mentioned this, I didn't mean to appear to be placing a road
> > block.I was just bringing to the table the information that COUNT(*) +
> > COUNT(*) works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n)
> does not.
>
> Sorry, late reply, but, FYI, I don't think our percentile functions can't
> be parallelized in the same way:
>
>     test=> \daS *percent*
>                                                           List of
> aggregate functions
>        Schema   |      Name       |  Result data type  |
> Argument data types              |             Description
>     ------------+-----------------+--------------------+----------
> ------------------------------------+---------------------------------
> ----
>      pg_catalog | percent_rank    | double precision   | VARIADIC
> "any" ORDER BY VARIADIC "any"       | fractional rank of hypothetical row
>      pg_catalog | percentile_cont | double precision   | double
> precision ORDER BY double precision   | continuous distribution percentile
>      pg_catalog | percentile_cont | double precision[] | double
> precision[] ORDER BY double precision | multiple continuous percentiles
>      pg_catalog | percentile_cont | interval           | double
> precision ORDER BY interval           | continuous distribution
> percentile
>      pg_catalog | percentile_cont | interval[]         | double
> precision[] ORDER BY interval         | multiple continuous percentiles
>      pg_catalog | percentile_disc | anyelement         | double
> precision ORDER BY anyelement         | discrete percentile
>      pg_catalog | percentile_disc | anyarray           | double
> precision[] ORDER BY anyelement       | multiple discrete percentiles
>
Yep, it seems to me the type of aggregate function that is not obvious
to split into multiple partitions.
I think, it is valuable even if we can push-down a part of aggregate
functions which is well known by the core planner.
For example, we know count(*) = sum(nrows), we also know avg(X) can
be rewritten to enhanced avg() that takes both of nrows and partial
sum of X. We can utilize these knowledge to break-down aggregate
functions.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>




Re: using custom scan nodes to prototype parallel sequential scan

From
Michael Paquier
Date:
On Wed, Dec 3, 2014 at 3:23 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
>> > Likely for most aggregates, like count, sum, max, min, bit_and and
>> > bit_or the merge function would be the same as the transition
>> > function, as the state type is just the same as the input type. It
>> > would only be aggregates like avg(), stddev*(), bool_and() and
>> > bool_or() that would need a new merge function made... These would be
>> > no more complex than the transition functions... Which are just a few
>> lines of code anyway.
>> >
>> > We'd simply just not run parallel query if any aggregates used in the
>> > query didn't have a merge function.
>> >
>> > When I mentioned this, I didn't mean to appear to be placing a road
>> > block.I was just bringing to the table the information that COUNT(*) +
>> > COUNT(*) works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n)
>> does not.
>>
>> Sorry, late reply, but, FYI, I don't think our percentile functions can't
>> be parallelized in the same way:
>>
>>       test=> \daS *percent*
>>                                                             List of
>> aggregate functions
>>          Schema   |      Name       |  Result data type  |
>> Argument data types              |             Description
>>       ------------+-----------------+--------------------+----------
>> ------------------------------------+---------------------------------
>> ----
>>        pg_catalog | percent_rank    | double precision   | VARIADIC
>> "any" ORDER BY VARIADIC "any"       | fractional rank of hypothetical row
>>        pg_catalog | percentile_cont | double precision   | double
>> precision ORDER BY double precision   | continuous distribution percentile
>>        pg_catalog | percentile_cont | double precision[] | double
>> precision[] ORDER BY double precision | multiple continuous percentiles
>>        pg_catalog | percentile_cont | interval           | double
>> precision ORDER BY interval           | continuous distribution
>> percentile
>>        pg_catalog | percentile_cont | interval[]         | double
>> precision[] ORDER BY interval         | multiple continuous percentiles
>>        pg_catalog | percentile_disc | anyelement         | double
>> precision ORDER BY anyelement         | discrete percentile
>>        pg_catalog | percentile_disc | anyarray           | double
>> precision[] ORDER BY anyelement       | multiple discrete percentiles
>>
> Yep, it seems to me the type of aggregate function that is not obvious
> to split into multiple partitions.
> I think, it is valuable even if we can push-down a part of aggregate
> functions which is well known by the core planner.
> For example, we know count(*) = sum(nrows), we also know avg(X) can
> be rewritten to enhanced avg() that takes both of nrows and partial
> sum of X. We can utilize these knowledge to break-down aggregate
> functions.
Postgres-XC (Postgres-XL) has implemented such parallel aggregate
logic some time ago using a set of sub functions and a finalization
function to do the work.
My 2c.
-- 
Michael