Thread: using custom scan nodes to prototype parallel sequential scan
On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Something usable, with severe restrictions, is actually better than we > have now. I understand the journey this work represents, so don't be > embarrassed by submitting things with heuristics and good-enoughs in > it. Our mentor, Mr.Lane, achieved much by spreading work over many > releases, leaving others to join in the task. It occurs to me that, now that the custom-scan stuff is committed, it wouldn't be that hard to use that, plus the other infrastructure we already have, to write a prototype of parallel sequential scan. Given where we are with the infrastructure, there would be a number of unhandled problems, such as deadlock detection (needs group locking or similar), assessment of quals as to parallel-safety (needs proisparallel or similar), general waterproofing to make sure that pushing down a qual we shouldn't does do anything really dastardly like crash the server (another written but yet-to-be-published patch adds a bunch of relevant guards), and snapshot sharing (likewise). But if you don't do anything weird, it should basically work. I think this would be useful for a couple of reasons. First, it would be a demonstrable show of progress, illustrating how close we are to actually having something you can really deploy. Second, we could use it to demonstrate how the remaining infrastructure patches close up gaps in the initial prototype. Third, it would let us start doing real performance testing. It seems pretty clear that a parallel sequential scan of data that's in memory (whether the page cache or the OS cache) can be accelerated by having multiple processes scan it in parallel. But it's much less clear what will happen when the data is being read in from disk. Does parallelism help at all? What degree of parallelism helps? Do we break OS readahead so badly that performance actually regresses? These are things that are likely to need a fair amount of tuning before this is ready for prime time, so being able to start experimenting with them in advance of all of the infrastructure being completely ready seems like it might help. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Robert, All, On 2014-11-10 10:57:16 -0500, Robert Haas wrote: > On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Something usable, with severe restrictions, is actually better than we > > have now. I understand the journey this work represents, so don't be > > embarrassed by submitting things with heuristics and good-enoughs in > > it. Our mentor, Mr.Lane, achieved much by spreading work over many > > releases, leaving others to join in the task. > > It occurs to me that, now that the custom-scan stuff is committed, it > wouldn't be that hard to use that, plus the other infrastructure we > already have, to write a prototype of parallel sequential scan. Given > where we are with the infrastructure, there would be a number of > unhandled problems, such as deadlock detection (needs group locking or > similar), assessment of quals as to parallel-safety (needs > proisparallel or similar), general waterproofing to make sure that > pushing down a qual we shouldn't does do anything really dastardly > like crash the server (another written but yet-to-be-published patch > adds a bunch of relevant guards), and snapshot sharing (likewise). > But if you don't do anything weird, it should basically work. > > I think this would be useful for a couple of reasons. First, it would > be a demonstrable show of progress, illustrating how close we are to > actually having something you can really deploy. Second, we could use > it to demonstrate how the remaining infrastructure patches close up > gaps in the initial prototype. Third, it would let us start doing > real performance testing. I think it might be a useful experiment - as long as it's clear that it's that. Which is, I think, what you're thinking about? > It seems pretty clear that a parallel sequential scan of data that's > in memory (whether the page cache or the OS cache) can be accelerated > by having multiple processes scan it in parallel. Right. > But it's much less clear what will happen when the data is being read > in from disk. I think that *very* heavily depends on the IO subsystem. > Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... > What degree of parallelism helps? That's quite a hard question. Generally the question about how much parallelism for what is beneficial will be one of the most complicated areas once the plumbing is in. > Do we break OS readahead so badly that performance actually regresses? I don't think it's likely that we break OS readahead - that works on a per task basis at least on linux afaik. But it's nonetheless very easy to have too many streams causing to many random reads. > These are things that are likely to > need a fair amount of tuning before this is ready for prime time, so > being able to start experimenting with them in advance of all of the > infrastructure being completely ready seems like it might help. I'm not actually entirely sure how much that's going to help. I think you could very well have a WIP patch ready reasonably quick that doesn't solve the issues you mention above by patching it in. For the kind of testing we're talking about that seems likely sufficient - a git branch somewhere probably is actually easier to compile for people than some contrib module that needs to be loaded... And I *do* think that you'll very quickly hit the limits of the custom scan API. And I'd much rather see you work on improving parallelism than the custom scan stuff, just so you can prototype further ahead. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-11-10 10:57:16 -0500, Robert Haas wrote: >> Does parallelism help at all? > > I'm pretty damn sure. We can't even make a mildly powerfull storage > fully busy right now. Heck, I can't make my workstation's storage with a > raid 10 out of four spinning disks fully busy. > > I think some of that benefit also could be reaped by being better at > hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I feel this may play a major role when the selectivity is more. Regards, Hari Babu Fujitsu Australia
On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
> >> Does parallelism help at all?
> >
> > I'm pretty damn sure. We can't even make a mildly powerfull storage
> > fully busy right now. Heck, I can't make my workstation's storage with a
> > raid 10 out of four spinning disks fully busy.
> >
> > I think some of that benefit also could be reaped by being better at
> > hinting the OS...
>
> Yes, it definitely helps but not only limited to IO bound operations.
> It gives a good gain for the queries having CPU intensive where conditions.
>
> One more point we may need to consider, is there any overhead in passing
> the data row from workers to backend?
>
> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
> >> Does parallelism help at all?
> >
> > I'm pretty damn sure. We can't even make a mildly powerfull storage
> > fully busy right now. Heck, I can't make my workstation's storage with a
> > raid 10 out of four spinning disks fully busy.
> >
> > I think some of that benefit also could be reaped by being better at
> > hinting the OS...
>
> Yes, it definitely helps but not only limited to IO bound operations.
> It gives a good gain for the queries having CPU intensive where conditions.
>
> One more point we may need to consider, is there any overhead in passing
> the data row from workers to backend?
I am not sure if that overhead will be too much visible if we improve the
use of I/O subsystem by making parallel tasks working on it. However
another idea here could be that instead of passing tuple data, we just
pass tuple id, but in that case we have to retain the pin on the buffer
that contains tuple untill master backend reads from it that might have
it's own kind of problems.
On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <kommi.haribabu@gmail.com> > wrote: >> >> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> >> wrote: >> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote: >> >> Does parallelism help at all? >> > >> > I'm pretty damn sure. We can't even make a mildly powerfull storage >> > fully busy right now. Heck, I can't make my workstation's storage with a >> > raid 10 out of four spinning disks fully busy. >> > >> > I think some of that benefit also could be reaped by being better at >> > hinting the OS... >> >> Yes, it definitely helps but not only limited to IO bound operations. >> It gives a good gain for the queries having CPU intensive where >> conditions. >> >> One more point we may need to consider, is there any overhead in passing >> the data row from workers to backend? > > I am not sure if that overhead will be too much visible if we improve the > use of I/O subsystem by making parallel tasks working on it. I feel there may be an overhead because of workers needs to put the result data in the shared memory and the backend has to read from there to process it further. If the cost of transfering data from worker to backend is more than fetching a tuple from the scan, then the overhead is visible when the selectivity is more. > However > another idea here could be that instead of passing tuple data, we just > pass tuple id, but in that case we have to retain the pin on the buffer > that contains tuple untill master backend reads from it that might have > it's own kind of problems. Transfering tuple id doesn't solve the scenarios if the node needs any projection. Regards, Hari Babu Fujitsu Australia
On Tue, Nov 11, 2014 at 9:42 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> > wrote:
> >>
> >> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com>
> >> wrote:
> >> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
> >> >> Does parallelism help at all?
> >> >
> >> > I'm pretty damn sure. We can't even make a mildly powerfull storage
> >> > fully busy right now. Heck, I can't make my workstation's storage with a
> >> > raid 10 out of four spinning disks fully busy.
> >> >
> >> > I think some of that benefit also could be reaped by being better at
> >> > hinting the OS...
> >>
> >> Yes, it definitely helps but not only limited to IO bound operations.
> >> It gives a good gain for the queries having CPU intensive where
> >> conditions.
> >>
> >> One more point we may need to consider, is there any overhead in passing
> >> the data row from workers to backend?
> >
> > I am not sure if that overhead will be too much visible if we improve the
> > use of I/O subsystem by making parallel tasks working on it.
>
> I feel there may be an overhead because of workers needs to put the result
> data in the shared memory and the backend has to read from there to process
> it further. If the cost of transfering data from worker to backend is more than
> fetching a tuple from the scan, then the overhead is visible when the
> selectivity is more.
>
> > However
> > another idea here could be that instead of passing tuple data, we just
> > pass tuple id, but in that case we have to retain the pin on the buffer
> > that contains tuple untill master backend reads from it that might have
> > it's own kind of problems.
>
> Transfering tuple id doesn't solve the scenarios if the node needs any
> projection.
Hmm, that's why I told that we need to retain buffer pin, so that we can
>
> On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> > wrote:
> >>
> >> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com>
> >> wrote:
> >> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
> >> >> Does parallelism help at all?
> >> >
> >> > I'm pretty damn sure. We can't even make a mildly powerfull storage
> >> > fully busy right now. Heck, I can't make my workstation's storage with a
> >> > raid 10 out of four spinning disks fully busy.
> >> >
> >> > I think some of that benefit also could be reaped by being better at
> >> > hinting the OS...
> >>
> >> Yes, it definitely helps but not only limited to IO bound operations.
> >> It gives a good gain for the queries having CPU intensive where
> >> conditions.
> >>
> >> One more point we may need to consider, is there any overhead in passing
> >> the data row from workers to backend?
> >
> > I am not sure if that overhead will be too much visible if we improve the
> > use of I/O subsystem by making parallel tasks working on it.
>
> I feel there may be an overhead because of workers needs to put the result
> data in the shared memory and the backend has to read from there to process
> it further. If the cost of transfering data from worker to backend is more than
> fetching a tuple from the scan, then the overhead is visible when the
> selectivity is more.
>
> > However
> > another idea here could be that instead of passing tuple data, we just
> > pass tuple id, but in that case we have to retain the pin on the buffer
> > that contains tuple untill master backend reads from it that might have
> > it's own kind of problems.
>
> Transfering tuple id doesn't solve the scenarios if the node needs any
> projection.
Hmm, that's why I told that we need to retain buffer pin, so that we can
get the tuple data.
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Haribabu Kommi > Sent: Tuesday, November 11, 2014 1:13 PM > To: Amit Kapila > Cc: Andres Freund; Robert Haas; Simon Riggs; Tom Lane; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] using custom scan nodes to prototype parallel > sequential scan > > On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi > > <kommi.haribabu@gmail.com> > > wrote: > >> > >> On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund > >> <andres@2ndquadrant.com> > >> wrote: > >> > On 2014-11-10 10:57:16 -0500, Robert Haas wrote: > >> >> Does parallelism help at all? > >> > > >> > I'm pretty damn sure. We can't even make a mildly powerfull storage > >> > fully busy right now. Heck, I can't make my workstation's storage > >> > with a raid 10 out of four spinning disks fully busy. > >> > > >> > I think some of that benefit also could be reaped by being better > >> > at hinting the OS... > >> > >> Yes, it definitely helps but not only limited to IO bound operations. > >> It gives a good gain for the queries having CPU intensive where > >> conditions. > >> > >> One more point we may need to consider, is there any overhead in > >> passing the data row from workers to backend? > > > > I am not sure if that overhead will be too much visible if we improve > > the use of I/O subsystem by making parallel tasks working on it. > > I feel there may be an overhead because of workers needs to put the result > data in the shared memory and the backend has to read from there to process > it further. If the cost of transfering data from worker to backend is more > than fetching a tuple from the scan, then the overhead is visible when the > selectivity is more. > In my experience, data copy and transformation to fit TupleTableSlot is the biggest overhead, rather than scan or join itself... Probably, a straight-forward way is to construct an array of values/isnull on a shared memory segment, then the backend process just switch pointers of tts_values/tts_isnull, with no data copy. It gave us a performance gain. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On 10 November 2014 15:57, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Something usable, with severe restrictions, is actually better than we >> have now. I understand the journey this work represents, so don't be >> embarrassed by submitting things with heuristics and good-enoughs in >> it. Our mentor, Mr.Lane, achieved much by spreading work over many >> releases, leaving others to join in the task. > > It occurs to me that, now that the custom-scan stuff is committed, it > wouldn't be that hard to use that, plus the other infrastructure we > already have, to write a prototype of parallel sequential scan. +1 > Given > where we are with the infrastructure, there would be a number of > unhandled problems, such as deadlock detection (needs group locking or > similar), assessment of quals as to parallel-safety (needs > proisparallel or similar), general waterproofing to make sure that > pushing down a qual we shouldn't does do anything really dastardly > like crash the server (another written but yet-to-be-published patch > adds a bunch of relevant guards), and snapshot sharing (likewise). > But if you don't do anything weird, it should basically work. If we build something fairly restricted, but production ready we can still do something useful to users. * only functions marked as "CONTAINS NO SQL" * parallel_workers = 2 or fixed * only one Plan type, designed to maximise benefit and minimize optimizer change Why? * The above is all that is needed to test any infrastructure changes we accept. We need both infrastructure and tests. We could do this as a custom scan plugin, but why not make it work in core - that doesn't prevent a plugin version also if you desire it. * only functions marked as "CONTAINS NO SQL" We don't really know what proisparallel is, but we do know what CONTAINS NO SQL means and can easily check for it. Plus I already have a patch for this, slightly bitrotted. * parallel_workers = 2 (or at least not make it user settable) By fixing the number of workers at 2 we avoid any problems caused by having N variable, such as how to vary N fairly amongst users and other such considerations. We get the main benefit of parallelism, without causing other issues across the server. * Fixed Plan: aggregate-scan To make everything simpler, allow only plans of a single type.SELECT something, list of aggregatesFROM fooWHERE filtersGROUPBY something because we know that passing large amounts of data from worker to master process will be slow, so focusing only on seq scan is not sensible; we should focus on plans that significantly reduce the number of rows passed upwards. We could just do this for very selective WHERE clauses, but that is not an important class of query. As soon as include aggregates, we reduce data passing significantly AND we hit a very important subset of queries: This plan type is widely used in reporting queries, so will hit the mainline of BI applications and many Mat View creations. This will allow SELECT count(*) FROM foo to go faster also. The execution plan for that query type looks like this... Hash Aggregate Gather From Workers {Worker Nodes workers = 2 HashAggregate PartialScan} Which is simple enough that we include a mechanism for the Gather operation, plus it is simple enough to not need extensive optimizer changes - just changes to set_plain_rel_pathlist() to consider parallel plans. Plan costs are easily to calculate for the above...cpu_worker_startup_cost = 100 -- Startup cost is easy to calculate by observation, but a reasonably large default will be OKcpu_ipc_tuple_cost = 0.1 -- assume x10 normal cost of cpu_tuple_cost Partial scan costs are just same as SeqScan, just with fewer blocks. All other costs are the same We can submit the main patch by Dec 15, fix all the problems by Feb 15. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> > Given > > where we are with the infrastructure, there would be a number of > > unhandled problems, such as deadlock detection (needs group locking or > > similar), assessment of quals as to parallel-safety (needs > > proisparallel or similar), general waterproofing to make sure that > > pushing down a qual we shouldn't does do anything really dastardly > > like crash the server (another written but yet-to-be-published patch > > adds a bunch of relevant guards), and snapshot sharing (likewise). > > But if you don't do anything weird, it should basically work. > > If we build something fairly restricted, but production ready we can still > do something useful to users. > > * only functions marked as "CONTAINS NO SQL" > * parallel_workers = 2 or fixed > * only one Plan type, designed to maximise benefit and minimize optimizer > change > > Why? > > * The above is all that is needed to test any infrastructure changes we > accept. We need both infrastructure and tests. We could do this as a custom > scan plugin, but why not make it work in core - that doesn't prevent a plugin > version also if you desire it. > > * only functions marked as "CONTAINS NO SQL" > We don't really know what proisparallel is, but we do know what CONTAINS > NO SQL means and can easily check for it. > Plus I already have a patch for this, slightly bitrotted. > Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient? > * parallel_workers = 2 (or at least not make it user settable) By fixing > the number of workers at 2 we avoid any problems caused by having N variable, > such as how to vary N fairly amongst users and other such considerations. > We get the main benefit of parallelism, without causing other issues across > the server. > > * Fixed Plan: aggregate-scan > To make everything simpler, allow only plans of a single type. > SELECT something, list of aggregates > FROM foo > WHERE filters > GROUP BY something > because we know that passing large amounts of data from worker to master > process will be slow, so focusing only on seq scan is not sensible; we should > focus on plans that significantly reduce the number of rows passed upwards. > We could just do this for very selective WHERE clauses, but that is not > an important class of query. > As soon as include aggregates, we reduce data passing significantly AND > we hit a very important subset of queries: > > This plan type is widely used in reporting queries, so will hit the mainline > of BI applications and many Mat View creations. > This will allow SELECT count(*) FROM foo to go faster also. > I agree with you. The key of performance is how many rows can be reduced by parallel processor, however, one thing I doubt is that the condition to kick parallel execution may be too restrictive. I'm not sure how much workloads run aggregate functions towards flat tables rather than multiple joined relations. Sorry, it might be a topic to discuss in a separated thread. Can't we have a functionality to push-down aggregate functions across table joins? If we could do this in some cases, but not any cases, it makes sense to run typical BI workload in parallel. Let me assume the following query: SELECT SUM(t1.X), SUM(t2.Y) FROM t1 JOIN t2 ON t1.pk_id = t2.fk_id GROUP BY t1.A, t2.B; It is a usual query that joins two relations with PK and FK. It can be broken down, like: SELECT SUM(t1.X), SUM(sq2.Y) FROM t1 JOIN (SELECT t2.fk_id, t2.B, SUM(t2.Y) Y FROM t2 GROUP BY t2.fk_id, t2.B) sq2 GROUPBY t1.A, sq2.B; If FK has 100 tuples per one PK in average, aggregate function that runs prior to join will reduce massive number of tuples to be joined, and we can leverage parallel processors to make preliminary aggregate on flat table. I'm now checking the previous academic people's job to identify the condition what kind of join allows to push down aggregate functions. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.7751 Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > * only functions marked as "CONTAINS NO SQL" > We don't really know what proisparallel is, but we do know what > CONTAINS NO SQL means and can easily check for it. > Plus I already have a patch for this, slightly bitrotted. Interestingly, I have a fairly solid idea of what proisparallel is, but I have no clear idea what CONTAINS NO SQL is or why it's relevant. I would imagine that srandom() contains no SQL under any reasonable definition of what that means, but it ain't parallel-safe. > * parallel_workers = 2 (or at least not make it user settable) > By fixing the number of workers at 2 we avoid any problems caused by > having N variable, such as how to vary N fairly amongst users and > other such considerations. We get the main benefit of parallelism, > without causing other issues across the server. I think this is a fairly pointless restriction. The code simplification we'll get out of it appears to me to be quite minor, and we'll just end up putting the stuff back in anyway. > * Fixed Plan: aggregate-scan > To make everything simpler, allow only plans of a single type. > SELECT something, list of aggregates > FROM foo > WHERE filters > GROUP BY something > because we know that passing large amounts of data from worker to > master process will be slow, so focusing only on seq scan is not > sensible; we should focus on plans that significantly reduce the > number of rows passed upwards. We could just do this for very > selective WHERE clauses, but that is not an important class of query. > As soon as include aggregates, we reduce data passing significantly > AND we hit a very important subset of queries: This is moving the goalposts in a way that I'm not at all comfortable with. Parallel sequential-scan is pretty simple and may well be a win if there's a restrictive filter condition involved. Parallel aggregation requires introducing new infrastructure into the aggregate machinery to allow intermediate state values to be combined, and that would be a great project for someone to do at some time, but it seems like a distraction for me to do that right now. > This plan type is widely used in reporting queries, so will hit the > mainline of BI applications and many Mat View creations. > This will allow SELECT count(*) FROM foo to go faster also. > > The execution plan for that query type looks like this... > Hash Aggregate > Gather From Workers > {Worker Nodes workers = 2 > HashAggregate > PartialScan} I'm going to aim for the simpler: Hash Aggregate -> Parallel Seq Scan Workers: 4 Yeah, I know that won't perform as well as what you're proposing, but I'm fairly sure it's simpler. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.
We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions.
Regards
David Rowley
On Wed, Nov 12, 2014 at 1:24 PM, David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions.
It might make sense to make a new planner operator which can be responsible for pulling from each of the individual parallel Agg nodes and then aggregating over the results.
A couple of things that might be worth considering are whether we want to enforce using parallel aggregation or let planner decide if it wants to do a parallel aggregate or go with a single plan. For eg, the average estimated size of groups might be one thing that planner may consider while deciding between a parallel and a single execution plan.
I dont see merging states as an easy problem, and should perhaps be tackled apart from this thread.
Also, do we want to allow parallelism only with GroupAggs?
Regards,
Atri
Regards,
Atri
On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient? There are certainly things that are parallel-safe that are not immutable. It might be the case that everything immutable is parallel-safe. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 12, 2014 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: >> Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient? > > There are certainly things that are parallel-safe that are not > immutable. It might be the case that everything immutable is > parallel-safe. FWIW, when working on the concept of expression and clause shippability for Postgres-XC (aka the possibility to pass it safely to another backend, but in another PG node in this case), we discussed similar things and if I recall correctly we even discussed about adding a flag to pg_proc to define if a function was shippable or not. Finally what we finished with was not adding a new flag and use as rule that all the immutable functions can be safely shipped, and others not, even some stable functions that *could* be safe. Maybe Ashutosh has more comments on that, my memory may be failing. In the end, I think that you would finish with something similar. My 2c. -- Michael
On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote: > I'm going to aim for the simpler: > > Hash Aggregate > -> Parallel Seq Scan > Workers: 4 > > Yeah, I know that won't perform as well as what you're proposing, but > I'm fairly sure it's simpler. Simple is best, so +1. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com> wrote: > On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> >> This plan type is widely used in reporting queries, so will hit the >> mainline of BI applications and many Mat View creations. >> This will allow SELECT count(*) FROM foo to go faster also. >> > > We'd also need to add some infrastructure to merge aggregate states together > for this to work properly. This means that could also work for avg() and > stddev etc. For max() and min() the merge functions would likely just be the > same as the transition functions. Do you mean something like a "subtotal" or "intermediate combination" functon? I guess we'd need the same thing to make intermediate aggregates work during a sort? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> * only functions marked as "CONTAINS NO SQL" >> We don't really know what proisparallel is, but we do know what >> CONTAINS NO SQL means and can easily check for it. >> Plus I already have a patch for this, slightly bitrotted. > > Interestingly, I have a fairly solid idea of what proisparallel is, > but I have no clear idea what CONTAINS NO SQL is or why it's relevant. > I would imagine that srandom() contains no SQL under any reasonable > definition of what that means, but it ain't parallel-safe. What is wrong in generating random numbers in parallel? But I'm sure many volatile functions would be annoying to support, so CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Nov 14, 2014 at 1:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Do you mean something like a "subtotal" or "intermediate combination" functon?On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com> wrote:
> On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>>
>> This plan type is widely used in reporting queries, so will hit the
>> mainline of BI applications and many Mat View creations.
>> This will allow SELECT count(*) FROM foo to go faster also.
>>
>
> We'd also need to add some infrastructure to merge aggregate states together
> for this to work properly. This means that could also work for avg() and
> stddev etc. For max() and min() the merge functions would likely just be the
> same as the transition functions.
If you had 4 parallel workers performing a seqscan, say the relation was 4000 pages in total, you could say that worker 1 would scan blocks 0-999, worker 2, 1000-1999 etc. After each worker had finished, there would then be 4 sets of records then needed to be merged into 1 set.
Take int4_avg_accum() for example it does:
transdata->count++;
transdata->sum += newval;
The merge function would need to perform something like:
transdata->count += transdata2merge.count;
transdata->sum += transdata2merge.sum;
Then the final function could be called on the merged aggregate state.
The same can be applied when the query contains a GROUP BY clause, just we'd need pay attention to which groups we merge together for that to work Any HAVING clause would have to be applied after the groups have been merged.
This whole topic is pretty exciting for data warehouse type workloads.
Regards
David Rowley
> On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com> > wrote: > > On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs > <simon@2ndquadrant.com> wrote: > >> > >> > >> This plan type is widely used in reporting queries, so will hit > the > >> mainline of BI applications and many Mat View creations. > >> This will allow SELECT count(*) FROM foo to go faster also. > >> > > > > We'd also need to add some infrastructure to merge aggregate > states together > > for this to work properly. This means that could also work for > avg() and > > stddev etc. For max() and min() the merge functions would likely > just be the > > same as the transition functions. > > > Do you mean something like a "subtotal" or "intermediate > combination" functon? > > > > > If you had 4 parallel workers performing a seqscan, say the relation was > 4000 pages in total, you could say that worker 1 would scan blocks 0-999, > worker 2, 1000-1999 etc. After each worker had finished, there would then > be 4 sets of records then needed to be merged into 1 set. > > Take int4_avg_accum() for example it does: > > > transdata->count++; > transdata->sum += newval; > > The merge function would need to perform something like: > > transdata->count += transdata2merge.count; sum += transdata2merge.sum; > > Then the final function could be called on the merged aggregate state. > > The same can be applied when the query contains a GROUP BY clause, just > we'd need pay attention to which groups we merge together for that to work > Any HAVING clause would have to be applied after the groups have been merged. > > This whole topic is pretty exciting for data warehouse type workloads. > More simplify, we can describe parallel aware aggregate function. Please assume AVG(X) function that takes nrows and sum of X. Its transition function performs as like you described above, then final function works as usual. The job of parallel seq scan needs to do is: 1. replace AVG(X) by AVG(nrows, sum(X) 2. generate count(*) on the partial relationbeing grouped. 3. generate sum(X) on the partial relation being grouped. It looks like the following query: SELECT AVG(nrows, sum_X) FROM ( SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blknobetween 0 and 999 GROUP BY cat UNION SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 1000 and 1999GROUP BY cat UNION SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 2000 and 2999 GROUP BY cat UNION SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 3000 and 3999 GROUP BY cat ); Note that stuff inside sub-query is a job of parallel scan. Do we need to invent a new infrastructure here? Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Fri, Nov 14, 2014 at 2:12 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> transdata->count += transdata2merge.count; sum += transdata2merge.sum;> On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com>
> Take int4_avg_accum() for example it does:
>
>
> transdata->count++;
> transdata->sum += newval;
>
> The merge function would need to perform something like:
>
>
> Then the final function could be called on the merged aggregate state.
>
More simplify, we can describe parallel aware aggregate function.
Please assume AVG(X) function that takes nrows and sum of X. Its transition
function performs as like you described above, then final function works as
usual.
The job of parallel seq scan needs to do is:
1. replace AVG(X) by AVG(nrows, sum(X)
2. generate count(*) on the partial relation being grouped.
3. generate sum(X) on the partial relation being grouped.
It looks like the following query:
SELECT AVG(nrows, sum_X) FROM (
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 0 and 999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 1000 and 1999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 2000 and 2999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 3000 and 3999 GROUP BY cat
);
Well, this would require giving the planner some kind of knowledge of what AVG() is. It currently knows nothing about that. It currently calls the transition function for each row, and the final function at the end and does not care or need to care about what either of those functions actually does. The transformation above looks like it would need to care and the logic to add that would be way more complex than aggregate merge functions.
Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the merge function would be the same as the transition function, as the state type is just the same as the input type. It would only be aggregates like avg(), stddev*(), bool_and() and bool_or() that would need a new merge function made... These would be no more complex than the transition functions... Which are just a few lines of code anyway.
We'd simply just not run parallel query if any aggregates used in the query didn't have a merge function.
When I mentioned this, I didn't mean to appear to be placing a road block.I was just bringing to the table the information that COUNT(*) + COUNT(*) works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not.
Merge functions should be a simple patch to write. If I thought there was going to be a use for them in this release, I'd put my hand up to put a patch together.
Regards
David Rowley
> On Fri, Nov 14, 2014 at 2:12 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > > > > On 12 November 2014 07:54, David Rowley > <dgrowleyml@gmail.com> > > Take int4_avg_accum() for example it does: > > > > > > transdata->count++; > > transdata->sum += newval; > > > > The merge function would need to perform something like: > > > > > transdata->count += transdata2merge.count; sum += > transdata2merge.sum; > > > > Then the final function could be called on the merged aggregate > state. > > > More simplify, we can describe parallel aware aggregate function. > Please assume AVG(X) function that takes nrows and sum of X. Its > transition > function performs as like you described above, then final function > works as > usual. > > The job of parallel seq scan needs to do is: > 1. replace AVG(X) by AVG(nrows, sum(X) > 2. generate count(*) on the partial relation being grouped. > 3. generate sum(X) on the partial relation being grouped. > > It looks like the following query: > SELECT AVG(nrows, sum_X) FROM ( > SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno > between 0 and 999 GROUP BY cat > UNION > SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno > between 1000 and 1999 GROUP BY cat > UNION > SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno > between 2000 and 2999 GROUP BY cat > UNION > SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno > between 3000 and 3999 GROUP BY cat > ); > > > > > Well, this would require giving the planner some kind of knowledge of what > AVG() is. It currently knows nothing about that. It currently calls the > transition function for each row, and the final function at the end and > does not care or need to care about what either of those functions actually > does. The transformation above looks like it would need to care and the > logic to add that would be way more complex than aggregate merge functions. > It may make sense to have an extra catalog to indicate how usual aggregate function can be broken down (or unavailable). Like, AVG(X) = {COUNT(X) + SUM(X)} x N-partitions > Likely for most aggregates, like count, sum, max, min, bit_and and bit_or > the merge function would be the same as the transition function, as the > state type is just the same as the input type. It would only be aggregates > like avg(), stddev*(), bool_and() and bool_or() that would need a new merge > function made... These would be no more complex than the transition > functions... Which are just a few lines of code anyway. > > We'd simply just not run parallel query if any aggregates used in the query > didn't have a merge function. > > When I mentioned this, I didn't mean to appear to be placing a road block.I > was just bringing to the table the information that COUNT(*) + COUNT(*) > works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not. > > Merge functions should be a simple patch to write. If I thought there was > going to be a use for them in this release, I'd put my hand up to put a > patch together. > Things I'm uncertain is, how caller of aggregate function distinguish a context to call usual translation function, or new merge function. Let's back an example of: SELECT cat, COUNT(*), AVG(X) FROM t GROUP BY cat; Its plan tree is probably as follows: HashAggregate Group Key: cat -> Custom Scan (Parallel Scan) # of workers:4 The caller of translation/merge/final function is HashAggregate node. On the other hand, it has to know whether the underlying plan returns every records of underlying table or sub-total by parallel scan. Please correct me, if my assumption is wrong. Once HashAggregate can know that sub-plan returns sub-total of the relation, it can chose merge function instead of translation function. However, what I want to clarify is how to inform HashAggregate node its sub- plan intends to return sub-total, instead of individual rows. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On 11/12/14, 1:54 AM, David Rowley wrote: > On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com <mailto:simon@2ndquadrant.com>> wrote: > > > This plan type is widely used in reporting queries, so will hit the > mainline of BI applications and many Mat View creations. > This will allow SELECT count(*) FROM foo to go faster also. > > We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means thatcould also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as thetransition functions. Sanity check: what % of a large aggregate query fed by a seqscan actually spent in the aggregate functions? Even if you lookstrictly at CPU cost, isn't there more code involved to get data to the aggregate function than in the aggregation itself,except maybe for numeric? In other words, I suspect that just having a dirt-simple parallel SeqScan could be a win for CPU. It should certainly bea win IO-wise; in my experience we're not very good at maxing out IO systems. (I was curious and came up with the list below for just the page-level stuff (ignoring IO). I don't see much code involvedin per-tuple work, but I also never came across detoasting code, so I suspect I'm missing something...) ExecScanFetch, heapgettup_pagemode, ReadBuffer, BufferAlloc, heap_page_prune_opt, LWLockAcquire... then you can finally doper-tuple work. HeapTupleSatisfiesVisibility. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 14 November 2014 20:37, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 11/12/14, 1:54 AM, David Rowley wrote:
We'd also need to add some infrastructure to merge aggregate states together for this to work properly. This means that could also work for avg() and stddev etc. For max() and min() the merge functions would likely just be the same as the transition functions.
Sanity check: what % of a large aggregate query fed by a seqscan actually spent in the aggregate functions? Even if you look strictly at CPU cost, isn't there more code involved to get data to the aggregate function than in the aggregation itself, except maybe for numeric?
You might be right, but that sounds like it would need all the parallel workers to send each matching tuple to a queue to be processed by some aggregate node. I guess this would have more advantage for wider tables or tables with many dead tuples, or if the query has quite a selective where clause, as less data would make it onto that queue.
Perhaps I've taken 1 step too far forward here. I had been thinking that each worker would perform the partial seqscan and in the worker context pass the tuple down to the aggregate node. Then later once each worker had complete some other perhaps new node type (MergeAggregateStates) would merge all those intermediate agg states into the final agg state (which would then be ready for the final function to be called).
Are there any plans for what will be in charge of deciding how many workers would be allocated to a parallel query? Will this be something that's done at planning time? Or should the planner just create a parallel friendly plan, iif the plan is costly enough and then just allow the executor decide how many workers to throw at the job based on how busy the system is with other tasks at execution time?
Regards
David Rowley
On 14 November 2014 07:37, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 11/12/14, 1:54 AM, David Rowley wrote: >> >> On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com >> <mailto:simon@2ndquadrant.com>> wrote: >> >> >> This plan type is widely used in reporting queries, so will hit the >> mainline of BI applications and many Mat View creations. >> This will allow SELECT count(*) FROM foo to go faster also. >> >> We'd also need to add some infrastructure to merge aggregate states >> together for this to work properly. This means that could also work for >> avg() and stddev etc. For max() and min() the merge functions would likely >> just be the same as the transition functions. > > > Sanity check: what % of a large aggregate query fed by a seqscan actually > spent in the aggregate functions? Even if you look strictly at CPU cost, > isn't there more code involved to get data to the aggregate function than in > the aggregation itself, except maybe for numeric? Yes, which is why I suggested pre-aggregating before collecting the streams together. The point is not that the aggregation is expensive, its that the aggregation eats data and the required bandwidth for later steps is reduced and hence does not then become a bottleneck that renders the parallel Seq Scan ineffective. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 14 November 2014 01:51, David Rowley <dgrowleyml@gmail.com> wrote: > When I mentioned this, I didn't mean to appear to be placing a road block.I > was just bringing to the table the information that COUNT(*) + COUNT(*) > works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not. > > Merge functions should be a simple patch to write. If I thought there was > going to be a use for them in this release, I'd put my hand up to put a > patch together. The hard part is describing and then agreeing the semantics. If you raise a separate post on this, copy me in and I will help. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 14 November 2014 07:37, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > > On 11/12/14, 1:54 AM, David Rowley wrote: > >> > >> On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com > >> <mailto:simon@2ndquadrant.com>> wrote: > >> > >> > >> This plan type is widely used in reporting queries, so will hit the > >> mainline of BI applications and many Mat View creations. > >> This will allow SELECT count(*) FROM foo to go faster also. > >> > >> We'd also need to add some infrastructure to merge aggregate states > >> together for this to work properly. This means that could also work > >> for > >> avg() and stddev etc. For max() and min() the merge functions would > >> likely just be the same as the transition functions. > > > > > > Sanity check: what % of a large aggregate query fed by a seqscan > > actually spent in the aggregate functions? Even if you look strictly > > at CPU cost, isn't there more code involved to get data to the > > aggregate function than in the aggregation itself, except maybe for > numeric? > > Yes, which is why I suggested pre-aggregating before collecting the streams > together. > > The point is not that the aggregation is expensive, its that the aggregation > eats data and the required bandwidth for later steps is reduced and hence > does not then become a bottleneck that renders the parallel Seq Scan > ineffective. > I'd like to throw community folks a question. Did someone have a discussion to the challenge of aggregate push-down across relations join in the past? It potentially reduces number of rows to be joined. If we already had, I'd like to check up the discussion at that time. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Fri, Nov 14, 2014 at 1:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote:
> Interestingly, I have a fairly solid idea of what proisparallel is,
> but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
> I would imagine that srandom() contains no SQL under any reasonable
> definition of what that means, but it ain't parallel-safe.
What is wrong in generating random numbers in parallel?
I was just watching Robert's talk on Parallel query on youtube... I think the answer is at 41:09, the link below should take you there:
Regards
David Rowley
On Thu, Nov 13, 2014 at 7:27 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> * only functions marked as "CONTAINS NO SQL" >>> We don't really know what proisparallel is, but we do know what >>> CONTAINS NO SQL means and can easily check for it. >>> Plus I already have a patch for this, slightly bitrotted. >> >> Interestingly, I have a fairly solid idea of what proisparallel is, >> but I have no clear idea what CONTAINS NO SQL is or why it's relevant. >> I would imagine that srandom() contains no SQL under any reasonable >> definition of what that means, but it ain't parallel-safe. > > What is wrong in generating random numbers in parallel? If they're random, nothing, provide that you use a different seed in each backend. But if the seed has been set using srandom(), then we must generate the particular sequence of random numbers associated with that seed. If we don't, the behavior is incompatible with what happens apart from parallel mode. > But I'm sure many volatile functions would be annoying to support, so > CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing. It might be OK to assume that immutable functions are fine, but not all stable functions look like they'll be safe, or not without additional effort: e.g. inet_client_addr(), pg_backend_pid(), pg_cursor(), pg_event_trigger_dropped_objects(). Or even, more mundanely, generate_series(), unless we ensure all calls are in the same backend. On the other hand, there are quite a few stable functions that it seems like we would want to allow in parallel queries, like to_char(), concat(), and textanycat(). I still don't know what CONTAINS NO SQL means. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 14 November 2014 11:02, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > I'd like to throw community folks a question. > Did someone have a discussion to the challenge of aggregate push-down across > relations join in the past? It potentially reduces number of rows to be joined. > If we already had, I'd like to check up the discussion at that time. Yes, I was looking at aggregate pushdown. I think it needs the same changes to aggregates discussed upthread. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 17 November 2014 16:01, Robert Haas <robertmhaas@gmail.com> wrote: > I still don't know what CONTAINS NO SQL means. It's a function marking that would indicate we aren't allowed to take snapshots or run SQL. I think you should publish the scheme for marking functions as safe for parallelism, so we can judge that. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 18 November 2014 05:19, Simon Riggs <simon@2ndquadrant.com> wrote:
On 14 November 2014 11:02, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> I'd like to throw community folks a question.
> Did someone have a discussion to the challenge of aggregate push-down across
> relations join in the past? It potentially reduces number of rows to be joined.
> If we already had, I'd like to check up the discussion at that time.
Yes, I was looking at aggregate pushdown. I think it needs the same
changes to aggregates discussed upthread.
I have something that I've been working on locally. It's far from ready, but it does work in very simple cases, and shows a nice performance boost. I'll start another thread soon and copy you both in. Perhaps we can share some ideas.
Regards
David Rowley
> On 18 November 2014 05:19, Simon Riggs <simon@2ndquadrant.com> wrote: > > > On 14 November 2014 11:02, Kouhei Kaigai <kaigai@ak.jp.nec.com> > wrote: > > > I'd like to throw community folks a question. > > Did someone have a discussion to the challenge of aggregate > push-down across > > relations join in the past? It potentially reduces number of rows > to be joined. > > If we already had, I'd like to check up the discussion at that > time. > > Yes, I was looking at aggregate pushdown. I think it needs the same > changes to aggregates discussed upthread. > > > > > I have something that I've been working on locally. It's far from ready, > but it does work in very simple cases, and shows a nice performance boost. > I'll start another thread soon and copy you both in. Perhaps we can share > some ideas. > Great, it's exactly valuable functionality to be in the core. -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote: > Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the > merge function would be the same as the transition function, as the state type > is just the same as the input type. It would only be aggregates like avg(), > stddev*(), bool_and() and bool_or() that would need a new merge function > made... These would be no more complex than the transition functions... Which > are just a few lines of code anyway. > > We'd simply just not run parallel query if any aggregates used in the query > didn't have a merge function. > > When I mentioned this, I didn't mean to appear to be placing a road block.I was > just bringing to the table the information that COUNT(*) + COUNT(*) works ok > for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not. Sorry, late reply, but, FYI, I don't think our percentile functions can't be parallelized in the same way: test=> \daS *percent* List of aggregate functions Schema | Name | Result data type | Argument data types | Description------------+-----------------+--------------------+----------------------------------------------+------------------------------------- pg_catalog| percent_rank | double precision | VARIADIC "any" ORDER BY VARIADIC "any" | fractional rank of hypotheticalrow pg_catalog | percentile_cont | double precision | double precision ORDER BY double precision | continuousdistribution percentile pg_catalog | percentile_cont | double precision[] | double precision[] ORDER BY doubleprecision | multiple continuous percentiles pg_catalog | percentile_cont | interval | double precision ORDERBY interval | continuous distribution percentile pg_catalog | percentile_cont | interval[] | doubleprecision[] ORDER BY interval | multiple continuous percentiles pg_catalog | percentile_disc | anyelement | double precision ORDER BY anyelement | discrete percentile pg_catalog | percentile_disc | anyarray | double precision[] ORDER BY anyelement | multiple discrete percentiles -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
> On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote: > > Likely for most aggregates, like count, sum, max, min, bit_and and > > bit_or the merge function would be the same as the transition > > function, as the state type is just the same as the input type. It > > would only be aggregates like avg(), stddev*(), bool_and() and > > bool_or() that would need a new merge function made... These would be > > no more complex than the transition functions... Which are just a few > lines of code anyway. > > > > We'd simply just not run parallel query if any aggregates used in the > > query didn't have a merge function. > > > > When I mentioned this, I didn't mean to appear to be placing a road > > block.I was just bringing to the table the information that COUNT(*) + > > COUNT(*) works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) > does not. > > Sorry, late reply, but, FYI, I don't think our percentile functions can't > be parallelized in the same way: > > test=> \daS *percent* > List of > aggregate functions > Schema | Name | Result data type | > Argument data types | Description > ------------+-----------------+--------------------+---------- > ------------------------------------+--------------------------------- > ---- > pg_catalog | percent_rank | double precision | VARIADIC > "any" ORDER BY VARIADIC "any" | fractional rank of hypothetical row > pg_catalog | percentile_cont | double precision | double > precision ORDER BY double precision | continuous distribution percentile > pg_catalog | percentile_cont | double precision[] | double > precision[] ORDER BY double precision | multiple continuous percentiles > pg_catalog | percentile_cont | interval | double > precision ORDER BY interval | continuous distribution > percentile > pg_catalog | percentile_cont | interval[] | double > precision[] ORDER BY interval | multiple continuous percentiles > pg_catalog | percentile_disc | anyelement | double > precision ORDER BY anyelement | discrete percentile > pg_catalog | percentile_disc | anyarray | double > precision[] ORDER BY anyelement | multiple discrete percentiles > Yep, it seems to me the type of aggregate function that is not obvious to split into multiple partitions. I think, it is valuable even if we can push-down a part of aggregate functions which is well known by the core planner. For example, we know count(*) = sum(nrows), we also know avg(X) can be rewritten to enhanced avg() that takes both of nrows and partial sum of X. We can utilize these knowledge to break-down aggregate functions. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Wed, Dec 3, 2014 at 3:23 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: >> On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote: >> > Likely for most aggregates, like count, sum, max, min, bit_and and >> > bit_or the merge function would be the same as the transition >> > function, as the state type is just the same as the input type. It >> > would only be aggregates like avg(), stddev*(), bool_and() and >> > bool_or() that would need a new merge function made... These would be >> > no more complex than the transition functions... Which are just a few >> lines of code anyway. >> > >> > We'd simply just not run parallel query if any aggregates used in the >> > query didn't have a merge function. >> > >> > When I mentioned this, I didn't mean to appear to be placing a road >> > block.I was just bringing to the table the information that COUNT(*) + >> > COUNT(*) works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) >> does not. >> >> Sorry, late reply, but, FYI, I don't think our percentile functions can't >> be parallelized in the same way: >> >> test=> \daS *percent* >> List of >> aggregate functions >> Schema | Name | Result data type | >> Argument data types | Description >> ------------+-----------------+--------------------+---------- >> ------------------------------------+--------------------------------- >> ---- >> pg_catalog | percent_rank | double precision | VARIADIC >> "any" ORDER BY VARIADIC "any" | fractional rank of hypothetical row >> pg_catalog | percentile_cont | double precision | double >> precision ORDER BY double precision | continuous distribution percentile >> pg_catalog | percentile_cont | double precision[] | double >> precision[] ORDER BY double precision | multiple continuous percentiles >> pg_catalog | percentile_cont | interval | double >> precision ORDER BY interval | continuous distribution >> percentile >> pg_catalog | percentile_cont | interval[] | double >> precision[] ORDER BY interval | multiple continuous percentiles >> pg_catalog | percentile_disc | anyelement | double >> precision ORDER BY anyelement | discrete percentile >> pg_catalog | percentile_disc | anyarray | double >> precision[] ORDER BY anyelement | multiple discrete percentiles >> > Yep, it seems to me the type of aggregate function that is not obvious > to split into multiple partitions. > I think, it is valuable even if we can push-down a part of aggregate > functions which is well known by the core planner. > For example, we know count(*) = sum(nrows), we also know avg(X) can > be rewritten to enhanced avg() that takes both of nrows and partial > sum of X. We can utilize these knowledge to break-down aggregate > functions. Postgres-XC (Postgres-XL) has implemented such parallel aggregate logic some time ago using a set of sub functions and a finalization function to do the work. My 2c. -- Michael