Home > mailing lists

Thread: Parallel Seq Scan

Parallel Seq Scan

From

Amit Kapila

Date:

04 December 2014, 06:35:37

As per discussion on another thread related to using

custom scan nodes for prototype of parallel sequence scan,

I have developed the same, but directly by adding

new nodes for parallel sequence scan. There might be

some advantages for developing this as a contrib

module by using custom scan nodes, however I think

we might get stucked after some point due to custom

scan node capability as pointed out by Andres.

The basic idea used is that while evaluating the cheapest

path for scan, optimizer will also evaluate if it can use

parallel seq path. Currently I have kept a very simple

model to calculate the cost of parallel sequence path which

is that divide the cost for CPU and disk by availble number

of worker backends (We can enhance it based on further

experiments and discussion; we need to consider worker startup

and dynamic shared memory setup cost as well). The work aka

scan of blocks is divided equally among all workers (except for

corner cases where blocks can't be equally divided among workers,

the last worker will be responsible for scanning the remaining blocks).

The number of worker backends that can be used for

parallel seq scan can be configured by using a new GUC

parallel_seqscan_degree, the default value of which is zero

and it means parallel seq scan will not be considered unless

user configures this value.

In ExecutorStart phase, initiate the required number of workers

as per parallel seq scan plan and setup dynamic shared memory and

share the information required for worker to execute the scan.

Currently I have just shared the relId, targetlist and number

of blocks to be scanned by worker, however I think we might want

to generate a plan for each of the workers in master backend and

then share the same to individual worker.

Now to fetch the data from multiple queues corresponding to each

worker a simple mechanism is used that is fetch from first queue

till all the data is consumed from same, then fetch from second

queue and so on. Also here master backend is responsible for just

getting the data from workers and passing it back to client.

I am sure that we can improve this strategy in many ways

like by making master backend to also perform scan for some

of the blocks rather than just getting data from workers and

a better strategy to fetch the data from multiple queues.

Worker backend will receive the information related to scan

from master backend and generate the plan from same and

execute that plan, so here the work to scan the data after

generating the plan is very much similar to exec_simple_query()

(i.e Create the portal and run it based on planned statement)

except that worker backends will initialize the block range it want to

scan in executor initialization phase (ExecInitSeqScan()).

Workers will exit after sending the data to master backend

which essentially means that for each execution we need

to initiate the workers, I think here we can improve by giving the

control for workers to postmaster so that we don't need to

initialize them each time during execution, however this can

be a totally separate optimization which is better to be done

independently of this patch.

As currently we don't have mechanism to share transaction

state, I have used separate transaction in worker backend to

execute the plan.

Any error in master backend either via backend worker or due

to other issue in master backend itself should terminate all the

workers before aborting the transaction.

We can't do it with the error context callback mechanism

(error_context_stack) which we use at other places in code, as

for this case we need it from the time workers are started till

the execution is complete (error_context_stack could get reset

once the control goes out of the function which has set it.)

One way could be that maintain the callback information in

TransactionState and use it to kill the workers before aborting

transaction in main backend. Another could be that have another

variable similar to error_context_stack (which will be used

specifically for storing the workers state), and kill the workers

in errfinish via callback. Currently I have handled it at the time of

detaching from shared memory.

Another point that needs to be taken care in worker backend is

that if any error occurs, we should *not* abort the transaction as

the transaction state is shared across all workers.

Currently the parallel seq scan will not be considered

for statements other than SELECT or if there is a join in

the statement or if statement contains quals or if target

list contains non-Var fields. We can definitely support

simple quals and targetlist other than non-Vars. By simple,

I means that it should not contain functions or some other

conditions which can't be pushed down to worker backend.

Behaviour of some simple statements with patch is as below:

postgres=# create table t1(c1 int, c2 char(500)) with (fillfactor=10);

CREATE TABLE

postgres=# insert into t1 values(generate_series(1,100),'amit');

INSERT 0 100

postgres=# explain select c1 from t1;

QUERY PLAN

------------------------------------------------------

Seq Scan on t1 (cost=0.00..101.00 rows=100 width=4)

(1 row)

postgres=# set parallel_seqscan_degree=4;

SET

postgres=# explain select c1 from t1;

QUERY PLAN

--------------------------------------------------------------

Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)

Number of Workers: 4

Number of Blocks Per Workers: 25

(3 rows)

postgres=# explain select Distinct(c1) from t1;

QUERY PLAN

--------------------------------------------------------------------

HashAggregate (cost=25.50..26.50 rows=100 width=4)

Group Key: c1

-> Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)

Number of Workers: 4

Number of Blocks Per Workers: 25

(5 rows)

Attached patch is just to facilitate the discussion about the

parallel seq scan and may be some other dependent tasks like

sharing of various states like combocid, snapshot with parallel

workers. It is by no means ready to do any complex test, ofcourse

I will work towards making it more robust both in terms of adding

more stuff and doing performance optimizations.

Thoughts/Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v1.patch

Re: Parallel Seq Scan

From

José Luis Tallón

Date:

05 December 2014, 15:08:43

On 12/04/2014 07:35 AM, Amit Kapila wrote:
> [snip]
>
> The number of worker backends that can be used for
> parallel seq scan can be configured by using a new GUC
> parallel_seqscan_degree, the default value of which is zero
> and it means parallel seq scan will not be considered unless
> user configures this value.

The number of parallel workers should be capped (of course!) at the 
maximum amount of "processors" (cores/vCores, threads/hyperthreads) 
available.

More over, when load goes up, the relative cost of parallel working 
should go up as well.
Something like:    p = number of cores    l = 1min-load
    additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)

(for c>1, of course)

> In ExecutorStart phase, initiate the required number of workers
> as per parallel seq scan plan and setup dynamic shared memory and
> share the information required for worker to execute the scan.
> Currently I have just shared the relId, targetlist and number
> of blocks to be scanned by worker, however I think we might want
> to generate a plan for each of the workers in master backend and
> then share the same to individual worker.
[snip]
> Attached patch is just to facilitate the discussion about the
> parallel seq scan and may be some other dependent tasks like
> sharing of various states like combocid, snapshot with parallel
> workers.  It is by no means ready to do any complex test, ofcourse
> I will work towards making it more robust both in terms of adding
> more stuff and doing performance optimizations.
>
> Thoughts/Suggestions?

Not directly (I haven't had the time to read the code yet), but I'm 
thinking about the ability to simply *replace* executor methods from an 
extension.
This could be an alternative to providing additional nodes that the 
planner can include in the final plan tree, ready to be executed.

The parallel seq scan nodes are definitively the best approach for 
"parallel query", since the planner can optimize them based on cost.
I'm wondering about the ability to modify the implementation of some 
methods themselves once at execution time: given a previously planned 
query, chances are that, at execution time (I'm specifically thinking 
about prepared statements here), a different implementation of the same 
"node" might be more suitable and could be used instead while the 
condition holds.

If this latter line of thinking is too off-topic within this thread and 
there is any interest, we can move the comments to another thread and 
I'd begin work on a PoC patch. It might as well make sense to implement 
the executor overloading mechanism alongide the custom plan API, though.
Any comments appreciated.

Thank you for your work, Amit

Regards,
    / J.L.

Re: Parallel Seq Scan

From

Stephen Frost

Date:

05 December 2014, 15:13:44

José,

* José Luis Tallón (jltallon@adv-solutions.net) wrote:
> On 12/04/2014 07:35 AM, Amit Kapila wrote:
> >The number of worker backends that can be used for
> >parallel seq scan can be configured by using a new GUC
> >parallel_seqscan_degree, the default value of which is zero
> >and it means parallel seq scan will not be considered unless
> >user configures this value.
>
> The number of parallel workers should be capped (of course!) at the
> maximum amount of "processors" (cores/vCores, threads/hyperthreads)
> available.
>
> More over, when load goes up, the relative cost of parallel working
> should go up as well.
> Something like:
>     p = number of cores
>     l = 1min-load
>
>     additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)
>
> (for c>1, of course)

While I agree in general that we'll need to come up with appropriate
acceptance criteria, etc, I don't think we want to complicate this patch
with that initially.  A SUSET GUC which caps the parallel GUC would be
enough for an initial implementation, imv.

> Not directly (I haven't had the time to read the code yet), but I'm
> thinking about the ability to simply *replace* executor methods from
> an extension.

You probably want to look at the CustomScan thread+patch directly then..
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Stephen Frost

Date:

05 December 2014, 15:16:15

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:
> postgres=# explain select c1 from t1;
>                       QUERY PLAN
> ------------------------------------------------------
>  Seq Scan on t1  (cost=0.00..101.00 rows=100 width=4)
> (1 row)
>
>
> postgres=# set parallel_seqscan_degree=4;
> SET
> postgres=# explain select c1 from t1;
>                           QUERY PLAN
> --------------------------------------------------------------
>  Parallel Seq Scan on t1  (cost=0.00..25.25 rows=100 width=4)
>    Number of Workers: 4
>    Number of Blocks Per Workers: 25
> (3 rows)

This is all great and interesting, but I feel like folks might be
waiting to see just what kind of performance results come from this (and
what kind of hardware is needed to see gains..).  There's likely to be
situations where this change is an improvement while also being cases
where it makes things worse.

One really interesting case would be parallel seq scans which are
executing against foreign tables/FDWs..
Thanks!
    Stephen

Re: Parallel Seq Scan

From

Jim Nasby

Date:

05 December 2014, 18:57:55

On 12/5/14, 9:08 AM, José Luis Tallón wrote:
>
> More over, when load goes up, the relative cost of parallel working should go up as well.
> Something like:
>      p = number of cores
>      l = 1min-load
>
>      additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)
>
> (for c>1, of course)

...

> The parallel seq scan nodes are definitively the best approach for "parallel query", since the planner can optimize
thembased on cost.
 
> I'm wondering about the ability to modify the implementation of some methods themselves once at execution time: given
apreviously planned query, chances are that, at execution time (I'm specifically thinking about prepared statements
here),a different implementation of the same "node" might be more suitable and could be used instead while the
conditionholds.
 

These comments got me wondering... would it be better to decide on parallelism during execution instead of at plan
time?That would allow us to dynamically scale parallelism based on system load. If we don't even consider parallelism
untilwe've pulled some number of tuples/pages from a relation, this would also eliminate all parallel overhead on small
relations.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

06 December 2014, 05:10:19

On Fri, Dec 5, 2014 at 8:38 PM, José Luis Tallón <jltallon@adv-solutions.net> wrote:
>
> On 12/04/2014 07:35 AM, Amit Kapila wrote:
>>
>> [snip]
>>
>> The number of worker backends that can be used for
>> parallel seq scan can be configured by using a new GUC
>> parallel_seqscan_degree, the default value of which is zero
>> and it means parallel seq scan will not be considered unless
>> user configures this value.
>
>
> The number of parallel workers should be capped (of course!) at the maximum amount of "processors" (cores/vCores, threads/hyperthreads) available.
>

Also, it should consider MaxConnections configured by user.

> More over, when load goes up, the relative cost of parallel working should go up as well.
> Something like:
> p = number of cores
> l = 1min-load
>
> additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)
>
> (for c>1, of course)
>

How will you identify load in above formula and what is exactly 'c'

(is it parallel workers involved?).

For now, I have managed this simply by having a configuration

variable and it seems to me that the same should be good

enough for first version, we can definitely enhance it in future

version by dynamically allocating the number of workers based

on their availability and need of query, but I think lets leave that

for another day.

>
>> In ExecutorStart phase, initiate the required number of workers
>> as per parallel seq scan plan and setup dynamic shared memory and
>> share the information required for worker to execute the scan.
>> Currently I have just shared the relId, targetlist and number
>> of blocks to be scanned by worker, however I think we might want
>> to generate a plan for each of the workers in master backend and
>> then share the same to individual worker.
>
> [snip]
>>
>> Attached patch is just to facilitate the discussion about the
>> parallel seq scan and may be some other dependent tasks like
>> sharing of various states like combocid, snapshot with parallel
>> workers. It is by no means ready to do any complex test, ofcourse
>> I will work towards making it more robust both in terms of adding
>> more stuff and doing performance optimizations.
>>
>> Thoughts/Suggestions?
>
>
> Not directly (I haven't had the time to read the code yet), but I'm thinking about the ability to simply *replace* executor methods from an extension.
> This could be an alternative to providing additional nodes that the planner can include in the final plan tree, ready to be executed.
>
> The parallel seq scan nodes are definitively the best approach for "parallel query", since the planner can optimize them based on cost.
> I'm wondering about the ability to modify the implementation of some methods themselves once at execution time: given a previously planned query, chances are that, at execution time (I'm specifically thinking about prepared statements here), a different implementation of the same "node" might be more suitable and could be used instead while the condition holds.
>

Idea sounds interesting and I think probably in some cases

different implementation of same node might help, but may be

at this stage if we focus on one kind of implementation (which is

a win for reasonable number of cases) and make it successful,

then doing alternative implementations will be comparatively

easier and have more chances of success.

> If this latter line of thinking is too off-topic within this thread and there is any interest, we can move the comments to another thread and I'd begin work on a PoC patch. It might as well make sense to implement the executor overloading mechanism alongide the custom plan API, though.

Sure, please go ahead which ever way you like to proceed.

If you want to contribute in this area/patch, then you are

welcome.

> Any comments appreciated.
>
>
> Thank you for your work, Amit

Many thanks to you as well for showing interest.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

David Rowley

Date:

06 December 2014, 05:13:25

On 4 December 2014 at 19:35, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached patch is just to facilitate the discussion about the
parallel seq scan and may be some other dependent tasks like
sharing of various states like combocid, snapshot with parallel
workers. It is by no means ready to do any complex test, ofcourse
I will work towards making it more robust both in terms of adding
more stuff and doing performance optimizations.

Thoughts/Suggestions?

This is good news!

I've not gotten to look at the patch yet, but I thought you may be able to make use of the attached at some point.

It's bare-bones core support for allowing aggregate states to be merged together with another aggregate state. I would imagine that if a query such as:

SELECT MAX(value) FROM bigtable;

was run, then a series of parallel workers could go off and each find the max value from their portion of the table and then perhaps some other node type would then take all the intermediate results from the workers, once they're finished, and join all of the aggregate states into one and return that. Naturally, you'd need to check that all aggregates used in the targetlist had a merge function first.

This is just a few hours of work. I've not really tested the pg_dump support or anything yet. I've also not added any new functions to allow AVG() or COUNT() to work, I've really just re-used existing functions where I could, as things like MAX() and BOOL_OR() can just make use of the existing transition function. I thought that this might be enough for early tests.

I'd imagine such a workload, ignoring IO overhead, should scale pretty much linearly with the number of worker processes. Of course, if there was a GROUP BY clause then the merger code would have to perform more work.

If you think you might be able to make use of this, then I'm willing to go off and write all the other merge functions required for the other aggregates.

Regards

David Rowley

Attachment

merge_aggregate_state_v1.patch

Re: Parallel Seq Scan

From

Amit Kapila

Date:

06 December 2014, 06:22:27

On Fri, Dec 5, 2014 at 8:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
> Amit,
>
> * Amit Kapila (amit.kapila16@gmail.com) wrote:
> > postgres=# explain select c1 from t1;
> > QUERY PLAN
> > ------------------------------------------------------
> > Seq Scan on t1 (cost=0.00..101.00 rows=100 width=4)
> > (1 row)
> >
> >
> > postgres=# set parallel_seqscan_degree=4;
> > SET
> > postgres=# explain select c1 from t1;
> > QUERY PLAN
> > --------------------------------------------------------------
> > Parallel Seq Scan on t1 (cost=0.00..25.25 rows=100 width=4)
> > Number of Workers: 4
> > Number of Blocks Per Workers: 25
> > (3 rows)
>
> This is all great and interesting, but I feel like folks might be
> waiting to see just what kind of performance results come from this (and
> what kind of hardware is needed to see gains..).

Initially I was thinking that first we should discuss if the design

and idea used in patch is sane, but now as you have asked and

even Robert has asked the same off list to me, I will take the

performance data next week (Another reason why I have not

taken any data is that still the work to push qualification down

to workers is left which I feel is quite important). However I still

think if I get some feedback on some of the basic things like below,

it would be good.

1. As the patch currently stands, it just shares the relevant

data (like relid, target list, block range each worker should

perform on etc.) to the worker and then worker receives that

data and form the planned statement which it will execute and

send the results back to master backend. So the question

here is do you think it is reasonable or should we try to form

the complete plan for each worker and then share the same

and may be other information as well like range table entries

which are required. My personal gut feeling in this matter

is that for long term it might be better to form the complete

plan of each worker in master and share the same, however

I think the current way as done in patch (okay that needs

some improvement) is also not bad and quite easier to implement.

2. Next question related to above is what should be the

output of ExplainPlan, as currently worker is responsible

for forming its own plan, Explain Plan is not able to show

the detailed plan for each worker, is that okay?

3. Some places where optimizations are possible:

- Currently after getting the tuple from heap, it is deformed by

worker and sent via message queue to master backend, master

backend then forms the tuple and send it to upper layer which

before sending it to frontend again deforms it via slot_getallattrs(slot).

- Master backend currently receives the data from multiple workers

serially. We can optimize in a way that it can check other queues,

if there is no data in current queue.

- Master backend is just responsible for coordination among workers

It shares the required information to workers and then fetch the

data processed by each worker, by using some more logic, we might

be able to make master backend also fetch data from heap rather than

doing just co-ordination among workers.

I think in all above places we can do some optimisation, however

we can do that later as well, unless they hit the performance badly for

cases which people care most.

4. Should parallel_seqscan_degree value be dependent on other

backend processes like MaxConnections, max_worker_processes,

autovacuum_max_workers do or should it be independent like

max_wal_senders?

I think it is better to keep it dependent on other backend processes,

however for simplicity, I have kept it similar to max_wal_senders for now.

> There's likely to be
> situations where this change is an improvement while also being cases
> where it makes things worse.

Agreed and I think that will be more clear after doing some

performance tests.

> One really interesting case would be parallel seq scans which are
> executing against foreign tables/FDWs..
>

Sure.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

06 December 2014, 06:41:59

On Fri, Dec 5, 2014 at 8:43 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
> José,
>
> * José Luis Tallón (jltallon@adv-solutions.net) wrote:
> > On 12/04/2014 07:35 AM, Amit Kapila wrote:
> > >The number of worker backends that can be used for
> > >parallel seq scan can be configured by using a new GUC
> > >parallel_seqscan_degree, the default value of which is zero
> > >and it means parallel seq scan will not be considered unless
> > >user configures this value.
> >
> > The number of parallel workers should be capped (of course!) at the
> > maximum amount of "processors" (cores/vCores, threads/hyperthreads)
> > available.
> >
> > More over, when load goes up, the relative cost of parallel working
> > should go up as well.
> > Something like:
> > p = number of cores
> > l = 1min-load
> >
> > additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)
> >
> > (for c>1, of course)
>
> While I agree in general that we'll need to come up with appropriate
> acceptance criteria, etc, I don't think we want to complicate this patch
> with that initially.

>A SUSET GUC which caps the parallel GUC would be
> enough for an initial implementation, imv.
>

This is exactly what I have done in patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

06 December 2014, 06:50:27

On Sat, Dec 6, 2014 at 12:27 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 12/5/14, 9:08 AM, José Luis Tallón wrote:
>>
>>
>> More over, when load goes up, the relative cost of parallel working should go up as well.
>> Something like:
>> p = number of cores
>> l = 1min-load
>>
>> additional_cost = tuple estimate * cpu_tuple_cost * (l+1)/(c-1)
>>
>> (for c>1, of course)
>
>
> ...
>
>> The parallel seq scan nodes are definitively the best approach for "parallel query", since the planner can optimize them based on cost.
>> I'm wondering about the ability to modify the implementation of some methods themselves once at execution time: given a previously planned query, chances are that, at execution time (I'm specifically thinking about prepared statements here), a different implementation of the same "node" might be more suitable and could be used instead while the condition holds.
>
>
> These comments got me wondering... would it be better to decide on parallelism during execution instead of at plan time? That would allow us to dynamically scale parallelism based on system load. If we don't even consider parallelism until we've pulled some number of tuples/pages from a relation,

>this would also eliminate all parallel overhead on small relations.
> --

I think we have access to this information in planner (RelOptInfo -> pages),

if we want, we can use that to eliminate the small relations from

parallelism, but question is how big relations do we want to consider

for parallelism, one way is to check via tests which I am planning to

follow, do you think we have any heuristic which we can use to decide

how big relations should be consider for parallelism?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

06 December 2014, 07:06:31

On Sat, Dec 6, 2014 at 10:43 AM, David Rowley <dgrowleyml@gmail.com> wrote:
> On 4 December 2014 at 19:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Attached patch is just to facilitate the discussion about the
>> parallel seq scan and may be some other dependent tasks like
>> sharing of various states like combocid, snapshot with parallel
>> workers. It is by no means ready to do any complex test, ofcourse
>> I will work towards making it more robust both in terms of adding
>> more stuff and doing performance optimizations.
>>
>> Thoughts/Suggestions?
>>
>
> This is good news!

Thanks.

> I've not gotten to look at the patch yet, but I thought you may be able to make use of the attached at some point.
>

I also think so, that it can be used in near future to enhance

and provide more value to the parallel scan feature. Thanks

for taking the initiative to do the leg-work for supporting

aggregates.

> It's bare-bones core support for allowing aggregate states to be merged together with another aggregate state. I would imagine that if a query such as:
>
> SELECT MAX(value) FROM bigtable;
>
> was run, then a series of parallel workers could go off and each find the max value from their portion of the table and then perhaps some other node type would then take all the intermediate results from the workers, once they're finished, and join all of the aggregate states into one and return that. Naturally, you'd need to check that all aggregates used in the targetlist had a merge function first.
>

Direction sounds to be right.

> This is just a few hours of work. I've not really tested the pg_dump support or anything yet. I've also not added any new functions to allow AVG() or COUNT() to work, I've really just re-used existing functions where I could, as things like MAX() and BOOL_OR() can just make use of the existing transition function. I thought that this might be enough for early tests.
>
> I'd imagine such a workload, ignoring IO overhead, should scale pretty much linearly with the number of worker processes. Of course, if there was a GROUP BY clause then the merger code would have to perform more work.
>

Agreed.

> If you think you might be able to make use of this, then I'm willing to go off and write all the other merge functions required for the other aggregates.
>

Don't you think that first we should stabilize the basic (target list

and quals that can be independently evaluated by workers) parallel

scan and then jump to do such enhancements?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Stephen Frost

Date:

06 December 2014, 12:07:23

* Amit Kapila (amit.kapila16@gmail.com) wrote:
> 1. As the patch currently stands, it just shares the relevant
> data (like relid, target list, block range each worker should
> perform on etc.) to the worker and then worker receives that
> data and form the planned statement which it will execute and
> send the results back to master backend.  So the question
> here is do you think it is reasonable or should we try to form
> the complete plan for each worker and then share the same
> and may be other information as well like range table entries
> which are required.   My personal gut feeling in this matter
> is that for long term it might be better to form the complete
> plan of each worker in master and share the same, however
> I think the current way as done in patch (okay that needs
> some improvement) is also not bad and quite easier to implement.

For my 2c, I'd like to see it support exactly what the SeqScan node
supports and then also what Foreign Scan supports.  That would mean we'd
then be able to push filtering down to the workers which would be great.
Even better would be figuring out how to parallelize an Append node
(perhaps only possible when the nodes underneath are all SeqScan or
ForeignScan nodes) since that would allow us to then parallelize the
work across multiple tables and remote servers.

One of the big reasons why I was asking about performance data is that,
today, we can't easily split a single relation across multiple i/o
channels.  Sure, we can use RAID and get the i/o channel that the table
sits on faster than a single disk and possibly fast enough that a single
CPU can't keep up, but that's not quite the same.  The historical
recommendations for Hadoop nodes is around one CPU per drive (of course,
it'll depend on workload, etc, etc, but still) and while there's still a
lot of testing, etc, to be done before we can be sure about the 'right'
answer for PG (and it'll also vary based on workload, etc), that strikes
me as a pretty reasonable rule-of-thumb to go on.

Of course, I'm aware that this won't be as easy to implement..

> 2. Next question related to above is what should be the
> output of ExplainPlan, as currently worker is responsible
> for forming its own plan, Explain Plan is not able to show
> the detailed plan for each worker, is that okay?

I'm not entirely following this.  How can the worker be responsible for
its own "plan" when the information passed to it (per the above
paragraph..) is pretty minimal?  In general, I don't think we need to
have specifics like "this worker is going to do exactly X" because we
will eventually need some communication to happen between the worker and
the master process where the worker can ask for more work because it's
finished what it was tasked with and the master will need to give it
another chunk of work to do.  I don't think we want exactly what each
worker process will do to be fully formed at the outset because, even
with the best information available, given concurrent load on the
system, it's not going to be perfect and we'll end up starving workers.
The plan, as formed by the master, should be more along the lines of
"this is what I'm gonna have my workers do" along w/ how many workers,
etc, and then it goes and does it.  Perhaps for an 'explain analyze' we
return information about what workers actually *did* what, but that's a
whole different discussion.

> 3. Some places where optimizations are possible:
> - Currently after getting the tuple from heap, it is deformed by
> worker and sent via message queue to master backend, master
> backend then forms the tuple and send it to upper layer which
> before sending it to frontend again deforms it via slot_getallattrs(slot).

If this is done as I was proposing above, we might be able to avoid
this, but I don't know that it's a huge issue either way..  The bigger
issue is getting the filtering pushed down.

> - Master backend currently receives the data from multiple workers
> serially.  We can optimize in a way that it can check other queues,
> if there is no data in current queue.

Yes, this is pretty critical.  In fact, it's one of the recommendations
I made previously about how to change the Append node to parallelize
Foreign Scan node work.

> - Master backend is just responsible for coordination among workers
> It shares the required information to workers and then fetch the
> data processed by each worker, by using some more logic, we might
> be able to make master backend also fetch data from heap rather than
> doing just co-ordination among workers.

I don't think this is really necessary...

> I think in all above places we can do some optimisation, however
> we can do that later as well, unless they hit the performance badly for
> cases which people care most.

I agree that we can improve the performance through various
optimizations later, but it's important to get the general structure and
design right or we'll end up having to reimplement a lot of it.

> 4. Should parallel_seqscan_degree value be dependent on other
> backend processes like MaxConnections, max_worker_processes,
> autovacuum_max_workers do or should it be independent like
> max_wal_senders?

Well, we're not going to be able to spin off more workers than we have
process slots, but I'm not sure we need anything more than that?  In any
case, this is definitely an area we can work on improving later and I
don't think it really impacts the rest of the design.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Amit Kapila

Date:

08 December 2014, 05:10:15

On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
> * Amit Kapila (amit.kapila16@gmail.com) wrote:
> > 1. As the patch currently stands, it just shares the relevant
> > data (like relid, target list, block range each worker should
> > perform on etc.) to the worker and then worker receives that
> > data and form the planned statement which it will execute and
> > send the results back to master backend. So the question
> > here is do you think it is reasonable or should we try to form
> > the complete plan for each worker and then share the same
> > and may be other information as well like range table entries
> > which are required. My personal gut feeling in this matter
> > is that for long term it might be better to form the complete
> > plan of each worker in master and share the same, however
> > I think the current way as done in patch (okay that needs
> > some improvement) is also not bad and quite easier to implement.
>
> For my 2c, I'd like to see it support exactly what the SeqScan node
> supports and then also what Foreign Scan supports. That would mean we'd
> then be able to push filtering down to the workers which would be great.
> Even better would be figuring out how to parallelize an Append node
> (perhaps only possible when the nodes underneath are all SeqScan or
> ForeignScan nodes) since that would allow us to then parallelize the
> work across multiple tables and remote servers.
>
> One of the big reasons why I was asking about performance data is that,
> today, we can't easily split a single relation across multiple i/o
> channels. Sure, we can use RAID and get the i/o channel that the table
> sits on faster than a single disk and possibly fast enough that a single
> CPU can't keep up, but that's not quite the same. The historical
> recommendations for Hadoop nodes is around one CPU per drive (of course,
> it'll depend on workload, etc, etc, but still) and while there's still a
> lot of testing, etc, to be done before we can be sure about the 'right'
> answer for PG (and it'll also vary based on workload, etc), that strikes
> me as a pretty reasonable rule-of-thumb to go on.
>
> Of course, I'm aware that this won't be as easy to implement..
>
> > 2. Next question related to above is what should be the
> > output of ExplainPlan, as currently worker is responsible
> > for forming its own plan, Explain Plan is not able to show
> > the detailed plan for each worker, is that okay?
>
> I'm not entirely following this. How can the worker be responsible for
> its own "plan" when the information passed to it (per the above
> paragraph..) is pretty minimal?

Because for a simple sequence scan that much information is sufficient,

basically if we have scanrelid, target list, qual and then RTE (primarily

relOid), then worker can form and perform sequence scan.

> In general, I don't think we need to
> have specifics like "this worker is going to do exactly X" because we
> will eventually need some communication to happen between the worker and
> the master process where the worker can ask for more work because it's
> finished what it was tasked with and the master will need to give it
> another chunk of work to do. I don't think we want exactly what each
> worker process will do to be fully formed at the outset because, even
> with the best information available, given concurrent load on the
> system, it's not going to be perfect and we'll end up starving workers.
> The plan, as formed by the master, should be more along the lines of
> "this is what I'm gonna have my workers do" along w/ how many workers,
> etc, and then it goes and does it.

I think here you want to say that work allocation for workers should be

dynamic rather fixed which I think makes sense, however we can try

such an optimization after some initial performance data.

> Perhaps for an 'explain analyze' we
> return information about what workers actually *did* what, but that's a
> whole different discussion.
>

Agreed.

> > 3. Some places where optimizations are possible:
> > - Currently after getting the tuple from heap, it is deformed by
> > worker and sent via message queue to master backend, master
> > backend then forms the tuple and send it to upper layer which
> > before sending it to frontend again deforms it via slot_getallattrs(slot).
>
> If this is done as I was proposing above, we might be able to avoid
> this, but I don't know that it's a huge issue either way.. The bigger
> issue is getting the filtering pushed down.
>
> > - Master backend currently receives the data from multiple workers
> > serially. We can optimize in a way that it can check other queues,
> > if there is no data in current queue.
>
> Yes, this is pretty critical. In fact, it's one of the recommendations
> I made previously about how to change the Append node to parallelize
> Foreign Scan node work.
>
> > - Master backend is just responsible for coordination among workers
> > It shares the required information to workers and then fetch the
> > data processed by each worker, by using some more logic, we might
> > be able to make master backend also fetch data from heap rather than
> > doing just co-ordination among workers.
>
> I don't think this is really necessary...
>
> > I think in all above places we can do some optimisation, however
> > we can do that later as well, unless they hit the performance badly for
> > cases which people care most.
>
> I agree that we can improve the performance through various
> optimizations later, but it's important to get the general structure and
> design right or we'll end up having to reimplement a lot of it.
>

So to summarize my understanding, below are the set of things

which I should work on and in the order they are listed.

1. Push down qualification

2. Performance Data

3. Improve the way to push down the information related to worker.

4. Dynamic allocation of work for workers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

08 December 2014, 17:46:16

On Sat, Dec 6, 2014 at 12:13 AM, David Rowley <dgrowleyml@gmail.com> wrote:
> It's bare-bones core support for allowing aggregate states to be merged
> together with another aggregate state. I would imagine that if a query such
> as:
>
> SELECT MAX(value) FROM bigtable;
>
> was run, then a series of parallel workers could go off and each find the
> max value from their portion of the table and then perhaps some other node
> type would then take all the intermediate results from the workers, once
> they're finished, and join all of the aggregate states into one and return
> that. Naturally, you'd need to check that all aggregates used in the
> targetlist had a merge function first.

I think this is great infrastructure and could also be useful for
pushing down aggregates in cases involving foreign data wrappers.  But
I suggest we discuss it on a separate thread because it's not related
to parallel seq scan per se.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

08 December 2014, 17:51:43

On Sat, Dec 6, 2014 at 1:50 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think we have access to this information in planner (RelOptInfo -> pages),
> if we want, we can use that to eliminate the small relations from
> parallelism, but question is how big relations do we want to consider
> for parallelism, one way is to check via tests which I am planning to
> follow, do you think we have any heuristic which we can use to decide
> how big relations should be consider for parallelism?

Surely the Path machinery needs to decide this in particular cases
based on cost.  We should assign some cost to starting a parallel
worker via some new GUC, like parallel_startup_cost = 100,000.  And
then we should also assign a cost to the act of relaying a tuple from
the parallel worker to the master, maybe cpu_tuple_cost (or some new
GUC).  For a small relation, or a query with a LIMIT clause, the
parallel startup cost will make starting a lot of workers look
unattractive, but for bigger relations it will make sense from a cost
perspective, which is exactly what we want.

There are probably other important considerations based on goals for
overall resource utilization, and also because at a certain point
adding more workers won't help because the disk will be saturated.  I
don't know exactly what we should do about those issues yet, but the
steps described in the previous paragraph seem like a good place to
start anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

08 December 2014, 17:57:43

On Sat, Dec 6, 2014 at 7:07 AM, Stephen Frost <sfrost@snowman.net> wrote:
> For my 2c, I'd like to see it support exactly what the SeqScan node
> supports and then also what Foreign Scan supports.  That would mean we'd
> then be able to push filtering down to the workers which would be great.
> Even better would be figuring out how to parallelize an Append node
> (perhaps only possible when the nodes underneath are all SeqScan or
> ForeignScan nodes) since that would allow us to then parallelize the
> work across multiple tables and remote servers.

I don't see how we can support the stuff ForeignScan does; presumably
any parallelism there is up to the FDW to implement, using whatever
in-core tools we provide.  I do agree that parallelizing Append nodes
is useful; but let's get one thing done first before we start trying
to do thing #2.

> I'm not entirely following this.  How can the worker be responsible for
> its own "plan" when the information passed to it (per the above
> paragraph..) is pretty minimal?  In general, I don't think we need to
> have specifics like "this worker is going to do exactly X" because we
> will eventually need some communication to happen between the worker and
> the master process where the worker can ask for more work because it's
> finished what it was tasked with and the master will need to give it
> another chunk of work to do.  I don't think we want exactly what each
> worker process will do to be fully formed at the outset because, even
> with the best information available, given concurrent load on the
> system, it's not going to be perfect and we'll end up starving workers.
> The plan, as formed by the master, should be more along the lines of
> "this is what I'm gonna have my workers do" along w/ how many workers,
> etc, and then it goes and does it.  Perhaps for an 'explain analyze' we
> return information about what workers actually *did* what, but that's a
> whole different discussion.

I agree with this.  For a first version, I think it's OK to start a
worker up for a particular sequential scan and have it help with that
sequential scan until the scan is completed, and then exit.  It should
not, as the present version of the patch does, assign a fixed block
range to each worker; instead, workers should allocate a block or
chunk of blocks to work on until no blocks remain.  That way, even if
every worker but one gets stuck, the rest of the scan can still
finish.

Eventually, we will want to be smarter about sharing works between
multiple parts of the plan, but I think it is just fine to leave that
as a future enhancement for now.

>> - Master backend is just responsible for coordination among workers
>> It shares the required information to workers and then fetch the
>> data processed by each worker, by using some more logic, we might
>> be able to make master backend also fetch data from heap rather than
>> doing just co-ordination among workers.
>
> I don't think this is really necessary...

I think it would be an awfully good idea to make this work.  The
master thread may be significantly faster than any of the others
because it has no IPC costs.  We don't want to leave our best resource
sitting on the bench.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

09 December 2014, 05:34:32

On Mon, Dec 8, 2014 at 11:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Dec 6, 2014 at 1:50 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think we have access to this information in planner (RelOptInfo -> pages),
> > if we want, we can use that to eliminate the small relations from
> > parallelism, but question is how big relations do we want to consider
> > for parallelism, one way is to check via tests which I am planning to
> > follow, do you think we have any heuristic which we can use to decide
> > how big relations should be consider for parallelism?
>
> Surely the Path machinery needs to decide this in particular cases
> based on cost. We should assign some cost to starting a parallel
> worker via some new GUC, like parallel_startup_cost = 100,000. And
> then we should also assign a cost to the act of relaying a tuple from
> the parallel worker to the master, maybe cpu_tuple_cost (or some new
> GUC). For a small relation, or a query with a LIMIT clause, the
> parallel startup cost will make starting a lot of workers look
> unattractive, but for bigger relations it will make sense from a cost
> perspective, which is exactly what we want.
>

Sounds sensible. cpu_tuple_cost is already used for some other

purpose so not sure if it is right thing to override that parameter,

how about cpu_tuple_communication_cost or cpu_tuple_comm_cost.

> There are probably other important considerations based on goals for
> overall resource utilization, and also because at a certain point
> adding more workers won't help because the disk will be saturated. I
> don't know exactly what we should do about those issues yet, but the
> steps described in the previous paragraph seem like a good place to
> start anyway.
>

Agreed.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

09 December 2014, 05:47:06

On Mon, Dec 8, 2014 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Dec 6, 2014 at 7:07 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > For my 2c, I'd like to see it support exactly what the SeqScan node
> > supports and then also what Foreign Scan supports. That would mean we'd
> > then be able to push filtering down to the workers which would be great.
> > Even better would be figuring out how to parallelize an Append node
> > (perhaps only possible when the nodes underneath are all SeqScan or
> > ForeignScan nodes) since that would allow us to then parallelize the
> > work across multiple tables and remote servers.
>
> I don't see how we can support the stuff ForeignScan does; presumably
> any parallelism there is up to the FDW to implement, using whatever
> in-core tools we provide. I do agree that parallelizing Append nodes
> is useful; but let's get one thing done first before we start trying
> to do thing #2.
>
> > I'm not entirely following this. How can the worker be responsible for
> > its own "plan" when the information passed to it (per the above
> > paragraph..) is pretty minimal? In general, I don't think we need to
> > have specifics like "this worker is going to do exactly X" because we
> > will eventually need some communication to happen between the worker and
> > the master process where the worker can ask for more work because it's
> > finished what it was tasked with and the master will need to give it
> > another chunk of work to do. I don't think we want exactly what each
> > worker process will do to be fully formed at the outset because, even
> > with the best information available, given concurrent load on the
> > system, it's not going to be perfect and we'll end up starving workers.
> > The plan, as formed by the master, should be more along the lines of
> > "this is what I'm gonna have my workers do" along w/ how many workers,
> > etc, and then it goes and does it. Perhaps for an 'explain analyze' we
> > return information about what workers actually *did* what, but that's a
> > whole different discussion.
>
> I agree with this. For a first version, I think it's OK to start a
> worker up for a particular sequential scan and have it help with that
> sequential scan until the scan is completed, and then exit. It should
> not, as the present version of the patch does, assign a fixed block
> range to each worker; instead, workers should allocate a block or
> chunk of blocks to work on until no blocks remain. That way, even if
> every worker but one gets stuck, the rest of the scan can still
> finish.
>

I will check on this point and see if it is feasible to do something on

those lines, basically currently at Executor initialization phase, we

set the scan limits and then during Executor Run phase use

heap_getnext to fetch the tuples accordingly, but doing it dynamically

means at ExecutorRun phase we need to reset the scan limit for

which page/pages to scan, still I have to check if there is any problem

with such an idea. Do you any different idea in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

09 December 2014, 14:45:11

On Tue, Dec 9, 2014 at 12:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I agree with this.  For a first version, I think it's OK to start a
>> worker up for a particular sequential scan and have it help with that
>> sequential scan until the scan is completed, and then exit.  It should
>> not, as the present version of the patch does, assign a fixed block
>> range to each worker; instead, workers should allocate a block or
>> chunk of blocks to work on until no blocks remain.  That way, even if
>> every worker but one gets stuck, the rest of the scan can still
>> finish.
>>
> I will check on this point and see if it is feasible to do something on
> those lines, basically currently at Executor initialization phase, we
> set the scan limits and then during Executor Run phase use
> heap_getnext to fetch the tuples accordingly, but doing it dynamically
> means at ExecutorRun phase we need to reset the scan limit for
> which page/pages to scan, still I have to check if there is any problem
> with such an idea.  Do you any different idea in mind?

Hmm.  Well, it looks like there are basically two choices: you can
either (as you propose) deal with this above the level of the
heap_beginscan/heap_getnext API by scanning one or a few pages at a
time and then resetting the scan to a new starting page via
heap_setscanlimits; or alternatively, you can add a callback to
HeapScanDescData that, if non-NULL, will be invoked to get the next
block number to scan.  I'm not entirely sure which is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

18 December 2014, 15:52:54

On Mon, Dec 8, 2014 at 10:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net> wrote:
> >
>
> So to summarize my understanding, below are the set of things
> which I should work on and in the order they are listed.
>
> 1. Push down qualification
> 2. Performance Data
> 3. Improve the way to push down the information related to worker.
> 4. Dynamic allocation of work for workers.
>
>

I have worked on the patch to accomplish above mentioned points

1, 2 and partly 3 and would like to share the progress with community.

If the statement contain quals that don't have volatile functions, then

they will be pushed down and the parallel can will be considered for

cost evaluation. I think eventually we might need some better way

to decide about which kind of functions are okay to be pushed.

I have also unified the way information is passed from master backend

to worker backends which is convert each node to string that has to be

passed and then later workers convert string to node, this has simplified

the related code.

I have taken performance data for different selectivity and complexity of

qual expressions, I understand that there will be other kind of scenario's

which we need to consider, however I think the current set of tests is good

place to start, please feel free to comment on kind of scenario's which you

want me to check

Performance Data

------------------------------

m/c details

IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

non-default settings in postgresql.conf

max_connections=300

shared_buffers = 8GB

checkpoint_segments = 300

checkpoint_timeout = 30min

max_worker_processes=100

create table tbl_perf(c1 int, c2 char(1000));

30 million rows

------------------------

insert into tbl_perf values(generate_series(1,10000000),'aaaaa');

insert into tbl_perf values(generate_series(10000000,30000000),'aaaaa');

Function used in quals

-----------------------------------

A simple function which will perform some calculation and return

the value passed which can be used in qual condition.

create or replace function calc_factorial(a integer, fact_val integer) returns integer

as $$

begin

perform (fact_val)!;

return a;

end;

$$ language plpgsql STABLE;

In below data,

num_workers - number of parallel workers configured using

parallel_seqscan_degree. 0, means it will execute sequence

scan and greater than 0 means parallel sequence scan.

exec_time - Execution Time given by Explain Analyze statement.

Tests having quals containing function evaluation in qual

expressions.

Test-1

Query - Explain analyze select c1 from tbl_perf where

c1 > calc_factorial(29700000,10) and c2 like '%aa%';
Selection_criteria – 1% of rows will be selected

num_workers	exec_time (ms)
0	229534
2	121741
4	67051
8	35607
16	24743

Test-2

Query - Explain analyze select c1 from tbl_perf where

c1 > calc_factorial(27000000,10) and c2 like '%aa%';
Selection_criteria – 10% of rows will be selected

num_workers	exec_time (ms)
0	226671
2	151587
4	93648
8	70540
16	55466

Test-3

Query - Explain analyze select c1 from tbl_perf

where c1 > calc_factorial(22500000,10) and c2 like '%aa%';

Selection_criteria – 25% of rows will be selected

num_workers	exec_time (ms)
0	232673
2	197609
4	142686
8	111664
16	98097

Tests having quals containing simple expressions in qual.

Test-4

Query - Explain analyze select c1 from tbl_perf

where c1 > 29700000 and c2 like '%aa%';
Selection_criteria – 1% of rows will be selected

num_workers	exec_time (ms)
0	15505
2	9155
4	6030
8	4523
16	4459
32	8259
64	13388

Test-5

Query - Explain analyze select c1 from tbl_perf

where c1 > 28500000 and c2 like '%aa%';
Selection_criteria – 5% of rows will be selected

num_workers	exec_time (ms)
0	18906
2	13446
4	8970
8	7887
16	10403

Test-6

Query - Explain analyze select c1 from tbl_perf

where c1 > 27000000 and c2 like '%aa%';
Selection_criteria – 10% of rows will be selected

num_workers	exec_time (ms)
0	16132
2	23780
4	20275
8	11390
16	11418

Conclusion

------------------

1. Parallel workers help a lot when there is an expensive qualification

to evaluated, the more expensive the qualification the more better are

results.

2. It works well for low selectivity quals and as the selectivity increases,

the benefit tends to go down due to additional tuple communication cost

between workers and master backend.

3. After certain point, increasing having more number of workers won't

help and rather have negative impact, refer Test-4.

I think as discussed previously we need to introduce 2 additional cost

variables (parallel_startup_cost, cpu_tuple_communication_cost) to

estimate the parallel seq scan cost so that when the tables are small

or selectivity is high, it should increase the cost of parallel plan.

Thoughts and feedback for the current state of patch is welcome.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

18 December 2014, 16:03:44

On Thu, Dec 18, 2014 at 9:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 8, 2014 at 10:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > >
> >
> > So to summarize my understanding, below are the set of things
> > which I should work on and in the order they are listed.
> >
> > 1. Push down qualification
> > 2. Performance Data
> > 3. Improve the way to push down the information related to worker.
> > 4. Dynamic allocation of work for workers.
> >
> >
>
> I have worked on the patch to accomplish above mentioned points
> 1, 2 and partly 3 and would like to share the progress with community.

Sorry forgot to attach updated patch in last mail, attaching it now.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v2.patch

Re: Parallel Seq Scan

From

Stephen Frost

Date:

19 December 2014, 12:51:07

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:
> 1. Parallel workers help a lot when there is an expensive qualification
> to evaluated, the more expensive the qualification the more better are
> results.

I'd certainly hope so. ;)

> 2. It works well for low selectivity quals and as the selectivity increases,
> the benefit tends to go down due to additional tuple communication cost
> between workers and master backend.

I'm a bit sad to hear that the communication between workers and the
master backend is already being a bottleneck.  Now, that said, the box
you're playing with looks to be pretty beefy and therefore the i/o
subsystem might be particularly good, but generally speaking, it's a lot
faster to move data in memory than it is to pull it off disk, and so I
wouldn't expect the tuple communication between processes to really be
the bottleneck...

> 3. After certain point, increasing having more number of workers won't
> help and rather have negative impact, refer Test-4.

Yes, I see that too and it's also interesting- have you been able to
identify why?  What is the overhead (specifically) which is causing
that?

> I think as discussed previously we need to introduce 2 additional cost
> variables (parallel_startup_cost, cpu_tuple_communication_cost) to
> estimate the parallel seq scan cost so that when the tables are small
> or selectivity is high, it should increase the cost of parallel plan.

I agree that we need to figure out a way to cost out parallel plans, but
I have doubts about these being the right way to do that.  There has
been quite a bit of literature regarding parallel execution and
planning- have you had a chance to review anything along those lines?
We certainly like to draw on previous experiences and analysis rather
than trying to pave our own way.

With these additional costs comes the consideration that we're looking
for a wall-clock runtime proxy and therefore, while we need to add costs
for parallel startup and tuple communication, we have to reduce the
overall cost because of the parallelism or we'd never end up choosing a
parallel plan.  Is the thought to simply add up all the costs and then
divide?  Or perhaps to divide the cost of the actual plan but then add
in the parallel startup cost and the tuple communication cost?

Perhaps there has been prior discussion on these points but I'm thinking
we need a README or similar which discusses all of this and includes any
references out to academic papers or similar as appropriate.
Thanks!
    Stephen

Re: Parallel Seq Scan

From

Robert Haas

Date:

19 December 2014, 13:15:14

On Fri, Dec 19, 2014 at 7:51 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> 3. After certain point, increasing having more number of workers won't
>> help and rather have negative impact, refer Test-4.
>
> Yes, I see that too and it's also interesting- have you been able to
> identify why?  What is the overhead (specifically) which is causing
> that?

Let's rewind.  Amit's results show that, with a naive algorithm
(pre-distributing equal-sized chunks of the relation to every worker)
and a fairly-naive first cut at how to pass tuples around (I believe
largely from what I did in pg_background) he can sequential-scan a
table with 8 workers at 6.4 times the speed of a single process, and
you're complaining because it's not efficient enough?  It's a first
draft!  Be happy we got 6.4x, for crying out loud!

The barrier to getting parallel sequential scan (or any parallel
feature at all) committed is not going to be whether an 8-way scan is
6.4 times faster or 7.1 times faster or 7.8 times faster.  It's going
to be whether it's robust and won't break things.  We should be
focusing most of our effort here on identifying and fixing robustness
problems.  I'd vote to commit a feature like this with a 3x
performance speedup if I thought it was robust enough.

I'm not saying we shouldn't try to improve the performance here - we
definitely should.  But I don't think we should say, oh, an 8-way scan
isn't good enough, we need a 16-way or 32-way scan in order for this
to be efficient.  That is getting your priorities quite mixed up.

>> I think as discussed previously we need to introduce 2 additional cost
>> variables (parallel_startup_cost, cpu_tuple_communication_cost) to
>> estimate the parallel seq scan cost so that when the tables are small
>> or selectivity is high, it should increase the cost of parallel plan.
>
> I agree that we need to figure out a way to cost out parallel plans, but
> I have doubts about these being the right way to do that.  There has
> been quite a bit of literature regarding parallel execution and
> planning- have you had a chance to review anything along those lines?
> We certainly like to draw on previous experiences and analysis rather
> than trying to pave our own way.

I agree that it would be good to review the literature, but am not
aware of anything relevant.  Could you (or can anyone) provide some
links?

> With these additional costs comes the consideration that we're looking
> for a wall-clock runtime proxy and therefore, while we need to add costs
> for parallel startup and tuple communication, we have to reduce the
> overall cost because of the parallelism or we'd never end up choosing a
> parallel plan.  Is the thought to simply add up all the costs and then
> divide?  Or perhaps to divide the cost of the actual plan but then add
> in the parallel startup cost and the tuple communication cost?

This has been discussed, on this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Stephen Frost

Date:

19 December 2014, 14:27:22

Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Dec 19, 2014 at 7:51 AM, Stephen Frost <sfrost@snowman.net> wrote:
> >> 3. After certain point, increasing having more number of workers won't
> >> help and rather have negative impact, refer Test-4.
> >
> > Yes, I see that too and it's also interesting- have you been able to
> > identify why?  What is the overhead (specifically) which is causing
> > that?
>
> Let's rewind.  Amit's results show that, with a naive algorithm
> (pre-distributing equal-sized chunks of the relation to every worker)
> and a fairly-naive first cut at how to pass tuples around (I believe
> largely from what I did in pg_background) he can sequential-scan a
> table with 8 workers at 6.4 times the speed of a single process, and
> you're complaining because it's not efficient enough?  It's a first
> draft!  Be happy we got 6.4x, for crying out loud!

He also showed cases where parallelizing a query even with just two
workers caused a serious increase in the total runtime (Test 6).  Even
having four workers was slower in that case, but a modest performance
improvment was reached at eight but then no improvement from that was
seen when running with 16.

Being able to understand what's happening will inform how we cost this
to, hopefully, achieve the 6.4x gains where we can and avoid the
pitfalls of performing worse than a single thread in cases where
parallelism doesn't help.  What would likely be very helpful in the
analysis would be CPU time information- when running with eight workers,
were we using 800% CPU (8x 100%), or something less (perhaps due to
locking, i/o, or other processes).

Perhaps it's my fault for not being surprised that a naive first cut
gives us such gains as my experience with parallel operations and PG has
generally been very good (through the use of multiple connections to the
DB and therefore independent transactions, of course).  I'm very excited
that we're making such great progress towards having parallel execution
in the DB as I've often used PG in data warehouse use-cases.

> The barrier to getting parallel sequential scan (or any parallel
> feature at all) committed is not going to be whether an 8-way scan is
> 6.4 times faster or 7.1 times faster or 7.8 times faster.  It's going
> to be whether it's robust and won't break things.  We should be
> focusing most of our effort here on identifying and fixing robustness
> problems.  I'd vote to commit a feature like this with a 3x
> performance speedup if I thought it was robust enough.

I don't have any problem if an 8-way scan is 6.4x faster or if it's 7.1
times faster, but what if that 3x performance speedup is only achieved
when running with 8 CPUs at 100%?  We'd have to coach our users to
constantly be tweaking the enable_parallel_query (or whatever) option
for the queries where it helps and turning it off for others.  I'm not
so excited about that.

> I'm not saying we shouldn't try to improve the performance here - we
> definitely should.  But I don't think we should say, oh, an 8-way scan
> isn't good enough, we need a 16-way or 32-way scan in order for this
> to be efficient.  That is getting your priorities quite mixed up.

I don't think I said that.  What I was getting at is that we need a cost
system which accounts for the costs accurately enough that we don't end
up with worse performance than single-threaded operation.  In general, I
don't expect that to be very difficult and we can be conservative in the
initial releases to hopefully avoid regressions, but it absolutely needs
consideration.

> >> I think as discussed previously we need to introduce 2 additional cost
> >> variables (parallel_startup_cost, cpu_tuple_communication_cost) to
> >> estimate the parallel seq scan cost so that when the tables are small
> >> or selectivity is high, it should increase the cost of parallel plan.
> >
> > I agree that we need to figure out a way to cost out parallel plans, but
> > I have doubts about these being the right way to do that.  There has
> > been quite a bit of literature regarding parallel execution and
> > planning- have you had a chance to review anything along those lines?
> > We certainly like to draw on previous experiences and analysis rather
> > than trying to pave our own way.
>
> I agree that it would be good to review the literature, but am not
> aware of anything relevant.  Could you (or can anyone) provide some
> links?

There's certainly documentation available from the other RDBMS' which
already support parallel query, as one source.  Other academic papers
exist (and once you've linked into one, the references and prior work
helps bring in others).  Sadly, I don't currently have ACM access (might
have to change that..), but there are publicly available papers also,
such as:

http://i.stanford.edu/pub/cstr/reports/cs/tr/96/1570/CS-TR-96-1570.pdf
http://www.vldb.org/conf/1998/p251.pdf
http://www.cs.uiuc.edu/class/fa05/cs591han/sigmodpods04/sigmod/pdf/I-001c.pdf

> > With these additional costs comes the consideration that we're looking
> > for a wall-clock runtime proxy and therefore, while we need to add costs
> > for parallel startup and tuple communication, we have to reduce the
> > overall cost because of the parallelism or we'd never end up choosing a
> > parallel plan.  Is the thought to simply add up all the costs and then
> > divide?  Or perhaps to divide the cost of the actual plan but then add
> > in the parallel startup cost and the tuple communication cost?
>
> This has been discussed, on this thread.

Fantastic.  What I found in the patch was:

+   /*
+    * We simply assume that cost will be equally shared by parallel
+    * workers which might not be true especially for doing disk access.
+    * XXX - We would like to change these values based on some concrete
+    * tests.
+    */

What I asked for was:

----
I'm thinking we need a README or similar which discusses all of this and
includes any references out to academic papers or similar as appropriate.
----

Perhaps it doesn't deserve its own README, but we clearly need more.
Thanks!
    Stephen

Re: Parallel Seq Scan

From

Marko Tiikkaja

Date:

19 December 2014, 14:32:32

On 12/19/14 3:27 PM, Stephen Frost wrote:
> We'd have to coach our users to
> constantly be tweaking the enable_parallel_query (or whatever) option
> for the queries where it helps and turning it off for others.  I'm not
> so excited about that.

I'd be perfectly (that means 100%) happy if it just defaulted to off, 
but I could turn it up to 11 whenever I needed it.  I don't believe to 
be the only one with this opinion, either.

.marko

Re: Parallel Seq Scan

From

Stephen Frost

Date:

19 December 2014, 14:40:04

* Marko Tiikkaja (marko@joh.to) wrote:
> On 12/19/14 3:27 PM, Stephen Frost wrote:
> >We'd have to coach our users to
> >constantly be tweaking the enable_parallel_query (or whatever) option
> >for the queries where it helps and turning it off for others.  I'm not
> >so excited about that.
>
> I'd be perfectly (that means 100%) happy if it just defaulted to
> off, but I could turn it up to 11 whenever I needed it.  I don't
> believe to be the only one with this opinion, either.

Perhaps we should reconsider our general position on hints then and
add them so users can define the plan to be used..  For my part, I don't
see this as all that much different.

Consider if we were just adding HashJoin support today as an example.
Would we be happy if we had to default to enable_hashjoin = off?  Or if
users had to do that regularly because our costing was horrid?  It's bad
enough that we have to resort to those tweaks today in rare cases.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Robert Haas

Date:

19 December 2014, 14:54:00

On Fri, Dec 19, 2014 at 9:39 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Perhaps we should reconsider our general position on hints then and
> add them so users can define the plan to be used..  For my part, I don't
> see this as all that much different.
>
> Consider if we were just adding HashJoin support today as an example.
> Would we be happy if we had to default to enable_hashjoin = off?  Or if
> users had to do that regularly because our costing was horrid?  It's bad
> enough that we have to resort to those tweaks today in rare cases.

If you're proposing that it is not reasonable to have a GUC that
limits the degree of parallelism, then I think that's outright crazy:
that is probably the very first GUC we need to add.  New query
processing capabilities can entail new controlling GUCs, and
parallelism, being as complex at it is, will probably add several of
them.

But the big picture here is that if you want to ever have parallelism
in PostgreSQL at all, you're going to have to live with the first
version being pretty crude.  I think it's quite likely that the first
version of parallel sequential scan will be just as buggy as Hot
Standby was when we first added it, or as buggy as the multi-xact code
was when it went in, and probably subject to an even greater variety
of taxing limitations than any feature we've committed in the 6 years
I've been involved in the project.  We get to pick between that and
not having it at all.

I'll take a look at the papers you sent about parallel query
optimization, but personally I think that's putting the cart not only
before the horse but also before the road.  For V1, we need a query
optimization model that does not completely suck - no more.  The key
criterion here is that this has to WORK.  There will be time enough to
improve everything else once we reach that goal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Heikki Linnakangas

Date:

19 December 2014, 14:55:08

On 12/19/2014 04:39 PM, Stephen Frost wrote:
> * Marko Tiikkaja (marko@joh.to) wrote:
>> On 12/19/14 3:27 PM, Stephen Frost wrote:
>>> We'd have to coach our users to
>>> constantly be tweaking the enable_parallel_query (or whatever) option
>>> for the queries where it helps and turning it off for others.  I'm not
>>> so excited about that.
>>
>> I'd be perfectly (that means 100%) happy if it just defaulted to
>> off, but I could turn it up to 11 whenever I needed it.  I don't
>> believe to be the only one with this opinion, either.
>
> Perhaps we should reconsider our general position on hints then and
> add them so users can define the plan to be used..  For my part, I don't
> see this as all that much different.
>
> Consider if we were just adding HashJoin support today as an example.
> Would we be happy if we had to default to enable_hashjoin = off?  Or if
> users had to do that regularly because our costing was horrid?  It's bad
> enough that we have to resort to those tweaks today in rare cases.

This is somewhat different. Imagine that we achieve perfect 
parallelization, so that when you set enable_parallel_query=8, every 
query runs exactly 8x faster on an 8-core system, by using all eight cores.

Now, you might still want to turn parallelization off, or at least set 
it to a lower setting, on an OLTP system. You might not want a single 
query to hog all CPUs to run one query faster; you'd want to leave some 
for other queries. In particular, if you run a mix of short 
transactions, and some background-like tasks that run for minutes or 
hours, you do not want to starve the short transactions by giving all 
eight CPUs to the background task.

Admittedly, this is a rather crude knob to tune for such things,
but it's quite intuitive to a DBA: how many CPU cores is one query 
allowed to utilize? And we don't really have anything better.

In real life, there's always some overhead to parallelization, so that 
even if you can make one query run faster by doing it, you might hurt 
overall throughput. To some extent, it's a latency vs. throughput 
tradeoff, and it's quite reasonable to have a GUC for that because 
people have different priorities.

- Heikki

Re: Parallel Seq Scan

From

Gavin Flower

Date:

19 December 2014, 19:26:45

On 20/12/14 03:54, Heikki Linnakangas wrote:
> On 12/19/2014 04:39 PM, Stephen Frost wrote:
>> * Marko Tiikkaja (marko@joh.to) wrote:
>>> On 12/19/14 3:27 PM, Stephen Frost wrote:
>>>> We'd have to coach our users to
>>>> constantly be tweaking the enable_parallel_query (or whatever) option
>>>> for the queries where it helps and turning it off for others.  I'm not
>>>> so excited about that.
>>>
>>> I'd be perfectly (that means 100%) happy if it just defaulted to
>>> off, but I could turn it up to 11 whenever I needed it.  I don't
>>> believe to be the only one with this opinion, either.
>>
>> Perhaps we should reconsider our general position on hints then and
>> add them so users can define the plan to be used..  For my part, I don't
>> see this as all that much different.
>>
>> Consider if we were just adding HashJoin support today as an example.
>> Would we be happy if we had to default to enable_hashjoin = off?  Or if
>> users had to do that regularly because our costing was horrid? It's bad
>> enough that we have to resort to those tweaks today in rare cases.
>
> This is somewhat different. Imagine that we achieve perfect 
> parallelization, so that when you set enable_parallel_query=8, every 
> query runs exactly 8x faster on an 8-core system, by using all eight 
> cores.
>
> Now, you might still want to turn parallelization off, or at least set 
> it to a lower setting, on an OLTP system. You might not want a single 
> query to hog all CPUs to run one query faster; you'd want to leave 
> some for other queries. In particular, if you run a mix of short 
> transactions, and some background-like tasks that run for minutes or 
> hours, you do not want to starve the short transactions by giving all 
> eight CPUs to the background task.
>
> Admittedly, this is a rather crude knob to tune for such things,
> but it's quite intuitive to a DBA: how many CPU cores is one query 
> allowed to utilize? And we don't really have anything better.
>
> In real life, there's always some overhead to parallelization, so that 
> even if you can make one query run faster by doing it, you might hurt 
> overall throughput. To some extent, it's a latency vs. throughput 
> tradeoff, and it's quite reasonable to have a GUC for that because 
> people have different priorities.
>
> - Heikki
>
>
>
How about 3 numbers:
   minCPUs # > 0   maxCPUs           # >= minCPUs   fractionOfCPUs    # rounded up


If you just have the /*number*/ of CPUs then a setting that is 
appropriate for quad core, may be too /*small*/ for an octo core processor.

If you just have the /*fraction*/ of CPUs then a setting that is 
appropriate for quad core, may be too /*large*/ for an octo core processor.



Cheers,
Gavin

Re: Parallel Seq Scan

From

Stephen Frost

Date:

19 December 2014, 19:49:40

Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Dec 19, 2014 at 9:39 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > Perhaps we should reconsider our general position on hints then and
> > add them so users can define the plan to be used..  For my part, I don't
> > see this as all that much different.
> >
> > Consider if we were just adding HashJoin support today as an example.
> > Would we be happy if we had to default to enable_hashjoin = off?  Or if
> > users had to do that regularly because our costing was horrid?  It's bad
> > enough that we have to resort to those tweaks today in rare cases.
>
> If you're proposing that it is not reasonable to have a GUC that
> limits the degree of parallelism, then I think that's outright crazy:

I'm pretty sure that I didn't say anything along those lines.  I'll try
to be clearer.

What I'd like is such a GUC that we can set at a reasonable default of,
say, 4, and trust that our planner will generally do the right thing.
Clearly, this may be something which admins have to tweak but what I
would really like to avoid is users having to set this GUC explicitly
for each of their queries.

> that is probably the very first GUC we need to add.  New query
> processing capabilities can entail new controlling GUCs, and
> parallelism, being as complex at it is, will probably add several of
> them.

That's fine if they're intended for debugging issues or dealing with
unexpected bugs or issues, but let's not go into this thinking we should
add GUCs which are geared with the expectation of users tweaking them
regularly.

> But the big picture here is that if you want to ever have parallelism
> in PostgreSQL at all, you're going to have to live with the first
> version being pretty crude.  I think it's quite likely that the first
> version of parallel sequential scan will be just as buggy as Hot
> Standby was when we first added it, or as buggy as the multi-xact code
> was when it went in, and probably subject to an even greater variety
> of taxing limitations than any feature we've committed in the 6 years
> I've been involved in the project.  We get to pick between that and
> not having it at all.

If it's disabled by default then I'm worried it won't really improve
until it is.  Perhaps that's setting a higher bar than you feel is
necessary but, for my part at least, it doesn't feel like a very high
level.

> I'll take a look at the papers you sent about parallel query
> optimization, but personally I think that's putting the cart not only
> before the horse but also before the road.  For V1, we need a query
> optimization model that does not completely suck - no more.  The key
> criterion here is that this has to WORK.  There will be time enough to
> improve everything else once we reach that goal.

I agree that it's got to work, but it also needs to be generally well
designed, and have the expectation of being on by default.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Stephen Frost

Date:

19 December 2014, 20:00:43

* Heikki Linnakangas (hlinnakangas@vmware.com) wrote:
> On 12/19/2014 04:39 PM, Stephen Frost wrote:
> >* Marko Tiikkaja (marko@joh.to) wrote:
> >>I'd be perfectly (that means 100%) happy if it just defaulted to
> >>off, but I could turn it up to 11 whenever I needed it.  I don't
> >>believe to be the only one with this opinion, either.
> >
> >Perhaps we should reconsider our general position on hints then and
> >add them so users can define the plan to be used..  For my part, I don't
> >see this as all that much different.
> >
> >Consider if we were just adding HashJoin support today as an example.
> >Would we be happy if we had to default to enable_hashjoin = off?  Or if
> >users had to do that regularly because our costing was horrid?  It's bad
> >enough that we have to resort to those tweaks today in rare cases.
>
> This is somewhat different. Imagine that we achieve perfect
> parallelization, so that when you set enable_parallel_query=8, every
> query runs exactly 8x faster on an 8-core system, by using all eight
> cores.

To be clear, as I mentioned to Robert just now, I'm not objecting to a
GUC being added to turn off or control parallelization.  I don't want
such a GUC to be a crutch for us to lean on when it comes to questions
about the optimizer though.  We need to work through the optimizer
questions of "should this be parallelized?" and, perhaps later, "how
many ways is it sensible to parallelize this?"  I'm worried we'll take
such a GUC as a directive along the lines of "we are being told to
parallelize to exactly this level every time and for every query which
can be."  The GUC should be an input into the planner/optimizer much the
way enable_hashjoin is, unless it's being done as a *limiting* factor
for the administrator to be able to control, but we've generally avoided
doing that (see: work_mem) and, if we're going to start, we should
probably come up with an approach that addresses the considerations for
other resources too.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Amit Kapila

Date:

21 December 2014, 06:42:20

On Fri, Dec 19, 2014 at 6:21 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
> Amit,
>
> * Amit Kapila (amit.kapila16@gmail.com) wrote:
> > 1. Parallel workers help a lot when there is an expensive qualification
> > to evaluated, the more expensive the qualification the more better are
> > results.
>
> I'd certainly hope so. ;)
>
> > 2. It works well for low selectivity quals and as the selectivity increases,
> > the benefit tends to go down due to additional tuple communication cost
> > between workers and master backend.
>
> I'm a bit sad to hear that the communication between workers and the
> master backend is already being a bottleneck. Now, that said, the box
> you're playing with looks to be pretty beefy and therefore the i/o
> subsystem might be particularly good, but generally speaking, it's a lot
> faster to move data in memory than it is to pull it off disk, and so I
> wouldn't expect the tuple communication between processes to really be
> the bottleneck...
>

The main reason for higher cost of tuple communication is because at

this moment I have used an approach to pass the tuples which is comparatively

less error prone and could be used as per existing FE/BE protocol.

To explain in brief, what is happening here is that currently worker backend

gets the tuple from page which it is deforms and send the same to master

backend via message queue, master backend then forms the tuple and send it

to upper layer which before sending it to frontend again deforms it via

slot_getallattrs(slot). The benefit of using this approach is that it works

as per current protocol message ('D') and as per our current executor code.

Now there could be couple of ways with which we can reduce the tuple

communication overhead.

a. Instead of passing value array, just pass tuple id, but retain the

buffer pin till master backend reads the tuple based on tupleid.

This has side effect that we have to retain buffer pin for longer

period of time, but again that might not have any problem in

real world usage of parallel query.

b. Instead of passing value array, pass directly the tuple which could

be directly propagated by master backend to upper layer or otherwise

in master backend change some code such that it could propagate the

tuple array received via shared memory queue directly to frontend.

Basically save the one extra cycle of form/deform tuple.

Both these need some new message type and handling for same in

Executor code.

Having said above, I think we can try to optimize this in multiple

ways, however we need additional mechanism and changes in Executor

code which is error prone and doesn't seem to be important at this

stage where we want the basic feature to work.

> > 3. After certain point, increasing having more number of workers won't
> > help and rather have negative impact, refer Test-4.
>
> Yes, I see that too and it's also interesting- have you been able to
> identify why? What is the overhead (specifically) which is causing
> that?
>

I think there are mainly two things which can lead to benefit

by employing parallel workers

a. Better use of available I/O bandwidth

b. Better use of available CPU's by doing expression evaluation

by multiple workers.

The simple theory here is that there has to be certain limit

(in terms of number of parallel workers) till which there can

be benefit due to both of the above points and after which there

will be overhead (setting up so many workers even though they

are not required, then some additional wait by master backend

for non-helping workers to finish their work, then if there

are not enough CPU's available and may be others as well like

overusing I/O channel might also degrade the performance

rather than improving it).

In the above tests, it seems to me that the maximum benefit due to

'a' is realized upto 4~8 workers and the maximum benefit due to

'b' depends upon the complexity (time to evaluate) of expression.

That is the reason why we can see benefit's in Tests-1 ~ Test-3 above

8 parallel workers as well whereas for Tests-4 to Tests-6 it maximizes

at 8 workers and after that either there is no improvement or

degradation due to one or more reasons as explained in previous

paragraph.

I think important point which is mentioned by you as well is

that there should be a reasonably good cost model which can

account some or all of these things so that by using parallel

query user can achieve the benefit it provides and won't have

to pay the cost in which there is no or less benefit.

I am not sure that in first cut we can come up with a highly

robust cost model, but it should not be too weak that most

of the time user has to find the right tuning based on parameters

we are going to add. Based on my understanding and by referring

to existing literature, I will try to come up with the cost model

and then we can have a discussion if required whether that is good

enough for first cut or not.

> > I think as discussed previously we need to introduce 2 additional cost
> > variables (parallel_startup_cost, cpu_tuple_communication_cost) to
> > estimate the parallel seq scan cost so that when the tables are small
> > or selectivity is high, it should increase the cost of parallel plan.
>
> I agree that we need to figure out a way to cost out parallel plans, but
> I have doubts about these being the right way to do that. There has
> been quite a bit of literature regarding parallel execution and
> planning- have you had a chance to review anything along those lines?

Not now, but sometime back I had read quite a few papers on parallelism,

I will refer some of them again before deciding the exact cost model

and might as well discuss about them.

> We certainly like to draw on previous experiences and analysis rather
> than trying to pave our own way.
>
> With these additional costs comes the consideration that we're looking
> for a wall-clock runtime proxy and therefore, while we need to add costs
> for parallel startup and tuple communication, we have to reduce the
> overall cost because of the parallelism or we'd never end up choosing a
> parallel plan. Is the thought to simply add up all the costs and then
> divide? Or perhaps to divide the cost of the actual plan but then add
> in the parallel startup cost and the tuple communication cost?
>
> Perhaps there has been prior discussion on these points but I'm thinking
> we need a README or similar which discusses all of this and includes any
> references out to academic papers or similar as appropriate.
>

Got the point, I think we need to mention somewhere either in README or

in some file header.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Jim Nasby

Date:

22 December 2014, 02:05:17

On 12/21/14, 12:42 AM, Amit Kapila wrote:
> On Fri, Dec 19, 2014 at 6:21 PM, Stephen Frost <sfrost@snowman.net <mailto:sfrost@snowman.net>> wrote:
> a. Instead of passing value array, just pass tuple id, but retain the
> buffer pin till master backend reads the tuple based on tupleid.
> This has side effect that we have to retain buffer pin for longer
> period of time, but again that might not have any problem in
> real world usage of parallel query.
>
> b. Instead of passing value array, pass directly the tuple which could
> be directly propagated by master backend to upper layer or otherwise
> in master backend change some code such that it could propagate the
> tuple array received via shared memory queue directly to frontend.
> Basically save the one extra cycle of form/deform tuple.
>
> Both these need some new message type and handling for same in
> Executor code.
>
> Having said above, I think we can try to optimize this in multiple
> ways, however we need additional mechanism and changes in Executor
> code which is error prone and doesn't seem to be important at this
> stage where we want the basic feature to work.

Would b require some means of ensuring we didn't try and pass raw tuples to frontends? Other than that potential
wrinkle,it seems like less work than a.
 

...

> I think there are mainly two things which can lead to benefit
> by employing parallel workers
> a. Better use of available I/O bandwidth
> b. Better use of available CPU's by doing expression evaluation
> by multiple workers.

...

> In the above tests, it seems to me that the maximum benefit due to
> 'a' is realized upto 4~8 workers

I'd think a good first estimate here would be to just use effective_io_concurrency.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

22 December 2014, 03:57:52

On Mon, Dec 22, 2014 at 7:34 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>
> On 12/21/14, 12:42 AM, Amit Kapila wrote:
>>
>> On Fri, Dec 19, 2014 at 6:21 PM, Stephen Frost <sfrost@snowman.net <mailto:sfrost@snowman.net>> wrote:
>> a. Instead of passing value array, just pass tuple id, but retain the
>> buffer pin till master backend reads the tuple based on tupleid.
>> This has side effect that we have to retain buffer pin for longer
>> period of time, but again that might not have any problem in
>> real world usage of parallel query.
>>
>> b. Instead of passing value array, pass directly the tuple which could
>> be directly propagated by master backend to upper layer or otherwise
>> in master backend change some code such that it could propagate the
>> tuple array received via shared memory queue directly to frontend.
>> Basically save the one extra cycle of form/deform tuple.
>>
>> Both these need some new message type and handling for same in
>> Executor code.
>>
>> Having said above, I think we can try to optimize this in multiple
>> ways, however we need additional mechanism and changes in Executor
>> code which is error prone and doesn't seem to be important at this
>> stage where we want the basic feature to work.
>
>
> Would b require some means of ensuring we didn't try and pass raw tuples to frontends?

That seems to be already there, before sending the tuple

to frontend, we already ensure to deform it (refer printtup()->

slot_getallattrs())

>Other than that potential wrinkle, it seems like less work than a.
>

Here, I am assuming that you are mentioning about *pass the tuple*

directly approach; We also need to devise a new protocol message

and mechanism to directly pass the tuple via shared memory queues,

also I think currently we can send only the things via shared memory

queues which we can do via FE/BE protocol and we don't send tuples

directly to frontend. Apart from this, I am not sure how much benefit it

can give, because it will reduce one part of tuple communication, but still

the amount of data transferred will be almost same.

This is an area of improvement which needs more investigation and even

without this we can get benefit in many cases as shown upthread and

I think after that we can try to parallelize the aggregation (Simon Riggs and

David Rowley have already worked out some infrastructure for the same)

that will surely give us good benefits. So I suggest it's better to focus on

the remaining things with which this patch could be in a shape (in terms of

robustness/stability) where it can be accepted rather than trying to

optimize tuple communication which we can do later as well.

> ...
>
>> I think there are mainly two things which can lead to benefit
>> by employing parallel workers
>> a. Better use of available I/O bandwidth
>> b. Better use of available CPU's by doing expression evaluation
>> by multiple workers.
>
>
> ...
>
>> In the above tests, it seems to me that the maximum benefit due to
>> 'a' is realized upto 4~8 workers
>
>
> I'd think a good first estimate here would be to just use effective_io_concurrency.
>

One thing we should be cautious about this parameter is that currently

it is mapped to number of pages that needs to prefetched, and using

it for deciding degree of parallelism could be slightly tricky, however I

will consider it while working on cost model.

Thanks for your suggestions.

With Regards,

Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Thom Brown

Date:

31 December 2014, 14:20:37

On 18 December 2014 at 16:03, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 18, 2014 at 9:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 8, 2014 at 10:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > >
> >
> > So to summarize my understanding, below are the set of things
> > which I should work on and in the order they are listed.
> >
> > 1. Push down qualification
> > 2. Performance Data
> > 3. Improve the way to push down the information related to worker.
> > 4. Dynamic allocation of work for workers.
> >
> >
>
> I have worked on the patch to accomplish above mentioned points
> 1, 2 and partly 3 and would like to share the progress with community.

Sorry forgot to attach updated patch in last mail, attaching it now.

When attempting to recreate the plan in your example, I get an error:

➤ psql://thom@[local]:5488/pgbench

# create table t1(c1 int, c2 char(500)) with (fillfactor=10);
CREATE TABLE
Time: 13.653 ms

➤ psql://thom@[local]:5488/pgbench

# insert into t1 values(generate_series(1,100),'amit');
INSERT 0 100
Time: 4.796 ms

➤ psql://thom@[local]:5488/pgbench

# explain select c1 from t1;
ERROR: could not register background process
HINT: You may need to increase max_worker_processes.
Time: 1.659 ms

➤ psql://thom@[local]:5488/pgbench

# show max_worker_processes ;
max_worker_processes
----------------------
8
(1 row)

Time: 0.199 ms

# show parallel_seqscan_degree ;
parallel_seqscan_degree
-------------------------
10
(1 row)

Should I really need to increase max_worker_processes to >= parallel_seqscan_degree? If so, shouldn't there be a hint here along with the error message pointing this out? And should the error be produced when only a *plan* is being requested?

Also, I noticed that where a table is partitioned, the plan isn't parallelised:

# explain select distinct bid from pgbench_accounts;

                                       QUERY PLAN
----------------------------------------------------------------------------------------
HashAggregate (cost=1446639.00..1446643.99 rows=499 width=4)
   Group Key: pgbench_accounts.bid
   -> Append (cost=0.00..1321639.00 rows=50000001 width=4)
         -> Seq Scan on pgbench_accounts (cost=0.00..0.00 rows=1 width=4)
         -> Seq Scan on pgbench_accounts_1 (cost=0.00..4279.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_2 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_3 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_4 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_5 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_6 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_7 (cost=0.00..2640.00 rows=100000 width=4)
...
         -> Seq Scan on pgbench_accounts_498 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_499 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_500 (cost=0.00..2640.00 rows=100000 width=4)
(504 rows)

Is this expected?

Thom

Re: Parallel Seq Scan

From

Thom Brown

Date:

31 December 2014, 16:16:58

On 31 December 2014 at 14:20, Thom Brown <thom@linux.com> wrote:

On 18 December 2014 at 16:03, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 18, 2014 at 9:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 8, 2014 at 10:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Dec 6, 2014 at 5:37 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > >
> >
> > So to summarize my understanding, below are the set of things
> > which I should work on and in the order they are listed.
> >
> > 1. Push down qualification
> > 2. Performance Data
> > 3. Improve the way to push down the information related to worker.
> > 4. Dynamic allocation of work for workers.
> >
> >
>
> I have worked on the patch to accomplish above mentioned points
> 1, 2 and partly 3 and would like to share the progress with community.

Sorry forgot to attach updated patch in last mail, attaching it now.

When attempting to recreate the plan in your example, I get an error:

➤ psql://thom@[local]:5488/pgbench

# create table t1(c1 int, c2 char(500)) with (fillfactor=10);
CREATE TABLE
Time: 13.653 ms

➤ psql://thom@[local]:5488/pgbench

# insert into t1 values(generate_series(1,100),'amit');
INSERT 0 100
Time: 4.796 ms

➤ psql://thom@[local]:5488/pgbench

# explain select c1 from t1;
ERROR: could not register background process
HINT: You may need to increase max_worker_processes.
Time: 1.659 ms

➤ psql://thom@[local]:5488/pgbench

# show max_worker_processes ;
max_worker_processes
----------------------
8
(1 row)

Time: 0.199 ms

# show parallel_seqscan_degree ;
parallel_seqscan_degree
-------------------------
10
(1 row)

Should I really need to increase max_worker_processes to >= parallel_seqscan_degree? If so, shouldn't there be a hint here along with the error message pointing this out? And should the error be produced when only a *plan* is being requested?

Also, I noticed that where a table is partitioned, the plan isn't parallelised:

# explain select distinct bid from pgbench_accounts;

                                       QUERY PLAN
----------------------------------------------------------------------------------------
HashAggregate (cost=1446639.00..1446643.99 rows=499 width=4)
   Group Key: pgbench_accounts.bid
   -> Append (cost=0.00..1321639.00 rows=50000001 width=4)
         -> Seq Scan on pgbench_accounts (cost=0.00..0.00 rows=1 width=4)
         -> Seq Scan on pgbench_accounts_1 (cost=0.00..4279.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_2 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_3 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_4 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_5 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_6 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_7 (cost=0.00..2640.00 rows=100000 width=4)
...
         -> Seq Scan on pgbench_accounts_498 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_499 (cost=0.00..2640.00 rows=100000 width=4)
         -> Seq Scan on pgbench_accounts_500 (cost=0.00..2640.00 rows=100000 width=4)
(504 rows)

Is this expected?

Another issue (FYI, pgbench2 initialised with: pgbench -i -s 100 -F 10 pgbench2):

➤ psql://thom@[local]:5488/pgbench2

# explain select distinct bid from pgbench_accounts;
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
HashAggregate (cost=245833.38..245834.38 rows=100 width=4)
   Group Key: bid
   -> Parallel Seq Scan on pgbench_accounts (cost=0.00..220833.38 rows=10000000 width=4)
         Number of Workers: 8
         Number of Blocks Per Workers: 208333
(5 rows)

Time: 7.476 ms

➤ psql://thom@[local]:5488/pgbench2

# explain (analyse, buffers, verbose) select distinct bid from pgbench_accounts;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 14897.991 ms

The logs say:

2014-12-31 15:21:42 GMT [9164]: [240-1] user=,db=,client= LOG: registering background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [241-1] user=,db=,client= LOG: registering background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [242-1] user=,db=,client= LOG: registering background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [243-1] user=,db=,client= LOG: registering background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [244-1] user=,db=,client= LOG: registering background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [245-1] user=,db=,client= LOG: registering background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [246-1] user=,db=,client= LOG: registering background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [247-1] user=,db=,client= LOG: registering background worker "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [248-1] user=,db=,client= LOG: starting background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [249-1] user=,db=,client= LOG: starting background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [250-1] user=,db=,client= LOG: starting background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [251-1] user=,db=,client= LOG: starting background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [252-1] user=,db=,client= LOG: starting background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [253-1] user=,db=,client= LOG: starting background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [254-1] user=,db=,client= LOG: starting background worker process "backend_worker"
2014-12-31 15:21:42 GMT [9164]: [255-1] user=,db=,client= LOG: starting background worker process "backend_worker"
2014-12-31 15:21:46 GMT [9164]: [256-1] user=,db=,client= LOG: worker process: backend_worker (PID 10887) exited with exit code 1
2014-12-31 15:21:46 GMT [9164]: [257-1] user=,db=,client= LOG: unregistering background worker "backend_worker"
2014-12-31 15:21:50 GMT [9164]: [258-1] user=,db=,client= LOG: worker process: backend_worker (PID 10888) exited with exit code 1
2014-12-31 15:21:50 GMT [9164]: [259-1] user=,db=,client= LOG: unregistering background worker "backend_worker"
2014-12-31 15:21:57 GMT [9164]: [260-1] user=,db=,client= LOG: server process (PID 10869) was terminated by signal 9: Killed
2014-12-31 15:21:57 GMT [9164]: [261-1] user=,db=,client= DETAIL: Failed process was running: explain (analyse, buffers, verbose) select distinct bid from pgbench_accounts;
2014-12-31 15:21:57 GMT [9164]: [262-1] user=,db=,client= LOG: terminating any other active server processes

Running it again, I get the same issue. This is with parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.

This doesn't happen if I set the pgbench scale to 50. I suspect this is a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale 100, and I don't have swap enabled. But I'm concerned it crashes the whole instance.

I also notice that requesting BUFFERS in a parallel EXPLAIN output yields no such information. Is that not possible to report?

Thom

Re: Parallel Seq Scan

From

Amit Kapila

Date:

01 January 2015, 07:11:19

On Wed, Dec 31, 2014 at 7:50 PM, Thom Brown <thom@linux.com> wrote:
>
>
> When attempting to recreate the plan in your example, I get an error:
>
> ➤ psql://thom@[local]:5488/pgbench
>
> # create table t1(c1 int, c2 char(500)) with (fillfactor=10);
> CREATE TABLE
> Time: 13.653 ms
>
> ➤ psql://thom@[local]:5488/pgbench
>
> # insert into t1 values(generate_series(1,100),'amit');
> INSERT 0 100
> Time: 4.796 ms
>
> ➤ psql://thom@[local]:5488/pgbench
>
> # explain select c1 from t1;
> ERROR: could not register background process
> HINT: You may need to increase max_worker_processes.
> Time: 1.659 ms
>
> ➤ psql://thom@[local]:5488/pgbench
>
> # show max_worker_processes ;
> max_worker_processes
> ----------------------
> 8
> (1 row)
>
> Time: 0.199 ms
>
> # show parallel_seqscan_degree ;
> parallel_seqscan_degree
> -------------------------
> 10
> (1 row)
>
>
> Should I really need to increase max_worker_processes to >= parallel_seqscan_degree?

Yes, as the parallel workers are implemented based on dynamic

bgworkers, so it is dependent on max_worker_processes.

> If so, shouldn't there be a hint here along with the error message pointing this out? And should the error be produced when only a *plan* is being requested?

I think one thing we could do minimize the chance of such an

error is set the value of parallel workers to be used for plan equal

to max_worker_processes if parallel_seqscan_degree is greater

than max_worker_processes. Even if we do this, still such an

error can come if user has registered bgworker before we could

start parallel plan execution.

> Also, I noticed that where a table is partitioned, the plan isn't parallelised:
>
>
> Is this expected?
>

Yes, to keep the initial implementation simple, it allows the

parallel plan when there is single table in query.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

01 January 2015, 10:34:33

On Wed, Dec 31, 2014 at 9:46 PM, Thom Brown <thom@linux.com> wrote:
>
> Another issue (FYI, pgbench2 initialised with: pgbench -i -s 100 -F 10 pgbench2):

>
> ➤ psql://thom@[local]:5488/pgbench2
>
> # explain (analyse, buffers, verbose) select distinct bid from pgbench_accounts;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> Time: 14897.991 ms
>
> 2014-12-31 15:21:57 GMT [9164]: [260-1] user=,db=,client= LOG: server process (PID 10869) was terminated by signal 9: Killed
> 2014-12-31 15:21:57 GMT [9164]: [261-1] user=,db=,client= DETAIL: Failed process was running: explain (analyse, buffers, verbose) select distinct bid from pgbench_accounts;
> 2014-12-31 15:21:57 GMT [9164]: [262-1] user=,db=,client= LOG: terminating any other active server processes
>
> Running it again, I get the same issue. This is with parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.
>
> This doesn't happen if I set the pgbench scale to 50. I suspect this is a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale 100, and I don't have swap enabled. But I'm concerned it crashes the whole instance.
>

Isn't this a backend crash due to OOM?

And after that server will restart automatically.

> I also notice that requesting BUFFERS in a parallel EXPLAIN output yields no such information.
> --

Yeah and the reason for same is that all the work done related

to BUFFERS is done by backend workers, master backend

doesn't read any pages, so it is not able to accumulate this

information.

> Is that not possible to report?

It is not impossible to report such information, we can develop some

way to share such information between master backend and workers.

I think we can do this if required once the patch is more stablized.

Thanks for looking into patch and reporting the issues.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Fabrízio de Royes Mello

Date:

01 January 2015, 17:00:20

I think one thing we could do minimize the chance of such an
error is set the value of parallel workers to be used for plan equal
to max_worker_processes if parallel_seqscan_degree is greater
than max_worker_processes. Even if we do this, still such an
error can come if user has registered bgworker before we could
start parallel plan execution.

Can we check the number of free bgworkers slots to set the max workers?

Regards,

Fabrízio Mello

Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL

>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello

>> Github: http://github.com/fabriziomello

Re: Parallel Seq Scan

From

Robert Haas

Date:

01 January 2015, 18:00:06

On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:
> Can we check the number of free bgworkers slots to set the max workers?

The real solution here is that this patch can't throw an error if it's
unable to obtain the desired number of background workers.  It needs
to be able to smoothly degrade to a smaller number of background
workers, or none at all.  I think a lot of this work will fall out
quite naturally if this patch is reworked to use the parallel
mode/parallel context stuff, the latest version of which includes an
example of how to set up a parallel scan in such a manner that it can
run with any number of workers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Thom Brown

Date:

02 January 2015, 10:01:16

On 1 January 2015 at 17:59, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
<fabriziomello@gmail.com> wrote:
> Can we check the number of free bgworkers slots to set the max workers?

The real solution here is that this patch can't throw an error if it's
unable to obtain the desired number of background workers. It needs
to be able to smoothly degrade to a smaller number of background
workers, or none at all. I think a lot of this work will fall out
quite naturally if this patch is reworked to use the parallel
mode/parallel context stuff, the latest version of which includes an
example of how to set up a parallel scan in such a manner that it can
run with any number of workers.

+1

That sounds like exactly what's needed.

Thom

Re: Parallel Seq Scan

From

Amit Kapila

Date:

02 January 2015, 10:36:29

On Thu, Jan 1, 2015 at 11:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
> <fabriziomello@gmail.com> wrote:
> > Can we check the number of free bgworkers slots to set the max workers?
>
> The real solution here is that this patch can't throw an error if it's
> unable to obtain the desired number of background workers. It needs
> to be able to smoothly degrade to a smaller number of background
> workers, or none at all.

I think handling this way can have one side effect which is that if

we degrade to smaller number, then the cost of plan (which was

decided by optimizer based on number of parallel workers) could

be more than non-parallel scan.

Ideally before finalizing the parallel plan we should reserve the

bgworkers required to execute that plan, but I think as of now

we can workout a solution without it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Thom Brown

Date:

02 January 2015, 10:40:06

On 1 January 2015 at 10:34, Amit Kapila <amit.kapila16@gmail.com> wrote:

> Running it again, I get the same issue. This is with parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.
>
> This doesn't happen if I set the pgbench scale to 50. I suspect this is a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale 100, and I don't have swap enabled. But I'm concerned it crashes the whole instance.
>

Isn't this a backend crash due to OOM?
And after that server will restart automatically.

Yes, I'm fairly sure it is. I guess what I'm confused about is that 8 parallel sequential scans in separate sessions (1 per session) don't cause the server to crash, but in a single session (8 in 1 session), they do.

> I also notice that requesting BUFFERS in a parallel EXPLAIN output yields no such information.
> --

Yeah and the reason for same is that all the work done related
to BUFFERS is done by backend workers, master backend
doesn't read any pages, so it is not able to accumulate this
information.

> Is that not possible to report?

It is not impossible to report such information, we can develop some
way to share such information between master backend and workers.
I think we can do this if required once the patch is more stablized.

Ah great, as I think losing such information to this feature would be unfortunate.

Will there be a GUC to influence parallel scan cost? Or does it take into account effective_io_concurrency in the costs?

And will the planner be able to decide whether or not it'll choose to use background workers or not? For example:

# explain (analyse, buffers, verbose) select distinct bid from pgbench_accounts;
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=89584.00..89584.05 rows=5 width=4) (actual time=228.222..228.224 rows=5 loops=1)
   Output: bid
   Group Key: pgbench_accounts.bid
   Buffers: shared hit=83334
   -> Seq Scan on public.pgbench_accounts (cost=0.00..88334.00 rows=500000 width=4) (actual time=0.008..136.522 rows=500000 loops=1)
         Output: bid
         Buffers: shared hit=83334
Planning time: 0.071 ms
Execution time: 228.265 ms
(9 rows)

This is a quick plan, but if we tell it that it's allowed 8 background workers:

# set parallel_seqscan_degree = 8;
SET
Time: 0.187 ms

# explain (analyse, buffers, verbose) select distinct bid from pgbench_accounts;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=12291.75..12291.80 rows=5 width=4) (actual time=603.042..603.042 rows=1 loops=1)
   Output: bid
   Group Key: pgbench_accounts.bid
   -> Parallel Seq Scan on public.pgbench_accounts (cost=0.00..11041.75 rows=500000 width=4) (actual time=2.445..529.284 rows=500000 loops=1)
         Output: bid
         Number of Workers: 8
         Number of Blocks Per Workers: 10416
Planning time: 0.049 ms
Execution time: 663.103 ms
(9 rows)

Time: 663.437 ms

It's significantly slower. I'd hope the planner would anticipate this and decide, "I'm just gonna perform a single scan in this instance as it'll be a lot quicker for this simple case." So at the moment parallel_seqscan_degree seems to mean "You *must* use this many threads if you can parallelise." Ideally we'd be saying "can use up to if necessary".

Thanks

Thom

Re: Parallel Seq Scan

From

Amit Kapila

Date:

02 January 2015, 11:13:59

On Fri, Jan 2, 2015 at 4:09 PM, Thom Brown <thom@linux.com> wrote:
>
> On 1 January 2015 at 10:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> > Running it again, I get the same issue. This is with parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.
>> >
>> > This doesn't happen if I set the pgbench scale to 50. I suspect this is a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale 100, and I don't have swap enabled. But I'm concerned it crashes the whole instance.
>> >
>>
>> Isn't this a backend crash due to OOM?
>> And after that server will restart automatically.
>
>
> Yes, I'm fairly sure it is. I guess what I'm confused about is that 8 parallel sequential scans in separate sessions (1 per session) don't cause the server to crash, but in a single session (8 in 1 session), they do.
>

It could be possible that master backend retains some memory

for longer period which causes it to hit OOM error, by the way

in your test does always master backend hits OOM or is it

random (either master or worker)

>
> Will there be a GUC to influence parallel scan cost? Or does it take into account effective_io_concurrency in the costs?
>

> And will the planner be able to decide whether or not it'll choose to use background workers or not? For example:
>

Yes, we are planing to introduce cost model for parallel

communication (there is some discussion about the same

upthread), but it's still not there and that's why you

are seeing it to choose parallel plan when it shouldn't.

Currently in patch, if you set parallel_seqscan_degree, it

will most probably choose parallel plan only.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Thom Brown

Date:

02 January 2015, 11:43:21

On 2 January 2015 at 11:13, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 2, 2015 at 4:09 PM, Thom Brown <thom@linux.com> wrote:
>
> On 1 January 2015 at 10:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> > Running it again, I get the same issue. This is with parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.
>> >
>> > This doesn't happen if I set the pgbench scale to 50. I suspect this is a OOM issue. My laptop has 16GB RAM, the table is around 13GB at scale 100, and I don't have swap enabled. But I'm concerned it crashes the whole instance.
>> >
>>
>> Isn't this a backend crash due to OOM?
>> And after that server will restart automatically.
>
>
> Yes, I'm fairly sure it is. I guess what I'm confused about is that 8 parallel sequential scans in separate sessions (1 per session) don't cause the server to crash, but in a single session (8 in 1 session), they do.
>

It could be possible that master backend retains some memory
for longer period which causes it to hit OOM error, by the way
in your test does always master backend hits OOM or is it
random (either master or worker)

Just ran a few tests, and it appears to always be the master that hits OOM, or at least I don't seem to be able to get an example of the worker hitting it.

>
> Will there be a GUC to influence parallel scan cost? Or does it take into account effective_io_concurrency in the costs?
>
> And will the planner be able to decide whether or not it'll choose to use background workers or not? For example:
>

Yes, we are planing to introduce cost model for parallel
communication (there is some discussion about the same
upthread), but it's still not there and that's why you
are seeing it to choose parallel plan when it shouldn't.
Currently in patch, if you set parallel_seqscan_degree, it
will most probably choose parallel plan only.

Ah, okay. Great.

Thanks.

Thom

Re: Parallel Seq Scan

From

Robert Haas

Date:

05 January 2015, 15:01:13

On Fri, Jan 2, 2015 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 1, 2015 at 11:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
>> <fabriziomello@gmail.com> wrote:
>> > Can we check the number of free bgworkers slots to set the max workers?
>>
>> The real solution here is that this patch can't throw an error if it's
>> unable to obtain the desired number of background workers.  It needs
>> to be able to smoothly degrade to a smaller number of background
>> workers, or none at all.
>
> I think handling this way can have one side effect which is that if
> we degrade to smaller number, then the cost of plan (which was
> decided by optimizer based on number of parallel workers) could
> be more than non-parallel scan.
> Ideally before finalizing the parallel plan we should reserve the
> bgworkers required to execute that plan, but I think as of now
> we can workout a solution without it.

I don't think this is very practical.  When cached plans are in use,
we can have a bunch of plans sitting around that may or may not get
reused at some point in the future, possibly far in the future.  The
current situation, which I think we want to maintain, is that such
plans hold no execution-time resources (e.g. locks) and, generally,
don't interfere with other things people might want to execute on the
system.  Nailing down a bunch of background workers just in case we
might want to use them in the future would be pretty unfriendly.

I think it's right to view this in the same way we view work_mem.  We
plan on the assumption that an amount of memory equal to work_mem will
be available at execution time, without actually reserving it.  If the
plan happens to need that amount of memory and if it actually isn't
available when needed, then performance will suck; conceivably, the
OOM killer might trigger.  But it's the user's job to avoid this by
not setting work_mem too high in the first place.  Whether this system
is for the best is arguable: one can certainly imagine a system where,
if there's not enough memory at execution time, we consider
alternatives like (a) replanning with a lower memory target, (b)
waiting until more memory is available, or (c) failing outright in
lieu of driving the machine into swap.  But devising such a system is
complicated -- for example, replanning with a lower memory target
might be latch onto a far more expensive plan, such that we would have
been better off waiting for more memory to be available; yet trying to
waiting until more memory is available might result in waiting
forever.  And that's why we don't have such a system.

We don't need to do any better here.  The GUC should tell us how many
parallel workers we should anticipate being able to obtain.  If other
settings on the system, or the overall system load, preclude us from
obtaining that number of parallel workers, then the query will take
longer to execute; and the plan might be sub-optimal.  If that happens
frequently, the user should lower the planner GUC to a level that
reflects the resources actually likely to be available at execution
time.

By the way, another area where this kind of effect crops up is with
the presence of particular disk blocks in shared_buffers or the system
buffer cache.  Right now, the planner makes no attempt to cost a scan
of a frequently-used, fully-cached relation different than a
rarely-used, probably-not-cached relation; and that sometimes leads to
bad plans.  But if it did try to do that, then we'd have the same kind
of problem discussed here -- things might change between planning and
execution, or even after the beginning of execution.  Also, we might
get nasty feedback effects: since the relation isn't cached, we view a
plan that would involve reading it in as very expensive, and avoid
such a plan.  However, we might be better off picking the "slow" plan
anyway, because it might be that once we've read the data once it will
stay cached and run much more quickly than some plan that seems better
starting from a cold cache.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Stephen Frost

Date:

05 January 2015, 15:21:11

* Robert Haas (robertmhaas@gmail.com) wrote:
> I think it's right to view this in the same way we view work_mem.  We
> plan on the assumption that an amount of memory equal to work_mem will
> be available at execution time, without actually reserving it.

Agreed- this seems like a good approach for how to address this.  We
should still be able to end up with plans which use less than the max
possible parallel workers though, as I pointed out somewhere up-thread.
This is also similar to work_mem- we certainly have plans which don't
expect to use all of work_mem and others that expect to use all of it
(per node, of course).
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Amit Kapila

Date:

08 January 2015, 11:43:06

On Mon, Jan 5, 2015 at 8:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 2, 2015 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Jan 1, 2015 at 11:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
> >> <fabriziomello@gmail.com> wrote:
> >> > Can we check the number of free bgworkers slots to set the max workers?
> >>
> >> The real solution here is that this patch can't throw an error if it's
> >> unable to obtain the desired number of background workers. It needs
> >> to be able to smoothly degrade to a smaller number of background
> >> workers, or none at all.
> >
> > I think handling this way can have one side effect which is that if
> > we degrade to smaller number, then the cost of plan (which was
> > decided by optimizer based on number of parallel workers) could
> > be more than non-parallel scan.
> > Ideally before finalizing the parallel plan we should reserve the
> > bgworkers required to execute that plan, but I think as of now
> > we can workout a solution without it.
>
> I don't think this is very practical. When cached plans are in use,
> we can have a bunch of plans sitting around that may or may not get
> reused at some point in the future, possibly far in the future. The
> current situation, which I think we want to maintain, is that such
> plans hold no execution-time resources (e.g. locks) and, generally,
> don't interfere with other things people might want to execute on the
> system. Nailing down a bunch of background workers just in case we
> might want to use them in the future would be pretty unfriendly.
>
> I think it's right to view this in the same way we view work_mem. We
> plan on the assumption that an amount of memory equal to work_mem will
> be available at execution time, without actually reserving it.

Are we sure that in such cases we will consume work_mem during

execution? In cases of parallel_workers we are sure to an extent

that if we reserve the workers then we will use it during execution.

Nonetheless, I have proceded and integrated the parallel_seq scan

patch with v0.3 of parallel_mode patch posted by you at below link:

http://www.postgresql.org/message-id/CA+TgmoYmp_=XcJEhvJZt9P8drBgW-pDpjHxBhZA79+M4o-CZQA@mail.gmail.com

Few things to note about this integrated patch are:

1. In this new patch, I have just integrated it with Robert's parallel_mode

patch and not done any further development or fixed known things

like changes in optimizer, prepare queries, etc. You might notice

that new patch has lesser size as compare to previous patch and the

reason is that there were some duplicate stuff between previous

version of parallel_seqscan patch and parallel_mode which I have

eliminated.

2. To enable two types of shared memory queue's (error queue and

tuple queue), we need to ensure that we switch to appropriate queue

during communication of various messages from parallel worker

to master backend. There are two ways to do it

a. Save the information about error queue during startup of parallel

worker (ParallelMain()) and then during error, set the same (switch

to error queue in errstart() and switch back to tuple queue in

errfinish() and errstart() in case errstart() doesn't need to propagate

error).

b. Do something similar as (a) for tuple queue in printtup or other place

if any for non-error messages.

I think approach (a) is slightly better as compare to approach (b) as

we need to switch many times for tuple queue (for each tuple) and

there could be multiple places where we need to do the same. For now,

I have used approach (a) in Patch which needs some more work if we

agree on the same.

3. As per current implementation of Parallel_seqscan, it needs to use

some information from parallel.c which was not exposed, so I have

exposed the same by moving it to parallel.h. Information that is required

is as follows:

ParallelWorkerNumber, FixedParallelState and shm keys -

This is used to decide the blocks that needs to be scanned.

We might change it in future the way parallel scan/work distribution

is done, but I don't see any harm in exposing this information.

4. Sending ReadyForQuery

> If the
> plan happens to need that amount of memory and if it actually isn't
> available when needed, then performance will suck; conceivably, the
> OOM killer might trigger. But it's the user's job to avoid this by
> not setting work_mem too high in the first place. Whether this system
> is for the best is arguable: one can certainly imagine a system where,
> if there's not enough memory at execution time, we consider
> alternatives like (a) replanning with a lower memory target, (b)
> waiting until more memory is available, or (c) failing outright in
> lieu of driving the machine into swap. But devising such a system is
> complicated -- for example, replanning with a lower memory target
> might be latch onto a far more expensive plan, such that we would have
> been better off waiting for more memory to be available; yet trying to
> waiting until more memory is available might result in waiting
> forever. And that's why we don't have such a system.
>
> We don't need to do any better here. The GUC should tell us how many
> parallel workers we should anticipate being able to obtain. If other
> settings on the system, or the overall system load, preclude us from
> obtaining that number of parallel workers, then the query will take
> longer to execute; and the plan might be sub-optimal. If that happens
> frequently, the user should lower the planner GUC to a level that
> reflects the resources actually likely to be available at execution
> time.
>

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

08 January 2015, 11:48:12

On Thu, Jan 8, 2015 at 5:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 5, 2015 at 8:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >

Sorry for incomplete mail sent prior to this, I just hit the send button

by mistake.

4. Sending ReadyForQuery() after completely sending the tuples,

as that is required to know that all the tuples are received and I think

we should send the same on tuple queue rather than on error queue.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v3.patch

Re: Parallel Seq Scan

From

Jim Nasby

Date:

08 January 2015, 19:32:42

On 1/5/15, 9:21 AM, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> I think it's right to view this in the same way we view work_mem.  We
>> plan on the assumption that an amount of memory equal to work_mem will
>> be available at execution time, without actually reserving it.
>
> Agreed- this seems like a good approach for how to address this.  We
> should still be able to end up with plans which use less than the max
> possible parallel workers though, as I pointed out somewhere up-thread.
> This is also similar to work_mem- we certainly have plans which don't
> expect to use all of work_mem and others that expect to use all of it
> (per node, of course).

I agree, but we should try and warn the user if they set parallel_seqscan_degree close to max_worker_processes, or at
leastgive some indication of what's going on. This is something you could end up beating your head on wondering why
it'snot working.
 

Perhaps we could have EXPLAIN throw a warning if a plan is likely to get less than parallel_seqscan_degree number of
workers.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Stephen Frost

Date:

08 January 2015, 19:46:22

* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
> On 1/5/15, 9:21 AM, Stephen Frost wrote:
> >* Robert Haas (robertmhaas@gmail.com) wrote:
> >>I think it's right to view this in the same way we view work_mem.  We
> >>plan on the assumption that an amount of memory equal to work_mem will
> >>be available at execution time, without actually reserving it.
> >
> >Agreed- this seems like a good approach for how to address this.  We
> >should still be able to end up with plans which use less than the max
> >possible parallel workers though, as I pointed out somewhere up-thread.
> >This is also similar to work_mem- we certainly have plans which don't
> >expect to use all of work_mem and others that expect to use all of it
> >(per node, of course).
>
> I agree, but we should try and warn the user if they set parallel_seqscan_degree close to max_worker_processes, or at
leastgive some indication of what's going on. This is something you could end up beating your head on wondering why
it'snot working. 
>
> Perhaps we could have EXPLAIN throw a warning if a plan is likely to get less than parallel_seqscan_degree number of
workers.

Yeah, if we come up with a plan for X workers and end up not being able
to spawn that many then I could see that being worth a warning or notice
or something.  Not sure what EXPLAIN has to do anything with it..
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Amit Kapila

Date:

09 January 2015, 14:04:44

On Fri, Jan 9, 2015 at 1:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>
> On 1/5/15, 9:21 AM, Stephen Frost wrote:
>>
>> * Robert Haas (robertmhaas@gmail.com) wrote:
>>>
>>> I think it's right to view this in the same way we view work_mem. We
>>> plan on the assumption that an amount of memory equal to work_mem will
>>> be available at execution time, without actually reserving it.
>>
>>
>> Agreed- this seems like a good approach for how to address this. We
>> should still be able to end up with plans which use less than the max
>> possible parallel workers though, as I pointed out somewhere up-thread.
>> This is also similar to work_mem- we certainly have plans which don't
>> expect to use all of work_mem and others that expect to use all of it
>> (per node, of course).
>
>
> I agree, but we should try and warn the user if they set parallel_seqscan_degree close to max_worker_processes, or at least give some indication of what's going on. This is something you could end up beating your head on wondering why it's not working.
>

Yet another way to handle the case when enough workers are not

available is to let user specify the desired minimum percentage of

requested parallel workers with parameter like

PARALLEL_QUERY_MIN_PERCENT. For example, if you specify

50 for this parameter, then at least 50% of the parallel workers

requested for any parallel operation must be available in order for

the operation to succeed else it will give error. If the value is set to

null, then all parallel operations will proceed as long as at least two

parallel workers are available for processing.

This is something how other commercial database handles such a

situation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

09 January 2015, 14:39:03

On Fri, Dec 19, 2014 at 7:57 PM, Stephen Frost <sfrost@snowman.net> wrote:
>
>
> There's certainly documentation available from the other RDBMS' which
> already support parallel query, as one source. Other academic papers
> exist (and once you've linked into one, the references and prior work
> helps bring in others). Sadly, I don't currently have ACM access (might
> have to change that..), but there are publicly available papers also,

I have gone through couple of papers and what some other databases

do in case of parallel sequential scan and here is brief summarization

of same and how I am planning to handle in the patch:

Costing:

In one of the paper's [1] suggested by you, below is the summarisation:

a. Startup costs are negligible if processes can be reused

rather than created afresh.

b. Communication cost consists of the CPU cost of sending

and receiving messages.

c. Communication costs can exceed the cost of operators such

as scanning, joining or grouping

These findings lead to the important conclusion that

Query optimization should be concerned with communication costs

but not with startup costs.

In our case as currently we don't have a mechanism to reuse parallel

workers, so we need to account for that cost as well. So based on that,

I am planing to add three new parameters cpu_tuple_comm_cost,

parallel_setup_cost, parallel_startup_cost

* cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker

to master backend with default value

DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied

with tuples expected to be selected

* parallel_setup_cost - Cost of setting up shared memory for parallelism

with default value as 100.0
* parallel_startup_cost - Cost of starting up parallel workers with default

value as 1000.0 multiplied by number of workers decided for scan.

I will do some experiments to finalise the default values, but in general,

I feel developing cost model on above parameters is good.

Execution:

Most other databases does partition level scan for partition on

different disks by each individual parallel worker. However,

it seems amazon dynamodb [2] also works on something

similar to what I have used in patch which means on fixed

blocks. I think this kind of strategy seems better than dividing

the blocks at runtime because dividing randomly the blocks

among workers could lead to random scan for a parallel

sequential scan.

Also I find in whatever I have read (Oracle, dynamodb) that most

databases divide work among workers and master backend acts

as coordinator, atleast that's what I could understand.

Let me know your opinion about the same?

I am planning to proceed with above ideas to strengthen the patch

in absence of any objection or better ideas.

[1] : http://i.stanford.edu/pub/cstr/reports/cs/tr/96/1570/CS-TR-96-1570.pdf

[2] : http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Stephen Frost

Date:

09 January 2015, 17:24:25

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:
> On Fri, Dec 19, 2014 at 7:57 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > There's certainly documentation available from the other RDBMS' which
> > already support parallel query, as one source.  Other academic papers
> > exist (and once you've linked into one, the references and prior work
> > helps bring in others).  Sadly, I don't currently have ACM access (might
> > have to change that..), but there are publicly available papers also,
>
> I have gone through couple of papers and what some other databases
> do in case of parallel sequential scan and here is brief summarization
> of same and how I am planning to handle in the patch:

Great, thanks!

> Costing:
> In one of the paper's [1] suggested by you, below is the summarisation:
>   a. Startup costs are negligible if processes can be reused
>            rather than created afresh.
> b. Communication cost consists of the CPU cost of sending
>            and receiving messages.
>   c. Communication costs can exceed the cost of operators such
>             as scanning, joining or grouping
> These findings lead to the important conclusion that
> Query optimization should be concerned with communication costs
> but not with startup costs.
>
> In our case as currently we don't have a mechanism to reuse parallel
> workers, so we need to account for that cost as well.  So based on that,
> I am planing to add three new parameters cpu_tuple_comm_cost,
> parallel_setup_cost, parallel_startup_cost
> *  cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker
>     to master backend with default value
>     DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied
>     with tuples expected to be selected
> *  parallel_setup_cost - Cost of setting up shared memory for parallelism
>    with default value as 100.0
>  *  parallel_startup_cost  - Cost of starting up parallel workers with
> default
>     value as 1000.0 multiplied by number of workers decided for scan.
>
> I will do some experiments to finalise the default values, but in general,
> I feel developing cost model on above parameters is good.

The parameters sound reasonable but I'm a bit worried about the way
you're describing the implementation.  Specifically this comment:

"Cost of starting up parallel workers with default value as 1000.0
multiplied by number of workers decided for scan."

That appears to imply that we'll decide on the number of workers, figure
out the cost, and then consider "parallel" as one path and
"not-parallel" as another.  I'm worried that if I end up setting the max
parallel workers to 32 for my big, beefy, mostly-single-user system then
I'll actually end up not getting parallel execution because we'll always
be including the full startup cost of 32 threads.  For huge queries,
it'll probably be fine, but there's a lot of room to parallelize things
at levels less than 32 which we won't even consider.

What I was advocating for up-thread was to consider multiple "parallel"
paths and to pick whichever ends up being the lowest overall cost.  The
flip-side to that is increased planning time.  Perhaps we can come up
with an efficient way of working out where the break-point is based on
the non-parallel cost and go at it from that direction instead of
building out whole paths for each increment of parallelism.

I'd really like to be able to set the 'max parallel' high and then have
the optimizer figure out how many workers should actually be spawned for
a given query.

> Execution:
> Most other databases does partition level scan for partition on
> different disks by each individual parallel worker.  However,
> it seems amazon dynamodb [2] also works on something
> similar to what I have used in patch which means on fixed
> blocks.  I think this kind of strategy seems better than dividing
> the blocks at runtime because dividing randomly the blocks
> among workers could lead to random scan for a parallel
> sequential scan.

Yeah, we also need to consider the i/o side of this, which will
definitely be tricky.  There are i/o systems out there which are faster
than a single CPU and ones where a single CPU can manage multiple i/o
channels.  There are also cases where the i/o system handles sequential
access nearly as fast as random and cases where sequential is much
faster than random.  Where we can get an idea of that distinction is
with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
lower random_page_cost from the default to indicate that.

> Also I find in whatever I have read (Oracle, dynamodb) that most
> databases divide work among workers and master backend acts
> as coordinator, atleast that's what I could understand.

Yeah, I agree that's more typical.  Robert's point that the master
backend should participate is interesting but, as I recall, it was based
on the idea that the master could finish faster than the worker- but if
that's the case then we've planned it out wrong from the beginning.
Thanks!
    Stephen

Re: Parallel Seq Scan

From

Stephen Frost

Date:

09 January 2015, 19:01:11

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:
> On Fri, Jan 9, 2015 at 1:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> > I agree, but we should try and warn the user if they set
> > parallel_seqscan_degree close to max_worker_processes, or at least give
> > some indication of what's going on. This is something you could end up
> > beating your head on wondering why it's not working.
>
> Yet another way to handle the case when enough workers are not
> available is to let user  specify the desired minimum percentage of
> requested parallel workers with parameter like
> PARALLEL_QUERY_MIN_PERCENT. For  example, if you specify
> 50 for this parameter, then at least 50% of the parallel workers
> requested for any  parallel operation must be available in order for
> the operation to succeed else it will give error. If the value is set to
> null, then all parallel operations will proceed as long as at least two
> parallel workers are available for processing.

Ugh.  I'm not a fan of this..  Based on how we're talking about modeling
this, if we decide to parallelize at all, then we expect it to be a win.
I don't like the idea of throwing an error if, at execution time, we end
up not being able to actually get the number of workers we want-
instead, we should degrade gracefully all the way back to serial, if
necessary.  Perhaps we should send a NOTICE or something along those
lines to let the user know we weren't able to get the level of
parallelization that the plan originally asked for, but I really don't
like just throwing an error.

Now, for debugging purposes, I could see such a parameter being
available but it should default to 'off/never-fail'.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Stefan Kaltenbrunner

Date:

09 January 2015, 21:15:25

On 01/09/2015 08:01 PM, Stephen Frost wrote:
> Amit,
> 
> * Amit Kapila (amit.kapila16@gmail.com) wrote:
>> On Fri, Jan 9, 2015 at 1:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>> I agree, but we should try and warn the user if they set
>>> parallel_seqscan_degree close to max_worker_processes, or at least give
>>> some indication of what's going on. This is something you could end up
>>> beating your head on wondering why it's not working.
>>
>> Yet another way to handle the case when enough workers are not
>> available is to let user  specify the desired minimum percentage of
>> requested parallel workers with parameter like
>> PARALLEL_QUERY_MIN_PERCENT. For  example, if you specify
>> 50 for this parameter, then at least 50% of the parallel workers
>> requested for any  parallel operation must be available in order for
>> the operation to succeed else it will give error. If the value is set to
>> null, then all parallel operations will proceed as long as at least two
>> parallel workers are available for processing.
> 
> Ugh.  I'm not a fan of this..  Based on how we're talking about modeling
> this, if we decide to parallelize at all, then we expect it to be a win.
> I don't like the idea of throwing an error if, at execution time, we end
> up not being able to actually get the number of workers we want-
> instead, we should degrade gracefully all the way back to serial, if
> necessary.  Perhaps we should send a NOTICE or something along those
> lines to let the user know we weren't able to get the level of
> parallelization that the plan originally asked for, but I really don't
> like just throwing an error.

yeah this seems like the the behaviour I would expect, if we cant get
enough parallel workers we should just use as much as we can get.
Everything else and especially erroring out will just cause random
application failures and easy DoS vectors.
I think all we need initially is being able to specify a "maximum number
of workers per query" as well as a "maximum number of workers in total
for parallel operations".


> 
> Now, for debugging purposes, I could see such a parameter being
> available but it should default to 'off/never-fail'.

not sure what it really would be useful for - if I execute a query I
would truely expect it to get answered - if it can be made faster if
done in parallel thats nice but why would I want it to fail?


Stefan

Re: Parallel Seq Scan

From

Stephen Frost

Date:

09 January 2015, 21:34:27

* Stefan Kaltenbrunner (stefan@kaltenbrunner.cc) wrote:
> On 01/09/2015 08:01 PM, Stephen Frost wrote:
> > Now, for debugging purposes, I could see such a parameter being
> > available but it should default to 'off/never-fail'.
>
> not sure what it really would be useful for - if I execute a query I
> would truely expect it to get answered - if it can be made faster if
> done in parallel thats nice but why would I want it to fail?

I was thinking for debugging only, though I'm not really sure why you'd
need it if you get a NOTICE when you don't end up with all the workers
you expect.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Jim Nasby

Date:

10 January 2015, 00:14:46

On 1/9/15, 3:34 PM, Stephen Frost wrote:
> * Stefan Kaltenbrunner (stefan@kaltenbrunner.cc) wrote:
>> On 01/09/2015 08:01 PM, Stephen Frost wrote:
>>> Now, for debugging purposes, I could see such a parameter being
>>> available but it should default to 'off/never-fail'.
>>
>> not sure what it really would be useful for - if I execute a query I
>> would truely expect it to get answered - if it can be made faster if
>> done in parallel thats nice but why would I want it to fail?
>
> I was thinking for debugging only, though I'm not really sure why you'd
> need it if you get a NOTICE when you don't end up with all the workers
> you expect.

Yeah, debugging is my concern as well. You're working on a query, you expect it to be using parallelism, and EXPLAIN is
showingit's not. Now you're scratching your head.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Jim Nasby

Date:

10 January 2015, 00:28:39

On 1/9/15, 11:24 AM, Stephen Frost wrote:
> What I was advocating for up-thread was to consider multiple "parallel"
> paths and to pick whichever ends up being the lowest overall cost.  The
> flip-side to that is increased planning time.  Perhaps we can come up
> with an efficient way of working out where the break-point is based on
> the non-parallel cost and go at it from that direction instead of
> building out whole paths for each increment of parallelism.

I think at some point we'll need the ability to stop planning part-way through for queries producing really small
estimates.If the first estimate you get is 1000 units, does it really make sense to do something like try every
possiblejoin permutation, or attempt to parallelize?

-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

10 January 2015, 04:59:08

On Fri, Jan 9, 2015 at 10:54 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Amit Kapila (amit.kapila16@gmail.com) wrote:
> > In our case as currently we don't have a mechanism to reuse parallel
> > workers, so we need to account for that cost as well. So based on that,
> > I am planing to add three new parameters cpu_tuple_comm_cost,
> > parallel_setup_cost, parallel_startup_cost
> > * cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker
> > to master backend with default value
> > DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied
> > with tuples expected to be selected
> > * parallel_setup_cost - Cost of setting up shared memory for parallelism
> > with default value as 100.0
> > * parallel_startup_cost - Cost of starting up parallel workers with
> > default
> > value as 1000.0 multiplied by number of workers decided for scan.
> >
> > I will do some experiments to finalise the default values, but in general,
> > I feel developing cost model on above parameters is good.
>
> The parameters sound reasonable but I'm a bit worried about the way
> you're describing the implementation. Specifically this comment:
>
> "Cost of starting up parallel workers with default value as 1000.0
> multiplied by number of workers decided for scan."
>
> That appears to imply that we'll decide on the number of workers, figure
> out the cost, and then consider "parallel" as one path and
> "not-parallel" as another. I'm worried that if I end up setting the max
> parallel workers to 32 for my big, beefy, mostly-single-user system then
> I'll actually end up not getting parallel execution because we'll always
> be including the full startup cost of 32 threads. For huge queries,
> it'll probably be fine, but there's a lot of room to parallelize things
> at levels less than 32 which we won't even consider.
>

Actually the main factor to decide whether a parallel plan will be

selected or not will be based on selectivity and cpu_tuple_comm_cost,

parallel_startup_cost is mainly to prevent the cases where user

has set parallel_seqscan_degree, but the table is small enough

(letus say 10,000 tuples) that it doesn't need parallelism. If you are

worried by default cost parameter's, then I think those still needs

to be decided based on certain experiments.

> What I was advocating for up-thread was to consider multiple "parallel"
> paths and to pick whichever ends up being the lowest overall cost. The
> flip-side to that is increased planning time.

The main idea behind providing a parameter like parallel_seqscan_degree

is such that it will try to use that many number of workers for a single

parallel operation (intra-node parallelism) and incase we have to perform

inter-node parallelism than having such an parameter means that each

node can use that many number of parallel worker. For example we have

to parallelize scan as well as sort (Select * from t1 order by c1), and

parallel_degree is specified as 2, then each of the scan and sort can use

2 parallel workers each.

This is somewhat similar to the concept how degree of parallelism (DOP)

works in other databases. Refer case of Oracle [1] (Setting Degree of

Parallelism).

I don't deny the fact that it will be a idea worth exploring to make optimizer

more smart for deciding parallel plans, but it seems to me it is an advanced

topic which will be more valuable when we will try to parallelize joins or other

similar stuff and even most papers talk about it in those regards only.

At this moment if we can ensure that parallel plan should not be selected

for cases where it will perform poorly is more than enough considering

we have lots of other work left to even make any parallel operation work.

> Perhaps we can come up
> with an efficient way of working out where the break-point is based on
> the non-parallel cost and go at it from that direction instead of
> building out whole paths for each increment of parallelism.
>
> I'd really like to be able to set the 'max parallel' high and then have
> the optimizer figure out how many workers should actually be spawned for
> a given query.
>
> > Execution:
> > Most other databases does partition level scan for partition on
> > different disks by each individual parallel worker. However,
> > it seems amazon dynamodb [2] also works on something
> > similar to what I have used in patch which means on fixed
> > blocks. I think this kind of strategy seems better than dividing
> > the blocks at runtime because dividing randomly the blocks
> > among workers could lead to random scan for a parallel
> > sequential scan.
>
> Yeah, we also need to consider the i/o side of this, which will
> definitely be tricky. There are i/o systems out there which are faster
> than a single CPU and ones where a single CPU can manage multiple i/o
> channels. There are also cases where the i/o system handles sequential
> access nearly as fast as random and cases where sequential is much
> faster than random. Where we can get an idea of that distinction is
> with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
> lower random_page_cost from the default to indicate that.
>
I am not clear, do you expect anything different in execution strategy

than what I have mentioned or does that sound reasonable to you?

[1] :http://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/pqoconce.htm

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

10 January 2015, 05:22:25

On Sat, Jan 10, 2015 at 2:45 AM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
>
> On 01/09/2015 08:01 PM, Stephen Frost wrote:
> > Amit,
> >
> > * Amit Kapila (amit.kapila16@gmail.com) wrote:
> >> On Fri, Jan 9, 2015 at 1:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> >>> I agree, but we should try and warn the user if they set
> >>> parallel_seqscan_degree close to max_worker_processes, or at least give
> >>> some indication of what's going on. This is something you could end up
> >>> beating your head on wondering why it's not working.
> >>
> >> Yet another way to handle the case when enough workers are not
> >> available is to let user specify the desired minimum percentage of
> >> requested parallel workers with parameter like
> >> PARALLEL_QUERY_MIN_PERCENT. For example, if you specify
> >> 50 for this parameter, then at least 50% of the parallel workers
> >> requested for any parallel operation must be available in order for
> >> the operation to succeed else it will give error. If the value is set to
> >> null, then all parallel operations will proceed as long as at least two
> >> parallel workers are available for processing.
> >

>>
> > Now, for debugging purposes, I could see such a parameter being
> > available but it should default to 'off/never-fail'.
>
> not sure what it really would be useful for - if I execute a query I
> would truely expect it to get answered - if it can be made faster if
> done in parallel thats nice but why would I want it to fail?
>

One usecase where I could imagine it to be useful is when the

query is going to take many hours if run sequentially and it could

be finished in minutes if run with 16 parallel workers, now let us

say during execution if there are less than 30% of parallel workers

available it might not be acceptable to user and he would like to

rather wait for some time and again run the query and if he wants

to run query even if 2 workers are available, he can choose not

to such a parameter.

Having said that, I also feel this doesn't seem to be an important case

to introduce a new parameter and such a behaviour. I have mentioned,

because it came across my eyes how some other databases handle

such a situation. Lets forget this suggestion if we can't imagine any

use of such a parameter.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Stephen Frost

Date:

10 January 2015, 18:03:43

* Amit Kapila (amit.kapila16@gmail.com) wrote:
> At this moment if we can ensure that parallel plan should not be selected
> for cases where it will perform poorly is more than enough considering
> we have lots of other work left to even make any parallel operation work.

The problem with this approach is that it doesn't consider any options
between 'serial' and 'parallelize by factor X'.  If the startup cost is
1000 and the factor is 32, then a seqscan which costs 31000 won't ever
be parallelized, even though a factor of 8 would have parallelized it.

You could forget about the per-process startup cost entirely, in fact,
and simply say "only parallelize if it's more than X".

Again, I don't like the idea of designing this with the assumption that
the user dictates the right level of parallelization for each and every
query.  I'd love to go out and tell users "set the factor to the number
of CPUs you have and we'll just use what makes sense."

The same goes for max number of backends.  If we set the parallel level
to the number of CPUs and set the max backends to the same, then we end
up with only one parallel query running at a time, ever.  That's
terrible.  Now, we could set the parallel level lower or set the max
backends higher, but either way we're going to end up either using less
than we could or over-subscribing, neither of which is good.

I agree that this makes it a bit different from work_mem, but in this
case there's an overall max in the form of the maximum number of
background workers.  If we had something similar for work_mem, then we
could set that higher and still trust the system to only use the amount
of memory necessary (eg: a hashjoin doesn't use all available work_mem
and neither does a sort, unless the set is larger than available
memory).

> > > Execution:
> > > Most other databases does partition level scan for partition on
> > > different disks by each individual parallel worker.  However,
> > > it seems amazon dynamodb [2] also works on something
> > > similar to what I have used in patch which means on fixed
> > > blocks.  I think this kind of strategy seems better than dividing
> > > the blocks at runtime because dividing randomly the blocks
> > > among workers could lead to random scan for a parallel
> > > sequential scan.
> >
> > Yeah, we also need to consider the i/o side of this, which will
> > definitely be tricky.  There are i/o systems out there which are faster
> > than a single CPU and ones where a single CPU can manage multiple i/o
> > channels.  There are also cases where the i/o system handles sequential
> > access nearly as fast as random and cases where sequential is much
> > faster than random.  Where we can get an idea of that distinction is
> > with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
> > lower random_page_cost from the default to indicate that.
> >
> I am not clear, do you expect anything different in execution strategy
> than what I have mentioned or does that sound reasonable to you?

What I'd like is a way to figure out the right amount of CPU for each
tablespace (0.25, 1, 2, 4, etc) and then use that many.  Using a single
CPU for each tablespace is likely to starve the CPU or starve the I/O
system and I'm not sure if there's a way to address that.

Note that I intentionally said tablespace there because that's how users
can tell us what the different i/o channels are.  I realize this ends up
going beyond the current scope, but the parallel seqscan at the per
relation level will only ever be using one i/o channel.  It'd be neat if
we could work out how fast that i/o channel is vs. the CPUs and
determine how many CPUs are necessary to keep up with the i/o channel
and then use more-or-less exactly that many for the scan.

I agree that some of this can come later but I worry that starting out
with a design that expects to always be told exactly how many CPUs to
use when running a parallel query will be difficult to move away from
later.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Robert Haas

Date:

11 January 2015, 03:39:33

On Thu, Jan 8, 2015 at 6:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Are we sure that in such cases we will consume work_mem during
> execution?  In cases of parallel_workers we are sure to an extent
> that if we reserve the workers then we will use it during execution.
> Nonetheless, I have proceded and integrated the parallel_seq scan
> patch with v0.3 of parallel_mode patch posted by you at below link:
> http://www.postgresql.org/message-id/CA+TgmoYmp_=XcJEhvJZt9P8drBgW-pDpjHxBhZA79+M4o-CZQA@mail.gmail.com

That depends on the costing model.  It makes no sense to do a parallel
sequential scan on a small relation, because the user backend can scan
the whole thing itself faster than the workers can start up.  I
suspect it may also be true that the useful amount of parallelism
increases the larger the relation gets (but maybe not).

> 2. To enable two types of shared memory queue's (error queue and
> tuple queue), we need to ensure that we switch to appropriate queue
> during communication of various messages from parallel worker
> to master backend.  There are two ways to do it
>    a.  Save the information about error queue during startup of parallel
>         worker (ParallelMain()) and then during error, set the same (switch
>         to error queue in errstart() and switch back to tuple queue in
>         errfinish() and errstart() in case errstart() doesn't need to
> propagate
>         error).
>    b.  Do something similar as (a) for tuple queue in printtup or other
> place
>         if any for non-error messages.
> I think approach (a) is slightly better as compare to approach (b) as
> we need to switch many times for tuple queue (for each tuple) and
> there could be multiple places where we need to do the same.  For now,
> I have used approach (a) in Patch which needs some more work if we
> agree on the same.

I don't think you should be "switching" queues.  The tuples should be
sent to the tuple queue, and errors and notices to the error queue.

> 3. As per current implementation of Parallel_seqscan, it needs to use
> some information from parallel.c which was not exposed, so I have
> exposed the same by moving it to parallel.h.  Information that is required
> is as follows:
> ParallelWorkerNumber, FixedParallelState and shm keys -
>     This is used to decide the blocks that needs to be scanned.
>     We might change it in future the way parallel scan/work distribution
>     is done, but I don't see any harm in exposing this information.

Hmm.  I can see why ParallelWorkerNumber might need to be exposed, but
the other stuff seems like it shouldn't be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

11 January 2015, 03:41:11

On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Yeah, if we come up with a plan for X workers and end up not being able
> to spawn that many then I could see that being worth a warning or notice
> or something.  Not sure what EXPLAIN has to do anything with it..

That seems mighty odd to me.  If there are 8 background worker
processes available, and you allow each session to use at most 4, then
when there are >2 sessions trying to do parallelism at the same time,
they might not all get their workers.  Emitting a notice for that
seems like it would be awfully chatty.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

11 January 2015, 03:46:12

On Fri, Jan 9, 2015 at 12:24 PM, Stephen Frost <sfrost@snowman.net> wrote:
> The parameters sound reasonable but I'm a bit worried about the way
> you're describing the implementation.  Specifically this comment:
>
> "Cost of starting up parallel workers with default value as 1000.0
> multiplied by number of workers decided for scan."
>
> That appears to imply that we'll decide on the number of workers, figure
> out the cost, and then consider "parallel" as one path and
> "not-parallel" as another.  [...]
> I'd really like to be able to set the 'max parallel' high and then have
> the optimizer figure out how many workers should actually be spawned for
> a given query.

+1.

> Yeah, we also need to consider the i/o side of this, which will
> definitely be tricky.  There are i/o systems out there which are faster
> than a single CPU and ones where a single CPU can manage multiple i/o
> channels.  There are also cases where the i/o system handles sequential
> access nearly as fast as random and cases where sequential is much
> faster than random.  Where we can get an idea of that distinction is
> with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
> lower random_page_cost from the default to indicate that.

On my MacOS X system, I've already seen cases where my parallel_count
module runs incredibly slowly some of the time.  I believe that this
is because having multiple workers reading the relation block-by-block
at the same time causes the OS to fail to realize that it needs to do
aggressive readahead.  I suspect we're going to need to account for
this somehow.

> Yeah, I agree that's more typical.  Robert's point that the master
> backend should participate is interesting but, as I recall, it was based
> on the idea that the master could finish faster than the worker- but if
> that's the case then we've planned it out wrong from the beginning.

So, if the workers have been started but aren't keeping up, the master
should do nothing until they produce tuples rather than participating?That doesn't seem right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

11 January 2015, 04:14:57

On Sun, Jan 11, 2015 at 9:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 8, 2015 at 6:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 2. To enable two types of shared memory queue's (error queue and
> > tuple queue), we need to ensure that we switch to appropriate queue
> > during communication of various messages from parallel worker
> > to master backend. There are two ways to do it
> > a. Save the information about error queue during startup of parallel
> > worker (ParallelMain()) and then during error, set the same (switch
> > to error queue in errstart() and switch back to tuple queue in
> > errfinish() and errstart() in case errstart() doesn't need to
> > propagate
> > error).
> > b. Do something similar as (a) for tuple queue in printtup or other
> > place
> > if any for non-error messages.
> > I think approach (a) is slightly better as compare to approach (b) as
> > we need to switch many times for tuple queue (for each tuple) and
> > there could be multiple places where we need to do the same. For now,
> > I have used approach (a) in Patch which needs some more work if we
> > agree on the same.
>
> I don't think you should be "switching" queues. The tuples should be
> sent to the tuple queue, and errors and notices to the error queue.
>

To achieve what you said (The tuples should be sent to the tuple

queue, and errors and notices to the error queue.), we need to

switch the queues.

The difficulty here is that once we set the queue (using

pq_redirect_to_shm_mq()) through which the communication has to

happen, it will use the same unless we change again the queue

using pq_redirect_to_shm_mq(). For example, assume we have

initially set error queue (using pq_redirect_to_shm_mq()) then to

send tuples, we need to call pq_redirect_to_shm_mq() to

set the tuple queue as the queue that needs to be used for communication

and again if error happens then we need to do the same for error

queue.

Do you have any other idea to achieve the same?

> > 3. As per current implementation of Parallel_seqscan, it needs to use
> > some information from parallel.c which was not exposed, so I have
> > exposed the same by moving it to parallel.h. Information that is required
> > is as follows:
> > ParallelWorkerNumber, FixedParallelState and shm keys -
> > This is used to decide the blocks that needs to be scanned.
> > We might change it in future the way parallel scan/work distribution
> > is done, but I don't see any harm in exposing this information.
>
> Hmm. I can see why ParallelWorkerNumber might need to be exposed, but
> the other stuff seems like it shouldn't be.
>
It depends upon how we decide to achieve the scan of blocks

by backend worker. In current form, the patch needs to know

if myworker is the last worker (and I have used workers_expected

to achieve the same, I know that is not the right thing but I need

something similar if we decide to do in the way I have proposed),

so that it can scan all the remaining blocks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Stephen Frost

Date:

11 January 2015, 10:27:28

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > Yeah, if we come up with a plan for X workers and end up not being able
> > to spawn that many then I could see that being worth a warning or notice
> > or something.  Not sure what EXPLAIN has to do anything with it..
>
> That seems mighty odd to me.  If there are 8 background worker
> processes available, and you allow each session to use at most 4, then
> when there are >2 sessions trying to do parallelism at the same time,
> they might not all get their workers.  Emitting a notice for that
> seems like it would be awfully chatty.

Yeah, agreed, it could get quite noisy.  Did you have another thought
for how to address the concern raised?  Specifically, that you might not
get as many workers as you thought you would?
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Stephen Frost

Date:

11 January 2015, 11:02:05

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Jan 9, 2015 at 12:24 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > Yeah, we also need to consider the i/o side of this, which will
> > definitely be tricky.  There are i/o systems out there which are faster
> > than a single CPU and ones where a single CPU can manage multiple i/o
> > channels.  There are also cases where the i/o system handles sequential
> > access nearly as fast as random and cases where sequential is much
> > faster than random.  Where we can get an idea of that distinction is
> > with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
> > lower random_page_cost from the default to indicate that.
>
> On my MacOS X system, I've already seen cases where my parallel_count
> module runs incredibly slowly some of the time.  I believe that this
> is because having multiple workers reading the relation block-by-block
> at the same time causes the OS to fail to realize that it needs to do
> aggressive readahead.  I suspect we're going to need to account for
> this somehow.

So, for my 2c, I've long expected us to parallelize at the relation-file
level for these kinds of operations.  This goes back to my other
thoughts on how we should be thinking about parallelizing inbound data
for bulk data loads but it seems appropriate to consider it here also.
One of the issues there is that 1G still feels like an awful lot for a
minimum work size for each worker and it would mean we don't parallelize
for relations less than that size.

On a random VM on my personal server, an uncached 1G read takes over
10s.  Cached it's less than half that, of course.  This is all spinning
rust (and only 7200 RPM at that) and there's a lot of other stuff going
on but that still seems like too much of a chunk to give to one worker
unless the overall data set to go through is really large.

There's other issues in there too, of course, if we're dumping data in
like this then we have to either deal with jagged relation files somehow
or pad the file out to 1G, and that doesn't even get into the issues
around how we'd have to redesign the interfaces for relation access and
how this thinking is an utter violation of the modularity we currently
have there.

> > Yeah, I agree that's more typical.  Robert's point that the master
> > backend should participate is interesting but, as I recall, it was based
> > on the idea that the master could finish faster than the worker- but if
> > that's the case then we've planned it out wrong from the beginning.
>
> So, if the workers have been started but aren't keeping up, the master
> should do nothing until they produce tuples rather than participating?
>  That doesn't seem right.

Having the master jump in and start working could screw things up also
though.  Perhaps we need the master to start working as a fail-safe but
not plan on having things go that way?  Having more processes trying to
do X doesn't always result in things getting better and the master needs
to keep up with all the tuples being thrown at it from the workers.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Stephen Frost

Date:

11 January 2015, 11:09:35

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:
> On Sun, Jan 11, 2015 at 9:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't think you should be "switching" queues.  The tuples should be
> > sent to the tuple queue, and errors and notices to the error queue.

Agreed.

> To achieve what you said (The tuples should be sent to the tuple
> queue, and errors and notices to the error queue.), we need to
> switch the queues.
> The difficulty here is that once we set the queue (using
> pq_redirect_to_shm_mq()) through which the communication has to
> happen, it will use the same unless we change again the queue
> using pq_redirect_to_shm_mq().  For example, assume we have
> initially set error queue (using pq_redirect_to_shm_mq()) then to
> send tuples, we need to call pq_redirect_to_shm_mq() to
> set the tuple queue as the queue that needs to be used for communication
> and again if error happens then we need to do the same for error
> queue.
> Do you have any other idea to achieve the same?

I think what Robert's getting at here is that pq_redirect_to_shm_mq()
might be fine for the normal data heading back, but we need something
separate for errors and notices.  Switching everything back and forth
between the normal and error queues definitely doesn't sound right to
me- they need to be independent.

In other words, you need to be able to register a "normal data" queue
and then you need to also register a "error/notice" queue and have
errors and notices sent there directly.  Going off of what I recall,
can't this be done by having the callbacks which are registered for
sending data back look at what they're being asked to send and then
decide which queue it's appropriate for out of the set which have been
registered so far?
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Stefan Kaltenbrunner

Date:

11 January 2015, 15:47:32

On 01/11/2015 11:27 AM, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>> Yeah, if we come up with a plan for X workers and end up not being able
>>> to spawn that many then I could see that being worth a warning or notice
>>> or something.  Not sure what EXPLAIN has to do anything with it..
>>
>> That seems mighty odd to me.  If there are 8 background worker
>> processes available, and you allow each session to use at most 4, then
>> when there are >2 sessions trying to do parallelism at the same time,
>> they might not all get their workers.  Emitting a notice for that
>> seems like it would be awfully chatty.
> 
> Yeah, agreed, it could get quite noisy.  Did you have another thought
> for how to address the concern raised?  Specifically, that you might not
> get as many workers as you thought you would?

Wild idea: What about dealing with it as some sort of statistic - ie
track some global counts in the stats collector or on a per-query base
in pg_stat_activity and/or through pg_stat_statements?

Not sure why it is that important to get it on a per-query base, imho it
is simply a configuration limit we have set (similiar to work_mem or
when switching to geqo) - we dont report "per query" through
notice/warning there either (though the effect is kind visible in explain).

Stefan

Re: Parallel Seq Scan

From

Robert Haas

Date:

11 January 2015, 21:56:04

On Sat, Jan 10, 2015 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I don't think you should be "switching" queues.  The tuples should be
>> sent to the tuple queue, and errors and notices to the error queue.
> To achieve what you said (The tuples should be sent to the tuple
> queue, and errors and notices to the error queue.), we need to
> switch the queues.
> The difficulty here is that once we set the queue (using
> pq_redirect_to_shm_mq()) through which the communication has to
> happen, it will use the same unless we change again the queue
> using pq_redirect_to_shm_mq().  For example, assume we have
> initially set error queue (using pq_redirect_to_shm_mq()) then to
> send tuples, we need to call pq_redirect_to_shm_mq() to
> set the tuple queue as the queue that needs to be used for communication
> and again if error happens then we need to do the same for error
> queue.
> Do you have any other idea to achieve the same?

Yeah, you need two separate global variables pointing to shm_mq
objects, one of which gets used by pqmq.c for errors and the other of
which gets used by printtup.c for tuples.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

11 January 2015, 21:57:14

On Sun, Jan 11, 2015 at 5:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> > Yeah, if we come up with a plan for X workers and end up not being able
>> > to spawn that many then I could see that being worth a warning or notice
>> > or something.  Not sure what EXPLAIN has to do anything with it..
>>
>> That seems mighty odd to me.  If there are 8 background worker
>> processes available, and you allow each session to use at most 4, then
>> when there are >2 sessions trying to do parallelism at the same time,
>> they might not all get their workers.  Emitting a notice for that
>> seems like it would be awfully chatty.
>
> Yeah, agreed, it could get quite noisy.  Did you have another thought
> for how to address the concern raised?  Specifically, that you might not
> get as many workers as you thought you would?

I'm not sure why that's a condition in need of special reporting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

11 January 2015, 22:00:20

On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
> So, for my 2c, I've long expected us to parallelize at the relation-file
> level for these kinds of operations.  This goes back to my other
> thoughts on how we should be thinking about parallelizing inbound data
> for bulk data loads but it seems appropriate to consider it here also.
> One of the issues there is that 1G still feels like an awful lot for a
> minimum work size for each worker and it would mean we don't parallelize
> for relations less than that size.

Yes, I think that's a killer objection.

> [ .. ] and
> how this thinking is an utter violation of the modularity we currently
> have there.

As is that.

My thinking is more along the lines that we might need to issue
explicit prefetch requests when doing a parallel sequential scan, to
make up for any failure of the OS to do that for us.

>> So, if the workers have been started but aren't keeping up, the master
>> should do nothing until they produce tuples rather than participating?
>>  That doesn't seem right.
>
> Having the master jump in and start working could screw things up also
> though.

I don't think there's any reason why that should screw things up.
There's no reason why the master's participation should look any
different from one more worker.  Look at my parallel_count code on the
other thread to see what I mean: the master and all the workers are
running the same code, and if fewer worker show up than expected, or
run unduly slowly, it's easily tolerated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

11 January 2015, 22:02:05

On Sun, Jan 11, 2015 at 6:09 AM, Stephen Frost <sfrost@snowman.net> wrote:
> I think what Robert's getting at here is that pq_redirect_to_shm_mq()
> might be fine for the normal data heading back, but we need something
> separate for errors and notices.  Switching everything back and forth
> between the normal and error queues definitely doesn't sound right to
> me- they need to be independent.

You've got that backwards.  pq_redirect_to_shm_mq() handles errors and
notices, but we need something separate for the tuple stream.

> In other words, you need to be able to register a "normal data" queue
> and then you need to also register a "error/notice" queue and have
> errors and notices sent there directly.  Going off of what I recall,
> can't this be done by having the callbacks which are registered for
> sending data back look at what they're being asked to send and then
> decide which queue it's appropriate for out of the set which have been
> registered so far?

It's pretty simple, really.  The functions that need to use the tuple
queue are in printtup.c; those, and only those, need to be modified to
write to the other queue.

Or, possibly, we should pass the tuples around in their native format
instead of translating them into binary form and then reconstituting
them on the other end.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

12 January 2015, 03:34:14

On Mon, Jan 12, 2015 at 3:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
> >> So, if the workers have been started but aren't keeping up, the master
> >> should do nothing until they produce tuples rather than participating?
> >> That doesn't seem right.
> >
> > Having the master jump in and start working could screw things up also
> > though.
>
> I don't think there's any reason why that should screw things up.

Consider the case of inter-node parallelism, in such cases master

backend will have 4 responsibilities (scan relation, receive tuples

from other workers, send tuples to other workers, send tuples to

frontend) if we make it act like a worker.

For example

Select * from t1 Order By c1;

Now here first it needs to perform parallel sequential scan and then

fed the tuples from scan to another parallel worker which is doing sort.

It seems to me that master backend could starve few resources doing

all the work in an optimized way. As an example, one case could be

master backend read one page in memory (shared buffers) and then

read one tuple and apply the qualification and in the mean time the

queues on which it needs to receive got filled and it becomes busy

fetching tuples from those queues, now the page which it has read from

disk will be pinned in shared buffers for a longer time and even if we

release such a page, it has to be read again. OTOH, if master backend

would choose to read all the tuples from a page before checking the status

of queues, it can lead to lot of data piled up in queues.

I think there can be more such scenarios where getting many things

done by master backend can turn out to have negative impact.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

12 January 2015, 03:44:29

On Mon, Jan 12, 2015 at 3:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Jan 11, 2015 at 5:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> >> On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
> >> > Yeah, if we come up with a plan for X workers and end up not being able
> >> > to spawn that many then I could see that being worth a warning or notice
> >> > or something. Not sure what EXPLAIN has to do anything with it..
> >>
> >> That seems mighty odd to me. If there are 8 background worker
> >> processes available, and you allow each session to use at most 4, then
> >> when there are >2 sessions trying to do parallelism at the same time,
> >> they might not all get their workers. Emitting a notice for that
> >> seems like it would be awfully chatty.
> >
> > Yeah, agreed, it could get quite noisy. Did you have another thought
> > for how to address the concern raised? Specifically, that you might not
> > get as many workers as you thought you would?
>
> I'm not sure why that's a condition in need of special reporting.
>

So what should happen if no workers are available?

I don't think we can change the plan to a non-parallel at that

stage.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Jim Nasby

Date:

12 January 2015, 08:17:44

On 1/11/15 3:57 PM, Robert Haas wrote:
> On Sun, Jan 11, 2015 at 5:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> * Robert Haas (robertmhaas@gmail.com) wrote:
>>> On Thu, Jan 8, 2015 at 2:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
>>>> Yeah, if we come up with a plan for X workers and end up not being able
>>>> to spawn that many then I could see that being worth a warning or notice
>>>> or something.  Not sure what EXPLAIN has to do anything with it..
>>>
>>> That seems mighty odd to me.  If there are 8 background worker
>>> processes available, and you allow each session to use at most 4, then
>>> when there are >2 sessions trying to do parallelism at the same time,
>>> they might not all get their workers.  Emitting a notice for that
>>> seems like it would be awfully chatty.
>>
>> Yeah, agreed, it could get quite noisy.  Did you have another thought
>> for how to address the concern raised?  Specifically, that you might not
>> get as many workers as you thought you would?
>
> I'm not sure why that's a condition in need of special reporting.

The case raised before (that I think is valid) is: what if you have a query that is massively parallel. You expect it
toget 60 cores on the server and take 10 minutes. Instead it gets 10 and takes an hour (or worse, 1 and takes 10
hours).

Maybe it's not worth dealing with that in the first version, but I expect it will come up very quickly. We better make
surewe're not painting ourselves in a corner.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

John Gorman

Date:

13 January 2015, 11:25:19

On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
> So, for my 2c, I've long expected us to parallelize at the relation-file
> level for these kinds of operations. This goes back to my other
> thoughts on how we should be thinking about parallelizing inbound data
> for bulk data loads but it seems appropriate to consider it here also.
> One of the issues there is that 1G still feels like an awful lot for a
> minimum work size for each worker and it would mean we don't parallelize
> for relations less than that size.

Yes, I think that's a killer objection.

One approach that I has worked well for me is to break big jobs into much smaller bite size tasks. Each task is small enough to complete quickly.

We add the tasks to a task queue and spawn a generic worker pool which eats through the task queue items.

This solves a lot of problems.

- Small to medium jobs can be parallelized efficiently.

- No need to split big jobs perfectly.

- We don't get into a situation where we are waiting around for a worker to finish chugging through a huge task while the other workers sit idle.

- Worker memory footprint is tiny so we can afford many of them.

- Worker pool management is a well known problem.

- Worker spawn time disappears as a cost factor.

- The worker pool becomes a shared resource that can be managed and reported on and becomes considerably more predictable.

Re: Parallel Seq Scan

From

John Gorman

Date:

13 January 2015, 12:08:51

On Tue, Jan 13, 2015 at 7:25 AM, John Gorman <johngorman2@gmail.com> wrote:

On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
> So, for my 2c, I've long expected us to parallelize at the relation-file
> level for these kinds of operations. This goes back to my other
> thoughts on how we should be thinking about parallelizing inbound data
> for bulk data loads but it seems appropriate to consider it here also.
> One of the issues there is that 1G still feels like an awful lot for a
> minimum work size for each worker and it would mean we don't parallelize
> for relations less than that size.

Yes, I think that's a killer objection.

One approach that I has worked well for me is to break big jobs into much smaller bite size tasks. Each task is small enough to complete quickly.

We add the tasks to a task queue and spawn a generic worker pool which eats through the task queue items.

This solves a lot of problems.

- Small to medium jobs can be parallelized efficiently.
- No need to split big jobs perfectly.
- We don't get into a situation where we are waiting around for a worker to finish chugging through a huge task while the other workers sit idle.
- Worker memory footprint is tiny so we can afford many of them.
- Worker pool management is a well known problem.
- Worker spawn time disappears as a cost factor.
- The worker pool becomes a shared resource that can be managed and reported on and becomes considerably more predictable.

I forgot to mention that a running task queue can provide metrics such as current utilization, current average throughput, current queue length and estimated queue wait time. These can become dynamic cost factors in deciding whether to parallelize.

Re: Parallel Seq Scan

From

Amit Kapila

Date:

14 January 2015, 03:43:04

On Tue, Jan 13, 2015 at 4:55 PM, John Gorman <johngorman2@gmail.com> wrote:
>
>
>
> On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> > So, for my 2c, I've long expected us to parallelize at the relation-file
>> > level for these kinds of operations. This goes back to my other
>> > thoughts on how we should be thinking about parallelizing inbound data
>> > for bulk data loads but it seems appropriate to consider it here also.
>> > One of the issues there is that 1G still feels like an awful lot for a
>> > minimum work size for each worker and it would mean we don't parallelize
>> > for relations less than that size.
>>
>> Yes, I think that's a killer objection.
>
>
> One approach that I has worked well for me is to break big jobs into much smaller bite size tasks. Each task is small enough to complete quickly.
>

Here we have to decide what should be the strategy and how much

each worker should scan. As an example one of the the strategy

could be if the table size is X MB and there are 8 workers, then

divide the work as X/8 MB for each worker (which I have currently

used in patch) and another could be each worker does scan

1 block at a time and then check some global structure to see which

next block it needs to scan, according to me this could lead to random

scan. I have read that some other databases also divide the work

based on partitions or segments (size of segment is not very clear).

> We add the tasks to a task queue and spawn a generic worker pool which eats through the task queue items.
>
> This solves a lot of problems.
>
> - Small to medium jobs can be parallelized efficiently.
> - No need to split big jobs perfectly.
> - We don't get into a situation where we are waiting around for a worker to finish chugging through a huge task while the other workers sit idle.
> - Worker memory footprint is tiny so we can afford many of them.
> - Worker pool management is a well known problem.
> - Worker spawn time disappears as a cost factor.
> - The worker pool becomes a shared resource that can be managed and reported on and becomes considerably more predictable.
>

Yeah, it is good idea to maintain shared worker pool, but it seems

to me that for initial version even if the workers are not shared,

then also it is meaningful to make parallel sequential scan work.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Ashutosh Bapat

Date:

14 January 2015, 04:00:31

On Wed, Jan 14, 2015 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 13, 2015 at 4:55 PM, John Gorman <johngorman2@gmail.com> wrote:
>
>
>
> On Sun, Jan 11, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Sun, Jan 11, 2015 at 6:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
>> > So, for my 2c, I've long expected us to parallelize at the relation-file
>> > level for these kinds of operations. This goes back to my other
>> > thoughts on how we should be thinking about parallelizing inbound data
>> > for bulk data loads but it seems appropriate to consider it here also.
>> > One of the issues there is that 1G still feels like an awful lot for a
>> > minimum work size for each worker and it would mean we don't parallelize
>> > for relations less than that size.
>>
>> Yes, I think that's a killer objection.
>
>
> One approach that I has worked well for me is to break big jobs into much smaller bite size tasks. Each task is small enough to complete quickly.
>

Here we have to decide what should be the strategy and how much
each worker should scan. As an example one of the the strategy
could be if the table size is X MB and there are 8 workers, then
divide the work as X/8 MB for each worker (which I have currently
used in patch) and another could be each worker does scan
1 block at a time and then check some global structure to see which
next block it needs to scan, according to me this could lead to random
scan. I have read that some other databases also divide the work
based on partitions or segments (size of segment is not very clear).

A block can contain useful tuples, i.e tuples which are visible and fulfil the quals + useless tuples i.e. tuples which are dead, invisible or that do not fulfil the quals. Depending upon the contents of these blocks, esp. the ratio of (useful tuples)/(unuseful tuples), even though we divide the relation into equal sized runs, each worker may take different time. So, instead of dividing the relation into number of run = number of workers, it might be better to divide them into fixed sized runs with size < (total number of blocks/ number of workers), and let a worker pick up a run after it finishes with the previous one. The smaller the size of runs the better load balancing but higher cost of starting with the run. So, we have to strike a balance.

> We add the tasks to a task queue and spawn a generic worker pool which eats through the task queue items.
>
> This solves a lot of problems.
>
> - Small to medium jobs can be parallelized efficiently.
> - No need to split big jobs perfectly.
> - We don't get into a situation where we are waiting around for a worker to finish chugging through a huge task while the other workers sit idle.
> - Worker memory footprint is tiny so we can afford many of them.
> - Worker pool management is a well known problem.
> - Worker spawn time disappears as a cost factor.
> - The worker pool becomes a shared resource that can be managed and reported on and becomes considerably more predictable.
>

Yeah, it is good idea to maintain shared worker pool, but it seems
to me that for initial version even if the workers are not shared,
then also it is meaningful to make parallel sequential scan work.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

14 January 2015, 21:26:00

On Tue, Jan 13, 2015 at 6:25 AM, John Gorman <johngorman2@gmail.com> wrote:
> One approach that I has worked well for me is to break big jobs into much
> smaller bite size tasks. Each task is small enough to complete quickly.
>
> We add the tasks to a task queue and spawn a generic worker pool which eats
> through the task queue items.
>
> This solves a lot of problems.
>
> - Small to medium jobs can be parallelized efficiently.
> - No need to split big jobs perfectly.
> - We don't get into a situation where we are waiting around for a worker to
> finish chugging through a huge task while the other workers sit idle.
> - Worker memory footprint is tiny so we can afford many of them.
> - Worker pool management is a well known problem.
> - Worker spawn time disappears as a cost factor.
> - The worker pool becomes a shared resource that can be managed and reported
> on and becomes considerably more predictable.

I think this is a good idea, but for now I would like to keep our
goals somewhat more modest: let's see if we can get parallel
sequential scan, and only parallel sequential scan, working and
committed.  Ultimately, I think we may need something like what you're
talking about, because if you have a query with three or six or twelve
different parallelizable operations in it, you want the available CPU
resources to switch between those as their respective needs may
dictate.  You certainly don't want to spawn a separate pool of workers
for each scan.

But I think getting that all working in the first version is probably
harder than what we should attempt.  We have a bunch of problems to
solve here just around parallel sequential scan and the parallel mode
infrastructure: heavyweight locking, prefetching, the cost model, and
so on.  Trying to add to that all of the problems that might attend on
a generic task queueing infrastructure fills me with no small amount
of fear.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Jim Nasby

Date:

15 January 2015, 02:01:07

On 1/13/15 9:42 PM, Amit Kapila wrote:
> As an example one of the the strategy
> could be if the table size is X MB and there are 8 workers, then
> divide the work as X/8 MB for each worker (which I have currently
> used in patch) and another could be each worker does scan
> 1 block at a time and then check some global structure to see which
> next block it needs to scan, according to me this could lead to random
> scan.  I have read that some other databases also divide the work
> based on partitions or segments (size of segment is not very clear).

Long-term I think we'll want a mix between the two approaches. Simply doing something like blkno % num_workers is going
tocause imbalances, but trying to do this on a per-block basis seems like too much overhead.

Also long-term, I think we also need to look at a more specialized version of parallelism at the IO layer. For example,
duringan index scan you'd really like to get IO requests for heap blocks started in the background while the backend is
focusedon the mechanics of the index scan itself. While this could be done with the stuff Robert has written I have to
wonderif it'd be a lot more efficient to use fadvise or AIO. Or perhaps it would just be better to deal with an entire
indexpage (remembering TIDs) and then hit the heap.

But I agree with Robert; there's a lot yet to be done just to get *any* kind of parallel execution working before we
startthinking about how to optimize it.

-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

15 January 2015, 05:55:17

On Wed, Jan 14, 2015 at 9:30 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
> On Wed, Jan 14, 2015 at 9:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Here we have to decide what should be the strategy and how much
>> each worker should scan. As an example one of the the strategy
>> could be if the table size is X MB and there are 8 workers, then
>> divide the work as X/8 MB for each worker (which I have currently
>> used in patch) and another could be each worker does scan
>> 1 block at a time and then check some global structure to see which
>> next block it needs to scan, according to me this could lead to random
>> scan. I have read that some other databases also divide the work
>> based on partitions or segments (size of segment is not very clear).
>
>
> A block can contain useful tuples, i.e tuples which are visible and fulfil the quals + useless tuples i.e. tuples which are dead, invisible or that do not fulfil the quals. Depending upon the contents of these blocks, esp. the ratio of (useful tuples)/(unuseful tuples), even though we divide the relation into equal sized runs, each worker may take different time. So, instead of dividing the relation into number of run = number of workers, it might be better to divide them into fixed sized runs with size < (total number of blocks/ number of workers), and let a worker pick up a run after it finishes with the previous one. The smaller the size of runs the better load balancing but higher cost of starting with the run. So, we have to strike a balance.
>

I think your suggestion is good and it somewhat falls inline

with what Robert has suggested, but instead of block-by-block,

you seem to be suggesting of doing it in chunks (where chunk size

is not clear), however the only point against this is that such a

strategy for parallel sequence scan could lead to random scans

which can hurt the operation badly. Nonetheless, I will think more

on this lines of making work distribution dynamic so that we can

ensure that all workers can be kept busy.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

15 January 2015, 13:28:04

On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jan 10, 2015 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> I don't think you should be "switching" queues. The tuples should be
> >> sent to the tuple queue, and errors and notices to the error queue.
> > To achieve what you said (The tuples should be sent to the tuple
> > queue, and errors and notices to the error queue.), we need to
> > switch the queues.
> > The difficulty here is that once we set the queue (using
> > pq_redirect_to_shm_mq()) through which the communication has to
> > happen, it will use the same unless we change again the queue
> > using pq_redirect_to_shm_mq(). For example, assume we have
> > initially set error queue (using pq_redirect_to_shm_mq()) then to
> > send tuples, we need to call pq_redirect_to_shm_mq() to
> > set the tuple queue as the queue that needs to be used for communication
> > and again if error happens then we need to do the same for error
> > queue.
> > Do you have any other idea to achieve the same?
>
> Yeah, you need two separate global variables pointing to shm_mq
> objects, one of which gets used by pqmq.c for errors and the other of
> which gets used by printtup.c for tuples.
>

Okay, I will try to change the way as suggested without doing

switching, but this way we need to do it separately for 'T', 'D', and

'C' messages.

I have moved this patch to next CF as apart from above still I

have to work on execution strategy and optimizer related changes

as discussed in this thread

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

16 January 2015, 18:19:53

On Wed, Jan 14, 2015 at 9:00 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> Simply doing
> something like blkno % num_workers is going to cause imbalances,

Yes.

> but trying
> to do this on a per-block basis seems like too much overhead.

...but no. Or at least, I doubt it. The cost of handing out blocks
one at a time is that, for each block, a worker's got to grab a
spinlock, increment and record the block number counter, and release
the spinlock. Or, use an atomic add. Now, it's true that spinlock
cycles and atomic ops can have sometimes impose severe overhead, but
you have to look at it as a percentage of the overall work being done.
In this case, the backend has to read, pin, and lock the page and
process every tuple on the page. Processing every tuple on the page
may involve de-TOASTing the tuple (leading to many more page
accesses), or evaluating a complex expression, or hitting CLOG to
check visibility, but even if it doesn't, I think the amount of work
that it takes to process all the tuples on the page will be far larger
than the cost of one atomic increment operation per block.

As mentioned downthread, a far bigger consideration is the I/O pattern
we create. A sequential scan is so-called because it reads the
relation sequentially. If we destroy that property, we will be more
than slightly sad. It might be OK to do sequential scans of, say,
each 1GB segment separately, but I'm pretty sure it would be a real
bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...

What I'm thinking about is that we might have something like this:

struct this_lives_in_dynamic_shared_memory
{ BlockNumber last_block; Size prefetch_distance; Size prefetch_increment; slock_t mutex; BlockNumber
next_prefetch_block; BlockNumber next_scan_block;

};

Each worker takes the mutex and checks whether next_prefetch_block -
next_scan_block < prefetch_distance and also whether
next_prefetch_block < last_block. If both are true, it prefetches
some number of additional blocks, as specified by prefetch_increment.
Otherwise, it increments next_scan_block and scans the block
corresponding to the old value.

So in this way, the prefetching runs ahead of the scan by a
configurable amount (prefetch_distance), which should be chosen so
that the prefetches have time to compete before the scan actually
reaches those blocks. Right now, of course, we rely on the operating
system to prefetch for sequential scans, but I have a strong hunch
that may not work on all systems if there are multiple processes doing
the reads.

Now, what of other strategies like dividing up the relation into 1GB
chunks and reading each one in a separate process? We could certainly
DO that, but what advantage does it have over this? The only benefit
I can see is that you avoid accessing a data structure of the type
shown above for every block, but that only matters if that cost is
material, and I tend to think it won't be. On the flip side, it means
that the granularity for dividing up work between processes is now
very coarse - when there are less than 6GB of data left in a relation,
at most 6 processes can work on it. That might be OK if the data is
being read in from disk anyway, but it's certainly not the best we can
do when the data is in memory.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

17 January 2015, 04:27:49

On Fri, Jan 16, 2015 at 11:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> As mentioned downthread, a far bigger consideration is the I/O pattern
> we create. A sequential scan is so-called because it reads the
> relation sequentially. If we destroy that property, we will be more
> than slightly sad. It might be OK to do sequential scans of, say,
> each 1GB segment separately, but I'm pretty sure it would be a real
> bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...
>
> What I'm thinking about is that we might have something like this:
>
> struct this_lives_in_dynamic_shared_memory
> {
> BlockNumber last_block;
> Size prefetch_distance;
> Size prefetch_increment;
> slock_t mutex;
> BlockNumber next_prefetch_block;
> BlockNumber next_scan_block;
> };
>
> Each worker takes the mutex and checks whether next_prefetch_block -
> next_scan_block < prefetch_distance and also whether
> next_prefetch_block < last_block. If both are true, it prefetches
> some number of additional blocks, as specified by prefetch_increment.
> Otherwise, it increments next_scan_block and scans the block
> corresponding to the old value.
>

Assuming we will increment next_prefetch_block only after prefetching

blocks (equivalent to prefetch_increment), won't 2 workers can

simultaneously see the same value for next_prefetch_block and try to

perform prefetch for same blocks?

What will be value of prefetch_increment?

Will it be equal to prefetch_distance or prefetch_distance/2 or

prefetch_distance/4 or .. or will it be totally unrelated to prefetch_distance?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

17 January 2015, 04:40:01

On Fri, Jan 16, 2015 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 16, 2015 at 11:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> As mentioned downthread, a far bigger consideration is the I/O pattern
>> we create.  A sequential scan is so-called because it reads the
>> relation sequentially.  If we destroy that property, we will be more
>> than slightly sad.  It might be OK to do sequential scans of, say,
>> each 1GB segment separately, but I'm pretty sure it would be a real
>> bad idea to read 8kB at a time at blocks 0, 64, 128, 1, 65, 129, ...
>>
>> What I'm thinking about is that we might have something like this:
>>
>> struct this_lives_in_dynamic_shared_memory
>> {
>>     BlockNumber last_block;
>>     Size prefetch_distance;
>>     Size prefetch_increment;
>>     slock_t mutex;
>>     BlockNumber next_prefetch_block;
>>     BlockNumber next_scan_block;
>> };
>>
>> Each worker takes the mutex and checks whether next_prefetch_block -
>> next_scan_block < prefetch_distance and also whether
>> next_prefetch_block < last_block.  If both are true, it prefetches
>> some number of additional blocks, as specified by prefetch_increment.
>> Otherwise, it increments next_scan_block and scans the block
>> corresponding to the old value.
>
> Assuming we will increment next_prefetch_block only after prefetching
> blocks (equivalent to prefetch_increment), won't 2 workers can
> simultaneously see the same value for next_prefetch_block and try to
> perform prefetch for same blocks?

The idea is that you can only examine and modify next_prefetch_block
or next_scan_block while holding the mutex.

> What will be value of prefetch_increment?
> Will it be equal to prefetch_distance or prefetch_distance/2 or
> prefetch_distance/4 or .. or will it be totally unrelated to
> prefetch_distance?

I dunno, that might take some experimentation.  prefetch_distance/2
doesn't sound stupid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

19 January 2015, 07:24:16

On Sat, Jan 17, 2015 at 10:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 16, 2015 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Assuming we will increment next_prefetch_block only after prefetching
> > blocks (equivalent to prefetch_increment), won't 2 workers can
> > simultaneously see the same value for next_prefetch_block and try to
> > perform prefetch for same blocks?
>
> The idea is that you can only examine and modify next_prefetch_block
> or next_scan_block while holding the mutex.
>
> > What will be value of prefetch_increment?
> > Will it be equal to prefetch_distance or prefetch_distance/2 or
> > prefetch_distance/4 or .. or will it be totally unrelated to
> > prefetch_distance?
>
> I dunno, that might take some experimentation. prefetch_distance/2
> doesn't sound stupid.
>

Okay, I think I got the idea what you want to achieve via

prefetching. So assuming prefetch_distance = 100 and

prefetch_increment = 50 (prefetch_distance /2), it seems to me

that as soon as there are less than 100 blocks in prefetch quota,

it will fetch next 50 blocks which means the system will be always

approximately 50 blocks ahead, that will ensure that in this algorithm

it will always perform sequential scan, however eventually this is turning

to be a system where one worker is reading from disk and then other

workers are reading from OS buffers to shared buffers and then getting

the tuple. In this approach only one downside I can see and that is

there could be times during execution where some/all workers will have

to wait on the worker doing prefetching, however I think we should try

this approach and see how it works.

Another thing is that I think prefetching is not supported on all platforms

(Windows) and for such systems as per above algorithm we need to

rely on block-by-block method.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

19 January 2015, 13:20:45

On Mon, Jan 19, 2015 at 2:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, I think I got the idea what you want to achieve via
> prefetching.  So assuming prefetch_distance = 100 and
> prefetch_increment = 50 (prefetch_distance /2), it seems to me
> that as soon as there are less than 100 blocks in prefetch quota,
> it will fetch next 50 blocks which means the system will be always
> approximately 50 blocks ahead, that will ensure that in this algorithm
> it will always perform sequential scan, however eventually this is turning
> to be a system where one worker is reading from disk and then other
> workers are reading from OS buffers to shared buffers and then getting
> the tuple.  In this approach only one downside I can see and that is
> there could be times during execution where some/all workers will have
> to wait on the worker doing prefetching, however I think we should try
> this approach and see how it works.

Right.  We probably want to make prefetch_distance a GUC.  After all,
we currently rely on the operating system for prefetching, and the
operating system has a setting for this, at least on Linux (blockdev
--getra).  It's possible, however, that we don't need this at all,
because the OS might be smart enough to figure it out for us.  It's
probably worth testing, though.

> Another thing is that I think prefetching is not supported on all platforms
> (Windows) and for such systems as per above algorithm we need to
> rely on block-by-block method.

Well, I think we should try to set up a test to see if this is hurting
us.  First, do a sequential-scan of a related too big at least twice
as large as RAM.  Then, do a parallel sequential scan of the same
relation with 2 workers.  Repeat these in alternation several times.
If the operating system is accomplishing meaningful readahead, and the
parallel sequential scan is breaking it, then since the test is
I/O-bound I would expect to see the parallel scan actually being
slower than the normal way.

Or perhaps there is some other test that would be better (ideas
welcome) but the point is we may need something like this, but we
should try to figure out whether we need it before spending too much
time on it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

20 January 2015, 14:29:20

On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > Yeah, you need two separate global variables pointing to shm_mq
> > objects, one of which gets used by pqmq.c for errors and the other of
> > which gets used by printtup.c for tuples.
> >
>
> Okay, I will try to change the way as suggested without doing
> switching, but this way we need to do it separately for 'T', 'D', and
> 'C' messages.
>

I have taken care of integrating the parallel sequence scan with the

latest patch posted (parallel-mode-v1.patch) by Robert at below

location:

http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

Changes in this version

-----------------------------------------------

1. As mentioned previously, I have exposed one parameter

ParallelWorkerNumber as used in parallel-mode patch.

2. Enabled tuple queue to be used for passing tuples from

worker backend to master backend along with error queue

as per suggestion by Robert in the mail above.

3. Involved master backend to scan the heap directly when

tuples are not available in any shared memory tuple queue.

4. Introduced 3 new parameters (cpu_tuple_comm_cost,

parallel_setup_cost, parallel_startup_cost) for deciding the cost

of parallel plan. Currently, I have kept the default values for

parallel_setup_cost and parallel_startup_cost as 0.0, as those

require some experiments.

5. Fixed some issues (related to memory increase as reported

upthread by Thom Brown and general feature issues found during

test)

Note - I have yet to handle the new node types introduced at some

of the places and need to verify prepared queries and some other

things, however I think it will be good if I can get some feedback

at current stage.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v4.patch

Re: Parallel Seq Scan

From

Thom Brown

Date:

20 January 2015, 16:14:35

On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > Yeah, you need two separate global variables pointing to shm_mq
> > objects, one of which gets used by pqmq.c for errors and the other of
> > which gets used by printtup.c for tuples.
> >
>
> Okay, I will try to change the way as suggested without doing
> switching, but this way we need to do it separately for 'T', 'D', and
> 'C' messages.
>

I have taken care of integrating the parallel sequence scan with the
latest patch posted (parallel-mode-v1.patch) by Robert at below
location:
http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

Changes in this version
-----------------------------------------------
1. As mentioned previously, I have exposed one parameter
ParallelWorkerNumber as used in parallel-mode patch.
2. Enabled tuple queue to be used for passing tuples from
worker backend to master backend along with error queue
as per suggestion by Robert in the mail above.
3. Involved master backend to scan the heap directly when
tuples are not available in any shared memory tuple queue.
4. Introduced 3 new parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for deciding the cost
of parallel plan. Currently, I have kept the default values for
parallel_setup_cost and parallel_startup_cost as 0.0, as those
require some experiments.
5. Fixed some issues (related to memory increase as reported
upthread by Thom Brown and general feature issues found during
test)

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

Which commit is this based against? I'm getting errors with the latest master:

thom@swift:~/Development/postgresql$ patch -p1 < ~/Downloads/parallel_seqscan_v4.patch
patching file src/backend/access/Makefile
patching file src/backend/access/common/printtup.c
patching file src/backend/access/shmmq/Makefile
patching file src/backend/access/shmmq/shmmqam.c
patching file src/backend/commands/explain.c
Hunk #1 succeeded at 721 (offset 8 lines).
Hunk #2 succeeded at 918 (offset 8 lines).
Hunk #3 succeeded at 1070 (offset 8 lines).
Hunk #4 succeeded at 1337 (offset 8 lines).
Hunk #5 succeeded at 2239 (offset 83 lines).
patching file src/backend/executor/Makefile
patching file src/backend/executor/execProcnode.c
patching file src/backend/executor/execScan.c
patching file src/backend/executor/execTuples.c
patching file src/backend/executor/nodeParallelSeqscan.c
patching file src/backend/executor/nodeSeqscan.c
patching file src/backend/libpq/pqmq.c
Hunk #1 succeeded at 23 with fuzz 2 (offset -3 lines).
Hunk #2 FAILED at 63.
Hunk #3 succeeded at 132 (offset -31 lines).
1 out of 3 hunks FAILED -- saving rejects to file src/backend/libpq/pqmq.c.rej
patching file src/backend/optimizer/path/Makefile
patching file src/backend/optimizer/path/allpaths.c
patching file src/backend/optimizer/path/costsize.c
patching file src/backend/optimizer/path/parallelpath.c
patching file src/backend/optimizer/plan/createplan.c
patching file src/backend/optimizer/plan/planner.c
patching file src/backend/optimizer/plan/setrefs.c
patching file src/backend/optimizer/util/pathnode.c
patching file src/backend/postmaster/Makefile
patching file src/backend/postmaster/backendworker.c
patching file src/backend/postmaster/postmaster.c
patching file src/backend/tcop/dest.c
patching file src/backend/tcop/postgres.c
Hunk #1 succeeded at 54 (offset -1 lines).
Hunk #2 succeeded at 1132 (offset -1 lines).
patching file src/backend/utils/misc/guc.c
patching file src/backend/utils/misc/postgresql.conf.sample
can't find file to patch at input line 2105
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
|index 761ba1f..00ad468 100644
|--- a/src/include/access/parallel.h
|+++ b/src/include/access/parallel.h
--------------------------
File to patch:

Thom

Re: Parallel Seq Scan

From

Amit Kapila

Date:

20 January 2015, 16:55:59

On Tue, Jan 20, 2015 at 9:43 PM, Thom Brown <thom@linux.com> wrote:
>
> On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > >
>> > > Yeah, you need two separate global variables pointing to shm_mq
>> > > objects, one of which gets used by pqmq.c for errors and the other of
>> > > which gets used by printtup.c for tuples.
>> > >
>> >
>> > Okay, I will try to change the way as suggested without doing
>> > switching, but this way we need to do it separately for 'T', 'D', and
>> > 'C' messages.
>> >
>>
>> I have taken care of integrating the parallel sequence scan with the
>> latest patch posted (parallel-mode-v1.patch) by Robert at below
>> location:
>> http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com
>>
>> Changes in this version
>> -----------------------------------------------
>> 1. As mentioned previously, I have exposed one parameter
>> ParallelWorkerNumber as used in parallel-mode patch.
>> 2. Enabled tuple queue to be used for passing tuples from
>> worker backend to master backend along with error queue
>> as per suggestion by Robert in the mail above.
>> 3. Involved master backend to scan the heap directly when
>> tuples are not available in any shared memory tuple queue.
>> 4. Introduced 3 new parameters (cpu_tuple_comm_cost,
>> parallel_setup_cost, parallel_startup_cost) for deciding the cost
>> of parallel plan. Currently, I have kept the default values for
>> parallel_setup_cost and parallel_startup_cost as 0.0, as those
>> require some experiments.
>> 5. Fixed some issues (related to memory increase as reported
>> upthread by Thom Brown and general feature issues found during
>> test)
>>
>> Note - I have yet to handle the new node types introduced at some
>> of the places and need to verify prepared queries and some other
>> things, however I think it will be good if I can get some feedback
>> at current stage.
>
>
> Which commit is this based against? I'm getting errors with the latest master:
>

It seems to me that you have not applied parallel-mode patch

before applying this patch, can you try once again by first applying

the patch posted by Robert at below link:

http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

commit-id used for this patch - 0b49642

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Thom Brown

Date:

20 January 2015, 16:58:56

On 20 January 2015 at 16:55, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 20, 2015 at 9:43 PM, Thom Brown <thom@linux.com> wrote:
>
> On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > >
>> > > Yeah, you need two separate global variables pointing to shm_mq
>> > > objects, one of which gets used by pqmq.c for errors and the other of
>> > > which gets used by printtup.c for tuples.
>> > >
>> >
>> > Okay, I will try to change the way as suggested without doing
>> > switching, but this way we need to do it separately for 'T', 'D', and
>> > 'C' messages.
>> >
>>
>> I have taken care of integrating the parallel sequence scan with the
>> latest patch posted (parallel-mode-v1.patch) by Robert at below
>> location:
>> http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com
>>
>> Changes in this version
>> -----------------------------------------------
>> 1. As mentioned previously, I have exposed one parameter
>> ParallelWorkerNumber as used in parallel-mode patch.
>> 2. Enabled tuple queue to be used for passing tuples from
>> worker backend to master backend along with error queue
>> as per suggestion by Robert in the mail above.
>> 3. Involved master backend to scan the heap directly when
>> tuples are not available in any shared memory tuple queue.
>> 4. Introduced 3 new parameters (cpu_tuple_comm_cost,
>> parallel_setup_cost, parallel_startup_cost) for deciding the cost
>> of parallel plan. Currently, I have kept the default values for
>> parallel_setup_cost and parallel_startup_cost as 0.0, as those
>> require some experiments.
>> 5. Fixed some issues (related to memory increase as reported
>> upthread by Thom Brown and general feature issues found during
>> test)
>>
>> Note - I have yet to handle the new node types introduced at some
>> of the places and need to verify prepared queries and some other
>> things, however I think it will be good if I can get some feedback
>> at current stage.
>
>
> Which commit is this based against? I'm getting errors with the latest master:
>

It seems to me that you have not applied parallel-mode patch
before applying this patch, can you try once again by first applying
the patch posted by Robert at below link:
http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

commit-id used for this patch - 0b49642

D'oh. Yes, you're completely right. Works fine now.

Thanks.

Thom

Re: Parallel Seq Scan

From

Thom Brown

Date:

20 January 2015, 17:30:14

On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 15, 2015 at 6:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 12, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > Yeah, you need two separate global variables pointing to shm_mq
> > objects, one of which gets used by pqmq.c for errors and the other of
> > which gets used by printtup.c for tuples.
> >
>
> Okay, I will try to change the way as suggested without doing
> switching, but this way we need to do it separately for 'T', 'D', and
> 'C' messages.
>

I have taken care of integrating the parallel sequence scan with the
latest patch posted (parallel-mode-v1.patch) by Robert at below
location:
http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

Changes in this version
-----------------------------------------------
1. As mentioned previously, I have exposed one parameter
ParallelWorkerNumber as used in parallel-mode patch.
2. Enabled tuple queue to be used for passing tuples from
worker backend to master backend along with error queue
as per suggestion by Robert in the mail above.
3. Involved master backend to scan the heap directly when
tuples are not available in any shared memory tuple queue.
4. Introduced 3 new parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for deciding the cost
of parallel plan. Currently, I have kept the default values for
parallel_setup_cost and parallel_startup_cost as 0.0, as those
require some experiments.
5. Fixed some issues (related to memory increase as reported
upthread by Thom Brown and general feature issues found during
test)

Note - I have yet to handle the new node types introduced at some
of the places and need to verify prepared queries and some other
things, however I think it will be good if I can get some feedback
at current stage.

I'm getting an issue:

➤ psql://thom@[local]:5488/pgbench

# set parallel_seqscan_degree = 8;
SET
Time: 0.248 ms

➤ psql://thom@[local]:5488/pgbench

# explain select c1 from t1;
                          QUERY PLAN
--------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..21.22 rows=100 width=4)
   Number of Workers: 8
   Number of Blocks Per Worker: 11
(3 rows)

Time: 0.322 ms

# explain analyse select c1 from t1;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..21.22 rows=100 width=4) (actual time=0.024..13.468 rows=100 loops=1)
   Number of Workers: 8
   Number of Blocks Per Worker: 11
Planning time: 0.040 ms
Execution time: 13.862 ms
(5 rows)

Time: 14.188 ms

➤ psql://thom@[local]:5488/pgbench

# set parallel_seqscan_degree = 10;
SET
Time: 0.219 ms

➤ psql://thom@[local]:5488/pgbench

# explain select c1 from t1;
                          QUERY PLAN
--------------------------------------------------------------
Parallel Seq Scan on t1 (cost=0.00..19.18 rows=100 width=4)
   Number of Workers: 10
   Number of Blocks Per Worker: 9
(3 rows)

Time: 0.375 ms

➤ psql://thom@[local]:5488/pgbench

# explain analyse select c1 from t1;

So setting parallel_seqscan_degree above max_worker_processes causes the CPU to max out, and the query never returns, or at least not after waiting 2 minutes. Shouldn't it have a ceiling of max_worker_processes?

The original test I performed where I was getting OOM errors now appears to be fine:

# explain (analyse, buffers, timing) select distinct bid from pgbench_accounts;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1400411.11..1400412.11 rows=100 width=4) (actual time=8504.333..8504.335 rows=13 loops=1)
   Group Key: bid
   Buffers: shared hit=32 read=18183
   -> Parallel Seq Scan on pgbench_accounts (cost=0.00..1375411.11 rows=10000000 width=4) (actual time=0.054..7183.494 rows=10000000 loops=1)
         Number of Workers: 8
         Number of Blocks Per Worker: 18215
         Buffers: shared hit=32 read=18183
Planning time: 0.058 ms
Execution time: 8876.967 ms
(9 rows)

Time: 8877.366 ms

Note that I increased seq_page_cost to force a parallel scan in this case.

Thom

Re: Parallel Seq Scan

From

Jim Nasby

Date:

20 January 2015, 20:39:56

On 1/19/15 7:20 AM, Robert Haas wrote:
>> >Another thing is that I think prefetching is not supported on all platforms
>> >(Windows) and for such systems as per above algorithm we need to
>> >rely on block-by-block method.
> Well, I think we should try to set up a test to see if this is hurting
> us.  First, do a sequential-scan of a related too big at least twice
> as large as RAM.  Then, do a parallel sequential scan of the same
> relation with 2 workers.  Repeat these in alternation several times.
> If the operating system is accomplishing meaningful readahead, and the
> parallel sequential scan is breaking it, then since the test is
> I/O-bound I would expect to see the parallel scan actually being
> slower than the normal way.
>
> Or perhaps there is some other test that would be better (ideas
> welcome) but the point is we may need something like this, but we
> should try to figure out whether we need it before spending too much
> time on it.

I'm guessing that not all supported platforms have prefetching that actually helps us... but it would be good to
actuallyknow if that's the case.

Where I think this gets a lot more interesting is if we could apply this to an index scan. My thought is that would
resultin one worker mostly being responsible for advancing the index scan itself while the other workers were issuing
(andwaiting on) heap IO. So even if this doesn't turn out to be a win for seqscan, there's other places we might well
wantto use it.

-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

21 January 2015, 06:21:31

On Tue, Jan 20, 2015 at 10:59 PM, Thom Brown <thom@linux.com> wrote:
>
> On 20 January 2015 at 14:29, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Note - I have yet to handle the new node types introduced at some
>> of the places and need to verify prepared queries and some other
>> things, however I think it will be good if I can get some feedback
>> at current stage.
>
>
> I'm getting an issue:
>
>
>
> # set parallel_seqscan_degree = 10;
> SET
> Time: 0.219 ms
>
> ➤ psql://thom@[local]:5488/pgbench
>
>
> ➤ psql://thom@[local]:5488/pgbench
>
> # explain analyse select c1 from t1;
>
>
> So setting parallel_seqscan_degree above max_worker_processes causes the CPU to max out, and the query never returns, or at least not after waiting 2 minutes. Shouldn't it have a ceiling of max_worker_processes?
>

Yes, it should behave that way, but this is not handled in

patch as still we have to decide on what is the best execution

strategy (block-by-block or fixed chunks for different workers)

and based on that I can handle this scenario in patch.

I could return an error for such a scenario or do some work

to handle it seamlessly, but it seems to me that I have to

rework on the same if we select different approach for doing

execution than used in patch, so I am waiting for that to get

decided. I am planing to work on getting the performance data for

both the approaches, so that we can decide which is better

way to go-ahead.

> The original test I performed where I was getting OOM errors now appears to be fine:
>

Thanks for confirming the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Langote

Date:

21 January 2015, 07:25:54

On 20-01-2015 PM 11:29, Amit Kapila wrote:
> 
> I have taken care of integrating the parallel sequence scan with the
> latest patch posted (parallel-mode-v1.patch) by Robert at below
> location:
> http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com
> 
> Changes in this version
> -----------------------------------------------
> 1. As mentioned previously, I have exposed one parameter
> ParallelWorkerNumber as used in parallel-mode patch.
> 2. Enabled tuple queue to be used for passing tuples from
> worker backend to master backend along with error queue
> as per suggestion by Robert in the mail above.
> 3. Involved master backend to scan the heap directly when
> tuples are not available in any shared memory tuple queue.
> 4. Introduced 3 new parameters (cpu_tuple_comm_cost,
> parallel_setup_cost, parallel_startup_cost) for deciding the cost
> of parallel plan.  Currently, I have kept the default values for
> parallel_setup_cost and parallel_startup_cost as 0.0, as those
> require some experiments.
> 5. Fixed some issues (related to memory increase as reported
> upthread by Thom Brown and general feature issues found during
> test)
> 
> Note - I have yet to handle the new node types introduced at some
> of the places and need to verify prepared queries and some other
> things, however I think it will be good if I can get some feedback
> at current stage.
> 

I got an assertion failure:

In src/backend/executor/execTuples.c: ExecStoreTuple()

/* passing shouldFree=true for a tuple on a disk page is not sane */
Assert(BufferIsValid(buffer) ? (!shouldFree) : true);

when called from:

In src/backend/executor/nodeParallelSeqscan.c: ParallelSeqNext()

I think something like the following would be necessary (reading from
comments in the code):

--- a/src/backend/executor/nodeParallelSeqscan.c
+++ b/src/backend/executor/nodeParallelSeqscan.c
@@ -85,7 +85,7 @@ ParallelSeqNext(ParallelSeqScanState *node)       if (tuple)           ExecStoreTuple(tuple,
               slot,
 
-                          scandesc->rs_cbuf,
+                          fromheap ? scandesc->rs_cbuf : InvalidBuffer,                          !fromheap);

After fixing this, the assertion failure seems to be gone though I
observed the blocked (CPU maxed out) state as reported elsewhere by Thom
Brown.

What I was doing:

CREATE TABLE test(a) AS SELECT generate_series(1, 10000000);

postgres=# SHOW max_worker_processes;max_worker_processes
----------------------8
(1 row)

postgres=# SET seq_page_cost TO 100;
SET

postgres=# SET parallel_seqscan_degree TO 4;
SET

postgres=# EXPLAIN SELECT * FROM test;                              QUERY PLAN
-------------------------------------------------------------------------Parallel Seq Scan on test
(cost=0.00..1801071.27rows=8981483 width=4)  Number of Workers: 4  Number of Blocks Per Worker: 8849
 
(3 rows)

Though, EXPLAIN ANALYZE caused the thing.

Thanks,
Amit

Re: Parallel Seq Scan

From

Amit Kapila

Date:

21 January 2015, 10:44:46

On Wed, Jan 21, 2015 at 12:47 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 20-01-2015 PM 11:29, Amit Kapila wrote:
> > Note - I have yet to handle the new node types introduced at some
> > of the places and need to verify prepared queries and some other
> > things, however I think it will be good if I can get some feedback
> > at current stage.
> >
>
> I got an assertion failure:
>
> In src/backend/executor/execTuples.c: ExecStoreTuple()
>
> /* passing shouldFree=true for a tuple on a disk page is not sane */
> Assert(BufferIsValid(buffer) ? (!shouldFree) : true);
>

Good Catch!

The reason is that while master backend is scanning from a heap

page, if it finds another tuple/tuples's from shared memory message

queue it will process those tuples first and in such a scenario, the scan

descriptor will still have reference to buffer which it is using from scanning

from heap. Your proposed fix will work.

> After fixing this, the assertion failure seems to be gone though I
> observed the blocked (CPU maxed out) state as reported elsewhere by Thom
> Brown.
>

Does it happen only when parallel_seqscan_degree > max_worker_processes?

Thanks for checking the patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Langote

Date:

21 January 2015, 11:01:24

On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 21, 2015 at 12:47 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 20-01-2015 PM 11:29, Amit Kapila wrote:
> > Note - I have yet to handle the new node types introduced at some
> > of the places and need to verify prepared queries and some other
> > things, however I think it will be good if I can get some feedback
> > at current stage.
> >
>
> I got an assertion failure:
>
> In src/backend/executor/execTuples.c: ExecStoreTuple()
>
> /* passing shouldFree=true for a tuple on a disk page is not sane */
> Assert(BufferIsValid(buffer) ? (!shouldFree) : true);
>

Good Catch!
The reason is that while master backend is scanning from a heap
page, if it finds another tuple/tuples's from shared memory message
queue it will process those tuples first and in such a scenario, the scan
descriptor will still have reference to buffer which it is using from scanning
from heap. Your proposed fix will work.

> After fixing this, the assertion failure seems to be gone though I
> observed the blocked (CPU maxed out) state as reported elsewhere by Thom
> Brown.
>

Does it happen only when parallel_seqscan_degree > max_worker_processes?

I have max_worker_processes set to the default of 8 while parallel_seqscan_degree is 4. So, this may be a case different from Thom's.

Thanks,

Amit

Re: Parallel Seq Scan

From

Amit Kapila

Date:

21 January 2015, 12:43:48

On Wed, Jan 21, 2015 at 4:31 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>
>> Does it happen only when parallel_seqscan_degree > max_worker_processes?
>
>
> I have max_worker_processes set to the default of 8 while parallel_seqscan_degree is 4. So, this may be a case different from Thom's.
>

I think this is due to reason that memory for forming

tuple in master backend is retained for longer time which

is causing this statement to take much longer time than

required. I have fixed the other issue as well reported by

you in attached patch.

I think this patch is still not completely ready for general

purpose testing, however it could be helpful if we can run

some tests to see in what kind of scenario's it gives benefit

like in the test you are doing if rather than increasing

seq_page_cost, you should add an expensive WHERE condition

so that it should automatically select parallel plan. I think it is better

to change one of the new parameter's (parallel_setup_cost,

parallel_startup_cost and cpu_tuple_comm_cost) if you want

your statement to use parallel plan, like in your example if

you would have reduced cpu_tuple_comm_cost, it would have

selected parallel plan, that way we can get some feedback about

what should be the appropriate default values for the newly added

parameters. I am already planing to do some tests in that regard,

however if I get some feedback from other's that would be helpful.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v5.patch

Re: Parallel Seq Scan

From

Kouhei Kaigai

Date:

22 January 2015, 01:08:08

> On Wed, Jan 21, 2015 at 4:31 PM, Amit Langote <amitlangote09@gmail.com>
> wrote:
> > On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >>
> >>
> >> Does it happen only when parallel_seqscan_degree > max_worker_processes?
> >
> >
> >  I have max_worker_processes set to the default of 8 while
> parallel_seqscan_degree is 4. So, this may be a case different from Thom's.
> >
> 
> I think this is due to reason that memory for forming tuple in master backend
> is retained for longer time which is causing this statement to take much
> longer time than required.  I have fixed the other issue as well reported
> by you in attached patch.
> 
> I think this patch is still not completely ready for general purpose testing,
> however it could be helpful if we can run some tests to see in what kind
> of scenario's it gives benefit like in the test you are doing if rather
> than increasing seq_page_cost, you should add an expensive WHERE condition
> so that it should automatically select parallel plan. I think it is better
> to change one of the new parameter's (parallel_setup_cost,
> parallel_startup_cost and cpu_tuple_comm_cost) if you want your statement
> to use parallel plan, like in your example if you would have reduced
> cpu_tuple_comm_cost, it would have selected parallel plan, that way we can
> get some feedback about what should be the appropriate default values for
> the newly added parameters.  I am already planing to do some tests in that
> regard, however if I get some feedback from other's that would be helpful.
> 
(Please point out me if my understanding is incorrect.)

What happen if dynamic background worker process tries to reference temporary
tables? Because buffer of temporary table blocks are allocated on private
address space, its recent status is not visible to other process unless it is
not flushed to the storage every time.

Do we need to prohibit create_parallelscan_paths() to generate a path when
target relation is temporary one?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: Parallel Seq Scan

From

Amit Kapila

Date:

22 January 2015, 05:00:58

On Thu, Jan 22, 2015 at 6:37 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>
> (Please point out me if my understanding is incorrect.)
>

> What happen if dynamic background worker process tries to reference temporary
> tables? Because buffer of temporary table blocks are allocated on private
> address space, its recent status is not visible to other process unless it is
> not flushed to the storage every time.
>
> Do we need to prohibit create_parallelscan_paths() to generate a path when
> target relation is temporary one?
>

Yes, we need to prohibit parallel scans on temporary relations. Will fix.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Langote

Date:

22 January 2015, 05:23:04

On 21-01-2015 PM 09:43, Amit Kapila wrote:
> On Wed, Jan 21, 2015 at 4:31 PM, Amit Langote <amitlangote09@gmail.com>
> wrote:
>> On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>>
>>>
>>> Does it happen only when parallel_seqscan_degree > max_worker_processes?
>>
>>
>>  I have max_worker_processes set to the default of 8 while
> parallel_seqscan_degree is 4. So, this may be a case different from Thom's.
>>
> 
> I think this is due to reason that memory for forming
> tuple in master backend is retained for longer time which
> is causing this statement to take much longer time than
> required.  I have fixed the other issue as well reported by
> you in attached patch.
> 

Thanks for fixing.

> I think this patch is still not completely ready for general
> purpose testing, however it could be helpful if we can run
> some tests to see in what kind of scenario's it gives benefit
> like in the test you are doing if rather than increasing
> seq_page_cost, you should add an expensive WHERE condition
> so that it should automatically select parallel plan. I think it is better
> to change one of the new parameter's (parallel_setup_cost,
> parallel_startup_cost and cpu_tuple_comm_cost) if you want
> your statement to use parallel plan, like in your example if
> you would have reduced cpu_tuple_comm_cost, it would have
> selected parallel plan, that way we can get some feedback about
> what should be the appropriate default values for the newly added
> parameters.  I am already planing to do some tests in that regard,
> however if I get some feedback from other's that would be helpful.
> 
> 

Perhaps you are aware or you've postponed working on it, but I see that
a plan executing in a worker does not know about instrumentation. It
results in the EXPLAIN ANALYZE showing incorrect figures. For example
compare the normal seqscan and parallel seqscan below:

postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE sqrt(a) < 3456 AND
md5(a::text) LIKE 'ac%';                                                 QUERY PLAN

---------------------------------------------------------------------------------------------------------------Seq Scan
ontest  (cost=0.00..310228.52 rows=16120 width=4) (actual
 
time=0.497..17062.436 rows=39028 loops=1)  Filter: ((sqrt((a)::double precision) < 3456::double precision) AND
(md5((a)::text) ~~ 'ac%'::text))  Rows Removed by Filter: 9960972Planning time: 0.206 msExecution time: 17378.413 ms
(5 rows)

postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE sqrt(a) < 3456 AND
md5(a::text) LIKE 'ac%';                                                     QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------Parallel
SeqScan on test  (cost=0.00..255486.08 rows=16120 width=4)
 
(actual time=7.329..4906.981 rows=39028 loops=1)  Filter: ((sqrt((a)::double precision) < 3456::double precision) AND
(md5((a)::text) ~~ 'ac%'::text))  Rows Removed by Filter: 1992710  Number of Workers: 4  Number of Blocks Per Worker:
8849Planningtime: 0.137 msExecution time: 6077.782 ms
 
(7 rows)

Note the "Rows Removed by Filter". I guess the difference may be
because, all the rows filtered by workers were not accounted for. I'm
not quite sure, but since exec_worker_stmt goes the Portal way,
QueryDesc.instrument_options remains unset and hence no instrumentation
opportunities in a worker backend. One option may be to pass
instrument_options down to worker_stmt?

By the way, 17s and 6s compare really well in favor of parallel seqscan
above, :)

Thanks,
Amit

Re: Parallel Seq Scan

From

Amit Kapila

Date:

22 January 2015, 05:30:11

On Thu, Jan 22, 2015 at 10:44 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 21-01-2015 PM 09:43, Amit Kapila wrote:
> > On Wed, Jan 21, 2015 at 4:31 PM, Amit Langote <amitlangote09@gmail.com>
> > wrote:
> >> On Wednesday, January 21, 2015, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> >>>
> >>>
> >>> Does it happen only when parallel_seqscan_degree > max_worker_processes?
> >>
> >>
> >> I have max_worker_processes set to the default of 8 while
> > parallel_seqscan_degree is 4. So, this may be a case different from Thom's.
> >>
> >
> > I think this is due to reason that memory for forming
> > tuple in master backend is retained for longer time which
> > is causing this statement to take much longer time than
> > required. I have fixed the other issue as well reported by
> > you in attached patch.
> >
>
> Thanks for fixing.
>
> > I think this patch is still not completely ready for general
> > purpose testing, however it could be helpful if we can run
> > some tests to see in what kind of scenario's it gives benefit
> > like in the test you are doing if rather than increasing
> > seq_page_cost, you should add an expensive WHERE condition
> > so that it should automatically select parallel plan. I think it is better
> > to change one of the new parameter's (parallel_setup_cost,
> > parallel_startup_cost and cpu_tuple_comm_cost) if you want
> > your statement to use parallel plan, like in your example if
> > you would have reduced cpu_tuple_comm_cost, it would have
> > selected parallel plan, that way we can get some feedback about
> > what should be the appropriate default values for the newly added
> > parameters. I am already planing to do some tests in that regard,
> > however if I get some feedback from other's that would be helpful.
> >
> >
>
> Perhaps you are aware or you've postponed working on it, but I see that
> a plan executing in a worker does not know about instrumentation.

I have deferred it until other main parts are stabilised/reviewed. Once

that is done, we can take a call what is best we can do for instrumentation.

Thom has reported the same as well upthread.

> Note the "Rows Removed by Filter". I guess the difference may be
> because, all the rows filtered by workers were not accounted for. I'm
> not quite sure, but since exec_worker_stmt goes the Portal way,
> QueryDesc.instrument_options remains unset and hence no instrumentation
> opportunities in a worker backend. One option may be to pass
> instrument_options down to worker_stmt?
>

I think there is more to it, master backend need to process that information

as well.

> By the way, 17s and 6s compare really well in favor of parallel seqscan
> above, :)
>

That sounds interesting.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Langote

Date:

22 January 2015, 05:35:09

On 22-01-2015 PM 02:30, Amit Kapila wrote:
>> Perhaps you are aware or you've postponed working on it, but I see that
>> a plan executing in a worker does not know about instrumentation.
> 
> I have deferred it until other main parts are stabilised/reviewed.  Once
> that is done, we can take a call what is best we can do for instrumentation.
> Thom has reported the same as well upthread.
> 

Ah, I missed Thom's report.

>> Note the "Rows Removed by Filter". I guess the difference may be
>> because, all the rows filtered by workers were not accounted for. I'm
>> not quite sure, but since exec_worker_stmt goes the Portal way,
>> QueryDesc.instrument_options remains unset and hence no instrumentation
>> opportunities in a worker backend. One option may be to pass
>> instrument_options down to worker_stmt?
>>
> 
> I think there is more to it, master backend need to process that information
> as well.
> 

I see.

Thanks,
Amit

Re: Parallel Seq Scan

From

Amit Kapila

Date:

22 January 2015, 10:57:57

On Mon, Jan 19, 2015 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 19, 2015 at 2:24 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Another thing is that I think prefetching is not supported on all platforms
> > (Windows) and for such systems as per above algorithm we need to
> > rely on block-by-block method.
>
> Well, I think we should try to set up a test to see if this is hurting
> us. First, do a sequential-scan of a related too big at least twice
> as large as RAM. Then, do a parallel sequential scan of the same
> relation with 2 workers. Repeat these in alternation several times.
> If the operating system is accomplishing meaningful readahead, and the
> parallel sequential scan is breaking it, then since the test is
> I/O-bound I would expect to see the parallel scan actually being
> slower than the normal way.
>

I have taken some performance data as per above discussion. Basically,

I have used parallel_count module which is part of parallel-mode patch

as that seems to be more close to verify the I/O pattern (doesn't have any

tuple communication overhead).

Script used to test is attached (parallel_count.sh)

Performance Data

----------------------------

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB

Table Size - 120GB

Used below statements to create table -

create table tbl_perf(c1 int, c2 char(1000));

insert into tbl_perf values(generate_series(1,10000000),'aaaaa');

insert into tbl_perf values(generate_series(10000001,30000000),'aaaaa');

insert into tbl_perf values(generate_series(30000001,110000000),'aaaaa');

Block-By-Block
No. of workers/Time (ms)	0	2
Run-1	267798	295051
Run-2	276646	296665
Run-3	281364	314952
Run-4	290231	326243
Run-5	288890	295684

Then I have modified the parallel_count module such that it can scan in

fixed chunks, rather than block-by-block, the patch for same is attached

(parallel_count_fixed_chunk_v1.patch, this is a patch based on parallel

count module in parallel-mode patch [1]).

Fixed-Chunks
No. of workers/Time (ms)	0	2
	286346	234037
	250051	215111
	255915	254934
	263754	242228
	251399	202581

Observations

------------------------

1. Scanning block-by-block has negative impact on performance and

I thin it will degrade more if we increase parallel count as that can lead

to more randomness.

2. Scanning in fixed chunks improves the performance. Increasing

parallel count to a very large number might impact the performance,

but I think we can have a lower bound below which we will not allow

multiple processes to scan the relation.

Now I can go-ahead and try with prefetching approach as suggested

by you, but I have a feeling that overall it might not be beneficial (mainly

due to the reason that it is not supported on all platforms, we can say

that we don't care for such platforms, but still there is no mitigation strategy

for those platforms) due to the reasons mentioned up-thread.

Thoughts?

[1] http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Parallel Seq Scan

From

Robert Haas

Date:

22 January 2015, 13:53:21

On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 1. Scanning block-by-block has negative impact on performance and
> I thin it will degrade more if we increase parallel count as that can lead
> to more randomness.
>
> 2. Scanning in fixed chunks improves the performance. Increasing
> parallel count to a very large number might impact the performance,
> but I think we can have a lower bound below which we will not allow
> multiple processes to scan the relation.

I'm confused.  Your actual test numbers seem to show that the
performance with the block-by-block approach was slightly higher with
parallelism than without, where as the performance with the
chunk-by-chunk approach was lower with parallelism than without, but
the text quoted above, summarizing those numbers, says the opposite.

Also, I think testing with 2 workers is probably not enough.  I think
we should test with 8 or even 16.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

22 January 2015, 14:02:24

On Thu, Jan 22, 2015 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 1. Scanning block-by-block has negative impact on performance and
> > I thin it will degrade more if we increase parallel count as that can lead
> > to more randomness.
> >
> > 2. Scanning in fixed chunks improves the performance. Increasing
> > parallel count to a very large number might impact the performance,
> > but I think we can have a lower bound below which we will not allow
> > multiple processes to scan the relation.
>
> I'm confused. Your actual test numbers seem to show that the
> performance with the block-by-block approach was slightly higher with
> parallelism than without, where as the performance with the
> chunk-by-chunk approach was lower with parallelism than without, but
> the text quoted above, summarizing those numbers, says the opposite.
>

Sorry for causing confusion, I should have been more explicit about

explaining the numbers. Let me try again,

Values in columns is time in milliseconds to complete the execution,

so higher means it took more time. If you see in block-by-block, the

time taken to complete the execution with 2 workers is more than

no workers which means parallelism has degraded the performance.

> Also, I think testing with 2 workers is probably not enough. I think
> we should test with 8 or even 16.
>

Sure, will do this and post the numbers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

22 January 2015, 14:17:14

On Thu, Jan 22, 2015 at 9:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I'm confused.  Your actual test numbers seem to show that the
>> performance with the block-by-block approach was slightly higher with
>> parallelism than without, where as the performance with the
>> chunk-by-chunk approach was lower with parallelism than without, but
>> the text quoted above, summarizing those numbers, says the opposite.
>
> Sorry for causing confusion, I should have been more explicit about
> explaining the numbers.  Let me try again,
> Values in columns is time in milliseconds to complete the execution,
> so higher means it took more time.  If you see in block-by-block, the
> time taken to complete the execution with 2 workers is more than
> no workers which means parallelism has degraded the performance.

*facepalm*

Oh, yeah, right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Josh Berkus

Date:

23 January 2015, 02:48:22

On 01/22/2015 05:53 AM, Robert Haas wrote:
> Also, I think testing with 2 workers is probably not enough.  I think
> we should test with 8 or even 16.

FWIW, based on my experience there will also be demand to use parallel
query using 4 workers, particularly on AWS.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

23 January 2015, 11:42:59

Below is the data with more number of workers, the amount of data and

other configurations remains as previous, I have only increased parallel

worker count:

Block-By-Block
No. of workers/Time (ms)	0	2	4	8	16	24	32
Run-1	257851	287353	350091	330193	284913	338001	295057
Run-2	263241	314083	342166	347337	378057	351916	348292
Run-3	315374	334208	389907	340327	328695	330048	330102
Run-4	301054	312790	314682	352835	323926	324042	302147
Run-5	304547	314171	349158	350191	350468	341219	281315

Fixed-Chunks
No. of workers/Time (ms)	0	2	4	8	16	24	32
Run-1	250536	266279	251263	234347	87930	50474	35474
Run-2	249587	230628	225648	193340	83036	35140	9100
Run-3	234963	220671	230002	256183	105382	62493	27903
Run-4	239111	245448	224057	189196	123780	63794	24746
Run-5	239937	222820	219025	220478	114007	77965	39766

The trend remains same although there is some variation.

In block-by-block approach, it performance dips (execution takes

more time) with more number of workers, though it stabilizes at

some higher value, still I feel it is random as it leads to random

scan.

In Fixed-chunk approach, the performance improves with more

number of workers especially at slightly higher worker count.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Jim Nasby

Date:

23 January 2015, 18:44:41

On 1/23/15 5:42 AM, Amit Kapila wrote:
> *Fixed-Chunks*     
> *No. of workers/Time (ms)*>         *0*     *2*     *4*     *8*     *16*     *24*     *32*
> Run-1     250536     266279     251263     234347     87930     50474     35474
> Run-2     249587     230628     225648     193340     83036     35140     9100
> Run-3     234963     220671     230002     256183     105382     62493     27903
> Run-4     239111     245448     224057     189196     123780     63794     24746
> Run-5     239937     222820     219025     220478     114007     77965     39766
>
>
>
> The trend remains same although there is some variation.
> In block-by-block approach, it performance dips (execution takes
> more time) with more number of workers, though it stabilizes at
> some higher value, still I feel it is random as it leads to random
> scan.
> In Fixed-chunk approach, the performance improves with more
> number of workers especially at slightly higher worker count.

Those fixed chunk numbers look pretty screwy. 2, 4 and 8 workers make no difference, then suddenly 16 cuts times by 1/2
to1/3? Then 32 cuts time by another 1/2 to 1/3?
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

"Joshua D. Drake"

Date:

23 January 2015, 18:54:41

On 01/23/2015 10:44 AM, Jim Nasby wrote:
> number of workers especially at slightly higher worker count.
>
> Those fixed chunk numbers look pretty screwy. 2, 4 and 8 workers make no
> difference, then suddenly 16 cuts times by 1/2 to 1/3? Then 32 cuts time
> by another 1/2 to 1/3?

cached? First couple of runs gets the relations into memory?

JD



-- 
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should             not be surprised when they come back as
Romans."

Re: Parallel Seq Scan

From

Amit Kapila

Date:

24 January 2015, 04:16:25

On Sat, Jan 24, 2015 at 12:24 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
>
>
> On 01/23/2015 10:44 AM, Jim Nasby wrote:
>>
>> number of workers especially at slightly higher worker count.
>>
>> Those fixed chunk numbers look pretty screwy. 2, 4 and 8 workers make no
>> difference, then suddenly 16 cuts times by 1/2 to 1/3? Then 32 cuts time
>> by another 1/2 to 1/3?
>

There is variation in tests at different worker count but there is

definitely improvement from 0 to 2 worker count (if you refer my

initial mail on this data, with 2 workers there is a benefit of ~20%)

and I think we run the tests in a similar way (like compare 0 and 2

or 0 or 4 or 0 and 8), then the other effects could be minimised and

we might see better consistency, however the general trend with

fixed-chunk seems to be that scanning that way is better.

I think the real benefit with the current approach/patch can be seen

with qualifications (especially costly expression evaluation).

Further, if we want to just get the benefit of parallel I/O, then

I think we can get that by parallelising partition scan where different

table partitions reside on different disk partitions, however that is

a matter of separate patch.

>
> cached? First couple of runs gets the relations into memory?
>

Not entirely, as the table size is double than RAM, so each run

has to perform I/O.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Jim Nasby

Date:

26 January 2015, 21:49:02

On 1/23/15 10:16 PM, Amit Kapila wrote:
> Further, if we want to just get the benefit of parallel I/O, then
> I think we can get that by parallelising partition scan where different
> table partitions reside on different disk partitions, however that is
> a matter of separate patch.

I don't think we even have to go that far.

My experience with Postgres is that it is *very* sensitive to IO latency (not bandwidth). I believe this is the case
becausecomplex queries tend to interleave CPU intensive code in-between IO requests. So we see this pattern:

Wait 5ms on IO
Compute for a few ms
Wait 5ms on IO
Compute for a few ms
...

We blindly assume that the kernel will magically do read-ahead for us, but I've never seen that work so great. It
certainlyfalls apart on something like an index scan.

If we could instead do this:

Wait for first IO, issue second IO request
Compute
Already have second IO request, issue third
...

We'd be a lot less sensitive to IO latency.

I wonder what kind of gains we would see if every SeqScan in a query spawned a worker just to read tuples and shove
themin a queue (or shove a pointer to a buffer in the queue). Similarly, have IndexScans have one worker reading the
indexand another worker taking index tuples and reading heap tuples...

-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Tom Lane

Date:

26 January 2015, 22:39:49

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> On 1/23/15 10:16 PM, Amit Kapila wrote:
>> Further, if we want to just get the benefit of parallel I/O, then
>> I think we can get that by parallelising partition scan where different
>> table partitions reside on different disk partitions, however that is
>> a matter of separate patch.

> I don't think we even have to go that far.

> My experience with Postgres is that it is *very* sensitive to IO latency (not bandwidth). I believe this is the case
becausecomplex queries tend to interleave CPU intensive code in-between IO requests. So we see this pattern:
 

> Wait 5ms on IO
> Compute for a few ms
> Wait 5ms on IO
> Compute for a few ms
> ...

> We blindly assume that the kernel will magically do read-ahead for us, but I've never seen that work so great. It
certainlyfalls apart on something like an index scan.
 

> If we could instead do this:

> Wait for first IO, issue second IO request
> Compute
> Already have second IO request, issue third
> ...

> We'd be a lot less sensitive to IO latency.

It would take about five minutes of coding to prove or disprove this:
stick a PrefetchBuffer call into heapgetpage() to launch a request for the
next page as soon as we've read the current one, and then see if that
makes any obvious performance difference.  I'm not convinced that it will,
but if it did then we could think about how to make it work for real.
        regards, tom lane

Re: Parallel Seq Scan

From

Amit Kapila

Date:

27 January 2015, 05:11:11

On Tue, Jan 27, 2015 at 3:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>
> On 1/23/15 10:16 PM, Amit Kapila wrote:
>>
>> Further, if we want to just get the benefit of parallel I/O, then
>> I think we can get that by parallelising partition scan where different
>> table partitions reside on different disk partitions, however that is
>> a matter of separate patch.
>
>
> I don't think we even have to go that far.
>
>
> We'd be a lot less sensitive to IO latency.
>
> I wonder what kind of gains we would see if every SeqScan in a query spawned a worker just to read tuples and shove them in a queue (or shove a pointer to a buffer in the queue).
>

Here IIUC, you want to say that just get the read done by one parallel

worker and then all expression calculation (evaluation of qualification

and target list) in the main backend, it seems to me that by doing it

that way, the benefit of parallelisation will be lost due to tuple

communication overhead (may be the overhead is less if we just

pass a pointer to buffer but that will have another kind of problems

like holding buffer pins for a longer period of time).

I could see the advantage of testing on lines as suggested by Tom Lane,

but that seems to be not directly related to what we want to achieve by

this patch (parallel seq scan) or if you think otherwise then let me know?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Daniel Bausch

Date:

27 January 2015, 07:10:20

Hi PG devs!

Tom Lane <tgl@sss.pgh.pa.us> writes:

>> Wait for first IO, issue second IO request
>> Compute
>> Already have second IO request, issue third
>> ...
>
>> We'd be a lot less sensitive to IO latency.
>
> It would take about five minutes of coding to prove or disprove this:
> stick a PrefetchBuffer call into heapgetpage() to launch a request for the
> next page as soon as we've read the current one, and then see if that
> makes any obvious performance difference.  I'm not convinced that it will,
> but if it did then we could think about how to make it work for real.

Sorry for dropping in so late...

I have done all this two years ago.  For TPC-H Q8, Q9, Q17, Q20, and Q21
I see a speedup of ~100% when using IndexScan prefetching + Nested-Loops
Look-Ahead (the outer loop!).
(On SSD with 32 Pages Prefetch/Look-Ahead + Cold Page Cache / Small RAM)

Regards,
Daniel
-- 
MSc. Daniel Bausch
Research Assistant (Computer Science)
Technische UniversitÃ¤t Darmstadt
http://www.dvs.tu-darmstadt.de/staff/dbausch

Re: Parallel Seq Scan

From

David Fetter

Date:

27 January 2015, 14:54:56

On Tue, Jan 27, 2015 at 08:02:37AM +0100, Daniel Bausch wrote:
> Hi PG devs!
> 
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> >> Wait for first IO, issue second IO request
> >> Compute
> >> Already have second IO request, issue third
> >> ...
> >
> >> We'd be a lot less sensitive to IO latency.
> >
> > It would take about five minutes of coding to prove or disprove this:
> > stick a PrefetchBuffer call into heapgetpage() to launch a request for the
> > next page as soon as we've read the current one, and then see if that
> > makes any obvious performance difference.  I'm not convinced that it will,
> > but if it did then we could think about how to make it work for real.
> 
> Sorry for dropping in so late...
> 
> I have done all this two years ago.  For TPC-H Q8, Q9, Q17, Q20, and Q21
> I see a speedup of ~100% when using IndexScan prefetching + Nested-Loops
> Look-Ahead (the outer loop!).
> (On SSD with 32 Pages Prefetch/Look-Ahead + Cold Page Cache / Small RAM)

Would you be so kind as to pass along any patches (ideally applicable
to git master), tests, and specific measurements you made?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Parallel Seq Scan

From

Robert Haas

Date:

27 January 2015, 21:10:38

On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Script used to test is attached (parallel_count.sh)

Why does this use EXPLAIN ANALYZE instead of \timing ?

> IBM POWER-7 16 cores, 64 hardware threads
> RAM = 64GB
>
> Table Size - 120GB
>
> Used below statements to create table -
> create table tbl_perf(c1 int, c2 char(1000));
> insert into tbl_perf values(generate_series(1,10000000),'aaaaa');
> insert into tbl_perf values(generate_series(10000001,30000000),'aaaaa');
> insert into tbl_perf values(generate_series(30000001,110000000),'aaaaa');

I generated this table using this same method and experimented with
copying the whole file to the bit bucket using dd.  I did this on
hydra, which I think is the same machine you used.

time for i in `seq 0 119`; do if [ $i -eq 0 ]; then f=16388; else
f=16388.$i; fi; dd if=$f of=/dev/null bs=8k; done

There is a considerable amount of variation in the amount of time this
takes to run based on how much of the relation is cached.  Clearly,
there's no way for the system to cache it all, but it can cache a
significant portion, and that affects the results to no small degree.
dd on hydra prints information on the data transfer rate; on uncached
1GB segments, it runs at right around 400 MB/s, but that can soar to
upwards of 3GB/s when the relation is fully cached.  I tried flushing
the OS cache via echo 1 > /proc/sys/vm/drop_caches, and found that
immediately after doing that, the above command took 5m21s to run -
i.e. ~321000 ms.  Most of your test times are faster than that, which
means they reflect some degree of caching.  When I immediately reran
the command a second time, it finished in 4m18s the second time, or
~258000 ms.  The rate was the same as the first test - about 400 MB/s
- for most of the files, but 27 of the last 28 files went much faster,
between 1.3 GB/s and 3.7 GB/s.

This tells us that the OS cache on this machine has anti-spoliation
logic in it, probably not dissimilar to what we have in PG.  If the
data were cycled through the system cache in strict LRU fashion, any
data that was leftover from the first run would have been flushed out
by the early part of the second run, so that all the results from the
second set of runs would have hit the disk.  But in fact, that's not
what happened: the last pages from the first run remained cached even
after reading an amount of new data that exceeds the size of RAM on
that machine.  What I think this demonstrates is that we're going to
have to be very careful to control for caching effects, or we may find
that we get misleading results.  To make this simpler, I've installed
a setuid binary /usr/bin/drop_caches that you (or anyone who has an
account on that machine) can use you drop the caches; run 'drop_caches
1'.

> Block-By-Block
>
> No. of workers/Time (ms) 0 2
> Run-1 267798 295051
> Run-2 276646 296665
> Run-3 281364 314952
> Run-4 290231 326243
> Run-5 288890 295684

The next thing I did was run test with the block-by-block method after
having dropped the caches.  I did this with 0 workers and with 8
workers.  I dropped the caches and restarted postgres before each
test, but then ran each test a second time to see the effect of
caching by both the OS and by PostgreSQL.  I got these results:

With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.

This is a confusing result, because you expect parallelism to help
more when the relation is partly cached, and make little or no
difference when it isn't cached.  But that's not what happened.

I've also got a draft of a prefetching implementation here that I'd
like to test out, but I've just discovered that it's buggy, so I'm
going to send these results for now and work on fixing that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Stephen Frost

Date:

27 January 2015, 21:46:50

Robert, all,

* Robert Haas (robertmhaas@gmail.com) wrote:
> There is a considerable amount of variation in the amount of time this
> takes to run based on how much of the relation is cached.  Clearly,
> there's no way for the system to cache it all, but it can cache a
> significant portion, and that affects the results to no small degree.
> dd on hydra prints information on the data transfer rate; on uncached
> 1GB segments, it runs at right around 400 MB/s, but that can soar to
> upwards of 3GB/s when the relation is fully cached.  I tried flushing
> the OS cache via echo 1 > /proc/sys/vm/drop_caches, and found that
> immediately after doing that, the above command took 5m21s to run -
> i.e. ~321000 ms.  Most of your test times are faster than that, which
> means they reflect some degree of caching.  When I immediately reran
> the command a second time, it finished in 4m18s the second time, or
> ~258000 ms.  The rate was the same as the first test - about 400 MB/s
> - for most of the files, but 27 of the last 28 files went much faster,
> between 1.3 GB/s and 3.7 GB/s.

[...]

> With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
> With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.
>
> This is a confusing result, because you expect parallelism to help
> more when the relation is partly cached, and make little or no
> difference when it isn't cached.  But that's not what happened.

These numbers seem to indicate that the oddball is the single-threaded
uncached run.  If I followed correctly, the uncached 'dd' took 321s,
which is relatively close to the uncached-lots-of-workers and the two
cached runs.  What in the world is the uncached single-thread case doing
that it takes an extra 543s, or over twice as long?  It's clearly not
disk i/o which is causing the slowdown, based on your dd tests.

One possibility might be round-trip latency.  The multi-threaded case is
able to keep the CPUs and the i/o system going, and the cached results
don't have as much latency since things are cached, but the
single-threaded uncached case going i/o -> cpu -> i/o -> cpu, ends up
with a lot of wait time as it switches between being on CPU and waiting
on the i/o.

Just some thoughts.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Robert Haas

Date:

27 January 2015, 23:01:02

On Fri, Jan 23, 2015 at 6:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Fixed-Chunks
>
> No. of workers/Time (ms) 0 2 4 8 16 24 32
> Run-1 250536 266279 251263 234347 87930 50474 35474
> Run-2 249587 230628 225648 193340 83036 35140 9100
> Run-3 234963 220671 230002 256183 105382 62493 27903
> Run-4 239111 245448 224057 189196 123780 63794 24746
> Run-5 239937 222820 219025 220478 114007 77965 39766

I cannot reproduce these results.  I applied your fixed-chunk size
patch and ran SELECT parallel_count('tbl_perf', 32) a few times.  The
first thing I notice is that, as I predicted, there's an issue with
different workers finishing at different times.  For example, from my
first run:

2015-01-27 22:13:09 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34700) exited with exit code 0
2015-01-27 22:13:09 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34698) exited with exit code 0
2015-01-27 22:13:09 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34701) exited with exit code 0
2015-01-27 22:13:10 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34699) exited with exit code 0
2015-01-27 22:15:00 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34683) exited with exit code 0
2015-01-27 22:15:29 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34673) exited with exit code 0
2015-01-27 22:15:58 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34679) exited with exit code 0
2015-01-27 22:16:38 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34689) exited with exit code 0
2015-01-27 22:16:39 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34671) exited with exit code 0
2015-01-27 22:16:47 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34677) exited with exit code 0
2015-01-27 22:16:47 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34672) exited with exit code 0
2015-01-27 22:16:48 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34680) exited with exit code 0
2015-01-27 22:16:50 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34686) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34670) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34690) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34674) exited with exit code 0
2015-01-27 22:16:52 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34684) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34675) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34682) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34691) exited with exit code 0
2015-01-27 22:16:54 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34676) exited with exit code 0
2015-01-27 22:16:54 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34685) exited with exit code 0
2015-01-27 22:16:55 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34692) exited with exit code 0
2015-01-27 22:16:56 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34687) exited with exit code 0
2015-01-27 22:16:56 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34678) exited with exit code 0
2015-01-27 22:16:57 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34681) exited with exit code 0
2015-01-27 22:16:57 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34688) exited with exit code 0
2015-01-27 22:16:59 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34694) exited with exit code 0
2015-01-27 22:16:59 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34693) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34695) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34697) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34696) exited with exit code 0

That run started at 22:13:01.  Within 4 seconds, 4 workers exited.  So
clearly we are not getting the promised 32-way parallelism for the
whole test.  Granted, in this instance, *most* of the workers run
until the end, but I think we'll find that there are
uncomfortably-frequent cases where we get significantly less
parallelism than we planned on because the work isn't divided evenly.

But leaving that aside, I've run this test 6 times in a row now, with
a warm cache, and the best time I have is 237310.042 ms and the worst
time I have is 242936.315 ms.  So there's very little variation, and
it's reasonably close to the results I got with dd, suggesting that
the system is fairly well I/O bound.  At a sequential read speed of
400 MB/s, 240 s = 96 GB of data.  Assuming it takes no time at all to
process the cached data (which seems to be not far from wrong judging
by how quickly the first few workers exit), that means we're getting
24 GB of data from cache on a 64 GB machine.  That seems a little low,
but if the kernel is refusing to cache the whole relation to avoid
cache-trashing, it could be right.

Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups.  Moreover,
I'm kind of suspicious about whether your results are actually
physically possible.  Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Jim Nasby

Date:

27 January 2015, 23:43:55

On 1/26/15 11:11 PM, Amit Kapila wrote:
> On Tue, Jan 27, 2015 at 3:18 AM, Jim Nasby <Jim.Nasby@bluetreble.com <mailto:Jim.Nasby@bluetreble.com>> wrote:
>  >
>  > On 1/23/15 10:16 PM, Amit Kapila wrote:
>  >>
>  >> Further, if we want to just get the benefit of parallel I/O, then
>  >> I think we can get that by parallelising partition scan where different
>  >> table partitions reside on different disk partitions, however that is
>  >> a matter of separate patch.
>  >
>  >
>  > I don't think we even have to go that far.
>  >
>  >
>  > We'd be a lot less sensitive to IO latency.
>  >
>  > I wonder what kind of gains we would see if every SeqScan in a query spawned a worker just to read tuples and
shovethem in a queue (or shove a pointer to a buffer in the queue).
 
>  >
>
> Here IIUC, you want to say that just get the read done by one parallel
> worker and then all expression calculation (evaluation of qualification
> and target list) in the main backend, it seems to me that by doing it
> that way, the benefit of parallelisation will be lost due to tuple
> communication overhead (may be the overhead is less if we just
> pass a pointer to buffer but that will have another kind of problems
> like holding buffer pins for a longer period of time).
>
> I could see the advantage of testing on lines as suggested by Tom Lane,
> but that seems to be not directly related to what we want to achieve by
> this patch (parallel seq scan) or if you think otherwise then let me know?

There's some low-hanging fruit when it comes to improving our IO performance (or more specifically, decreasing our
sensitivityto IO latency). Perhaps the way to do that is with the parallel infrastructure, perhaps not. But I think
it'spremature to look at parallelism for increasing IO performance, or worrying about things like how many IO threads
weshould have before we at least look at simpler things we could do. We shouldn't assume there's nothing to be gained
shortof a full parallelization implementation.
 

That's not to say there's nothing else we could use parallelism for. Sort, merge and hash operations come to mind.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Jim Nasby

Date:

27 January 2015, 23:52:43

On 1/27/15 3:46 PM, Stephen Frost wrote:
>> With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
>> >With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.
>> >
>> >This is a confusing result, because you expect parallelism to help
>> >more when the relation is partly cached, and make little or no
>> >difference when it isn't cached.  But that's not what happened.
> These numbers seem to indicate that the oddball is the single-threaded
> uncached run.  If I followed correctly, the uncached 'dd' took 321s,
> which is relatively close to the uncached-lots-of-workers and the two
> cached runs.  What in the world is the uncached single-thread case doing
> that it takes an extra 543s, or over twice as long?  It's clearly not
> disk i/o which is causing the slowdown, based on your dd tests.
>
> One possibility might be round-trip latency.  The multi-threaded case is
> able to keep the CPUs and the i/o system going, and the cached results
> don't have as much latency since things are cached, but the
> single-threaded uncached case going i/o -> cpu -> i/o -> cpu, ends up
> with a lot of wait time as it switches between being on CPU and waiting
> on the i/o.

This exactly mirrors what I've seen on production systems. On a single SeqScan I can't get anywhere close to the IO
performanceI could get with dd. Once I got up to 4-8 SeqScans of different tables running together, I saw iostat
numbersthat were similar to what a single dd bs=8k would do. I've tested this with iSCSI SAN volumes on both 1Gbit and
10Gbitethernet.

This is why I think that when it comes to IO performance, before we start worrying about real parallelization we should
investigateways to do some kind of async IO.

I only have my SSD laptop and a really old server to test on, but I'll try Tom's suggestion of adding a PrefetchBuffer
callinto heapgetpage() unless someone beats me to it. I should be able to do it tomorrow.

-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

28 January 2015, 02:08:01

On Tue, Jan 27, 2015 at 4:46 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
>> With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.
>>
>> This is a confusing result, because you expect parallelism to help
>> more when the relation is partly cached, and make little or no
>> difference when it isn't cached.  But that's not what happened.
>
> These numbers seem to indicate that the oddball is the single-threaded
> uncached run.  If I followed correctly, the uncached 'dd' took 321s,
> which is relatively close to the uncached-lots-of-workers and the two
> cached runs.  What in the world is the uncached single-thread case doing
> that it takes an extra 543s, or over twice as long?  It's clearly not
> disk i/o which is causing the slowdown, based on your dd tests.

Yeah, I'm wondering if the disk just froze up on that run for a long
while, which has been known to occasionally happen on this machine,
because I can't reproduce that crappy number.  I did the 0-worker test
a few more times, with the block-by-block method, dropping the caches
and restarting PostgreSQL each time, and got:

322222.968 ms
322873.325 ms
322967.722 ms
321759.273 ms

After that last run, I ran it a few more times without restarting
PostgreSQL or dropping the caches, and got:

257629.348 ms
289668.976 ms
290342.970 ms
258035.226 ms
284237.729 ms

Then I redid the 8-client test.  Cold cache, I got 337312.554 ms.  On
the rerun, 323423.813 ms.  Third run, 324940.785.

There is more variability than I would like here.  Clearly, it goes a
bit faster when the cache is warm, but that's about all I can say with
any confidence.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

28 January 2015, 02:16:46

On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Now, when you did what I understand to be the same test on the same
> machine, you got times ranging from 9.1 seconds to 35.4 seconds.
> Clearly, there is some difference between our test setups.  Moreover,
> I'm kind of suspicious about whether your results are actually
> physically possible.  Even in the best case where you somehow had the
> maximum possible amount of data - 64 GB on a 64 GB machine - cached,
> leaving no space for cache duplication between PG and the OS and no
> space for the operating system or postgres itself - the table is 120
> GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
> from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
> there could be some speedup from issuing I/O requests in parallel
> instead of serially, but that is a 15x speedup over dd, so I am a
> little suspicious that there is some problem with the test setup,
> especially because I cannot reproduce the results.

So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1].  So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them.  Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6.  So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment.  When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments.  That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.

All that aside, I still can't account for the numbers you are seeing.
When I run with your patch and what I think is your test case, I get
different (slower) numbers.  And even if we've got 6 drives cranking
along at 400MB/s each, that's still only 2.4 GB/s, not >6 GB/s.  So
I'm still perplexed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Not my idea.

Re: Parallel Seq Scan

From

Heikki Linnakangas

Date:

28 January 2015, 07:08:50

On 01/28/2015 04:16 AM, Robert Haas wrote:
> On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Now, when you did what I understand to be the same test on the same
>> machine, you got times ranging from 9.1 seconds to 35.4 seconds.
>> Clearly, there is some difference between our test setups.  Moreover,
>> I'm kind of suspicious about whether your results are actually
>> physically possible.  Even in the best case where you somehow had the
>> maximum possible amount of data - 64 GB on a 64 GB machine - cached,
>> leaving no space for cache duplication between PG and the OS and no
>> space for the operating system or postgres itself - the table is 120
>> GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
>> from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
>> there could be some speedup from issuing I/O requests in parallel
>> instead of serially, but that is a 15x speedup over dd, so I am a
>> little suspicious that there is some problem with the test setup,
>> especially because I cannot reproduce the results.
>
> So I thought about this a little more, and I realized after some
> poking around that hydra's disk subsystem is actually six disks
> configured in a software RAID5[1].  So one advantage of the
> chunk-by-chunk approach you are proposing is that you might be able to
> get all of the disks chugging away at once, because the data is
> presumably striped across all of them.  Reading one block at a time,
> you'll never have more than 1 or 2 disks going, but if you do
> sequential reads from a bunch of different places in the relation, you
> might manage to get all 6.  So that's something to think about.
>
> One could imagine an algorithm like this: as long as there are more
> 1GB segments remaining than there are workers, each worker tries to
> chug through a separate 1GB segment.  When there are not enough 1GB
> segments remaining for that to work, then they start ganging up on the
> same segments.  That way, you get the benefit of spreading out the I/O
> across multiple files (and thus hopefully multiple members of the RAID
> group) when the data is coming from disk, but you can still keep
> everyone busy until the end, which will be important when the data is
> all in-memory and you're just limited by CPU bandwidth.

OTOH, spreading the I/O across multiple files is not a good thing, if 
you don't have a RAID setup like that. With a single spindle, you'll 
just induce more seeks.

Perhaps the OS is smart enough to read in large-enough chunks that the 
occasional seek doesn't hurt much. But then again, why isn't the OS 
smart enough to read in large-enough chunks to take advantage of the 
RAID even when you read just a single file?

- Heikki

Re: Parallel Seq Scan

From

Amit Kapila

Date:

28 January 2015, 07:40:21

On Wed, Jan 28, 2015 at 12:38 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>
> On 01/28/2015 04:16 AM, Robert Haas wrote:
>>
>> On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>>> Now, when you did what I understand to be the same test on the same
>>> machine, you got times ranging from 9.1 seconds to 35.4 seconds.
>>> Clearly, there is some difference between our test setups. Moreover,
>>> I'm kind of suspicious about whether your results are actually
>>> physically possible. Even in the best case where you somehow had the
>>> maximum possible amount of data - 64 GB on a 64 GB machine - cached,
>>> leaving no space for cache duplication between PG and the OS and no
>>> space for the operating system or postgres itself - the table is 120
>>> GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB
>>> from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
>>> there could be some speedup from issuing I/O requests in parallel
>>> instead of serially, but that is a 15x speedup over dd, so I am a
>>> little suspicious that there is some problem with the test setup,
>>> especially because I cannot reproduce the results.
>>
>>
>> So I thought about this a little more, and I realized after some
>> poking around that hydra's disk subsystem is actually six disks
>> configured in a software RAID5[1]. So one advantage of the
>> chunk-by-chunk approach you are proposing is that you might be able to
>> get all of the disks chugging away at once, because the data is
>> presumably striped across all of them. Reading one block at a time,
>> you'll never have more than 1 or 2 disks going, but if you do
>> sequential reads from a bunch of different places in the relation, you
>> might manage to get all 6. So that's something to think about.
>>
>> One could imagine an algorithm like this: as long as there are more
>> 1GB segments remaining than there are workers, each worker tries to
>> chug through a separate 1GB segment. When there are not enough 1GB
>> segments remaining for that to work, then they start ganging up on the
>> same segments. That way, you get the benefit of spreading out the I/O
>> across multiple files (and thus hopefully multiple members of the RAID
>> group) when the data is coming from disk, but you can still keep
>> everyone busy until the end, which will be important when the data is
>> all in-memory and you're just limited by CPU bandwidth.
>
>
> OTOH, spreading the I/O across multiple files is not a good thing, if you don't have a RAID setup like that. With a single spindle, you'll just induce more seeks.
>

Yeah, if such a thing happens then there is less chance that user

will get any major benefit via parallel sequential scan unless

the qualification expressions or other expressions used in

statement are costly. So here one way could be that either user

should configure the parallel sequence scan parameters in such

a way that only when it can be beneficial it should perform parallel

scan (something like increase parallel_tuple_comm_cost or we can

have some another parameter) or just not use parallel sequential scan

(parallel_seqscan_degree=0).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

28 January 2015, 14:03:08

On Wed, Jan 28, 2015 at 2:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> OTOH, spreading the I/O across multiple files is not a good thing, if you
> don't have a RAID setup like that. With a single spindle, you'll just induce
> more seeks.
>
> Perhaps the OS is smart enough to read in large-enough chunks that the
> occasional seek doesn't hurt much. But then again, why isn't the OS smart
> enough to read in large-enough chunks to take advantage of the RAID even
> when you read just a single file?

Suppose we have N spindles and N worker processes and it just so
happens that the amount of computation is such that a each spindle can
keep one CPU busy.  Let's suppose the chunk size is 4MB.  If you read
from the relation at N staggered offsets, you might be lucky enough
that each one of them keeps a spindle busy, and you might be lucky
enough to have that stay true as the scans advance.  You don't need
any particularly large amount of read-ahead; you just need to stay at
least one block ahead of the CPU.  But if you read the relation in one
pass from beginning to end, you need at least N*4MB of read-ahead to
have data in cache for all N spindles, and the read-ahead will
certainly fail you at the end of every 1GB segment.

The problem here, as I see it, is that we're flying blind.  If there's
just one spindle, I think it's got to be right to read the relation
sequentially.  But if there are multiple spindles, it might not be,
but it seems hard to predict what we should do.  We don't know what
the RAID chunk size is or how many spindles there are, so any guess as
to how to chunk up the relation and divide up the work between workers
is just a shot in the dark.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Thom Brown

Date:

28 January 2015, 14:13:27

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 28 January 2015 at 14:03, Robert Haas <span
dir="ltr"><<ahref="mailto:robertmhaas@gmail.com" target="_blank">robertmhaas@gmail.com</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The problem
here,as I see it, is that we're flying blind.  If there's<br /> just one spindle, I think it's got to be right to read
therelation<br /> sequentially.  But if there are multiple spindles, it might not be,<br /> but it seems hard to
predictwhat we should do.  We don't know what<br /> the RAID chunk size is or how many spindles there are, so any guess
as<br/> to how to chunk up the relation and divide up the work between workers<br /> is just a shot in the
dark.</blockquote></div><br/></div><div class="gmail_extra">Can't the planner take effective_io_concurrency into
account?<brclear="all" /></div><div class="gmail_extra"><br /><div class="gmail_signature">Thom</div></div></div>

Re: Parallel Seq Scan

From

Amit Kapila

Date:

28 January 2015, 15:29:59

On Wed, Jan 28, 2015 at 7:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> All that aside, I still can't account for the numbers you are seeing.
> When I run with your patch and what I think is your test case, I get
> different (slower) numbers. And even if we've got 6 drives cranking
> along at 400MB/s each, that's still only 2.4 GB/s, not >6 GB/s. So
> I'm still perplexed.
>

I have tried the tests again and found that I have forgotten to increase

max_worker_processes due to which the data is so different. Basically

at higher client count it is just scanning lesser number of blocks in

fixed chunk approach. So today I again tried with changing

max_worker_processes and found that there is not much difference in

performance at higher client count. I will take some more data for

both block_by_block and fixed_chunk approach and repost the data.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

28 January 2015, 15:39:07

On Wed, Jan 28, 2015 at 9:12 AM, Thom Brown <thom@linux.com> wrote:
> On 28 January 2015 at 14:03, Robert Haas <robertmhaas@gmail.com> wrote:
>> The problem here, as I see it, is that we're flying blind.  If there's
>> just one spindle, I think it's got to be right to read the relation
>> sequentially.  But if there are multiple spindles, it might not be,
>> but it seems hard to predict what we should do.  We don't know what
>> the RAID chunk size is or how many spindles there are, so any guess as
>> to how to chunk up the relation and divide up the work between workers
>> is just a shot in the dark.
>
> Can't the planner take effective_io_concurrency into account?

Maybe.  It's answering a somewhat the right question -- to tell us how
many parallel I/O channels we think we've got.  But I'm not quite sure
what the to do with that information in this case.  I mean, if we've
got effective_io_concurrency = 6, does that mean it's right to start
scans in 6 arbitrary places in the relation and hope that keeps all
the drives busy?  That seems like throwing darts at the wall.  We have
no idea which parts are on which underlying devices.  Or maybe it mean
we should prefetch 24MB, on the assumption that the RAID stripe is
4MB?  That's definitely blind guesswork.

Considering the email Amit just sent, it looks like on this machine,
regardless of what algorithm we used, the scan took between 3 minutes
and 5.5 minutes, and most of them took between 4 minutes and 5.5
minutes.  The results aren't very predictable, more workers don't
necessarily help, and it's not really clear that any algorithm we've
tried is clearly better than any other.  I experimented with
prefetching a bit yesterday, too, and it was pretty much the same.
Some settings made it slightly faster.  Others made it slower.  Whee!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Tom Lane

Date:

28 January 2015, 15:41:16

Robert Haas <robertmhaas@gmail.com> writes:
> The problem here, as I see it, is that we're flying blind.  If there's
> just one spindle, I think it's got to be right to read the relation
> sequentially.  But if there are multiple spindles, it might not be,
> but it seems hard to predict what we should do.  We don't know what
> the RAID chunk size is or how many spindles there are, so any guess as
> to how to chunk up the relation and divide up the work between workers
> is just a shot in the dark.

I thought the proposal to chunk on the basis of "each worker processes
one 1GB-sized segment" should work all right.  The kernel should see that
as sequential reads of different files, issued by different processes;
and if it can't figure out how to process that efficiently then it's a
very sad excuse for a kernel.

You are right that trying to do any detailed I/O scheduling by ourselves
is a doomed exercise.  For better or worse, we have kept ourselves at
sufficient remove from the hardware that we can't possibly do that
successfully.
        regards, tom lane

Re: Parallel Seq Scan

From

Robert Haas

Date:

28 January 2015, 15:42:40

On Wed, Jan 28, 2015 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> The problem here, as I see it, is that we're flying blind.  If there's
>> just one spindle, I think it's got to be right to read the relation
>> sequentially.  But if there are multiple spindles, it might not be,
>> but it seems hard to predict what we should do.  We don't know what
>> the RAID chunk size is or how many spindles there are, so any guess as
>> to how to chunk up the relation and divide up the work between workers
>> is just a shot in the dark.
>
> I thought the proposal to chunk on the basis of "each worker processes
> one 1GB-sized segment" should work all right.  The kernel should see that
> as sequential reads of different files, issued by different processes;
> and if it can't figure out how to process that efficiently then it's a
> very sad excuse for a kernel.

I agree.  But there's only value in doing something like that if we
have evidence that it improves anything.  Such evidence is presently a
bit thin on the ground.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Tom Lane

Date:

28 January 2015, 15:50:00

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jan 28, 2015 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I thought the proposal to chunk on the basis of "each worker processes
>> one 1GB-sized segment" should work all right.  The kernel should see that
>> as sequential reads of different files, issued by different processes;
>> and if it can't figure out how to process that efficiently then it's a
>> very sad excuse for a kernel.

> I agree.  But there's only value in doing something like that if we
> have evidence that it improves anything.  Such evidence is presently a
> bit thin on the ground.

Well, of course none of this should get committed without convincing
evidence that it's a win.  But I think that chunking on relation segment
boundaries is a plausible way of dodging the problem that we can't do
explicitly hardware-aware scheduling.
        regards, tom lane

Re: Parallel Seq Scan

From

Stephen Frost

Date:

28 January 2015, 15:56:58

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Jan 28, 2015 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I thought the proposal to chunk on the basis of "each worker processes
> > one 1GB-sized segment" should work all right.  The kernel should see that
> > as sequential reads of different files, issued by different processes;
> > and if it can't figure out how to process that efficiently then it's a
> > very sad excuse for a kernel.

Agreed.

> I agree.  But there's only value in doing something like that if we
> have evidence that it improves anything.  Such evidence is presently a
> bit thin on the ground.

You need an i/o subsystem that's fast enough to keep a single CPU busy,
otherwise (as you mentioned elsewhere), you're just going to be i/o
bound and having more processes isn't going to help (and could hurt).

Such i/o systems do exist, but a single RAID5 group over spinning rust
with a simple filter isn't going to cut it with a modern CPU- we're just
too darn efficient to end up i/o bound in that case.  A more complex
filter might be able to change it over to being more CPU bound than i/o
bound and produce the performance improvments you're looking for.

The caveat to this is if you have multiple i/o *channels* (which it
looks like you don't in this case) where you can parallelize across
those channels by having multiple processes involved.  We only support
multiple i/o channels today with tablespaces and we can't span tables
across tablespaces.  That's a problem when working with large data sets,
but I'm hopeful that this work will eventually lead to a parallelized
Append node that operates against a partitioned/inheirited table to work
across multiple tablespaces.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Stephen Frost

Date:

28 January 2015, 16:02:34

* Stephen Frost (sfrost@snowman.net) wrote:
> Such i/o systems do exist, but a single RAID5 group over spinning rust
> with a simple filter isn't going to cut it with a modern CPU- we're just
> too darn efficient to end up i/o bound in that case.

err, to *not* end up i/o bound.
Thanks,
    Stephen

Re: Parallel Seq Scan

From

Jim Nasby

Date:

28 January 2015, 20:37:19

On 1/28/15 9:56 AM, Stephen Frost wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Wed, Jan 28, 2015 at 10:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I thought the proposal to chunk on the basis of "each worker processes
>>> one 1GB-sized segment" should work all right.  The kernel should see that
>>> as sequential reads of different files, issued by different processes;
>>> and if it can't figure out how to process that efficiently then it's a
>>> very sad excuse for a kernel.
>
> Agreed.
>
>> I agree.  But there's only value in doing something like that if we
>> have evidence that it improves anything.  Such evidence is presently a
>> bit thin on the ground.
>
> You need an i/o subsystem that's fast enough to keep a single CPU busy,
> otherwise (as you mentioned elsewhere), you're just going to be i/o
> bound and having more processes isn't going to help (and could hurt).
>
> Such i/o systems do exist, but a single RAID5 group over spinning rust
> with a simple filter isn't going to cut it with a modern CPU- we're just
> too darn efficient to end up i/o bound in that case.  A more complex
> filter might be able to change it over to being more CPU bound than i/o
> bound and produce the performance improvments you're looking for.

Except we're nowhere near being IO efficient. The vast difference between Postgres IO rates and dd shows this. I
suspectthat's because we're not giving the OS a list of IO to perform while we're doing our thing, but that's just a
guess.

> The caveat to this is if you have multiple i/o *channels* (which it
> looks like you don't in this case) where you can parallelize across
> those channels by having multiple processes involved.

Keep in mind that multiple processes is in no way a requirement for that. Async IO would do that, or even just
requestingstuff from the OS before we need it.
 

>  We only support
> multiple i/o channels today with tablespaces and we can't span tables
> across tablespaces.  That's a problem when working with large data sets,
> but I'm hopeful that this work will eventually lead to a parallelized
> Append node that operates against a partitioned/inheirited table to work
> across multiple tablespaces.

Until we can get a single seqscan close to dd performance, I fear worrying about tablespaces and IO channels is
entirelypremature.
 
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Stephen Frost

Date:

29 January 2015, 01:27:30

Jim,

* Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
> On 1/28/15 9:56 AM, Stephen Frost wrote:
> >Such i/o systems do exist, but a single RAID5 group over spinning rust
> >with a simple filter isn't going to cut it with a modern CPU- we're just
> >too darn efficient to end up i/o bound in that case.  A more complex
> >filter might be able to change it over to being more CPU bound than i/o
> >bound and produce the performance improvments you're looking for.
>
> Except we're nowhere near being IO efficient. The vast difference between Postgres IO rates and dd shows this. I
suspectthat's because we're not giving the OS a list of IO to perform while we're doing our thing, but that's just a
guess.

Uh, huh?  The dd was ~321000 and the slowest uncached PG run from
Robert's latest tests was 337312.554, based on my inbox history at
least.  I don't consider ~4-5% difference to be vast.

> >The caveat to this is if you have multiple i/o *channels* (which it
> >looks like you don't in this case) where you can parallelize across
> >those channels by having multiple processes involved.
>
> Keep in mind that multiple processes is in no way a requirement for that. Async IO would do that, or even just
requestingstuff from the OS before we need it. 

While I agree with this in principle, experience has shown that it
doesn't tend to work out as well as we'd like with a single process.

> > We only support
> >multiple i/o channels today with tablespaces and we can't span tables
> >across tablespaces.  That's a problem when working with large data sets,
> >but I'm hopeful that this work will eventually lead to a parallelized
> >Append node that operates against a partitioned/inheirited table to work
> >across multiple tablespaces.
>
> Until we can get a single seqscan close to dd performance, I fear worrying about tablespaces and IO channels is
entirelypremature. 

I feel like one of us is misunderstanding the numbers, which is probably
in part because they're a bit piecemeal over email, but the seqscan
speed in this case looks pretty close to dd performance for this
particular test, when things are uncached.  Cached numbers are
different, but that's not what we're discussing here, I don't think.

Don't get me wrong- I've definitely seen cases where we're CPU bound
because of complex filters, etc, but that doesn't seem to be the case
here.
Thanks!
    Stephen

Re: Parallel Seq Scan

From

Robert Haas

Date:

29 January 2015, 02:59:48

On Wed, Jan 28, 2015 at 8:27 PM, Stephen Frost <sfrost@snowman.net> wrote:
> I feel like one of us is misunderstanding the numbers, which is probably
> in part because they're a bit piecemeal over email, but the seqscan
> speed in this case looks pretty close to dd performance for this
> particular test, when things are uncached.  Cached numbers are
> different, but that's not what we're discussing here, I don't think.
>
> Don't get me wrong- I've definitely seen cases where we're CPU bound
> because of complex filters, etc, but that doesn't seem to be the case
> here.

To try to clarify a bit: What we've testing here is a function I wrote
called parallel_count(regclass), which counts all the visible tuples
in a named relation.  That runs as fast as dd, and giving it extra
workers or prefetching or the ability to read the relation with
different I/O patterns never seems to speed anything up very much.

The story with parallel sequential scan itself may well be different,
since that has a lot more CPU overhead than a dumb-simple
tuple-counter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Daniel Bausch

Date:

29 January 2015, 07:10:35

Robert Haas <robertmhaas@gmail.com> writes:

> On Wed, Jan 28, 2015 at 9:12 AM, Thom Brown <thom@linux.com> wrote:
>> On 28 January 2015 at 14:03, Robert Haas <robertmhaas@gmail.com> wrote:
>>> The problem here, as I see it, is that we're flying blind.  If there's
>>> just one spindle, I think it's got to be right to read the relation
>>> sequentially.  But if there are multiple spindles, it might not be,
>>> but it seems hard to predict what we should do.  We don't know what
>>> the RAID chunk size is or how many spindles there are, so any guess as
>>> to how to chunk up the relation and divide up the work between workers
>>> is just a shot in the dark.
>>
>> Can't the planner take effective_io_concurrency into account?
>
> Maybe.  It's answering a somewhat the right question -- to tell us how
> many parallel I/O channels we think we've got.  But I'm not quite sure
> what the to do with that information in this case.  I mean, if we've
> got effective_io_concurrency = 6, does that mean it's right to start
> scans in 6 arbitrary places in the relation and hope that keeps all
> the drives busy?  That seems like throwing darts at the wall.  We have
> no idea which parts are on which underlying devices.  Or maybe it mean
> we should prefetch 24MB, on the assumption that the RAID stripe is
> 4MB?  That's definitely blind guesswork.
>
> Considering the email Amit just sent, it looks like on this machine,
> regardless of what algorithm we used, the scan took between 3 minutes
> and 5.5 minutes, and most of them took between 4 minutes and 5.5
> minutes.  The results aren't very predictable, more workers don't
> necessarily help, and it's not really clear that any algorithm we've
> tried is clearly better than any other.  I experimented with
> prefetching a bit yesterday, too, and it was pretty much the same.
> Some settings made it slightly faster.  Others made it slower.  Whee!

I have been researching this topic long time ago.  One notably fact is
that active prefetching disables automatic readahead prefetching (by
Linux kernel), which can occour in larger granularities than 8K.
Automatic readahead prefetching occours when consecutive addresses are
read, which may happen by a seqscan but also by "accident" through an
indexscan in correlated cases.

My consequence was to NOT prefetch seqscans, because OS does good enough
without advice.  Prefetching indexscan heap accesses is very valuable
though, but you need to detect the accidential sequential accesses to
not hurt your performance in correlated cases.

In general I can give you the hint to not only focus on HDDs with their
single spindle.  A single SATA SSD scales up to 32 (31 on Linux)
requests in parallel (without RAID or anything else).  The difference in
throughput is extreme for this type of storage device.  While single
spinning HDDs can only gain up to ~20% by NCQ, SATA SSDs can easily gain
up to 700%.

+1 for using effective_io_concurrency to tune for this, since
prefetching random addresses is effectively a type of parallel I/O.

Regards,
Daniel
--
MSc. Daniel Bausch
Research Assistant (Computer Science)
Technische Universität Darmstadt
http://www.dvs.tu-darmstadt.de/staff/dbausch

Re: Parallel Seq Scan

From

Amit Kapila

Date:

29 January 2015, 12:22:05

On Wed, Jan 28, 2015 at 8:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have tried the tests again and found that I have forgotten to increase
> max_worker_processes due to which the data is so different. Basically
> at higher client count it is just scanning lesser number of blocks in
> fixed chunk approach. So today I again tried with changing
> max_worker_processes and found that there is not much difference in
> performance at higher client count. I will take some more data for
> both block_by_block and fixed_chunk approach and repost the data.
>

I have again taken the data and found that there is not much difference

either between block-by-block or fixed_chuck approach, the data is at

end of mail for your reference. There is variation in some cases like in

fixed_chunk approach, in 8 workers case it is showing lesser time, however

on certain executions it has taken almost the same time as other workers.

Now if we go with block-by-block approach then we have advantage that

the work distribution granularity will be smaller and hence better and if

we go with chunk-by-chunk (fixed_chunk of 1GB) approach, then there

is good chance that kernel can do the better optimization for reading it.

Based on inputs on this thread, one way for execution strategy could

be:

a. In optimizer, based on effective_io_concurrency, size of relation and

parallel_seqscan_degree, we can decide how many workers can be

used for executing the plan

- choose the number_of_workers equal to effective_io_concurrency,

if it is less than parallel_seqscan_degree, else number_of_workers

will be equal to parallel_seqscan_degree.

- if the size of relation is greater than number_of_workers times GB

(if number_of_workers is 8, then we need to compare the size of

relation with 8GB), then keep number_of_workers intact and distribute

the remaining chunks/segments during execution, else

reduce the number_of_workers such that each worker gets 1GB

to operate.

- if the size of relation is less than 1GB, then we can either not

choose the parallel_seqscan at all or could use smaller chunks

or could use block-by-block approach to execute.

- here we need to consider other parameters like parallel_setup

parallel_startup and tuple_communication cost as well.

b. In executor, if less workers are available than what are required

for statement execution, then we can redistribute the remaining

work among workers.

Performance Data - Before first run of each worker, I have executed

drop_caches to clear the cache and restarted the server, so we can

assume that except Run-1, all other runs have some caching effect.

Fixed-Chunks
No. of workers/Time (ms)	0	8	16	32
Run-1	322822	245759	330097	330002
Run-2	275685	275428	301625	286251
Run-3	252129	244167	303494	278604
Run-4	252528	259273	250438	258636
Run-5	250612	242072	235384	265918

Block-By-Block
No. of workers/Time (ms)	0	8	16	32
Run-1	323084	341950	338999	334100
Run-2	310968	349366	344272	322643
Run-3	250312	336227	346276	322274
Run-4	262744	314489	351652	325135
Run-5	265987	316260	342924	319200

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Jeff Janes

Date:

29 January 2015, 16:34:18

On Tue, Jan 27, 2015 at 11:08 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 01/28/2015 04:16 AM, Robert Haas wrote:
On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups. Moreover,
I'm kind of suspicious about whether your results are actually
physically possible. Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB
from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.

So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1]. So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them. Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6. So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment. When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments. That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.

OTOH, spreading the I/O across multiple files is not a good thing, if you don't have a RAID setup like that. With a single spindle, you'll just induce more seeks.

Perhaps the OS is smart enough to read in large-enough chunks that the occasional seek doesn't hurt much. But then again, why isn't the OS smart enough to read in large-enough chunks to take advantage of the RAID even when you read just a single file?

In my experience with RAID, it is smart enough to take advantage of that. If the raid controller detects a sequential access pattern read, it initiates a read ahead on each disk to pre-position the data it will need (or at least, the behavior I observe is as-if it did that). But maybe if the sequential read is a bunch of "random" reads from different processes which just happen to add up to sequential, that confuses the algorithm?

Cheers,

Jeff

Re: Parallel Seq Scan

From

Tom Lane

Date:

29 January 2015, 16:40:27

Jeff Janes <jeff.janes@gmail.com> writes:
> On Tue, Jan 27, 2015 at 11:08 PM, Heikki Linnakangas <
> hlinnakangas@vmware.com> wrote:
>> OTOH, spreading the I/O across multiple files is not a good thing, if you
>> don't have a RAID setup like that. With a single spindle, you'll just
>> induce more seeks.
>> 
>> Perhaps the OS is smart enough to read in large-enough chunks that the
>> occasional seek doesn't hurt much. But then again, why isn't the OS smart
>> enough to read in large-enough chunks to take advantage of the RAID even
>> when you read just a single file?

> In my experience with RAID, it is smart enough to take advantage of that.
> If the raid controller detects a sequential access pattern read, it
> initiates a read ahead on each disk to pre-position the data it will need
> (or at least, the behavior I observe is as-if it did that).  But maybe if
> the sequential read is a bunch of "random" reads from different processes
> which just happen to add up to sequential, that confuses the algorithm?

If seqscan detection is being done at the level of the RAID controller,
I rather imagine that the controller would not know which process had
initiated which read anyway.  But if it's being done at the level of the
kernel, it's a whole nother thing, and I bet it *would* matter.
        regards, tom lane

Re: Parallel Seq Scan

From

Robert Haas

Date:

29 January 2015, 17:01:45

On Thu, Jan 29, 2015 at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In my experience with RAID, it is smart enough to take advantage of that.
>> If the raid controller detects a sequential access pattern read, it
>> initiates a read ahead on each disk to pre-position the data it will need
>> (or at least, the behavior I observe is as-if it did that).  But maybe if
>> the sequential read is a bunch of "random" reads from different processes
>> which just happen to add up to sequential, that confuses the algorithm?
>
> If seqscan detection is being done at the level of the RAID controller,
> I rather imagine that the controller would not know which process had
> initiated which read anyway.  But if it's being done at the level of the
> kernel, it's a whole nother thing, and I bet it *would* matter.

That was my feeling too.  On the machine that Amit and I have been
using for testing, we can't find any really convincing evidence that
it matters.  I won't be a bit surprised if there are other systems
where it does matter, but I don't know how to find them except to
encourage other people to help test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Jim Nasby

Date:

30 January 2015, 01:22:31

On 1/28/15 7:27 PM, Stephen Frost wrote:
> * Jim Nasby (Jim.Nasby@BlueTreble.com) wrote:
>> >On 1/28/15 9:56 AM, Stephen Frost wrote:
>>> > >Such i/o systems do exist, but a single RAID5 group over spinning rust
>>> > >with a simple filter isn't going to cut it with a modern CPU- we're just
>>> > >too darn efficient to end up i/o bound in that case.  A more complex
>>> > >filter might be able to change it over to being more CPU bound than i/o
>>> > >bound and produce the performance improvments you're looking for.
>> >
>> >Except we're nowhere near being IO efficient. The vast difference between Postgres IO rates and dd shows this. I
suspectthat's because we're not giving the OS a list of IO to perform while we're doing our thing, but that's just a
guess.
> Uh, huh?  The dd was ~321000 and the slowest uncached PG run from
> Robert's latest tests was 337312.554, based on my inbox history at
> least.  I don't consider ~4-5% difference to be vast.

Sorry, I was speaking more generally than this specific test. In the past I've definitely seen SeqScan performance that
wasan order of magnitude slower than what dd would do. This was an older version of Postgres and an older version of
linux,running on an iSCSI SAN. My suspicion is that the added IO latency imposed by iSCSI is what was causing this, but
that'sjust conjecture.

I think Robert was saying that he hasn't been able to see this effect on their test server... that makes me think it's
doingread-ahead on the OS level. But I suspect it's pretty touch and go to rely on that; I'd prefer we have some way to
explicitlyget that behavior where we want it.

-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Parallel Seq Scan

From

Stephen Frost

Date:

30 January 2015, 21:16:38

Daniel,

* Daniel Bausch (bausch@dvs.tu-darmstadt.de) wrote:
> I have been researching this topic long time ago.  One notably fact is
> that active prefetching disables automatic readahead prefetching (by
> Linux kernel), which can occour in larger granularities than 8K.
> Automatic readahead prefetching occours when consecutive addresses are
> read, which may happen by a seqscan but also by "accident" through an
> indexscan in correlated cases.

That strikes me as a pretty good point to consider.

> My consequence was to NOT prefetch seqscans, because OS does good enough
> without advice.  Prefetching indexscan heap accesses is very valuable
> though, but you need to detect the accidential sequential accesses to
> not hurt your performance in correlated cases.

Seems like we might be able to do that, it's not that different from
what we do with the bitmap scan case, we'd just look at the bitmap and
see if there's long runs of 1's.

> In general I can give you the hint to not only focus on HDDs with their
> single spindle.  A single SATA SSD scales up to 32 (31 on Linux)
> requests in parallel (without RAID or anything else).  The difference in
> throughput is extreme for this type of storage device.  While single
> spinning HDDs can only gain up to ~20% by NCQ, SATA SSDs can easily gain
> up to 700%.

I definitely agree with the idea that we should be looking at SSD-based
systems but I don't know if anyone happens to have easy access to server
gear with SSDs.  I've got an SSD in my laptop, but that's not really the
same thing.
Thanks!
    Stephen

Re: Parallel Seq Scan

From

Jeff Janes

Date:

30 January 2015, 21:20:23

This patch depends on https://commitfest.postgresql.org/3/22/

Re: Parallel Seq Scan

From

Daniel Bausch

Date:

06 February 2015, 14:05:47

Hi David and others!

David Fetter <david@fetter.org> writes:

> On Tue, Jan 27, 2015 at 08:02:37AM +0100, Daniel Bausch wrote:
>>
>> Tom Lane <tgl@sss.pgh.pa.us> writes:
>>
>> >> Wait for first IO, issue second IO request
>> >> Compute
>> >> Already have second IO request, issue third
>> >> ...
>> >
>> >> We'd be a lot less sensitive to IO latency.
>> >
>> > It would take about five minutes of coding to prove or disprove this:
>> > stick a PrefetchBuffer call into heapgetpage() to launch a request for the
>> > next page as soon as we've read the current one, and then see if that
>> > makes any obvious performance difference.  I'm not convinced that it will,
>> > but if it did then we could think about how to make it work for real.
>>
>> Sorry for dropping in so late...
>>
>> I have done all this two years ago.  For TPC-H Q8, Q9, Q17, Q20, and Q21
>> I see a speedup of ~100% when using IndexScan prefetching + Nested-Loops
>> Look-Ahead (the outer loop!).
>> (On SSD with 32 Pages Prefetch/Look-Ahead + Cold Page Cache / Small RAM)
>
> Would you be so kind as to pass along any patches (ideally applicable
> to git master), tests, and specific measurements you made?

Attached find my patches based on the old revision
36f4c7843cf3d201279855ed9a6ebc1deb3c9463
(Adjust cube.out expected output for new test queries.)

I did not test applicability against HEAD by now.

Disclaimer: This was just a proof-of-concept and so is poor
implementation quality.  Nevertheless, performance looked promising
while it still needs a lot of extra rules for special cases, like
detecting accidential sequential scans.  General assumption is: no
concurrency - a single query owning the machine.

Here is a comparison using dbt3.  Q8, Q9, Q17, Q20, and Q21 are
significantly improved.

|     |   baseline |  indexscan | indexscan+nestloop |
|     |            | patch 1+2  | patch 3            |
|-----+------------+------------+--------------------|
| Q1  |  76.124261 |  73.165161 |          76.323119 |
| Q2  |   9.676956 |  11.211073 |          10.480668 |
| Q3  |  36.836417 |  36.268022 |          36.837226 |
| Q4  |  48.707501 |    64.2255 |          30.872218 |
| Q5  |  59.371467 |  59.205048 |          58.646096 |
| Q6  |  70.514214 |  73.021006 |           72.64643 |
| Q7  |  63.667594 |  63.258499 |          62.758288 |
| Q8  |  70.640973 |  33.144454 |          32.530732 |
| Q9  | 446.630473 | 379.063773 |         219.926094 |
| Q10 |  49.616125 |  49.244744 |          48.411664 |
| Q11 |   6.122317 |   6.158616 |           6.160189 |
| Q12 |  74.294292 |  87.780442 |          87.533936 |
| Q13 |   32.37932 |  32.771938 |          33.483444 |
| Q14 |  47.836053 |  48.093996 |           47.72221 |
| Q15 | 139.350038 | 138.880208 |         138.681336 |
| Q16 |  12.092429 |  12.120661 |          11.668971 |
| Q17 |   9.346636 |   4.106042 |           4.018951 |
| Q18 |  66.106875 | 123.754111 |         122.623193 |
| Q19 |  22.750504 |  23.191532 |           22.34084 |
| Q20 |  80.481986 |  29.906274 |           28.58106 |
| Q21 | 396.897269 |  355.45988 |          214.44184 |
| Q22 |   6.834841 |   6.600922 |           6.524032 |

Regards,
Daniel
--
MSc. Daniel Bausch
Research Assistant (Computer Science)
Technische Universität Darmstadt
http://www.dvs.tu-darmstadt.de/staff/dbausch
>From 569398929d899100b769abfd919bc3383626ac9f Mon Sep 17 00:00:00 2001
From: Daniel Bausch <bausch@dvs.tu-darmstadt.de>
Date: Tue, 22 Oct 2013 15:22:25 +0200
Subject: [PATCH 1/4] Quick proof-of-concept for indexscan prefetching

This implements a prefetching queue of tuples whose tid is read ahead.
Their block number is quickly checked for random properties (not current
block and not the block prefetched last).  Random reads are prefetched.
Up to 32 tuples are considered by default.  The tids are queued in a
fixed ring buffer.

The prefetching is implemented in the generic part of the index scan, so
it applies to all access methods.
---
 src/backend/access/index/indexam.c | 96 ++++++++++++++++++++++++++++++++++++++
 src/include/access/relscan.h       | 12 +++++
 2 files changed, 108 insertions(+)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b878155..1c54ef5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -251,6 +251,12 @@ index_beginscan(Relation heapRelation,
     scan->heapRelation = heapRelation;
     scan->xs_snapshot = snapshot;

+#ifdef USE_PREFETCH
+    scan->xs_prefetch_head = scan->xs_prefetch_tail = -1;
+    scan->xs_last_prefetch = -1;
+    scan->xs_done = false;
+#endif
+
     return scan;
 }

@@ -432,6 +438,55 @@ index_restrpos(IndexScanDesc scan)
     FunctionCall1(procedure, PointerGetDatum(scan));
 }

+static int
+index_prefetch_queue_space(IndexScanDesc scan)
+{
+    if (scan->xs_prefetch_tail < 0)
+        return INDEXSCAN_PREFETCH_COUNT;
+
+    Assert(scan->xs_prefetch_head >= 0);
+
+    return (INDEXSCAN_PREFETCH_COUNT
+            - (scan->xs_prefetch_tail - scan->xs_prefetch_head + 1))
+        % INDEXSCAN_PREFETCH_COUNT;
+}
+
+/* makes copy of ItemPointerData */
+static bool
+index_prefetch_queue_push(IndexScanDesc scan, ItemPointer tid)
+{
+    Assert(index_prefetch_queue_space(scan) > 0);
+
+    if (scan->xs_prefetch_tail == -1)
+        scan->xs_prefetch_head = scan->xs_prefetch_tail = 0;
+    else
+        scan->xs_prefetch_tail =
+            (scan->xs_prefetch_tail + 1) % INDEXSCAN_PREFETCH_COUNT;
+
+    scan->xs_prefetch_queue[scan->xs_prefetch_tail] = *tid;
+
+    return true;
+}
+
+static ItemPointer
+index_prefetch_queue_pop(IndexScanDesc scan)
+{
+    ItemPointer res;
+
+    if (scan->xs_prefetch_head < 0)
+        return NULL;
+
+    res = &scan->xs_prefetch_queue[scan->xs_prefetch_head];
+
+    if (scan->xs_prefetch_head == scan->xs_prefetch_tail)
+        scan->xs_prefetch_head = scan->xs_prefetch_tail = -1;
+    else
+        scan->xs_prefetch_head =
+            (scan->xs_prefetch_head + 1) % INDEXSCAN_PREFETCH_COUNT;
+
+    return res;
+}
+
 /* ----------------
  * index_getnext_tid - get the next TID from a scan
  *
@@ -444,12 +499,52 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 {
     FmgrInfo   *procedure;
     bool        found;
+    ItemPointer    from_queue;
+    BlockNumber    pf_block;

     SCAN_CHECKS;
     GET_SCAN_PROCEDURE(amgettuple);

     Assert(TransactionIdIsValid(RecentGlobalXmin));

+#ifdef USE_PREFETCH
+    while (!scan->xs_done && index_prefetch_queue_space(scan) > 0) {
+        /*
+         * The AM's amgettuple proc finds the next index entry matching the
+         * scan keys, and puts the TID into scan->xs_ctup.t_self.  It should
+         * also set scan->xs_recheck and possibly scan->xs_itup, though we pay
+         * no attention to those fields here.
+         */
+        found = DatumGetBool(FunctionCall2(procedure,
+                                           PointerGetDatum(scan),
+                                           Int32GetDatum(direction)));
+        if (found)
+        {
+            index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
+            pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
+            /* prefetch only if not the current buffer and not exactly the
+             * previously prefetched buffer (heuristic random detection)
+             * because sequential read-ahead would be redundant */
+            if ((!BufferIsValid(scan->xs_cbuf) ||
+                 pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
+                pf_block != scan->xs_last_prefetch)
+            {
+                PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
+                scan->xs_last_prefetch = pf_block;
+            }
+        }
+        else
+            scan->xs_done = true;
+    }
+    from_queue = index_prefetch_queue_pop(scan);
+    if (from_queue)
+    {
+        scan->xs_ctup.t_self = *from_queue;
+        found = true;
+    }
+    else
+        found = false;
+#else
     /*
      * The AM's amgettuple proc finds the next index entry matching the scan
      * keys, and puts the TID into scan->xs_ctup.t_self.  It should also set
@@ -459,6 +554,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
     found = DatumGetBool(FunctionCall2(procedure,
                                        PointerGetDatum(scan),
                                        Int32GetDatum(direction)));
+#endif

     /* Reset kill flag immediately for safety */
     scan->kill_prior_tuple = false;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 3a86ca4..bccc1a4 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -93,6 +93,18 @@ typedef struct IndexScanDescData

     /* state data for traversing HOT chains in index_getnext */
     bool        xs_continue_hot;    /* T if must keep walking HOT chain */
+
+#ifdef USE_PREFETCH
+# ifndef INDEXSCAN_PREFETCH_COUNT
+#  define INDEXSCAN_PREFETCH_COUNT 32
+# endif
+    /* prefetch queue - ringbuffer */
+    ItemPointerData xs_prefetch_queue[INDEXSCAN_PREFETCH_COUNT];
+    int            xs_prefetch_head;
+    int            xs_prefetch_tail;
+    BlockNumber    xs_last_prefetch;
+    bool        xs_done;
+#endif
 }    IndexScanDescData;

 /* Struct for heap-or-index scans of system tables */
--
2.0.5

>From 7cb5839dd7751bcdcae6e4cbf69cfd24af10a694 Mon Sep 17 00:00:00 2001
From: Daniel Bausch <bausch@dvs.tu-darmstadt.de>
Date: Wed, 23 Oct 2013 09:45:11 +0200
Subject: [PATCH 2/4] Fix index-only scan and rescan

Prefetching heap data for index-only scans does not make any sense and
it uses a different field (itup), nevertheless.  Deactivate the prefetch
logic for index-only scans.

Reset xs_done and the queue on rescan, so we find tuples again.
Remember last prefetch to detect correlation.
---
 src/backend/access/index/indexam.c | 85 +++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 1c54ef5..d8a4622 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -353,6 +353,12 @@ index_rescan(IndexScanDesc scan,

     scan->kill_prior_tuple = false;        /* for safety */

+#ifdef USE_PREFETCH
+    /* I think, it does not hurt to remember xs_last_prefetch */
+    scan->xs_prefetch_head = scan->xs_prefetch_tail = -1;
+    scan->xs_done = false;
+#endif
+
     FunctionCall5(procedure,
                   PointerGetDatum(scan),
                   PointerGetDatum(keys),
@@ -508,7 +514,47 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
     Assert(TransactionIdIsValid(RecentGlobalXmin));

 #ifdef USE_PREFETCH
-    while (!scan->xs_done && index_prefetch_queue_space(scan) > 0) {
+    if (!scan->xs_want_itup)
+    {
+        while (!scan->xs_done && index_prefetch_queue_space(scan) > 0) {
+            /*
+             * The AM's amgettuple proc finds the next index entry matching
+             * the scan keys, and puts the TID into scan->xs_ctup.t_self.  It
+             * should also set scan->xs_recheck and possibly scan->xs_itup,
+             * though we pay no attention to those fields here.
+             */
+            found = DatumGetBool(FunctionCall2(procedure,
+                                               PointerGetDatum(scan),
+                                               Int32GetDatum(direction)));
+            if (found)
+            {
+                index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
+                pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
+                /* prefetch only if not the current buffer and not exactly the
+                 * previously prefetched buffer (heuristic random detection)
+                 * because sequential read-ahead would be redundant */
+                if ((!BufferIsValid(scan->xs_cbuf) ||
+                     pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
+                    pf_block != scan->xs_last_prefetch)
+                {
+                    PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
+                    scan->xs_last_prefetch = pf_block;
+                }
+            }
+            else
+                scan->xs_done = true;
+        }
+        from_queue = index_prefetch_queue_pop(scan);
+        if (from_queue)
+        {
+            scan->xs_ctup.t_self = *from_queue;
+            found = true;
+        }
+        else
+            found = false;
+    }
+    else
+#endif
         /*
          * The AM's amgettuple proc finds the next index entry matching the
          * scan keys, and puts the TID into scan->xs_ctup.t_self.  It should
@@ -518,43 +564,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
         found = DatumGetBool(FunctionCall2(procedure,
                                            PointerGetDatum(scan),
                                            Int32GetDatum(direction)));
-        if (found)
-        {
-            index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
-            pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
-            /* prefetch only if not the current buffer and not exactly the
-             * previously prefetched buffer (heuristic random detection)
-             * because sequential read-ahead would be redundant */
-            if ((!BufferIsValid(scan->xs_cbuf) ||
-                 pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
-                pf_block != scan->xs_last_prefetch)
-            {
-                PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
-                scan->xs_last_prefetch = pf_block;
-            }
-        }
-        else
-            scan->xs_done = true;
-    }
-    from_queue = index_prefetch_queue_pop(scan);
-    if (from_queue)
-    {
-        scan->xs_ctup.t_self = *from_queue;
-        found = true;
-    }
-    else
-        found = false;
-#else
-    /*
-     * The AM's amgettuple proc finds the next index entry matching the scan
-     * keys, and puts the TID into scan->xs_ctup.t_self.  It should also set
-     * scan->xs_recheck and possibly scan->xs_itup, though we pay no attention
-     * to those fields here.
-     */
-    found = DatumGetBool(FunctionCall2(procedure,
-                                       PointerGetDatum(scan),
-                                       Int32GetDatum(direction)));
-#endif

     /* Reset kill flag immediately for safety */
     scan->kill_prior_tuple = false;
--
2.0.5

>From d8b1533955e3471fb2eb6a030619dcbc258955a8 Mon Sep 17 00:00:00 2001
From: Daniel Bausch <bausch@dvs.tu-darmstadt.de>
Date: Mon, 28 Oct 2013 10:43:16 +0100
Subject: [PATCH 3/4] First try on tuple look-ahead in nestloop

Similarly to the prefetching logic just added to the index scan, look
ahead tuples in the outer loop of a nested loop scan.  For every tuple
looked ahead issue an explicit request for prefetching to the inner
plan.  Modify the index scan to react on this request.
---
 src/backend/access/index/indexam.c   |  81 +++++++++-----
 src/backend/executor/execProcnode.c  |  36 +++++++
 src/backend/executor/nodeIndexscan.c |  16 +++
 src/backend/executor/nodeNestloop.c  | 200 ++++++++++++++++++++++++++++++++++-
 src/include/access/genam.h           |   4 +
 src/include/executor/executor.h      |   3 +
 src/include/executor/nodeIndexscan.h |   1 +
 src/include/nodes/execnodes.h        |  12 +++
 8 files changed, 323 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index d8a4622..5f44dec 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -493,6 +493,57 @@ index_prefetch_queue_pop(IndexScanDesc scan)
     return res;
 }

+#ifdef USE_PREFETCH
+int
+index_prefetch(IndexScanDesc scan, int maxPrefetch, ScanDirection direction)
+{
+    FmgrInfo   *procedure;
+    int            numPrefetched = 0;
+    bool        found;
+    BlockNumber    pf_block;
+    FILE       *logfile;
+
+    GET_SCAN_PROCEDURE(amgettuple);
+
+    while (numPrefetched < maxPrefetch && !scan->xs_done &&
+           index_prefetch_queue_space(scan) > 0)
+    {
+        /*
+         * The AM's amgettuple proc finds the next index entry matching the
+         * scan keys, and puts the TID into scan->xs_ctup.t_self.  It should
+         * also set scan->xs_recheck and possibly scan->xs_itup, though we pay
+         * no attention to those fields here.
+         */
+        found = DatumGetBool(FunctionCall2(procedure,
+                                           PointerGetDatum(scan),
+                                           Int32GetDatum(direction)));
+        if (found)
+        {
+            index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
+            pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
+
+            /*
+             * Prefetch only if not the current buffer and not exactly the
+             * previously prefetched buffer (heuristic random detection)
+             * because sequential read-ahead would be redundant
+             */
+            if ((!BufferIsValid(scan->xs_cbuf) ||
+                 pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
+                pf_block != scan->xs_last_prefetch)
+            {
+                PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
+                scan->xs_last_prefetch = pf_block;
+                numPrefetched++;
+            }
+        }
+        else
+            scan->xs_done = true;
+    }
+
+    return numPrefetched;
+}
+#endif
+
 /* ----------------
  * index_getnext_tid - get the next TID from a scan
  *
@@ -506,7 +557,6 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
     FmgrInfo   *procedure;
     bool        found;
     ItemPointer    from_queue;
-    BlockNumber    pf_block;

     SCAN_CHECKS;
     GET_SCAN_PROCEDURE(amgettuple);
@@ -516,34 +566,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 #ifdef USE_PREFETCH
     if (!scan->xs_want_itup)
     {
-        while (!scan->xs_done && index_prefetch_queue_space(scan) > 0) {
-            /*
-             * The AM's amgettuple proc finds the next index entry matching
-             * the scan keys, and puts the TID into scan->xs_ctup.t_self.  It
-             * should also set scan->xs_recheck and possibly scan->xs_itup,
-             * though we pay no attention to those fields here.
-             */
-            found = DatumGetBool(FunctionCall2(procedure,
-                                               PointerGetDatum(scan),
-                                               Int32GetDatum(direction)));
-            if (found)
-            {
-                index_prefetch_queue_push(scan, &scan->xs_ctup.t_self);
-                pf_block = ItemPointerGetBlockNumber(&scan->xs_ctup.t_self);
-                /* prefetch only if not the current buffer and not exactly the
-                 * previously prefetched buffer (heuristic random detection)
-                 * because sequential read-ahead would be redundant */
-                if ((!BufferIsValid(scan->xs_cbuf) ||
-                     pf_block != BufferGetBlockNumber(scan->xs_cbuf)) &&
-                    pf_block != scan->xs_last_prefetch)
-                {
-                    PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, pf_block);
-                    scan->xs_last_prefetch = pf_block;
-                }
-            }
-            else
-                scan->xs_done = true;
-        }
+        index_prefetch(scan, INDEXSCAN_PREFETCH_COUNT, direction);
         from_queue = index_prefetch_queue_pop(scan);
         if (from_queue)
         {
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 76dd62f..a8f2c90 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -741,3 +741,39 @@ ExecEndNode(PlanState *node)
             break;
     }
 }
+
+
+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *        ExecPrefetchNode
+ *
+ *        Request explicit prefetching from a subtree/node without
+ *        actually forming a tuple.
+ *
+ *        The node shall request at most 'maxPrefetch' pages being
+ *        prefetched.
+ *
+ *        The function returns how many pages have been requested.
+ *
+ *        Calling this function for a type that does not support
+ *        prefetching is not an error.  It just returns 0 as if no
+ *        prefetching was possible.
+ * ----------------------------------------------------------------
+ */
+int
+ExecPrefetchNode(PlanState *node, int maxPrefetch)
+{
+    if (node == NULL)
+        return 0;
+
+    switch (nodeTag(node))
+    {
+        case T_IndexScanState:
+            return ExecPrefetchIndexScan((IndexScanState *) node,
+                                         maxPrefetch);
+
+        default:
+            return 0;
+    }
+}
+#endif
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index f1062f1..bab0e7a 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -192,6 +192,22 @@ ExecReScanIndexScan(IndexScanState *node)
     ExecScanReScan(&node->ss);
 }

+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *        ExecPrefetchIndexScan(node, maxPrefetch)
+ *
+ *        Trigger prefetching of index scan without actually fetching
+ *        a tuple.
+ * ----------------------------------------------------------------
+ */
+int
+ExecPrefetchIndexScan(IndexScanState *node, int maxPrefetch)
+{
+    return index_prefetch(node->iss_ScanDesc, maxPrefetch,
+                          node->ss.ps.state->es_direction);
+}
+#endif
+

 /*
  * ExecIndexEvalRuntimeKeys
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index c7a08ed..21ad5f8 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -25,6 +25,90 @@
 #include "executor/nodeNestloop.h"
 #include "utils/memutils.h"

+#ifdef USE_PREFETCH
+static int
+NestLoopLookAheadQueueSpace(NestLoopState *node)
+{
+    if (node->nl_lookAheadQueueTail < 0)
+        return NESTLOOP_PREFETCH_COUNT;
+
+    Assert(node->nl_lookAheadQueueHead >= 0);
+
+    return (NESTLOOP_PREFETCH_COUNT
+            - (node->nl_lookAheadQueueTail - node->nl_lookAheadQueueHead + 1))
+        % NESTLOOP_PREFETCH_COUNT;
+}
+
+/* makes materialized copy of tuple table slot */
+static bool
+NestLoopLookAheadQueuePush(NestLoopState *node, TupleTableSlot *tuple)
+{
+    TupleTableSlot **queueEntry;
+
+    Assert(NestLoopLookAheadQueueSpace(node) > 0);
+
+    if (node->nl_lookAheadQueueTail == -1)
+        node->nl_lookAheadQueueHead = node->nl_lookAheadQueueTail = 0;
+    else
+        node->nl_lookAheadQueueTail =
+            (node->nl_lookAheadQueueTail +1) % NESTLOOP_PREFETCH_COUNT;
+
+    queueEntry = &node->nl_lookAheadQueue[node->nl_lookAheadQueueTail];
+
+    if (!(*queueEntry))
+    {
+        *queueEntry = ExecInitExtraTupleSlot(node->js.ps.state);
+        ExecSetSlotDescriptor(*queueEntry,
+                              ExecGetResultType(outerPlanState(node)));
+    }
+
+    ExecCopySlot(*queueEntry, tuple);
+
+    return true;
+}
+
+static TupleTableSlot *
+NestLoopLookAheadQueuePop(NestLoopState *node)
+{
+    TupleTableSlot *res;
+
+    if (node->nl_lookAheadQueueHead < 0)
+        return NULL;
+
+    res = node->nl_lookAheadQueue[node->nl_lookAheadQueueHead];
+
+    if (node->nl_lookAheadQueueHead == node->nl_lookAheadQueueTail)
+        node->nl_lookAheadQueueHead = node->nl_lookAheadQueueTail = -1;
+    else
+        node->nl_lookAheadQueueHead =
+            (node->nl_lookAheadQueueHead + 1) % NESTLOOP_PREFETCH_COUNT;
+
+    return res;
+}
+
+static void
+NestLoopLookAheadQueueClear(NestLoopState *node)
+{
+    TupleTableSlot *lookAheadTuple;
+    int        i;
+
+    /*
+     * As we do not clear the tuple table slots on pop, we need to scan the
+     * whole array, regardless of the current queue fill.
+     *
+     * We cannot really free the slot, as there is no well defined interface
+     * for that, but the emptied slots will be freed when the query ends.
+     */
+    for (i = 0; i < NESTLOOP_PREFETCH_COUNT; i++)
+    {
+        lookAheadTuple = node->nl_lookAheadQueue[i];
+        /* look only on pointer - all non NULL fields are non-empty */
+        if (lookAheadTuple)
+            ExecClearTuple(lookAheadTuple);
+    }
+
+}
+#endif /* USE_PREFETCH */

 /* ----------------------------------------------------------------
  *        ExecNestLoop(node)
@@ -120,7 +204,87 @@ ExecNestLoop(NestLoopState *node)
         if (node->nl_NeedNewOuter)
         {
             ENL1_printf("getting new outer tuple");
-            outerTupleSlot = ExecProcNode(outerPlan);
+
+#ifdef USE_PREFETCH
+            /*
+             * While we have outer tuples and were not able to request enought
+             * prefetching from the inner plan to properly load the system,
+             * request more outer tuples and inner prefetching for them.
+             *
+             * Unfortunately we can do outer look-ahead directed prefetching
+             * only when we are rescanning the inner plan anyway; otherwise we
+             * would break the inner scan.  Only an independent copy of the
+             * inner plan state would allow us to prefetch accross inner loops
+             * regardless of inner scan position.
+             */
+            while (!node->nl_lookAheadDone &&
+                   node->nl_numInnerPrefetched < NESTLOOP_PREFETCH_COUNT &&
+                   NestLoopLookAheadQueueSpace(node) > 0)
+            {
+                TupleTableSlot *lookAheadTupleSlot = ExecProcNode(outerPlan);
+
+                if (!TupIsNull(lookAheadTupleSlot))
+                {
+                    NestLoopLookAheadQueuePush(node, lookAheadTupleSlot);
+
+                    /*
+                     * Set inner params according to look-ahead tuple.
+                     *
+                     * Fetch the values of any outer Vars that must be passed
+                     * to the inner scan, and store them in the appropriate
+                     * PARAM_EXEC slots.
+                     */
+                    foreach(lc, nl->nestParams)
+                    {
+                        NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
+                        int            paramno = nlp->paramno;
+                        ParamExecData *prm;
+
+                        prm = &(econtext->ecxt_param_exec_vals[paramno]);
+                        /* Param value should be an OUTER_VAR var */
+                        Assert(IsA(nlp->paramval, Var));
+                        Assert(nlp->paramval->varno == OUTER_VAR);
+                        Assert(nlp->paramval->varattno > 0);
+                        prm->value = slot_getattr(lookAheadTupleSlot,
+                                                  nlp->paramval->varattno,
+                                                  &(prm->isnull));
+                        /* Flag parameter value as changed */
+                        innerPlan->chgParam =
+                            bms_add_member(innerPlan->chgParam, paramno);
+                    }
+
+                    /*
+                     * Rescan inner plan with changed parameters and request
+                     * explicit prefetch.  Limit the inner prefetch amount
+                     * according to our own bookkeeping.
+                     *
+                     * When the so processed outer tuple gets finally active
+                     * in the inner loop, the inner plan will autonomously
+                     * prefetch the same tuples again.  This is redundant but
+                     * avoiding that seems too complicated for now.  It should
+                     * not hurt too much and may even help in case the
+                     * prefetched blocks have been evicted again in the
+                     * meantime.
+                     */
+                    ExecReScan(innerPlan);
+                    node->nl_numInnerPrefetched +=
+                        ExecPrefetchNode(innerPlan,
+                                         NESTLOOP_PREFETCH_COUNT -
+                                         node->nl_numInnerPrefetched);
+                }
+                else
+                    node->nl_lookAheadDone = true; /* outer plan exhausted */
+            }
+
+            /*
+             * If there is already the next outerPlan in our look-ahead queue,
+             * get the next outer tuple from there, otherwise execute the
+             * outer plan.
+             */
+            outerTupleSlot = NestLoopLookAheadQueuePop(node);
+            if (TupIsNull(outerTupleSlot) && !node->nl_lookAheadDone)
+#endif /* USE_PREFETCH */
+                outerTupleSlot = ExecProcNode(outerPlan);

             /*
              * if there are no more outer tuples, then the join is complete..
@@ -174,6 +338,18 @@ ExecNestLoop(NestLoopState *node)
         innerTupleSlot = ExecProcNode(innerPlan);
         econtext->ecxt_innertuple = innerTupleSlot;

+#ifdef USE_PREFETCH
+        /*
+         * Decrement prefetch counter as we cosume inner tuples.  We need to
+         * check for >0 because prefetching might not have happened for the
+         * consumed tuple, maybe because explicit prefetching is not supported
+         * by the inner plan or because the explicit prefetching requested by
+         * us is exhausted and the inner plan is doing it on its own now.
+         */
+        if (node->nl_numInnerPrefetched > 0)
+            node->nl_numInnerPrefetched--;
+#endif
+
         if (TupIsNull(innerTupleSlot))
         {
             ENL1_printf("no inner tuple, need new outer tuple");
@@ -296,6 +472,9 @@ NestLoopState *
 ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 {
     NestLoopState *nlstate;
+#ifdef USE_PREFETCH
+    int i;
+#endif

     /* check for unsupported flags */
     Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -381,6 +560,15 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
     nlstate->nl_NeedNewOuter = true;
     nlstate->nl_MatchedOuter = false;

+#ifdef USE_PREFETCH
+    nlstate->nl_lookAheadQueueHead = nlstate->nl_lookAheadQueueTail = -1;
+    nlstate->nl_lookAheadDone = false;
+    nlstate->nl_numInnerPrefetched = 0;
+
+    for (i = 0; i < NESTLOOP_PREFETCH_COUNT; i++)
+        nlstate->nl_lookAheadQueue[i] = NULL;
+#endif
+
     NL1_printf("ExecInitNestLoop: %s\n",
                "node initialized");

@@ -409,6 +597,10 @@ ExecEndNestLoop(NestLoopState *node)
      */
     ExecClearTuple(node->js.ps.ps_ResultTupleSlot);

+#ifdef USE_PREFETCH
+    NestLoopLookAheadQueueClear(node);
+#endif
+
     /*
      * close down subplans
      */
@@ -444,4 +636,10 @@ ExecReScanNestLoop(NestLoopState *node)
     node->js.ps.ps_TupFromTlist = false;
     node->nl_NeedNewOuter = true;
     node->nl_MatchedOuter = false;
+
+#ifdef USE_PREFETCH
+    NestLoopLookAheadQueueClear(node);
+    node->nl_lookAheadDone = false;
+    node->nl_numInnerPrefetched = 0;
+#endif
 }
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a800041..7733b3c 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -146,6 +146,10 @@ extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
                   ScanDirection direction);
+#ifdef USE_PREFETCH
+extern int index_prefetch(IndexScanDesc scan, int maxPrefetch,
+                          ScanDirection direction);
+#endif
 extern HeapTuple index_fetch_heap(IndexScanDesc scan);
 extern HeapTuple index_getnext(IndexScanDesc scan, ScanDirection direction);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 75841c8..88d0522 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -221,6 +221,9 @@ extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
+#ifdef USE_PREFETCH
+extern int ExecPrefetchNode(PlanState *node, int maxPrefetch);
+#endif

 /*
  * prototypes from functions in execQual.c
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 71dbd9c..f93632c 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -18,6 +18,7 @@

 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecIndexScan(IndexScanState *node);
+extern int ExecPrefetchIndexScan(IndexScanState *node, int maxPrefetch);
 extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3b430e0..27fe65d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1526,6 +1526,18 @@ typedef struct NestLoopState
     bool        nl_NeedNewOuter;
     bool        nl_MatchedOuter;
     TupleTableSlot *nl_NullInnerTupleSlot;
+
+#ifdef USE_PREFETCH
+# ifndef NESTLOOP_PREFETCH_COUNT
+#  define NESTLOOP_PREFETCH_COUNT 32
+# endif
+    /* look-ahead queue (for prefetching) - ringbuffer */
+    TupleTableSlot *nl_lookAheadQueue[NESTLOOP_PREFETCH_COUNT];
+    int            nl_lookAheadQueueHead;
+    int            nl_lookAheadQueueTail;
+    bool        nl_lookAheadDone;
+    int            nl_numInnerPrefetched;
+#endif
 } NestLoopState;

 /* ----------------
--
2.0.5

>From a1fcab2d9d001505a5fc25accdca71e88148e4ff Mon Sep 17 00:00:00 2001
From: Daniel Bausch <bausch@dvs.tu-darmstadt.de>
Date: Tue, 29 Oct 2013 16:41:09 +0100
Subject: [PATCH 4/4] Limit recursive prefetching for merge join

Add switch facility to limit the prefetching of a subtree recursively.
In a first try add support for some variants of merge join.  Distribute
the prefetch allowance evenly between outer and inner subplan.
---
 src/backend/access/index/indexam.c   |  5 +++-
 src/backend/executor/execProcnode.c  | 47 +++++++++++++++++++++++++++++++++++-
 src/backend/executor/nodeAgg.c       | 10 ++++++++
 src/backend/executor/nodeIndexscan.c | 18 ++++++++++++++
 src/backend/executor/nodeMaterial.c  | 14 +++++++++++
 src/backend/executor/nodeMergejoin.c | 22 +++++++++++++++++
 src/include/access/relscan.h         |  1 +
 src/include/executor/executor.h      |  1 +
 src/include/executor/nodeAgg.h       |  3 +++
 src/include/executor/nodeIndexscan.h |  3 +++
 src/include/executor/nodeMaterial.h  |  3 +++
 src/include/executor/nodeMergejoin.h |  3 +++
 src/include/nodes/execnodes.h        |  6 +++++
 13 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 5f44dec..354bde6 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -255,6 +255,7 @@ index_beginscan(Relation heapRelation,
     scan->xs_prefetch_head = scan->xs_prefetch_tail = -1;
     scan->xs_last_prefetch = -1;
     scan->xs_done = false;
+    scan->xs_prefetch_limit = INDEXSCAN_PREFETCH_COUNT;
 #endif

     return scan;
@@ -506,7 +507,9 @@ index_prefetch(IndexScanDesc scan, int maxPrefetch, ScanDirection direction)
     GET_SCAN_PROCEDURE(amgettuple);

     while (numPrefetched < maxPrefetch && !scan->xs_done &&
-           index_prefetch_queue_space(scan) > 0)
+           index_prefetch_queue_space(scan) > 0 &&
+           index_prefetch_queue_space(scan) >
+           (INDEXSCAN_PREFETCH_COUNT - scan->xs_prefetch_limit))
     {
         /*
          * The AM's amgettuple proc finds the next index entry matching the
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a8f2c90..a14a0d0 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -745,6 +745,51 @@ ExecEndNode(PlanState *node)

 #ifdef USE_PREFETCH
 /* ----------------------------------------------------------------
+ *        ExecLimitPrefetchNode
+ *
+ *        Limit the amount of prefetching that may be requested by
+ *        a subplan.
+ *
+ *        Most of the handlers just pass-through the received value
+ *        to their subplans.  That is the case, when they have just
+ *        one subplan that might prefetch.  If they have two subplans
+ *        intelligent heuristics need to be applied to distribute the
+ *        prefetch allowance in a way delivering overall advantage.
+ * ----------------------------------------------------------------
+ */
+void
+ExecLimitPrefetchNode(PlanState *node, int limit)
+{
+    if (node == NULL)
+        return;
+
+    switch (nodeTag(node))
+    {
+        case T_IndexScanState:
+            ExecLimitPrefetchIndexScan((IndexScanState *) node, limit);
+            break;
+
+        case T_MergeJoinState:
+            ExecLimitPrefetchMergeJoin((MergeJoinState *) node, limit);
+            break;
+
+        case T_MaterialState:
+            ExecLimitPrefetchMaterial((MaterialState *) node, limit);
+            break;
+
+        case T_AggState:
+            ExecLimitPrefetchAgg((AggState *) node, limit);
+            break;
+
+        default:
+            elog(INFO,
+                 "missing ExecLimitPrefetchNode handler for node type: %d",
+                 (int) nodeTag(node));
+            break;
+    }
+}
+
+/* ----------------------------------------------------------------
  *        ExecPrefetchNode
  *
  *        Request explicit prefetching from a subtree/node without
@@ -776,4 +821,4 @@ ExecPrefetchNode(PlanState *node, int maxPrefetch)
             return 0;
     }
 }
-#endif
+#endif /* USE_PREFETCH */
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index e02a6ff..94f6d77 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1877,6 +1877,16 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
     return aggstate;
 }

+#ifdef USE_PREFETCH
+void
+ExecLimitPrefetchAgg(AggState *node, int limit)
+{
+    Assert(node != NULL);
+
+    ExecLimitPrefetchNode(outerPlanState(node), limit);
+}
+#endif
+
 static Datum
 GetAggInitVal(Datum textInitVal, Oid transtype)
 {
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index bab0e7a..6ea236e 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -640,6 +640,24 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
     return indexstate;
 }

+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *        ExecLimitPrefetchIndexScan
+ *
+ *        Sets/changes the number of tuples whose pages to request in
+ *        advance.
+ * ----------------------------------------------------------------
+ */
+void
+ExecLimitPrefetchIndexScan(IndexScanState *node, int limit)
+{
+    Assert(node != NULL);
+    Assert(node->iss_ScanDesc != NULL);
+
+    node->iss_ScanDesc->xs_prefetch_limit = limit;
+}
+#endif
+

 /*
  * ExecIndexBuildScanKeys
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 7a82f56..3370362 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -232,6 +232,20 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
     return matstate;
 }

+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *        ExecLimitPrefetchMaterial
+ * ----------------------------------------------------------------
+ */
+void
+ExecLimitPrefetchMaterial(MaterialState *node, int limit)
+{
+    Assert(node != NULL);
+
+    ExecLimitPrefetchNode(outerPlanState(node), limit);
+}
+#endif
+
 /* ----------------------------------------------------------------
  *        ExecEndMaterial
  * ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index e69bc64..f25e074 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1627,6 +1627,10 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
     mergestate->mj_OuterTupleSlot = NULL;
     mergestate->mj_InnerTupleSlot = NULL;

+#ifdef USE_PREFETCH
+    ExecLimitPrefetchMergeJoin(mergestate, MERGEJOIN_PREFETCH_COUNT);
+#endif
+
     /*
      * initialization successful
      */
@@ -1636,6 +1640,24 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
     return mergestate;
 }

+#ifdef USE_PREFETCH
+/* ----------------------------------------------------------------
+ *        ExecLimitPrefetchMergeJoin
+ * ----------------------------------------------------------------
+ */
+void
+ExecLimitPrefetchMergeJoin(MergeJoinState *node, int limit)
+{
+    int outerLimit = limit/2;
+    int innerLimit = limit/2;
+
+    Assert(node != NULL);
+
+    ExecLimitPrefetchNode(outerPlanState(node), outerLimit);
+    ExecLimitPrefetchNode(innerPlanState(node), innerLimit);
+}
+#endif
+
 /* ----------------------------------------------------------------
  *        ExecEndMergeJoin
  *
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index bccc1a4..3297900 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -104,6 +104,7 @@ typedef struct IndexScanDescData
     int            xs_prefetch_tail;
     BlockNumber    xs_last_prefetch;
     bool        xs_done;
+    int            xs_prefetch_limit;
 #endif
 }    IndexScanDescData;

diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 88d0522..09b94e0 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -222,6 +222,7 @@ extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 #ifdef USE_PREFETCH
+extern void ExecLimitPrefetchNode(PlanState *node, int limit);
 extern int ExecPrefetchNode(PlanState *node, int maxPrefetch);
 #endif

diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 38823d6..f775ec8 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -17,6 +17,9 @@
 #include "nodes/execnodes.h"

 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
+#ifdef USE_PREFETCH
+extern void ExecLimitPrefetchAgg(AggState *node, int limit);
+#endif
 extern TupleTableSlot *ExecAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index f93632c..ccf3121 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -17,6 +17,9 @@
 #include "nodes/execnodes.h"

 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
+#ifdef USE_PREFETCH
+extern void ExecLimitPrefetchIndexScan(IndexScanState *node, int limit);
+#endif
 extern TupleTableSlot *ExecIndexScan(IndexScanState *node);
 extern int ExecPrefetchIndexScan(IndexScanState *node, int maxPrefetch);
 extern void ExecEndIndexScan(IndexScanState *node);
diff --git a/src/include/executor/nodeMaterial.h b/src/include/executor/nodeMaterial.h
index cfca0a5..5c81fe8 100644
--- a/src/include/executor/nodeMaterial.h
+++ b/src/include/executor/nodeMaterial.h
@@ -17,6 +17,9 @@
 #include "nodes/execnodes.h"

 extern MaterialState *ExecInitMaterial(Material *node, EState *estate, int eflags);
+#ifdef USE_PREFETCH
+extern void ExecLimitPrefetchMaterial(MaterialState *node, int limit);
+#endif
 extern TupleTableSlot *ExecMaterial(MaterialState *node);
 extern void ExecEndMaterial(MaterialState *node);
 extern void ExecMaterialMarkPos(MaterialState *node);
diff --git a/src/include/executor/nodeMergejoin.h b/src/include/executor/nodeMergejoin.h
index fa6b5e0..e402b42 100644
--- a/src/include/executor/nodeMergejoin.h
+++ b/src/include/executor/nodeMergejoin.h
@@ -17,6 +17,9 @@
 #include "nodes/execnodes.h"

 extern MergeJoinState *ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags);
+#ifdef USE_PREFETCH
+extern void ExecLimitPrefetchMergeJoin(MergeJoinState *node, int limit);
+#endif
 extern TupleTableSlot *ExecMergeJoin(MergeJoinState *node);
 extern void ExecEndMergeJoin(MergeJoinState *node);
 extern void ExecReScanMergeJoin(MergeJoinState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 27fe65d..64ed6fb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1585,6 +1585,12 @@ typedef struct MergeJoinState
     ExprContext *mj_InnerEContext;
 } MergeJoinState;

+#ifdef USE_PREFETCH
+# ifndef MERGEJOIN_PREFETCH_COUNT
+#  define MERGEJOIN_PREFETCH_COUNT 32
+# endif
+#endif
+
 /* ----------------
  *     HashJoinState information
  *
--
2.0.5

Re: Parallel Seq Scan

From

Amit Kapila

Date:

06 February 2015, 14:44:04

On Thu, Jan 22, 2015 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 22, 2015 at 6:37 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> >
> > (Please point out me if my understanding is incorrect.)
> >
> > What happen if dynamic background worker process tries to reference temporary
> > tables? Because buffer of temporary table blocks are allocated on private
> > address space, its recent status is not visible to other process unless it is
> > not flushed to the storage every time.
> >
> > Do we need to prohibit create_parallelscan_paths() to generate a path when
> > target relation is temporary one?
> >
>
> Yes, we need to prohibit parallel scans on temporary relations. Will fix.
>

Here is the latest patch which fixes reported issues and supported

Prepared Statements and Explain Statement for parallel sequential

scan.

The main purpose is to get the feedback if possible on overall

structure/design of code before I goahead.

Note -

a. it is still based on parallel-mode-v1 [1] patch of Robert.

b. based on CommitId - fd496129 [on top of this commit, apply

Robert's patch and then the attached patch]

c. just build and tested on Windows, my linux box has some

problem, will fix that soon and verify this on linux as well.

[1]

http://www.postgresql.org/message-id/CA+TgmoZdUK4K3XHBxc9vM-82khourEZdvQWTfgLhWsd2R2aAGQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v6.patch

Re: Parallel Seq Scan

From

Robert Haas

Date:

06 February 2015, 17:34:35

On Fri, Feb 6, 2015 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Here is the latest patch which fixes reported issues and supported
> Prepared Statements and Explain Statement for parallel sequential
> scan.
>
> The main purpose is to get the feedback if possible on overall
> structure/design of code before I goahead.

I'm not very happy with the way this is modularized:

1. The new parallel sequential scan node runs only in the master.  The
workers are running a regular sequential scan with a hack to make them
scan a subset of the blocks.  I think this is wrong; parallel
sequential scan shouldn't require this kind of modifications to the
non-parallel case.

2. InitiateWorkers() is entirely specific to the concerns of parallel
sequential scan.  After looking this over, I think there are three
categories of things that need to be clearly separated.  Some stuff is
going to be needed for any parallel query; some stuff is going to be
needed only for parallel scans but will be needed for any type of
parallel scan, not just parallel sequential scan[1]; some stuff is
needed for any type of node that returns tuples but not for nodes that
don't return tuples (e.g. needed for ParallelSeqScan and
ParallelHashJoin, but not needed for ParallelHash); and some stuff is
only going to be needed for parallel sequential scan specifically.
This patch mixes all of those concerns together in a single function.
That won't do; this needs to be easily extensible to whatever someone
wants to parallelize next.

3. I think the whole idea of using the portal infrastructure for this
is wrong.  We've talked about this before, but the fact that you're
doing it this way is having a major impact on the whole design of the
patch, and I can't believe it's ever going to be committable this way.
To create a portal, you have to pretend that you received a protocol
message, which you didn't; and you have to pretend there is an SQL
query so you can call PortalDefineQuery.   That's ugly.  As far as I
can see the only thing we really get out of any of that is that we can
use the DestReceiver stuff to get the tuples back to the master, but
that doesn't really work either, because you're having to hack
printtup.c anyway.  So from my point of view you're going through a
bunch of layers that really don't have any value.  Considering the way
the parallel mode patch has evolved, I no longer think there's much
point to passing anything other than raw tuples between the backends,
so the whole idea of going through a deform/send/recv/form cycle seems
like something we can entirely skip.

4.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] It is of course arguable whether a parallel index-scan or parallel
bitmap index-scan or parallel index-only-scan or parallel custom scan
makes sense, but this patch shouldn't assume that we won't want to do
those things.  We have other places in the code that know about the
concept of a scan as opposed to some other kind of executor construct,
and we should preserve that distinction here.

Re: Parallel Seq Scan

From

Robert Haas

Date:

06 February 2015, 19:13:41

On Fri, Feb 6, 2015 at 12:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 4.

Obviously that went out a bit too soon.  Anyway, what I think we
should do here is back up a bit and talk about what the problems are
that we need to solve here and how each of them should be solved.  I
think there is some good code in this patch, but we really need to
think about what the interfaces should look like and achieve a clean
separation of concerns.

Looking at the code for the non-parallel SeqScan node, there are
basically two things going on here:

1. We call heap_getnext() to get the next tuple and store it into a
TupleTableSlot.
2. Via ExecScan(), we do projection and apply quals.

My first comment here is that I think we should actually teach
heapam.c about parallelism.  In other words, let's have an interface
like this:

extern Size heap_parallelscan_estimate(Snapshot snapshot);
extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
Relation relation, Snapshot snapshot);
extern HeapScanDesc heap_beginscan_parallel(ParallelHeapScanDesc);

So the idea is that if you want to do a parallel scan, you call
heap_parallelscan_estimate() to figure out how much space to reserve
in your dynamic shared memory segment.  Then you call
heap_parallelscan_initialize() to initialize the chunk of memory once
you have it.  Then each backend that wants to assist in the parallel
scan calls heap_beginscan_parallel() on that chunk of memory and gets
its own HeapScanDesc.  Then, they can all call heap_getnext() just as
in the non-parallel case.  The ParallelHeapScanDesc needs to contain
the relation OID, the snapshot, the ending block number, and a
current-block counter.  Instead of automatically advancing to the next
block, they use one of Andres's nifty new atomic ops to bump the
current-block counter and then scan the block just before the new
value.  All this seems pretty straightforward, and if we decide to
later change the way the relation gets scanned (e.g. in 1GB chunks
rather than block-by-block) it can be handled here pretty easily.

Now, let's suppose that we have this interface and for some reason we
don't care about quals and projection - we just want to get the tuples
back to the master.  It's easy enough to create a parallel context
that fires up a worker and lets the worker call
heap_beginscan_parallel() and then heap_getnext() in a loop, but what
does it do with the resulting tuples?  We need a tuple queue that can
be used to send the tuples back to master.  That's also pretty easy:
just set up a shm_mq and use shm_mq_send() to send each tuple.  Use
shm_mq_receive() in the master to read them back out.  The only thing
we need to be careful about is that the tuple descriptors match.  It
must be that they do, because the way the current parallel context
patch works, the master is guaranteed to hold a lock on the relation
from before the worker starts up until after it dies.  But we could
stash the tuple descriptor in shared memory and cross-check that it
matches just to be sure.  Anyway, this doesn't seem terribly complex
although we might want to wrap some abstraction around it somehow so
that every kind of parallelism that uses tuple queues can benefit from
it.  Perhaps this could even be built into the parallel context
machinery somehow, or maybe it's something executor-specific.  At any
rate it looks simpler than what you've got now.

The complicated part here seems to me to figure out what we need to
pass from the parallel leader to the parallel worker to create enough
state for quals and projection.  If we want to be able to call
ExecScan() without modification, which seems like a good goal, we're
going to need a ScanState node, which is going to need to contain
valid pointers to (at least) a ProjectionInfo, an ExprContext, and a
List of quals.  That in turn is going to require an ExecutorState.
Serializing those things directly doesn't seem very practical; what we
instead want to do is figure out what we can pass that will allow easy
reconstruction of those data structures.  Right now, you're passing
the target list, the qual list, the range table, and the params, but
the range table doesn't seem to be getting used anywhere.  I wonder if
we need it.  If we could get away with just passing the target list
and qual list, and params, we'd be doing pretty well, I think.  But
I'm not sure exactly what that looks like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

06 February 2015, 21:01:03

On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The complicated part here seems to me to figure out what we need to
> pass from the parallel leader to the parallel worker to create enough
> state for quals and projection.  If we want to be able to call
> ExecScan() without modification, which seems like a good goal, we're
> going to need a ScanState node, which is going to need to contain
> valid pointers to (at least) a ProjectionInfo, an ExprContext, and a
> List of quals.  That in turn is going to require an ExecutorState.
> Serializing those things directly doesn't seem very practical; what we
> instead want to do is figure out what we can pass that will allow easy
> reconstruction of those data structures.  Right now, you're passing
> the target list, the qual list, the range table, and the params, but
> the range table doesn't seem to be getting used anywhere.  I wonder if
> we need it.  If we could get away with just passing the target list
> and qual list, and params, we'd be doing pretty well, I think.  But
> I'm not sure exactly what that looks like.

IndexBuildHeapRangeScan shows how to do qual evaluation with
relatively little setup:
   estate = CreateExecutorState();   econtext = GetPerTupleExprContext(estate);   slot =
MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));
   /* Arrange for econtext's scan tuple to be the tuple under test */   econtext->ecxt_scantuple = slot;
   /* Set up execution state for predicate, if any. */   predicate = (List *)       ExecPrepareExpr((Expr *)
indexInfo->ii_Predicate,                      estate);

Then, for each tuple:
      ExecStoreTuple(heapTuple, slot, InvalidBuffer, false);

And:
           if (!ExecQual(predicate, econtext, false))               continue;

This looks like a good model to follow for parallel sequential scan.
The point though is that I think we should do it directly rather than
letting the portal machinery do it for us.  Not sure how to get
projection working yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

07 February 2015, 03:57:50

On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> My first comment here is that I think we should actually teach
> heapam.c about parallelism.

I coded this up; see attached.  I'm also attaching an updated version
of the parallel count code revised to use this API.  It's now called
"parallel_count" rather than "parallel_dummy" and I removed some
stupid stuff from it.  I'm curious to see what other people think, but
this seems much cleaner to me.  With the old approach, the
parallel-count code was duplicating some of the guts of heapam.c and
dropping the rest on the floor; now it just asks for a parallel scan
and away it goes.  Similarly, if your parallel-seqscan patch wanted to
scan block-by-block rather than splitting the relation into equal
parts, or if it wanted to participate in the synchronized-seqcan
stuff, there was no clean way to do that.  With this approach, those
decisions are - as they quite properly should be - isolated within
heapam.c, rather than creeping into the executor.

(These patches should be applied over parallel-mode-v4.patch.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On Sat, Feb 7, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> The complicated part here seems to me to figure out what we need to
> pass from the parallel leader to the parallel worker to create enough
> state for quals and projection. If we want to be able to call
> ExecScan() without modification, which seems like a good goal, we're
> going to need a ScanState node, which is going to need to contain
> valid pointers to (at least) a ProjectionInfo, an ExprContext, and a
> List of quals. That in turn is going to require an ExecutorState.
> Serializing those things directly doesn't seem very practical; what we
> instead want to do is figure out what we can pass that will allow easy
> reconstruction of those data structures. Right now, you're passing
> the target list, the qual list, the range table, and the params, but
> the range table doesn't seem to be getting used anywhere. I wonder if
> we need it.

The range table is used by executor for processing qualification, one of

the examples is ExecEvalWholeRowVar(), I don't think we can process

without range table. Apart from above mentioned things we need to pass

Instrumentation structure where each worker needs to update the same,

this is required for Explain statement.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

08 February 2015, 03:36:22

On Sun, Feb 8, 2015 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Feb 7, 2015 at 4:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2015-02-06 22:57:43 -0500, Robert Haas wrote:
> >> On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> > My first comment here is that I think we should actually teach
> >> > heapam.c about parallelism.
> >>
> >> I coded this up; see attached. I'm also attaching an updated version
> >> of the parallel count code revised to use this API. It's now called
> >> "parallel_count" rather than "parallel_dummy" and I removed some
> >> stupid stuff from it. I'm curious to see what other people think, but
> >> this seems much cleaner to me. With the old approach, the
> >> parallel-count code was duplicating some of the guts of heapam.c and
> >> dropping the rest on the floor; now it just asks for a parallel scan
> >> and away it goes. Similarly, if your parallel-seqscan patch wanted to
> >> scan block-by-block rather than splitting the relation into equal
> >> parts, or if it wanted to participate in the synchronized-seqcan
> >> stuff, there was no clean way to do that. With this approach, those
> >> decisions are - as they quite properly should be - isolated within
> >> heapam.c, rather than creeping into the executor.
> >
> > I'm not convinced that that reasoning is generally valid. While it may
> > work out nicely for seqscans - which might be useful enough on its own -
> > the more stuff we parallelize the *more* the executor will have to know
> > about it to make it sane. To actually scale nicely e.g. a parallel sort
> > will have to execute the nodes below it on each backend, instead of
> > doing that in one as a separate step, ferrying over all tuples to
> > indivdual backends through queues, and only then parallezing the
> > sort.
> >
> > Now. None of that is likely to matter immediately, but I think starting
> > to build the infrastructure at the points where we'll later need it does
> > make some sense.

I think doing it for parallel seq scan as well makes the processing for

worker much more easier like processing for prepared queries

(bind parameters), processing of Explain statement, Qualification,

Projection, decision for processing of junk entries.

>
> Well, I agree with you, but I'm not really sure what that has to do
> with the issue at hand. I mean, if we were to apply Amit's patch,
> we'd been in a situation where, for a non-parallel heap scan, heapam.c
> decides the order in which blocks get scanned, but for a parallel heap
> scan, nodeParallelSeqscan.c makes that decision.

I think other places also decides about the order/way heapam.c has

to scan, example the order in which rows/pages has to traversed is

decided at portal/executor layer and the same is passed till heap and

in case of index, the scanlimits (heap_setscanlimits()) are decided

outside heapam.c and something similar is done for parallel seq scan.

In general, the scan is driven by Scandescriptor which is constructed

at upper level and there are some API's exposed to derive the scan.

If you are not happy with the current way nodeParallelSeqscan has

set the scan limits, we can have some form of callback which do the

required work and this callback can be called from heapam.c.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

08 February 2015, 17:33:34

On Sat, Feb 7, 2015 at 10:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Well, I agree with you, but I'm not really sure what that has to do
>> with the issue at hand.  I mean, if we were to apply Amit's patch,
>> we'd been in a situation where, for a non-parallel heap scan, heapam.c
>> decides the order in which blocks get scanned, but for a parallel heap
>> scan, nodeParallelSeqscan.c makes that decision.
>
> I think other places also decides about the order/way heapam.c has
> to scan, example the order in which rows/pages has to traversed is
> decided at portal/executor layer and the same is passed till heap and
> in case of index, the scanlimits (heap_setscanlimits()) are decided
> outside heapam.c and something similar is done for parallel seq scan.
> In general, the scan is driven by Scandescriptor which is constructed
> at upper level and there are some API's exposed to derive the scan.
> If you are not happy with the current way nodeParallelSeqscan has
> set the scan limits, we can have some form of callback which do the
> required work and this callback can be called from heapam.c.

I thought about a callback, but what's the benefit of doing that vs.
hard-coding it in heapam.c?  If the upper-layer wants to impose a TID
qual or similar then heap_setscanlimits() makes sense, but that's
effectively a filter condition, not a policy decision about the access
pattern.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

09 February 2015, 07:31:53

On Sat, Feb 7, 2015 at 2:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 6, 2015 at 2:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > The complicated part here seems to me to figure out what we need to
> > pass from the parallel leader to the parallel worker to create enough
> > state for quals and projection. If we want to be able to call
> > ExecScan() without modification, which seems like a good goal, we're
> > going to need a ScanState node, which is going to need to contain
> > valid pointers to (at least) a ProjectionInfo, an ExprContext, and a
> > List of quals. That in turn is going to require an ExecutorState.
> > Serializing those things directly doesn't seem very practical; what we
> > instead want to do is figure out what we can pass that will allow easy
> > reconstruction of those data structures. Right now, you're passing
> > the target list, the qual list, the range table, and the params, but
> > the range table doesn't seem to be getting used anywhere. I wonder if
> > we need it. If we could get away with just passing the target list
> > and qual list, and params, we'd be doing pretty well, I think. But
> > I'm not sure exactly what that looks like.
>
> IndexBuildHeapRangeScan shows how to do qual evaluation with
> relatively little setup:
>

I think even to make quals work, we need to do few extra things

like setup paramlist, rangetable. Also for quals, we need to fix

function id's by calling fix_opfuncids() and do the stuff what

ExecInit*() function does for quals. I think these extra things

will be required in processing of qualification for seq scan.

Then we need to construct projection info from target list (basically

do the stuff what ExecInit*() function does). After constructing

projectioninfo, we can call ExecProject().

Here we need to take care that functions to collect instrumentation

information like InstrStartNode(), InstrStopNode(), InstrCountFiltered1(),

etc. be called at appropriate places, so that we can collect the same for

Explain statement when requested by master backend.

Then finally after sending tuples need to destroy all the execution

state constructed for fetching tuples.

So to make this work, basically we need to do all important work

that executor does in three different phases initialization of

node, execution of node, ending the node. Ideally, we can make this

work by having code specific to just execution of sequiatial scan,

however it seems to me we again need more such kinds of code

(extracted from core part of executor) to make parallel execution of

other functionalaties like aggregation, partition seq scan, etc.

Another idea is to use Executor level interfaces (like ExecutorStart(),

ExecutorRun(), ExecutorEnd()) for execution rather than using Portal

level interfaces. I have used Portal level interfaces with the

thought that we can reuse the existing infrastructure of Portal to

make parallel execution of scrollable cursors, but as per my analysis

it is not so easy to support them especially backward scan, absolute/

relative fetch, etc, so Executor level interfaces seems more appealing

to me (something like how Explain statement works (ExplainOnePlan)).

Using Executor level interfaces will have advantage that we can reuse them

for other parallel functionalaties. In this approach, we need to take

care of constructing relavant structures (with the information passed by

master backend) required for Executor interfaces, but I think these should

be lesser than what we need in previous approach (extract seqscan specific

stuff from executor).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

09 February 2015, 07:40:21

On Fri, Feb 6, 2015 at 11:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 6, 2015 at 9:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Here is the latest patch which fixes reported issues and supported
> > Prepared Statements and Explain Statement for parallel sequential
> > scan.
> >
> > The main purpose is to get the feedback if possible on overall
> > structure/design of code before I goahead.
>
>
> 2. InitiateWorkers() is entirely specific to the concerns of parallel
> sequential scan. After looking this over, I think there are three
> categories of things that need to be clearly separated. Some stuff is
> going to be needed for any parallel query; some stuff is going to be
> needed only for parallel scans but will be needed for any type of
> parallel scan, not just parallel sequential scan[1]; some stuff is
> needed for any type of node that returns tuples but not for nodes that
> don't return tuples (e.g. needed for ParallelSeqScan and
> ParallelHashJoin, but not needed for ParallelHash); and some stuff is
> only going to be needed for parallel sequential scan specifically.
> This patch mixes all of those concerns together in a single function.
> That won't do; this needs to be easily extensible to whatever someone
> wants to parallelize next.
>

Master backend shares Targetlist, Qual, Scanrelid, Rangetable, Bind Params,

Info about Scan range (Blocks), Tuple queues, Instrumentation Info

to worker, going by your suggestion, I think we can separate them as below:

1. parallel query - Target list, Qual, Bind Params, Instrumentation Info

2. parallel scan and nodes that returns tuples - scanrelid, range table, Tuple Queues

3. parallel sequiantial scan specific - Info about Scan range (Blocks)

This is as per current list of things which master backend shares with worker,

if more things are required, then we can decide in which category it falls and

add it accordingly.

Is this similar to what you have in mind?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

09 February 2015, 10:51:00

On Sun, Feb 8, 2015 at 11:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Feb 7, 2015 at 10:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> Well, I agree with you, but I'm not really sure what that has to do
> >> with the issue at hand. I mean, if we were to apply Amit's patch,
> >> we'd been in a situation where, for a non-parallel heap scan, heapam.c
> >> decides the order in which blocks get scanned, but for a parallel heap
> >> scan, nodeParallelSeqscan.c makes that decision.
> >
> > I think other places also decides about the order/way heapam.c has
> > to scan, example the order in which rows/pages has to traversed is
> > decided at portal/executor layer and the same is passed till heap and
> > in case of index, the scanlimits (heap_setscanlimits()) are decided
> > outside heapam.c and something similar is done for parallel seq scan.
> > In general, the scan is driven by Scandescriptor which is constructed
> > at upper level and there are some API's exposed to derive the scan.
> > If you are not happy with the current way nodeParallelSeqscan has
> > set the scan limits, we can have some form of callback which do the
> > required work and this callback can be called from heapam.c.
>
> I thought about a callback, but what's the benefit of doing that vs.
> hard-coding it in heapam.c?

Basically I want to address your concern of setting scan limit via

sequence scan node, one of the ways could be that pass a callback_function

and callback_state to heap_beginscan which will remember that information

in HeapScanDesc and then use in heap_getnext(), now callback_state will

have info about next page which will be updated by callback_function.

We can remember callback_function and callback_state information in

estate which will be set only by parallel worker which means it won't effect

non-parallel case. I think this will be helpful in future as well where we want

particular scan or sort to use that information to behave as parallel scan or

sort.

With Regards,

Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

09 February 2015, 14:07:12

On Mon, Feb 9, 2015 at 2:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Another idea is to use Executor level interfaces (like ExecutorStart(),
> ExecutorRun(), ExecutorEnd()) for execution rather than using Portal
> level interfaces.  I have used Portal level interfaces with the
> thought that we can reuse the existing infrastructure of Portal to
> make parallel execution of scrollable cursors, but as per my analysis
> it is not so easy to support them especially backward scan, absolute/
> relative fetch, etc, so Executor level interfaces seems more appealing
> to me (something like how Explain statement works (ExplainOnePlan)).
> Using Executor level interfaces will have advantage that we can reuse them
> for other parallel functionalaties.  In this approach, we need to take
> care of constructing relavant structures (with the information passed by
> master backend) required for Executor interfaces, but I think these should
> be lesser than what we need in previous approach (extract seqscan specific
> stuff from executor).

I think using the executor-level interfaces instead of the
portal-level interfaces is a good idea.  That would possibly let us
altogether prohibit access to the portal layer from within a parallel
worker, which seems like it might be a good sanity check to add.  But
that seems to still require us to have a PlannedStmt and a QueryDesc,
and I'm not sure whether that's going to be too much of a pain.  We
might need to think about an alternative API for starting the Executor
like ExecutorStartParallel() or ExecutorStartExtended().  But I'm not
sure.  If you can revise things to go through the executor interfaces
I think that would be a good start, and then perhaps after that we can
see what else makes sense to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Andres Freund

Date:

10 February 2015, 07:48:56

On 2015-02-07 17:16:12 -0500, Robert Haas wrote:
> On Sat, Feb 7, 2015 at 4:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > [ criticicm of Amit's heapam integration ]

> > I'm not convinced that that reasoning is generally valid. While it may
> > work out nicely for seqscans - which might be useful enough on its own -
> > the more stuff we parallelize the *more* the executor will have to know
> > about it to make it sane. To actually scale nicely e.g. a parallel sort
> > will have to execute the nodes below it on each backend, instead of
> > doing that in one as a separate step, ferrying over all tuples to
> > indivdual backends through queues, and only then parallezing the
> > sort.
> >
> > Now. None of that is likely to matter immediately, but I think starting
> > to build the infrastructure at the points where we'll later need it does
> > make some sense.
>
> Well, I agree with you, but I'm not really sure what that has to do
> with the issue at hand.  I mean, if we were to apply Amit's patch,
> we'd been in a situation where, for a non-parallel heap scan, heapam.c
> decides the order in which blocks get scanned, but for a parallel heap
> scan, nodeParallelSeqscan.c makes that decision.  Maybe I'm an old
> fuddy-duddy[1] but that seems like an abstraction violation to me.  I
> think the executor should see a parallel scan as a stream of tuples
> that streams into a bunch of backends in parallel, without really
> knowing how heapam.c is dividing up the work.  That's how it's
> modularized today, and I don't see a reason to change it.  Do you?

I don't really agree. Normally heapam just sequentially scan the heap in
one go, not much logic to that. Ok, then there's also the synchronized
seqscan stuff - which just about every user of heapscans but the
executor promptly disables again. I don't think a heap_scan_page() or
similar API will consitute a relevant layering violation over what we
already have.

Note that I'm not saying that Amit's patch is right - I haven't read it
- but that I don't think a 'scan this range of pages' heapscan API would
not be a bad idea. Not even just for parallelism, but for a bunch of
usecases.

> Regarding tuple flow between backends, I've thought about that before,
> I agree that we need it, and I don't think I know how to do it.  I can
> see how to have a group of processes executing a single node in
> parallel, or a single process executing a group of nodes we break off
> from the query tree and push down to it, but what you're talking about
> here is a group of processes executing a group of nodes jointly.

I don't think it really is that. I think you'd do it essentially by
introducing a couple more nodes. Something like
                             SomeUpperLayerNode                                     |
 |                                AggCombinerNode                                   /   \
  /     \                                 /       \                  PartialHashAggNode   PartialHashAggNode ....
.PartialHashAggNode...                          |                    |                          |                    |
                       |                    |                          |                    |
PartialSeqScan       PartialSeqScan
 

The only thing that'd potentially might need to end up working jointly
jointly would be the block selection of the individual PartialSeqScans
to avoid having to wait for stragglers for too long. E.g. each might
just ask for a range of a 16 megabytes or so that it scans sequentially.

In such a plan - a pretty sensible and not that uncommon thing for
parallelized aggregates - you'd need to be able to tell the heap scans
which blocks to scan. Right?

> That seems like an excellent idea, but I don't know how to design it.
> Actually routing the tuples between whichever backends we want to
> exchange them between is easy enough, but how do we decide whether to
> generate such a plan?  What does the actual plan tree look like?

I described above how I think it'd roughly look like. Whether to
generate it probably would be dependant on the cardinality (not much
point to do the above if all groups are distinct) and possibly the
aggregates in use (if we have a parallizable sum/count/avg etc).

> Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
> scan, mostly, I would think) and can-absorb-parallel-tuple-streams
> (sort, hash, materialize), or something like that, but I'm really
> fuzzy on the details.

I don't think we really should have individual nodes that produce
multiple streams - that seems like it'd end up being really
complicated. I'd more say that we have distinct nodes (like the
PartialSeqScan ones above) that do a teensy bit of coordination about
which work to perform.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Parallel Seq Scan

From

Robert Haas

Date:

10 February 2015, 13:52:15

On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Note that I'm not saying that Amit's patch is right - I haven't read it
> - but that I don't think a 'scan this range of pages' heapscan API would
> not be a bad idea. Not even just for parallelism, but for a bunch of
> usecases.

We do have that, already.  heap_setscanlimits().  I'm just not
convinced that that's the right way to split up a parallel scan.
There's too much risk of ending up with a very-uneven distribution of
work.

>> Regarding tuple flow between backends, I've thought about that before,
>> I agree that we need it, and I don't think I know how to do it.  I can
>> see how to have a group of processes executing a single node in
>> parallel, or a single process executing a group of nodes we break off
>> from the query tree and push down to it, but what you're talking about
>> here is a group of processes executing a group of nodes jointly.
>
> I don't think it really is that. I think you'd do it essentially by
> introducing a couple more nodes. Something like
>
>                               SomeUpperLayerNode
>                                       |
>                                       |
>                                  AggCombinerNode
>                                     /   \
>                                    /     \
>                                   /       \
>                    PartialHashAggNode   PartialHashAggNode .... .PartialHashAggNode ...
>                            |                    |
>                            |                    |
>                            |                    |
>                            |                    |
>                     PartialSeqScan        PartialSeqScan
>
> The only thing that'd potentially might need to end up working jointly
> jointly would be the block selection of the individual PartialSeqScans
> to avoid having to wait for stragglers for too long. E.g. each might
> just ask for a range of a 16 megabytes or so that it scans sequentially.
>
> In such a plan - a pretty sensible and not that uncommon thing for
> parallelized aggregates - you'd need to be able to tell the heap scans
> which blocks to scan. Right?

For this case, what I would imagine is that there is one parallel heap
scan, and each PartialSeqScan attaches to it.  The executor says "give
me a tuple" and heapam.c provides one.  Details like the chunk size
are managed down inside heapam.c, and the executor does not know about
them.  It just knows that it can establish a parallel scan and then
pull tuples from it.

>> Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
>> scan, mostly, I would think) and can-absorb-parallel-tuple-streams
>> (sort, hash, materialize), or something like that, but I'm really
>> fuzzy on the details.
>
> I don't think we really should have individual nodes that produce
> multiple streams - that seems like it'd end up being really
> complicated. I'd more say that we have distinct nodes (like the
> PartialSeqScan ones above) that do a teensy bit of coordination about
> which work to perform.

I think we're in violent agreement here, except for some
terminological confusion.  Are there N PartialSeqScan nodes, one
running in each node, or is there one ParallelSeqScan node, which is
copied and run jointly across N nodes?  You can talk about either way
and have it make sense, but we haven't had enough conversations about
this on this list to have settled on a consistent set of vocabulary
yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Andres Freund

Date:

10 February 2015, 14:08:46

On 2015-02-10 08:52:09 -0500, Robert Haas wrote:
> On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Note that I'm not saying that Amit's patch is right - I haven't read it
> > - but that I don't think a 'scan this range of pages' heapscan API would
> > not be a bad idea. Not even just for parallelism, but for a bunch of
> > usecases.
> 
> We do have that, already.  heap_setscanlimits().  I'm just not
> convinced that that's the right way to split up a parallel scan.
> There's too much risk of ending up with a very-uneven distribution of
> work.

If you make the chunks small enough, and then coordate only the chunk
distribution, not really.

> >> Regarding tuple flow between backends, I've thought about that before,
> >> I agree that we need it, and I don't think I know how to do it.  I can
> >> see how to have a group of processes executing a single node in
> >> parallel, or a single process executing a group of nodes we break off
> >> from the query tree and push down to it, but what you're talking about
> >> here is a group of processes executing a group of nodes jointly.
> >
> > I don't think it really is that. I think you'd do it essentially by
> > introducing a couple more nodes. Something like
> >
> >                               SomeUpperLayerNode
> >                                       |
> >                                       |
> >                                  AggCombinerNode
> >                                     /   \
> >                                    /     \
> >                                   /       \
> >                    PartialHashAggNode   PartialHashAggNode .... .PartialHashAggNode ...
> >                            |                    |
> >                            |                    |
> >                            |                    |
> >                            |                    |
> >                     PartialSeqScan        PartialSeqScan
> >
> > The only thing that'd potentially might need to end up working jointly
> > jointly would be the block selection of the individual PartialSeqScans
> > to avoid having to wait for stragglers for too long. E.g. each might
> > just ask for a range of a 16 megabytes or so that it scans sequentially.
> >
> > In such a plan - a pretty sensible and not that uncommon thing for
> > parallelized aggregates - you'd need to be able to tell the heap scans
> > which blocks to scan. Right?
> 
> For this case, what I would imagine is that there is one parallel heap
> scan, and each PartialSeqScan attaches to it.  The executor says "give
> me a tuple" and heapam.c provides one.  Details like the chunk size
> are managed down inside heapam.c, and the executor does not know about
> them.  It just knows that it can establish a parallel scan and then
> pull tuples from it.

I think that's a horrible approach that'll end up with far more
entangled pieces than what you're trying to avoid. Unless the tuple flow
is organized to only happen in the necessary cases the performance will
be horrible. And good chunk sizes et al depend on higher layers,
selectivity estimates and such. And that's planner/executor work, not
the physical layer (which heapam.c pretty much is).

A individual heap scan's state lives in process private memory. And if
the results inside the separate workers should directly be used in the
these workers without shipping over the network it'd be horrible to have
the logic in the heapscan. How would you otherwise model an executor
tree that does the seqscan and aggregation combined in multiple
processes at the same time?

> >> Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
> >> scan, mostly, I would think) and can-absorb-parallel-tuple-streams
> >> (sort, hash, materialize), or something like that, but I'm really
> >> fuzzy on the details.
> >
> > I don't think we really should have individual nodes that produce
> > multiple streams - that seems like it'd end up being really
> > complicated. I'd more say that we have distinct nodes (like the
> > PartialSeqScan ones above) that do a teensy bit of coordination about
> > which work to perform.
> 
> I think we're in violent agreement here, except for some
> terminological confusion.  Are there N PartialSeqScan nodes, one
> running in each node, or is there one ParallelSeqScan node, which is
> copied and run jointly across N nodes?  You can talk about either way
> and have it make sense, but we haven't had enough conversations about
> this on this list to have settled on a consistent set of vocabulary
> yet.

I pretty strongly believe that it has to be independent scan nodes. Both
from a implementation and a conversational POV. They might have some
very light cooperation between them (e.g. coordinating block ranges or
such), but everything else should be separate. From an implementation
POV it seems pretty awful to have executor node that's accessed by
multiple separate backends - that'd mean it have to be concurrency safe,
have state in shared memory and everything.

Now, there'll be a node that needs to do some parallel magic - but in
the above example that should be the AggCombinerNode, which would not
only ask for tuples from one of the children at a time, but ask multiple
ones in parallel. But even then it doesn't have to deal with concurrency
around it's own state.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Parallel Seq Scan

From

Robert Haas

Date:

10 February 2015, 14:23:09

On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> If you make the chunks small enough, and then coordate only the chunk
> distribution, not really.

True, but why do you want to do that in the executor instead of in the heapam?

>> For this case, what I would imagine is that there is one parallel heap
>> scan, and each PartialSeqScan attaches to it.  The executor says "give
>> me a tuple" and heapam.c provides one.  Details like the chunk size
>> are managed down inside heapam.c, and the executor does not know about
>> them.  It just knows that it can establish a parallel scan and then
>> pull tuples from it.
>
> I think that's a horrible approach that'll end up with far more
> entangled pieces than what you're trying to avoid. Unless the tuple flow
> is organized to only happen in the necessary cases the performance will
> be horrible.

I can't understand this at all.  A parallel heap scan, as I've coded
it up, involves no tuple flow at all.  All that's happening at the
heapam.c layer is that we're coordinating which blocks to scan.  Not
to be disrespectful, but have you actually looked at the patch?

> And good chunk sizes et al depend on higher layers,
> selectivity estimates and such. And that's planner/executor work, not
> the physical layer (which heapam.c pretty much is).

If it's true that a good chunk size depends on the higher layers, then
that would be a good argument for doing this differently, or at least
exposing an API for the higher layers to tell heapam.c what chunk size
they want.  I hadn't considered that possibility - can you elaborate
on why you think we might want to vary the chunk size?

> A individual heap scan's state lives in process private memory. And if
> the results inside the separate workers should directly be used in the
> these workers without shipping over the network it'd be horrible to have
> the logic in the heapscan. How would you otherwise model an executor
> tree that does the seqscan and aggregation combined in multiple
> processes at the same time?

Again, the heap scan is not shipping anything anywhere ever in any
design of any patch proposed or written.  The results *are* directly
used inside each individual worker.

>> I think we're in violent agreement here, except for some
>> terminological confusion.  Are there N PartialSeqScan nodes, one
>> running in each node, or is there one ParallelSeqScan node, which is
>> copied and run jointly across N nodes?  You can talk about either way
>> and have it make sense, but we haven't had enough conversations about
>> this on this list to have settled on a consistent set of vocabulary
>> yet.
>
> I pretty strongly believe that it has to be independent scan nodes. Both
> from a implementation and a conversational POV. They might have some
> very light cooperation between them (e.g. coordinating block ranges or
> such), but everything else should be separate. From an implementation
> POV it seems pretty awful to have executor node that's accessed by
> multiple separate backends - that'd mean it have to be concurrency safe,
> have state in shared memory and everything.

I don't agree with that, but again I think it's a terminological
dispute.  I think what will happen is that you will have a single node
that gets copied into multiple backends, and in some cases a small
portion of its state will live in shared memory.  That's more or less
what you're thinking of too, I think.

But what I don't want is - if we've got a parallel scan-and-aggregate
happening in N nodes, EXPLAIN shows N copies of all of that - not only
because it's display clutter, but also because a plan to do that thing
with 3 workers is fundamentally the same as a plan to do it with 30
workers.  Those plans shouldn't look different, except perhaps for a
line some place that says "Number of Workers: N".

> Now, there'll be a node that needs to do some parallel magic - but in
> the above example that should be the AggCombinerNode, which would not
> only ask for tuples from one of the children at a time, but ask multiple
> ones in parallel. But even then it doesn't have to deal with concurrency
> around it's own state.

Sure, we clearly want to minimize the amount of coordination between nodes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Andres Freund

Date:

10 February 2015, 20:57:18

On 2015-02-10 09:23:02 -0500, Robert Haas wrote:
> On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > And good chunk sizes et al depend on higher layers,
> > selectivity estimates and such. And that's planner/executor work, not
> > the physical layer (which heapam.c pretty much is).
> 
> If it's true that a good chunk size depends on the higher layers, then
> that would be a good argument for doing this differently, or at least
> exposing an API for the higher layers to tell heapam.c what chunk size
> they want.  I hadn't considered that possibility - can you elaborate
> on why you think we might want to vary the chunk size?

Because things like chunk size depend on the shape of the entire
plan. If you have a 1TB table and want to sequentially scan it in
parallel with 10 workers you better use some rather large chunks. That
way readahead will be efficient in a cpu/socket local manner,
i.e. directly reading in the pages into the directly connected memory of
that cpu. Important for performance on a NUMA system, otherwise you'll
constantly have everything go over the shared bus.  But if you instead
have a plan where the sequential scan goes over a 1GB table, perhaps
with some relatively expensive filters, you'll really want a small
chunks size to avoid waiting.  The chunk size will also really depend on
what other nodes are doing, at least if they can run in the same worker.

Even without things like NUMA and readahead I'm pretty sure that you'll
want a chunk size a good bit above one page. The locks we acquire for
the buffercache lookup and for reading the page are already quite bad
for performance/scalability; even if we don't always/often hit the same
lock. Making 20 processes that scan pages in parallel acquire yet a
another lock (that's shared between all of them!) for every single page
won't be fun, especially without or fast filters.

> >> For this case, what I would imagine is that there is one parallel heap
> >> scan, and each PartialSeqScan attaches to it.  The executor says "give
> >> me a tuple" and heapam.c provides one.  Details like the chunk size
> >> are managed down inside heapam.c, and the executor does not know about
> >> them.  It just knows that it can establish a parallel scan and then
> >> pull tuples from it.
> >
> > I think that's a horrible approach that'll end up with far more
> > entangled pieces than what you're trying to avoid. Unless the tuple flow
> > is organized to only happen in the necessary cases the performance will
> > be horrible.
> 
> I can't understand this at all.  A parallel heap scan, as I've coded
> it up, involves no tuple flow at all.  All that's happening at the
> heapam.c layer is that we're coordinating which blocks to scan.  Not
> to be disrespectful, but have you actually looked at the patch?

No, and I said so upthread. I started commenting because you argued that
architecturally parallelism belongs in heapam.c instead of upper layers,
and I can't agree with that.  I now have, and it looks less bad than I
had assumed, sorry.

Unfortunately I still think it's wrong approach, also sorry.

As pointed out above (moved there after reading the patch...) I don't
think a chunk size of 1 or any other constant size can make sense. I
don't even believe it'll necessarily be constant across an entire query
execution (big initially, small at the end).  Now, we could move
determining that before the query execution into executor
initialization, but then we won't yet know how many workers we're going
to get. We could add a function setting that at runtime, but that'd mix
up responsibilities quite a bit.

I also can't agree with having a static snapshot in shared memory put
there by the initialization function. For one it's quite awkward to end
up with several equivalent snapshots at various places in shared
memory. Right now the entire query execution can share one snapshot,
this way we'd end up with several of them.  Imo for actual parallel
query execution the plan should be shared once and then be reused for
everything done in the name of the query.

Without the need to do that you end up pretty much with only with setup
for infrastructure so heap_parallelscan_nextpage is called. How about
instead renaming heap_beginscan_internal() to _extended and offering an
option to provide a callback + state that determines the next page?
Additionally provide some separate functions managing a simple
implementation of such a callback + state?

Btw, using a atomic uint32 you'd end up without the spinlock and just
about the same amount of code... Just do a atomic_fetch_add_until32(var,
1, InvalidBlockNumber)... ;)

> >> I think we're in violent agreement here, except for some
> >> terminological confusion.  Are there N PartialSeqScan nodes, one
> >> running in each node, or is there one ParallelSeqScan node, which is
> >> copied and run jointly across N nodes?  You can talk about either way
> >> and have it make sense, but we haven't had enough conversations about
> >> this on this list to have settled on a consistent set of vocabulary
> >> yet.
> >
> > I pretty strongly believe that it has to be independent scan nodes. Both
> > from a implementation and a conversational POV. They might have some
> > very light cooperation between them (e.g. coordinating block ranges or
> > such), but everything else should be separate. From an implementation
> > POV it seems pretty awful to have executor node that's accessed by
> > multiple separate backends - that'd mean it have to be concurrency safe,
> > have state in shared memory and everything.
> 
> I don't agree with that, but again I think it's a terminological
> dispute.  I think what will happen is that you will have a single node
> that gets copied into multiple backends, and in some cases a small
> portion of its state will live in shared memory.  That's more or less
> what you're thinking of too, I think.

Well, let me put it that way, I think that the tuple flow has to be
pretty much like I'd ascii-art'ed earlier. And that only very few nodes
will need to coordinate between query execution happening in different
workers.  With that I mean it has to be possible to have queries like:
   ParallelismDrivingNode           |  ---------------- Parallelism boundary           |        NestLoop        /
\CSeqScan    IndexScan

Where the 'coordinated seqscan' scans a relation so that each tuple
eventually gets returned once across all nodes, but the nested loop (and
through it the index scan) will just run normally, without any
coordination and parallelism. But everything below --- would happen
multiple nodes. If you agree, yes, then we're in violent agreement
;). The "single node that gets copied" bit above makes me a bit unsure
whether we are though.

To me, given the existing executor code, it seems easiest to achieve
that by having the ParallelismDrivingNode above having a dynamic number
of nestloop children in different backends and point the coordinated
seqscan to some shared state.  As you point out, the number of these
children cannot be certainly known (just targeted for) at plan time;
that puts a certain limit on how independent they are.  But since a
large number of them can be independent between workers it seems awkward
to generally treat them as being the same node across workers. But maybe
that's just an issue with my mental model.

> But what I don't want is - if we've got a parallel scan-and-aggregate
> happening in N nodes, EXPLAIN shows N copies of all of that - not only
> because it's display clutter, but also because a plan to do that thing
> with 3 workers is fundamentally the same as a plan to do it with 30
> workers.  Those plans shouldn't look different, except perhaps for a
> line some place that says "Number of Workers: N".

I'm really not concerned with what explain is going to show. We can do
quite some fudging there - it's not like it's a 1:1 representation of
the query plan.

I think we're getting to the point where having a unique mapping from
the plan to the execution tree is proving to be rather limiting
anyway. Check for example discussion about join removal. But even for
current code, showing only the custom plans for the first five EXPLAIN
EXECUTEs is pretty nasty (Try explain that to somebody that doesn't know
pg internals. Their looks are worth gold and can kill you at the same
time) and should be done differently.

And I actually can very well imagine that you'd want a option to show
the different execution statistics for every worker in the ANALYZE case.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Parallel Seq Scan

From

Robert Haas

Date:

11 February 2015, 20:49:26

On Tue, Feb 10, 2015 at 3:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-02-10 09:23:02 -0500, Robert Haas wrote:
>> On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > And good chunk sizes et al depend on higher layers,
>> > selectivity estimates and such. And that's planner/executor work, not
>> > the physical layer (which heapam.c pretty much is).
>>
>> If it's true that a good chunk size depends on the higher layers, then
>> that would be a good argument for doing this differently, or at least
>> exposing an API for the higher layers to tell heapam.c what chunk size
>> they want.  I hadn't considered that possibility - can you elaborate
>> on why you think we might want to vary the chunk size?
>
> Because things like chunk size depend on the shape of the entire
> plan. If you have a 1TB table and want to sequentially scan it in
> parallel with 10 workers you better use some rather large chunks. That
> way readahead will be efficient in a cpu/socket local manner,
> i.e. directly reading in the pages into the directly connected memory of
> that cpu. Important for performance on a NUMA system, otherwise you'll
> constantly have everything go over the shared bus.  But if you instead
> have a plan where the sequential scan goes over a 1GB table, perhaps
> with some relatively expensive filters, you'll really want a small
> chunks size to avoid waiting.

I see.  That makes sense.

> The chunk size will also really depend on
> what other nodes are doing, at least if they can run in the same worker.

Example?

> Even without things like NUMA and readahead I'm pretty sure that you'll
> want a chunk size a good bit above one page. The locks we acquire for
> the buffercache lookup and for reading the page are already quite bad
> for performance/scalability; even if we don't always/often hit the same
> lock. Making 20 processes that scan pages in parallel acquire yet a
> another lock (that's shared between all of them!) for every single page
> won't be fun, especially without or fast filters.

This is possible, but I'm skeptical.  If the amount of other work we
have to do that page is so little that the additional spinlock cycle
per page causes meaningful contention, I doubt we should be
parallelizing in the first place.

> No, and I said so upthread. I started commenting because you argued that
> architecturally parallelism belongs in heapam.c instead of upper layers,
> and I can't agree with that.  I now have, and it looks less bad than I
> had assumed, sorry.

OK, that's something.

> Unfortunately I still think it's wrong approach, also sorry.
>
> As pointed out above (moved there after reading the patch...) I don't
> think a chunk size of 1 or any other constant size can make sense. I
> don't even believe it'll necessarily be constant across an entire query
> execution (big initially, small at the end).  Now, we could move
> determining that before the query execution into executor
> initialization, but then we won't yet know how many workers we're going
> to get. We could add a function setting that at runtime, but that'd mix
> up responsibilities quite a bit.

I still think this belongs in heapam.c somehow or other.  If the logic
is all in the executor, then it becomes impossible for any code that
doensn't use the executor to do a parallel heap scan, and that's
probably bad.  It's not hard to imagine something like CLUSTER wanting
to reuse that code, and that won't be possible if the logic is up in
some higher layer.  If the logic we want is to start with a large
chunk size and then switch to a small chunk size when there's not much
of the relation left to scan, there's still no reason that can't be
encapsulated in heapam.c.

> Btw, using a atomic uint32 you'd end up without the spinlock and just
> about the same amount of code... Just do a atomic_fetch_add_until32(var,
> 1, InvalidBlockNumber)... ;)

I thought of that, but I think there's an overflow hazard.

> Where the 'coordinated seqscan' scans a relation so that each tuple
> eventually gets returned once across all nodes, but the nested loop (and
> through it the index scan) will just run normally, without any
> coordination and parallelism. But everything below --- would happen
> multiple nodes. If you agree, yes, then we're in violent agreement
> ;). The "single node that gets copied" bit above makes me a bit unsure
> whether we are though.

Yeah, I think we're talking about the same thing.

> To me, given the existing executor code, it seems easiest to achieve
> that by having the ParallelismDrivingNode above having a dynamic number
> of nestloop children in different backends and point the coordinated
> seqscan to some shared state.  As you point out, the number of these
> children cannot be certainly known (just targeted for) at plan time;
> that puts a certain limit on how independent they are.  But since a
> large number of them can be independent between workers it seems awkward
> to generally treat them as being the same node across workers. But maybe
> that's just an issue with my mental model.

I think it makes sense to think of a set of tasks in which workers can
assist.  So you a query tree which is just one query tree, with no
copies of the nodes, and then there are certain places in that query
tree where a worker can jump in and assist that node.  To do that, it
will have a copy of the node, but that doesn't mean that all of the
stuff inside the node becomes shared data at the code level, because
that would be stupid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

12 February 2015, 11:08:08

On Thu, Feb 12, 2015 at 2:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 10, 2015 at 3:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2015-02-10 09:23:02 -0500, Robert Haas wrote:
> >> On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >
> > As pointed out above (moved there after reading the patch...) I don't
> > think a chunk size of 1 or any other constant size can make sense. I
> > don't even believe it'll necessarily be constant across an entire query
> > execution (big initially, small at the end). Now, we could move
> > determining that before the query execution into executor
> > initialization, but then we won't yet know how many workers we're going
> > to get. We could add a function setting that at runtime, but that'd mix
> > up responsibilities quite a bit.
>
> I still think this belongs in heapam.c somehow or other. If the logic
> is all in the executor, then it becomes impossible for any code that
> doensn't use the executor to do a parallel heap scan, and that's
> probably bad. It's not hard to imagine something like CLUSTER wanting
> to reuse that code, and that won't be possible if the logic is up in
> some higher layer. If the logic we want is to start with a large
> chunk size and then switch to a small chunk size when there's not much
> of the relation left to scan, there's still no reason that can't be
> encapsulated in heapam.c.
>

It seems to me that we need to use both ways (make heap or other lower

layers aware of parallelism and another one is handle at executor level and

use callback_function and callback_state to make it work) for doing

parallelism. TBH, I think for the matter of this patch we can go either way

and then think more on it as we move ahead to parallelize other operations.

So what I can do is to try using Robert's patch to make heap aware of

parallelism and then see how it comes up?

> > Btw, using a atomic uint32 you'd end up without the spinlock and just
> > about the same amount of code... Just do a atomic_fetch_add_until32(var,
> > 1, InvalidBlockNumber)... ;)
>
> I thought of that, but I think there's an overflow hazard.
>
> > Where the 'coordinated seqscan' scans a relation so that each tuple
> > eventually gets returned once across all nodes, but the nested loop (and
> > through it the index scan) will just run normally, without any
> > coordination and parallelism. But everything below --- would happen
> > multiple nodes. If you agree, yes, then we're in violent agreement
> > ;). The "single node that gets copied" bit above makes me a bit unsure
> > whether we are though.
>
> Yeah, I think we're talking about the same thing.
>
> > To me, given the existing executor code, it seems easiest to achieve
> > that by having the ParallelismDrivingNode above having a dynamic number
> > of nestloop children in different backends and point the coordinated
> > seqscan to some shared state. As you point out, the number of these
> > children cannot be certainly known (just targeted for) at plan time;
> > that puts a certain limit on how independent they are. But since a
> > large number of them can be independent between workers it seems awkward
> > to generally treat them as being the same node across workers. But maybe
> > that's just an issue with my mental model.
>
> I think it makes sense to think of a set of tasks in which workers can
> assist. So you a query tree which is just one query tree, with no
> copies of the nodes, and then there are certain places in that query
> tree where a worker can jump in and assist that node. To do that, it
> will have a copy of the node, but that doesn't mean that all of the
> stuff inside the node becomes shared data at the code level, because
> that would be stupid.
>

As per my understanding of the discussion related to this point, I think

there are 3 somewhat related ways to achieve this.

1. Both master and worker runs the same node (ParallelSeqScan) where

the work done by worker (scan chunks of the heap) for this node is

subset of what is done by master (coordinate the data returned by workers +

scan chunks of heap). It seems to me Robert is advocating this approach.

2. Master and worker uses different nodes to operate. Master runs parallelism

drivingnode (ParallelSeqscan - coordinate the data returned by workers +

scan chunks of heap ) and worker runs some form of Parallelismdriver

node (PartialSeqScan - scan chunks of the heap). It seems to me

Andres is proposing this approach.

3. Same as 2, but modify existing SeqScan node to behave as

PartialSeqScan. This is what I have done in patch.

Correct me or add here if I have misunderstood any thing.

I think going forward (for cases like aggregation) the work done in

Master and Worker node will have substantial differences that it

is better to do the work as part of different nodes in master and

worker.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

16 February 2015, 12:20:20

On Mon, Feb 9, 2015 at 7:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Feb 9, 2015 at 2:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Another idea is to use Executor level interfaces (like ExecutorStart(),
> > ExecutorRun(), ExecutorEnd()) for execution rather than using Portal
> > level interfaces. I have used Portal level interfaces with the
> > thought that we can reuse the existing infrastructure of Portal to
> > make parallel execution of scrollable cursors, but as per my analysis
> > it is not so easy to support them especially backward scan, absolute/
> > relative fetch, etc, so Executor level interfaces seems more appealing
> > to me (something like how Explain statement works (ExplainOnePlan)).
> > Using Executor level interfaces will have advantage that we can reuse them
> > for other parallel functionalaties. In this approach, we need to take
> > care of constructing relavant structures (with the information passed by
> > master backend) required for Executor interfaces, but I think these should
> > be lesser than what we need in previous approach (extract seqscan specific
> > stuff from executor).
>
> I think using the executor-level interfaces instead of the
> portal-level interfaces is a good idea. That would possibly let us
> altogether prohibit access to the portal layer from within a parallel
> worker, which seems like it might be a good sanity check to add. But
> that seems to still require us to have a PlannedStmt and a QueryDesc,
> and I'm not sure whether that's going to be too much of a pain. We
> might need to think about an alternative API for starting the Executor
> like ExecutorStartParallel() or ExecutorStartExtended(). But I'm not
> sure. If you can revise things to go through the executor interfaces
> I think that would be a good start, and then perhaps after that we can
> see what else makes sense to do.
>

Okay, I have modified the patch to use Executor level interfaces

rather than Portal-level interfaces. To achieve that I need to add

a new Dest (DestRemoteBackend). For now, I have modified

printtup.c to handle this new destination type similar to what

it does for DestRemote and DestRemoteExecute.

Apart from above, the other major changes to address your concerns

and review comments are:

a. Made InitiateWorkers() and ParallelQueryMain(an entry function for

parallel query execution) modular

b. Adapted the parallel-heap-scan patch posted by Robert upthread

http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

c. Now master and worker backend, both run as part of same node

ParallelSeqScan (I have yet to update copy and out funcs for new

parameters), check if you think that is the right way to go. I still

feel it would have been better if master and backend worker runs

as part of different nodes, however this also looks okay for the

purpose of parallel sequential scan.

I have yet to modify the code to allow expressions in projection

and allowing joins, I think these are related to allow-parallel-safety

patch, I will once take a look at that patch and then modify

accordingly.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v7.patch

Re: Parallel Seq Scan

From

Andres Freund

Date:

17 February 2015, 16:22:49

On 2015-02-11 15:49:17 -0500, Robert Haas wrote:
> On Tue, Feb 10, 2015 at 3:56 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > And good chunk sizes et al depend on higher layers,
> >> > selectivity estimates and such. And that's planner/executor work, not
> >> > the physical layer (which heapam.c pretty much is).
> >>
> >> If it's true that a good chunk size depends on the higher layers, then
> >> that would be a good argument for doing this differently, or at least
> >> exposing an API for the higher layers to tell heapam.c what chunk size
> >> they want.  I hadn't considered that possibility - can you elaborate
> >> on why you think we might want to vary the chunk size?
> >
> > Because things like chunk size depend on the shape of the entire
> > plan. If you have a 1TB table and want to sequentially scan it in
> > parallel with 10 workers you better use some rather large chunks. That
> > way readahead will be efficient in a cpu/socket local manner,
> > i.e. directly reading in the pages into the directly connected memory of
> > that cpu. Important for performance on a NUMA system, otherwise you'll
> > constantly have everything go over the shared bus.  But if you instead
> > have a plan where the sequential scan goes over a 1GB table, perhaps
> > with some relatively expensive filters, you'll really want a small
> > chunks size to avoid waiting.
> 
> I see.  That makes sense.
> 
> > The chunk size will also really depend on
> > what other nodes are doing, at least if they can run in the same worker.
> 
> Example?

A query whose runetime is dominated by a sequential scan (+ attached
filter) is certainly going to require a bigger prefetch size than one
that does other expensive stuff.

Imagine parallelizing
SELECT * FROM largetable WHERE col = low_cardinality_value;
and
SELECT *
FROM largetable JOIN gigantic_table ON (index_nestloop_condition)
WHERE col = high_cardinality_value;

The first query will be a simple sequential and disk reads on largetable
will be the major cost of executing it.  In contrast the second query
might very well sensibly be planned as a parallel sequential scan with
the nested loop executing in the same worker. But the cost of the
sequential scan itself will likely be completely drowned out by the
nestloop execution - index probes are expensive/unpredictable.

My guess is that the batch size can wil have to be computed based on the
fraction of cost of the parallized work it has.

> > Even without things like NUMA and readahead I'm pretty sure that you'll
> > want a chunk size a good bit above one page. The locks we acquire for
> > the buffercache lookup and for reading the page are already quite bad
> > for performance/scalability; even if we don't always/often hit the same
> > lock. Making 20 processes that scan pages in parallel acquire yet a
> > another lock (that's shared between all of them!) for every single page
> > won't be fun, especially without or fast filters.
> 
> This is possible, but I'm skeptical.  If the amount of other work we
> have to do that page is so little that the additional spinlock cycle
> per page causes meaningful contention, I doubt we should be
> parallelizing in the first place.

It's easy to see contention of buffer mapping (many workloads), buffer
content and buffer header (especially btree roots and small foreign key
target tables) locks. And for most of them we already avoid acquiring
the same spinlock in all backends.

Right now to process a page in a sequential scan we acquire a
nonblocking buffer mapping lock (which doesn't use a spinlock anymore
*because* it proved to be a bottleneck), a nonblocking content lock and
a the buffer header spinlock. All of those are essentially partitioned -
another spinlock shared between all workers will show up.

> > As pointed out above (moved there after reading the patch...) I don't
> > think a chunk size of 1 or any other constant size can make sense. I
> > don't even believe it'll necessarily be constant across an entire query
> > execution (big initially, small at the end).  Now, we could move
> > determining that before the query execution into executor
> > initialization, but then we won't yet know how many workers we're going
> > to get. We could add a function setting that at runtime, but that'd mix
> > up responsibilities quite a bit.
> 
> I still think this belongs in heapam.c somehow or other.  If the logic
> is all in the executor, then it becomes impossible for any code that
> doensn't use the executor to do a parallel heap scan, and that's
> probably bad.  It's not hard to imagine something like CLUSTER wanting
> to reuse that code, and that won't be possible if the logic is up in
> some higher layer.

Yea.

> If the logic we want is to start with a large chunk size and then
> switch to a small chunk size when there's not much of the relation
> left to scan, there's still no reason that can't be encapsulated in
> heapam.c.

I don't mind having some logic in there, but I think you put in too
much. The snapshot stuff should imo go, and the next page logic should
be caller provided.

> > Btw, using a atomic uint32 you'd end up without the spinlock and just
> > about the same amount of code... Just do a atomic_fetch_add_until32(var,
> > 1, InvalidBlockNumber)... ;)
> 
> I thought of that, but I think there's an overflow hazard.

That's why I said atomic_fetch_add_until32 - which can't overflow ;). I
now remember that that was actually pulled on Heikki's request from the
commited patch until a user shows up, but we can easily add it
back. compare/exchange makes such things simple luckily.

> > To me, given the existing executor code, it seems easiest to achieve
> > that by having the ParallelismDrivingNode above having a dynamic number
> > of nestloop children in different backends and point the coordinated
> > seqscan to some shared state.  As you point out, the number of these
> > children cannot be certainly known (just targeted for) at plan time;
> > that puts a certain limit on how independent they are.  But since a
> > large number of them can be independent between workers it seems awkward
> > to generally treat them as being the same node across workers. But maybe
> > that's just an issue with my mental model.
> 
> I think it makes sense to think of a set of tasks in which workers can
> assist.  So you a query tree which is just one query tree, with no
> copies of the nodes, and then there are certain places in that query
> tree where a worker can jump in and assist that node.  To do that, it
> will have a copy of the node, but that doesn't mean that all of the
> stuff inside the node becomes shared data at the code level, because
> that would be stupid.

My only "problem" with that description is that I think workers will
have to work on more than one node - it'll be entire subtrees of the
executor tree.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Parallel Seq Scan

From

Amit Kapila

Date:

18 February 2015, 11:29:36

On Tue, Feb 17, 2015 at 9:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> On 2015-02-11 15:49:17 -0500, Robert Haas wrote:
>
> A query whose runetime is dominated by a sequential scan (+ attached
> filter) is certainly going to require a bigger prefetch size than one
> that does other expensive stuff.
>
> Imagine parallelizing
> SELECT * FROM largetable WHERE col = low_cardinality_value;
> and
> SELECT *
> FROM largetable JOIN gigantic_table ON (index_nestloop_condition)
> WHERE col = high_cardinality_value;
>
> The first query will be a simple sequential and disk reads on largetable
> will be the major cost of executing it. In contrast the second query
> might very well sensibly be planned as a parallel sequential scan with
> the nested loop executing in the same worker. But the cost of the
> sequential scan itself will likely be completely drowned out by the
> nestloop execution - index probes are expensive/unpredictable.
>

I think the work/task given to each worker should be as granular

as possible to make it more predictable.

I think the better way to parallelize such a work (Join query) is that

first worker does sequential scan and filtering on large table and

then pass it to next worker for doing join with gigantic_table.

> >
> > I think it makes sense to think of a set of tasks in which workers can
> > assist. So you a query tree which is just one query tree, with no
> > copies of the nodes, and then there are certain places in that query
> > tree where a worker can jump in and assist that node. To do that, it
> > will have a copy of the node, but that doesn't mean that all of the
> > stuff inside the node becomes shared data at the code level, because
> > that would be stupid.
>
> My only "problem" with that description is that I think workers will
> have to work on more than one node - it'll be entire subtrees of the
> executor tree.
>

There could be some cases where it could be beneficial for worker

to process a sub-tree, but I think there will be more cases where

it will just work on a part of node and send the result back to either

master backend or another worker for further processing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Andres Freund

Date:

18 February 2015, 13:14:07

On 2015-02-18 16:59:26 +0530, Amit Kapila wrote:
> On Tue, Feb 17, 2015 at 9:52 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
> > A query whose runetime is dominated by a sequential scan (+ attached
> > filter) is certainly going to require a bigger prefetch size than one
> > that does other expensive stuff.
> >
> > Imagine parallelizing
> > SELECT * FROM largetable WHERE col = low_cardinality_value;
> > and
> > SELECT *
> > FROM largetable JOIN gigantic_table ON (index_nestloop_condition)
> > WHERE col = high_cardinality_value;
> >
> > The first query will be a simple sequential and disk reads on largetable
> > will be the major cost of executing it.  In contrast the second query
> > might very well sensibly be planned as a parallel sequential scan with
> > the nested loop executing in the same worker. But the cost of the
> > sequential scan itself will likely be completely drowned out by the
> > nestloop execution - index probes are expensive/unpredictable.

> I think the work/task given to each worker should be as granular
> as possible to make it more predictable.
> I think the better way to parallelize such a work (Join query) is that
> first worker does sequential scan and filtering on large table and
> then pass it to next worker for doing join with gigantic_table.

I'm pretty sure that'll result in rather horrible performance. IPC is
rather expensive, you want to do as little of it as possible.

> > >
> > > I think it makes sense to think of a set of tasks in which workers can
> > > assist.  So you a query tree which is just one query tree, with no
> > > copies of the nodes, and then there are certain places in that query
> > > tree where a worker can jump in and assist that node.  To do that, it
> > > will have a copy of the node, but that doesn't mean that all of the
> > > stuff inside the node becomes shared data at the code level, because
> > > that would be stupid.
> >
> > My only "problem" with that description is that I think workers will
> > have to work on more than one node - it'll be entire subtrees of the
> > executor tree.

> There could be some cases where it could be beneficial for worker
> to process a sub-tree, but I think there will be more cases where
> it will just work on a part of node and send the result back to either
> master backend or another worker for further processing.

I think many parallelism projects start out that way, and then notice
that it doesn't parallelize very efficiently.

The most extreme example, but common, is aggregation over large amounts
of data - unless you want to ship huge amounts of data between processes
eto parallize it you have to do the sequential scan and the
pre-aggregate step (that e.g. selects count() and sum() to implement a
avg over all the workers) inside one worker.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Parallel Seq Scan

From

Amit Kapila

Date:

20 February 2015, 13:57:37

On Wed, Feb 18, 2015 at 6:44 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-02-18 16:59:26 +0530, Amit Kapila wrote:
>
> > There could be some cases where it could be beneficial for worker
> > to process a sub-tree, but I think there will be more cases where
> > it will just work on a part of node and send the result back to either
> > master backend or another worker for further processing.
>
> I think many parallelism projects start out that way, and then notice
> that it doesn't parallelize very efficiently.
>
> The most extreme example, but common, is aggregation over large amounts
> of data - unless you want to ship huge amounts of data between processes
> eto parallize it you have to do the sequential scan and the
> pre-aggregate step (that e.g. selects count() and sum() to implement a
> avg over all the workers) inside one worker.
>

OTOH if someone wants to parallelize scan (including expensive qual) and

sort then it will be better to perform scan (or part of scan by one worker)

and sort by other worker.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Haribabu Kommi

Date:

20 February 2015, 21:48:58

On Sat, Feb 21, 2015 at 12:57 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 18, 2015 at 6:44 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>> On 2015-02-18 16:59:26 +0530, Amit Kapila wrote:
>>
>> > There could be some cases where it could be beneficial for worker
>> > to process a sub-tree, but I think there will be more cases where
>> > it will just work on a part of node and send the result back to either
>> > master backend or another worker for further processing.
>>
>> I think many parallelism projects start out that way, and then notice
>> that it doesn't parallelize very efficiently.
>>
>> The most extreme example, but common, is aggregation over large amounts
>> of data - unless you want to ship huge amounts of data between processes
>> eto parallize it you have to do the sequential scan and the
>> pre-aggregate step (that e.g. selects count() and sum() to implement a
>> avg over all the workers) inside one worker.
>>
>
> OTOH if someone wants to parallelize scan (including expensive qual) and
> sort then it will be better to perform scan (or part of scan by one worker)
> and sort by other worker.

There exists a performance problem if we perform SCAN in one worker
and SORT operation in another worker,
because there is a need of twice tuple transfer between worker to
worker/backend. This is a costly operation.
It is better to combine SCAN and SORT operation into a one worker job.
This can be targeted once the parallel scan
code is stable.

Regards,
Hari Babu
Fujitsu Australia

Re: Parallel Seq Scan

From

Robert Haas

Date:

22 February 2015, 01:09:57

On Tue, Feb 17, 2015 at 11:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I still think this belongs in heapam.c somehow or other.  If the logic
>> is all in the executor, then it becomes impossible for any code that
>> doensn't use the executor to do a parallel heap scan, and that's
>> probably bad.  It's not hard to imagine something like CLUSTER wanting
>> to reuse that code, and that won't be possible if the logic is up in
>> some higher layer.
>
> Yea.
>
>> If the logic we want is to start with a large chunk size and then
>> switch to a small chunk size when there's not much of the relation
>> left to scan, there's still no reason that can't be encapsulated in
>> heapam.c.
>
> I don't mind having some logic in there, but I think you put in too
> much. The snapshot stuff should imo go, and the next page logic should
> be caller provided.

If we need to provide a way for the caller to provide the next-page
logic, then I think that should be done via configuration arguments or
flags, not a callback.  There's just no way that the needs of the
executor are going to be so radically different from a utility command
that only a callback will do.

>> I think it makes sense to think of a set of tasks in which workers can
>> assist.  So you a query tree which is just one query tree, with no
>> copies of the nodes, and then there are certain places in that query
>> tree where a worker can jump in and assist that node.  To do that, it
>> will have a copy of the node, but that doesn't mean that all of the
>> stuff inside the node becomes shared data at the code level, because
>> that would be stupid.
>
> My only "problem" with that description is that I think workers will
> have to work on more than one node - it'll be entire subtrees of the
> executor tree.

Amit and I had a long discussion about this on Friday while in Boston
together.  I previously argued that the master and the slave should be
executing the same node, ParallelSeqScan.  However, Amit argued
persuasively that what the master is doing is really pretty different
from what the worker is doing, and that they really ought to be
running two different nodes.  This led us to cast about for a better
design, and we came up with something that I think will be much
better.

The basic idea is to introduce a new node called Funnel.  A Funnel
node will have a left child but no right child, and its job will be to
fire up a given number of workers.  Each worker will execute the plan
which is the left child of the funnel.  The funnel node itself will
pull tuples from all of those workers, and can also (if there are no
tuples available from any worker) execute the plan itself.  So a
parallel sequential scan will look something like this:

Funnel
Workers: 4
-> Partial Heap Scan on xyz

What this is saying is that each worker is going to scan part of the
heap for xyz; together, they will scan the whole thing.

The neat thing about this way of separating things out is that we can
eventually write code to push more stuff down into the funnel.  For
example, consider this:

Nested Loop
-> Seq Scan on foo
-> Index Scan on bar   Index Cond: bar.x = foo.x

Now, if a parallel sequential scan is cheaper than a regular
sequential scan, we can instead do this:

Nested Loop
-> Funnel   -> Partial Heap Scan on foo
-> Index Scan on bara   Index Cond: bar.x = foo.x

The problem with this is that the nested loop/index scan is happening
entirely in the master.  But we can have logic that fixes that by
knowing that a nested loop can be pushed through a funnel, yielding
this:

Funnel
-> Nested Loop   -> Partial Heap Scan on foo   -> Index Scan on bar       Index Cond: bar.x = foo.x

Now that's pretty neat.  One can also imagine doing this with
aggregates.  Consider:

HashAggregate
-> Funnel   -> Partial Heap Scan on foo       Filter: x = 1

Here, we can't just push the HashAggregate through the filter, but
given infrastructure for we could convert that to something like this:

HashAggregateFinish
-> Funnel   -> HashAggregatePartial       -> Partial Heap Scan on foo            Filter: x = 1

That'd be swell.

You can see that something like this will also work for breaking off
an entire plan tree and shoving it down into a worker.  The master
can't participate in the computation in that case, but it's otherwise
the same idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Kohei KaiGai

Date:

23 February 2015, 12:24:38

> Amit and I had a long discussion about this on Friday while in Boston
> together.  I previously argued that the master and the slave should be
> executing the same node, ParallelSeqScan.  However, Amit argued
> persuasively that what the master is doing is really pretty different
> from what the worker is doing, and that they really ought to be
> running two different nodes.  This led us to cast about for a better
> design, and we came up with something that I think will be much
> better.
>
> The basic idea is to introduce a new node called Funnel.  A Funnel
> node will have a left child but no right child, and its job will be to
> fire up a given number of workers.  Each worker will execute the plan
> which is the left child of the funnel.  The funnel node itself will
> pull tuples from all of those workers, and can also (if there are no
> tuples available from any worker) execute the plan itself.  So a
> parallel sequential scan will look something like this:
>
> Funnel
> Workers: 4
> -> Partial Heap Scan on xyz
>
> What this is saying is that each worker is going to scan part of the
> heap for xyz; together, they will scan the whole thing.
>

What is the best way to determine the number of workers?
Fixed number is an idea. It may also make sense to add a new common field
to Path node to introduce how much portion of the node execution can be
parallelized, or unavailable to run in parallel.
Not on the plan time, we may be able to determine the number according to
the number of concurrent workers and number of CPU cores.

> The neat thing about this way of separating things out is that we can
> eventually write code to push more stuff down into the funnel.  For
> example, consider this:
>
> Nested Loop
> -> Seq Scan on foo
> -> Index Scan on bar
>     Index Cond: bar.x = foo.x
>
> Now, if a parallel sequential scan is cheaper than a regular
> sequential scan, we can instead do this:
>
> Nested Loop
> -> Funnel
>     -> Partial Heap Scan on foo
> -> Index Scan on bara
>     Index Cond: bar.x = foo.x
>
> The problem with this is that the nested loop/index scan is happening
> entirely in the master.  But we can have logic that fixes that by
> knowing that a nested loop can be pushed through a funnel, yielding
> this:
>
> Funnel
> -> Nested Loop
>     -> Partial Heap Scan on foo
>     -> Index Scan on bar
>         Index Cond: bar.x = foo.x
>
> Now that's pretty neat.  One can also imagine doing this with
> aggregates.  Consider:
>
I guess the planner enhancement shall exist around add_paths_to_joinrel().
In case when any underlying join paths that support multi-node execution,
the new portion will add Funnel node with these join paths. Just my thought.

> HashAggregate
> -> Funnel
>     -> Partial Heap Scan on foo
>         Filter: x = 1
>
> Here, we can't just push the HashAggregate through the filter, but
> given infrastructure for we could convert that to something like this:
>
> HashAggregateFinish
> -> Funnel
>     -> HashAggregatePartial
>         -> Partial Heap Scan on foo
>              Filter: x = 1
>
> That'd be swell.
>
> You can see that something like this will also work for breaking off
> an entire plan tree and shoving it down into a worker.  The master
> can't participate in the computation in that case, but it's otherwise
> the same idea.
>
I believe the entire vision we've discussed around combining aggregate
function thread is above, although people primarily considers to apply
this feature on aggregate push-down across join.

One key infrastructure may be a capability to define the combining function
of aggregates. It informs the planner given aggregation support two stage
execution. In addition to this, we may need to have a planner enhancement
to inject the partial aggregate node during path construction.

Probably, we have to set a flag to inform later stage (that will construct
Agg plan) the underlying scan/join node takes partial aggregation, thus,
final aggregation has to expect state data, instead of usual arguments for
row-by-row.

Also, I think HashJoin with very large outer relation but unbalanced much
small inner is a good candidate to distribute multiple nodes.
Even if multi-node HashJoin has to read the small inner relation N-times,
separation of very large outer relation will make gain.

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>

Re: Parallel Seq Scan

From

Amit Kapila

Date:

04 March 2015, 00:47:23

On Sun, Feb 22, 2015 at 6:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 17, 2015 at 11:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > My only "problem" with that description is that I think workers will
> > have to work on more than one node - it'll be entire subtrees of the
> > executor tree.
>
> Amit and I had a long discussion about this on Friday while in Boston
> together. I previously argued that the master and the slave should be
> executing the same node, ParallelSeqScan. However, Amit argued
> persuasively that what the master is doing is really pretty different
> from what the worker is doing, and that they really ought to be
> running two different nodes. This led us to cast about for a better
> design, and we came up with something that I think will be much
> better.
>
> The basic idea is to introduce a new node called Funnel. A Funnel
> node will have a left child but no right child, and its job will be to
> fire up a given number of workers. Each worker will execute the plan
> which is the left child of the funnel. The funnel node itself will
> pull tuples from all of those workers, and can also (if there are no
> tuples available from any worker) execute the plan itself.

I have modified the patch to introduce a Funnel node (and left child

as PartialSeqScan node). Apart from that, some other noticeable

changes based on feedback include:

a) Master backend forms and send the planned stmt to each worker,

earlier patch use to send individual elements and form the planned

stmt in each worker.

b) Passed tuples directly via tuple queue instead of going via

FE-BE protocol.

c) Removed restriction of expressions in target list.

d) Introduced a parallelmodeneeded flag in plannerglobal structure

and set it for Funnel plan.

There is still some work left like integrating with

access-parallel-safety patch (use parallelmodeok flag to decide

whether parallel path can be generated, Enter/Exit parallel mode is still

done during execution of funnel node).

I think these are minor points which can be fixed once we decide

on the other major parts of patch. Find modified patch attached with

this mail.

Note -

This patch is based on Head (commit-id: d1479011) +

parallel-mode-v6.patch [1] + parallel-heap-scan.patch[2]

[1]

http://www.postgresql.org/message-id/CA+TgmobCMwFOz-9=hFv=hJ4SH7p=5X6Ga5V=WtT8=huzE6C+Mg@mail.gmail.com

[2]

http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v8.patch

Re: Parallel Seq Scan

From

Amit Kapila

Date:

09 March 2015, 14:39:13

On Wed, Mar 4, 2015 at 6:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Feb 22, 2015 at 6:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Feb 17, 2015 at 11:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > My only "problem" with that description is that I think workers will
> > > have to work on more than one node - it'll be entire subtrees of the
> > > executor tree.
> >
> > Amit and I had a long discussion about this on Friday while in Boston
> > together. I previously argued that the master and the slave should be
> > executing the same node, ParallelSeqScan. However, Amit argued
> > persuasively that what the master is doing is really pretty different
> > from what the worker is doing, and that they really ought to be
> > running two different nodes. This led us to cast about for a better
> > design, and we came up with something that I think will be much
> > better.
> >
> > The basic idea is to introduce a new node called Funnel. A Funnel
> > node will have a left child but no right child, and its job will be to
> > fire up a given number of workers. Each worker will execute the plan
> > which is the left child of the funnel. The funnel node itself will
> > pull tuples from all of those workers, and can also (if there are no
> > tuples available from any worker) execute the plan itself.
>
> I have modified the patch to introduce a Funnel node (and left child
> as PartialSeqScan node). Apart from that, some other noticeable
> changes based on feedback include:
> a) Master backend forms and send the planned stmt to each worker,
> earlier patch use to send individual elements and form the planned
> stmt in each worker.
> b) Passed tuples directly via tuple queue instead of going via
> FE-BE protocol.
> c) Removed restriction of expressions in target list.
> d) Introduced a parallelmodeneeded flag in plannerglobal structure
> and set it for Funnel plan.
>
> There is still some work left like integrating with
> access-parallel-safety patch (use parallelmodeok flag to decide
> whether parallel path can be generated, Enter/Exit parallel mode is still
> done during execution of funnel node).
>
> I think these are minor points which can be fixed once we decide
> on the other major parts of patch. Find modified patch attached with
> this mail.
>
> Note -
> This patch is based on Head (commit-id: d1479011) +
> parallel-mode-v6.patch [1] + parallel-heap-scan.patch[2]
>
> [1]
> http://www.postgresql.org/message-id/CA+TgmobCMwFOz-9=hFv=hJ4SH7p=5X6Ga5V=WtT8=huzE6C+Mg@mail.gmail.com
> [2]
> http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
>

Assuming previous patch is in right direction, I have enabled

join support for the patch and done some minor cleanup of

patch which leads to attached new version.

It is based on commit-id:5a2a48f0 and parallel-mode-v7.patch

and parallel-heap-scan.patch

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v9.patch

Re: Parallel Seq Scan

From

Haribabu Kommi

Date:

10 March 2015, 01:21:03

On Tue, Mar 10, 2015 at 1:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Assuming previous patch is in right direction, I have enabled
> join support for the patch and done some minor cleanup of
> patch which leads to attached new version.

Is this patch handles the cases where the re-scan starts without
finishing the earlier scan?

Regards,
Hari Babu
Fujitsu Australia

Re: Parallel Seq Scan

From

Amit Kapila

Date:

10 March 2015, 04:10:00

On Tue, Mar 10, 2015 at 6:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Tue, Mar 10, 2015 at 1:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Assuming previous patch is in right direction, I have enabled
> > join support for the patch and done some minor cleanup of
> > patch which leads to attached new version.
>
> Is this patch handles the cases where the re-scan starts without
> finishing the earlier scan?
>

Do you mean to say cases like ANTI, SEMI Join (in nodeNestLoop.c)

where we scan the next outer tuple and rescan inner table without

completing the previous scan of inner table?

I have currently modelled it based on existing rescan for seqscan

(ExecReScanSeqScan()) which means it will begin the scan again.

Basically if the workers are already started/initialized by previous

scan, then re-initialize them (refer function ExecReScanFunnel() in

patch).

Can you elaborate more if you think current handling is not sufficient

for any case?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Haribabu Kommi

Date:

10 March 2015, 04:53:45

On Tue, Mar 10, 2015 at 3:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Mar 10, 2015 at 6:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Tue, Mar 10, 2015 at 1:38 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >
>> > Assuming previous patch is in right direction, I have enabled
>> > join support for the patch and done some minor cleanup of
>> > patch which leads to attached new version.
>>
>> Is this patch handles the cases where the re-scan starts without
>> finishing the earlier scan?
>>
>
> Do you mean to say cases like ANTI, SEMI Join (in nodeNestLoop.c)
> where we scan the next outer tuple and rescan inner table without
> completing the previous scan of inner table?

Yes.

> I have currently modelled it based on existing rescan for seqscan
> (ExecReScanSeqScan()) which means it will begin the scan again.
> Basically if the workers are already started/initialized by previous
> scan, then re-initialize them (refer function ExecReScanFunnel() in
> patch).
>
> Can you elaborate more if you think current handling is not sufficient
> for any case?

From ExecReScanFunnel function it seems that the re-scan waits till
all the workers
has to be finished to start again the next scan. Are the workers will
stop the current
ongoing task? otherwise this may decrease the performance instead of
improving as i feel.

I am not sure if it already handled or not,  when a worker is waiting
to pass the results,
whereas the backend is trying to start the re-scan?

Regards,
Hari Babu
Fujitsu Australia

Re: Parallel Seq Scan

From

Amit Kapila

Date:

10 March 2015, 06:56:14

On Tue, Mar 10, 2015 at 10:23 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Tue, Mar 10, 2015 at 3:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > I have currently modelled it based on existing rescan for seqscan
> > (ExecReScanSeqScan()) which means it will begin the scan again.
> > Basically if the workers are already started/initialized by previous
> > scan, then re-initialize them (refer function ExecReScanFunnel() in
> > patch).
> >
> > Can you elaborate more if you think current handling is not sufficient
> > for any case?
>
> From ExecReScanFunnel function it seems that the re-scan waits till
> all the workers
> has to be finished to start again the next scan. Are the workers will
> stop the current
> ongoing task? otherwise this may decrease the performance instead of
> improving as i feel.
>

Okay, performance-wise it might effect such a case, but I think we can

handle it by not calling WaitForParallelWorkersToFinish(),
as DestroyParallelContext() will automatically terminate all the workers.

> I am not sure if it already handled or not, when a worker is waiting
> to pass the results,
> whereas the backend is trying to start the re-scan?
>

I think stopping/terminating workers should handle such a case.

Thanks for pointing out this case, I will change it in next update.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

10 March 2015, 19:32:00

On Tue, Mar 3, 2015 at 7:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have modified the patch to introduce a Funnel node (and left child
> as PartialSeqScan node).  Apart from that, some other noticeable
> changes based on feedback include:
> a) Master backend forms and send the planned stmt to each worker,
> earlier patch use to send individual elements and form the planned
> stmt in each worker.
> b) Passed tuples directly via tuple queue instead of going via
> FE-BE protocol.
> c) Removed restriction of expressions in target list.
> d) Introduced a parallelmodeneeded flag in plannerglobal structure
> and set it for Funnel plan.
>
> There is still some work left like integrating with
> access-parallel-safety patch (use parallelmodeok flag to decide
> whether parallel path can be generated, Enter/Exit parallel mode is still
> done during execution of funnel node).
>
> I think these are minor points which can be fixed once we decide
> on the other major parts of patch.  Find modified patch attached with
> this mail.

This is definitely progress.  I do think you need to integrate it with
the access-parallel-safety patch.  Other comments:

- There's not much code left in shmmqam.c.  I think that the remaining
logic should be integrated directly into nodeFunnel.c, with the two
bools in worker_result_state becoming part of the FunnelState.  It
doesn't make sense to have a separate structure for two booleans and
20 lines of code.  If you were going to keep this file around, I'd
complain about its name and its location in the source tree, too, but
as it is I think we can just get rid of it altogether.

- Something is deeply wrong with the separation of concerns between
nodeFunnel.c and nodePartialSeqscan.c.  nodeFunnel.c should work
correctly with *any arbitrary plan tree* as its left child, and that
is clearly not the case right now.  shm_getnext() can't just do
heap_getnext().  Instead, it's got to call ExecProcNode() on its left
child and let the left child decide what to do about that.  The logic
in InitFunnelRelation() belongs in the parallel seq scan node, not the
funnel.  ExecReScanFunnel() cannot be calling heap_parallel_rescan();
it needs to *not know* that there is a parallel scan under it.  The
comment in FunnelRecheck is a copy-and-paste from elsewhere that is
not applicable to a generic funnel mode.

- The comment in execAmi.c refers to says "Backward scan is not
suppotted for parallel sequiantel scan".  "Sequential" is mis-spelled
here, but I think you should just nuke the whole comment.  The funnel
node is not, in the long run, just for parallel sequential scan, so
putting that comment above it is not right.  If you want to keep the
comment, it's got to be more general than that somehow, like "parallel
nodes do not support backward scans", but I'd just drop it.

- Can we rename create_worker_scan_plannedstmt to
create_parallel_worker_plannedstmt?

- I *strongly* suggest that, for the first version of this, we remove
all of the tts_fromheap stuff.  Let's make no special provision for
returning a tuple stored in a tuple queue; instead, just copy it and
store it in the slot as a pfree-able tuple.  That may be slightly less
efficient, but I think it's totally worth it to avoid the complexity
of tinkering with the slot mechanism.

- InstrAggNode claims that we only need the master's information for
statistics other than buffer usage and tuple counts, but is that
really true?  The parallel backends can be working on the parallel
part of the plan while the master is doing something else, so the
amount of time the *master* spent in a particular node may not be that
relevant.  We might need to think carefully about what it makes sense
to display in the EXPLAIN output in parallel cases.

- The header comment on nodeFunnel.h capitalizes the filename incorrectly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Haribabu Kommi

Date:

11 March 2015, 22:14:37

On Wed, Mar 11, 2015 at 6:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 3, 2015 at 7:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I have modified the patch to introduce a Funnel node (and left child
>> as PartialSeqScan node).  Apart from that, some other noticeable
>> changes based on feedback include:
>> a) Master backend forms and send the planned stmt to each worker,
>> earlier patch use to send individual elements and form the planned
>> stmt in each worker.
>> b) Passed tuples directly via tuple queue instead of going via
>> FE-BE protocol.
>> c) Removed restriction of expressions in target list.
>> d) Introduced a parallelmodeneeded flag in plannerglobal structure
>> and set it for Funnel plan.
>>
>> There is still some work left like integrating with
>> access-parallel-safety patch (use parallelmodeok flag to decide
>> whether parallel path can be generated, Enter/Exit parallel mode is still
>> done during execution of funnel node).
>>
>> I think these are minor points which can be fixed once we decide
>> on the other major parts of patch.  Find modified patch attached with
>> this mail.
>
> - Something is deeply wrong with the separation of concerns between
> nodeFunnel.c and nodePartialSeqscan.c.  nodeFunnel.c should work
> correctly with *any arbitrary plan tree* as its left child, and that
> is clearly not the case right now.  shm_getnext() can't just do
> heap_getnext().  Instead, it's got to call ExecProcNode() on its left
> child and let the left child decide what to do about that.  The logic
> in InitFunnelRelation() belongs in the parallel seq scan node, not the
> funnel.  ExecReScanFunnel() cannot be calling heap_parallel_rescan();
> it needs to *not know* that there is a parallel scan under it.  The
> comment in FunnelRecheck is a copy-and-paste from elsewhere that is
> not applicable to a generic funnel mode.

In create_parallelscan_paths() function the funnel path is added once
the partial seq scan
path is generated. I feel the funnel path can be added once on top of
the total possible
parallel path in the entire query path.

Is this the right patch to add such support also?

Regards,
Hari Babu
Fujitsu Australia

Re: Parallel Seq Scan

From

Amit Langote

Date:

12 March 2015, 10:52:56

On 10-03-2015 PM 01:09, Amit Kapila wrote:
> On Tue, Mar 10, 2015 at 6:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
>> Is this patch handles the cases where the re-scan starts without
>> finishing the earlier scan?
>>
> 
> Do you mean to say cases like ANTI, SEMI Join (in nodeNestLoop.c)
> where we scan the next outer tuple and rescan inner table without
> completing the previous scan of inner table?
> 
> I have currently modelled it based on existing rescan for seqscan
> (ExecReScanSeqScan()) which means it will begin the scan again.
> Basically if the workers are already started/initialized by previous
> scan, then re-initialize them (refer function ExecReScanFunnel() in
> patch).
> 

From Robert's description[1], it looked like the NestLoop with Funnel would
have Funnel as either outer plan or topmost plan node or NOT a parameterised
plan. In that case, would this case arise or am I missing something?

Thanks,
Amit

[1]
http://www.postgresql.org/message-id/CA+TgmobM7X6jgre442638b+33h1EWa=vcZqnsvzEdX057ZHVuw@mail.gmail.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

12 March 2015, 14:46:28

On Wed, Mar 11, 2015 at 1:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 3, 2015 at 7:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > There is still some work left like integrating with
> > access-parallel-safety patch (use parallelmodeok flag to decide
> > whether parallel path can be generated, Enter/Exit parallel mode is still
> > done during execution of funnel node).
> >
> > I think these are minor points which can be fixed once we decide
> > on the other major parts of patch. Find modified patch attached with
> > this mail.
>
> This is definitely progress. I do think you need to integrate it with
> the access-parallel-safety patch.

I have tried, but there are couple of failures while applying latest

access-parallel-safety patch, so left it as it is for now.

> Other comments:
>
> - There's not much code left in shmmqam.c. I think that the remaining
> logic should be integrated directly into nodeFunnel.c, with the two
> bools in worker_result_state becoming part of the FunnelState. It
> doesn't make sense to have a separate structure for two booleans and
> 20 lines of code. If you were going to keep this file around, I'd
> complain about its name and its location in the source tree, too, but
> as it is I think we can just get rid of it altogether.
>

Agreed. Moved the code/logic to nodeFunnel.c

> - Something is deeply wrong with the separation of concerns between
> nodeFunnel.c and nodePartialSeqscan.c. nodeFunnel.c should work
> correctly with *any arbitrary plan tree* as its left child, and that
> is clearly not the case right now. shm_getnext() can't just do
> heap_getnext(). Instead, it's got to call ExecProcNode() on its left
> child and let the left child decide what to do about that.

Agreed and made the required changes.

> The logic
> in InitFunnelRelation() belongs in the parallel seq scan node, not the
> funnel.

I think we should retain initialization of parallelcontext in InitFunnel().

Apart from that, I have moved other stuff to partial seq scan node.

> ExecReScanFunnel() cannot be calling heap_parallel_rescan();
> it needs to *not know* that there is a parallel scan under it.

Agreed. I think it is better to be do that as part of partial seq scan

node.

> The
> comment in FunnelRecheck is a copy-and-paste from elsewhere that is
> not applicable to a generic funnel mode.
>

With new changes, this API is not required.

> - The comment in execAmi.c refers to says "Backward scan is not
> suppotted for parallel sequiantel scan". "Sequential" is mis-spelled
> here, but I think you should just nuke the whole comment. The funnel
> node is not, in the long run, just for parallel sequential scan, so
> putting that comment above it is not right. If you want to keep the
> comment, it's got to be more general than that somehow, like "parallel
> nodes do not support backward scans", but I'd just drop it.
>
> - Can we rename create_worker_scan_plannedstmt to
> create_parallel_worker_plannedstmt?
>

Agreed and changed as per suggestion.

> - I *strongly* suggest that, for the first version of this, we remove
> all of the tts_fromheap stuff. Let's make no special provision for
> returning a tuple stored in a tuple queue; instead, just copy it and
> store it in the slot as a pfree-able tuple. That may be slightly less
> efficient, but I think it's totally worth it to avoid the complexity
> of tinkering with the slot mechanism.
>

Sure, removed (tts_fromheap becomes redundant with new changes).

> - InstrAggNode claims that we only need the master's information for
> statistics other than buffer usage and tuple counts, but is that
> really true? The parallel backends can be working on the parallel
> part of the plan while the master is doing something else, so the
> amount of time the *master* spent in a particular node may not be that
> relevant.

Yes, but isn't other nodes also work this way, example join node will

display the accumulated stats for buffer usage, but for timing, it will

just use the time for that node (which automatically includes some

part of execution of child nodes, but it is not direct accumulation)?

> We might need to think carefully about what it makes sense
> to display in the EXPLAIN output in parallel cases.
>

Currently the Explain for parallel scan on relation will display the

Funnel node which contains aggregated stat of all workers and the

number of workers and Partial Seq Scan node containing stats for

the scan done by master backend. Do we want to display something

more?

Current result of Explain statement is as below:

postgres=# explain (analyze,buffers) select c1 from t1 where c1 > 90000;

QUERY PLAN

--------------------------------------------------------------------------------

-------------------------------------------

Funnel on t1 (cost=0.00..43750.44 rows=9905 width=4) (actual time=1097.236..15

30.416 rows=10000 loops=1)

Filter: (c1 > 90000)

Rows Removed by Filter: 65871

Number of Workers: 2

Buffers: shared hit=96 read=99905

-> Partial Seq Scan on t1 (cost=0.00..101251.01 rows=9905 width=4) (actual

time=1096.188..1521.810 rows=2342 loops=1)

Filter: (c1 > 90000)

Rows Removed by Filter: 24130

Buffers: shared hit=33 read=26439

Planning time: 0.143 ms

Execution time: 1533.438 ms

(11 rows)

> - The header comment on nodeFunnel.h capitalizes the filename incorrectly.
>

Changed.

One additional change (we need to SetLatch() in HandleParallelMessageInterrupt)

is done to handle the hang issue reported on parallel-mode thread.

Without this change it is difficult to verify the patch (will remove this change

once new version of parallel-mode patch containing this change will be posted).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v10.patch

Re: Parallel Seq Scan

From

Thom Brown

Date:

12 March 2015, 15:04:21

On 12 March 2015 at 14:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
> One additional change (we need to SetLatch() in
> HandleParallelMessageInterrupt)
> is done to handle the hang issue reported on parallel-mode thread.
> Without this change it is difficult to verify the patch (will remove this
> change
> once new version of parallel-mode patch containing this change will be
> posted).

Applied parallel-mode-v7.patch and parallel_seqscan_v10.patch, but
getting this error when building:

gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE   -c -o brin.o brin.c -MMD -MP -MF .deps/brin.Po
In file included from ../../../../src/include/nodes/execnodes.h:18:0,                from
../../../../src/include/access/brin.h:14,               from brin.c:18: 
../../../../src/include/access/heapam.h:119:34: error: unknown type
name ‘ParallelHeapScanDesc’extern void heap_parallel_rescan(ParallelHeapScanDesc pscan,
HeapScanDesc scan);                                 ^

Am I missing another patch here?

--
Thom

Re: Parallel Seq Scan

From

Amit Kapila

Date:

12 March 2015, 15:29:40

On Thu, Mar 12, 2015 at 8:33 PM, Thom Brown <thom@linux.com> wrote:
>
> On 12 March 2015 at 14:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > One additional change (we need to SetLatch() in
> > HandleParallelMessageInterrupt)
> > is done to handle the hang issue reported on parallel-mode thread.
> > Without this change it is difficult to verify the patch (will remove this
> > change
> > once new version of parallel-mode patch containing this change will be
> > posted).
>
> Applied parallel-mode-v7.patch and parallel_seqscan_v10.patch, but
> getting this error when building:
>
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
> -D_GNU_SOURCE -c -o brin.o brin.c -MMD -MP -MF .deps/brin.Po
> In file included from ../../../../src/include/nodes/execnodes.h:18:0,
> from ../../../../src/include/access/brin.h:14,
> from brin.c:18:
> ../../../../src/include/access/heapam.h:119:34: error: unknown type
> name ‘ParallelHeapScanDesc’
> extern void heap_parallel_rescan(ParallelHeapScanDesc pscan,
> HeapScanDesc scan);
> ^
>
> Am I missing another patch here?

Yes, the below parallel-heap-scan patch.
http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Please note that parallel_setup_cost and parallel_startup_cost are

still set to zero by default, so you need to set it to higher values

if you don't want the parallel plans once parallel_seqscan_degree

is set. I have yet to comeup with default values for them, needs

some tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Thom Brown

Date:

12 March 2015, 16:20:52

On 12 March 2015 at 15:29, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 12, 2015 at 8:33 PM, Thom Brown <thom@linux.com> wrote:
>>
>> On 12 March 2015 at 14:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > One additional change (we need to SetLatch() in
>> > HandleParallelMessageInterrupt)
>> > is done to handle the hang issue reported on parallel-mode thread.
>> > Without this change it is difficult to verify the patch (will remove
>> > this
>> > change
>> > once new version of parallel-mode patch containing this change will be
>> > posted).
>>
>> Applied parallel-mode-v7.patch and parallel_seqscan_v10.patch, but
>> getting this error when building:
>>
>> gcc -Wall -Wmissing-prototypes -Wpointer-arith
>> -Wdeclaration-after-statement -Wendif-labels
>> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
>> -fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
>> -D_GNU_SOURCE   -c -o brin.o brin.c -MMD -MP -MF .deps/brin.Po
>> In file included from ../../../../src/include/nodes/execnodes.h:18:0,
>>                  from ../../../../src/include/access/brin.h:14,
>>                  from brin.c:18:
>> ../../../../src/include/access/heapam.h:119:34: error: unknown type
>> name ‘ParallelHeapScanDesc’
>>  extern void heap_parallel_rescan(ParallelHeapScanDesc pscan,
>> HeapScanDesc scan);
>>                                   ^
>>
>> Am I missing another patch here?
>
> Yes, the below parallel-heap-scan patch.
> http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
>
> Please note that parallel_setup_cost and parallel_startup_cost are
> still set to zero by default, so you need to set it to higher values
> if you don't want the parallel plans once parallel_seqscan_degree
> is set.  I have yet to comeup with default values for them, needs
> some tests.

Thanks.  Getting a problem:

createdb pgbench
pgbench -i -s 200 pgbench

CREATE TABLE pgbench_accounts_1 (CHECK (bid = 1)) INHERITS (pgbench_accounts);
...
CREATE TABLE pgbench_accounts_200 (CHECK (bid = 200)) INHERITS
(pgbench_accounts);

WITH del AS (DELETE FROM pgbench_accounts WHERE bid = 1 RETURNING *)
INSERT INTO pgbench_accounts_1 SELECT * FROM del;
...
WITH del AS (DELETE FROM pgbench_accounts WHERE bid = 200 RETURNING *)
INSERT INTO pgbench_accounts_200 SELECT * FROM del;

VACUUM ANALYSE;

# SELECT name, setting FROM pg_settings WHERE name IN
('parallel_seqscan_degree','max_worker_processes','seq_page_cost');         name           | setting
-------------------------+---------max_worker_processes    | 20parallel_seqscan_degree | 8seq_page_cost           |
1000
(3 rows)

# EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
ERROR:  too many dynamic shared memory segments


And separately, I've seen this in the logs:

2015-03-12 16:09:30 GMT [7880]: [4-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [5-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [6-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [7-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [8-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [9-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [10-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [11-1] user=,db=,client= LOG:
registering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [12-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [13-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [14-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [15-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [16-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [17-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [18-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [19-1] user=,db=,client= LOG:
starting background worker process "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [20-1] user=,db=,client= LOG:  worker
process: parallel worker for PID 7889 (PID 7913) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [21-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [22-1] user=,db=,client= LOG:  worker
process: parallel worker for PID 7889 (PID 7919) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [23-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [24-1] user=,db=,client= LOG:  worker
process: parallel worker for PID 7889 (PID 7916) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [25-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [26-1] user=,db=,client= LOG:  worker
process: parallel worker for PID 7889 (PID 7918) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [27-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [28-1] user=,db=,client= LOG:  worker
process: parallel worker for PID 7889 (PID 7917) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [29-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [30-1] user=,db=,client= LOG:  worker
process: parallel worker for PID 7889 (PID 7914) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [31-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [32-1] user=,db=,client= LOG:  worker
process: parallel worker for PID 7889 (PID 7915) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [33-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [34-1] user=,db=,client= LOG:  worker
process: parallel worker for PID 7889 (PID 7912) exited with exit code
0
2015-03-12 16:09:30 GMT [7880]: [35-1] user=,db=,client= LOG:
unregistering background worker "parallel worker for PID 7889"
2015-03-12 16:09:30 GMT [7880]: [36-1] user=,db=,client= LOG:  server
process (PID 7889) was terminated by signal 11: Segmentation fault
2015-03-12 16:09:30 GMT [7880]: [37-1] user=,db=,client= DETAIL:
Failed process was running: SELECT pg_catalog.quote_ident(c.relname)
FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm',
'f') AND substring(pg_catalog.quote_ident(c.relname),1,10)='pgbench_br'
AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <>
(SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog')       UNION       SELECT
pg_catalog.quote_ident(n.nspname)|| '.' FROM 
pg_catalog.pg_namespace n WHERE
substring(pg_catalog.quote_ident(n.nspname) || '.',1,10)='pgbench_br'
AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE
substring(pg_catalog.quote_ident(nspname) || '.',1,10) =
substring('pgbench_br',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1))
> 1       UNION       SELECT pg_catalog.quote_ident(n.nspname) || '.' ||
pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c,
pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind
IN ('r', 'S', 'v', 'm', 'f') AND
substring(pg_catalog.quote_ident(n.nspname) || '.' ||
pg_catalog.quote_ident(c.relname),1,10)='pgbench_br' AND substri
2015-03-12 16:09:30 GMT [7880]: [38-1] user=,db=,client= LOG:
terminating any other active server processes
2015-03-12 16:09:30 GMT [7886]: [2-1] user=,db=,client= WARNING:
terminating connection because of crash of another server process
2015-03-12 16:09:30 GMT [7886]: [3-1] user=,db=,client= DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.
2015-03-12 16:09:30 GMT [7886]: [4-1] user=,db=,client= HINT:  In a
moment you should be able to reconnect to the database and repeat your
command.
2015-03-12 16:09:30 GMT [7880]: [39-1] user=,db=,client= LOG:  all
server processes terminated; reinitializing
2015-03-12 16:09:30 GMT [7920]: [1-1] user=,db=,client= LOG:  database
system was interrupted; last known up at 2015-03-12 16:07:26 GMT
2015-03-12 16:09:30 GMT [7920]: [2-1] user=,db=,client= LOG:  database
system was not properly shut down; automatic recovery in progress
2015-03-12 16:09:30 GMT [7920]: [3-1] user=,db=,client= LOG:  invalid
record length at 2/7E269A0
2015-03-12 16:09:30 GMT [7920]: [4-1] user=,db=,client= LOG:  redo is
not required
2015-03-12 16:09:30 GMT [7880]: [40-1] user=,db=,client= LOG:
database system is ready to accept connections
2015-03-12 16:09:30 GMT [7924]: [1-1] user=,db=,client= LOG:
autovacuum launcher started

I can recreate this by typing:

EXPLAIN SELECT DISTINCT bid FROM pgbench_<tab>

This happens with seq_page_cost = 1000, but not when it's set to 1.

--
Thom

Re: Parallel Seq Scan

From

Thom Brown

Date:

12 March 2015, 17:06:28

On 12 March 2015 at 16:20, Thom Brown <thom@linux.com> wrote:
> On 12 March 2015 at 15:29, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Mar 12, 2015 at 8:33 PM, Thom Brown <thom@linux.com> wrote:
>>>
>>> On 12 March 2015 at 14:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> > One additional change (we need to SetLatch() in
>>> > HandleParallelMessageInterrupt)
>>> > is done to handle the hang issue reported on parallel-mode thread.
>>> > Without this change it is difficult to verify the patch (will remove
>>> > this
>>> > change
>>> > once new version of parallel-mode patch containing this change will be
>>> > posted).
>>>
>>> Applied parallel-mode-v7.patch and parallel_seqscan_v10.patch, but
>>> getting this error when building:
>>>
>>> gcc -Wall -Wmissing-prototypes -Wpointer-arith
>>> -Wdeclaration-after-statement -Wendif-labels
>>> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
>>> -fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
>>> -D_GNU_SOURCE   -c -o brin.o brin.c -MMD -MP -MF .deps/brin.Po
>>> In file included from ../../../../src/include/nodes/execnodes.h:18:0,
>>>                  from ../../../../src/include/access/brin.h:14,
>>>                  from brin.c:18:
>>> ../../../../src/include/access/heapam.h:119:34: error: unknown type
>>> name ‘ParallelHeapScanDesc’
>>>  extern void heap_parallel_rescan(ParallelHeapScanDesc pscan,
>>> HeapScanDesc scan);
>>>                                   ^
>>>
>>> Am I missing another patch here?
>>
>> Yes, the below parallel-heap-scan patch.
>> http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
>>
>> Please note that parallel_setup_cost and parallel_startup_cost are
>> still set to zero by default, so you need to set it to higher values
>> if you don't want the parallel plans once parallel_seqscan_degree
>> is set.  I have yet to comeup with default values for them, needs
>> some tests.
>
> Thanks.  Getting a problem:
>
> createdb pgbench
> pgbench -i -s 200 pgbench
>
> CREATE TABLE pgbench_accounts_1 (CHECK (bid = 1)) INHERITS (pgbench_accounts);
> ...
> CREATE TABLE pgbench_accounts_200 (CHECK (bid = 200)) INHERITS
> (pgbench_accounts);
>
> WITH del AS (DELETE FROM pgbench_accounts WHERE bid = 1 RETURNING *)
> INSERT INTO pgbench_accounts_1 SELECT * FROM del;
> ...
> WITH del AS (DELETE FROM pgbench_accounts WHERE bid = 200 RETURNING *)
> INSERT INTO pgbench_accounts_200 SELECT * FROM del;
>
> VACUUM ANALYSE;
>
> # SELECT name, setting FROM pg_settings WHERE name IN
> ('parallel_seqscan_degree','max_worker_processes','seq_page_cost');
>           name           | setting
> -------------------------+---------
>  max_worker_processes    | 20
>  parallel_seqscan_degree | 8
>  seq_page_cost           | 1000
> (3 rows)
>
> # EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
> ERROR:  too many dynamic shared memory segments
>
>
> And separately, I've seen this in the logs:
>
> 2015-03-12 16:09:30 GMT [7880]: [4-1] user=,db=,client= LOG:
> registering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [5-1] user=,db=,client= LOG:
> registering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [6-1] user=,db=,client= LOG:
> registering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [7-1] user=,db=,client= LOG:
> registering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [8-1] user=,db=,client= LOG:
> registering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [9-1] user=,db=,client= LOG:
> registering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [10-1] user=,db=,client= LOG:
> registering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [11-1] user=,db=,client= LOG:
> registering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [12-1] user=,db=,client= LOG:
> starting background worker process "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [13-1] user=,db=,client= LOG:
> starting background worker process "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [14-1] user=,db=,client= LOG:
> starting background worker process "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [15-1] user=,db=,client= LOG:
> starting background worker process "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [16-1] user=,db=,client= LOG:
> starting background worker process "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [17-1] user=,db=,client= LOG:
> starting background worker process "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [18-1] user=,db=,client= LOG:
> starting background worker process "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [19-1] user=,db=,client= LOG:
> starting background worker process "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [20-1] user=,db=,client= LOG:  worker
> process: parallel worker for PID 7889 (PID 7913) exited with exit code
> 0
> 2015-03-12 16:09:30 GMT [7880]: [21-1] user=,db=,client= LOG:
> unregistering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [22-1] user=,db=,client= LOG:  worker
> process: parallel worker for PID 7889 (PID 7919) exited with exit code
> 0
> 2015-03-12 16:09:30 GMT [7880]: [23-1] user=,db=,client= LOG:
> unregistering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [24-1] user=,db=,client= LOG:  worker
> process: parallel worker for PID 7889 (PID 7916) exited with exit code
> 0
> 2015-03-12 16:09:30 GMT [7880]: [25-1] user=,db=,client= LOG:
> unregistering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [26-1] user=,db=,client= LOG:  worker
> process: parallel worker for PID 7889 (PID 7918) exited with exit code
> 0
> 2015-03-12 16:09:30 GMT [7880]: [27-1] user=,db=,client= LOG:
> unregistering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [28-1] user=,db=,client= LOG:  worker
> process: parallel worker for PID 7889 (PID 7917) exited with exit code
> 0
> 2015-03-12 16:09:30 GMT [7880]: [29-1] user=,db=,client= LOG:
> unregistering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [30-1] user=,db=,client= LOG:  worker
> process: parallel worker for PID 7889 (PID 7914) exited with exit code
> 0
> 2015-03-12 16:09:30 GMT [7880]: [31-1] user=,db=,client= LOG:
> unregistering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [32-1] user=,db=,client= LOG:  worker
> process: parallel worker for PID 7889 (PID 7915) exited with exit code
> 0
> 2015-03-12 16:09:30 GMT [7880]: [33-1] user=,db=,client= LOG:
> unregistering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [34-1] user=,db=,client= LOG:  worker
> process: parallel worker for PID 7889 (PID 7912) exited with exit code
> 0
> 2015-03-12 16:09:30 GMT [7880]: [35-1] user=,db=,client= LOG:
> unregistering background worker "parallel worker for PID 7889"
> 2015-03-12 16:09:30 GMT [7880]: [36-1] user=,db=,client= LOG:  server
> process (PID 7889) was terminated by signal 11: Segmentation fault
> 2015-03-12 16:09:30 GMT [7880]: [37-1] user=,db=,client= DETAIL:
> Failed process was running: SELECT pg_catalog.quote_ident(c.relname)
> FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm',
> 'f') AND substring(pg_catalog.quote_ident(c.relname),1,10)='pgbench_br'
> AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <>
> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog')
>         UNION
>         SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM
> pg_catalog.pg_namespace n WHERE
> substring(pg_catalog.quote_ident(n.nspname) || '.',1,10)='pgbench_br'
> AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE
> substring(pg_catalog.quote_ident(nspname) || '.',1,10) =
> substring('pgbench_br',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1))
>> 1
>         UNION
>         SELECT pg_catalog.quote_ident(n.nspname) || '.' ||
> pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c,
> pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind
> IN ('r', 'S', 'v', 'm', 'f') AND
> substring(pg_catalog.quote_ident(n.nspname) || '.' ||
> pg_catalog.quote_ident(c.relname),1,10)='pgbench_br' AND substri
> 2015-03-12 16:09:30 GMT [7880]: [38-1] user=,db=,client= LOG:
> terminating any other active server processes
> 2015-03-12 16:09:30 GMT [7886]: [2-1] user=,db=,client= WARNING:
> terminating connection because of crash of another server process
> 2015-03-12 16:09:30 GMT [7886]: [3-1] user=,db=,client= DETAIL:  The
> postmaster has commanded this server process to roll back the current
> transaction and exit, because another server process exited abnormally
> and possibly corrupted shared memory.
> 2015-03-12 16:09:30 GMT [7886]: [4-1] user=,db=,client= HINT:  In a
> moment you should be able to reconnect to the database and repeat your
> command.
> 2015-03-12 16:09:30 GMT [7880]: [39-1] user=,db=,client= LOG:  all
> server processes terminated; reinitializing
> 2015-03-12 16:09:30 GMT [7920]: [1-1] user=,db=,client= LOG:  database
> system was interrupted; last known up at 2015-03-12 16:07:26 GMT
> 2015-03-12 16:09:30 GMT [7920]: [2-1] user=,db=,client= LOG:  database
> system was not properly shut down; automatic recovery in progress
> 2015-03-12 16:09:30 GMT [7920]: [3-1] user=,db=,client= LOG:  invalid
> record length at 2/7E269A0
> 2015-03-12 16:09:30 GMT [7920]: [4-1] user=,db=,client= LOG:  redo is
> not required
> 2015-03-12 16:09:30 GMT [7880]: [40-1] user=,db=,client= LOG:
> database system is ready to accept connections
> 2015-03-12 16:09:30 GMT [7924]: [1-1] user=,db=,client= LOG:
> autovacuum launcher started
>
> I can recreate this by typing:
>
> EXPLAIN SELECT DISTINCT bid FROM pgbench_<tab>
>
> This happens with seq_page_cost = 1000, but not when it's set to 1.

Another problem.  I restarted the instance (just in case), and get this error:

# \df+ *.*
ERROR:  cannot retain locks acquired while in parallel mode

I get this even with seq_page_cost = 1, parallel_seqscan_degree = 1
and max_worker_processes = 1.
--
Thom

Re: Parallel Seq Scan

From

Amit Kapila

Date:

13 March 2015, 01:24:32

On Thu, Mar 12, 2015 at 4:22 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 10-03-2015 PM 01:09, Amit Kapila wrote:
> > On Tue, Mar 10, 2015 at 6:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> >> Is this patch handles the cases where the re-scan starts without
> >> finishing the earlier scan?
> >>
> >
> > Do you mean to say cases like ANTI, SEMI Join (in nodeNestLoop.c)
> > where we scan the next outer tuple and rescan inner table without
> > completing the previous scan of inner table?
> >
> > I have currently modelled it based on existing rescan for seqscan
> > (ExecReScanSeqScan()) which means it will begin the scan again.
> > Basically if the workers are already started/initialized by previous
> > scan, then re-initialize them (refer function ExecReScanFunnel() in
> > patch).
> >
>
> From Robert's description[1], it looked like the NestLoop with Funnel would
> have Funnel as either outer plan or topmost plan node or NOT a parameterised
> plan. In that case, would this case arise or am I missing something?
>

Probably not if the costing is right and user doesn't manually disable

plans (like by set enable_* = off). However we should have rescan code

incase it chooses the plan such that Funnel is inner node and I think

apart from that also in few cases Rescan is required.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Langote

Date:

13 March 2015, 04:37:36

On 13-03-2015 AM 10:24, Amit Kapila wrote:
> On Thu, Mar 12, 2015 at 4:22 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
>> From Robert's description[1], it looked like the NestLoop with Funnel
> would
>> have Funnel as either outer plan or topmost plan node or NOT a
> parameterised
>> plan. In that case, would this case arise or am I missing something?
>>
> 
> Probably not if the costing is right and user doesn't manually disable
> plans (like by set enable_* = off).  However we should have rescan code
> incase it chooses the plan such that Funnel is inner node and I think
> apart from that also in few cases Rescan is required.
> 

I see, thanks.

By the way, is it right that TupleQueueFunnel.queue has one shm_mq_handle per
initialized parallel worker? If so, how does TupleQueueFunnel.maxqueues relate
to ParallelContext.nworkers (of the corresponding parallel context)?

Why I asked this is because in CreateTupleQueueFunnel():
   funnel->maxqueues = 8;   funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));

So, is the hardcoded "8" intentional or an oversight?

Thanks,
Amit

Re: Parallel Seq Scan

From

Amit Langote

Date:

13 March 2015, 04:52:56

On 13-03-2015 PM 01:37, Amit Langote wrote:
> By the way, is it right that TupleQueueFunnel.queue has one shm_mq_handle per
> initialized parallel worker? If so, how does TupleQueueFunnel.maxqueues relate
> to ParallelContext.nworkers (of the corresponding parallel context)?
> 
> Why I asked this is because in CreateTupleQueueFunnel():
> 
>     funnel->maxqueues = 8;
>     funnel->queue = palloc(funnel->maxqueues * sizeof(shm_mq_handle *));
> 
> So, is the hardcoded "8" intentional or an oversight?
> 

Oh, I see that in RegisterTupleQueueOnFunnel(), the TupleQueueFunnel.queue is
expanded (repalloc'd) if needed as per corresponding pcxt->nworkers.

Thanks,
Amit

Re: Parallel Seq Scan

From

Amit Langote

Date:

13 March 2015, 08:32:55

On 12-03-2015 PM 11:46, Amit Kapila wrote:
> [parallel_seqscan_v10.patch]

There may be a bug in TupleQueueFunnelNext().

1) I observed a hang with stack looking like:

#0  0x00000039696df098 in poll () from /lib64/libc.so.6
#1  0x00000000006f1c6a in WaitLatchOrSocket (latch=0x7f29dc3c73b4,
wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:333
#2  0x00000000006f1aca in WaitLatch (latch=0x7f29dc3c73b4, wakeEvents=1,
timeout=0) at pg_latch.c:197
#3  0x000000000065088b in TupleQueueFunnelNext (funnel=0x17b4a20, nowait=0
'\000', done=0x17ad481 "") at tqueue.c:269
#4  0x0000000000636cab in funnel_getnext (funnelstate=0x17ad3d0) at
nodeFunnel.c:347
...
<snip>

2) In some cases, there can be a segmentation fault with stack looking like:

#0  0x000000396968990a in memcpy () from /lib64/libc.so.6
#1  0x00000000006507e7 in TupleQueueFunnelNext (funnel=0x263c800, nowait=0
'\000', done=0x2633461 "") at tqueue.c:233
#2  0x0000000000636cab in funnel_getnext (funnelstate=0x26333b0) at
nodeFunnel.c:347
#3  0x0000000000636901 in ExecFunnel (node=0x26333b0) at nodeFunnel.c:179
...
<snip>

I could get rid of (1) and (2) with the attached fix.

Re: Parallel Seq Scan

From

Amit Langote

Date:

13 March 2015, 08:42:29

On 13-03-2015 PM 05:32, Amit Langote wrote:
> On 12-03-2015 PM 11:46, Amit Kapila wrote:
>> [parallel_seqscan_v10.patch]
>
> There may be a bug in TupleQueueFunnelNext().
>
> 1) I observed a hang with stack looking like:
>
> #0  0x00000039696df098 in poll () from /lib64/libc.so.6
> #1  0x00000000006f1c6a in WaitLatchOrSocket (latch=0x7f29dc3c73b4,
> wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:333
> #2  0x00000000006f1aca in WaitLatch (latch=0x7f29dc3c73b4, wakeEvents=1,
> timeout=0) at pg_latch.c:197
> #3  0x000000000065088b in TupleQueueFunnelNext (funnel=0x17b4a20, nowait=0
> '\000', done=0x17ad481 "") at tqueue.c:269
> #4  0x0000000000636cab in funnel_getnext (funnelstate=0x17ad3d0) at
> nodeFunnel.c:347
> ...
> <snip>
>
> 2) In some cases, there can be a segmentation fault with stack looking like:
>
> #0  0x000000396968990a in memcpy () from /lib64/libc.so.6
> #1  0x00000000006507e7 in TupleQueueFunnelNext (funnel=0x263c800, nowait=0
> '\000', done=0x2633461 "") at tqueue.c:233
> #2  0x0000000000636cab in funnel_getnext (funnelstate=0x26333b0) at
> nodeFunnel.c:347
> #3  0x0000000000636901 in ExecFunnel (node=0x26333b0) at nodeFunnel.c:179
> ...
> <snip>
>
> I could get rid of (1) and (2) with the attached fix.

Hit send too soon!

By the way, the bug seems to be exposed only with a certain pattern/sequence
of workers being detached (perhaps in immediate successive) whereby the
funnel->nextqueue remains incorrectly set.

The patch attached this time.

By the way, when I have asserts enabled, I hit this compilation error:

createplan.c: In function ‘create_partialseqscan_plan’:
createplan.c:1180: error: ‘Path’ has no member named ‘path’

I see following line there:

    Assert(best_path->path.parent->rtekind == RTE_RELATION);

Thanks,
Amit

Attachment

TupleQueueFunnelNext-bugfix.patch

Re: Parallel Seq Scan

From

Amit Kapila

Date:

13 March 2015, 11:01:30

On Fri, Mar 13, 2015 at 2:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 13-03-2015 PM 05:32, Amit Langote wrote:
> > On 12-03-2015 PM 11:46, Amit Kapila wrote:
> >> [parallel_seqscan_v10.patch]
> >
> > There may be a bug in TupleQueueFunnelNext().
> >
> > 1) I observed a hang with stack looking like:
> >
> > #0 0x00000039696df098 in poll () from /lib64/libc.so.6
> > #1 0x00000000006f1c6a in WaitLatchOrSocket (latch=0x7f29dc3c73b4,
> > wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:333
> > #2 0x00000000006f1aca in WaitLatch (latch=0x7f29dc3c73b4, wakeEvents=1,
> > timeout=0) at pg_latch.c:197
> > #3 0x000000000065088b in TupleQueueFunnelNext (funnel=0x17b4a20, nowait=0
> > '\000', done=0x17ad481 "") at tqueue.c:269
> > #4 0x0000000000636cab in funnel_getnext (funnelstate=0x17ad3d0) at
> > nodeFunnel.c:347
> > ...
> > <snip>
> >
> > 2) In some cases, there can be a segmentation fault with stack looking like:
> >
> > #0 0x000000396968990a in memcpy () from /lib64/libc.so.6
> > #1 0x00000000006507e7 in TupleQueueFunnelNext (funnel=0x263c800, nowait=0
> > '\000', done=0x2633461 "") at tqueue.c:233
> > #2 0x0000000000636cab in funnel_getnext (funnelstate=0x26333b0) at
> > nodeFunnel.c:347
> > #3 0x0000000000636901 in ExecFunnel (node=0x26333b0) at nodeFunnel.c:179
> > ...
> > <snip>
> >
> > I could get rid of (1) and (2) with the attached fix.
>
> Hit send too soon!
>
> By the way, the bug seems to be exposed only with a certain pattern/sequence
> of workers being detached (perhaps in immediate successive) whereby the
> funnel->nextqueue remains incorrectly set.
>

I think this can happen if funnel->nextqueue is greater than funnel->nqueues.

Please see if attached patch fixes the issue, else could you share the

scenario in more detail where you hit this issue.

> The patch attached this time.
>
> By the way, when I have asserts enabled, I hit this compilation error:
>
> createplan.c: In function ‘create_partialseqscan_plan’:
> createplan.c:1180: error: ‘Path’ has no member named ‘path’
>
> I see following line there:
>
> Assert(best_path->path.parent->rtekind == RTE_RELATION);
>

Okay, will take care of this.

Thanks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

fix_tupqueue_issue_v1.patch

Re: Parallel Seq Scan

From

Amit Kapila

Date:

13 March 2015, 12:17:52

On Thu, Mar 12, 2015 at 10:35 PM, Thom Brown <thom@linux.com> wrote:
>
>
> Another problem. I restarted the instance (just in case), and get this error:
>
> # \df+ *.*
> ERROR: cannot retain locks acquired while in parallel mode
>

This problem occurs because above statement is trying to

execute parallel_unsafe function (obj_description) in parallelmode.

This will be resolved once parallel_seqscan patch is integrated

with access-parallel-safety patch [1].

[1]
https://commitfest.postgresql.org/4/155/

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

13 March 2015, 12:59:19

On Tue, Mar 10, 2015 at 12:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 10, 2015 at 10:23 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
> >
> > On Tue, Mar 10, 2015 at 3:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > I have currently modelled it based on existing rescan for seqscan
> > > (ExecReScanSeqScan()) which means it will begin the scan again.
> > > Basically if the workers are already started/initialized by previous
> > > scan, then re-initialize them (refer function ExecReScanFunnel() in
> > > patch).
> > >
> > > Can you elaborate more if you think current handling is not sufficient
> > > for any case?
> >
> > From ExecReScanFunnel function it seems that the re-scan waits till
> > all the workers
> > has to be finished to start again the next scan. Are the workers will
> > stop the current
> > ongoing task? otherwise this may decrease the performance instead of
> > improving as i feel.
> >
>
> Okay, performance-wise it might effect such a case, but I think we can
> handle it by not calling WaitForParallelWorkersToFinish(),
> as DestroyParallelContext() will automatically terminate all the workers.
>

We can't directly call DestroyParallelContext() to terminate workers as

it can so happen that by that time some of the workers are still not started.

So that can lead to problem. I think what we need here is a way to know

whether all workers are started. (basically need a new function

WaitForParallelWorkersToStart()). This API needs to be provided by

parallel-mode patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

13 March 2015, 13:02:03

On Thu, Mar 12, 2015 at 3:44 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> In create_parallelscan_paths() function the funnel path is added once
> the partial seq scan
> path is generated. I feel the funnel path can be added once on top of
> the total possible
> parallel path in the entire query path.
>
> Is this the right patch to add such support also?
>

This seems to be an optimization for parallel paths which can be

done later as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

13 March 2015, 13:22:31

On Fri, Mar 13, 2015 at 9:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 12, 2015 at 3:44 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>> In create_parallelscan_paths() function the funnel path is added once
>> the partial seq scan
>> path is generated. I feel the funnel path can be added once on top of
>> the total possible
>> parallel path in the entire query path.
>>
>> Is this the right patch to add such support also?
>
> This seems to be an optimization for parallel paths which can be
> done later as well.

+1.  Let's keep it simple for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

13 March 2015, 13:30:34

On Fri, Mar 13, 2015 at 8:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> We can't directly call DestroyParallelContext() to terminate workers as
> it can so happen that by that time some of the workers are still not
> started.

That shouldn't be a problem.  TerminateBackgroundWorker() not only
kills an existing worker if there is one, but also tells the
postmaster that if it hasn't started the worker yet, it should not
bother.  So at the conclusion of the first loop inside
DestroyParallelContext(), every running worker will have received
SIGTERM and no more workers will be started.

> So that can lead to problem.  I think what we need here is a way to know
> whether all workers are started. (basically need a new function
>WaitForParallelWorkersToStart()).  This API needs to be provided by
> parallel-mode patch.

I don't think so.  DestroyParallelContext() is intended to be good
enough for this purpose; if it's not, we should fix that instead of
adding a new function.

No matter what, re-scanning a parallel node is not going to be very
efficient.  But the way to deal with that is to make sure that such
nodes have a substantial startup cost, so that the planner won't pick
them in the case where it isn't going to work out well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

13 March 2015, 13:45:26

On Fri, Mar 13, 2015 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think this can happen if funnel->nextqueue is greater than
> funnel->nqueues.
> Please see if attached patch fixes the issue, else could you share the
> scenario in more detail where you hit this issue.

Speaking as the guy who wrote the first version of that code...

I don't think this is the right fix; the point of that code is to
remove a tuple queue from the funnel when it gets detached, which is a
correct thing to want to do.  funnel->nextqueue should always be less
than funnel->nqueues; how is that failing to be the case here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

13 March 2015, 14:03:59

On Fri, Mar 13, 2015 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 13, 2015 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think this can happen if funnel->nextqueue is greater than
> > funnel->nqueues.
> > Please see if attached patch fixes the issue, else could you share the
> > scenario in more detail where you hit this issue.
>
> Speaking as the guy who wrote the first version of that code...
>
> I don't think this is the right fix; the point of that code is to
> remove a tuple queue from the funnel when it gets detached, which is a
> correct thing to want to do. funnel->nextqueue should always be less
> than funnel->nqueues; how is that failing to be the case here?
>

I could not reproduce the issue, neither the exact scenario is

mentioned in mail. However what I think can lead to funnel->nextqueue

greater than funnel->nqueues is something like below:

Assume 5 queues, so value of funnel->nqueues will be 5 and

assume value of funnel->nextqueue is 2, so now let us say 4 workers

got detached one-by-one, so for such a case it will always go in else loop

and will never change funnel->nextqueue whereas value of funnel->nqueues

will become 1.

Am I missing something?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Langote

Date:

13 March 2015, 14:18:05

On Fri, Mar 13, 2015 at 11:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 13, 2015 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Fri, Mar 13, 2015 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > I think this can happen if funnel->nextqueue is greater than
>> > funnel->nqueues.
>> > Please see if attached patch fixes the issue, else could you share the
>> > scenario in more detail where you hit this issue.
>>
>> Speaking as the guy who wrote the first version of that code...
>>
>> I don't think this is the right fix; the point of that code is to
>> remove a tuple queue from the funnel when it gets detached, which is a
>> correct thing to want to do.  funnel->nextqueue should always be less
>> than funnel->nqueues; how is that failing to be the case here?
>>
>
> I could not reproduce the issue, neither the exact scenario is
> mentioned in mail.  However what I think can lead to funnel->nextqueue
> greater than funnel->nqueues is something like below:
>
> Assume 5 queues, so value of funnel->nqueues will be 5 and
> assume value of funnel->nextqueue is 2, so now let us say 4 workers
> got detached one-by-one, so for such a case it will always go in else loop
> and will never change funnel->nextqueue whereas value of funnel->nqueues
> will become 1.
>
> Am I missing something?
>

Sorry, I did not mention the exact example I'd used but I thought it
was just any arbitrary example:

CREATE TABLE t1(c1, c2)  SELECT g1, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2)  SELECT g1, repeat('x', 5) FROM
generate_series(1, 1000000) g;

SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;

The observed behavior included a hang or segfault arbitrarily (that's
why I guessed it may be arbitrariness of sequence of detachment of
workers).

Changed parameters to cause plan to include a Funnel:

parallel_seqscan_degree = 8
cpu_tuple_communication_cost = 0.01/0.001

Thanks,
Amit

Re: Parallel Seq Scan

From

Amit Kapila

Date:

14 March 2015, 05:04:55

On Thu, Mar 12, 2015 at 9:50 PM, Thom Brown <thom@linux.com> wrote:
>
> On 12 March 2015 at 15:29, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Please note that parallel_setup_cost and parallel_startup_cost are
> > still set to zero by default, so you need to set it to higher values
> > if you don't want the parallel plans once parallel_seqscan_degree
> > is set. I have yet to comeup with default values for them, needs
> > some tests.
>
> Thanks. Getting a problem:
>

Thanks for looking into patch.

So as per this report, I am seeing 3 different problems in it.

Problem-1:

---------------------
>
> # SELECT name, setting FROM pg_settings WHERE name IN
> ('parallel_seqscan_degree','max_worker_processes','seq_page_cost');
> name | setting
> -------------------------+---------
> max_worker_processes | 20
> parallel_seqscan_degree | 8
> seq_page_cost | 1000
> (3 rows)
>
> # EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
> ERROR: too many dynamic shared memory segments
>
>

This happens because we have maximum limit on the number of

dynamic shared memory segments in the system.

In function dsm_postmaster_startup(), it is defined as follows:

maxitems = PG_DYNSHMEM_FIXED_SLOTS

+ PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

In the above case, it is choosing parallel plan for each of the AppendRelation,

(because of seq_page_cost = 1000) and that causes the test to

cross max limit of dsm segments.

One way to fix could be that we increase the number of dsm segments

that can be created in a system/backend, but it seems to me that in

reality there might not be many such plans which would need so many

dsm segments, unless user tinkers too much with costing and even if

he does, he can increase max_connections to avoid such problem.

I would like to see opinion of other people on this matter.

Problem-2:

--------------------

2015-03-12 16:09:30 GMT [7880]: [36-1] user=,db=,client= LOG: server

process (PID 7889) was terminated by signal 11: Segmentation fault

2015-03-12 16:09:30 GMT [7880]: [37-1] user=,db=,client= DETAIL:

Failed process was running: SELECT pg_catalog.quote_ident(c.relname)

FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm',

'f') AND substring(pg_catalog.quote_ident(c.relname),1,10)='pgbench_br'

AND pg_catalog.pg_table_is_visible(c.oid) AND c.relnamespace <>

(SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog')

UNION

SELECT pg_catalog.quote_ident(n.nspname) || '.' FROM

pg_catalog.pg_namespace n WHERE

substring(pg_catalog.quote_ident(n.nspname) || '.',1,10)='pgbench_br'

AND (SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE

substring(pg_catalog.quote_ident(nspname) || '.',1,10) =

substring('pgbench_br',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1))

> 1

UNION

SELECT pg_catalog.quote_ident(n.nspname) || '.' ||

pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c,

pg_catalog.pg_namespace n WHERE c.relnamespace = n.oid AND c.relkind

IN ('r', 'S', 'v', 'm', 'f') AND

substring(pg_catalog.quote_ident(n.nspname) || '.' ||

pg_catalog.quote_ident(c.relname),1,10)='pgbench_br' AND substri

This seems to be unrelated to first issue (as the statement in log has

nothing to do with Problem-1) and this could be same issue what

Amit Langote has reported, so we can test this once with the fix for that

issue, but I think it is important if we can isolate the test due to which

this problem has occurred.

Problem-3

----------------

I am seeing as Assertion failure (in ExitParallelMode()) with this test,

but that seems to be an issue due to the lack of integration with

access-parallel-safety patch. I will test this after integrating with

access-parallel-safety patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Langote

Date:

16 March 2015, 04:10:43

On 13-03-2015 PM 11:03, Amit Kapila wrote:
> On Fri, Mar 13, 2015 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> I don't think this is the right fix; the point of that code is to
>> remove a tuple queue from the funnel when it gets detached, which is a
>> correct thing to want to do.  funnel->nextqueue should always be less
>> than funnel->nqueues; how is that failing to be the case here?
>>
> 
> I could not reproduce the issue, neither the exact scenario is
> mentioned in mail.  However what I think can lead to funnel->nextqueue
> greater than funnel->nqueues is something like below:
> 
> Assume 5 queues, so value of funnel->nqueues will be 5 and
> assume value of funnel->nextqueue is 2, so now let us say 4 workers
> got detached one-by-one, so for such a case it will always go in else loop
> and will never change funnel->nextqueue whereas value of funnel->nqueues
> will become 1.
> 

Or if the just-detached queue happens to be the last one, we'll make
shm_mq_receive() to read from a potentially already-detached queue in the
immediately next iteration. That seems to be caused by not having updated the
funnel->nextqueue. With the returned value being SHM_MQ_DETACHED, we'll again
try to remove it from the queue. In this case, it causes the third argument to
memcpy be negative and hence the segfault.

I can't seem to really figure out the other problem of waiting forever in
WaitLatch() but I had managed to make it go away with:

-        if (funnel->nextqueue == waitpos)
+        if (result != SHM_MQ_DETACHED && funnel->nextqueue == waitpos)

By the way, you can try reproducing this with the example I posted on Friday.

Thanks,
Amit

Re: Parallel Seq Scan

From

Amit Kapila

Date:

16 March 2015, 07:14:09

On Mon, Mar 16, 2015 at 9:40 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 13-03-2015 PM 11:03, Amit Kapila wrote:
> > On Fri, Mar 13, 2015 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>
> >> I don't think this is the right fix; the point of that code is to
> >> remove a tuple queue from the funnel when it gets detached, which is a
> >> correct thing to want to do. funnel->nextqueue should always be less
> >> than funnel->nqueues; how is that failing to be the case here?
> >>
> >
> > I could not reproduce the issue, neither the exact scenario is
> > mentioned in mail. However what I think can lead to funnel->nextqueue
> > greater than funnel->nqueues is something like below:
> >
> > Assume 5 queues, so value of funnel->nqueues will be 5 and
> > assume value of funnel->nextqueue is 2, so now let us say 4 workers
> > got detached one-by-one, so for such a case it will always go in else loop
> > and will never change funnel->nextqueue whereas value of funnel->nqueues
> > will become 1.
> >
>
> Or if the just-detached queue happens to be the last one, we'll make
> shm_mq_receive() to read from a potentially already-detached queue in the
> immediately next iteration.

Won't the last queue case already handled by below code:

else
{
--funnel->nqueues;
if (funnel->nqueues == 0)
{
if (done != NULL)
*done = true;
return NULL;
}

> That seems to be caused by not having updated the
> funnel->nextqueue. With the returned value being SHM_MQ_DETACHED, we'll again
> try to remove it from the queue. In this case, it causes the third argument to

> memcpy be negative and hence the segfault.
>

In anycase, I think we need some handling for such cases.

> I can't seem to really figure out the other problem of waiting forever in
> WaitLatch()

The reason seems that for certain scenarios, the way we set the latch before

exiting needs some more thought. Currently we are setting the latch in

HandleParallelMessageInterrupt(), that doesn't seem to be sufficient.

> By the way, you can try reproducing this with the example I posted on Friday.
>

Sure.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

16 March 2015, 07:16:02

On Fri, Mar 13, 2015 at 7:06 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Amit Kapila wrote:
>
> > I think this can happen if funnel->nextqueue is greater
> > than funnel->nqueues.
> > Please see if attached patch fixes the issue, else could you share the
> > scenario in more detail where you hit this issue.
>
> Uh, isn't this copying an overlapping memory region? If so you should
> be using memmove instead.
>

Agreed, will update this in next version of patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Langote

Date:

16 March 2015, 07:28:48

On 16-03-2015 PM 04:14, Amit Kapila wrote:
> On Mon, Mar 16, 2015 at 9:40 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
>> Or if the just-detached queue happens to be the last one, we'll make
>> shm_mq_receive() to read from a potentially already-detached queue in the
>> immediately next iteration.
> 
> Won't the last queue case already handled by below code:
> else
> {
> --funnel->nqueues;
> if (funnel->nqueues == 0)
> {
> if (done != NULL)
> *done = true;
> return NULL;
> }
> 

Actually I meant "currently the last" or:
   funnel->nextqueue == funnel->nqueue - 1

So the code you quote would only take care of subset of the cases.

Imagine funnel->nqueues going down from 5 to 3 in successive iterations while
funnel->nextqueue remains set to 4 (which would have been the "currently last"
when funnel->nqueues was 5).

>> I can't seem to really figure out the other problem of waiting forever in
>> WaitLatch()
>>
> 
> The reason seems that for certain scenarios, the way we set the latch before
> exiting needs some more thought.  Currently we are setting the latch in
> HandleParallelMessageInterrupt(), that doesn't seem to be sufficient.
> 

How about shm_mq_detach() called from ParallelQueryMain() right after
exec_parallel_stmt() returns? Doesn't that do the SetLatch() that needs to be
done by a worker?

Thanks,
Amit

Re: Parallel Seq Scan

From

Amit Kapila

Date:

17 March 2015, 05:42:20

On Fri, Mar 13, 2015 at 7:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 13, 2015 at 8:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > We can't directly call DestroyParallelContext() to terminate workers as
> > it can so happen that by that time some of the workers are still not
> > started.
>
> That shouldn't be a problem. TerminateBackgroundWorker() not only
> kills an existing worker if there is one, but also tells the
> postmaster that if it hasn't started the worker yet, it should not
> bother. So at the conclusion of the first loop inside
> DestroyParallelContext(), every running worker will have received
> SIGTERM and no more workers will be started.
>

The problem occurs in second loop inside DestroyParallelContext()

where it calls WaitForBackgroundWorkerShutdown(). Basically

WaitForBackgroundWorkerShutdown() just checks for BGWH_STOPPED

status, refer below code in parallel-mode patch:

+ status = GetBackgroundWorkerPid(handle, &pid);
+ if (status == BGWH_STOPPED)
+ return status;

So if the status here returned is BGWH_NOT_YET_STARTED, then it

will go for WaitLatch and will there forever.

I think fix is to check if status is BGWH_STOPPED or BGWH_NOT_YET_STARTED,

then just return the status.

What do you say?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

17 March 2015, 14:24:49

On Tue, Mar 17, 2015 at 1:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The problem occurs in second loop inside DestroyParallelContext()
> where it calls WaitForBackgroundWorkerShutdown().  Basically
> WaitForBackgroundWorkerShutdown() just checks for BGWH_STOPPED
> status, refer below code in parallel-mode patch:
>
> + status = GetBackgroundWorkerPid(handle, &pid);
> + if (status == BGWH_STOPPED)
> + return status;
>
> So if the status here returned is BGWH_NOT_YET_STARTED, then it
> will go for WaitLatch and will there forever.
>
> I think fix is to check if status is BGWH_STOPPED or  BGWH_NOT_YET_STARTED,
> then just return the status.
>
> What do you say?

No, that's not right.  If we return when the status is
BGWH_NOT_YET_STARTED, then the postmaster could subsequently start the
worker.

Can you try this:

diff --git a/src/backend/postmaster/bgworker.c
b/src/backend/postmaster/bgworker.c
index f80141a..39b919f 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -244,6 +244,8 @@ BackgroundWorkerStateChange(void)                               rw->rw_terminate = true;
                  if (rw->rw_pid != 0)                                       kill(rw->rw_pid, SIGTERM);
 
+                               else
+                                       ReportBackgroundWorkerPID(rw);                       }
continue;              }
 

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

18 March 2015, 06:22:21

On Tue, Mar 17, 2015 at 7:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 17, 2015 at 1:42 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The problem occurs in second loop inside DestroyParallelContext()
> > where it calls WaitForBackgroundWorkerShutdown(). Basically
> > WaitForBackgroundWorkerShutdown() just checks for BGWH_STOPPED
> > status, refer below code in parallel-mode patch:
> >
> > + status = GetBackgroundWorkerPid(handle, &pid);
> > + if (status == BGWH_STOPPED)
> > + return status;
> >
> > So if the status here returned is BGWH_NOT_YET_STARTED, then it
> > will go for WaitLatch and will there forever.
> >
> > I think fix is to check if status is BGWH_STOPPED or BGWH_NOT_YET_STARTED,
> > then just return the status.
> >
> > What do you say?
>
> No, that's not right. If we return when the status is
> BGWH_NOT_YET_STARTED, then the postmaster could subsequently start the
> worker.
>
> Can you try this:
>
> diff --git a/src/backend/postmaster/bgworker.c
> b/src/backend/postmaster/bgworker.c
> index f80141a..39b919f 100644
> --- a/src/backend/postmaster/bgworker.c
> +++ b/src/backend/postmaster/bgworker.c
> @@ -244,6 +244,8 @@ BackgroundWorkerStateChange(void)
> rw->rw_terminate = true;
> if (rw->rw_pid != 0)
> kill(rw->rw_pid, SIGTERM);
> + else
> + ReportBackgroundWorkerPID(rw);
> }
> continue;
> }
>

It didn't fix the problem. IIUC, you have done this to ensure that

if worker is not already started, then update it's pid, so that we

can get the required status in WaitForBackgroundWorkerShutdown().

As this is a timing issue, it can so happen that before Postmaster

gets a chance to report the pid, backend has already started waiting

on WaitLatch().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

18 March 2015, 15:45:07

On Wed, Mar 18, 2015 at 2:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Can you try this:
>>
>> diff --git a/src/backend/postmaster/bgworker.c
>> b/src/backend/postmaster/bgworker.c
>> index f80141a..39b919f 100644
>> --- a/src/backend/postmaster/bgworker.c
>> +++ b/src/backend/postmaster/bgworker.c
>> @@ -244,6 +244,8 @@ BackgroundWorkerStateChange(void)
>>                                 rw->rw_terminate = true;
>>                                 if (rw->rw_pid != 0)
>>                                         kill(rw->rw_pid, SIGTERM);
>> +                               else
>> +                                       ReportBackgroundWorkerPID(rw);
>>                         }
>>                         continue;
>>                 }
>>
>
> It didn't fix the problem.  IIUC, you have done this to ensure that
> if worker is not already started, then update it's pid, so that we
> can get the required status in WaitForBackgroundWorkerShutdown().
> As this is a timing issue, it can so happen that before Postmaster
> gets a chance to report the pid, backend has already started waiting
> on WaitLatch().

I think I figured out the problem.  That fix only helps in the case
where the postmaster noticed the new registration previously but
didn't start the worker, and then later notices the termination.
What's much more likely to happen is that the worker is started and
terminated so quickly that both happen before we create a
RegisteredBgWorker for it.  The attached patch fixes that case, too.

Assuming this actually fixes the problem, I think we should back-patch
it into 9.4.  To recap, the problem is that, at present, if you
register a worker and then terminate it before it's launched,
GetBackgroundWorkerPid() will still return BGWH_NOT_YET_STARTED, which
it makes it seem like we're still waiting for it to start.  But when
or if the slot is reused for an unrelated registration, then
GetBackgroundWorkerPid() will switch to returning BGWH_STOPPED.  It's
hard to believe that's the behavior anyone wants.  With this patch,
the return value will always be BGWH_STOPPED in this situation.  That
has the virtue of being consistent, and practically speaking I think
it's the behavior that everyone will want, because the case where this
matters is when you are waiting for workers to start or waiting for
worker to stop, and in either case you will want to treat a worker
that was marked for termination before the postmaster actually started
it as already-stopped rather than not-yet-started.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

stop-notify-fix-v2.patch

Re: Parallel Seq Scan

From

Robert Haas

Date:

18 March 2015, 17:15:55

On Sat, Mar 14, 2015 at 1:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> # EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
>> ERROR:  too many dynamic shared memory segments
>
> This happens because we have maximum limit on the number of
> dynamic shared memory segments in the system.
>
> In function dsm_postmaster_startup(), it is defined as follows:
>
> maxitems = PG_DYNSHMEM_FIXED_SLOTS
> + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
>
> In the above case, it is choosing parallel plan for each of the
> AppendRelation,
> (because of seq_page_cost = 1000) and that causes the test to
> cross max limit of dsm segments.

The problem here is, of course, that each parallel sequential scan is
trying to create an entirely separate group of workers.  Eventually, I
think we should fix this by rejiggering things so that when there are
multiple parallel nodes in a plan, they all share a pool of workers.
So each worker would actually get a list of plan nodes instead of a
single plan node.  Maybe it works on the first node in the list until
that's done, and then moves onto the next, or maybe it round-robins
among all the nodes and works on the ones where the output tuple
queues aren't currently full, or maybe the master somehow notifies the
workers which nodes are most useful to work on at the present time.
But I think trying to figure this out is far too ambitious for 9.5,
and I think we can have a useful feature without implementing any of
it.

But, we can't just ignore the issue right now, because erroring out on
a large inheritance hierarchy is no good.  Instead, we should fall
back to non-parallel operation in this case.  By the time we discover
the problem, it's too late to change the plan, because it's already
execution time.  So we are going to be stuck executing the parallel
node - just with no workers to help.  However, what I think we can do
is use a slab of backend-private memory instead of a dynamic shared
memory segment, and in that way avoid this error.  We do something
similar when starting the postmaster in stand-alone mode: the main
shared memory segment is replaced by a backend-private allocation with
the same contents that the shared memory segment would normally have.
The same fix will work here.

Even once we make the planner and executor smarter, so that they don't
create lots of shared memory segments and lots of separate worker
pools in this type of case, it's probably still useful to have this as
a fallback approach, because there's always the possibility that some
other client of the dynamic shared memory system could gobble up all
the segments.  So, I'm going to go try to figure out the best way to
implement this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

19 March 2015, 03:04:52

On Wed, Mar 18, 2015 at 10:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Mar 14, 2015 at 1:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> # EXPLAIN SELECT DISTINCT bid FROM pgbench_accounts;
> >> ERROR: too many dynamic shared memory segments
> >
> > This happens because we have maximum limit on the number of
> > dynamic shared memory segments in the system.
> >
> > In function dsm_postmaster_startup(), it is defined as follows:
> >
> > maxitems = PG_DYNSHMEM_FIXED_SLOTS
> > + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
> >
> > In the above case, it is choosing parallel plan for each of the
> > AppendRelation,
> > (because of seq_page_cost = 1000) and that causes the test to
> > cross max limit of dsm segments.
>
> The problem here is, of course, that each parallel sequential scan is
> trying to create an entirely separate group of workers. Eventually, I
> think we should fix this by rejiggering things so that when there are
> multiple parallel nodes in a plan, they all share a pool of workers.
> So each worker would actually get a list of plan nodes instead of a
> single plan node. Maybe it works on the first node in the list until
> that's done, and then moves onto the next, or maybe it round-robins
> among all the nodes and works on the ones where the output tuple
> queues aren't currently full, or maybe the master somehow notifies the
> workers which nodes are most useful to work on at the present time.

Good idea. I think for this particular case, we might want to optimize

the work distribution such each worker gets one independent relation

segment to scan.

> But I think trying to figure this out is far too ambitious for 9.5,
> and I think we can have a useful feature without implementing any of
> it.
>

Agreed.

> But, we can't just ignore the issue right now, because erroring out on
> a large inheritance hierarchy is no good. Instead, we should fall
> back to non-parallel operation in this case. By the time we discover
> the problem, it's too late to change the plan, because it's already
> execution time. So we are going to be stuck executing the parallel
> node - just with no workers to help. However, what I think we can do
> is use a slab of backend-private memory instead of a dynamic shared
> memory segment, and in that way avoid this error. We do something
> similar when starting the postmaster in stand-alone mode: the main
> shared memory segment is replaced by a backend-private allocation with
> the same contents that the shared memory segment would normally have.
> The same fix will work here.
>
> Even once we make the planner and executor smarter, so that they don't
> create lots of shared memory segments and lots of separate worker
> pools in this type of case, it's probably still useful to have this as
> a fallback approach, because there's always the possibility that some
> other client of the dynamic shared memory system could gobble up all
> the segments. So, I'm going to go try to figure out the best way to
> implement this.
>

Thanks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

19 March 2015, 03:44:54

On Wed, Mar 18, 2015 at 9:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Mar 18, 2015 at 2:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > It didn't fix the problem. IIUC, you have done this to ensure that
> > if worker is not already started, then update it's pid, so that we
> > can get the required status in WaitForBackgroundWorkerShutdown().
> > As this is a timing issue, it can so happen that before Postmaster
> > gets a chance to report the pid, backend has already started waiting
> > on WaitLatch().
>
> I think I figured out the problem. That fix only helps in the case
> where the postmaster noticed the new registration previously but
> didn't start the worker, and then later notices the termination.
> What's much more likely to happen is that the worker is started and
> terminated so quickly that both happen before we create a
> RegisteredBgWorker for it. The attached patch fixes that case, too.
>

Patch fixes the problem and now for Rescan, we don't need to Wait

for workers to finish.

> Assuming this actually fixes the problem, I think we should back-patch
> it into 9.4.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Robert Haas

Date:

19 March 2015, 15:12:16

On Wed, Mar 18, 2015 at 11:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Patch fixes the problem and now for Rescan, we don't need to Wait
> for workers to finish.
>
>> Assuming this actually fixes the problem, I think we should back-patch
>> it into 9.4.
>
> +1

OK, done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

20 March 2015, 12:07:06

On Mon, Mar 16, 2015 at 12:58 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 16-03-2015 PM 04:14, Amit Kapila wrote:
> > On Mon, Mar 16, 2015 at 9:40 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> > wrote:
> >> Or if the just-detached queue happens to be the last one, we'll make
> >> shm_mq_receive() to read from a potentially already-detached queue in the
> >> immediately next iteration.
> >
> > Won't the last queue case already handled by below code:
> > else
> > {
> > --funnel->nqueues;
> > if (funnel->nqueues == 0)
> > {
> > if (done != NULL)
> > *done = true;
> > return NULL;
> > }
> >
>
> Actually I meant "currently the last" or:
>
> funnel->nextqueue == funnel->nqueue - 1
>
> So the code you quote would only take care of subset of the cases.
>

Fixed this issue by resetting funnel->next queue to zero (as per offlist

discussion with Robert), so that it restarts from first queue in such

a case.

>
> >> I can't seem to really figure out the other problem of waiting forever in
> >> WaitLatch()
> >>
> >
> > The reason seems that for certain scenarios, the way we set the latch before
> > exiting needs some more thought. Currently we are setting the latch in
> > HandleParallelMessageInterrupt(), that doesn't seem to be sufficient.
> >
>
> How about shm_mq_detach() called from ParallelQueryMain() right after
> exec_parallel_stmt() returns? Doesn't that do the SetLatch() that needs to be
> done by a worker?
>

Fixed this issue by not going for Wait incase of detached queues.

Apart from these fixes, latest patch contains below changes:

1. Integrated with assess-parallel-safety-v4.patch [1]. To test

with this patch, please remember to comment below line

in this patch, else it will always enter parallel-mode.

+ glob->parallelModeNeeded = glob->parallelModeOK; /* XXX JUST FOR TESTING */

2. Handle the case where enough workers are not available for

execution of Funnel node. In such a case it will run the plan

with available number of workers and incase no worker is available,

it will just run the local partial seq scan node. I think we can

invent some more advanced solution to handle this problem in

case there is a strong need after the first version went in.

3. Support for pg_stat_statements (it will show the stats for parallel-

statement). To handle this case, we need to share buffer usage

stats from all the workers. Currently the patch does collect

buffer usage stats by default (even though pg_stat_statements is

not enabled) as that is quite cheap and we can make it conditional

if required in future.

So the patches have to be applied in below sequence:

HEAD Commit-id : 8d1f2390

parallel-mode-v8.1.patch [2]

assess-parallel-safety-v4.patch [1]

parallel-heap-scan.patch [3]

parallel_seqscan_v11.patch (Attached with this mail)

The reason for not using the latest commit in HEAD is that latest

version of assess-parallel-safety patch was not getting applied,

so I generated the patch at commit-id where I could apply that

patch successfully.

[1] - http://www.postgresql.org/message-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com

[2] - http://www.postgresql.org/message-id/CA+TgmoZJjzYnpXChL3gr7NwRUzkAzPMPVKAtDt5sHvC5Cd7RKw@mail.gmail.com

[3] - http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v11.patch

Re: Parallel Seq Scan

From

Amit Langote

Date:

23 March 2015, 01:48:27

On 20-03-2015 PM 09:06, Amit Kapila wrote:
> On Mon, Mar 16, 2015 at 12:58 PM, Amit Langote <
> Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Actually I meant "currently the last" or:
>>
>>     funnel->nextqueue == funnel->nqueue - 1
>>
>> So the code you quote would only take care of subset of the cases.
>>
> 
> Fixed this issue by resetting funnel->next queue to zero (as per offlist
> discussion with Robert), so that it restarts from first queue in such
> a case.
> 
>>
>>
>> How about shm_mq_detach() called from ParallelQueryMain() right after
>> exec_parallel_stmt() returns? Doesn't that do the SetLatch() that needs
> to be
>> done by a worker?
>>
> 
> Fixed this issue by not going for Wait incase of detached queues.
> 

Thanks for fixing. I no longer see the problems.

Regards,
Amit

Re: Parallel Seq Scan

From

Rajeev rastogi

Date:

25 March 2015, 10:18:20

On 20 March 2015 17:37, Amit Kapila Wrote:

> So the patches have to be applied in below sequence:

> HEAD Commit-id : 8d1f2390

> parallel-mode-v8.1.patch [2]

> assess-parallel-safety-v4.patch [1]

> parallel-heap-scan.patch [3]

> parallel_seqscan_v11.patch (Attached with this mail)

While I was going through this patch, I observed one invalid ASSERT in the function “ExecInitFunnel” i.e.

Assert(outerPlan(node) == NULL);

Outer node of Funnel node is always non-NULL and currently it will be PartialSeqScan Node.

May be ASSERT is disabled while building the code because of which this issue has not yet been observed.

Thanks and Regards,

Kumar Rajeev Rastogi

Re: Parallel Seq Scan

From

Amit Kapila

Date:

25 March 2015, 10:27:14

On Fri, Mar 20, 2015 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> So the patches have to be applied in below sequence:
> HEAD Commit-id : 8d1f2390
> parallel-mode-v8.1.patch [2]
> assess-parallel-safety-v4.patch [1]
> parallel-heap-scan.patch [3]
> parallel_seqscan_v11.patch (Attached with this mail)
>
> The reason for not using the latest commit in HEAD is that latest
> version of assess-parallel-safety patch was not getting applied,
> so I generated the patch at commit-id where I could apply that
> patch successfully.
>
> [1] - http://www.postgresql.org/message-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
> [2] - http://www.postgresql.org/message-id/CA+TgmoZJjzYnpXChL3gr7NwRUzkAzPMPVKAtDt5sHvC5Cd7RKw@mail.gmail.com
> [3] - http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
>

Fixed the reported issue on assess-parallel-safety thread and another

bug caught while testing joins and integrated with latest version of

parallel-mode patch (parallel-mode-v9 patch).

Apart from that I have moved the Initialization of dsm segement from

InitNode phase to ExecFunnel() (on first execution) as per suggestion

from Robert. The main idea is that as it creates large shared memory

segment, so do the work when it is really required.

HEAD Commit-Id: 11226e38

parallel-mode-v9.patch [2]
assess-parallel-safety-v4.patch [1]

parallel-heap-scan.patch [3]

parallel_seqscan_v12.patch (Attached with this mail)

[1] - http://www.postgresql.org/message-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
[2] - http://www.postgresql.org/message-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
[3] - http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v12.patch

Re: Parallel Seq Scan

From

Amit Kapila

Date:

25 March 2015, 10:30:34

On Wed, Mar 25, 2015 at 3:47 PM, Rajeev rastogi <rajeev.rastogi@huawei.com> wrote:
>
> On 20 March 2015 17:37, Amit Kapila Wrote:
>

> > So the patches have to be applied in below sequence:

> > HEAD Commit-id : 8d1f2390
> > parallel-mode-v8.1.patch [2]
> > assess-parallel-safety-v4.patch [1]
> > parallel-heap-scan.patch [3]
> > parallel_seqscan_v11.patch (Attached with this mail)

>
> While I was going through this patch, I observed one invalid ASSERT in the function “ExecInitFunnel” i.e.
>
> Assert(outerPlan(node) == NULL);
>
> Outer node of Funnel node is always non-NULL and currently it will be PartialSeqScan Node.
>

Which version of patch you are looking at?

I am seeing below code in ExecInitFunnel() in Version-11 to which

you have replied.

+ /* Funnel node doesn't have innerPlan node. */
+ Assert(innerPlan(node) == NULL);

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Rajeev rastogi

Date:

25 March 2015, 10:39:20

On 25 March 2015 16:00, Amit Kapila Wrote:

> Which version of patch you are looking at?

> I am seeing below code in ExecInitFunnel() in Version-11 to which

> you have replied.

> + /* Funnel node doesn't have innerPlan node. */
> + Assert(innerPlan(node) == NULL

I was seeing the version-10.

I just checked version-11 and version-12 and found to be already fixed.

I should have checked the latest version before sending the report…J

Thanks and Regards,

Kumar Rajeev Rastogi

From: Amit Kapila [mailto:amit.kapila16@gmail.com]
Sent: 25 March 2015 16:00
To: Rajeev rastogi
Cc: Amit Langote; Robert Haas; Andres Freund; Kouhei Kaigai; Amit Langote; Fabrízio Mello; Thom Brown; Stephen Frost; pgsql-hackers
Subject: Re: [HACKERS] Parallel Seq Scan

Which version of patch you are looking at?

I am seeing below code in ExecInitFunnel() in Version-11 to which

you have replied.

+ /* Funnel node doesn't have innerPlan node. */
+ Assert(innerPlan(node) == NULL);

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

25 March 2015, 11:00:24

On Wed, Mar 25, 2015 at 4:08 PM, Rajeev rastogi <rajeev.rastogi@huawei.com> wrote:
>
> On 25 March 2015 16:00, Amit Kapila Wrote:
>
> > Which version of patch you are looking at?
>
> > I am seeing below code in ExecInitFunnel() in Version-11 to which
>
> > you have replied.
>
>
>
> > + /* Funnel node doesn't have innerPlan node. */
> > + Assert(innerPlan(node) == NULL
>
>
>
> I was seeing the version-10.
>
> I just checked version-11 and version-12 and found to be already fixed.
>
> I should have checked the latest version before sending the report…J
>

No problem, Thanks for looking into the patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Thom Brown

Date:

25 March 2015, 11:46:46

On 25 March 2015 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 20, 2015 at 5:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> So the patches have to be applied in below sequence:
> HEAD Commit-id : 8d1f2390
> parallel-mode-v8.1.patch [2]
> assess-parallel-safety-v4.patch [1]
> parallel-heap-scan.patch [3]
> parallel_seqscan_v11.patch (Attached with this mail)
>
> The reason for not using the latest commit in HEAD is that latest
> version of assess-parallel-safety patch was not getting applied,
> so I generated the patch at commit-id where I could apply that
> patch successfully.
>
> [1] - http://www.postgresql.org/message-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
> [2] - http://www.postgresql.org/message-id/CA+TgmoZJjzYnpXChL3gr7NwRUzkAzPMPVKAtDt5sHvC5Cd7RKw@mail.gmail.com
> [3] - http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
>

Fixed the reported issue on assess-parallel-safety thread and another
bug caught while testing joins and integrated with latest version of
parallel-mode patch (parallel-mode-v9 patch).

Apart from that I have moved the Initialization of dsm segement from
InitNode phase to ExecFunnel() (on first execution) as per suggestion
from Robert. The main idea is that as it creates large shared memory
segment, so do the work when it is really required.

HEAD Commit-Id: 11226e38
parallel-mode-v9.patch [2]
assess-parallel-safety-v4.patch [1]
parallel-heap-scan.patch [3]
parallel_seqscan_v12.patch (Attached with this mail)

[1] - http://www.postgresql.org/message-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
[2] - http://www.postgresql.org/message-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
[3] - http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com

Okay, with my pgbench_accounts partitioned into 300, I ran:

SELECT DISTINCT bid FROM pgbench_accounts;

The query never returns, and I also get this:

grep -r 'starting background worker process "parallel worker for PID 12165"' postgresql-2015-03-25_112522.log | wc -l
2496

2,496 workers? This is with parallel_seqscan_degree set to 8. If I set it to 2, this number goes down to 626, and with 16, goes up to 4320.

Here's the query plan:

                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
HashAggregate (cost=38856527.50..38856529.50 rows=200 width=4)
   Group Key: pgbench_accounts.bid
   -> Append (cost=0.00..38806370.00 rows=20063001 width=4)
         -> Seq Scan on pgbench_accounts (cost=0.00..0.00 rows=1 width=4)
         -> Funnel on pgbench_accounts_1 (cost=0.00..192333.33 rows=100000 width=4)
               Number of Workers: 8
               -> Partial Seq Scan on pgbench_accounts_1 (cost=0.00..1641000.00 rows=100000 width=4)
         -> Funnel on pgbench_accounts_2 (cost=0.00..192333.33 rows=100000 width=4)
               Number of Workers: 8
               -> Partial Seq Scan on pgbench_accounts_2 (cost=0.00..1641000.00 rows=100000 width=4)
         -> Funnel on pgbench_accounts_3 (cost=0.00..192333.33 rows=100000 width=4)
               Number of Workers: 8
...
               -> Partial Seq Scan on pgbench_accounts_498 (cost=0.00..10002.10 rows=210 width=4)
         -> Funnel on pgbench_accounts_499 (cost=0.00..1132.34 rows=210 width=4)
               Number of Workers: 8
               -> Partial Seq Scan on pgbench_accounts_499 (cost=0.00..10002.10 rows=210 width=4)
         -> Funnel on pgbench_accounts_500 (cost=0.00..1132.34 rows=210 width=4)
               Number of Workers: 8
               -> Partial Seq Scan on pgbench_accounts_500 (cost=0.00..10002.10 rows=210 width=4)

Still not sure why 8 workers are needed for each partial scan. I would expect 8 workers to be used for 8 separate scans. Perhaps this is just my misunderstanding of how this feature works.

Thom

Re: Parallel Seq Scan

From

Thom Brown

Date:

25 March 2015, 13:40:35

On 25 March 2015 at 11:46, Thom Brown <thom@linux.com> wrote:

Still not sure why 8 workers are needed for each partial scan. I would expect 8 workers to be used for 8 separate scans. Perhaps this is just my misunderstanding of how this feature works.

Another issue:

SELECT * FROM pgb<tab>

*crash*

Logs:

2015-03-25 13:17:49 GMT [22823]: [124-1] user=,db=,client= LOG: registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [125-1] user=,db=,client= LOG: registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [126-1] user=,db=,client= LOG: registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [127-1] user=,db=,client= LOG: registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [128-1] user=,db=,client= LOG: registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [129-1] user=,db=,client= LOG: registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [130-1] user=,db=,client= LOG: registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [131-1] user=,db=,client= LOG: registering background worker "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [132-1] user=,db=,client= LOG: starting background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [133-1] user=,db=,client= LOG: starting background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [134-1] user=,db=,client= LOG: starting background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [135-1] user=,db=,client= LOG: starting background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [136-1] user=,db=,client= LOG: starting background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [137-1] user=,db=,client= LOG: starting background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [138-1] user=,db=,client= LOG: starting background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [139-1] user=,db=,client= LOG: starting background worker process "parallel worker for PID 24792"
2015-03-25 13:17:49 GMT [22823]: [140-1] user=,db=,client= LOG: worker process: parallel worker for PID 24792 (PID 24804) was terminated by signal 11: Segmentation fault
2015-03-25 13:17:49 GMT [22823]: [141-1] user=,db=,client= LOG: terminating any other active server processes
2015-03-25 13:17:49 GMT [24777]: [2-1] user=,db=,client= WARNING: terminating connection because of crash of another server process
2015-03-25 13:17:49 GMT [24777]: [3-1] user=,db=,client= DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2015-03-25 13:17:49 GMT [24777]: [4-1] user=,db=,client= HINT: In a moment you should be able to reconnect to the database and repeat your command.

Backtrace:

#0 GrantLockLocal (locallock=locallock@entry=0xfbe7f0, owner=owner@entry=0x1046da0) at lock.c:1544
#1 0x000000000066975c in LockAcquireExtended (locktag=locktag@entry=0x7fffdcb0ea20, lockmode=1,
    lockmode@entry=<error reading variable: Cannot access memory at address 0x7fffdcb0e9f0>, sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0 '\000',
    reportMemoryError=reportMemoryError@entry=1 '\001', ) at lock.c:798
#2 0x000000000066a1c4 in LockAcquire (locktag=locktag@entry=0x7fffdcb0ea20, lockmode=<error reading variable: Cannot access memory at address 0x7fffdcb0e9f0>,
    sessionLock=sessionLock@entry=0 '\000', dontWait=dontWait@entry=0 '\000') at lock.c:680
#3 0x0000000000667c48 in LockRelationOid (relid=<error reading variable: Cannot access memory at address 0x7fffdcb0e9e8>,
    relid@entry=<error reading variable: Cannot access memory at address 0x7fffdcb0ea48>,
    lockmode=<error reading variable: Cannot access memory at address 0x7fffdcb0e9f0>,
    lockmode@entry=<error reading variable: Cannot access memory at address 0x7fffdcb0ea48>) at lmgr.c:94

But the issue seems to produce a different backtrace each time...

2nd backtrace:

#0 hash_search_with_hash_value (hashp=0x2a2c370, keyPtr=keyPtr@entry=0x7ffff5ad2230, hashvalue=hashvalue@entry=2114233864, action=action@entry=HASH_FIND,
    foundPtr=foundPtr@entry=0x0) at dynahash.c:918
#1 0x0000000000654d1a in BufTableLookup (tagPtr=tagPtr@entry=0x7ffff5ad2230, hashcode=hashcode@entry=2114233864) at buf_table.c:96
#2 0x000000000065746b in BufferAlloc (foundPtr=0x7ffff5ad222f <Address 0x7ffff5ad222f out of bounds>, strategy=0x0,
    blockNum=<error reading variable: Cannot access memory at address 0x7ffff5ad2204>,
    forkNum=<error reading variable: Cannot access memory at address 0x7ffff5ad2208>,
    relpersistence=<error reading variable: Cannot access memory at address 0x7ffff5ad2214>, smgr=0x2aaae00) at bufmgr.c:893
#3 ReadBuffer_common (smgr=0x2aaae00, relpersistence=<optimized out>, ) at bufmgr.c:641
#4 0x0000000000657e40 in ReadBufferExtended (reln=<error reading variable: Cannot access memory at address 0x7ffff5ad2278>,
    reln@entry=<error reading variable: Cannot access memory at address 0x7ffff5ad22f8>, forkNum=MAIN_FORKNUM, blockNum=6, mode=<optimized out>,
    strategy=<optimized out>) at bufmgr.c:560

3rd backtrace:

#0 hash_search_with_hash_value (hashp=0x1d97370, keyPtr=keyPtr@entry=0x7ffff95855f0, hashvalue=hashvalue@entry=2382868486, action=action@entry=HASH_FIND,
    foundPtr=foundPtr@entry=0x0) at dynahash.c:907
#1 0x0000000000654d1a in BufTableLookup (tagPtr=tagPtr@entry=0x7ffff95855f0, hashcode=hashcode@entry=2382868486) at buf_table.c:96
#2 0x000000000065746b in BufferAlloc (foundPtr=0x7ffff95855ef "", strategy=0x0, blockNum=9, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=0x1e15860)
    at bufmgr.c:893
#3 ReadBuffer_common (smgr=0x1e15860, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=9, mode=RBM_NORMAL, strategy=0x0,
    hit=hit@entry=0x7ffff958569f "") at bufmgr.c:641
#4 0x0000000000657e40 in ReadBufferExtended (reln=reln@entry=0x7f8a17bab2c0, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=9, mode=mode@entry=RBM_NORMAL,
    strategy=strategy@entry=0x0) at bufmgr.c:560
#5 0x0000000000657f4d in ReadBuffer (blockNum=<optimized out>, reln=0x7f8a17bab2c0) at bufmgr.c:492
#6 ReleaseAndReadBuffer (buffer=buffer@entry=398111424, relation=relation@entry=0x1, blockNum=<optimized out>) at bufmgr.c:1403
#7 0x000000000049e6bf in _bt_relandgetbuf (rel=0x1, rel@entry=0x7f8a17bab2c0, obuf=398111424, blkno=blkno@entry=9, access=access@entry=1) at nbtpage.c:707
#8 0x00000000004a24b4 in _bt_search (rel=rel@entry=0x7f8a17bab2c0, keysz=keysz@entry=2, scankey=scankey@entry=0x7ffff95858b0, nextkey=nextkey@entry=0 '\000',
    bufP=bufP@entry=0x7ffff95857ac, access=access@entry=1) at nbtsearch.c:131
#9 0x00000000004a2cb4 in _bt_first (scan=scan@entry=0x1eb2048, dir=dir@entry=ForwardScanDirection) at nbtsearch.c:940
#10 0x00000000004a1141 in btgettuple (fcinfo=<optimized out>) at nbtree.c:288
#11 0x0000000000759132 in FunctionCall2Coll (flinfo=flinfo@entry=0x1e34390, collation=collation@entry=0, arg1=arg1@entry=32186440, arg2=arg2@entry=1) at fmgr.c:1323
#12 0x000000000049b273 in index_getnext_tid (scan=scan@entry=0x1eb2048, direction=direction@entry=ForwardScanDirection) at indexam.c:462
#13 0x000000000049b450 in index_getnext (scan=0x1eb2048, direction=direction@entry=ForwardScanDirection) at indexam.c:602
#14 0x000000000049a9a9 in systable_getnext (sysscan=sysscan@entry=0x1eb1ff8) at genam.c:416
#15 0x0000000000740452 in SearchCatCache (cache=0x1ddf540, v1=<optimized out>, v2=<optimized out>, v3=<optimized out>, v4=<optimized out>) at catcache.c:1248
#16 0x000000000074bd06 in GetSysCacheOid (cacheId=cacheId@entry=44, key1=key1@entry=140226851237264, key2=<optimized out>, key3=key3@entry=0, key4=key4@entry=0)
    at syscache.c:988
#17 0x000000000074d674 in get_relname_relid (relname=relname@entry=0x7f891ba7ed90 "pgbench_accounts_3", relnamespace=<optimized out>) at lsyscache.c:1602
#18 0x00000000004e1228 in RelationIsVisible (relid=relid@entry=16428) at namespace.c:740
#19 0x00000000004e4b6f in pg_table_is_visible (fcinfo=0x1e9dfc8) at namespace.c:4078
#20 0x0000000000595f72 in ExecMakeFunctionResultNoSets (fcache=0x1e9df58, econtext=0x1e99848, isNull=0x7ffff95871bf "", isDone=<optimized out>) at execQual.c:2015
#21 0x000000000059b469 in ExecQual (qual=qual@entry=0x1e9b368, econtext=econtext@entry=0x1e99848, resultForNull=resultForNull@entry=0 '\000') at execQual.c:5206
#22 0x000000000059b9a6 in ExecScan (node=node@entry=0x1e99738, accessMtd=accessMtd@entry=0x5ad780 <PartialSeqNext>,
    recheckMtd=recheckMtd@entry=0x5ad770 <PartialSeqRecheck>) at execScan.c:195
#23 0x00000000005ad8d0 in ExecPartialSeqScan (node=node@entry=0x1e99738) at nodePartialSeqscan.c:241
#24 0x0000000000594f68 in ExecProcNode (node=0x1e99738) at execProcnode.c:422
#25 0x00000000005a39b6 in funnel_getnext (funnelstate=0x1e943c8) at nodeFunnel.c:308
#26 ExecFunnel (node=node@entry=0x1e943c8) at nodeFunnel.c:185
#27 0x0000000000594f58 in ExecProcNode (node=0x1e943c8) at execProcnode.c:426
#28 0x00000000005a0212 in ExecAppend (node=node@entry=0x1e941d8) at nodeAppend.c:209
#29 0x0000000000594fa8 in ExecProcNode (node=node@entry=0x1e941d8) at execProcnode.c:399
#30 0x00000000005a0c9e in agg_fill_hash_table (aggstate=0x1e93ba8) at nodeAgg.c:1353
#31 ExecAgg (node=node@entry=0x1e93ba8) at nodeAgg.c:1115
#32 0x0000000000594e38 in ExecProcNode (node=node@entry=0x1e93ba8) at execProcnode.c:506
#33 0x00000000005a8144 in ExecLimit (node=node@entry=0x1e93908) at nodeLimit.c:91
#34 0x0000000000594d98 in ExecProcNode (node=node@entry=0x1e93908) at execProcnode.c:530
#35 0x0000000000592380 in ExecutePlan (dest=0x7f891bbc9f10, direction=<optimized out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, planstate=0x1e93908,
#36 standard_ExecutorRun (queryDesc=0x1dbb800, direction=<optimized out>, count=0) at execMain.c:342
#37 0x000000000067e9a8 in PortalRunSelect (portal=0x1e639e0, portal@entry=<error reading variable: Cannot access memory at address 0x7ffff95874c8>,
    forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:947

4th backtrace:

#0 ScanKeywordLookup (text=text@entry=0x1d57fa0 "information_schema_catalog_name", keywords=0x84f220 <ScanKeywords>, num_keywords=408) at kwlookup.c:64
#1 0x000000000070aa14 in quote_identifier (ident=0x1d57fa0 "information_schema_catalog_name") at ruleutils.c:9009
#2 0x00000000006f54bd in quote_ident (fcinfo=<optimized out>) at quote.c:31
#3 0x0000000000595f72 in ExecMakeFunctionResultNoSets (fcache=0x1d42cb8, econtext=0x1d3f848, isNull=0x1d42858 "", isDone=<optimized out>) at execQual.c:2015
#4 0x0000000000595f1d in ExecMakeFunctionResultNoSets (fcache=0x1d424a8, econtext=0x1d3f848, isNull=0x1d42048 "", isDone=<optimized out>) at execQual.c:1989
#5 0x0000000000595f1d in ExecMakeFunctionResultNoSets (fcache=0x1d41c98, econtext=0x1d3f848, isNull=0x7fff0bdc61df "", isDone=<optimized out>) at execQual.c:1989
#6 0x000000000059b469 in ExecQual (qual=qual@entry=0x1d41368, econtext=econtext@entry=0x1d3f848, resultForNull=resultForNull@entry=0 '\000') at execQual.c:5206
#7 0x000000000059b9a6 in ExecScan (node=node@entry=0x1d3f738, accessMtd=accessMtd@entry=0x5ad780 <PartialSeqNext>,
    recheckMtd=recheckMtd@entry=0x5ad770 <PartialSeqRecheck>) at execScan.c:195
#8 0x00000000005ad8d0 in ExecPartialSeqScan (node=node@entry=0x1d3f738) at nodePartialSeqscan.c:241
#9 0x0000000000594f68 in ExecProcNode (node=0x1d3f738) at execProcnode.c:422
#10 0x00000000005a39b6 in funnel_getnext (funnelstate=0x1d3a3c8) at nodeFunnel.c:308
#11 ExecFunnel (node=node@entry=0x1d3a3c8) at nodeFunnel.c:185
#12 0x0000000000594f58 in ExecProcNode (node=0x1d3a3c8) at execProcnode.c:426
#13 0x00000000005a0212 in ExecAppend (node=node@entry=0x1d3a1d8) at nodeAppend.c:209
#14 0x0000000000594fa8 in ExecProcNode (node=node@entry=0x1d3a1d8) at execProcnode.c:399
#15 0x00000000005a0c9e in agg_fill_hash_table (aggstate=0x1d39ba8) at nodeAgg.c:1353
#16 ExecAgg (node=node@entry=0x1d39ba8) at nodeAgg.c:1115
#17 0x0000000000594e38 in ExecProcNode (node=node@entry=0x1d39ba8) at execProcnode.c:506
#18 0x00000000005a8144 in ExecLimit (node=node@entry=0x1d39908) at nodeLimit.c:91
#19 0x0000000000594d98 in ExecProcNode (node=node@entry=0x1d39908) at execProcnode.c:530
#20 0x0000000000592380 in ExecutePlan (dest=0x7fe8c8a1cf10, direction=<optimized out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, planstate=0x1d39908,
    estate=0x1d01990) at execMain.c:1533
#21 standard_ExecutorRun (queryDesc=0x1c61800, direction=<optimized out>, count=0) at execMain.c:342
#22 0x000000000067e9a8 in PortalRunSelect (portal=portal@entry=0x1d099e0, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
    dest=dest@entry=0x7fe8c8a1cf10) at pquery.c:947
#23 0x000000000067fd0f in PortalRun (portal=portal@entry=0x1d099e0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001',
    dest=dest@entry=0x7fe8c8a1cf10, altdest=altdest@entry=0x7fe8c8a1cf10, completionTag=completionTag@entry=0x7fff0bdc6790 "") at pquery.c:791
#24 0x000000000067dab8 in exec_simple_query (
    query_string=0x1caf750 "SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident(c.relname),1,3)='pgb' AND pg_catalog.pg_table_is_v"...) at postgres.c:1107
#25 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1c3db60, dbname=0x1c3da18 "pgbench", username=<optimized out>) at postgres.c:4120
#26 0x0000000000462c8e in BackendRun (port=0x1c621f0) at postmaster.c:4148
#27 BackendStartup (port=0x1c621f0) at postmaster.c:3833
#28 ServerLoop () at postmaster.c:1601
#29 0x000000000062e803 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x1c3cca0) at postmaster.c:1248
#30 0x00000000004636dd in main (argc=1, argv=0x1c3cca0) at main.c:221

5th backtrace:

#0 0x000000000075d757 in hash_search_with_hash_value (hashp=0x1d62310, keyPtr=keyPtr@entry=0x7fffb686f4a0, hashvalue=hashvalue@entry=171639189,
    action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7fffb686f44f <Address 0x7fffb686f44f out of bounds>) at dynahash.c:1026
#1 0x0000000000654d52 in BufTableInsert (tagPtr=tagPtr@entry=0x7fffb686f4a0, hashcode=hashcode@entry=171639189, buf_id=169) at buf_table.c:128
#2 0x0000000000657711 in BufferAlloc (foundPtr=0x7fffb686f49f <Address 0x7fffb686f49f out of bounds>, strategy=0x0, blockNum=11, forkNum=MAIN_FORKNUM,
    relpersistence=<error reading variable: Cannot access memory at address 0x7fffb686f484>,
    smgr=<error reading variable: Cannot access memory at address 0x7fffb686f488>) at bufmgr.c:1089
#3 ReadBuffer_common (smgr=<error reading variable: Cannot access memory at address 0x7fffb686f488>, relpersistence=<optimized out>, forkNum=MAIN_FORKNUM,
    forkNum@entry=<error reading variable: Cannot access memory at address 0x7fffb686f4f0>, blockNum=11,
    blockNum@entry=<error reading variable: Cannot access memory at address 0x7fffb686f4f8>, mode=RBM_NORMAL, strategy=0x0,
    hit=hit@entry=0x7fffb686f54f <Address 0x7fffb686f54f out of bounds>) at bufmgr.c:641
#4 0x0000000000657e40 in ReadBufferExtended (reln=<error reading variable: Cannot access memory at address 0x7fffb686f4e8>,
    reln@entry=<error reading variable: Cannot access memory at address 0x7fffb686f568>,
    forkNum=<error reading variable: Cannot access memory at address 0x7fffb686f4f0>,
    blockNum=<error reading variable: Cannot access memory at address 0x7fffb686f4f8>, mode=<optimized out>, strategy=<optimized out>) at bufmgr.c:560

Thom

Re: Parallel Seq Scan

From

Amit Kapila

Date:

25 March 2015, 15:49:49

On Wed, Mar 25, 2015 at 5:16 PM, Thom Brown <thom@linux.com> wrote:
>
> On 25 March 2015 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Fixed the reported issue on assess-parallel-safety thread and another
>> bug caught while testing joins and integrated with latest version of
>> parallel-mode patch (parallel-mode-v9 patch).
>>
>> Apart from that I have moved the Initialization of dsm segement from
>> InitNode phase to ExecFunnel() (on first execution) as per suggestion
>> from Robert. The main idea is that as it creates large shared memory
>> segment, so do the work when it is really required.
>>
>>
>> HEAD Commit-Id: 11226e38
>> parallel-mode-v9.patch [2]
>> assess-parallel-safety-v4.patch [1]
>> parallel-heap-scan.patch [3]
>> parallel_seqscan_v12.patch (Attached with this mail)
>>
>> [1] - http://www.postgresql.org/message-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
>> [2] - http://www.postgresql.org/message-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
>> [3] - http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
>
>
> Okay, with my pgbench_accounts partitioned into 300, I ran:
>
> SELECT DISTINCT bid FROM pgbench_accounts;
>
> The query never returns,

You seem to be hitting the issue I have pointed in near-by thread [1]

and I have mentioned the same while replying on assess-parallel-safety

thread. Can you check after applying the patch in mail [1]

> and I also get this:
>
> grep -r 'starting background worker process "parallel worker for PID 12165"' postgresql-2015-03-25_112522.log | wc -l
> 2496
>
> 2,496 workers? This is with parallel_seqscan_degree set to 8. If I set it to 2, this number goes down to 626, and with 16, goes up to 4320.
>

>
> Still not sure why 8 workers are needed for each partial scan. I would expect 8 workers to be used for 8 separate scans. Perhaps this is just my misunderstanding of how this feature works.
>

The reason is that for each table scan, it tries to use workers

equal to parallel_seqscan_degree if they are available and in this

case as the scan for inheritance hierarchy (tables in hierarchy) happens

one after another, it uses 8 workers for each scan. I think as of now

the strategy to decide number of workers to be used in scan is kept

simple and in future we can try to come with some better mechanism

to decide number of workers.

[1] - http://www.postgresql.org/message-id/CAA4eK1+NwUJ9ik61yGfZBcN85dQuNEvd38_h1zngCdZrGLGQTQ@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Thom Brown

Date:

25 March 2015, 16:23:47

On 25 March 2015 at 15:49, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 25, 2015 at 5:16 PM, Thom Brown <thom@linux.com> wrote:
>
> On 25 March 2015 at 10:27, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Fixed the reported issue on assess-parallel-safety thread and another
>> bug caught while testing joins and integrated with latest version of
>> parallel-mode patch (parallel-mode-v9 patch).
>>
>> Apart from that I have moved the Initialization of dsm segement from
>> InitNode phase to ExecFunnel() (on first execution) as per suggestion
>> from Robert. The main idea is that as it creates large shared memory
>> segment, so do the work when it is really required.
>>
>>
>> HEAD Commit-Id: 11226e38
>> parallel-mode-v9.patch [2]
>> assess-parallel-safety-v4.patch [1]
>> parallel-heap-scan.patch [3]
>> parallel_seqscan_v12.patch (Attached with this mail)
>>
>> [1] - http://www.postgresql.org/message-id/CA+TgmobJSuefiPOk6+i9WERUgeAB3ggJv7JxLX+r6S5SYydBRQ@mail.gmail.com
>> [2] - http://www.postgresql.org/message-id/CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
>> [3] - http://www.postgresql.org/message-id/CA+TgmoYJETgeAXUsZROnA7BdtWzPtqExPJNTV1GKcaVMgSdhug@mail.gmail.com
>
>
> Okay, with my pgbench_accounts partitioned into 300, I ran:
>
> SELECT DISTINCT bid FROM pgbench_accounts;
>
> The query never returns,

You seem to be hitting the issue I have pointed in near-by thread [1]
and I have mentioned the same while replying on assess-parallel-safety
thread. Can you check after applying the patch in mail [1]

Ah, okay, here's the patches I've now applied:

parallel-mode-v9.patch
assess-parallel-safety-v4.patch
parallel-heap-scan.patch
parallel_seqscan_v12.patch
release_lock_dsm_v1.patch

(with perl patch for pg_proc.h)

The query now returns successfully.

> and I also get this:
>
> grep -r 'starting background worker process "parallel worker for PID 12165"' postgresql-2015-03-25_112522.log | wc -l
> 2496
>
> 2,496 workers? This is with parallel_seqscan_degree set to 8. If I set it to 2, this number goes down to 626, and with 16, goes up to 4320.
>
..
>
> Still not sure why 8 workers are needed for each partial scan. I would expect 8 workers to be used for 8 separate scans. Perhaps this is just my misunderstanding of how this feature works.
>

The reason is that for each table scan, it tries to use workers
equal to parallel_seqscan_degree if they are available and in this
case as the scan for inheritance hierarchy (tables in hierarchy) happens
one after another, it uses 8 workers for each scan. I think as of now
the strategy to decide number of workers to be used in scan is kept
simple and in future we can try to come with some better mechanism
to decide number of workers.

Yes, I was expecting the parallel aspect to apply across partitions (a worker per partition up to parallel_seqscan_degree and reallocate to another scan once finished with current job), not individual ones, so for the workers to be above the funnel, not below it. So this is parallelising, just not in a way that will be a win in this case. :( For the query I posted (SELECT DISTINCT bid FROM pgbench_partitions), the parallelised version takes 8 times longer to complete. However, I'm perhaps premature in what I expect from the feature at this stage.

Thom

Re: Parallel Seq Scan

From

Amit Kapila

Date:

26 March 2015, 02:59:17

On Wed, Mar 25, 2015 at 9:53 PM, Thom Brown <thom@linux.com> wrote:
>
> On 25 March 2015 at 15:49, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Wed, Mar 25, 2015 at 5:16 PM, Thom Brown <thom@linux.com> wrote:
>> > Okay, with my pgbench_accounts partitioned into 300, I ran:
>> >
>> > SELECT DISTINCT bid FROM pgbench_accounts;
>> >
>> > The query never returns,
>>
>> You seem to be hitting the issue I have pointed in near-by thread [1]
>> and I have mentioned the same while replying on assess-parallel-safety
>> thread. Can you check after applying the patch in mail [1]
>
>
> Ah, okay, here's the patches I've now applied:
>
> parallel-mode-v9.patch
> assess-parallel-safety-v4.patch
> parallel-heap-scan.patch
> parallel_seqscan_v12.patch
> release_lock_dsm_v1.patch
>
> (with perl patch for pg_proc.h)
>
> The query now returns successfully.
>

Thanks for verification.

>> ..
>> >
>> > Still not sure why 8 workers are needed for each partial scan. I would expect 8 workers to be used for 8 separate scans. Perhaps this is just my misunderstanding of how this feature works.
>> >
>>
>> The reason is that for each table scan, it tries to use workers
>> equal to parallel_seqscan_degree if they are available and in this
>> case as the scan for inheritance hierarchy (tables in hierarchy) happens
>> one after another, it uses 8 workers for each scan. I think as of now
>> the strategy to decide number of workers to be used in scan is kept
>> simple and in future we can try to come with some better mechanism
>> to decide number of workers.
>
>
> Yes, I was expecting the parallel aspect to apply across partitions (a worker per partition up to parallel_seqscan_degree and reallocate to another >scan once finished with current job), not individual ones,

Here what you are describing is something like parallel partition

scan which is somewhat related but different feature. This

feature will parallelize the scan for an individual table.

> so for the workers to be above the funnel, not below it. So this is parallelising, just not in a way that will be a win in this case. :( For the query I

> posted (SELECT DISTINCT bid FROM pgbench_partitions), the parallelised version takes 8 times longer to complete.

I think the primary reason for it not performing as per expectation is

because we have either not the set the right values for cost

parameters or changed the existing cost parameters (cost_seq_page)

which makes planner to select parallel plan even though it is costly.

This is similar to the behaviour when user has intentionally disabled

index scan to test sequence scan and then telling that it is performing

slower.

I think if you want to help in this direction, then what will be more useful

is to see what could be the appropriate values of cost parameters for

parallel scan. We have introduced 3 parameters (cpu_tuple_comm_cost,
parallel_setup_cost, parallel_startup_cost) for costing of parallel plans, so

with your tests if we can decide what is the appropriate value for each of

these parameters such that it chooses parallel plan only when it is better

than non-parallel plan, then that will be really valuable input.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan

From

Amit Kapila

Date:

27 March 2015, 06:34:37

On Wed, Mar 25, 2015 at 7:09 PM, Thom Brown <thom@linux.com> wrote:
>
> On 25 March 2015 at 11:46, Thom Brown <thom@linux.com> wrote:
>>
>>
>> Still not sure why 8 workers are needed for each partial scan. I would expect 8 workers to be used for 8 separate scans. Perhaps this is just my misunderstanding of how this feature works.
>
>
> Another issue:
>
> SELECT * FROM pgb<tab>
>
> *crash*
>

The reason of this problem is that above tab-completion is executing

query [1] which contains subplan for the funnel node and currently

we don't have capability (enough infrastructure) to support execution

of subplans by parallel workers. Here one might wonder why we

have choosen Parallel Plan (Funnel node) for such a case and the

reason for same is that subplans are attached after Plan generation

(SS_finalize_plan()) and if want to discard such a plan, it will be

much more costly, tedious and not worth the effort as we have to

eventually make such a plan work.

Here we have two choices to proceed, first one is to support execution

of subplans by parallel workers and second is execute/scan locally for

Funnel node having subplan (don't launch workers).

I have tried to evaluate what it would take us to support execution

of subplans by parallel workers. We need to pass the sub plans

stored in Funnel Node (initPlan) and corresponding subplans stored

in planned statement (subplans) as subplan's stored in Funnel node

has reference to subplans in planned statement. Next currently

readfuncs.c (functions to read different type of nodes) doesn't support

reading any type of plan node, so we need to add support for reading all kind

of plan nodes (as subplan can have any type of plan node) and similarly

to execute any type of Plan node, we might need more work (infrastructure).

Currently I have updated the patch to use second approach which

is to execute/scan locally for Funnel node having subplan.

I understand that it is quite interesting if we can have support for

execution of subplans (un-correlated expression subselects) by

parallel workers, but I feel it is better done as a separate patch.

[1] -

SELECT pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c WHERE c.relkind IN ('r', 'S', 'v', 'm',

'f') AND substring(pg_catalog.quote_ident(c.relname),1,3)='pgb' AND pg_catalog.pg_table_is_visible(c.oid) AND

c.relnamespace <> (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'pg_catalog') UNION SELECT

pg_catalog.quote_ident(n.nspname) || '.' FROM pg_catalog.pg_namespace n WHERE substring

(pg_catalog.quote_ident(n.nspname) || '.',1,3)='pgb' AND (SELECT pg_catalog.count(*) FROM

pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) || '.',1,3) = substring

('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) > 1 UNION SELECT pg_catalog.quote_ident

(n.nspname) || '.' || pg_catalog.quote_ident(c.relname) FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n

WHERE c.relnamespace = n.oid AND c.relkind IN ('r', 'S', 'v', 'm', 'f') AND substring(pg_catalog.quote_ident

(n.nspname) || '.' || pg_catalog.quote_ident(c.relname),1,3)='pgb' AND substring(pg_catalog.quote_ident

(n.nspname) || '.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(n.nspname))+1) AND

(SELECT pg_catalog.count(*) FROM pg_catalog.pg_namespace WHERE substring(pg_catalog.quote_ident(nspname) ||

'.',1,3) = substring('pgb',1,pg_catalog.length(pg_catalog.quote_ident(nspname))+1)) = 1 LIMIT 1000;

Query Plan

--------------------------

QUERY PLAN

--------------------------------------------------------------------------------

----------------------------------------------

Limit (cost=10715.89..10715.92 rows=3 width=85)

-> HashAggregate (cost=10715.89..10715.92 rows=3 width=85)

Group Key: (quote_ident((c.relname)::text))

-> Append (cost=8.15..10715.88 rows=3 width=85)

-> Funnel on pg_class c (cost=8.15..9610.67 rows=1 width=64)

Filter: ((relnamespace <> $4) AND (relkind = ANY ('{r,S,v,m

,f}'::"char"[])) AND ("substring"(quote_ident((relname)::text), 1, 3) = 'pgb'::t

ext) AND pg_table_is_visible(oid))

Number of Workers: 1

InitPlan 3 (returns $4)

-> Index Scan using pg_namespace_nspname_index on pg_nam

espace pg_namespace_2 (cost=0.13..8.15 rows=1 width=4)

Index Cond: (nspname = 'pg_catalog'::name)

-> Partial Seq Scan on pg_class c (cost=0.00..19043.43 ro

ws=1 width=64)

Filter: ((relnamespace <> $4) AND (relkind = ANY ('{r

,S,v,m,f}'::"char"[])) AND ("substring"(quote_ident((relname)::text), 1, 3) = 'p

gb'::text) AND pg_table_is_visible(oid))

-> Result (cost=8.52..16.69 rows=1 width=64)

One-Time Filter: ($3 > 1)

InitPlan 2 (returns $3)

-> Aggregate (cost=8.37..8.38 rows=1 width=0)

-> Index Only Scan using pg_namespace_nspname_inde

x on pg_namespace pg_namespace_1 (cost=0.13..8.37 rows=1 width=0)

Filter: ("substring"((quote_ident((nspname)::

text) || '.'::text), 1, 3) = "substring"('pgb'::text, 1, (length(quote_ident((ns

pname)::text)) + 1)))

-> Index Only Scan using pg_namespace_nspname_index on pg_

namespace n (cost=0.13..8.30 rows=1 width=64)

Filter: ("substring"((quote_ident((nspname)::text) ||

'.'::text), 1, 3) = 'pgb'::text)

-> Result (cost=8.79..1088.49 rows=1 width=128)

One-Time Filter: ($0 = 1)

InitPlan 1 (returns $0)

-> Aggregate (cost=8.37..8.38 rows=1 width=0)

-> Index Only Scan using pg_namespace_nspname_inde

x on pg_namespace (cost=0.13..8.37 rows=1 width=0)

Filter: ("substring"((quote_ident((nspname)::

text) || '.'::text), 1, 3) = "substring"('pgb'::text, 1, (length(quote_ident((ns

pname)::text)) + 1)))

-> Nested Loop (cost=0.41..1080.09 rows=1 width=128)

-> Index Scan using pg_namespace_oid_index on pg_nam

espace n_1 (cost=0.13..12.37 rows=1 width=68)

Filter: ("substring"((quote_ident((nspname)::te

xt) || '.'::text), 1, 3) = "substring"('pgb'::text, 1, (length(quote_ident((nspn

ame)::text)) + 1)))

-> Index Scan using pg_class_relname_nsp_index on pg

_class c_1 (cost=0.28..1067.71 rows=1 width=68)

Index Cond: (relnamespace = n_1.oid)

Filter: ((relkind = ANY ('{r,S,v,m,f}'::"char"[

])) AND ("substring"(((quote_ident((n_1.nspname)::text) || '.'::text) || quote_i

dent((relname)::text)), 1, 3) = 'pgb'::text))

(32 rows)

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

parallel_seqscan_v13.patch

Re: Parallel Seq Scan

From

Robert Haas

Date:

30 March 2015, 14:41:41

On Fri, Mar 27, 2015 at 2:34 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> The reason of this problem is that above tab-completion is executing
> query [1] which contains subplan for the funnel node and currently
> we don't have capability (enough infrastructure) to support execution
> of subplans by parallel workers.  Here one might wonder why we
> have choosen Parallel Plan (Funnel node) for such a case and the
> reason for same is that subplans are attached after Plan generation
> (SS_finalize_plan()) and if want to discard such a plan, it will be
> much more costly, tedious and not worth the effort as we have to
> eventually make such a plan work.
>
> Here we have two choices to proceed, first one is to support execution
> of subplans by parallel workers and second is execute/scan locally for
> Funnel node having subplan (don't launch workers).

It looks to me like the is an InitPlan, not a subplan.  There
shouldn't be any problem with a Funnel node having an InitPlan; it
looks to me like all of the InitPlan stuff is handled by common code
within the executor (grep for initPlan), so it ought to work here the
same as it does for anything else.  What I suspect is failing
(although you aren't being very clear about it here) is the passing
down of the parameters set by the InitPlan to the workers.  I think we
need to make that work; it's an integral piece of the executor
infrastructure and we shouldn't leave it out just because it requires
a bit more IPC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

30 March 2015, 15:01:17

On Wed, Mar 18, 2015 at 11:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think I figured out the problem.  That fix only helps in the case
>> where the postmaster noticed the new registration previously but
>> didn't start the worker, and then later notices the termination.
>> What's much more likely to happen is that the worker is started and
>> terminated so quickly that both happen before we create a
>> RegisteredBgWorker for it.  The attached patch fixes that case, too.
>
> Patch fixes the problem and now for Rescan, we don't need to Wait
> for workers to finish.

I realized that there is a problem with this.  If an error occurs in
one of the workers just as we're deciding to kill them all, then the
error won't be reported. Also, the new code to propagate
XactLastRecEnd won't work right, either.  I think we need to find a
way to shut down the workers cleanly.  The idea generally speaking
should be:

1. Tell all of the workers that we want them to shut down gracefully
without finishing the scan.

2. Wait for them to exit via WaitForParallelWorkersToFinish().

My first idea about how to implement this is to have the master detach
all of the tuple queues via a new function TupleQueueFunnelShutdown().
Then, we should change tqueueReceiveSlot() so that it does not throw
an error when shm_mq_send() returns SHM_MQ_DETACHED.  We could modify
the receiveSlot method of a DestReceiver to return bool rather than
void; a "true" value can mean "continue processing" where as a "false"
value can mean "stop early, just as if we'd reached the end of the
scan".

This design will cause each parallel worker to finish producing the
tuple it's currently in the middle of generating, and then shut down.
You can imagine cases where we'd want the worker to respond faster
than that, though; for example, if it's applying a highly selective
filter condition, we'd like it to stop the scan right away, not when
it finds the next matching tuple.  I can't immediately see a real
clean way of accomplishing that, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Robert Haas

Date:

30 March 2015, 15:05:27

On Wed, Mar 25, 2015 at 6:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Apart from that I have moved the Initialization of dsm segement from
> InitNode phase to ExecFunnel() (on first execution) as per suggestion
> from Robert.  The main idea is that as it creates large shared memory
> segment, so do the work when it is really required.

So, suppose we have a plan like this:

Append
-> Funnel -> Partial Seq Scan
-> Funnel -> Partial Seq Scan
(repeated many times)

In earlier versions of this patch, that was chewing up lots of DSM
segments.  But it seems to me, on further reflection, that it should
never use more than one at a time.  The first funnel node should
initialize its workers and then when it finishes, all those workers
should get shut down cleanly and the DSM destroyed before the next
scan is initialized.

Obviously we could do better here: if we put the Funnel on top of the
Append instead of underneath it, we could avoid shutting down and
restarting workers for every child node.  But even without that, I'm
hoping it's no longer the case that this uses more than one DSM at a
time.  If that's not the case, we should see if we can't fix that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel Seq Scan

From

Amit Kapila

Date:

31 March 2015, 12:53:35

On Mon, Mar 30, 2015 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 27, 2015 at 2:34 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The reason of this problem is that above tab-completion is executing
> > query [1] which contains subplan for the funnel node and currently
> > we don't have capability (enough infrastructure) to support execution
> > of subplans by parallel workers. Here one might wonder why we
> > have choosen Parallel Plan (Funnel node) for such a case and the
> > reason for same is that subplans are attached after Plan generation
> > (SS_finalize_plan()) and if want to discard such a plan, it will be
> > much more costly, tedious and not worth the effort as we have to
> > eventually make such a plan work.
> >
> > Here we have two choices to proceed, first one is to support execution
> > of subplans by parallel workers and second is execute/scan locally for
> > Funnel node having subplan (don't launch workers).
>
> It looks to me like the is an InitPlan, not a subplan. There
> shouldn't be any problem with a Funnel node having an InitPlan; it
> looks to me like all of the InitPlan stuff is handled by common code
> within the executor (grep for initPlan), so it ought to work here the
> same as it does for anything else. What I suspect is failing
> (although you aren't being very clear about it here) is the passing
> down of the parameters set by the InitPlan to the workers.

It is failing because we are not passing InitPlan itself (InitPlan is

nothing but a list of SubPlan) and I tried tried to describe in previous

mail [1] what we need to do to achieve the same, but in short, it is not

difficult to pass down the required parameters (like plan->InitPlan or

plannedstmt->subplans), rather the main missing part is the handling

of such parameters in worker side (mainly we need to provide support

for all plan nodes which can be passed as part of InitPlan in readfuncs.c).

I am not against supporting InitPlan's on worker side, but just wanted to

say that if possible why not leave that for first version.