Thread: How to set parallel_tuple_cost

How to set parallel_tuple_cost

From

Jeff Janes

Date:

20 December 2019, 21:03:29

The docs for parallel_tuple_cost are quite terse, as the reference section of the docs usually are:

"Sets the planner's estimate of the cost of transferring one tuple from a parallel worker process to another process. The default is 0.1."

Usually you can find more extensive discussion of such settings in informal resources like blog posts or mailing lists, but Googling the name I don't find much for this setting. Is there good information out there somewhere?

If you were take the doc description literally, then the default value seems much too high, as it doesn't take 10x the (default) cpu_tuple_cost to transfer a tuple up from a parallel worker. On the other hand, you probably don't want a query which consumes 8x the CPU resources just to finish only 5% faster (on an otherwise idle server with 8 CPUs). Maybe this Amdahl factor is what inspired the high default value?

Cheers,

Jeff

Re: How to set parallel_tuple_cost

From

Laurenz Albe

Date:

20 December 2019, 21:42:19

On Fri, 2019-12-20 at 13:03 -0500, Jeff Janes wrote:
> The docs for parallel_tuple_cost are quite terse, as the reference section of the docs usually are:
> 
> "Sets the planner's estimate of the cost of transferring one tuple from a parallel worker process to another process.
Thedefault is 0.1."
 
> 
> Usually you can find more extensive discussion of such settings in informal resources like blog posts or mailing
lists,
> but Googling the name I don't find much for this setting.  Is there good information out there somewhere?
> 
> If you were take the doc description literally, then the default value seems much too high, as it doesn't take
> 10x the (default) cpu_tuple_cost to transfer a tuple up from a parallel worker.  On the other hand, you probably
> don't want a query which consumes 8x the CPU resources just to finish only 5% faster (on an otherwise idle server
with8 CPUs).
 
> Maybe this Amdahl factor is what inspired the high default value?

Hmm.  The parameter was introduced into the discussion here:
https://www.postgresql.org/message-id/CAA4eK1L0dk9D3hARoAb84v2pGvUw4B5YoS4x18ORQREwR%2B1VCg%40mail.gmail.com
and while the name was changed from "cpu_tuple_comm_cost" to "parallel_tuple_cost"
later, the default value seems not to have been the subject of discussion.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

Re: How to set parallel_tuple_cost

From

Tom Lane

Date:

20 December 2019, 21:58:57

Jeff Janes <jeff.janes@gmail.com> writes:
> The docs for parallel_tuple_cost are quite terse, as the reference section
> of the docs usually are:
> "Sets the planner's estimate of the cost of transferring one tuple from a
> parallel worker process to another process. The default is 0.1."

> If you were take the doc description literally, then the default value
> seems much too high, as it doesn't take 10x the (default) cpu_tuple_cost to
> transfer a tuple up from a parallel worker.

Really?  If anything, I'd have thought it might be worse than 10x.
Cross-process communication isn't cheap, at least not according to
my instincts.

> On the other hand, you
> probably don't want a query which consumes 8x the CPU resources just to
> finish only 5% faster (on an otherwise idle server with 8 CPUs).  Maybe
> this Amdahl factor is what inspired the high default value?

I think the large value of parallel_setup_cost is what's meant to
discourage that scenario.

            regards, tom lane

Re: How to set parallel_tuple_cost

From

Andres Freund

Date:

21 December 2019, 03:24:53

Hi,

On 2019-12-20 13:58:57 -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > The docs for parallel_tuple_cost are quite terse, as the reference section
> > of the docs usually are:
> > "Sets the planner's estimate of the cost of transferring one tuple from a
> > parallel worker process to another process. The default is 0.1."
> 
> > If you were take the doc description literally, then the default value
> > seems much too high, as it doesn't take 10x the (default) cpu_tuple_cost to
> > transfer a tuple up from a parallel worker.
> 
> Really?  If anything, I'd have thought it might be worse than 10x.
> Cross-process communication isn't cheap, at least not according to
> my instincts.

+1. I did at some point measure the cost of transferring through a
tuplequeue, and it's quite expensive, compared to local tuple
handoff. Some of that is not intrinsic, and could be fixed - e.g. by
just putting pointers to tuples into the queue, instead of the whole
tuple (but that's hard due to our process model leading to dynamic shm
having differing addresses).  What's worse, putting a tuple into a
tuplequeue requires the input slot to be materialized into a HeapTuple
(should probably be MinimalTuple....), which often the input will not
yet be.  So I think it'll often be much worse than 10x.

Greetings,

Andres Freund

Re: How to set parallel_tuple_cost

From

Jeff Janes

Date:

21 December 2019, 04:37:15

On Fri, Dec 20, 2019 at 1:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:
> The docs for parallel_tuple_cost are quite terse, as the reference section
> of the docs usually are:
> "Sets the planner's estimate of the cost of transferring one tuple from a
> parallel worker process to another process. The default is 0.1."

> If you were take the doc description literally, then the default value
> seems much too high, as it doesn't take 10x the (default) cpu_tuple_cost to
> transfer a tuple up from a parallel worker.

Really? If anything, I'd have thought it might be worse than 10x.
Cross-process communication isn't cheap, at least not according to
my instincts.

I was a bit surprised. I set it up so that there was a fine-tunable filter which can be applied in the parallel workers, and then only the surviving tuples get passed up to the leader. The use of a parallel seq scan didn't become slower than the non-parallel version until over 95% of the tuples were surviving the filter. If I wanted to make the estimated cost cross-over point match the measured time cross-over point, I had to mark the parallel_tuple_cost down to about 0.011. This was an 8 CPU machine, an AWS m5.4xlarge, with max_parallel_workers_per_gather=7. (On my crummy 2-CPU Windows 10 laptop running ubuntu via VirtualBox, the cross-over point was closer to 40% of the tuples surviving, and the parallel_tuple_cost to match cross-over point would be about 0.016, but I don't have enough RAM to make a large enough all-in-shared-buffer table to really get a good assessments).

My method was to make shared_buffers be a large fraction of RAM (55GB, out of 64GB), then make a table slightly smaller than that and forced it into shared_buffers with pg_prewarm. I set seq_page_cost = random_age_cost = 0, to accurately reflect the fact that no IO is occuring.

create table para_seq as select floor(random()*10000)::int as id, random() as x, md5(random()::text)||md5(random()::text) t from generate_series(1,8000000*55);
vacuum ANALYZE para_seq ;
select pg_prewarm('para_seq');

explain (analyze, buffers, settings, timing off) select * from para_seq where id<9500;

Where you can change the 9500 to tune the selectivity of the filter. Is this the correct way to try to isolate just the overhead of transferring of a tuple away from other considerations so it can be measured?

I don't think the fact that EXPLAIN ANALYZE throws away the result set without reading it should change anything. Reading it should add the same fixed overhead to both parallel and non-parallel, so would dilute out percentage difference without change absolute differences.

I tried it with wider tuples as well, but not so wide they would activate TOAST, and didn't really see a difference in the conclusion.

> On the other hand, you
> probably don't want a query which consumes 8x the CPU resources just to
> finish only 5% faster (on an otherwise idle server with 8 CPUs). Maybe
> this Amdahl factor is what inspired the high default value?

I think the large value of parallel_setup_cost is what's meant to
discourage that scenario.

I think that can only account for overhead like forking and setting up memory segments. The overhead of moving around tuples (more than single-threaded execution already moves them around) would need to scale with the number of tuples moved around.

Cheers,

Jeff