Re: Parallel Seq Scan - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Parallel Seq Scan |
Date | |
Msg-id | CAA4eK1KtGfdMSF=UpAqcba43H3PC7w67ea7gEojUmYm68cBMGg@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel Seq Scan (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: Parallel Seq Scan
|
List | pgsql-hackers |
On Fri, Jan 9, 2015 at 10:54 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Amit Kapila (amit.kapila16@gmail.com) wrote:
> > In our case as currently we don't have a mechanism to reuse parallel
> > workers, so we need to account for that cost as well. So based on that,
> > I am planing to add three new parameters cpu_tuple_comm_cost,
> > parallel_setup_cost, parallel_startup_cost
> > * cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker
> > to master backend with default value
> > DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied
> > with tuples expected to be selected
> > * parallel_setup_cost - Cost of setting up shared memory for parallelism
> > with default value as 100.0
> > * parallel_startup_cost - Cost of starting up parallel workers with
> > default
> > value as 1000.0 multiplied by number of workers decided for scan.
> >
> > I will do some experiments to finalise the default values, but in general,
> > I feel developing cost model on above parameters is good.
>
> The parameters sound reasonable but I'm a bit worried about the way
> you're describing the implementation. Specifically this comment:
>
> "Cost of starting up parallel workers with default value as 1000.0
> multiplied by number of workers decided for scan."
>
> That appears to imply that we'll decide on the number of workers, figure
> out the cost, and then consider "parallel" as one path and
> "not-parallel" as another. I'm worried that if I end up setting the max
> parallel workers to 32 for my big, beefy, mostly-single-user system then
> I'll actually end up not getting parallel execution because we'll always
> be including the full startup cost of 32 threads. For huge queries,
> it'll probably be fine, but there's a lot of room to parallelize things
> at levels less than 32 which we won't even consider.
>
> What I was advocating for up-thread was to consider multiple "parallel"
> paths and to pick whichever ends up being the lowest overall cost. The
> flip-side to that is increased planning time.
> * Amit Kapila (amit.kapila16@gmail.com) wrote:
> > In our case as currently we don't have a mechanism to reuse parallel
> > workers, so we need to account for that cost as well. So based on that,
> > I am planing to add three new parameters cpu_tuple_comm_cost,
> > parallel_setup_cost, parallel_startup_cost
> > * cpu_tuple_comm_cost - Cost of CPU time to pass a tuple from worker
> > to master backend with default value
> > DEFAULT_CPU_TUPLE_COMM_COST as 0.1, this will be multiplied
> > with tuples expected to be selected
> > * parallel_setup_cost - Cost of setting up shared memory for parallelism
> > with default value as 100.0
> > * parallel_startup_cost - Cost of starting up parallel workers with
> > default
> > value as 1000.0 multiplied by number of workers decided for scan.
> >
> > I will do some experiments to finalise the default values, but in general,
> > I feel developing cost model on above parameters is good.
>
> The parameters sound reasonable but I'm a bit worried about the way
> you're describing the implementation. Specifically this comment:
>
> "Cost of starting up parallel workers with default value as 1000.0
> multiplied by number of workers decided for scan."
>
> That appears to imply that we'll decide on the number of workers, figure
> out the cost, and then consider "parallel" as one path and
> "not-parallel" as another. I'm worried that if I end up setting the max
> parallel workers to 32 for my big, beefy, mostly-single-user system then
> I'll actually end up not getting parallel execution because we'll always
> be including the full startup cost of 32 threads. For huge queries,
> it'll probably be fine, but there's a lot of room to parallelize things
> at levels less than 32 which we won't even consider.
>
Actually the main factor to decide whether a parallel plan will be
selected or not will be based on selectivity and cpu_tuple_comm_cost,
parallel_startup_cost is mainly to prevent the cases where user
has set parallel_seqscan_degree, but the table is small enough
(letus say 10,000 tuples) that it doesn't need parallelism. If you are
worried by default cost parameter's, then I think those still needs
to be decided based on certain experiments.
> What I was advocating for up-thread was to consider multiple "parallel"
> paths and to pick whichever ends up being the lowest overall cost. The
> flip-side to that is increased planning time.
The main idea behind providing a parameter like parallel_seqscan_degree
is such that it will try to use that many number of workers for a single
parallel operation (intra-node parallelism) and incase we have to perform
inter-node parallelism than having such an parameter means that each
node can use that many number of parallel worker. For example we have
to parallelize scan as well as sort (Select * from t1 order by c1), and
parallel_degree is specified as 2, then each of the scan and sort can use
2 parallel workers each.
This is somewhat similar to the concept how degree of parallelism (DOP)
works in other databases. Refer case of Oracle [1] (Setting Degree of
Parallelism).
I don't deny the fact that it will be a idea worth exploring to make optimizer
more smart for deciding parallel plans, but it seems to me it is an advanced
topic which will be more valuable when we will try to parallelize joins or other
similar stuff and even most papers talk about it in those regards only.
At this moment if we can ensure that parallel plan should not be selected
for cases where it will perform poorly is more than enough considering
we have lots of other work left to even make any parallel operation work.
> Perhaps we can come up
> with an efficient way of working out where the break-point is based on
> the non-parallel cost and go at it from that direction instead of
> building out whole paths for each increment of parallelism.
>
> I'd really like to be able to set the 'max parallel' high and then have
> the optimizer figure out how many workers should actually be spawned for
> a given query.
>
> > Execution:
> > Most other databases does partition level scan for partition on
> > different disks by each individual parallel worker. However,
> > it seems amazon dynamodb [2] also works on something
> > similar to what I have used in patch which means on fixed
> > blocks. I think this kind of strategy seems better than dividing
> > the blocks at runtime because dividing randomly the blocks
> > among workers could lead to random scan for a parallel
> > sequential scan.
>
> Yeah, we also need to consider the i/o side of this, which will
> definitely be tricky. There are i/o systems out there which are faster
> than a single CPU and ones where a single CPU can manage multiple i/o
> channels. There are also cases where the i/o system handles sequential
> access nearly as fast as random and cases where sequential is much
> faster than random. Where we can get an idea of that distinction is
> with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
> lower random_page_cost from the default to indicate that.
>
I am not clear, do you expect anything different in execution strategy
> with an efficient way of working out where the break-point is based on
> the non-parallel cost and go at it from that direction instead of
> building out whole paths for each increment of parallelism.
>
> I'd really like to be able to set the 'max parallel' high and then have
> the optimizer figure out how many workers should actually be spawned for
> a given query.
>
> > Execution:
> > Most other databases does partition level scan for partition on
> > different disks by each individual parallel worker. However,
> > it seems amazon dynamodb [2] also works on something
> > similar to what I have used in patch which means on fixed
> > blocks. I think this kind of strategy seems better than dividing
> > the blocks at runtime because dividing randomly the blocks
> > among workers could lead to random scan for a parallel
> > sequential scan.
>
> Yeah, we also need to consider the i/o side of this, which will
> definitely be tricky. There are i/o systems out there which are faster
> than a single CPU and ones where a single CPU can manage multiple i/o
> channels. There are also cases where the i/o system handles sequential
> access nearly as fast as random and cases where sequential is much
> faster than random. Where we can get an idea of that distinction is
> with seq_page_cost vs. random_page_cost as folks running on SSDs tend to
> lower random_page_cost from the default to indicate that.
>
I am not clear, do you expect anything different in execution strategy
than what I have mentioned or does that sound reasonable to you?
pgsql-hackers by date: