Thread: [HACKERS] Increasing parallel workers at runtime

[HACKERS] Increasing parallel workers at runtime

From
Haribabu Kommi
Date:

In the current parallel implementation, in case if the
number of planned workers doesn't available during
the start of the query execution, the query starts the
execution with the available number of workers till
the end of the query.

It may be possible that during the query processing
the required number of workers may be available.
so how about increasing the parallel workers to the
planned workers (only when the main backend has
going to wait for the workers send the tuples or it
has to process the plan by itself.)

The wait of the workers to send tuples is may be
because of less number of workers. So increasing
the parallel workers may improve the performance.

POC Patch attached for the same.

This still needs some adjustments to fix for the cases where
the main backend also does the scan instead of waiting for
the workers to finish the job. As increasing the workers logic
shouldn't add an overhead in this case.

Regards,
Hari Babu
Fujitsu Australia
Attachment

Re: [HACKERS] Increasing parallel workers at runtime

From
Dilip Kumar
Date:
On Mon, May 15, 2017 at 7:36 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> The wait of the workers to send tuples is may be
> because of less number of workers. So increasing
> the parallel workers may improve the performance.

Yes, for the cases where a significant amount of work is still
pending, but it might hurt the performance where the work is about to
complete but not yet completed.

@@ -346,6 +346,38 @@ gather_readnext(GatherState *gatherstate) if (nvisited >= gatherstate->nreaders) { /*
+ * As we are going to wait for the workers to send tuples, this may be
+ * possible because of not sufficient workers that are planned?
+ * Does the gather have all the required parallel workers? If not try to get
+ * some more workers (only when all the previously allocated workers are still
+ * doing the job) before we wait, this will further increase the performance
+ * of the query as planned.
+ */
You might want to do similar changes for gather_merge_readnext.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Increasing parallel workers at runtime

From
Robert Haas
Date:
On Mon, May 15, 2017 at 10:06 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> This still needs some adjustments to fix for the cases where
> the main backend also does the scan instead of waiting for
> the workers to finish the job. As increasing the workers logic
> shouldn't add an overhead in this case.

I think it would be pretty crazy to try relaunching workers after
every tuple, as this patch does.  The overhead of that will be very
high for queries where the number of tuples passing through the Gather
is large, whereas when the number of tuples passing through Gather is
small, or where tuples are sent all at once at the end of procesisng,
it will not actually be very effective at getting hold of more
workers.  A different idea is to have an area in shared memory where
queries can advertise that they didn't get all of the workers they
wanted, plus a background process that periodically tries to launch
workers to help those queries as parallel workers become available.
It can recheck for available workers after some interval, say 10s.
There are some problems there -- the process won't have bgw_notify_pid
pointing at the parallel leader -- but I think it might be best to try
to solve those problems instead of making it the leader's job to try
to grab more workers as we go along.  For one thing, the background
process idea can attempt to achieve fairness.  Suppose there are two
processes that didn't get all of their workers; one got 3 of 4, the
other 1 of 4.  When a worker becomes available, we'd presumably like
to give it to the process that got 1 of 4, rather than having the
leaders race to see who grabs the new worker first.  Similarly if
there are four workers available and two queries that each got 1 of 5
workers they wanted, we'd like to split the workers two and two,
rather than having one leader grab all four of them.  Or at least, I
think that's what we want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Increasing parallel workers at runtime

From
Ashutosh Bapat
Date:
On Mon, May 15, 2017 at 9:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 15, 2017 at 10:06 AM, Haribabu Kommi
> <kommi.haribabu@gmail.com> wrote:
>> This still needs some adjustments to fix for the cases where
>> the main backend also does the scan instead of waiting for
>> the workers to finish the job. As increasing the workers logic
>> shouldn't add an overhead in this case.
>
> I think it would be pretty crazy to try relaunching workers after
> every tuple, as this patch does.  The overhead of that will be very
> high for queries where the number of tuples passing through the Gather
> is large, whereas when the number of tuples passing through Gather is
> small, or where tuples are sent all at once at the end of procesisng,
> it will not actually be very effective at getting hold of more
> workers.

+1

> A different idea is to have an area in shared memory where
> queries can advertise that they didn't get all of the workers they
> wanted, plus a background process that periodically tries to launch
> workers to help those queries as parallel workers become available.
> It can recheck for available workers after some interval, say 10s.
> There are some problems there -- the process won't have bgw_notify_pid
> pointing at the parallel leader -- but I think it might be best to try
> to solve those problems instead of making it the leader's job to try
> to grab more workers as we go along.  For one thing, the background
> process idea can attempt to achieve fairness.  Suppose there are two
> processes that didn't get all of their workers; one got 3 of 4, the
> other 1 of 4.  When a worker becomes available, we'd presumably like
> to give it to the process that got 1 of 4, rather than having the
> leaders race to see who grabs the new worker first.  Similarly if
> there are four workers available and two queries that each got 1 of 5
> workers they wanted, we'd like to split the workers two and two,
> rather than having one leader grab all four of them.  Or at least, I
> think that's what we want.

+1 for a separate process distributing workers. But, I am not sure
whether we want to spend a full background process doing this. User is
expected to configure enough parallel worker so that every parallel
query gets enough of them under normal circumstances, so that process
may not find anybody to dispatch idle worker to. But then the question
is which process should do that job, postmaster, since it's the one
which spawns workers when they die? But postmaster itself is quite
busy to also execute the balancing algorithm. So, may be a new
background worker is indeed needed.

Also, looking at the patch, it doesn't look like it take enough care
to build execution state of new worker so that it can participate in a
running query. I may be wrong, but the execution state initialization
routines are written with the assumption that all the workers start
simultaneously?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Increasing parallel workers at runtime

From
Haribabu Kommi
Date:


On Tue, May 16, 2017 at 1:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, May 15, 2017 at 10:06 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> This still needs some adjustments to fix for the cases where
> the main backend also does the scan instead of waiting for
> the workers to finish the job. As increasing the workers logic
> shouldn't add an overhead in this case.

I think it would be pretty crazy to try relaunching workers after
every tuple, as this patch does.  The overhead of that will be very
high for queries where the number of tuples passing through the Gather
is large, whereas when the number of tuples passing through Gather is
small, or where tuples are sent all at once at the end of procesisng,
it will not actually be very effective at getting hold of more
workers.

In the current state of the patch, the main backend tries to start the
extra workers only when there is no tuples that are available from the
available workers. I feel that the invocation for more workers doesn't
do for every tuple.

1. When there are large number of tuples are getting transferred from workers,
I feel there is very less chance that backend is free that it can start more workers
as because the backend itself may not need to execute the plan locally.

2. When there are tuples that are transferred at the end of the plan for the cases
something it involves a sort node or has aggregate or etc, either the backend is
waiting for the tuples to arrive or by itself doing the plan execution along with
workers after trying of extending the number workers once.

3. When there are small number of tuples that are getting transferred, in this
case there are chance of extra workers invocation more time compare to the
other scenarios, but still in this case also, The less number of tuples transfer
is may be because of a complex filter condition that is taking time and also it
is filtering more records. So in this case also, once the backend tried to extend
the number of workers, after that it also participate in executing the plan, then
the backend also takes time to get a tuple by executing the plan locally. By that
time there are more chances of that workers are already ready with tuples.

The problem of invoking for more number of workers is possible when there is
only one worker that is allotted to the query execution.

Am I missing?
 
  A different idea is to have an area in shared memory where
queries can advertise that they didn't get all of the workers they
wanted, plus a background process that periodically tries to launch
workers to help those queries as parallel workers become available.
It can recheck for available workers after some interval, say 10s.
There are some problems there -- the process won't have bgw_notify_pid
pointing at the parallel leader -- but I think it might be best to try
to solve those problems instead of making it the leader's job to try
to grab more workers as we go along.  For one thing, the background
process idea can attempt to achieve fairness.  Suppose there are two
processes that didn't get all of their workers; one got 3 of 4, the
other 1 of 4.  When a worker becomes available, we'd presumably like
to give it to the process that got 1 of 4, rather than having the
leaders race to see who grabs the new worker first.  Similarly if
there are four workers available and two queries that each got 1 of 5
workers they wanted, we'd like to split the workers two and two,
rather than having one leader grab all four of them.  Or at least, I
think that's what we want.

Another background process logic can produce a fair distribution of
workers to the parallel queries. In this case also, the backend should
advertise only when the allotted workers are not enough, this is because
there may be a case where the planned workers may be 5, but because
of other part of the query, the main backend is feed by the tuples just by
2 workers, then there is no need to provide extra workers.

The another background process approach of wait interval to reassign
more workers after an interval period doesn't work for the queries that
are getting finished before the configured time of the wait. May be we
can ignore those scenarios?

Needs some smarter logic to share the required details to start the worker
as it is started by the main backend itself. But this approach is useful for
the cases where the query doesn't get any workers I feel.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Increasing parallel workers at runtime

From
Robert Haas
Date:
On Tue, May 16, 2017 at 8:18 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> In the current state of the patch, the main backend tries to start the
> extra workers only when there is no tuples that are available from the
> available workers. I feel that the invocation for more workers doesn't
> do for every tuple.

Well, the point is, the frequency with which the leader will try to
acquire more workers is going to be highly variable with your patch,
and might sometimes be extremely frequent.  It depends on the timing
of the workers and of the leader, and I don't think that's something
we want to depend on here.

> Another background process logic can produce a fair distribution of
> workers to the parallel queries. In this case also, the backend should
> advertise only when the allotted workers are not enough, this is because
> there may be a case where the planned workers may be 5, but because
> of other part of the query, the main backend is feed by the tuples just by
> 2 workers, then there is no need to provide extra workers.

That's true, but I think taking it into account might be difficult.

> The another background process approach of wait interval to reassign
> more workers after an interval period doesn't work for the queries that
> are getting finished before the configured time of the wait. May be we
> can ignore those scenarios?

I think we can ignore that.  We can still help a lot of cases, and
queries with very short run times obviously aren't being seriously
hurt by the lack of full parallelism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Increasing parallel workers at runtime

From
Haribabu Kommi
Date:


On Wed, May 17, 2017 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 16, 2017 at 8:18 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> In the current state of the patch, the main backend tries to start the
> extra workers only when there is no tuples that are available from the
> available workers. I feel that the invocation for more workers doesn't
> do for every tuple.

Well, the point is, the frequency with which the leader will try to
acquire more workers is going to be highly variable with your patch,
and might sometimes be extremely frequent.  It depends on the timing
of the workers and of the leader, and I don't think that's something
we want to depend on here.

Ok.
 
> Another background process logic can produce a fair distribution of
> workers to the parallel queries. In this case also, the backend should
> advertise only when the allotted workers are not enough, this is because
> there may be a case where the planned workers may be 5, but because
> of other part of the query, the main backend is feed by the tuples just by
> 2 workers, then there is no need to provide extra workers.

That's true, but I think taking it into account might be difficult.

> The another background process approach of wait interval to reassign
> more workers after an interval period doesn't work for the queries that
> are getting finished before the configured time of the wait. May be we
> can ignore those scenarios?

I think we can ignore that.  We can still help a lot of cases, and
queries with very short run times obviously aren't being seriously
hurt by the lack of full parallelism.

Ok. In this approach, we need to share some of the gatherstate structure
members in the shared memory to access by the other background process
or some better logic to update these details whenever a new worker gets
allotted.

I will give a try and see how easy to implement the same.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Increasing parallel workers at runtime

From
Amit Kapila
Date:
On Wed, May 17, 2017 at 5:45 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Wed, May 17, 2017 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>
>
> Ok. In this approach, we need to share some of the gatherstate structure
> members in the shared memory to access by the other background process
> or some better logic to update these details whenever a new worker gets
> allotted.
>

I don't think you need to share that stuff as we initialize the
dsm/queues based on planned workers.  I think one thing you might want
to consider is to see if this new background worker can detect whether
the query has already finished before launching workers.  Yet another
way to look at this problem is to have an idea of Alternative
non-parallel plans corresponding to parallel plans.  As of now, we
only have one plan during execution of parallel plan and if we don't
have enough workers, we have no choice but to proceed with that plan,
however, if we have some Alternative plan which is non-parallel, it
might be better to use the same if we can't allocate enough number of
workers for the query.  IIRC, this has been discussed previously
during the development of Parallel Sequential Scan patch and I think
some other databases uses some similar technique for this problem.




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Increasing parallel workers at runtime

From
Amit Kapila
Date:
On Tue, May 16, 2017 at 2:14 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Mon, May 15, 2017 at 9:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Also, looking at the patch, it doesn't look like it take enough care
> to build execution state of new worker so that it can participate in a
> running query. I may be wrong, but the execution state initialization
> routines are written with the assumption that all the workers start
> simultaneously?
>

No such assumptions, workers started later can also join the execution
of the query.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Increasing parallel workers at runtime

From
Robert Haas
Date:
On Tue, May 16, 2017 at 8:15 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> Ok. In this approach, we need to share some of the gatherstate structure
> members in the shared memory to access by the other background process
> or some better logic to update these details whenever a new worker gets
> allotted.

What I'm imagining is a shared memory array with up to, say, 32 slots
(or, say, max_worker_processes slots).  Each slot stores the PID of
the leader, the number of workers expected, and the number of workers
actually obtained.  The leader tries to claim a slot for its details
(just giving up if there are none available) and clears the slot again
at the end of the query (if it got one, and even in case of error).
The whole array is protected by a new LWLock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Increasing parallel workers at runtime

From
Rafia Sabih
Date:
On Wed, May 17, 2017 at 2:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, May 16, 2017 at 2:14 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Mon, May 15, 2017 at 9:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> Also, looking at the patch, it doesn't look like it take enough care
>> to build execution state of new worker so that it can participate in a
>> running query. I may be wrong, but the execution state initialization
>> routines are written with the assumption that all the workers start
>> simultaneously?
>>
>
> No such assumptions, workers started later can also join the execution
> of the query.
>
If we are talking of run-time allocation of workers I'd like to
propose an idea to safeguard parallelism from selectivity-estimation
errors. Start each query (if it qualifies for the use of parallelism)
with a minimum number of workers (say 2) irrespective of the #planned
workers. Then as query proceeds and we find that there is more work to
do, we allocate more workers.

Let's get to the details a little, we'll have following new variables,
- T_int - a time interval at which we'll periodically check if the
query requires more workers,
- work_remaining - a variable which estimates the work yet to do. This
will use the selectivity estimates to find the total work done and the
remaining work accordingly. Once, the actual number of rows crosses
the estimated number of rows, take maximum possible tuples for that
operator as the new estimate.

Now, we'll check at gather, after each T_int if the work is remaining
and allocate another 2 (say) workers. This way we'll keep on adding
the workers in small chunks and not in one go. Thus, saving resources
in case over-estimation is done.

Some of the things we may improvise upon are,
- check if we want to increase workers or kill some of them. e.g. if
the filtering is not happening at estimated at the node, i.e. #output
tuples is same as #input tuples, then do not add any more workers as
it will increase the work at gather only.
- Instead of just having a number of 2 or 4 workers at the start,
allocate x% of planned workers.
- As the query progresses, we may alter the value of T_int and/or
#workers to allocate. e.g. till a query is done something less than
50%, check at every T_int, after that increase T_int to T_int(1 + .5),
similarly for #workers, because now allocating more workers might not
do much work rather the cost of adding new workers could be more.

This scheme is likely to safeguard parallelism with selectivity
estimation errors in a sense that it is using resources only when
required.
Certainly there are many more things to think about in terms of
implementation and the cases where this can help or regress, will be
glad to know opinion of more people on this.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/



Re: [HACKERS] Increasing parallel workers at runtime

From
Kuntal Ghosh
Date:
On Mon, May 22, 2017 at 2:54 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> On Wed, May 17, 2017 at 2:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Tue, May 16, 2017 at 2:14 PM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> On Mon, May 15, 2017 at 9:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>>> Also, looking at the patch, it doesn't look like it take enough care
>>> to build execution state of new worker so that it can participate in a
>>> running query. I may be wrong, but the execution state initialization
>>> routines are written with the assumption that all the workers start
>>> simultaneously?
>>>
>>
>> No such assumptions, workers started later can also join the execution
>> of the query.
>>
> If we are talking of run-time allocation of workers I'd like to
> propose an idea to safeguard parallelism from selectivity-estimation
> errors. Start each query (if it qualifies for the use of parallelism)
> with a minimum number of workers (say 2) irrespective of the #planned
> workers. Then as query proceeds and we find that there is more work to
> do, we allocate more workers.
>
> Let's get to the details a little, we'll have following new variables,
> - T_int - a time interval at which we'll periodically check if the
> query requires more workers,
> - work_remaining - a variable which estimates the work yet to do. This
> will use the selectivity estimates to find the total work done and the
> remaining work accordingly. Once, the actual number of rows crosses
> the estimated number of rows, take maximum possible tuples for that
> operator as the new estimate.
>
> Now, we'll check at gather, after each T_int if the work is remaining
> and allocate another 2 (say) workers. This way we'll keep on adding
> the workers in small chunks and not in one go. Thus, saving resources
> in case over-estimation is done.
>
I understand your concern about selectivity estimation error which
affects the number of workers planned as well. But, in that case, I
would like to fix the optimizer so that it calculates the number of
workers correctly. If the optimizer thinks that we should start with n
number of workers, probably we SHOULD start with n number of workers.

However, error in selectivity estimation(The root of all evil, the
Achilles Heel of query optimization, according to Guy Lohman et al.
:)) can always prove the optimizer wrong. In that case, +1 for your
suggested approach of dynamically add or kill some workers based on
the estimated work left to do.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Increasing parallel workers at runtime

From
Amit Kapila
Date:
On Mon, May 22, 2017 at 2:54 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> On Wed, May 17, 2017 at 2:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Tue, May 16, 2017 at 2:14 PM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> On Mon, May 15, 2017 at 9:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>>> Also, looking at the patch, it doesn't look like it take enough care
>>> to build execution state of new worker so that it can participate in a
>>> running query. I may be wrong, but the execution state initialization
>>> routines are written with the assumption that all the workers start
>>> simultaneously?
>>>
>>
>> No such assumptions, workers started later can also join the execution
>> of the query.
>>
> If we are talking of run-time allocation of workers I'd like to
> propose an idea to safeguard parallelism from selectivity-estimation
> errors. Start each query (if it qualifies for the use of parallelism)
> with a minimum number of workers (say 2) irrespective of the #planned
> workers. Then as query proceeds and we find that there is more work to
> do, we allocate more workers.
>
> Let's get to the details a little, we'll have following new variables,
> - T_int - a time interval at which we'll periodically check if the
> query requires more workers,
> - work_remaining - a variable which estimates the work yet to do. This
> will use the selectivity estimates to find the total work done and the
> remaining work accordingly. Once, the actual number of rows crosses
> the estimated number of rows, take maximum possible tuples for that
> operator as the new estimate.
>
> Now, we'll check at gather, after each T_int if the work is remaining
> and allocate another 2 (say) workers. This way we'll keep on adding
> the workers in small chunks and not in one go. Thus, saving resources
> in case over-estimation is done.
>
> Some of the things we may improvise upon are,
> - check if we want to increase workers or kill some of them. e.g. if
> the filtering is not happening at estimated at the node, i.e. #output
> tuples is same as #input tuples, then do not add any more workers as
> it will increase the work at gather only.
> - Instead of just having a number of 2 or 4 workers at the start,
> allocate x% of planned workers.
> - As the query progresses, we may alter the value of T_int and/or
> #workers to allocate. e.g. till a query is done something less than
> 50%, check at every T_int, after that increase T_int to T_int(1 + .5),
> similarly for #workers, because now allocating more workers might not
> do much work rather the cost of adding new workers could be more.
>
> This scheme is likely to safeguard parallelism with selectivity
> estimation errors in a sense that it is using resources only when
> required.

Isn't this point contradictory?  Basically, on one side you are
suggesting to calculate additional workers (work_remaining) based on
selectivity and on another side you are saying that it will fix
estimation errors.  IIUC, then you are talking about some sort of
executor feedback to adjust a number of workers which might be a good
thing to do but I that is a different problem to solve.  As of now, I
don't think we have that type of mechanism even for non-parallel
execution.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Increasing parallel workers at runtime

From
Rafia Sabih
Date:
On Tue, May 23, 2017 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

> Isn't this point contradictory?  Basically, on one side you are
> suggesting to calculate additional workers (work_remaining) based on
> selectivity and on another side you are saying that it will fix
> estimation errors.  IIUC, then you are talking about some sort of
> executor feedback to adjust a number of workers which might be a good
> thing to do but I that is a different problem to solve.  As of now, I
> don't think we have that type of mechanism even for non-parallel
> execution.
>
Yes, I am talking of a executor feedback mechanism sort of a thing.
For non parallel execution it's a lot more challenging since the
execution plan might need to be changed in the runtime, but in
parallel query case only addition/subtraction of workers is to be
dealt, which appears to be a less complex thing.
Why I am thinking in this direction is because in my experience it
seems very easy to regress with more workers when over-estimation is
done and this runtime calculation of workers can mitigate it largely.
However, I understand that this might require more proof than my
experience alone. Let's see if anybody else shares my gut feeling. :)

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/