Thread: Re: strange parallel query behavior after OOM crashes
On Fri, Mar 31, 2017 at 8:29 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
On Fri, Mar 31, 2017 at 2:05 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
>
> 1. Put an Assert(0) in ParallelQueryMain(), start server and execute
> any parallel query.
> In LaunchParallelWorkers, you can see
> nworkers = n nworkers_launched = n (n>0)
> But, all the workers will crash because of the assert statement.
> 2. the server restarts automatically, initialize
> BackgroundWorkerData->parallel_register_count and
> BackgroundWorkerData->parallel_terminate_count in the shared memory.
> After that, it calls ForgetBackgroundWorker and it increments
> parallel_terminate_count. In LaunchParallelWorkers, we have the
> following condition:
> if ((BackgroundWorkerData->parallel_register_count -
> BackgroundWorkerData->parallel_terminate_count) >=
> max_parallel_workers)
> DO NOT launch any parallel worker.
> Hence, nworkers = n nworkers_launched = 0.
parallel_register_count and parallel_terminate_count, both are
unsigned integer. So, whenever the difference is negative, it'll be a
well-defined unsigned integer and certainly much larger than
max_parallel_workers. Hence, no workers will be launched. I've
attached a patch to fix this.
The current explanation of active number of parallel workers is:
* The active
* number of parallel workers is the number of registered workers minus the
* terminated ones.
In the situations like you mentioned above, this formula can give negative
number for active parallel workers. However a negative number for active
parallel workers does not make any sense.
I feel it would be better to explain in code that in what situations, the formula
can generate a negative result and what that means.
Regards,
Neha
Neha
On Fri, Mar 31, 2017 at 5:43 AM, Neha Khatri <nehakhatri5@gmail.com> wrote: > > On Fri, Mar 31, 2017 at 8:29 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> > wrote: >> >> On Fri, Mar 31, 2017 at 2:05 AM, Kuntal Ghosh >> <kuntalghosh.2007@gmail.com> wrote: >> > >> > 1. Put an Assert(0) in ParallelQueryMain(), start server and execute >> > any parallel query. >> > In LaunchParallelWorkers, you can see >> > nworkers = n nworkers_launched = n (n>0) >> > But, all the workers will crash because of the assert statement. >> > 2. the server restarts automatically, initialize >> > BackgroundWorkerData->parallel_register_count and >> > BackgroundWorkerData->parallel_terminate_count in the shared memory. >> > After that, it calls ForgetBackgroundWorker and it increments >> > parallel_terminate_count. In LaunchParallelWorkers, we have the >> > following condition: >> > if ((BackgroundWorkerData->parallel_register_count - >> > BackgroundWorkerData->parallel_terminate_count) >= >> > max_parallel_workers) >> > DO NOT launch any parallel worker. >> > Hence, nworkers = n nworkers_launched = 0. >> parallel_register_count and parallel_terminate_count, both are >> unsigned integer. So, whenever the difference is negative, it'll be a >> well-defined unsigned integer and certainly much larger than >> max_parallel_workers. Hence, no workers will be launched. I've >> attached a patch to fix this. > > > The current explanation of active number of parallel workers is: > > * The active > * number of parallel workers is the number of registered workers minus the > * terminated ones. > > In the situations like you mentioned above, this formula can give negative > number for active parallel workers. However a negative number for active > parallel workers does not make any sense. Agreed. > I feel it would be better to explain in code that in what situations, the > formula > can generate a negative result and what that means. I think that we need to find a fix so that it never generates a negative result. The last patch submitted by me generates a negative value correctly. But, surely that's not enough. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com