Re: strange parallel query behavior after OOM crashes - Mailing list pgsql-hackers

From Neha Khatri
Subject Re: strange parallel query behavior after OOM crashes
Date
Msg-id CAFO0U+874hTAooRdPgvE7f0bPc-QfUTywLS1baM8cMp-tSjvTw@mail.gmail.com
Whole thread Raw
Responses Re: strange parallel query behavior after OOM crashes  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
List pgsql-hackers

On Fri, Mar 31, 2017 at 8:29 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
On Fri, Mar 31, 2017 at 2:05 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
>
> 1. Put an Assert(0) in ParallelQueryMain(), start server and execute
> any parallel query.
>  In LaunchParallelWorkers, you can see
>        nworkers = n nworkers_launched = n (n>0)
> But, all the workers will crash because of the assert statement.
> 2. the server restarts automatically, initialize
> BackgroundWorkerData->parallel_register_count and
> BackgroundWorkerData->parallel_terminate_count in the shared memory.
> After that, it calls ForgetBackgroundWorker and it increments
> parallel_terminate_count. In LaunchParallelWorkers, we have the
> following condition:
> if ((BackgroundWorkerData->parallel_register_count -
>                      BackgroundWorkerData->parallel_terminate_count) >=
>         max_parallel_workers)
> DO NOT launch any parallel worker.
> Hence, nworkers = n nworkers_launched = 0.
parallel_register_count and parallel_terminate_count, both are
unsigned integer. So, whenever the difference is negative, it'll be a
well-defined unsigned integer and certainly much larger than
max_parallel_workers. Hence, no workers will be launched. I've
attached a patch to fix this.

The current explanation of active number of parallel workers is:
 
 * The active
 * number of parallel workers is the number of registered workers minus the
 * terminated ones.

In the situations like you mentioned above, this formula can give negative
number for active parallel workers. However a negative number for active
parallel workers does not make any sense. 

I feel it would be better to explain in code that in what situations, the formula
can generate a negative result and what that means.

Regards,
Neha

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Something broken around FDW connection close
Next
From: Petr Jelinek
Date:
Subject: Re: Somebody has not thought through subscription lockingconsiderations