Thread: Re: strange parallel query behavior after OOM crashes

Re: strange parallel query behavior after OOM crashes

From
Kuntal Ghosh
Date:
On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> 2. the server restarts automatically, initialize
>> BackgroundWorkerData->parallel_register_count and
>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>> After that, it calls ForgetBackgroundWorker and it increments
>> parallel_terminate_count.
>
> Hmm.  So this seems like the root of the problem.  Presumably those
> things need to be reset AFTER forgetting any background workers from
> before the crash.
>
IMHO, the fix would be not to increase the terminated parallel worker
count whenever ForgetBackgroundWorker is called due to a bgworker
crash. I've attached a patch for the same. PFA.



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: strange parallel query behavior after OOM crashes

From
Neha Khatri
Date:
Looking further in this context, number of active parallel workers is:
    parallel_register_count - parallel_terminate_count

Can active workers ever be greater than max_parallel_workers, I think no. Then why should there be greater than check in the following condition:

    if (parallel && (BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count) >= max_parallel_workers)

I feel there should be an assert if
    (BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count) > max_parallel_workers)

And the check could be 
    if (parallel && (active_parallel_workers == max_parallel_workers))
        then do not register new parallel wokers and return.

There should be no tolerance for the case when active_parallel_workers > max_parallel_workers. After all that is the purpose of max_parallel_workers.

Is it like multiple backends trying to register parallel workers at the same time, for which the greater than check should be present?

Thoughts?

Regards,
Neha

Re: strange parallel query behavior after OOM crashes

From
Robert Haas
Date:
On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>> <kuntalghosh.2007@gmail.com> wrote:
>>> 2. the server restarts automatically, initialize
>>> BackgroundWorkerData->parallel_register_count and
>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>> After that, it calls ForgetBackgroundWorker and it increments
>>> parallel_terminate_count.
>>
>> Hmm.  So this seems like the root of the problem.  Presumably those
>> things need to be reset AFTER forgetting any background workers from
>> before the crash.
>>
> IMHO, the fix would be not to increase the terminated parallel worker
> count whenever ForgetBackgroundWorker is called due to a bgworker
> crash. I've attached a patch for the same. PFA.

While I'm not opposed to that approach, I don't think this is a good
way to implement it.  If you want to pass an explicit flag to
ForgetBackgroundWorker telling it whether or not it should performing
the increment, fine.  But with what you've got here, you're
essentially relying on "spooky action at a distance".  It would be
easy for future code changes to break this, not realizing that
somebody's got a hard dependency on 0 having a specific meaning.

BTW, if this isn't on the open items list, it should be.  It's
presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: strange parallel query behavior after OOM crashes

From
Tomas Vondra
Date:
On 04/04/2017 06:52 PM, Robert Haas wrote:
> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>>> <kuntalghosh.2007@gmail.com> wrote:
>>>> 2. the server restarts automatically, initialize
>>>> BackgroundWorkerData->parallel_register_count and
>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>>> After that, it calls ForgetBackgroundWorker and it increments
>>>> parallel_terminate_count.
>>>
>>> Hmm.  So this seems like the root of the problem.  Presumably those
>>> things need to be reset AFTER forgetting any background workers from
>>> before the crash.
>>>
>> IMHO, the fix would be not to increase the terminated parallel worker
>> count whenever ForgetBackgroundWorker is called due to a bgworker
>> crash. I've attached a patch for the same. PFA.
>
> While I'm not opposed to that approach, I don't think this is a good
> way to implement it.  If you want to pass an explicit flag to
> ForgetBackgroundWorker telling it whether or not it should performing
> the increment, fine.  But with what you've got here, you're
> essentially relying on "spooky action at a distance".  It would be
> easy for future code changes to break this, not realizing that
> somebody's got a hard dependency on 0 having a specific meaning.
>

I'm probably missing something, but I don't quite understand how these 
values actually survive the crash. I mean, what I observed is OOM 
followed by a restart, so shouldn't BackgroundWorkerShmemInit() simply 
reset the values back to 0? Or do we call ForgetBackgroundWorker() after 
the crash for some reason?


In any case, the comment right before BackgroundWorkerArray says this:
 * These counters can of course overflow, but it's not important here * since the subtraction will still give the right
number.

which means that this assert

+    Assert(BackgroundWorkerData->parallel_register_count >=
+        BackgroundWorkerData->parallel_terminate_count);

is outright broken, just like any other attempts to rely on simple 
comparisons of these two counters, no?

> BTW, if this isn't on the open items list, it should be.  It's
> presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9.
>

Unrelated, but perhaps we should also add this to open items:

https://www.postgresql.org/message-id/flat/57bbc52c-5d40-bb80-2992-7e1027d0b67c%402ndquadrant.com

(although that's more a 9.6 material).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: strange parallel query behavior after OOM crashes

From
Kuntal Ghosh
Date:
On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 04/04/2017 06:52 PM, Robert Haas wrote:
>>
>> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com>
>> wrote:
>>>
>>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com>
>>> wrote:
>>>>
>>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>>>> <kuntalghosh.2007@gmail.com> wrote:
>>>>>
>>>>> 2. the server restarts automatically, initialize
>>>>> BackgroundWorkerData->parallel_register_count and
>>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>>>> After that, it calls ForgetBackgroundWorker and it increments
>>>>> parallel_terminate_count.
>>>>
>>>>
>>>> Hmm.  So this seems like the root of the problem.  Presumably those
>>>> things need to be reset AFTER forgetting any background workers from
>>>> before the crash.
>>>>
>>> IMHO, the fix would be not to increase the terminated parallel worker
>>> count whenever ForgetBackgroundWorker is called due to a bgworker
>>> crash. I've attached a patch for the same. PFA.
>>
>>
>> While I'm not opposed to that approach, I don't think this is a good
>> way to implement it.  If you want to pass an explicit flag to
>> ForgetBackgroundWorker telling it whether or not it should performing
>> the increment, fine.  But with what you've got here, you're
>> essentially relying on "spooky action at a distance".  It would be
>> easy for future code changes to break this, not realizing that
>> somebody's got a hard dependency on 0 having a specific meaning.
>>
>
> I'm probably missing something, but I don't quite understand how these
> values actually survive the crash. I mean, what I observed is OOM followed
> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
> some reason?
AFAICU, during crash recovery, we wait for all non-syslogger children
to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
StartupDataBase. While starting the startup process we check if any
bgworker is scheduled for a restart. If a bgworker is crashed and not
meant for a restart(parallel worker in our case), we call
ForgetBackgroundWorker() to discard it.

>
> In any case, the comment right before BackgroundWorkerArray says this:
>
>  * These counters can of course overflow, but it's not important here
>  * since the subtraction will still give the right number.
>
> which means that this assert
>
> +       Assert(BackgroundWorkerData->parallel_register_count >=
> +               BackgroundWorkerData->parallel_terminate_count);
>
> is outright broken, just like any other attempts to rely on simple
> comparisons of these two counters, no?
>
IIUC, the assert statements ensures that register count should always
be greater than or equal to the terminate count.
Whereas, the comment refers to the fact that register count and
terminate count indicate the total number of parallel workers spawned
and terminated respectively since the server has been started. At
every session, we're not resetting the counts, hence they can
overflow. But, their substraction should give you the number of active
parallel worker at any instance.
So, I'm not able to see how the assert is broken w.r.t the aforesaid
comment. Am I missing something here?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: strange parallel query behavior after OOM crashes

From
Kuntal Ghosh
Date:
On Tue, Apr 4, 2017 at 12:16 PM, Neha Khatri <nehakhatri5@gmail.com> wrote:

> I feel there should be an assert if
>     (BackgroundWorkerData->parallel_register_count -
> BackgroundWorkerData->parallel_terminate_count) > max_parallel_workers)
>
Backend 1 > SET max_parallel_worker = 8;
Backend 1 > Execute a long running parallel query q1 with number of
parallel worker spawned is say, 4.
Backend 2> SET max_parallel_worker = 3;
Backend 2 > Execute a long running parallel query q2 with number of
parallel worker spawned > 0.

The above assert statement will bring down the server unnecessarily
while executing q2.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: strange parallel query behavior after OOM crashes

From
Tomas Vondra
Date:

On 04/05/2017 09:05 AM, Kuntal Ghosh wrote:
> On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 04/04/2017 06:52 PM, Robert Haas wrote:
>>>
>>> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com>
>>> wrote:
>>>>
>>>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com>
>>>> wrote:
>>>>>
>>>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>>>>> <kuntalghosh.2007@gmail.com> wrote:
>>>>>>
>>>>>> 2. the server restarts automatically, initialize
>>>>>> BackgroundWorkerData->parallel_register_count and
>>>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>>>>> After that, it calls ForgetBackgroundWorker and it increments
>>>>>> parallel_terminate_count.
>>>>>
>>>>>
>>>>> Hmm.  So this seems like the root of the problem.  Presumably those
>>>>> things need to be reset AFTER forgetting any background workers from
>>>>> before the crash.
>>>>>
>>>> IMHO, the fix would be not to increase the terminated parallel worker
>>>> count whenever ForgetBackgroundWorker is called due to a bgworker
>>>> crash. I've attached a patch for the same. PFA.
>>>
>>>
>>> While I'm not opposed to that approach, I don't think this is a good
>>> way to implement it.  If you want to pass an explicit flag to
>>> ForgetBackgroundWorker telling it whether or not it should performing
>>> the increment, fine.  But with what you've got here, you're
>>> essentially relying on "spooky action at a distance".  It would be
>>> easy for future code changes to break this, not realizing that
>>> somebody's got a hard dependency on 0 having a specific meaning.
>>>
>>
>> I'm probably missing something, but I don't quite understand how these
>> values actually survive the crash. I mean, what I observed is OOM followed
>> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
>> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
>> some reason?
> AFAICU, during crash recovery, we wait for all non-syslogger children
> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
> StartupDataBase. While starting the startup process we check if any
> bgworker is scheduled for a restart. If a bgworker is crashed and not
> meant for a restart(parallel worker in our case), we call
> ForgetBackgroundWorker() to discard it.
> 

OK, so essentially we reset the counters, getting
    parallel_register_count = 0    parallel_terminate_count = 0

and then ForgetBackgroundWworker() runs and increments the 
terminate_count, breaking the logic?

And you're using (parallel_register_count > 0) to detect whether we're 
still in the init phase or not. Hmmm.

>>
>> In any case, the comment right before BackgroundWorkerArray says this:
>>
>>   * These counters can of course overflow, but it's not important here
>>   * since the subtraction will still give the right number.
>>
>> which means that this assert
>>
>> +       Assert(BackgroundWorkerData->parallel_register_count >=
>> +               BackgroundWorkerData->parallel_terminate_count);
>>
>> is outright broken, just like any other attempts to rely on simple
>> comparisons of these two counters, no?
>>
> IIUC, the assert statements ensures that register count should always
> be greater than or equal to the terminate count.
> Whereas, the comment refers to the fact that register count and
> terminate count indicate the total number of parallel workers spawned
> and terminated respectively since the server has been started. At
> every session, we're not resetting the counts, hence they can
> overflow. But, their substraction should give you the number of active
> parallel worker at any instance.
> So, I'm not able to see how the assert is broken w.r.t the aforesaid
> comment. Am I missing something here?
> 

The comment says that the counters are allowed to overflow, i.e. after a 
long uptime you might get these values
    parallel_register_count = UINT_MAX + 1 = 1    parallel_terminate_count = UINT_MAX

which is fine, because the C handles the overflow during subtraction and so
    parallel_register_count - parallel_terminate_count = 1

But the assert is not doing subtraction, it's comparing the values 
directly:
    Assert(parallel_register_count >= parallel_terminate_count);

and the (perfectly valid) values trivially violate this comparison.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: strange parallel query behavior after OOM crashes

From
Kuntal Ghosh
Date:
On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 04/05/2017 09:05 AM, Kuntal Ghosh wrote:
>>
>> AFAICU, during crash recovery, we wait for all non-syslogger children
>> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
>> StartupDataBase. While starting the startup process we check if any
>> bgworker is scheduled for a restart. If a bgworker is crashed and not
>> meant for a restart(parallel worker in our case), we call
>> ForgetBackgroundWorker() to discard it.
>>
>
> OK, so essentially we reset the counters, getting
>
>     parallel_register_count = 0
>     parallel_terminate_count = 0
>
> and then ForgetBackgroundWworker() runs and increments the terminate_count,
> breaking the logic?
>
> And you're using (parallel_register_count > 0) to detect whether we're still
> in the init phase or not. Hmmm.
>
Yes. But, as Robert suggested up in the thread, we should not use
(parallel_register_count = 0) as an alternative to define a bgworker
crash. Hence, I've added an argument named 'wasCrashed' in
ForgetBackgroundWorker to indicate a bgworker crash.

>>>
>>> In any case, the comment right before BackgroundWorkerArray says this:
>>>
>>>   * These counters can of course overflow, but it's not important here
>>>   * since the subtraction will still give the right number.
>>>
>>> which means that this assert
>>>
>>> +       Assert(BackgroundWorkerData->parallel_register_count >=
>>> +               BackgroundWorkerData->parallel_terminate_count);
>>>
>>> is outright broken, just like any other attempts to rely on simple
>>> comparisons of these two counters, no?
>>>
>> IIUC, the assert statements ensures that register count should always
>> be greater than or equal to the terminate count.
>> Whereas, the comment refers to the fact that register count and
>> terminate count indicate the total number of parallel workers spawned
>> and terminated respectively since the server has been started. At
>> every session, we're not resetting the counts, hence they can
>> overflow. But, their substraction should give you the number of active
>> parallel worker at any instance.
>> So, I'm not able to see how the assert is broken w.r.t the aforesaid
>> comment. Am I missing something here?
>>
>
> The comment says that the counters are allowed to overflow, i.e. after a
> long uptime you might get these values
>
>     parallel_register_count = UINT_MAX + 1 = 1
>     parallel_terminate_count = UINT_MAX
>
> which is fine, because the C handles the overflow during subtraction and so
>
>     parallel_register_count - parallel_terminate_count = 1
>
> But the assert is not doing subtraction, it's comparing the values directly:
>
>     Assert(parallel_register_count >= parallel_terminate_count);
>
> and the (perfectly valid) values trivially violate this comparison.
>
Thanks for the explanation. So, we can't use the above assert
statement. Even the following assert statement will not be helpful:
Assert(parallel_register_count - parallel_terminate_count >= 0);
Because, it'll fail to track the scenario when parallel_register_count
is not overflowed, still less than parallel_terminate_count. :(

It seems to me the only thing we can make sure is
(parallel_register_count == parallel_terminate_count == 0) in
ForgetBackgroundWorker in case of a crash.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: strange parallel query behavior after OOM crashes

From
Tomas Vondra
Date:

On 04/05/2017 12:36 PM, Kuntal Ghosh wrote:
> On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> On 04/05/2017 09:05 AM, Kuntal Ghosh wrote:
>>>
>>> AFAICU, during crash recovery, we wait for all non-syslogger children
>>> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
>>> StartupDataBase. While starting the startup process we check if any
>>> bgworker is scheduled for a restart. If a bgworker is crashed and not
>>> meant for a restart(parallel worker in our case), we call
>>> ForgetBackgroundWorker() to discard it.
>>>
>>
>> OK, so essentially we reset the counters, getting
>>
>>      parallel_register_count = 0
>>      parallel_terminate_count = 0
>>
>> and then ForgetBackgroundWworker() runs and increments the terminate_count,
>> breaking the logic?
>>
>> And you're using (parallel_register_count > 0) to detect whether we're still
>> in the init phase or not. Hmmm.
>>
> Yes. But, as Robert suggested up in the thread, we should not use
> (parallel_register_count = 0) as an alternative to define a bgworker
> crash. Hence, I've added an argument named 'wasCrashed' in
> ForgetBackgroundWorker to indicate a bgworker crash.
> 

Sure, and I agree that having an explicit flag is the right solution. 
I'm just trying to make sure I understand what's happening.

>>
>> The comment says that the counters are allowed to overflow, i.e. after a
>> long uptime you might get these values
>>
>>      parallel_register_count = UINT_MAX + 1 = 1
>>      parallel_terminate_count = UINT_MAX
>>
>> which is fine, because the C handles the overflow during subtraction and so
>>
>>      parallel_register_count - parallel_terminate_count = 1
>>
>> But the assert is not doing subtraction, it's comparing the values directly:
>>
>>      Assert(parallel_register_count >= parallel_terminate_count);
>>
>> and the (perfectly valid) values trivially violate this comparison.
>>
> Thanks for the explanation. So, we can't use the above assert
> statement. Even the following assert statement will not be helpful:
> Assert(parallel_register_count - parallel_terminate_count >= 0);
> Because, it'll fail to track the scenario when parallel_register_count
> is not overflowed, still less than parallel_terminate_count. :(
> 

Actually, that assert would work, because C does handle overflows on 
uint values during subtraction just fine. That is,
    (UINT_MAX+x) - UINT_MAX = x

Also, the comment about overflows before BackgroundWorkerArray claims 
this is the case.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: strange parallel query behavior after OOM crashes

From
Kuntal Ghosh
Date:
On Wed, Apr 5, 2017 at 4:13 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>>>
>>> The comment says that the counters are allowed to overflow, i.e. after a
>>> long uptime you might get these values
>>>
>>>      parallel_register_count = UINT_MAX + 1 = 1
>>>      parallel_terminate_count = UINT_MAX
>>>
>>> which is fine, because the C handles the overflow during subtraction and
>>> so
>>>
>>>      parallel_register_count - parallel_terminate_count = 1
>>>
>>> But the assert is not doing subtraction, it's comparing the values
>>> directly:
>>>
>>>      Assert(parallel_register_count >= parallel_terminate_count);
>>>
>>> and the (perfectly valid) values trivially violate this comparison.
>>>
>> Thanks for the explanation. So, we can't use the above assert
>> statement. Even the following assert statement will not be helpful:
>> Assert(parallel_register_count - parallel_terminate_count >= 0);
>> Because, it'll fail to track the scenario when parallel_register_count
>> is not overflowed, still less than parallel_terminate_count. :(
>>
>
> Actually, that assert would work, because C does handle overflows on uint
> values during subtraction just fine. That is,
>
>     (UINT_MAX+x) - UINT_MAX = x
>
> Also, the comment about overflows before BackgroundWorkerArray claims this
> is the case.
>
Agreed on the overflowed case. But, my concern is when an overflow has
not yet happened:

Suppose,
uint parallel_register_count = 1; /* Didn't overflow* /
uint parallel_terminate_count = 2; /* Didn't overflow */

Assert(parallel_register_count - parallel_terminate_count >= 0);
We want the assert statement to fail here, but I think it won't since
-1 has a valid representation in unsigned int and it is greater than
0, no?



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: strange parallel query behavior after OOM crashes

From
Tomas Vondra
Date:

On 04/05/2017 01:09 PM, Kuntal Ghosh wrote:
> On Wed, Apr 5, 2017 at 4:13 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>> The comment says that the counters are allowed to overflow, i.e. after a
>>>> long uptime you might get these values
>>>>
>>>>       parallel_register_count = UINT_MAX + 1 = 1
>>>>       parallel_terminate_count = UINT_MAX
>>>>
>>>> which is fine, because the C handles the overflow during subtraction and
>>>> so
>>>>
>>>>       parallel_register_count - parallel_terminate_count = 1
>>>>
>>>> But the assert is not doing subtraction, it's comparing the values
>>>> directly:
>>>>
>>>>       Assert(parallel_register_count >= parallel_terminate_count);
>>>>
>>>> and the (perfectly valid) values trivially violate this comparison.
>>>>
>>> Thanks for the explanation. So, we can't use the above assert
>>> statement. Even the following assert statement will not be helpful:
>>> Assert(parallel_register_count - parallel_terminate_count >= 0);
>>> Because, it'll fail to track the scenario when parallel_register_count
>>> is not overflowed, still less than parallel_terminate_count. :(
>>>
>>
>> Actually, that assert would work, because C does handle overflows on uint
>> values during subtraction just fine. That is,
>>
>>      (UINT_MAX+x) - UINT_MAX = x
>>
>> Also, the comment about overflows before BackgroundWorkerArray claims this
>> is the case.
>>
> Agreed on the overflowed case. But, my concern is when an overflow has
> not yet happened:
> 
> Suppose,
> uint parallel_register_count = 1; /* Didn't overflow* /
> uint parallel_terminate_count = 2; /* Didn't overflow */
> 
> Assert(parallel_register_count - parallel_terminate_count >= 0);
> We want the assert statement to fail here, but I think it won't since
> -1 has a valid representation in unsigned int and it is greater than
> 0, no?

I don't follow. How exactly do you get into this situation, assuming you 
actually enforce the (register_count > terminate_count) invariant 
consistently? In the modulo arithmetic of course.

But now that I'm thinking about it, the subtraction actually happens in 
unsigned ints, so the result will be unsigned int too, i.e. always >=0. 
Whether it makes sense depends on the invariant being true.

But I think this would work:

Assert(parallel_register_count - parallel_terminate_count <= 
max_parallel_workers);

If the difference gets 'negative' and wraps around, it'll be close to 
UINT_MAX.

But man, this unsigned int arithmetic makes my brain hurt ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: strange parallel query behavior after OOM crashes

From
Amit Kapila
Date:
On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 04/04/2017 06:52 PM, Robert Haas wrote:
>>>
>>> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com>
>>> wrote:
>>>>
>>>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmhaas@gmail.com>
>>>> wrote:
>>>>>
>>>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>>>>> <kuntalghosh.2007@gmail.com> wrote:
>>>>>>
>>>>>> 2. the server restarts automatically, initialize
>>>>>> BackgroundWorkerData->parallel_register_count and
>>>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>>>>> After that, it calls ForgetBackgroundWorker and it increments
>>>>>> parallel_terminate_count.
>>>>>
>>>>>
>>>>> Hmm.  So this seems like the root of the problem.  Presumably those
>>>>> things need to be reset AFTER forgetting any background workers from
>>>>> before the crash.
>>>>>
>>>> IMHO, the fix would be not to increase the terminated parallel worker
>>>> count whenever ForgetBackgroundWorker is called due to a bgworker
>>>> crash. I've attached a patch for the same. PFA.
>>>
>>>
>>> While I'm not opposed to that approach, I don't think this is a good
>>> way to implement it.  If you want to pass an explicit flag to
>>> ForgetBackgroundWorker telling it whether or not it should performing
>>> the increment, fine.  But with what you've got here, you're
>>> essentially relying on "spooky action at a distance".  It would be
>>> easy for future code changes to break this, not realizing that
>>> somebody's got a hard dependency on 0 having a specific meaning.
>>>
>>
>> I'm probably missing something, but I don't quite understand how these
>> values actually survive the crash. I mean, what I observed is OOM followed
>> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
>> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
>> some reason?
> AFAICU, during crash recovery, we wait for all non-syslogger children
> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
> StartupDataBase. While starting the startup process we check if any
> bgworker is scheduled for a restart.
>

In general, your theory appears right, but can you check how it
behaves in standby server because there is a difference in how the
startup process behaves during master and standby startup?  In master,
it stops after recovery whereas in standby it will keep on running to
receive WAL.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: strange parallel query behavior after OOM crashes

From
Robert Haas
Date:
On Tue, Apr 4, 2017 at 1:52 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> In any case, the comment right before BackgroundWorkerArray says this:
>
>  * These counters can of course overflow, but it's not important here
>  * since the subtraction will still give the right number.
>
> which means that this assert
>
> +       Assert(BackgroundWorkerData->parallel_register_count >=
> +               BackgroundWorkerData->parallel_terminate_count);
>
> is outright broken, just like any other attempts to rely on simple
> comparisons of these two counters, no?

Yeah, that's busted.  Good catch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: strange parallel query behavior after OOM crashes

From
Robert Haas
Date:
On Wed, Apr 5, 2017 at 6:36 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> Yes. But, as Robert suggested up in the thread, we should not use
> (parallel_register_count = 0) as an alternative to define a bgworker
> crash. Hence, I've added an argument named 'wasCrashed' in
> ForgetBackgroundWorker to indicate a bgworker crash.

Did you intend to attach that patch to this email?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: strange parallel query behavior after OOM crashes

From
Kuntal Ghosh
Date:
On Wed, Apr 5, 2017 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Apr 5, 2017 at 6:36 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> Yes. But, as Robert suggested up in the thread, we should not use
>> (parallel_register_count = 0) as an alternative to define a bgworker
>> crash. Hence, I've added an argument named 'wasCrashed' in
>> ForgetBackgroundWorker to indicate a bgworker crash.
>
> Did you intend to attach that patch to this email?
>
Actually, I'm confused how we should ensure (register_count >
terminate_count) invariant. I think there can be a system crash what
Tomas has suggested up in the thread.

Assert(parallel_register_count - parallel_terminate_count <=
max_parallel_workers);
Backend 1 > SET max_parallel_worker = 8;
Backend 1 > Execute a long running parallel query q1 with number of
parallel worker spawned is say, 4.
Backend 2> SET max_parallel_worker = 3;
Backend 2 > Try to execute any parallel query q2 with number of
parallel worker spawned > 0.

The above assert statement will bring down the server unnecessarily
while executing q2. If the assert statement was not there, it could
have gone ahead without launching any workers.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: strange parallel query behavior after OOM crashes

From
Robert Haas
Date:
On Wed, Apr 5, 2017 at 10:09 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
>> Did you intend to attach that patch to this email?
>>
> Actually, I'm confused how we should ensure (register_count >
> terminate_count) invariant. I think there can be a system crash what
> Tomas has suggested up in the thread.
>
> Assert(parallel_register_count - parallel_terminate_count <=
> max_parallel_workers);
> Backend 1 > SET max_parallel_worker = 8;
> Backend 1 > Execute a long running parallel query q1 with number of
> parallel worker spawned is say, 4.

At this point, parallel_register_count should be equal to
parallel_terminate_count.  4 workers were started, and 4 have
terminated.

> Backend 2> SET max_parallel_worker = 3;
> Backend 2 > Try to execute any parallel query q2 with number of
> parallel worker spawned > 0.

Now here parallel_register_count should get bumped up to 4+(# of
workers now launched) at the beginning and then
parallel_terminate_count at the end.  No problem.

What's going wrong, here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: strange parallel query behavior after OOM crashes

From
Tomas Vondra
Date:
On 04/05/2017 04:09 PM, Kuntal Ghosh wrote:
> On Wed, Apr 5, 2017 at 7:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Apr 5, 2017 at 6:36 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>> Yes. But, as Robert suggested up in the thread, we should not use
>>> (parallel_register_count = 0) as an alternative to define a bgworker
>>> crash. Hence, I've added an argument named 'wasCrashed' in
>>> ForgetBackgroundWorker to indicate a bgworker crash.
>>
>> Did you intend to attach that patch to this email?
>>
> Actually, I'm confused how we should ensure (register_count >
> terminate_count) invariant. I think there can be a system crash what
> Tomas has suggested up in the thread.
>
> Assert(parallel_register_count - parallel_terminate_count <=
> max_parallel_workers);
> Backend 1 > SET max_parallel_worker = 8;
> Backend 1 > Execute a long running parallel query q1 with number of
> parallel worker spawned is say, 4.
> Backend 2> SET max_parallel_worker = 3;
> Backend 2 > Try to execute any parallel query q2 with number of
> parallel worker spawned > 0.
>
> The above assert statement will bring down the server unnecessarily
> while executing q2. If the assert statement was not there, it could
> have gone ahead without launching any workers.
>

Ah, right. I forgot max_parallel_workers may be changed in session. I 
think we can use max_worker_processes instead:
  Assert(parallel_register_count - parallel_terminate_count <= 
max_worker_processes);

The whole point is that if parallel_terminate_count exceeds 
parallel_register_count, the subtraction wraps around to a value close 
to UINT_MAX. All we need is an maximum possible delta between the two 
values to detect this, and max_worker_processes seems fine. We could 
also use UINT_MAX/2 for example.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: strange parallel query behavior after OOM crashes

From
Tomas Vondra
Date:
On 04/05/2017 04:15 PM, Robert Haas wrote:
> On Wed, Apr 5, 2017 at 10:09 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>>> Did you intend to attach that patch to this email?
>>>
>> Actually, I'm confused how we should ensure (register_count >
>> terminate_count) invariant. I think there can be a system crash what
>> Tomas has suggested up in the thread.
>>
>> Assert(parallel_register_count - parallel_terminate_count <=
>> max_parallel_workers);
>> Backend 1 > SET max_parallel_worker = 8;
>> Backend 1 > Execute a long running parallel query q1 with number of
>> parallel worker spawned is say, 4.
>
> At this point, parallel_register_count should be equal to
> parallel_terminate_count.  4 workers were started, and 4 have
> terminated.
>

No, the workers were started, but are still running, so
    register_count = 4    terminate_count = 0

>> Backend 2> SET max_parallel_worker = 3;
>> Backend 2 > Try to execute any parallel query q2 with number of
>> parallel worker spawned > 0.
>
> Now here parallel_register_count should get bumped up to 4+(# of
> workers now launched) at the beginning and then
> parallel_terminate_count at the end.  No problem.
>
> What's going wrong, here?
>

Well, assuming the other workers are still running, you get
   register_count = 4   terminate_count = 0

and the difference is greater than max_parallel_workers = 3. Which 
violates the assert.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: strange parallel query behavior after OOM crashes

From
Kuntal Ghosh
Date:
On Wed, Apr 5, 2017 at 7:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Apr 5, 2017 at 10:09 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>>> Did you intend to attach that patch to this email?
>>>
>> Actually, I'm confused how we should ensure (register_count >
>> terminate_count) invariant. I think there can be a system crash what
>> Tomas has suggested up in the thread.
>>
>> Assert(parallel_register_count - parallel_terminate_count <=
>> max_parallel_workers);
>> Backend 1 > SET max_parallel_worker = 8;
>> Backend 1 > Execute a long running parallel query q1 with number of
>> parallel worker spawned is say, 4.
>
> At this point, parallel_register_count should be equal to
> parallel_terminate_count.  4 workers were started, and 4 have
> terminated.
>
Actually, I'm referring to the case when q1 is still running. In that
case, parallel_register_count = 4 and parallel_terminate_count = 0.

>> Backend 2> SET max_parallel_worker = 3;
Now, parallel_register_count - parallel_terminate_count = 4 >
max_parallel_worker.

>> Backend 2 > Try to execute any parallel query q2 with number of
>> parallel worker spawned > 0.
>
Hence, the assert will fail here.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: strange parallel query behavior after OOM crashes

From
Tomas Vondra
Date:
On 04/05/2017 04:26 PM, Kuntal Ghosh wrote:
> On Wed, Apr 5, 2017 at 7:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Apr 5, 2017 at 10:09 AM, Kuntal Ghosh
>> <kuntalghosh.2007@gmail.com> wrote:
>>>> Did you intend to attach that patch to this email?
>>>>
>>> Actually, I'm confused how we should ensure (register_count >
>>> terminate_count) invariant. I think there can be a system crash what
>>> Tomas has suggested up in the thread.
>>>
>>> Assert(parallel_register_count - parallel_terminate_count <=
>>> max_parallel_workers);
>>> Backend 1 > SET max_parallel_worker = 8;
>>> Backend 1 > Execute a long running parallel query q1 with number of
>>> parallel worker spawned is say, 4.
>>
>> At this point, parallel_register_count should be equal to
>> parallel_terminate_count.  4 workers were started, and 4 have
>> terminated.
>>
> Actually, I'm referring to the case when q1 is still running. In that
> case, parallel_register_count = 4 and parallel_terminate_count = 0.
>
>>> Backend 2> SET max_parallel_worker = 3;
> Now, parallel_register_count - parallel_terminate_count = 4 >
> max_parallel_worker.
>
>>> Backend 2 > Try to execute any parallel query q2 with number of
>>> parallel worker spawned > 0.
>>
> Hence, the assert will fail here.
>

Actually, you probably don't even need two sessions to trigger the 
assert. All you need is to tweak the max_parallel_workers and then 
reload the config while the parallel query is running. Then 
ForgetBackgroundWorker() will see the new value.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: strange parallel query behavior after OOM crashes

From
Neha Khatri
Date:
On Wed, Apr 5, 2017 at 5:34 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
On Tue, Apr 4, 2017 at 12:16 PM, Neha Khatri <nehakhatri5@gmail.com> wrote:

> I feel there should be an assert if
>     (BackgroundWorkerData->parallel_register_count -
> BackgroundWorkerData->parallel_terminate_count) > max_parallel_workers)
>
Backend 1 > SET max_parallel_worker = 8;
Backend 1 > Execute a long running parallel query q1 with number of
parallel worker spawned is say, 4.
Backend 2> SET max_parallel_worker = 3;
Backend 2 > Execute a long running parallel query q2 with number of
parallel worker spawned > 0.

The above assert statement will bring down the server unnecessarily
while executing q2.

Right, with multiple backends trying to fiddle with max_parallel_workers, that might bring the server down with the said assert:
    Assert(parallel_register_count - parallel_terminate_count <= max_parallel_workers)

The problem here seem to be the change in the max_parallel_workers value while the parallel workers are still under execution. So this poses two questions:

1. From usecase point of view, why could there be a need to tweak the max_parallel_workers exactly at the time when the parallel workers are at play.
2. Could there be a restriction on tweaking of max_parallel_workers while the parallel workers are at play? At least do not allow setting the max_parallel_workers less than the current # of active parallel workers.

Regards,
Neha


Re: strange parallel query behavior after OOM crashes

From
Robert Haas
Date:
On Wed, Apr 5, 2017 at 8:17 PM, Neha Khatri <nehakhatri5@gmail.com> wrote:
> The problem here seem to be the change in the max_parallel_workers value
> while the parallel workers are still under execution. So this poses two
> questions:
>
> 1. From usecase point of view, why could there be a need to tweak the
> max_parallel_workers exactly at the time when the parallel workers are at
> play.
> 2. Could there be a restriction on tweaking of max_parallel_workers while
> the parallel workers are at play? At least do not allow setting the
> max_parallel_workers less than the current # of active parallel workers.

Well, that would be letting the tail wag the dog.  The maximum value
of max_parallel_workers is only 1024, and what we're really worried
about here is seeing a value near PG_UINT32_MAX, which leaves a lot of
daylight.  How about just creating a #define that's used by guc.c as
the maximum for the GUC, and here we assert that we're <= that value?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: strange parallel query behavior after OOM crashes

From
Kuntal Ghosh
Date:
On Wed, Apr 5, 2017 at 6:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
>>> I'm probably missing something, but I don't quite understand how these
>>> values actually survive the crash. I mean, what I observed is OOM followed
>>> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
>>> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
>>> some reason?
>> AFAICU, during crash recovery, we wait for all non-syslogger children
>> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
>> StartupDataBase. While starting the startup process we check if any
>> bgworker is scheduled for a restart.
>>
>
> In general, your theory appears right, but can you check how it
> behaves in standby server because there is a difference in how the
> startup process behaves during master and standby startup?  In master,
> it stops after recovery whereas in standby it will keep on running to
> receive WAL.
>
While performing StartupDatabase, both master and standby server
behave in similar way till postmaster spawns startup process.
In master, startup process completes its job and dies. As a result,
reaper is called which in turn calls maybe_start_bgworker().
In standby, after getting a valid snapshot, startup process sends
postmaster a signal to enable connections. Signal handler in
postmaster calls maybe_start_bgworker().
In maybe_start_bgworker(), if we find a crashed bgworker(crashed_at !=
0) with a NEVER RESTART flag, we call ForgetBackgroundWorker().to
forget the bgworker process.

I've attached the patch for adding an argument in
ForgetBackgroundWorker() to indicate a crashed situation. Based on
that, we can take the necessary actions. I've not included the Assert
statement in this patch.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment