Re: strange parallel query behavior after OOM crashes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: strange parallel query behavior after OOM crashes
Date
Msg-id da03d726-e87d-ef5e-396d-188549e6cc23@2ndquadrant.com
Whole thread Raw
In response to Re: strange parallel query behavior after OOM crashes  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
Responses Re: strange parallel query behavior after OOM crashes  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
List pgsql-hackers

On 04/05/2017 12:36 PM, Kuntal Ghosh wrote:
> On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> On 04/05/2017 09:05 AM, Kuntal Ghosh wrote:
>>>
>>> AFAICU, during crash recovery, we wait for all non-syslogger children
>>> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
>>> StartupDataBase. While starting the startup process we check if any
>>> bgworker is scheduled for a restart. If a bgworker is crashed and not
>>> meant for a restart(parallel worker in our case), we call
>>> ForgetBackgroundWorker() to discard it.
>>>
>>
>> OK, so essentially we reset the counters, getting
>>
>>      parallel_register_count = 0
>>      parallel_terminate_count = 0
>>
>> and then ForgetBackgroundWworker() runs and increments the terminate_count,
>> breaking the logic?
>>
>> And you're using (parallel_register_count > 0) to detect whether we're still
>> in the init phase or not. Hmmm.
>>
> Yes. But, as Robert suggested up in the thread, we should not use
> (parallel_register_count = 0) as an alternative to define a bgworker
> crash. Hence, I've added an argument named 'wasCrashed' in
> ForgetBackgroundWorker to indicate a bgworker crash.
> 

Sure, and I agree that having an explicit flag is the right solution. 
I'm just trying to make sure I understand what's happening.

>>
>> The comment says that the counters are allowed to overflow, i.e. after a
>> long uptime you might get these values
>>
>>      parallel_register_count = UINT_MAX + 1 = 1
>>      parallel_terminate_count = UINT_MAX
>>
>> which is fine, because the C handles the overflow during subtraction and so
>>
>>      parallel_register_count - parallel_terminate_count = 1
>>
>> But the assert is not doing subtraction, it's comparing the values directly:
>>
>>      Assert(parallel_register_count >= parallel_terminate_count);
>>
>> and the (perfectly valid) values trivially violate this comparison.
>>
> Thanks for the explanation. So, we can't use the above assert
> statement. Even the following assert statement will not be helpful:
> Assert(parallel_register_count - parallel_terminate_count >= 0);
> Because, it'll fail to track the scenario when parallel_register_count
> is not overflowed, still less than parallel_terminate_count. :(
> 

Actually, that assert would work, because C does handle overflows on 
uint values during subtraction just fine. That is,
    (UINT_MAX+x) - UINT_MAX = x

Also, the comment about overflows before BackgroundWorkerArray claims 
this is the case.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Kuntal Ghosh
Date:
Subject: Re: strange parallel query behavior after OOM crashes
Next
From: Kuntal Ghosh
Date:
Subject: Re: strange parallel query behavior after OOM crashes