Re: strange parallel query behavior after OOM crashes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: strange parallel query behavior after OOM crashes
Date
Msg-id eaf07984-078f-1918-1304-237934b9fe20@2ndquadrant.com
Whole thread Raw
In response to Re: strange parallel query behavior after OOM crashes  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
List pgsql-hackers

On 04/05/2017 01:09 PM, Kuntal Ghosh wrote:
> On Wed, Apr 5, 2017 at 4:13 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>> The comment says that the counters are allowed to overflow, i.e. after a
>>>> long uptime you might get these values
>>>>
>>>>       parallel_register_count = UINT_MAX + 1 = 1
>>>>       parallel_terminate_count = UINT_MAX
>>>>
>>>> which is fine, because the C handles the overflow during subtraction and
>>>> so
>>>>
>>>>       parallel_register_count - parallel_terminate_count = 1
>>>>
>>>> But the assert is not doing subtraction, it's comparing the values
>>>> directly:
>>>>
>>>>       Assert(parallel_register_count >= parallel_terminate_count);
>>>>
>>>> and the (perfectly valid) values trivially violate this comparison.
>>>>
>>> Thanks for the explanation. So, we can't use the above assert
>>> statement. Even the following assert statement will not be helpful:
>>> Assert(parallel_register_count - parallel_terminate_count >= 0);
>>> Because, it'll fail to track the scenario when parallel_register_count
>>> is not overflowed, still less than parallel_terminate_count. :(
>>>
>>
>> Actually, that assert would work, because C does handle overflows on uint
>> values during subtraction just fine. That is,
>>
>>      (UINT_MAX+x) - UINT_MAX = x
>>
>> Also, the comment about overflows before BackgroundWorkerArray claims this
>> is the case.
>>
> Agreed on the overflowed case. But, my concern is when an overflow has
> not yet happened:
> 
> Suppose,
> uint parallel_register_count = 1; /* Didn't overflow* /
> uint parallel_terminate_count = 2; /* Didn't overflow */
> 
> Assert(parallel_register_count - parallel_terminate_count >= 0);
> We want the assert statement to fail here, but I think it won't since
> -1 has a valid representation in unsigned int and it is greater than
> 0, no?

I don't follow. How exactly do you get into this situation, assuming you 
actually enforce the (register_count > terminate_count) invariant 
consistently? In the modulo arithmetic of course.

But now that I'm thinking about it, the subtraction actually happens in 
unsigned ints, so the result will be unsigned int too, i.e. always >=0. 
Whether it makes sense depends on the invariant being true.

But I think this would work:

Assert(parallel_register_count - parallel_terminate_count <= 
max_parallel_workers);

If the difference gets 'negative' and wraps around, it'll be close to 
UINT_MAX.

But man, this unsigned int arithmetic makes my brain hurt ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: increasing the default WAL segment size
Next
From: Peter Eisentraut
Date:
Subject: Re: Functions Immutable but not parallel safe?