Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
Date
Msg-id CAEepm=2HmnTnCuYaWZvetRP3nsW1hMWtq+P8m+8pDh389in1wA@mail.gmail.com
Whole thread Raw
In response to Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Mon, Mar 5, 2018 at 4:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 03/04/2018 10:27 AM, Thomas Munro wrote:
>> I can fix it with the following patch, which writes XXX out to the log
>> where it would otherwise miss a final message sent just before
>> detaching with sufficiently bad timing/memory ordering.  This patch
>> isn't my proposed fix, it's just a demonstration of what's busted.
>> There could be a better way to structure things than this.
>
> I can confirm this resolves the issue for me. Before the patch, I've
> seen 112 failures in ~11500 runs. With the patch I saw 0 failures, but
> about 100 messages XXX in the log.
>
> So my conclusion is that your analysis is likely correct.

Thanks!  Here are a couple of patches.  I'm not sure which I prefer.
The "pessimistic" one looks simpler and is probably the way to go, but
the "optimistic" one avoids doing an extra read until it has actually
run out of data and seen mq_detached == true.

I realised that the pg_write_barrier() added to
shm_mq_detach_internal() from the earlier demonstration/hack patch was
not needed... I had a notion that SpinLockAcquire() might not include
a strong enough barrier (unlike SpinLockRelease()), but after reading
s_lock.h I think it's not needed (since you get either TAS() or a
syscall-based slow path, both expected to include a full fence).  I
haven't personally tested this on a weak memory order system.

Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [HACKERS] user-defined numeric data types triggering ERROR:unsupported type
Next
From: Daniel Gustafsson
Date:
Subject: Re: [HACKERS] Support for Secure Transport SSL library on macOS as OpenSSL alternative