Re: [HACKERS] Issues with logical replication - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: [HACKERS] Issues with logical replication
Date
Msg-id 68879b3a-46c0-0aab-f6a2-d374010b4e71@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] Issues with logical replication  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Issues with logical replication  (Stas Kelvich <s.kelvich@postgrespro.ru>)
List pgsql-hackers
Hi,

(sorry for not being active here, I am still catching up after being
away for some family issues)

On 16/11/17 21:12, Robert Haas wrote:
> On Thu, Nov 16, 2017 at 2:41 PM, Andres Freund <andres@anarazel.de> wrote:
>>> To me, it seems like SnapBuildWaitSnapshot() is fundamentally
>>> misdesigned
>>
>> Maybe I'm confused, but why is it fundamentally misdesigned? It's not
>> such an absurd idea to wait for an xid in a WAL record.  I get that
>> there's a race condition here, which obviously bad, but I don't really
>> see as evidence of the above claim.
>>
>> I actually think this code used to be safe because ProcArrayLock used to
>> be held while generating and logging the running snapshots record.  That
>> was removed when fixing some other bug, but perhaps that shouldn't have
>> been done...
> 
> OK.  Well, I might be overstating the case.  My comment about
> fundamental misdesign was really just based on the assumption that
> XactLockTableWait() could be used to wait for an XID the instant it
> was generated.  That was never gonna work and there's no obvious clean
> workaround for the problem.  Getting snapshot building to work
> properly seems to be Hard (TM).
> 

No kidding, we've been at it since 9.4.

But anyway, while sure we have race condition because
XactLockTableWait() finishes too early, all that should mean is we call
LogStandbySnapshot() too early and as a result taking snapshot (and
hence slot creation) will take longer as we'll wait for next call of
LogStandbySnapshot() from some other caller (because the tx we care
about will still be running). The whole SnapBuildWaitSnapshot() is just
optimization to make slot creation take less time (and also to be able
to write tests).

What I don't understand is how it leads to crash (and I could not
reproduce it using the pgbench file attached in this thread either) and
moreover how it leads to 0 xid being logged. The only explanation I can
come up is that some kind of similar race has to be in
LogStandbySnapshot() but we explicitly check for 0 xid value there.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services


pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: using index or check in ALTER TABLE SET NOT NULL
Next
From: Tom Lane
Date:
Subject: Re: using index or check in ALTER TABLE SET NOT NULL