Re: [HACKERS] Issues with logical replication - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: [HACKERS] Issues with logical replication
Date
Msg-id 74a30eb7-4c1c-29cf-5e13-811b2c5ebd3c@2ndquadrant.com
Whole thread Raw
In response to Re: [HACKERS] Issues with logical replication  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] Issues with logical replication  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 30/11/17 00:40, Andres Freund wrote:
> On 2017-11-30 00:25:58 +0100, Petr Jelinek wrote:
>> Yes that helps thanks. Now that I reproduced it I understand, I was
>> confused by the backtrace that said xid was 0 on the input to
>> XactLockTableWait() but that's not the case, it's what xid is changed to
>> in the inner loop.
> 
>> So what happens is that we manage to do LogStandbySnapshot(), decode the
>> logged snapshot, and run SnapBuildWaitSnapshot() for a transaction in
>> between GetNewTransactionId() and XactLockTableInsert() calls in
>> AssignTransactionId() for that same transaction.
>>
>> I guess the probability of this happening is increased by the fact that
>> GetRunningTransactionData() acquires XidGenLock so if there is
>> GetNewTransactionId() running in parallel it will wait for it to finish
>> and we'll log immediately after that.
>>
>> Hmm that means that Robert's loop idea will not help and ProcArrayLock
>> will not save us either. Maybe we could either rewrite XactLockTableWait
>> or create another version of it with SubTransGetParent() call replaced
>> by SubTransGetTopmostTransaction() as that will return the same top
>> level xid in case the input xid wasn't a subxact. That would make it
>> safe to be called on transactions that didn't acquire lock on themselves
>> yet.
> 
> I've not really looked into this deeply, but afair we can just make this
> code accept that edgecase be done with it. As the comment says:
> 
>  * Iterate through xids in record, wait for all older than the cutoff to
>  * finish.  Then, if possible, log a new xl_running_xacts record.
>  *
> --- highlight ---
>  * This isn't required for the correctness of decoding, but to:
> --- highlight ---
>  * a) allow isolationtester to notice that we're currently waiting for
>  *      something.
>  * b) log a new xl_running_xacts record where it'd be helpful, without having
>  *      to write for bgwriter or checkpointer.
> 

I don't understand. I mean sure the SnapBuildWaitSnapshot() can live
with it, but the problematic logic happens inside the
XactLockTableInsert() and SnapBuildWaitSnapshot() has no way of
detecting the situation short of reimplementing the
XactLockTableInsert() instead of calling it.

The problem is that SubTransGetParent() returns InvalidTransactionId
when the race happens and SubTransGetParen() is called because
XactLockTableInsert() assumes that if transaction lock was acquired and
the xid is still in progress, the input must have been xid of a
subtransaction). This is why I suggested replacing that call with
SubTransGetTopmostTransaction() which will just return same xid it got
on input and we'll simply retry to do the lock instead of crashing.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] Issues with logical replication
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] Issues with logical replication