Home > mailing lists

Re: [HACKERS] Issues with logical replication - Mailing list pgsql-hackers

From	Petr Jelinek
Subject	Re: [HACKERS] Issues with logical replication
Date	November 30, 2017 05:45:44
Msg-id	74a30eb7-4c1c-29cf-5e13-811b2c5ebd3c@2ndquadrant.com Whole thread Raw
In response to	Re: [HACKERS] Issues with logical replication (Andres Freund <andres@anarazel.de>)
Responses	Re: [HACKERS] Issues with logical replication
List	pgsql-hackers

Tree view

On 30/11/17 00:40, Andres Freund wrote:
> On 2017-11-30 00:25:58 +0100, Petr Jelinek wrote:
>> Yes that helps thanks. Now that I reproduced it I understand, I was
>> confused by the backtrace that said xid was 0 on the input to
>> XactLockTableWait() but that's not the case, it's what xid is changed to
>> in the inner loop.
> 
>> So what happens is that we manage to do LogStandbySnapshot(), decode the
>> logged snapshot, and run SnapBuildWaitSnapshot() for a transaction in
>> between GetNewTransactionId() and XactLockTableInsert() calls in
>> AssignTransactionId() for that same transaction.
>>
>> I guess the probability of this happening is increased by the fact that
>> GetRunningTransactionData() acquires XidGenLock so if there is
>> GetNewTransactionId() running in parallel it will wait for it to finish
>> and we'll log immediately after that.
>>
>> Hmm that means that Robert's loop idea will not help and ProcArrayLock
>> will not save us either. Maybe we could either rewrite XactLockTableWait
>> or create another version of it with SubTransGetParent() call replaced
>> by SubTransGetTopmostTransaction() as that will return the same top
>> level xid in case the input xid wasn't a subxact. That would make it
>> safe to be called on transactions that didn't acquire lock on themselves
>> yet.
> 
> I've not really looked into this deeply, but afair we can just make this
> code accept that edgecase be done with it. As the comment says:
> 
>  * Iterate through xids in record, wait for all older than the cutoff to
>  * finish.  Then, if possible, log a new xl_running_xacts record.
>  *
> --- highlight ---
>  * This isn't required for the correctness of decoding, but to:
> --- highlight ---
>  * a) allow isolationtester to notice that we're currently waiting for
>  *      something.
>  * b) log a new xl_running_xacts record where it'd be helpful, without having
>  *      to write for bgwriter or checkpointer.
> 

I don't understand. I mean sure the SnapBuildWaitSnapshot() can live
with it, but the problematic logic happens inside the
XactLockTableInsert() and SnapBuildWaitSnapshot() has no way of
detecting the situation short of reimplementing the
XactLockTableInsert() instead of calling it.

The problem is that SubTransGetParent() returns InvalidTransactionId
when the race happens and SubTransGetParen() is called because
XactLockTableInsert() assumes that if transaction lock was acquired and
the xid is still in progress, the input must have been xid of a
subtransaction). This is why I suggested replacing that call with
SubTransGetTopmostTransaction() which will just return same xid it got
on input and we'll simply retry to do the lock instead of crashing.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

pgsql-hackers by date:

From: Andres Freund
Date: 30 November 2017, 05:40:36
Subject: Re: [HACKERS] Issues with logical replication

From: Andres Freund
Date: 30 November 2017, 05:47:57
Subject: Re: [HACKERS] Issues with logical replication

Re: [HACKERS] Issues with logical replication - Mailing list pgsql-hackers

Previous

Next