Logical decoding restart problems - Mailing list pgsql-hackers

From konstantin knizhnik
Subject Logical decoding restart problems
Date
Msg-id CC0B0B78-273C-4D5D-B4B2-6BD2A774B303@postgrespro.ru
Whole thread Raw
Responses Re: Logical decoding restart problems  (Petr Jelinek <petr@2ndquadrant.com>)
Re: Logical decoding restart problems  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
Hi,

We are using logical decoding in multimaster and we are faced with the problem that inconsistent transactions are sent to replica.
Briefly, multimaster is using logical decoding in this way:
1. Each multimaster node is connected with each other using logical decoding channel and so each pair of nodes 
has its own replication slot.
2. In normal scenario each replication channel is used to replicate only those transactions which were originated at the source node.
We are using origin mechanism to skip "foreign" transactions.
2. When offline cluster node is returned back to the multimaster we need to recover this node to the current cluster state.
Recovery is performed from one of the cluster's node. So we are using only one replication channel to receive all (self and foreign) transactions.
Only in this case we can guarantee consistent order of applying transactions at recovered node.
After the end of recovery we need to recreate replication slots with all other cluster nodes (because we have already replied transactions from this nodes).
To restart logical decoding we first drop existed slot, then create new one and then start logical replication from the WAL position 0/0 (invalid LSN).
In this case recovery should be started from the last consistent point.

The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica.  I started investigation of logical decoding code and found several things which I do not understand.

Assume that we have transactions T1={start_lsn=100, end_lsn=400} and T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
If now we want to restart logical decoding, we can not start with position less than 300, because CreateDecodingContext doesn't allow it:

 * start_lsn
 * The LSN at which to start decoding.  If InvalidXLogRecPtr, restart
 * from the slot's confirmed_flush; otherwise, start from the specified
 * location (but move it forwards to confirmed_flush if it's older than
 * that, see below).
 *
else if (start_lsn < slot->data.confirmed_flush)
{
/*
* It might seem like we should error out in this case, but it's
* pretty common for a client to acknowledge a LSN it doesn't have to
* do anything for, and thus didn't store persistently, because the
* xlog records didn't result in anything relevant for logical
* decoding. Clients have to be able to do that to support synchronous
* replication.
*/

So it means that we have no chance to restore T1?
What is worse, if there are valid T2 transaction records with lsn >= 300, then we can partly decode T1 and send this T1' to the replica.
I missed something here?

Are there any alternative way to "seek" slot to the proper position without  actual fetching data from it or recreation of the slot?
Is there any mechanism in xlog which can enforce consistent decoding of transaction (so that no transaction records are missed)?
May be I missed something but I didn't find any "record_number" or something else which can identify first record of transaction.

Thanks in advance,
Konstantin Knizhnik,
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: synchronous_commit = remote_flush
Next
From: Michael Paquier
Date:
Subject: Re: pg_basebackup wish list