Re: Logical decoding restart problems - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: Logical decoding restart problems
Date
Msg-id 9b686524-60e5-3dcb-cda1-af01d1ed8145@2ndquadrant.com
Whole thread Raw
In response to Logical decoding restart problems  (konstantin knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: Logical decoding restart problems  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List pgsql-hackers
On 19/08/16 09:34, konstantin knizhnik wrote:
>
> We are using logical decoding in multimaster and we are faced with the
> problem that inconsistent transactions are sent to replica.
> Briefly, multimaster is using logical decoding in this way:
> 1. Each multimaster node is connected with each other using logical
> decoding channel and so each pair of nodes
> has its own replication slot.
> 2. In normal scenario each replication channel is used to replicate only
> those transactions which were originated at the source node.
> We are using origin mechanism to skip "foreign" transactions.
> When offline cluster node is returned back to the multimaster we need
> to recover this node to the current cluster state.
> Recovery is performed from one of the cluster's node. So we are using
> only one replication channel to receive all (self and foreign) transactions.
> Only in this case we can guarantee consistent order of applying
> transactions at recovered node.
> After the end of recovery we need to recreate replication slots with all
> other cluster nodes (because we have already replied transactions from
> this nodes).
> To restart logical decoding we first drop existed slot, then create new
> one and then start logical replication from the WAL position 0/0
> (invalid LSN).
> In this case recovery should be started from the last consistent point.
>

I don't think this will work correctly, there will be gap between when 
the new slot starts to decode and the drop of the old one as the new 
slot first needs to make snapshot.

Do I understand correctly that you are not using replication origins?

> The problem is that for some reasons consistent point is not so
> consistent and we get partly decoded transactions.
> I.e. transaction body consists of two UPDATE but reorder_buffer extracts
> only the one (last) update and sent this truncated transaction to
> destination causing consistency violation at replica.  I started
> investigation of logical decoding code and found several things which I
> do not understand.

Never seen this happen. Do you have more details about what exactly is 
happening?

>
> Assume that we have transactions T1={start_lsn=100, end_lsn=400} and
> T2={start_lsn=200, end_lsn=300}.
> Transaction T2 is sent to the replica and replica confirms that
> flush_lsn=300.
> If now we want to restart logical decoding, we can not start with
> position less than 300, because CreateDecodingContext doesn't allow it:
>
>  * start_lsn
>  *The LSN at which to start decoding.  If InvalidXLogRecPtr, restart
>  *from the slot's confirmed_flush; otherwise, start from the specified
>  *location (but move it forwards to confirmed_flush if it's older than
>  *that, see below).
>  *
> else if (start_lsn < slot->data.confirmed_flush)
> {
> /*
> * It might seem like we should error out in this case, but it's
> * pretty common for a client to acknowledge a LSN it doesn't have to
> * do anything for, and thus didn't store persistently, because the
> * xlog records didn't result in anything relevant for logical
> * decoding. Clients have to be able to do that to support synchronous
> * replication.
> */
>
> So it means that we have no chance to restore T1?
> What is worse, if there are valid T2 transaction records with lsn >=
> 300, then we can partly decode T1 and send this T1' to the replica.
> I missed something here?

The decoding starts from restart_lsn of the slot, start_lsn is used for 
skipping the transactions.

> Are there any alternative way to "seek" slot to the proper position
> without  actual fetching data from it or recreation of the slot?

You can seek forward just fine, just specify the start position in 
START_REPLICATION command.

> Is there any mechanism in xlog which can enforce consistent decoding of
> transaction (so that no transaction records are missed)?
> May be I missed something but I didn't find any "record_number" or
> something else which can identify first record of transaction.

As I mentioned above, what you probably want to do is use replication 
origins. When you use those you get origin info when decoding the 
transaction which you can then send to downstream and it can update it's 
idea of where it is for that origin. This is especially useful for the 
transaction forwarding you are doing (See BDR and/or pglogical code for 
example of that).

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Should we cacheline align PGXACT?
Next
From: Jeff Janes
Date:
Subject: Re: sslmode=require fallback