Re: Logical decoding restart problems - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Logical decoding restart problems
Date
Msg-id CAMsr+YGWap3QzC7B951a95MR-isxjrtzUwUAqZ_M=zJTRL1rMA@mail.gmail.com
Whole thread Raw
In response to Re: Logical decoding restart problems  (konstantin knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: Logical decoding restart problems  (Stas Kelvich <s.kelvich@postgrespro.ru>)
List pgsql-hackers


On 20 August 2016 at 14:56, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
Thank you for answers.

No, you don't need to recreate them. Just advance your replication identifier downstream and request a replay position in the future. Let the existing slot skip over unwanted data and resume where you want to start replay.

You can advance the replication origins on the peers as you replay forwarded xacts from your master.

Have a look at how the BDR code does this during "catchup mode" replay.
 
So while your problem discussed below seems concerning, you don't have to drop and recreate slots like are currently doing. 

The only reason for recreation of slot is that I want to move it to the current "horizont" and skip all pending transaction without explicit specification of the restart position.

Why not just specify the restart position as the upstream server's xlog insert position?

Anyway, you _should_ specify the restart position. Otherwise, if there's concurrent write activity, you might have a gap between when you stop replaying from your forwarding slot on the recovery node and start replaying from the other nodes.

Again, really, go read the BDR catchup mode code. Really.
 
If I do not drop the slot and just restart replication specifying position 0/0 (invalid LSN), then replication will be continued from the current slot position in WAL, will not it?

The "current slot position" isn't in WAL. It's stored in the replication slot in pg_replslot/ . But yes, if you pass 0/0 it'll use the stored confirmed_flush_lsn from the replication slot.
 
So there  is no way to specify something "start replication from the end of WAL", like lseek(0, SEEK_END).

Correct, but you can fetch the server's xlog insert position separately and pass it.

I guess I can see it being a little bit useful to be able to say "start decoding at the first commit after this command". Send a patch, see if Andres agrees.

I still think your whole approach is wrong and you need to use replication origins or similar to co-ordinate a consistent switchover.
 
 
Slot is created by peer node using standard libpq connection with database=replication connection string.


So walsender interface  then.
 
The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica.  I started investigation of logical decoding code and found several things which I do not understand.

Yeah, that sounds concerning and shouldn't happen.

I looked at replication code more precisely and understand that my first concerns were wrong.
Confirming flush position should not prevent replaying transactions with smaller LSNs.

Strictly, confirming the flush position does not prevent transactions *with changes* at lower LSNs. It does prevent replay of transactions that *commit* with lower LSNs.
 
But unfortunately the problem is really present. May be it is caused by race conditions (although most logical decoder data is local to backend).
This is why I will try to create reproducing scenario without multimaster.
 
Yeh, but unfortunately it happens. Need to understand why...

Yes. I think we need a simple standalone test case. I've never yet seen a partially decoded transaction like this.
It's all already there. See logical decoding's use of xl_running_xacts.
But how this information is persisted?

restart_lsn points to a xl_running_xacts record in WAL. Which is of course persistent. The restart_lsn is persistent in the replication slot, as is catalog_xmin and confirmed_flush_lsn.
 
What will happen if wal_sender is restarted?

That's why the restart_lsn exists. Decoding restarts from the restart_lsn when you START_REPLICATION on the new walsender. It continues without sending data to the client until it decodes the first commit > confirmed_flush_lsn or some greater-than-that LSN that you requested by passing it to the START_REPLICATION command.

The snapshot builder is also involved; see snapbuild.c and the comments there.

I'll wait for a test case or some more detail.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: dsm_unpin_segment
Next
From: Craig Ringer
Date:
Subject: [PATCH] Transaction traceability - txid_status(bigint)