Re: [HACKERS] Replication status in logical replication - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: [HACKERS] Replication status in logical replication
Date
Msg-id 20180706052148.GA776@paquier.xyz
Whole thread Raw
In response to Re: [HACKERS] Replication status in logical replication  (Michael Paquier <michael@paquier.xyz>)
Responses Re: [HACKERS] Replication status in logical replication  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Thu, Jul 05, 2018 at 05:13:27PM +0900, Michael Paquier wrote:
> This concerns as well v10, so that's not actually an open item...
> Well, it was an open item last year.  The last set of patches is from
> Simon here:
> https://www.postgresql.org/message-id/CANP8%2BjLwgsexwdPkBtkN5kdHN5TwV-d%3Di311Tq_FdOmzJ8QyRQ%40mail.gmail.com

logical_repl_caught_up_v4alt2.patch is actually incorrect after I tested
the thing, and that logical_repl_caught_up_v4.patch gets the correct
call.

> Simon, do you feel confident with your patch?  If yes, could you finish
> wrapping it?  I am getting myself familiar with the problem as this has
> been around for some time now so I am reviewing the thing as well and
> then I can board the ship..

Okay, I have spent some time today looking at this patch, and the error
is very easy to reproduce once you do that in the TAP tests:
1) Stop and start once the publisher in one of the tests of
src/test/subscription.
2) Enforce wait_for_catchup() to check for state = 'streaming'.
And then you would see the tests waiting until timeout is reached and
then die.

I would be inclined to add those tests in the final patch, the
disadvantage being that 1) makes one of the test scripts a bit longer,
but it can reproduce the failures most of the time.  Having 2) is
actually nice for physical replication as the tests in
src/test/recovery/ use wait_for_catchup() in various ways.

Some other notes about the patch:
- I switched the error message in WalSndLoop as mentioned upthread for
nodes catching up, aka no more "primary" but "upstream server".
- Added a note about using only GetFlushRecPtr in XLogSendLogical as
logical decoding cannot be used on standby nodes.  If one day logical
decoding gets supports on standby then this would need an update as
well.

Does this look fine to all the folks involved in this thread?  It is
Friday afternoon here so my brain is getting fried, but I can finish
wrapping up this patch at the beginning of next week if there are no
objections.  At quick glance this indeed would need a backpatch down to
9.4 but I have not spent time testing those configurations yet.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Legacy GiST invalid tuples
Next
From: Kyotaro HORIGUCHI
Date:
Subject: Re: using expression syntax for partition bounds