Thread: [HACKERS] Replication status in logical replication

[HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
Hi all,

When using logical replication, I ran into a situation where the
pg_stat_replication.state is not updated until any wal record is sent
after started up. For example, I set up logical replication with 2
subscriber and restart the publisher server, but I see the following
status for a while (maybe until autovacuum run).

=# select application_name, state, sent_location, write_location,
flush_location, replay_location, sync_state from pg_stat_replication ;
 application_name |  state  | sent_location | write_location |
flush_location | replay_location | sync_state
------------------+---------+---------------+----------------+----------------+-----------------+------------
 node1            | catchup | 0/16329F8     | 0/16329F8      |
0/16329F8      | 0/16329F8       | potential
 node2            | catchup | 0/16329F8     | 0/16329F8      |
0/16329F8      | 0/16329F8       | async
(2 rows)

It seems that all wal senders have caught up but
pg_stat_replication.state is still "catchup". The reason of this
behavior is that WalSndCaughtUp is updated only in WalSndWaitForWal in
logical replication during running, and in logical_read_xlog_page
always try to read next wal record (i.g. it calls
WalSndWaitForWal(targetPagePtr + reqLen)). So WalSndWaitForWal cannot
update WalSndCaughtUp until any new wal record is created after
started up and wal sender read it.

Attached patch fixes this behavior by updating WalSndCaughtUp before
trying to read next WAL if already caught up.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Replication status in logical replication

From
Simon Riggs
Date:
On 22 March 2017 at 02:50, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> When using logical replication, I ran into a situation where the
> pg_stat_replication.state is not updated until any wal record is sent
> after started up. For example, I set up logical replication with 2
> subscriber and restart the publisher server, but I see the following
> status for a while (maybe until autovacuum run).
...

> Attached patch fixes this behavior by updating WalSndCaughtUp before
> trying to read next WAL if already caught up.

Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or
as far as it goes).

Objections to commit?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Wed, Apr 12, 2017 at 5:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 22 March 2017 at 02:50, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>> When using logical replication, I ran into a situation where the
>> pg_stat_replication.state is not updated until any wal record is sent
>> after started up. For example, I set up logical replication with 2
>> subscriber and restart the publisher server, but I see the following
>> status for a while (maybe until autovacuum run).
> ...
>
>> Attached patch fixes this behavior by updating WalSndCaughtUp before
>> trying to read next WAL if already caught up.
>
> Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or
> as far as it goes).
>
> Objections to commit?
>

Seems we still have this issue. Any update or comment on this? Barring
any objections, I'll add this to the open item so it doesn't get
missed.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Replication status in logical replication

From
Noah Misch
Date:
On Fri, May 19, 2017 at 11:33:48AM +0900, Masahiko Sawada wrote:
> On Wed, Apr 12, 2017 at 5:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On 22 March 2017 at 02:50, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >> When using logical replication, I ran into a situation where the
> >> pg_stat_replication.state is not updated until any wal record is sent
> >> after started up. For example, I set up logical replication with 2
> >> subscriber and restart the publisher server, but I see the following
> >> status for a while (maybe until autovacuum run).
> > ...
> >
> >> Attached patch fixes this behavior by updating WalSndCaughtUp before
> >> trying to read next WAL if already caught up.
> >
> > Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or
> > as far as it goes).
> >
> > Objections to commit?
> >
> 
> Seems we still have this issue. Any update or comment on this? Barring
> any objections, I'll add this to the open item so it doesn't get
> missed.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] Replication status in logical replication

From
Peter Eisentraut
Date:
On 5/29/17 22:56, Noah Misch wrote:
> On Fri, May 19, 2017 at 11:33:48AM +0900, Masahiko Sawada wrote:
>> On Wed, Apr 12, 2017 at 5:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or
>>> as far as it goes).
>>>
>>> Objections to commit?
>>>
>>
>> Seems we still have this issue. Any update or comment on this? Barring
>> any objections, I'll add this to the open item so it doesn't get
>> missed.
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.

I would ask Simon to go ahead with this patch if he feels comfortable
with it.

I'm disclaiming this open item, since it's an existing bug from previous
releases (and I have other open items to focus on).

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Replication status in logical replication

From
Daniel Gustafsson
Date:
> On 30 May 2017, at 19:55, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
>
> On 5/29/17 22:56, Noah Misch wrote:
>> On Fri, May 19, 2017 at 11:33:48AM +0900, Masahiko Sawada wrote:
>>> On Wed, Apr 12, 2017 at 5:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or
>>>> as far as it goes).
>>>>
>>>> Objections to commit?
>>>>
>>>
>>> Seems we still have this issue. Any update or comment on this? Barring
>>> any objections, I'll add this to the open item so it doesn't get
>>> missed.
>>
>> [Action required within three days.  This is a generic notification.]
>>
>> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
>> since you committed the patch believed to have created it, you own this open
>> item.  If some other commit is more relevant or if this does not belong as a
>> v10 open item, please let us know.
>
> I would ask Simon to go ahead with this patch if he feels comfortable
> with it.
>
> I'm disclaiming this open item, since it's an existing bug from previous
> releases (and I have other open items to focus on).

I’m not entirely sure why this was flagged as "Waiting for Author” by the
automatic run, the patch applies for me and builds so resetting back to “Needs
review”.

Simon: do you think you will have time to look at this patch in this CF?

cheers ./daniel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication status in logical replication

From
Vaishnavi Prabakaran
Date:
Hi,

On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson <daniel@yesql.se> wrote:

I’m not entirely sure why this was flagged as "Waiting for Author” by the
automatic run, the patch applies for me and builds so resetting back to “Needs
review”.


This patch applies and build cleanly and I did a testing with one publisher and one subscriber, and confirm that the replication state after restarting the server now is "streaming" and not "Catchup".
 
And, I don't find any issues with code and patch to me is ready for committer, marked the same in cf entry.  
 
Thanks & Regards,
Vaishnavi,
Fujitsu Australia. 

Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Tue, Sep 26, 2017 at 10:36 AM, Vaishnavi Prabakaran
<vaishnaviprabakaran@gmail.com> wrote:
> Hi,
>
> On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson <daniel@yesql.se> wrote:
>>
>>
>> I’m not entirely sure why this was flagged as "Waiting for Author” by the
>> automatic run, the patch applies for me and builds so resetting back to
>> “Needs
>> review”.
>>
>
> This patch applies and build cleanly and I did a testing with one publisher
> and one subscriber, and confirm that the replication state after restarting
> the server now is "streaming" and not "Catchup".
>
> And, I don't find any issues with code and patch to me is ready for
> committer, marked the same in cf entry.
>

Thank you for the reviewing the patch!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication status in logical replication

From
Thomas Munro
Date:
On Tue, Sep 26, 2017 at 3:45 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Sep 26, 2017 at 10:36 AM, Vaishnavi Prabakaran
> <vaishnaviprabakaran@gmail.com> wrote:
>> On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson <daniel@yesql.se> wrote:
>>> I’m not entirely sure why this was flagged as "Waiting for Author” by the
>>> automatic run, the patch applies for me and builds so resetting back to
>>> “Needs
>>> review”.
>>>
>>
>> This patch applies and build cleanly and I did a testing with one publisher
>> and one subscriber, and confirm that the replication state after restarting
>> the server now is "streaming" and not "Catchup".
>>
>> And, I don't find any issues with code and patch to me is ready for
>> committer, marked the same in cf entry.

Hi Sawada-san,

My patch-testing robot doesn't like this patch[1].  I just tried it on
my laptop to double-check and get some more details, and saw the same
failures:

(1) "make check" under src/test/recovery fails like this:

t/006_logical_decoding.pl ............ 2/16 # Looks like your test
exited with 29 just after 4.
t/006_logical_decoding.pl ............ Dubious, test returned 29
(wstat 7424, 0x1d00)
Failed 12/16 subtests

regress_log_006_logical_decoding says:

ok 4 - got same expected output from pg_recvlogical decoding session
pg_recvlogical timed out at
/opt/local/lib/perl5/vendor_perl/5.24/IPC/Run.pm line 2918.waiting for endpos 0/1609B60 with stdout '', stderr '' at
/Users/munro/projects/postgres/src/test/recovery/../../../src/test/perl/PostgresNode.pm
line 1700.
### Stopping node "master" using mode immediate
# Running: pg_ctl -D
/Users/munro/projects/postgres/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
-m immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "master"
# Looks like your test exited with 29 just after 4.

(2) "make check" under src/test/subscription says:

t/001_rep_changes.pl .. ok
t/002_types.pl ........ #
# Looks like your test exited with 60 before it could output anything.
t/002_types.pl ........ Dubious, test returned 60 (wstat 15360, 0x3c00)
Failed 3/3 subtests
t/003_constraints.pl ..

Each of those tooks several minutes, and I stopped it there.  It may
be going to say some more things but is taking a very long time
(presumably timing out, but the 001 took ages and then succeeded...
hmm).  In fact I had to run this on my laptop to see that because on
Travis CI the whole test job just gets killed after 10 minutes of
non-output and the above output was never logged because of the way
concurrent test jobs' output is buffered.

I didn't try to figure out what is going wrong.

[1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/301586561

--
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Tue, Nov 14, 2017 at 6:46 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Sep 26, 2017 at 3:45 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Tue, Sep 26, 2017 at 10:36 AM, Vaishnavi Prabakaran
>> <vaishnaviprabakaran@gmail.com> wrote:
>>> On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson <daniel@yesql.se> wrote:
>>>> I’m not entirely sure why this was flagged as "Waiting for Author” by the
>>>> automatic run, the patch applies for me and builds so resetting back to
>>>> “Needs
>>>> review”.
>>>>
>>>
>>> This patch applies and build cleanly and I did a testing with one publisher
>>> and one subscriber, and confirm that the replication state after restarting
>>> the server now is "streaming" and not "Catchup".
>>>
>>> And, I don't find any issues with code and patch to me is ready for
>>> committer, marked the same in cf entry.
>
> Hi Sawada-san,
>
> My patch-testing robot doesn't like this patch[1].  I just tried it on
> my laptop to double-check and get some more details, and saw the same
> failures:
>
> (1) "make check" under src/test/recovery fails like this:
>
> t/006_logical_decoding.pl ............ 2/16 # Looks like your test
> exited with 29 just after 4.
> t/006_logical_decoding.pl ............ Dubious, test returned 29
> (wstat 7424, 0x1d00)
> Failed 12/16 subtests
>
> regress_log_006_logical_decoding says:
>
> ok 4 - got same expected output from pg_recvlogical decoding session
> pg_recvlogical timed out at
> /opt/local/lib/perl5/vendor_perl/5.24/IPC/Run.pm line 2918.
>  waiting for endpos 0/1609B60 with stdout '', stderr '' at
> /Users/munro/projects/postgres/src/test/recovery/../../../src/test/perl/PostgresNode.pm
> line 1700.
> ### Stopping node "master" using mode immediate
> # Running: pg_ctl -D
> /Users/munro/projects/postgres/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
> -m immediate stop
> waiting for server to shut down.... done
> server stopped
> # No postmaster PID for node "master"
> # Looks like your test exited with 29 just after 4.
>
> (2) "make check" under src/test/subscription says:
>
> t/001_rep_changes.pl .. ok
> t/002_types.pl ........ #
> # Looks like your test exited with 60 before it could output anything.
> t/002_types.pl ........ Dubious, test returned 60 (wstat 15360, 0x3c00)
> Failed 3/3 subtests
> t/003_constraints.pl ..
>
> Each of those tooks several minutes, and I stopped it there.  It may
> be going to say some more things but is taking a very long time
> (presumably timing out, but the 001 took ages and then succeeded...
> hmm).  In fact I had to run this on my laptop to see that because on
> Travis CI the whole test job just gets killed after 10 minutes of
> non-output and the above output was never logged because of the way
> concurrent test jobs' output is buffered.
>
> I didn't try to figure out what is going wrong.
>

Thank you for the notification!

After investigation, I found out that my previous patch was wrong
direction. I should have changed XLogSendLogical() so that we can
check the read LSN and set WalSndCaughtUp = true even after read a
record without wait. Attached updated patch passed 'make check-world'.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Replication status in logical replication

From
Michael Paquier
Date:
On Wed, Nov 22, 2017 at 6:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> After investigation, I found out that my previous patch was wrong
> direction. I should have changed XLogSendLogical() so that we can
> check the read LSN and set WalSndCaughtUp = true even after read a
> record without wait. Attached updated patch passed 'make check-world'.
> Please review it.

Moved to next CF.
-- 
Michael


Re: [HACKERS] Replication status in logical replication

From
Petr Jelinek
Date:
On 21/11/17 22:06, Masahiko Sawada wrote:
> 
> After investigation, I found out that my previous patch was wrong
> direction. I should have changed XLogSendLogical() so that we can
> check the read LSN and set WalSndCaughtUp = true even after read a
> record without wait. Attached updated patch passed 'make check-world'.
> Please review it.
> 

Hi,

This version looks good to me and seems to be in line with what we do in
physical replication.

Marking as ready for committer.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 21/11/17 22:06, Masahiko Sawada wrote:
>>
>> After investigation, I found out that my previous patch was wrong
>> direction. I should have changed XLogSendLogical() so that we can
>> check the read LSN and set WalSndCaughtUp = true even after read a
>> record without wait. Attached updated patch passed 'make check-world'.
>> Please review it.
>>
>
> Hi,
>
> This version looks good to me and seems to be in line with what we do in
> physical replication.
>
> Marking as ready for committer.
>

Thank you for reviewing this patch! I think this patch can be
back-patched to 9.4 as Simon mentioned.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Replication status in logical replication

From
"Tels"
Date:
Moin,

On Mon, December 25, 2017 7:26 pm, Masahiko Sawada wrote:
> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> On 21/11/17 22:06, Masahiko Sawada wrote:
>>>
>>> After investigation, I found out that my previous patch was wrong
>>> direction. I should have changed XLogSendLogical() so that we can
>>> check the read LSN and set WalSndCaughtUp = true even after read a
>>> record without wait. Attached updated patch passed 'make check-world'.
>>> Please review it.
>>>
>>
>> Hi,
>>
>> This version looks good to me and seems to be in line with what we do in
>> physical replication.
>>
>> Marking as ready for committer.

(Sorry Masahiko, you'll get this twice, as fumbled the reply button.)

I have not verifed that comment and/or code are correct, just a grammar fix:

+                /*
+                 * If we've sent a record is at or beyond the flushed
point, then
+                 * we're caught up.

That should read more like this:

"If we've sent a record that is at or beyond the flushed point, we have
caught up."

All the best,

Tels



Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Tue, Dec 26, 2017 at 6:19 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote:
> Moin,
>
> On Mon, December 25, 2017 7:26 pm, Masahiko Sawada wrote:
>> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com> wrote:
>>> On 21/11/17 22:06, Masahiko Sawada wrote:
>>>>
>>>> After investigation, I found out that my previous patch was wrong
>>>> direction. I should have changed XLogSendLogical() so that we can
>>>> check the read LSN and set WalSndCaughtUp = true even after read a
>>>> record without wait. Attached updated patch passed 'make check-world'.
>>>> Please review it.
>>>>
>>>
>>> Hi,
>>>
>>> This version looks good to me and seems to be in line with what we do in
>>> physical replication.
>>>
>>> Marking as ready for committer.
>
> (Sorry Masahiko, you'll get this twice, as fumbled the reply button.)
>
> I have not verifed that comment and/or code are correct, just a grammar fix:
>
> +                /*
> +                 * If we've sent a record is at or beyond the flushed
> point, then
> +                 * we're caught up.
>
> That should read more like this:
>
> "If we've sent a record that is at or beyond the flushed point, we have
> caught up."
>

Thank you for reviewing the patch!
Actually, that comment is inspired by the comment just below comment.
ISTM it's better to fix both if grammar of them is not appropriate.

+
+      /*
+        * If we've sent a record that is at or beyond the flushed point, we
+        * have caught up.
+        */
+       if (sentPtr >= GetFlushRecPtr())
+           WalSndCaughtUp = true;
   }
   else
   {
       /*
        * If the record we just wanted read is at or beyond the flushed
        * point, then we're caught up.
        */
       if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
       {

Attached a updated patch. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Replication status in logical replication

From
"Tels"
Date:
Moin,

On Tue, December 26, 2017 5:26 am, Masahiko Sawada wrote:
> On Tue, Dec 26, 2017 at 6:19 PM, Tels <nospam-pg-abuse@bloodgate.com>
> wrote:
>> Moin,
>>
>> On Mon, December 25, 2017 7:26 pm, Masahiko Sawada wrote:
>>> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek
>>> <petr.jelinek@2ndquadrant.com> wrote:
>>>> On 21/11/17 22:06, Masahiko Sawada wrote:
>>>>>
>>>>> After investigation, I found out that my previous patch was wrong
>>>>> direction. I should have changed XLogSendLogical() so that we can
>>>>> check the read LSN and set WalSndCaughtUp = true even after read a
>>>>> record without wait. Attached updated patch passed 'make
>>>>> check-world'.
>>>>> Please review it.
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> This version looks good to me and seems to be in line with what we do
>>>> in
>>>> physical replication.
>>>>
>>>> Marking as ready for committer.
>>
>> (Sorry Masahiko, you'll get this twice, as fumbled the reply button.)
>>
>> I have not verifed that comment and/or code are correct, just a grammar
>> fix:
>>
>> +                /*
>> +                 * If we've sent a record is at or beyond the flushed
>> point, then
>> +                 * we're caught up.
>>
>> That should read more like this:
>>
>> "If we've sent a record that is at or beyond the flushed point, we have
>> caught up."
>>
>
> Thank you for reviewing the patch!
> Actually, that comment is inspired by the comment just below comment.
> ISTM it's better to fix both if grammar of them is not appropriate.

Oh yes. Your attached version reads fine to me.

All the best,

Tels


Re: [HACKERS] Replication status in logical replication

From
Simon Riggs
Date:
On 26 December 2017 at 00:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> On 21/11/17 22:06, Masahiko Sawada wrote:
>>>
>>> After investigation, I found out that my previous patch was wrong
>>> direction. I should have changed XLogSendLogical() so that we can
>>> check the read LSN and set WalSndCaughtUp = true even after read a
>>> record without wait. Attached updated patch passed 'make check-world'.
>>> Please review it.
>>>
>>
>> Hi,
>>
>> This version looks good to me and seems to be in line with what we do in
>> physical replication.
>>
>> Marking as ready for committer.
>>
>
> Thank you for reviewing this patch!

The patch calls this AFTER processing the record
   if (sentPtr >= GetFlushRecPtr())
but it seems better to call GetFlushRecPtr() before we process the
record, otherwise the flush pointer might have moved forwards while we
process the record and it wouldn't catch up. (Physical replication
works like that).

New patch version attached for discussion before commit. (v4)

I'd rather not call it at all at that point though, so if we made
RecentFlushPtr static at the module level rather than within
WalSndWaitForWal we could use it here also. That's a bit more invasive
for backpatching, so not implemented that here.

Overall, I find setting WalSndCaughtUp = false at the top of
XLogSendLogical() to be incredibly ugly and I would like to remove it.
It can't be correct to have a static status variable that oscillates
between false and true with every record. This seems to be done
because of the lack of a logical initialization call. Petr? Peter?
Version with this removed (v4alt2)

I've removed the edit that fusses over English grammar: both ways are correct.

> I think this patch can be
> back-patched to 9.4 as Simon mentioned.

This patch appears to cause this DEBUG1 message

"standby \"%s\" has now caught up with primary"

which probably isn't the right message, but might be OK to backpatch.

Thoughts on better wording?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Sun, Jan 7, 2018 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 26 December 2017 at 00:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com> wrote:
>>> On 21/11/17 22:06, Masahiko Sawada wrote:
>>>>
>>>> After investigation, I found out that my previous patch was wrong
>>>> direction. I should have changed XLogSendLogical() so that we can
>>>> check the read LSN and set WalSndCaughtUp = true even after read a
>>>> record without wait. Attached updated patch passed 'make check-world'.
>>>> Please review it.
>>>>
>>>
>>> Hi,
>>>
>>> This version looks good to me and seems to be in line with what we do in
>>> physical replication.
>>>
>>> Marking as ready for committer.
>>>
>>
>> Thank you for reviewing this patch!
>
> The patch calls this AFTER processing the record
>    if (sentPtr >= GetFlushRecPtr())
> but it seems better to call GetFlushRecPtr() before we process the
> record, otherwise the flush pointer might have moved forwards while we
> process the record and it wouldn't catch up. (Physical replication
> works like that).

Agreed.

> New patch version attached for discussion before commit. (v4)

v4 patch looks good to me.

>
> I'd rather not call it at all at that point though, so if we made
> RecentFlushPtr static at the module level rather than within
> WalSndWaitForWal we could use it here also. That's a bit more invasive
> for backpatching, so not implemented that here.
>
> Overall, I find setting WalSndCaughtUp = false at the top of
> XLogSendLogical() to be incredibly ugly and I would like to remove it.
> It can't be correct to have a static status variable that oscillates
> between false and true with every record. This seems to be done
> because of the lack of a logical initialization call. Petr? Peter?
> Version with this removed (v4alt2)
>
> I've removed the edit that fusses over English grammar: both ways are correct.
>
>> I think this patch can be
>> back-patched to 9.4 as Simon mentioned.
>
> This patch appears to cause this DEBUG1 message
>
> "standby \"%s\" has now caught up with primary"
>
> which probably isn't the right message, but might be OK to backpatch.
>
> Thoughts on better wording?
>

I think that this DEBUG1 message appears when sent any WAL after
caught up even without this patch. This patch makes this message
appear at a properly timing. Or am I missing something?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Replication status in logical replication

From
Simon Riggs
Date:
On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

>> This patch appears to cause this DEBUG1 message
>>
>> "standby \"%s\" has now caught up with primary"
>>
>> which probably isn't the right message, but might be OK to backpatch.
>>
>> Thoughts on better wording?
>>
>
> I think that this DEBUG1 message appears when sent any WAL after
> caught up even without this patch. This patch makes this message
> appear at a properly timing. Or am I missing something?

We're not talking about standbys, so the message is incorrect.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>>> This patch appears to cause this DEBUG1 message
>>>
>>> "standby \"%s\" has now caught up with primary"
>>>
>>> which probably isn't the right message, but might be OK to backpatch.
>>>
>>> Thoughts on better wording?
>>>
>>
>> I think that this DEBUG1 message appears when sent any WAL after
>> caught up even without this patch. This patch makes this message
>> appear at a properly timing. Or am I missing something?
>
> We're not talking about standbys, so the message is incorrect.
>

Ah, I understood. How about "\"%s\"  has now caught up with upstream server"?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Replication status in logical replication

From
Michael Paquier
Date:
On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote:
> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> We're not talking about standbys, so the message is incorrect.
>
> Ah, I understood. How about "\"%s\"  has now caught up with upstream
> server"?

+1.
--
Michael

Attachment

Re: [HACKERS] Replication status in logical replication

From
Simon Riggs
Date:
On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote:
>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> We're not talking about standbys, so the message is incorrect.
>>
>> Ah, I understood. How about "\"%s\"  has now caught up with upstream
>> server"?
>
> +1.

upstream is what I would have suggested, so +1 here also.

Will commit.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote:
>> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote:
>>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> We're not talking about standbys, so the message is incorrect.
>>>
>>> Ah, I understood. How about "\"%s\"  has now caught up with upstream
>>> server"?
>>
>> +1.
>
> upstream is what I would have suggested, so +1 here also.
>
> Will commit.
>

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Replication status in logical replication

From
Fujii Masao
Date:
On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote:
>> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote:
>>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> We're not talking about standbys, so the message is incorrect.
>>>
>>> Ah, I understood. How about "\"%s\"  has now caught up with upstream
>>> server"?
>>
>> +1.
>
> upstream is what I would have suggested, so +1 here also.
>
> Will commit.

ping?

Regards,

-- 
Fujii Masao


Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Fri, Mar 30, 2018 at 5:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote:
>>> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote:
>>>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> We're not talking about standbys, so the message is incorrect.
>>>>
>>>> Ah, I understood. How about "\"%s\"  has now caught up with upstream
>>>> server"?
>>>
>>> +1.
>>
>> upstream is what I would have suggested, so +1 here also.
>>
>> Will commit.
>
> ping?
>

Should I add this item to "Older Bugs" of the open item since we
regarded it as a bug?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Replication status in logical replication

From
David Steele
Date:
On 4/10/18 6:14 AM, Masahiko Sawada wrote:
> On Fri, Mar 30, 2018 at 5:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote:
>>>> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote:
>>>>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>> We're not talking about standbys, so the message is incorrect.
>>>>>
>>>>> Ah, I understood. How about "\"%s\"  has now caught up with upstream
>>>>> server"?
>>>>
>>>> +1.
>>>
>>> upstream is what I would have suggested, so +1 here also.
>>>
>>> Will commit.
>>
>> ping?
>>
> 
> Should I add this item to "Older Bugs" of the open item since we
> regarded it as a bug?

I'm going to reclassify this as a bug since everyone seems to agree it
is one.

Simon, are you still planning to commit this?

Thanks,
-- 
-David
david@pgmasters.net


Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Tue, Apr 10, 2018 at 11:11 PM, David Steele <david@pgmasters.net> wrote:
> On 4/10/18 6:14 AM, Masahiko Sawada wrote:
>> On Fri, Mar 30, 2018 at 5:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>> On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote:
>>>>> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote:
>>>>>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>>>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>>> We're not talking about standbys, so the message is incorrect.
>>>>>>
>>>>>> Ah, I understood. How about "\"%s\"  has now caught up with upstream
>>>>>> server"?
>>>>>
>>>>> +1.
>>>>
>>>> upstream is what I would have suggested, so +1 here also.
>>>>
>>>> Will commit.
>>>
>>> ping?
>>>
>>
>> Should I add this item to "Older Bugs" of the open item since we
>> regarded it as a bug?
>
> I'm going to reclassify this as a bug since everyone seems to agree it
> is one.

I've added this to Open Items so as not to forget.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Replication status in logical replication

From
Michael Paquier
Date:
On Mon, May 21, 2018 at 10:15:02AM +0900, Masahiko Sawada wrote:
> I've added this to Open Items so as not to forget.

This concerns as well v10, so that's not actually an open item...
Well, it was an open item last year.  The last set of patches is from
Simon here:
https://www.postgresql.org/message-id/CANP8%2BjLwgsexwdPkBtkN5kdHN5TwV-d%3Di311Tq_FdOmzJ8QyRQ%40mail.gmail.com

Simon, do you feel confident with your patch?  If yes, could you finish
wrapping it?  I am getting myself familiar with the problem as this has
been around for some time now so I am reviewing the thing as well and
then I can board the ship..
--
Michael

Attachment

Re: [HACKERS] Replication status in logical replication

From
Michael Paquier
Date:
On Thu, Jul 05, 2018 at 05:13:27PM +0900, Michael Paquier wrote:
> This concerns as well v10, so that's not actually an open item...
> Well, it was an open item last year.  The last set of patches is from
> Simon here:
> https://www.postgresql.org/message-id/CANP8%2BjLwgsexwdPkBtkN5kdHN5TwV-d%3Di311Tq_FdOmzJ8QyRQ%40mail.gmail.com

logical_repl_caught_up_v4alt2.patch is actually incorrect after I tested
the thing, and that logical_repl_caught_up_v4.patch gets the correct
call.

> Simon, do you feel confident with your patch?  If yes, could you finish
> wrapping it?  I am getting myself familiar with the problem as this has
> been around for some time now so I am reviewing the thing as well and
> then I can board the ship..

Okay, I have spent some time today looking at this patch, and the error
is very easy to reproduce once you do that in the TAP tests:
1) Stop and start once the publisher in one of the tests of
src/test/subscription.
2) Enforce wait_for_catchup() to check for state = 'streaming'.
And then you would see the tests waiting until timeout is reached and
then die.

I would be inclined to add those tests in the final patch, the
disadvantage being that 1) makes one of the test scripts a bit longer,
but it can reproduce the failures most of the time.  Having 2) is
actually nice for physical replication as the tests in
src/test/recovery/ use wait_for_catchup() in various ways.

Some other notes about the patch:
- I switched the error message in WalSndLoop as mentioned upthread for
nodes catching up, aka no more "primary" but "upstream server".
- Added a note about using only GetFlushRecPtr in XLogSendLogical as
logical decoding cannot be used on standby nodes.  If one day logical
decoding gets supports on standby then this would need an update as
well.

Does this look fine to all the folks involved in this thread?  It is
Friday afternoon here so my brain is getting fried, but I can finish
wrapping up this patch at the beginning of next week if there are no
objections.  At quick glance this indeed would need a backpatch down to
9.4 but I have not spent time testing those configurations yet.
--
Michael

Attachment

Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Fri, Jul 6, 2018 at 2:21 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Jul 05, 2018 at 05:13:27PM +0900, Michael Paquier wrote:
>> This concerns as well v10, so that's not actually an open item...
>> Well, it was an open item last year.  The last set of patches is from
>> Simon here:
>> https://www.postgresql.org/message-id/CANP8%2BjLwgsexwdPkBtkN5kdHN5TwV-d%3Di311Tq_FdOmzJ8QyRQ%40mail.gmail.com
>
> logical_repl_caught_up_v4alt2.patch is actually incorrect after I tested
> the thing, and that logical_repl_caught_up_v4.patch gets the correct
> call.
>
>> Simon, do you feel confident with your patch?  If yes, could you finish
>> wrapping it?  I am getting myself familiar with the problem as this has
>> been around for some time now so I am reviewing the thing as well and
>> then I can board the ship..
>
> Okay, I have spent some time today looking at this patch, and the error
> is very easy to reproduce once you do that in the TAP tests:
> 1) Stop and start once the publisher in one of the tests of
> src/test/subscription.
> 2) Enforce wait_for_catchup() to check for state = 'streaming'.
> And then you would see the tests waiting until timeout is reached and
> then die.
>
> I would be inclined to add those tests in the final patch, the
> disadvantage being that 1) makes one of the test scripts a bit longer,
> but it can reproduce the failures most of the time.  Having 2) is
> actually nice for physical replication as the tests in
> src/test/recovery/ use wait_for_catchup() in various ways.
>
> Some other notes about the patch:
> - I switched the error message in WalSndLoop as mentioned upthread for
> nodes catching up, aka no more "primary" but "upstream server".
> - Added a note about using only GetFlushRecPtr in XLogSendLogical as
> logical decoding cannot be used on standby nodes.  If one day logical
> decoding gets supports on standby then this would need an update as
> well.
>
> Does this look fine to all the folks involved in this thread?  It is
> Friday afternoon here so my brain is getting fried, but I can finish
> wrapping up this patch at the beginning of next week if there are no
> objections.  At quick glance this indeed would need a backpatch down to
> 9.4 but I have not spent time testing those configurations yet.

Thank you for updating the patch. The patch looks fine to me, and I
agree with all changes you made.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Replication status in logical replication

From
Michael Paquier
Date:
On Mon, Jul 09, 2018 at 05:25:55PM +0900, Masahiko Sawada wrote:
> Thank you for updating the patch. The patch looks fine to me, and I
> agree with all changes you made.

Thanks.  If there are no objections, then I will try to wrap this stuff
on Thursday my time.
--
Michael

Attachment

Re: [HACKERS] Replication status in logical replication

From
Michael Paquier
Date:
On Tue, Jul 10, 2018 at 10:14:35AM +0900, Michael Paquier wrote:
> Thanks.  If there are no objections, then I will try to wrap this stuff
> on Thursday my time.

And done down to 9.4.
--
Michael

Attachment

Re: [HACKERS] Replication status in logical replication

From
Masahiko Sawada
Date:
On Thu, Jul 12, 2018 at 10:22 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Jul 10, 2018 at 10:14:35AM +0900, Michael Paquier wrote:
>> Thanks.  If there are no objections, then I will try to wrap this stuff
>> on Thursday my time.
>
> And done down to 9.4.

Thank you!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center