Thread: [HACKERS] Replication status in logical replication
Hi all, When using logical replication, I ran into a situation where the pg_stat_replication.state is not updated until any wal record is sent after started up. For example, I set up logical replication with 2 subscriber and restart the publisher server, but I see the following status for a while (maybe until autovacuum run). =# select application_name, state, sent_location, write_location, flush_location, replay_location, sync_state from pg_stat_replication ; application_name | state | sent_location | write_location | flush_location | replay_location | sync_state ------------------+---------+---------------+----------------+----------------+-----------------+------------ node1 | catchup | 0/16329F8 | 0/16329F8 | 0/16329F8 | 0/16329F8 | potential node2 | catchup | 0/16329F8 | 0/16329F8 | 0/16329F8 | 0/16329F8 | async (2 rows) It seems that all wal senders have caught up but pg_stat_replication.state is still "catchup". The reason of this behavior is that WalSndCaughtUp is updated only in WalSndWaitForWal in logical replication during running, and in logical_read_xlog_page always try to read next wal record (i.g. it calls WalSndWaitForWal(targetPagePtr + reqLen)). So WalSndWaitForWal cannot update WalSndCaughtUp until any new wal record is created after started up and wal sender read it. Attached patch fixes this behavior by updating WalSndCaughtUp before trying to read next WAL if already caught up. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 22 March 2017 at 02:50, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > When using logical replication, I ran into a situation where the > pg_stat_replication.state is not updated until any wal record is sent > after started up. For example, I set up logical replication with 2 > subscriber and restart the publisher server, but I see the following > status for a while (maybe until autovacuum run). ... > Attached patch fixes this behavior by updating WalSndCaughtUp before > trying to read next WAL if already caught up. Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or as far as it goes). Objections to commit? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 12, 2017 at 5:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 22 March 2017 at 02:50, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> When using logical replication, I ran into a situation where the >> pg_stat_replication.state is not updated until any wal record is sent >> after started up. For example, I set up logical replication with 2 >> subscriber and restart the publisher server, but I see the following >> status for a while (maybe until autovacuum run). > ... > >> Attached patch fixes this behavior by updating WalSndCaughtUp before >> trying to read next WAL if already caught up. > > Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or > as far as it goes). > > Objections to commit? > Seems we still have this issue. Any update or comment on this? Barring any objections, I'll add this to the open item so it doesn't get missed. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, May 19, 2017 at 11:33:48AM +0900, Masahiko Sawada wrote: > On Wed, Apr 12, 2017 at 5:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On 22 March 2017 at 02:50, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > >> When using logical replication, I ran into a situation where the > >> pg_stat_replication.state is not updated until any wal record is sent > >> after started up. For example, I set up logical replication with 2 > >> subscriber and restart the publisher server, but I see the following > >> status for a while (maybe until autovacuum run). > > ... > > > >> Attached patch fixes this behavior by updating WalSndCaughtUp before > >> trying to read next WAL if already caught up. > > > > Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or > > as far as it goes). > > > > Objections to commit? > > > > Seems we still have this issue. Any update or comment on this? Barring > any objections, I'll add this to the open item so it doesn't get > missed. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Peter, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On 5/29/17 22:56, Noah Misch wrote: > On Fri, May 19, 2017 at 11:33:48AM +0900, Masahiko Sawada wrote: >> On Wed, Apr 12, 2017 at 5:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or >>> as far as it goes). >>> >>> Objections to commit? >>> >> >> Seems we still have this issue. Any update or comment on this? Barring >> any objections, I'll add this to the open item so it doesn't get >> missed. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Peter, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. I would ask Simon to go ahead with this patch if he feels comfortable with it. I'm disclaiming this open item, since it's an existing bug from previous releases (and I have other open items to focus on). -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 30 May 2017, at 19:55, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > On 5/29/17 22:56, Noah Misch wrote: >> On Fri, May 19, 2017 at 11:33:48AM +0900, Masahiko Sawada wrote: >>> On Wed, Apr 12, 2017 at 5:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> Looks like a bug that we should fix in PG10, with backpatch to 9.4 (or >>>> as far as it goes). >>>> >>>> Objections to commit? >>>> >>> >>> Seems we still have this issue. Any update or comment on this? Barring >>> any objections, I'll add this to the open item so it doesn't get >>> missed. >> >> [Action required within three days. This is a generic notification.] >> >> The above-described topic is currently a PostgreSQL 10 open item. Peter, >> since you committed the patch believed to have created it, you own this open >> item. If some other commit is more relevant or if this does not belong as a >> v10 open item, please let us know. > > I would ask Simon to go ahead with this patch if he feels comfortable > with it. > > I'm disclaiming this open item, since it's an existing bug from previous > releases (and I have other open items to focus on). I’m not entirely sure why this was flagged as "Waiting for Author” by the automatic run, the patch applies for me and builds so resetting back to “Needs review”. Simon: do you think you will have time to look at this patch in this CF? cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson <daniel@yesql.se> wrote:
I’m not entirely sure why this was flagged as "Waiting for Author” by the
automatic run, the patch applies for me and builds so resetting back to “Needs
review”.
This patch applies and build cleanly and I did a testing with one publisher and one subscriber, and confirm that the replication state after restarting the server now is "streaming" and not "Catchup".
And, I don't find any issues with code and patch to me is ready for committer, marked the same in cf entry.
Thanks & Regards,
Vaishnavi,
Fujitsu Australia.
On Tue, Sep 26, 2017 at 10:36 AM, Vaishnavi Prabakaran <vaishnaviprabakaran@gmail.com> wrote: > Hi, > > On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson <daniel@yesql.se> wrote: >> >> >> I’m not entirely sure why this was flagged as "Waiting for Author” by the >> automatic run, the patch applies for me and builds so resetting back to >> “Needs >> review”. >> > > This patch applies and build cleanly and I did a testing with one publisher > and one subscriber, and confirm that the replication state after restarting > the server now is "streaming" and not "Catchup". > > And, I don't find any issues with code and patch to me is ready for > committer, marked the same in cf entry. > Thank you for the reviewing the patch! Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 26, 2017 at 3:45 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Sep 26, 2017 at 10:36 AM, Vaishnavi Prabakaran > <vaishnaviprabakaran@gmail.com> wrote: >> On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson <daniel@yesql.se> wrote: >>> I’m not entirely sure why this was flagged as "Waiting for Author” by the >>> automatic run, the patch applies for me and builds so resetting back to >>> “Needs >>> review”. >>> >> >> This patch applies and build cleanly and I did a testing with one publisher >> and one subscriber, and confirm that the replication state after restarting >> the server now is "streaming" and not "Catchup". >> >> And, I don't find any issues with code and patch to me is ready for >> committer, marked the same in cf entry. Hi Sawada-san, My patch-testing robot doesn't like this patch[1]. I just tried it on my laptop to double-check and get some more details, and saw the same failures: (1) "make check" under src/test/recovery fails like this: t/006_logical_decoding.pl ............ 2/16 # Looks like your test exited with 29 just after 4. t/006_logical_decoding.pl ............ Dubious, test returned 29 (wstat 7424, 0x1d00) Failed 12/16 subtests regress_log_006_logical_decoding says: ok 4 - got same expected output from pg_recvlogical decoding session pg_recvlogical timed out at /opt/local/lib/perl5/vendor_perl/5.24/IPC/Run.pm line 2918.waiting for endpos 0/1609B60 with stdout '', stderr '' at /Users/munro/projects/postgres/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1700. ### Stopping node "master" using mode immediate # Running: pg_ctl -D /Users/munro/projects/postgres/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata -m immediate stop waiting for server to shut down.... done server stopped # No postmaster PID for node "master" # Looks like your test exited with 29 just after 4. (2) "make check" under src/test/subscription says: t/001_rep_changes.pl .. ok t/002_types.pl ........ # # Looks like your test exited with 60 before it could output anything. t/002_types.pl ........ Dubious, test returned 60 (wstat 15360, 0x3c00) Failed 3/3 subtests t/003_constraints.pl .. Each of those tooks several minutes, and I stopped it there. It may be going to say some more things but is taking a very long time (presumably timing out, but the 001 took ages and then succeeded... hmm). In fact I had to run this on my laptop to see that because on Travis CI the whole test job just gets killed after 10 minutes of non-output and the above output was never logged because of the way concurrent test jobs' output is buffered. I didn't try to figure out what is going wrong. [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/301586561 -- Thomas Munro http://www.enterprisedb.com
On Tue, Nov 14, 2017 at 6:46 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Sep 26, 2017 at 3:45 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Sep 26, 2017 at 10:36 AM, Vaishnavi Prabakaran >> <vaishnaviprabakaran@gmail.com> wrote: >>> On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson <daniel@yesql.se> wrote: >>>> I’m not entirely sure why this was flagged as "Waiting for Author” by the >>>> automatic run, the patch applies for me and builds so resetting back to >>>> “Needs >>>> review”. >>>> >>> >>> This patch applies and build cleanly and I did a testing with one publisher >>> and one subscriber, and confirm that the replication state after restarting >>> the server now is "streaming" and not "Catchup". >>> >>> And, I don't find any issues with code and patch to me is ready for >>> committer, marked the same in cf entry. > > Hi Sawada-san, > > My patch-testing robot doesn't like this patch[1]. I just tried it on > my laptop to double-check and get some more details, and saw the same > failures: > > (1) "make check" under src/test/recovery fails like this: > > t/006_logical_decoding.pl ............ 2/16 # Looks like your test > exited with 29 just after 4. > t/006_logical_decoding.pl ............ Dubious, test returned 29 > (wstat 7424, 0x1d00) > Failed 12/16 subtests > > regress_log_006_logical_decoding says: > > ok 4 - got same expected output from pg_recvlogical decoding session > pg_recvlogical timed out at > /opt/local/lib/perl5/vendor_perl/5.24/IPC/Run.pm line 2918. > waiting for endpos 0/1609B60 with stdout '', stderr '' at > /Users/munro/projects/postgres/src/test/recovery/../../../src/test/perl/PostgresNode.pm > line 1700. > ### Stopping node "master" using mode immediate > # Running: pg_ctl -D > /Users/munro/projects/postgres/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata > -m immediate stop > waiting for server to shut down.... done > server stopped > # No postmaster PID for node "master" > # Looks like your test exited with 29 just after 4. > > (2) "make check" under src/test/subscription says: > > t/001_rep_changes.pl .. ok > t/002_types.pl ........ # > # Looks like your test exited with 60 before it could output anything. > t/002_types.pl ........ Dubious, test returned 60 (wstat 15360, 0x3c00) > Failed 3/3 subtests > t/003_constraints.pl .. > > Each of those tooks several minutes, and I stopped it there. It may > be going to say some more things but is taking a very long time > (presumably timing out, but the 001 took ages and then succeeded... > hmm). In fact I had to run this on my laptop to see that because on > Travis CI the whole test job just gets killed after 10 minutes of > non-output and the above output was never logged because of the way > concurrent test jobs' output is buffered. > > I didn't try to figure out what is going wrong. > Thank you for the notification! After investigation, I found out that my previous patch was wrong direction. I should have changed XLogSendLogical() so that we can check the read LSN and set WalSndCaughtUp = true even after read a record without wait. Attached updated patch passed 'make check-world'. Please review it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Wed, Nov 22, 2017 at 6:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > After investigation, I found out that my previous patch was wrong > direction. I should have changed XLogSendLogical() so that we can > check the read LSN and set WalSndCaughtUp = true even after read a > record without wait. Attached updated patch passed 'make check-world'. > Please review it. Moved to next CF. -- Michael
On 21/11/17 22:06, Masahiko Sawada wrote: > > After investigation, I found out that my previous patch was wrong > direction. I should have changed XLogSendLogical() so that we can > check the read LSN and set WalSndCaughtUp = true even after read a > record without wait. Attached updated patch passed 'make check-world'. > Please review it. > Hi, This version looks good to me and seems to be in line with what we do in physical replication. Marking as ready for committer. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > On 21/11/17 22:06, Masahiko Sawada wrote: >> >> After investigation, I found out that my previous patch was wrong >> direction. I should have changed XLogSendLogical() so that we can >> check the read LSN and set WalSndCaughtUp = true even after read a >> record without wait. Attached updated patch passed 'make check-world'. >> Please review it. >> > > Hi, > > This version looks good to me and seems to be in line with what we do in > physical replication. > > Marking as ready for committer. > Thank you for reviewing this patch! I think this patch can be back-patched to 9.4 as Simon mentioned. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Moin, On Mon, December 25, 2017 7:26 pm, Masahiko Sawada wrote: > On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek > <petr.jelinek@2ndquadrant.com> wrote: >> On 21/11/17 22:06, Masahiko Sawada wrote: >>> >>> After investigation, I found out that my previous patch was wrong >>> direction. I should have changed XLogSendLogical() so that we can >>> check the read LSN and set WalSndCaughtUp = true even after read a >>> record without wait. Attached updated patch passed 'make check-world'. >>> Please review it. >>> >> >> Hi, >> >> This version looks good to me and seems to be in line with what we do in >> physical replication. >> >> Marking as ready for committer. (Sorry Masahiko, you'll get this twice, as fumbled the reply button.) I have not verifed that comment and/or code are correct, just a grammar fix: + /* + * If we've sent a record is at or beyond the flushed point, then + * we're caught up. That should read more like this: "If we've sent a record that is at or beyond the flushed point, we have caught up." All the best, Tels
On Tue, Dec 26, 2017 at 6:19 PM, Tels <nospam-pg-abuse@bloodgate.com> wrote: > Moin, > > On Mon, December 25, 2017 7:26 pm, Masahiko Sawada wrote: >> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek >> <petr.jelinek@2ndquadrant.com> wrote: >>> On 21/11/17 22:06, Masahiko Sawada wrote: >>>> >>>> After investigation, I found out that my previous patch was wrong >>>> direction. I should have changed XLogSendLogical() so that we can >>>> check the read LSN and set WalSndCaughtUp = true even after read a >>>> record without wait. Attached updated patch passed 'make check-world'. >>>> Please review it. >>>> >>> >>> Hi, >>> >>> This version looks good to me and seems to be in line with what we do in >>> physical replication. >>> >>> Marking as ready for committer. > > (Sorry Masahiko, you'll get this twice, as fumbled the reply button.) > > I have not verifed that comment and/or code are correct, just a grammar fix: > > + /* > + * If we've sent a record is at or beyond the flushed > point, then > + * we're caught up. > > That should read more like this: > > "If we've sent a record that is at or beyond the flushed point, we have > caught up." > Thank you for reviewing the patch! Actually, that comment is inspired by the comment just below comment. ISTM it's better to fix both if grammar of them is not appropriate. + + /* + * If we've sent a record that is at or beyond the flushed point, we + * have caught up. + */ + if (sentPtr >= GetFlushRecPtr()) + WalSndCaughtUp = true; } else { /* * If the record we just wanted read is at or beyond the flushed * point, then we're caught up. */ if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr()) { Attached a updated patch. Please review it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Moin, On Tue, December 26, 2017 5:26 am, Masahiko Sawada wrote: > On Tue, Dec 26, 2017 at 6:19 PM, Tels <nospam-pg-abuse@bloodgate.com> > wrote: >> Moin, >> >> On Mon, December 25, 2017 7:26 pm, Masahiko Sawada wrote: >>> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek >>> <petr.jelinek@2ndquadrant.com> wrote: >>>> On 21/11/17 22:06, Masahiko Sawada wrote: >>>>> >>>>> After investigation, I found out that my previous patch was wrong >>>>> direction. I should have changed XLogSendLogical() so that we can >>>>> check the read LSN and set WalSndCaughtUp = true even after read a >>>>> record without wait. Attached updated patch passed 'make >>>>> check-world'. >>>>> Please review it. >>>>> >>>> >>>> Hi, >>>> >>>> This version looks good to me and seems to be in line with what we do >>>> in >>>> physical replication. >>>> >>>> Marking as ready for committer. >> >> (Sorry Masahiko, you'll get this twice, as fumbled the reply button.) >> >> I have not verifed that comment and/or code are correct, just a grammar >> fix: >> >> + /* >> + * If we've sent a record is at or beyond the flushed >> point, then >> + * we're caught up. >> >> That should read more like this: >> >> "If we've sent a record that is at or beyond the flushed point, we have >> caught up." >> > > Thank you for reviewing the patch! > Actually, that comment is inspired by the comment just below comment. > ISTM it's better to fix both if grammar of them is not appropriate. Oh yes. Your attached version reads fine to me. All the best, Tels
On 26 December 2017 at 00:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek > <petr.jelinek@2ndquadrant.com> wrote: >> On 21/11/17 22:06, Masahiko Sawada wrote: >>> >>> After investigation, I found out that my previous patch was wrong >>> direction. I should have changed XLogSendLogical() so that we can >>> check the read LSN and set WalSndCaughtUp = true even after read a >>> record without wait. Attached updated patch passed 'make check-world'. >>> Please review it. >>> >> >> Hi, >> >> This version looks good to me and seems to be in line with what we do in >> physical replication. >> >> Marking as ready for committer. >> > > Thank you for reviewing this patch! The patch calls this AFTER processing the record if (sentPtr >= GetFlushRecPtr()) but it seems better to call GetFlushRecPtr() before we process the record, otherwise the flush pointer might have moved forwards while we process the record and it wouldn't catch up. (Physical replication works like that). New patch version attached for discussion before commit. (v4) I'd rather not call it at all at that point though, so if we made RecentFlushPtr static at the module level rather than within WalSndWaitForWal we could use it here also. That's a bit more invasive for backpatching, so not implemented that here. Overall, I find setting WalSndCaughtUp = false at the top of XLogSendLogical() to be incredibly ugly and I would like to remove it. It can't be correct to have a static status variable that oscillates between false and true with every record. This seems to be done because of the lack of a logical initialization call. Petr? Peter? Version with this removed (v4alt2) I've removed the edit that fusses over English grammar: both ways are correct. > I think this patch can be > back-patched to 9.4 as Simon mentioned. This patch appears to cause this DEBUG1 message "standby \"%s\" has now caught up with primary" which probably isn't the right message, but might be OK to backpatch. Thoughts on better wording? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sun, Jan 7, 2018 at 7:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 26 December 2017 at 00:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Dec 26, 2017 at 1:10 AM, Petr Jelinek >> <petr.jelinek@2ndquadrant.com> wrote: >>> On 21/11/17 22:06, Masahiko Sawada wrote: >>>> >>>> After investigation, I found out that my previous patch was wrong >>>> direction. I should have changed XLogSendLogical() so that we can >>>> check the read LSN and set WalSndCaughtUp = true even after read a >>>> record without wait. Attached updated patch passed 'make check-world'. >>>> Please review it. >>>> >>> >>> Hi, >>> >>> This version looks good to me and seems to be in line with what we do in >>> physical replication. >>> >>> Marking as ready for committer. >>> >> >> Thank you for reviewing this patch! > > The patch calls this AFTER processing the record > if (sentPtr >= GetFlushRecPtr()) > but it seems better to call GetFlushRecPtr() before we process the > record, otherwise the flush pointer might have moved forwards while we > process the record and it wouldn't catch up. (Physical replication > works like that). Agreed. > New patch version attached for discussion before commit. (v4) v4 patch looks good to me. > > I'd rather not call it at all at that point though, so if we made > RecentFlushPtr static at the module level rather than within > WalSndWaitForWal we could use it here also. That's a bit more invasive > for backpatching, so not implemented that here. > > Overall, I find setting WalSndCaughtUp = false at the top of > XLogSendLogical() to be incredibly ugly and I would like to remove it. > It can't be correct to have a static status variable that oscillates > between false and true with every record. This seems to be done > because of the lack of a logical initialization call. Petr? Peter? > Version with this removed (v4alt2) > > I've removed the edit that fusses over English grammar: both ways are correct. > >> I think this patch can be >> back-patched to 9.4 as Simon mentioned. > > This patch appears to cause this DEBUG1 message > > "standby \"%s\" has now caught up with primary" > > which probably isn't the right message, but might be OK to backpatch. > > Thoughts on better wording? > I think that this DEBUG1 message appears when sent any WAL after caught up even without this patch. This patch makes this message appear at a properly timing. Or am I missing something? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> This patch appears to cause this DEBUG1 message >> >> "standby \"%s\" has now caught up with primary" >> >> which probably isn't the right message, but might be OK to backpatch. >> >> Thoughts on better wording? >> > > I think that this DEBUG1 message appears when sent any WAL after > caught up even without this patch. This patch makes this message > appear at a properly timing. Or am I missing something? We're not talking about standbys, so the message is incorrect. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >>> This patch appears to cause this DEBUG1 message >>> >>> "standby \"%s\" has now caught up with primary" >>> >>> which probably isn't the right message, but might be OK to backpatch. >>> >>> Thoughts on better wording? >>> >> >> I think that this DEBUG1 message appears when sent any WAL after >> caught up even without this patch. This patch makes this message >> appear at a properly timing. Or am I missing something? > > We're not talking about standbys, so the message is incorrect. > Ah, I understood. How about "\"%s\" has now caught up with upstream server"? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote: > On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> We're not talking about standbys, so the message is incorrect. > > Ah, I understood. How about "\"%s\" has now caught up with upstream > server"? +1. -- Michael
Attachment
On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote: >> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> We're not talking about standbys, so the message is incorrect. >> >> Ah, I understood. How about "\"%s\" has now caught up with upstream >> server"? > > +1. upstream is what I would have suggested, so +1 here also. Will commit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote: >> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote: >>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> We're not talking about standbys, so the message is incorrect. >>> >>> Ah, I understood. How about "\"%s\" has now caught up with upstream >>> server"? >> >> +1. > > upstream is what I would have suggested, so +1 here also. > > Will commit. > Thank you! Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote: >> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote: >>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> We're not talking about standbys, so the message is incorrect. >>> >>> Ah, I understood. How about "\"%s\" has now caught up with upstream >>> server"? >> >> +1. > > upstream is what I would have suggested, so +1 here also. > > Will commit. ping? Regards, -- Fujii Masao
On Fri, Mar 30, 2018 at 5:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote: >>> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote: >>>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> We're not talking about standbys, so the message is incorrect. >>>> >>>> Ah, I understood. How about "\"%s\" has now caught up with upstream >>>> server"? >>> >>> +1. >> >> upstream is what I would have suggested, so +1 here also. >> >> Will commit. > > ping? > Should I add this item to "Older Bugs" of the open item since we regarded it as a bug? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 4/10/18 6:14 AM, Masahiko Sawada wrote: > On Fri, Mar 30, 2018 at 5:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote: >>>> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote: >>>>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>> We're not talking about standbys, so the message is incorrect. >>>>> >>>>> Ah, I understood. How about "\"%s\" has now caught up with upstream >>>>> server"? >>>> >>>> +1. >>> >>> upstream is what I would have suggested, so +1 here also. >>> >>> Will commit. >> >> ping? >> > > Should I add this item to "Older Bugs" of the open item since we > regarded it as a bug? I'm going to reclassify this as a bug since everyone seems to agree it is one. Simon, are you still planning to commit this? Thanks, -- -David david@pgmasters.net
On Tue, Apr 10, 2018 at 11:11 PM, David Steele <david@pgmasters.net> wrote: > On 4/10/18 6:14 AM, Masahiko Sawada wrote: >> On Fri, Mar 30, 2018 at 5:37 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> On Wed, Jan 17, 2018 at 2:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> On 16 January 2018 at 06:21, Michael Paquier <michael.paquier@gmail.com> wrote: >>>>> On Tue, Jan 16, 2018 at 10:40:43AM +0900, Masahiko Sawada wrote: >>>>>> On Sun, Jan 14, 2018 at 12:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>>>> On 9 January 2018 at 04:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>>> We're not talking about standbys, so the message is incorrect. >>>>>> >>>>>> Ah, I understood. How about "\"%s\" has now caught up with upstream >>>>>> server"? >>>>> >>>>> +1. >>>> >>>> upstream is what I would have suggested, so +1 here also. >>>> >>>> Will commit. >>> >>> ping? >>> >> >> Should I add this item to "Older Bugs" of the open item since we >> regarded it as a bug? > > I'm going to reclassify this as a bug since everyone seems to agree it > is one. I've added this to Open Items so as not to forget. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, May 21, 2018 at 10:15:02AM +0900, Masahiko Sawada wrote: > I've added this to Open Items so as not to forget. This concerns as well v10, so that's not actually an open item... Well, it was an open item last year. The last set of patches is from Simon here: https://www.postgresql.org/message-id/CANP8%2BjLwgsexwdPkBtkN5kdHN5TwV-d%3Di311Tq_FdOmzJ8QyRQ%40mail.gmail.com Simon, do you feel confident with your patch? If yes, could you finish wrapping it? I am getting myself familiar with the problem as this has been around for some time now so I am reviewing the thing as well and then I can board the ship.. -- Michael
Attachment
On Thu, Jul 05, 2018 at 05:13:27PM +0900, Michael Paquier wrote: > This concerns as well v10, so that's not actually an open item... > Well, it was an open item last year. The last set of patches is from > Simon here: > https://www.postgresql.org/message-id/CANP8%2BjLwgsexwdPkBtkN5kdHN5TwV-d%3Di311Tq_FdOmzJ8QyRQ%40mail.gmail.com logical_repl_caught_up_v4alt2.patch is actually incorrect after I tested the thing, and that logical_repl_caught_up_v4.patch gets the correct call. > Simon, do you feel confident with your patch? If yes, could you finish > wrapping it? I am getting myself familiar with the problem as this has > been around for some time now so I am reviewing the thing as well and > then I can board the ship.. Okay, I have spent some time today looking at this patch, and the error is very easy to reproduce once you do that in the TAP tests: 1) Stop and start once the publisher in one of the tests of src/test/subscription. 2) Enforce wait_for_catchup() to check for state = 'streaming'. And then you would see the tests waiting until timeout is reached and then die. I would be inclined to add those tests in the final patch, the disadvantage being that 1) makes one of the test scripts a bit longer, but it can reproduce the failures most of the time. Having 2) is actually nice for physical replication as the tests in src/test/recovery/ use wait_for_catchup() in various ways. Some other notes about the patch: - I switched the error message in WalSndLoop as mentioned upthread for nodes catching up, aka no more "primary" but "upstream server". - Added a note about using only GetFlushRecPtr in XLogSendLogical as logical decoding cannot be used on standby nodes. If one day logical decoding gets supports on standby then this would need an update as well. Does this look fine to all the folks involved in this thread? It is Friday afternoon here so my brain is getting fried, but I can finish wrapping up this patch at the beginning of next week if there are no objections. At quick glance this indeed would need a backpatch down to 9.4 but I have not spent time testing those configurations yet. -- Michael
Attachment
On Fri, Jul 6, 2018 at 2:21 PM, Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Jul 05, 2018 at 05:13:27PM +0900, Michael Paquier wrote: >> This concerns as well v10, so that's not actually an open item... >> Well, it was an open item last year. The last set of patches is from >> Simon here: >> https://www.postgresql.org/message-id/CANP8%2BjLwgsexwdPkBtkN5kdHN5TwV-d%3Di311Tq_FdOmzJ8QyRQ%40mail.gmail.com > > logical_repl_caught_up_v4alt2.patch is actually incorrect after I tested > the thing, and that logical_repl_caught_up_v4.patch gets the correct > call. > >> Simon, do you feel confident with your patch? If yes, could you finish >> wrapping it? I am getting myself familiar with the problem as this has >> been around for some time now so I am reviewing the thing as well and >> then I can board the ship.. > > Okay, I have spent some time today looking at this patch, and the error > is very easy to reproduce once you do that in the TAP tests: > 1) Stop and start once the publisher in one of the tests of > src/test/subscription. > 2) Enforce wait_for_catchup() to check for state = 'streaming'. > And then you would see the tests waiting until timeout is reached and > then die. > > I would be inclined to add those tests in the final patch, the > disadvantage being that 1) makes one of the test scripts a bit longer, > but it can reproduce the failures most of the time. Having 2) is > actually nice for physical replication as the tests in > src/test/recovery/ use wait_for_catchup() in various ways. > > Some other notes about the patch: > - I switched the error message in WalSndLoop as mentioned upthread for > nodes catching up, aka no more "primary" but "upstream server". > - Added a note about using only GetFlushRecPtr in XLogSendLogical as > logical decoding cannot be used on standby nodes. If one day logical > decoding gets supports on standby then this would need an update as > well. > > Does this look fine to all the folks involved in this thread? It is > Friday afternoon here so my brain is getting fried, but I can finish > wrapping up this patch at the beginning of next week if there are no > objections. At quick glance this indeed would need a backpatch down to > 9.4 but I have not spent time testing those configurations yet. Thank you for updating the patch. The patch looks fine to me, and I agree with all changes you made. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jul 09, 2018 at 05:25:55PM +0900, Masahiko Sawada wrote: > Thank you for updating the patch. The patch looks fine to me, and I > agree with all changes you made. Thanks. If there are no objections, then I will try to wrap this stuff on Thursday my time. -- Michael
Attachment
On Tue, Jul 10, 2018 at 10:14:35AM +0900, Michael Paquier wrote: > Thanks. If there are no objections, then I will try to wrap this stuff > on Thursday my time. And done down to 9.4. -- Michael
Attachment
On Thu, Jul 12, 2018 at 10:22 AM, Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Jul 10, 2018 at 10:14:35AM +0900, Michael Paquier wrote: >> Thanks. If there are no objections, then I will try to wrap this stuff >> on Thursday my time. > > And done down to 9.4. Thank you! Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center