Thread: pg_recvlogical prints bogus error when interrupted

pg_recvlogical prints bogus error when interrupted

From
Andres Freund
Date:
Hi,

While reviewing
https://postgr.es/m/CAD21AoBe2o2D%3Dxyycsxw2bQOD%3DzPj7ETuJ5VYGN%3DdpoTiCMRJQ%40mail.gmail.com
I noticed that pg_recvlogical prints
"pg_recvlogical: error: unexpected termination of replication stream: "

when signalled with SIGINT/TERM.

Oddly enough, that looks to have "always" been the case, even though clearly
the code tried to make provisions for a different outcome.


It looks to me like all that's needed is to gate the block printing the
message with an !time_to_abort.


I also then noticed that we don't fsync the output file in cases of errors -
that seems wrong to me? Looks to me like that block should be moved till after
the error:?

Greetings,

Andres Freund



Re: pg_recvlogical prints bogus error when interrupted

From
Bharath Rupireddy
Date:
On Thu, Oct 20, 2022 at 3:10 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> While reviewing
> https://postgr.es/m/CAD21AoBe2o2D%3Dxyycsxw2bQOD%3DzPj7ETuJ5VYGN%3DdpoTiCMRJQ%40mail.gmail.com
> I noticed that pg_recvlogical prints
> "pg_recvlogical: error: unexpected termination of replication stream: "
>
> when signalled with SIGINT/TERM.
>
> Oddly enough, that looks to have "always" been the case, even though clearly
> the code tried to make provisions for a different outcome.
>
>
> It looks to me like all that's needed is to gate the block printing the
> message with an !time_to_abort.

+1. How about emitting a message like its friend pg_receivewal, like
the attached patch?

> I also then noticed that we don't fsync the output file in cases of errors -
> that seems wrong to me? Looks to me like that block should be moved till after
> the error:?

How about something like the attached patch?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Kyotaro Horiguchi
Date:
At Thu, 20 Oct 2022 13:28:45 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> On Thu, Oct 20, 2022 at 3:10 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > While reviewing
> > https://postgr.es/m/CAD21AoBe2o2D%3Dxyycsxw2bQOD%3DzPj7ETuJ5VYGN%3DdpoTiCMRJQ%40mail.gmail.com
> > I noticed that pg_recvlogical prints
> > "pg_recvlogical: error: unexpected termination of replication stream: "
> >
> > when signalled with SIGINT/TERM.
> >
> > Oddly enough, that looks to have "always" been the case, even though clearly
> > the code tried to make provisions for a different outcome.
> >
> >
> > It looks to me like all that's needed is to gate the block printing the
> > message with an !time_to_abort.

+1

> +1. How about emitting a message like its friend pg_receivewal, like
> the attached patch?

I'm not a fan of treating SIGINT as an error in this case. It calls
prepareToTerminate() when time_to_abort and everything goes fine after
then. So I think we should do the same thing after receiving an
interrupt.  This also does file-sync naturally as a part of normal
shutdown.  I'm also not a fan of doing fsync at error.

> > I also then noticed that we don't fsync the output file in cases of errors -
> > that seems wrong to me? Looks to me like that block should be moved till after
> > the error:?
> 
> How about something like the attached patch?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Bharath Rupireddy
Date:
On Fri, Oct 21, 2022 at 7:52 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> > +1. How about emitting a message like its friend pg_receivewal, like
> > the attached patch?
>
> I'm not a fan of treating SIGINT as an error in this case. It calls
> prepareToTerminate() when time_to_abort and everything goes fine after
> then. So I think we should do the same thing after receiving an
> interrupt.  This also does file-sync naturally as a part of normal
> shutdown.  I'm also not a fan of doing fsync at error.

I think the pg_recvlogical can gracefully exit on both SIGINT and
SIGTERM to keep things simple.

> > > I also then noticed that we don't fsync the output file in cases of errors -
> > > that seems wrong to me? Looks to me like that block should be moved till after
> > > the error:?
> >
> > How about something like the attached patch?

The attached patch (pg_recvlogical_graceful_interrupt.text) has a
couple of problems, I believe. We're losing prepareToTerminate() with
keepalive true and we're not skipping pg_log_error("unexpected
termination of replication stream: %s" upon interrupt, after all we're
here discussing how to avoid it.

I came up with the attached v2 patch, please have a look.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Bharath Rupireddy
Date:
On Mon, Oct 24, 2022 at 8:15 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Oct 21, 2022 at 7:52 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > > +1. How about emitting a message like its friend pg_receivewal, like
> > > the attached patch?
> >
> > I'm not a fan of treating SIGINT as an error in this case. It calls
> > prepareToTerminate() when time_to_abort and everything goes fine after
> > then. So I think we should do the same thing after receiving an
> > interrupt.  This also does file-sync naturally as a part of normal
> > shutdown.  I'm also not a fan of doing fsync at error.
>
> I think the pg_recvlogical can gracefully exit on both SIGINT and
> SIGTERM to keep things simple.
>
> > > > I also then noticed that we don't fsync the output file in cases of errors -
> > > > that seems wrong to me? Looks to me like that block should be moved till after
> > > > the error:?
> > >
> > > How about something like the attached patch?
>
> The attached patch (pg_recvlogical_graceful_interrupt.text) has a
> couple of problems, I believe. We're losing prepareToTerminate() with
> keepalive true and we're not skipping pg_log_error("unexpected
> termination of replication stream: %s" upon interrupt, after all we're
> here discussing how to avoid it.
>
> I came up with the attached v2 patch, please have a look.

FWIW, I added it to CF - https://commitfest.postgresql.org/40/3966/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: pg_recvlogical prints bogus error when interrupted

From
Andres Freund
Date:
Hi,

On 2022-10-24 08:15:11 +0530, Bharath Rupireddy wrote:
> I came up with the attached v2 patch, please have a look.

Thanks for working on this!


> +        /* When we get SIGINT/SIGTERM, we exit */
> +        if (ready_to_exit)
> +        {
> +            /*
> +             * Try informing the server about our exit, but don't wait around
> +             * or retry on failure.
> +             */
> +            (void) PQputCopyEnd(conn, NULL);
> +            (void) PQflush(conn);
> +            time_to_abort = ready_to_exit;

This doesn't strike me as great - because the ready_to_exit isn't checked in
the loop around StreamLogicalLog(), we'll reconnect if something else causes
StreamLogicalLog() to return.

Why do we need both time_to_abort and ready_to_exit? Perhaps worth noting that
time_to_abort is still an sig_atomic_t, but isn't modified in a signal
handler, which seems a bit unnecessarily confusing.

Greetings,

Andres Freund



Re: pg_recvlogical prints bogus error when interrupted

From
Bharath Rupireddy
Date:
On Fri, Oct 28, 2022 at 4:41 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-10-24 08:15:11 +0530, Bharath Rupireddy wrote:
>
>
> > +             /* When we get SIGINT/SIGTERM, we exit */
> > +             if (ready_to_exit)
> > +             {
> > +                     /*
> > +                      * Try informing the server about our exit, but don't wait around
> > +                      * or retry on failure.
> > +                      */
> > +                     (void) PQputCopyEnd(conn, NULL);
> > +                     (void) PQflush(conn);
> > +                     time_to_abort = ready_to_exit;
>
> This doesn't strike me as great - because the ready_to_exit isn't checked in
> the loop around StreamLogicalLog(), we'll reconnect if something else causes
> StreamLogicalLog() to return.

Fixed.

> Why do we need both time_to_abort and ready_to_exit?

Intention to have ready_to_exit is to be able to distinguish between
SIGINT/SIGTERM and aborting when endpos is reached so that necessary
code is skipped/executed and proper logs are printed.

> Perhaps worth noting that
> time_to_abort is still an sig_atomic_t, but isn't modified in a signal
> handler, which seems a bit unnecessarily confusing.

time_to_abort is just a static variable, no?

+static bool    time_to_abort = false;
+static volatile sig_atomic_t ready_to_exit = false;

Please see the attached v3 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Cary Huang
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

Hello

The patch applies and tests fine. I like the way to have both ready_to_exit and time_to_abort variables to control the
exitsequence. I think the (void) cast can be removed in front of PQputCopyEnd(), PQflush for consistency purposes as it
doesnot give warnings and everywhere else does not have those casts. 
 

thank you
Cary

Re: pg_recvlogical prints bogus error when interrupted

From
Michael Paquier
Date:
On Mon, Oct 24, 2022 at 08:15:11AM +0530, Bharath Rupireddy wrote:
> The attached patch (pg_recvlogical_graceful_interrupt.text) has a
> couple of problems, I believe. We're losing prepareToTerminate() with
> keepalive true and we're not skipping pg_log_error("unexpected
> termination of replication stream: %s" upon interrupt, after all we're
> here discussing how to avoid it.
>
> I came up with the attached v2 patch, please have a look.

This thread has slipped through the feature freeze deadline.  Would
people be OK to do something now on HEAD?  A backpatch is also in
order, IMO, as the current behavior looks confusing under SIGINT and
SIGTERM.
--
Michael

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Bharath Rupireddy
Date:
On Tue, Apr 11, 2023 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Oct 24, 2022 at 08:15:11AM +0530, Bharath Rupireddy wrote:
> > The attached patch (pg_recvlogical_graceful_interrupt.text) has a
> > couple of problems, I believe. We're losing prepareToTerminate() with
> > keepalive true and we're not skipping pg_log_error("unexpected
> > termination of replication stream: %s" upon interrupt, after all we're
> > here discussing how to avoid it.
> >
> > I came up with the attached v2 patch, please have a look.
>
> This thread has slipped through the feature freeze deadline.  Would
> people be OK to do something now on HEAD?  A backpatch is also in
> order, IMO, as the current behavior looks confusing under SIGINT and
> SIGTERM.

IMO, +1 for HEAD/PG16 and +0.5 for backpatching as it may not be so
critical to backpatch all the way down. What may happen without this
patch is that the output file isn't fsync-ed upon SIGINT/SIGTERM.
Well, is it a critical issue on production servers?

On Fri, Apr 7, 2023 at 5:12 AM Cary Huang <cary.huang@highgo.ca> wrote:
>
> The following review has been posted through the commitfest application:
>
> The patch applies and tests fine. I like the way to have both ready_to_exit and time_to_abort variables to control
theexit sequence. I think the (void) cast can be removed in front of PQputCopyEnd(), PQflush for consistency purposes
asit does not give warnings and everywhere else does not have those casts. 

Thanks for reviewing. I removed the (void) casts like elsewhere in the
code, however, I didn't change such casts in prepareToTerminate() to
not create a diff.

I'm attaching the v4 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Michael Paquier
Date:
On Thu, Apr 27, 2023 at 11:24:52AM +0530, Bharath Rupireddy wrote:
> IMO, +1 for HEAD/PG16 and +0.5 for backpatching as it may not be so
> critical to backpatch all the way down. What may happen without this
> patch is that the output file isn't fsync-ed upon SIGINT/SIGTERM.
> Well, is it a critical issue on production servers?

It is also true that it's been this way for years with basically
nobody complaining outside this thread.  So there is also an argument
about leaving v16 out of the picture, and do that only in 17~ to avoid
playing with the branch stability more than necessary?  I see 7 open
items as of today, and there are TAP tests linked to pg_recvlogical.
That should be OK, because none of these tests rely on specific
signals, but the buildfarm has some weird setups, as well..
--
Michael

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
"Tristan Partin"
Date:
On Thu Apr 27, 2023 at 12:54 AM CDT, Bharath Rupireddy wrote:
> Thanks for reviewing. I removed the (void) casts like elsewhere in the
> code, however, I didn't change such casts in prepareToTerminate() to
> not create a diff.
>
> I'm attaching the v4 patch for further review.

Bharath,

I signed up to review the patch for the commitfest. The patch looks
pretty good to me, but I would like to come to a conclusion on what
Andres posted earlier in the thread.

> Why do we need both time_to_abort and ready_to_exit?

I am trying to understand why we need both as well. Maybe I am missing
something important :).

> /* It is not unexepected termination error when Ctrl-C'ed. */

My only other comment is that it would be nice to have the word "an"
before unexpected.

--
Tristan Partin
Neon (https://neon.tech)



Re: pg_recvlogical prints bogus error when interrupted

From
Michael Paquier
Date:
On Thu, Jul 06, 2023 at 10:29:10AM -0500, Tristan Partin wrote:
> On Thu Apr 27, 2023 at 12:54 AM CDT, Bharath Rupireddy wrote:
>> Why do we need both time_to_abort and ready_to_exit?
>
> I am trying to understand why we need both as well. Maybe I am missing
> something important :).

As StreamLogicalLog() states once it leaves its main loop because
time_to_abort has been switched to true, we want a clean exit.  I
think that this patch is just a more complicated way to avoid doing
twice the operations done by prepareToTerminate().  So how about
moving the prepareToTerminate() call outside the main streaming loop
and call it when time_to_abort is true?  Then, I would suggest to
change the keepalive argument of prepareToTerminate() to an enum able
to handle three values to log the reason why the tool is stopping: the
end of WAL, an interruption or a keepalive when logging.  There are
two of them now, but we want a third mode for the signals.

> > /* It is not unexepected termination error when Ctrl-C'ed. */
>
> My only other comment is that it would be nice to have the word "an"
> before unexpected.

s/unexepected/unexpected/.  Still, it seems to me that we don't need
this comment.
--
Michael

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Michael Paquier
Date:
On Mon, Jul 10, 2023 at 01:44:45PM +0900, Michael Paquier wrote:
> As StreamLogicalLog() states once it leaves its main loop because
> time_to_abort has been switched to true, we want a clean exit.  I
> think that this patch is just a more complicated way to avoid doing
> twice the operations done by prepareToTerminate().  So how about
> moving the prepareToTerminate() call outside the main streaming loop
> and call it when time_to_abort is true?  Then, I would suggest to
> change the keepalive argument of prepareToTerminate() to an enum able
> to handle three values to log the reason why the tool is stopping: the
> end of WAL, an interruption or a keepalive when logging.  There are
> two of them now, but we want a third mode for the signals.

It took me some time to come back to this one, but attached is what I
had in mind.  This stuff has three reasons to stop: keepalive, end LSN
or signal.  This makes the code easier to follow.

Thoughts or comments?
--
Michael

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Bharath Rupireddy
Date:
On Wed, Jul 19, 2023 at 8:04 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> It took me some time to come back to this one, but attached is what I
> had in mind.  This stuff has three reasons to stop: keepalive, end LSN
> or signal.  This makes the code easier to follow.
>
> Thoughts or comments?

Thanks. I have some comments on v5:

1. I don't think we need a stop_lsn variable, the cur_record_lsn can
help if it's defined outside the loop. With this, the patch can
further be simplified as attached v6.

2. And, I'd prefer not to assume the stop_reason as signal by default.
While it works for now because the remaining place where time_to_abort
is set to true is only in the signal handler, it is not extensible, if
there's another exit condition comes in future that sets time_to_abort
to true.

3. pg_log_info("end position %X/%X reached on signal", .... For
signal, end position is a bit vague wording and I think we can just
say pg_log_info("received interrupt signal, exiting"); like
pg_receivewal. We really can't have a valid stop_lsn for signal exit
because that depends on when signal arrives in the while loop. If at
all, someone wants to know the last received LSN - they can look at
the other messages that pg_recvlogical emits - pg_recvlogical:
confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0 (slot myslot).

4. With v5, it was taking a while to exit after the first CTRL+C, see
multiple CTRL+Cs at the end:
ubuntu::~/postgres/inst/bin$ ./pg_recvlogical --slot=lsub1_repl_slot
--file=lsub1.data --start --verbose
pg_recvlogical: starting log streaming at 0/0 (slot lsub1_repl_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
^Cpg_recvlogical: end position 0/2867D70 reached on signal
^C^C^C^C^C^C^C^C^C^C^C^C
^C^C^C^C^C^C^C^C^C^C^C^C

5. FWIW, on HEAD we'd get the following and the patch emits a better messaging:
ubuntu:~/postgres/inst/bin$ ./pg_recvlogical --slot=lsub1_repl_slot
--file=lsub1.data --start --dbname='host=localhost port=5432
dbname=postgres user=ubuntu' --verbose
pg_recvlogical: starting log streaming at 0/0 (slot lsub1_repl_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
^Cpg_recvlogical: error: unexpected termination of replication stream:

Attaching v6 patch with the above changes to v6. Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Bharath Rupireddy
Date:
On Wed, Jul 19, 2023 at 12:07 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> 4. With v5, it was taking a while to exit after the first CTRL+C, see
> multiple CTRL+Cs at the end:
> ubuntu::~/postgres/inst/bin$ ./pg_recvlogical --slot=lsub1_repl_slot
> --file=lsub1.data --start --verbose
> pg_recvlogical: starting log streaming at 0/0 (slot lsub1_repl_slot)
> pg_recvlogical: streaming initiated
> pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot lsub1_repl_slot)
> pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
> (slot lsub1_repl_slot)
> pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
> (slot lsub1_repl_slot)
> pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
> (slot lsub1_repl_slot)
> ^Cpg_recvlogical: end position 0/2867D70 reached on signal
> ^C^C^C^C^C^C^C^C^C^C^C^C
> ^C^C^C^C^C^C^C^C^C^C^C^C

I think the delay is expected for the reason specified below and is
not because of any of the changes in v5. As far as CTRL+C is
concerned, it is a clean exit and hence we can't escape the while(1)
loop.

/*
         * We're doing a client-initiated clean exit and have sent CopyDone to
         * the server. Drain any messages, so we don't miss a last-minute
         * ErrorResponse. The walsender stops generating XLogData records once
         * it sees CopyDone, so expect this to finish quickly. After CopyDone,
         * it's too late for sendFeedback(), even if this were to take a long
         * time. Hence, use synchronous-mode PQgetCopyData().
         */
        while (1)
        {

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: pg_recvlogical prints bogus error when interrupted

From
Michael Paquier
Date:
On Wed, Jul 19, 2023 at 12:07:21PM +0530, Bharath Rupireddy wrote:
> 1. I don't think we need a stop_lsn variable, the cur_record_lsn can
> help if it's defined outside the loop. With this, the patch can
> further be simplified as attached v6.

Okay by me.

> 2. And, I'd prefer not to assume the stop_reason as signal by default.
> While it works for now because the remaining place where time_to_abort
> is set to true is only in the signal handler, it is not extensible, if
> there's another exit condition comes in future that sets time_to_abort
> to true.

Yeah, I was also wondering whether this should be set by the signal
handler in this case while storing the reason statically on a specific
sig_atomic_t.

> 3. pg_log_info("end position %X/%X reached on signal", .... For
> signal, end position is a bit vague wording and I think we can just
> say pg_log_info("received interrupt signal, exiting"); like
> pg_receivewal. We really can't have a valid stop_lsn for signal exit
> because that depends on when signal arrives in the while loop. If at
> all, someone wants to know the last received LSN - they can look at
> the other messages that pg_recvlogical emits - pg_recvlogical:
> confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0 (slot myslot).

+    case STREAM_STOP_SIGNAL:
+            pg_log_info("received interrupt signal, exiting");
+            break;

Still it is useful to report the location we have finished with when
stopping on a signal, no?  Why couldn't we use "lsn" here, aka
cur_record_lsn?

> 4. With v5, it was taking a while to exit after the first CTRL+C, see
> multiple CTRL+Cs at the end:
> ubuntu::~/postgres/inst/bin$ ./pg_recvlogical --slot=lsub1_repl_slot
> --file=lsub1.data --start --verbose
> pg_recvlogical: starting log streaming at 0/0 (slot lsub1_repl_slot)
> pg_recvlogical: streaming initiated
> pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot lsub1_repl_slot)
> pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
> (slot lsub1_repl_slot)
> pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
> (slot lsub1_repl_slot)
> pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
> (slot lsub1_repl_slot)
> ^Cpg_recvlogical: end position 0/2867D70 reached on signal
> ^C^C^C^C^C^C^C^C^C^C^C^C
> ^C^C^C^C^C^C^C^C^C^C^C^C

Isn't that where we'd need to look at a long change but we need to
stop cleanly?  That sounds expected to me.
--
Michael

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Bharath Rupireddy
Date:
On Wed, Jul 19, 2023 at 1:41 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> > 3. pg_log_info("end position %X/%X reached on signal", .... For
> > signal, end position is a bit vague wording and I think we can just
> > say pg_log_info("received interrupt signal, exiting"); like
> > pg_receivewal. We really can't have a valid stop_lsn for signal exit
> > because that depends on when signal arrives in the while loop. If at
> > all, someone wants to know the last received LSN - they can look at
> > the other messages that pg_recvlogical emits - pg_recvlogical:
> > confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0 (slot myslot).
>
> +    case STREAM_STOP_SIGNAL:
> +            pg_log_info("received interrupt signal, exiting");
> +            break;
>
> Still it is useful to report the location we have finished with when
> stopping on a signal, no?  Why couldn't we use "lsn" here, aka
> cur_record_lsn?

Printing LSN on signal exit won't be correct - if signal is received
before cur_record_lsn gets assigned, we will be showing an old LSN if
it was previously assigned or invalid LSN if it wasn't assigned
previously. Signal arrival and processing are indeterministic, so we
can't always show the right info. Instead, we can just be simple in
the messaging without an lsn like pg_receivewal.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: pg_recvlogical prints bogus error when interrupted

From
Michael Paquier
Date:
On Wed, Jul 19, 2023 at 01:33:15PM +0530, Bharath Rupireddy wrote:
> I think the delay is expected for the reason specified below and is
> not because of any of the changes in v5. As far as CTRL+C is
> concerned, it is a clean exit and hence we can't escape the while(1)
> loop.

Yes, that's also what I am expecting.  That's costly when replaying a
large change chunk, but we also want a clean exit on a signal as the
code comments document, so..
--
Michael

Attachment

Re: pg_recvlogical prints bogus error when interrupted

From
Michael Paquier
Date:
On Wed, Jul 19, 2023 at 01:46:02PM +0530, Bharath Rupireddy wrote:
> Printing LSN on signal exit won't be correct - if signal is received
> before cur_record_lsn gets assigned, we will be showing an old LSN if
> it was previously assigned or invalid LSN if it wasn't assigned
> previously. Signal arrival and processing are indeterministic, so we
> can't always show the right info.

I think that there's an argument to be made because cur_record_lsn
will be set before coming back to the beginning of the replay loop
when a stop is triggered by a signal.

> Instead, we can just be simple in the messaging without an lsn like
> pg_receivewal.

Anyway, I'm OK with simple for now as it looks that you don't feel
about that either, and the patch is enough to fix the report of this
thread.  And one would get periodic information in --verbose mode
depending the sync message frequency, as well.

So, I have applied v6 after fixing two issues with it:
- I have kept the reason as an argument of prepareToTerminate(), to be
able to take advantage of the switch structure where compilers would
generate a warning if adding a new value to StreamStopReason.
- cur_record_lsn was not initialized at the beginning of
StreamLogicalLog(), which is where the compiler complains about the
case of receiving a signal before entering in the replay loop, after
establising a connection.
--
Michael

Attachment