Thread: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo

Hi,

I had reported a possible subscription 'disable_on_error' bug found
while reviewing another patch.

I am including my initial report and Nisha's analysis again here so
that this topic has its own thread.

==================
INITIAL REPORT [1]
==================

...
I see now that any ALTER of the subscription's connection, even to
some value that fails, will restart a new worker (like ALTER of any
other subscription parameters). For a bad connection, it will continue
to relaunch-worker/ERROR over and over. e.g.

----------
test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG:  logical
replication apply worker for subscription "sub4" has started
2024-01-17 09:34:28.666 AEDT [11274] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:28.667 AEDT [928] LOG:  background worker "logical
replication apply worker" (PID 11274) exited with exit code 1
2024-01-17 09:34:33.669 AEDT [11391] LOG:  logical replication apply
worker for subscription "sub4" has started
2024-01-17 09:34:33.669 AEDT [11391] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:33.670 AEDT [928] LOG:  background worker "logical
replication apply worker" (PID 11391) exited with exit code 1
etc...
----------

While experimenting with the bad connection ALTER I also tried setting
'disable_on_error' like below:

ALTER SUBSCRIPTION sub4 SET (disable_on_error);
ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';

...but here the subscription did not become DISABLED as I expected it
would do on the next connection error iteration. It remains enabled
and just continues to loop relaunch/ERROR indefinitely same as before.

That looks like it may be a bug. Thoughts?

=====================
ANALYSIS BY NISHA [2]
=====================

Ideally, if the already running apply worker in
"LogicalRepApplyLoop()" has any exception/error it will be handled and
the subscription will be disabled if 'disable_on_error' is set -

start_apply(XLogRecPtr origin_startpos)
{
PG_TRY();
{
LogicalRepApplyLoop(origin_startpos);
}
PG_CATCH();
{
if (MySubscription->disableonerr)
DisableSubscriptionAndExit();
...

What is happening in this case is that the control reaches the
function - run_apply_worker() -> start_apply() -> LogicalRepApplyLoop
-> maybe_reread_subscription()
...
/*
* Exit if any parameter that affects the remote connection was changed.
* The launcher will start a new worker but note that the parallel apply
* worker won't restart if the streaming option's value is changed from
* 'parallel' to any other value or the server decides not to stream the
* in-progress transaction.
*/
if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
...

and it sees a change in the parameter and calls apply_worker_exit().
This will exit the current process, without throwing an exception to
the caller and the postmaster will try to restart the apply worker.
The new apply worker, before reaching the start_apply() [where we
handle exception], will hit the code to establish the connection to
the publisher -

ApplyWorkerMain() -> run_apply_worker() -
...
LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo,
true /* replication */ ,
true,
must_use_password,
MySubscription->name, &err);

if (LogRepWorkerWalRcvConn == NULL)
  ereport(ERROR,
  (errcode(ERRCODE_CONNECTION_FAILURE),
   errmsg("could not connect to the publisher: %s", err)));
...

and due to the bad connection string in the subscription, it will error out.
[28680] ERROR:  could not connect to the publisher: invalid port number: "-1"
[3196] LOG:  background worker "logical replication apply worker" (PID
28680) exited with exit code 1

Now, the postmaster keeps trying to restart the apply worker and it
will keep failing until the connection string is corrected or the
subscription is disabled manually.

I think this is a bug that needs to be handled in run_apply_worker()
when disable_on_error is set.

======
[1] https://www.postgresql.org/message-id/CAHut%2BPvcf7P2dHbCeWPM4jQ%3DyHqf4WFS_C6eVb8V%3DbcZPMMp7A%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CABdArM6ORu%2BKpS_kXd-jwwPBqYPo1YqZjwwGnqmVanWgdHCggA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



On Thu, Jan 18, 2024 at 8:16 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi,
>
> I had reported a possible subscription 'disable_on_error' bug found
> while reviewing another patch.
>
> I am including my initial report and Nisha's analysis again here so
> that this topic has its own thread.
>
> ==================
> INITIAL REPORT [1]
> ==================
>
> ...
> I see now that any ALTER of the subscription's connection, even to
> some value that fails, will restart a new worker (like ALTER of any
> other subscription parameters). For a bad connection, it will continue
> to relaunch-worker/ERROR over and over. e.g.
>
> ----------
> test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG:  logical
> replication apply worker for subscription "sub4" has started
> 2024-01-17 09:34:28.666 AEDT [11274] ERROR:  could not connect to the
> publisher: invalid port number: "-1"
> 2024-01-17 09:34:28.667 AEDT [928] LOG:  background worker "logical
> replication apply worker" (PID 11274) exited with exit code 1
> 2024-01-17 09:34:33.669 AEDT [11391] LOG:  logical replication apply
> worker for subscription "sub4" has started
> 2024-01-17 09:34:33.669 AEDT [11391] ERROR:  could not connect to the
> publisher: invalid port number: "-1"
> 2024-01-17 09:34:33.670 AEDT [928] LOG:  background worker "logical
> replication apply worker" (PID 11391) exited with exit code 1
> etc...
> ----------
>
> While experimenting with the bad connection ALTER I also tried setting
> 'disable_on_error' like below:
>
> ALTER SUBSCRIPTION sub4 SET (disable_on_error);
> ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';
>
> ...but here the subscription did not become DISABLED as I expected it
> would do on the next connection error iteration. It remains enabled
> and just continues to loop relaunch/ERROR indefinitely same as before.
>
> That looks like it may be a bug. Thoughts?

Although we can improve it to handle this case too, I'm not sure it's
a bug. The doc says[1]:

Specifies whether the subscription should be automatically disabled if
any errors are detected by subscription workers during data
replication from the publisher.

When an apply worker is trying to establish a connection, it's not
replicating data from the publisher.

Regards,

[1]
https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
...
>
> Although we can improve it to handle this case too, I'm not sure it's
> a bug. The doc says[1]:
>
> Specifies whether the subscription should be automatically disabled if
> any errors are detected by subscription workers during data
> replication from the publisher.
>
> When an apply worker is trying to establish a connection, it's not
> replicating data from the publisher.
>
> Regards,
>
> [1]
https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

Yeah, I had also seen that wording of those docs. And I agree it
leaves open some room for doubts because strictly from that wording it
can be interpreted that establishing the connection is not actually
"data replication from the publisher" in which case maybe there is no
bug.

OTOH, it was not clear to me if that precise wording was the intention
or not. It could have been written as "Specifies whether the
subscription should be automatically disabled if any errors are
detected by subscription workers." which would mean the same thing 99%
of the time except that would mean that the current behaviour is a
bug.

I tried looking at the original thread where this feature was born [1]
but it is still unclear to me if 'disable_on_error' was meant for
every kind of error or only data replication errors.

Indeed. even the commit message [2] seems to have an each-way bet:
* It talks about errors applying changes --- "Logical replication
apply workers for a subscription can easily get stuck in an infinite
loop of attempting to apply a change..."
* But, it also says it covers any errors --- "When true, both the
tablesync worker and apply worker catch any errors thrown..."

~

Maybe we should be asking ourselves how a user intuitively expects
this option to behave. IMO the answer is right there in the option
name - the subscription says 'disable_on_error' and I got an error, so
I expected the subscription to be disabled. To wriggle out of it by
saying "Ah, but we did not mean _those_ kinds of errors" doesn't quite
seem genuine to me.

======
[1]
https://www.postgresql.org/message-id/flat/CAA4eK1KsaVgkO%3DRbjj0bcXZTpeV1QVm0TGkdxZiH73MHfxf6oQ%40mail.gmail.com#d4a0db154fbeca356a494c50ac877ff1
[2] https://github.com/postgres/postgres/commit/705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33

Kind Regards,
Peter Smith.
Fujitsu Australia



On Thu, Jan 18, 2024 at 11:15 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> ...
> >
> > Although we can improve it to handle this case too, I'm not sure it's
> > a bug. The doc says[1]:
> >
> > Specifies whether the subscription should be automatically disabled if
> > any errors are detected by subscription workers during data
> > replication from the publisher.
> >
> > When an apply worker is trying to establish a connection, it's not
> > replicating data from the publisher.
> >
> > Regards,
> >
> > [1]
https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR
> >
> > --
> > Masahiko Sawada
> > Amazon Web Services: https://aws.amazon.com
>
> Yeah, I had also seen that wording of those docs. And I agree it
> leaves open some room for doubts because strictly from that wording it
> can be interpreted that establishing the connection is not actually
> "data replication from the publisher" in which case maybe there is no
> bug.
>

As far as I remember that was the intention. The idea was if there is
any conflict during apply that users manually need to fix, they have
the provision to stop repeating the error. If we wish to extend the
purpose of this option for another valid use case and there is a good
way to achieve the same then we can discuss but I don't think we need
to change it in back-branches.

--
With Regards,
Amit Kapila.



On Thu, Jan 18, 2024 at 8:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 11:15 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > ...
> > >
> > > Although we can improve it to handle this case too, I'm not sure it's
> > > a bug. The doc says[1]:
> > >
> > > Specifies whether the subscription should be automatically disabled if
> > > any errors are detected by subscription workers during data
> > > replication from the publisher.
> > >
> > > When an apply worker is trying to establish a connection, it's not
> > > replicating data from the publisher.
> > >
> > > Regards,
> > >
> > > [1]
https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR
> > >
> > > --
> > > Masahiko Sawada
> > > Amazon Web Services: https://aws.amazon.com
> >
> > Yeah, I had also seen that wording of those docs. And I agree it
> > leaves open some room for doubts because strictly from that wording it
> > can be interpreted that establishing the connection is not actually
> > "data replication from the publisher" in which case maybe there is no
> > bug.
> >
>
> As far as I remember that was the intention. The idea was if there is
> any conflict during apply that users manually need to fix, they have
> the provision to stop repeating the error. If we wish to extend the
> purpose of this option for another valid use case and there is a good
> way to achieve the same then we can discuss but I don't think we need
> to change it in back-branches.
>
> --

In case we want to proceed with this, here is a simple POC patch that
seems to do the job.

~~~

RESULT:

test_sub=# create subscription sub1 connection 'dbname=test_pub'
publication pub1;
NOTICE:  created replication slot "sub1" on publisher
CREATE SUBSCRIPTION
2024-01-19 11:50:33.385 AEDT [17905] LOG:  logical replication apply
worker for subscription "sub1" has started
2024-01-19 11:50:33.398 AEDT [17907] LOG:  logical replication table
synchronization worker for subscription "sub1", table "t1" has started
2024-01-19 11:50:33.481 AEDT [17907] LOG:  logical replication table
synchronization worker for subscription "sub1", table "t1" has
finished

test_sub=# alter subscription sub1 set (disable_on_error);
ALTER SUBSCRIPTION

test_sub=# alter subscription sub1 connection 'port=-1';
2024-01-19 11:51:00.696 AEDT [17905] LOG:  logical replication worker
for subscription "sub1" will restart because of a parameter change
ALTER SUBSCRIPTION
2024-01-19 11:51:00.704 AEDT [18649] LOG:  logical replication apply
worker for subscription "sub1" has started
2024-01-19 11:51:00.704 AEDT [18649] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-19 11:51:00.705 AEDT [18649] LOG:  subscription "sub1" has
been disabled because of an error

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment
On Thu, Jan 18, 2024 at 6:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 11:15 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > ...
> > >
> > > Although we can improve it to handle this case too, I'm not sure it's
> > > a bug. The doc says[1]:
> > >
> > > Specifies whether the subscription should be automatically disabled if
> > > any errors are detected by subscription workers during data
> > > replication from the publisher.
> > >
> > > When an apply worker is trying to establish a connection, it's not
> > > replicating data from the publisher.
> > >
> > > Regards,
> > >
> > > [1]
https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR
> > >
> > > --
> > > Masahiko Sawada
> > > Amazon Web Services: https://aws.amazon.com
> >
> > Yeah, I had also seen that wording of those docs. And I agree it
> > leaves open some room for doubts because strictly from that wording it
> > can be interpreted that establishing the connection is not actually
> > "data replication from the publisher" in which case maybe there is no
> > bug.
> >
>
> As far as I remember that was the intention. The idea was if there is
> any conflict during apply that users manually need to fix, they have
> the provision to stop repeating the error. If we wish to extend the
> purpose of this option for another valid use case and there is a good
> way to achieve the same then we can discuss but I don't think we need
> to change it in back-branches.

I agree not to change it in back-branches.

What is the use case of extending disable_on_error?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Fri, Jan 19, 2024 at 7:21 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 6:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 11:15 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > ...
> > > >
> > > > Although we can improve it to handle this case too, I'm not sure it's
> > > > a bug. The doc says[1]:
> > > >
> > > > Specifies whether the subscription should be automatically disabled if
> > > > any errors are detected by subscription workers during data
> > > > replication from the publisher.
> > > >
> > > > When an apply worker is trying to establish a connection, it's not
> > > > replicating data from the publisher.
> > > >
> > > > Regards,
> > > >
> > > > [1]
https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR
> > > >
> > > > --
> > > > Masahiko Sawada
> > > > Amazon Web Services: https://aws.amazon.com
> > >
> > > Yeah, I had also seen that wording of those docs. And I agree it
> > > leaves open some room for doubts because strictly from that wording it
> > > can be interpreted that establishing the connection is not actually
> > > "data replication from the publisher" in which case maybe there is no
> > > bug.
> > >
> >
> > As far as I remember that was the intention. The idea was if there is
> > any conflict during apply that users manually need to fix, they have
> > the provision to stop repeating the error. If we wish to extend the
> > purpose of this option for another valid use case and there is a good
> > way to achieve the same then we can discuss but I don't think we need
> > to change it in back-branches.
>
> I agree not to change it in back-branches.
>
> What is the use case of extending disable_on_error?
>

The use-case is that with my user-hat on I had assumed
disable_on_error behaviour was as per the name implied so the
subscription would disable on getting (any) error. OTOH I agree, my
expectation is not exactly what the current docs say.

Also, I had thought the motivation for that option was to avoid having
infinite repeating errors that might be caused by the user or data.
e.g. a simple typo in the conninfo can cause this error and AFAIK the
ALTER will appear successful so the user won't know anything about it
unless they also check the logs. OTOH something like a connection
error may only be temporary (caused by a network issue?) and not
caused by a user typo at all, so I can see perhaps that is why
disable_on_error is OK to be excluding connection errors.

 TBH I think there are pros and cons to doing nothing and leaving the
existing behaviour as-is or extending it -- I'm happy to go either
way.

Another idea is to leave behaviour unchanged, but add a note in the
docs like "Note: connection errors (e.g. specifying a bad conninfo
using ALTER SUBSCRIPTION) will not cause the subscription to become
disabled"

Thoughts?

======
Kind Regards,
Peter Smith.
Fujitsu Australia