Thread: Add an option to skip loading missing publication to avoid logical replication failure

Hi,

Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the
logical replication in certain cases. This can happen as the apply
worker will get restarted after SET PUBLICATION, the apply worker will
use the existing slot and replication origin corresponding to the
subscription. Now, it is possible that before restart the origin has
not been updated and the WAL start location points to a location prior
to where PUBLICATION pub exists which can lead to such an error. Once
this error occurs, apply worker will never be able to proceed and will
always return the same error.

There was discussion on this and Amit had posted a patch to handle
this at [2]. Amit's patch does continue using a historic snapshot but
ignores publications that are not found for the purpose of computing
RelSyncEntry attributes. We won't mark such an entry as valid till all
the publications are loaded without anything missing. This means we
won't publish operations on tables corresponding to that publication
till we found such a publication and that seems okay.
I have added an option skip_not_exist_publication to enable this
operation only when skip_not_exist_publication is specified as true.
There is no change in default behavior when skip_not_exist_publication
is specified as false.

But one thing to note with the patch (with skip_not_exist_publication
option) is that replication of few WAL entries will be skipped till
the publication is loaded like in the below example:
-- Create table in publisher and subscriber
create table t1(c1 int);
create table t2(c1 int);

-- Create publications
create publication pub1 for table t1;
create publication pub2 for table t2;

-- Create subscription
create subscription test1 connection 'dbname=postgres host=localhost
port=5432' publication pub1, pub2;

-- Drop one publication
drop publication pub1;

-- Insert in the publisher
insert into t1 values(11);
insert into t2 values(21);

-- Select in subscriber
postgres=# select * from t1;
 c1
----
(0 rows)

postgres=# select * from t2;
 c1
----
 21
(1 row)

-- Create the dropped publication in publisher
create publication pub1 for table t1;

-- Insert in the publisher
insert into t1 values(12);
postgres=# select * from t1;
 c1
----
 11
 12
(2 rows)

-- Select data in subscriber
postgres=# select * from t1; -- record with value 11 will be missing
in subscriber
 c1
----
 12
(1 row)

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com

Regards,
Vignesh

Attachment
On Mon, 19 Feb 2024 at 12:48, vignesh C <vignesh21@gmail.com> wrote:
>
> Hi,
>
> Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the
> logical replication in certain cases. This can happen as the apply
> worker will get restarted after SET PUBLICATION, the apply worker will
> use the existing slot and replication origin corresponding to the
> subscription. Now, it is possible that before restart the origin has
> not been updated and the WAL start location points to a location prior
> to where PUBLICATION pub exists which can lead to such an error. Once
> this error occurs, apply worker will never be able to proceed and will
> always return the same error.
>
> There was discussion on this and Amit had posted a patch to handle
> this at [2]. Amit's patch does continue using a historic snapshot but
> ignores publications that are not found for the purpose of computing
> RelSyncEntry attributes. We won't mark such an entry as valid till all
> the publications are loaded without anything missing. This means we
> won't publish operations on tables corresponding to that publication
> till we found such a publication and that seems okay.
> I have added an option skip_not_exist_publication to enable this
> operation only when skip_not_exist_publication is specified as true.
> There is no change in default behavior when skip_not_exist_publication
> is specified as false.

I have updated the patch to now include changes for pg_dump, added few
tests, describe changes and added documentation changes. The attached
v2 version patch has the changes for the same.

Regards,
Vignesh

Attachment
On Mon, Feb 19, 2024 at 12:49 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the
> logical replication in certain cases. This can happen as the apply
> worker will get restarted after SET PUBLICATION, the apply worker will
> use the existing slot and replication origin corresponding to the
> subscription. Now, it is possible that before restart the origin has
> not been updated and the WAL start location points to a location prior
> to where PUBLICATION pub exists which can lead to such an error. Once
> this error occurs, apply worker will never be able to proceed and will
> always return the same error.
>
> There was discussion on this and Amit had posted a patch to handle
> this at [2]. Amit's patch does continue using a historic snapshot but
> ignores publications that are not found for the purpose of computing
> RelSyncEntry attributes. We won't mark such an entry as valid till all
> the publications are loaded without anything missing. This means we
> won't publish operations on tables corresponding to that publication
> till we found such a publication and that seems okay.
> I have added an option skip_not_exist_publication to enable this
> operation only when skip_not_exist_publication is specified as true.
> There is no change in default behavior when skip_not_exist_publication
> is specified as false.
>

Did you try to measure the performance impact of this change? We can
try a few cases where DDL and DMLs are involved, missing publication
(drop publication and recreate after a varying number of records to
check the impact).

The other names for the option could be:
skip_notexistant_publications, or ignore_nonexistant_publications. Can
we think of any others?

--
With Regards,
Amit Kapila.



On Tue, Feb 18, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 14 Feb 2025 at 15:36, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Did you try to measure the performance impact of this change? We can
> > try a few cases where DDL and DMLs are involved, missing publication
> > (drop publication and recreate after a varying number of records to
> > check the impact).
>
> Since we don't have an exact scenario to compare with the patch
> (because, in the current HEAD, when the publication is missing, an
> error is thrown and the walsender/worker restarts), I compared the
> positive case, where records are successfully replicated to the
> subscriber, as shown below. For the scenario with the patch, I ran the
> same test, where the publication is dropped before the insert,
> allowing the walsender to check whether the publication is present.
> The test results, which represent the median of 7 runs and the
> execution run is in milliseconds, are provided below:
>
> Brach/records  |  100     |  1000   |  10000     |  100000  |  1000000
> Head               |   1.214  |  2.548  |  10.823    |  90.3       |  951.833
> Patch              |   1.215  |  2.5485 |  10.8545 |  90.94     |   955.134
> % diff              |   0.082  |   0.020   |   0.291   |   0.704    |    0.347
>
> I noticed that the test run with patches is very negligible. The
> scripts used for execution are attached.
>

You have used the synchronous_standby_name to evaluate the performance
which covers other parts of replication than the logical decoding. It
would be better to test using pg_recvlogical.

--
With Regards,
Amit Kapila.



On Tue, 18 Feb 2025 at 16:53, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 18, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Fri, 14 Feb 2025 at 15:36, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Did you try to measure the performance impact of this change? We can
> > > try a few cases where DDL and DMLs are involved, missing publication
> > > (drop publication and recreate after a varying number of records to
> > > check the impact).
> >
> > Since we don't have an exact scenario to compare with the patch
> > (because, in the current HEAD, when the publication is missing, an
> > error is thrown and the walsender/worker restarts), I compared the
> > positive case, where records are successfully replicated to the
> > subscriber, as shown below. For the scenario with the patch, I ran the
> > same test, where the publication is dropped before the insert,
> > allowing the walsender to check whether the publication is present.
> > The test results, which represent the median of 7 runs and the
> > execution run is in milliseconds, are provided below:
> >
> > Brach/records  |  100     |  1000   |  10000     |  100000  |  1000000
> > Head               |   1.214  |  2.548  |  10.823    |  90.3       |  951.833
> > Patch              |   1.215  |  2.5485 |  10.8545 |  90.94     |   955.134
> > % diff              |   0.082  |   0.020   |   0.291   |   0.704    |    0.347
> >
> > I noticed that the test run with patches is very negligible. The
> > scripts used for execution are attached.
> >
>
> You have used the synchronous_standby_name to evaluate the performance
> which covers other parts of replication than the logical decoding. It
> would be better to test using pg_recvlogical.

Here are the test runs with pg_recvlogical, the test results, which
represent the median of 10 runs and the execution run is in
milliseconds, are provided below:
Brach/records  |  100     |  1000   |  10000   |  100000  |  1000000
Head               |   9.95   |  15.26  |   62.62    |  536.57  |   8480.83
Patch              |   9.218  |  10.32 |   23.05    |  143.83  |   4852.43
% diff              |   7.356  |  32.38  |   63.19   |    73.193|       42.783

We observe that test execution with the patch performs better between
7.35 percent to 73.19 percent. This is because, in HEAD, after loading
and verifying that the publication is valid, it must continue
processing to output the change. In contrast, with the patch,
outputting the change is skipped since the publication does not exist.

The attached script has the script that was used for testing. Here the
NUM_RECORDS count should be changed accordingly for each of the tests
and while running the test with the patch change uncomment the drop
publication command.

Regards,
Vignesh

Attachment
On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote:
>
> The attached script has the script that was used for testing. Here the
> NUM_RECORDS count should be changed accordingly for each of the tests
> and while running the test with the patch change uncomment the drop
> publication command.

I have done further analysis on the test and changed the test to
compare it better with HEAD. The execution time is in milliseconds.
Brach/records  |  100     |  1000   |  10000    |  100000  |  1000000
Head               |   10.43  |  15.86  |   64.44    |  550.56  |   8991.04
Patch              |   11.35  |  17.26   |   73.50    |  640.21  |  10104.72
% diff              |   -8.82  |  -8.85    |   -14.08   |   -16.28  |  -12.38

There is a  performance degradation in the range of 8.8 to 16.2 percent.

Test Details (With Head):
a) Create two publications for the same table. b) Insert the records
listed in the table below. c) Use pg_recvlogical to capture the
changes.
Test Details (With Patch):
a) Create two publications for the same table.b) Drop one
publication(to check the impact of skip missing publication), ensuring
that changes from the remaining publication continue to be captured.
c) Insert the records listed in the table below.d) Use pg_recvlogical
to capture the changes.

The performance degradation is in the range of 8.8 to 16.2
percentage.The script used for the testing is attached, while running
with patch the drop publication command in script should be
uncommented and the record count should be changed for each of the
run. Also I changed  the patch so that  we need not execute the
get_rel_sync_entry code flow for every record in case of missing
publication case and to do so only in case when the publications have
changed. The attached v4 version patch has the changes for the same.

Regards,
Vignesh

Attachment
On Mon, Mar 3, 2025 at 2:30 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote:
> >
> > The attached script has the script that was used for testing. Here the
> > NUM_RECORDS count should be changed accordingly for each of the tests
> > and while running the test with the patch change uncomment the drop
> > publication command.
>
> I have done further analysis on the test and changed the test to
> compare it better with HEAD. The execution time is in milliseconds.
> Brach/records  |  100     |  1000   |  10000    |  100000  |  1000000
> Head               |   10.43  |  15.86  |   64.44    |  550.56  |   8991.04
> Patch              |   11.35  |  17.26   |   73.50    |  640.21  |  10104.72
> % diff              |   -8.82  |  -8.85    |   -14.08   |   -16.28  |  -12.38
>
> There is a  performance degradation in the range of 8.8 to 16.2 percent.
>

- /* Validate the entry */
- if (!entry->replicate_valid)
+ /*
+ * If the publication is invalid, check for updates.
+ * This optimization ensures that the next block, which queries the system
+ * tables and builds the relation entry, runs only if a new publication was
+ * created.
+ */
+ if (!publications_valid && data->publications)
+ {
+ bool skipped_pub = false;
+ List    *publications;
+
+ publications = LoadPublications(data->publication_names, &skipped_pub);

The publications_valid flag indicates whether the publications cache
is valid or not; the flag is set to false for any invalidation in the
pg_publication catalog. I wonder that instead of using the same flag
what if we use a separate publications_skipped flag? If that works,
you don't even need to change the current location where we
LoadPublications.

--
With Regards,
Amit Kapila.



On Mon, 3 Mar 2025 at 16:41, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 3, 2025 at 2:30 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > The attached script has the script that was used for testing. Here the
> > > NUM_RECORDS count should be changed accordingly for each of the tests
> > > and while running the test with the patch change uncomment the drop
> > > publication command.
> >
> > I have done further analysis on the test and changed the test to
> > compare it better with HEAD. The execution time is in milliseconds.
> > Brach/records  |  100     |  1000   |  10000    |  100000  |  1000000
> > Head               |   10.43  |  15.86  |   64.44    |  550.56  |   8991.04
> > Patch              |   11.35  |  17.26   |   73.50    |  640.21  |  10104.72
> > % diff              |   -8.82  |  -8.85    |   -14.08   |   -16.28  |  -12.38
> >
> > There is a  performance degradation in the range of 8.8 to 16.2 percent.
> >
>
> - /* Validate the entry */
> - if (!entry->replicate_valid)
> + /*
> + * If the publication is invalid, check for updates.
> + * This optimization ensures that the next block, which queries the system
> + * tables and builds the relation entry, runs only if a new publication was
> + * created.
> + */
> + if (!publications_valid && data->publications)
> + {
> + bool skipped_pub = false;
> + List    *publications;
> +
> + publications = LoadPublications(data->publication_names, &skipped_pub);
>
> The publications_valid flag indicates whether the publications cache
> is valid or not; the flag is set to false for any invalidation in the
> pg_publication catalog. I wonder that instead of using the same flag
> what if we use a separate publications_skipped flag? If that works,
> you don't even need to change the current location where we
> LoadPublications.

There is almost negligible dip with the above suggested way, the test
results for the same is given below(execution time is in milli
seconds):
Brach/records  |  100     |  1000   |  10000    |  100000 |  1000000
Head               |   10.25  |  15.85  |   65.53    |  569.15  |  9194.19
Patch              |   10.25  |  15.84  |   65.91    |  571.75  |  9208.66
% diff              |      0.00 |    0.06  |    -0.58    |     -0.46  |    -0.16

There is a performance dip in the range of 0 to 0.58 percent.
The attached patch has the changes for the same. The test script used
is also attached.

Regards,
Vignesh

Attachment
On Tue, 4 Mar 2025 at 12:22, vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 3 Mar 2025 at 16:41, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Mar 3, 2025 at 2:30 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > The attached script has the script that was used for testing. Here the
> > > > NUM_RECORDS count should be changed accordingly for each of the tests
> > > > and while running the test with the patch change uncomment the drop
> > > > publication command.
> > >
> > > I have done further analysis on the test and changed the test to
> > > compare it better with HEAD. The execution time is in milliseconds.
> > > Brach/records  |  100     |  1000   |  10000    |  100000  |  1000000
> > > Head               |   10.43  |  15.86  |   64.44    |  550.56  |   8991.04
> > > Patch              |   11.35  |  17.26   |   73.50    |  640.21  |  10104.72
> > > % diff              |   -8.82  |  -8.85    |   -14.08   |   -16.28  |  -12.38
> > >
> > > There is a  performance degradation in the range of 8.8 to 16.2 percent.
> > >
> >
> > - /* Validate the entry */
> > - if (!entry->replicate_valid)
> > + /*
> > + * If the publication is invalid, check for updates.
> > + * This optimization ensures that the next block, which queries the system
> > + * tables and builds the relation entry, runs only if a new publication was
> > + * created.
> > + */
> > + if (!publications_valid && data->publications)
> > + {
> > + bool skipped_pub = false;
> > + List    *publications;
> > +
> > + publications = LoadPublications(data->publication_names, &skipped_pub);
> >
> > The publications_valid flag indicates whether the publications cache
> > is valid or not; the flag is set to false for any invalidation in the
> > pg_publication catalog. I wonder that instead of using the same flag
> > what if we use a separate publications_skipped flag? If that works,
> > you don't even need to change the current location where we
> > LoadPublications.
>
> There is almost negligible dip with the above suggested way, the test
> results for the same is given below(execution time is in milli
> seconds):
> Brach/records  |  100     |  1000   |  10000    |  100000 |  1000000
> Head               |   10.25  |  15.85  |   65.53    |  569.15  |  9194.19
> Patch              |   10.25  |  15.84  |   65.91    |  571.75  |  9208.66
> % diff              |      0.00 |    0.06  |    -0.58    |     -0.46  |    -0.16
>
> There is a performance dip in the range of 0 to 0.58 percent.
> The attached patch has the changes for the same. The test script used
> is also attached.

On further thinking, I felt the use of publications_updated variable
is not required we can use publications_valid itself which will be set
if the publication system table is invalidated. Here is a patch for
the same.

Regards,
Vignesh

Attachment
On Tue, Mar 4, 2025 at 12:23 PM vignesh C <vignesh21@gmail.com> wrote:
>
> There is almost negligible dip with the above suggested way, the test
> results for the same is given below(execution time is in milli
> seconds):
> Brach/records  |  100     |  1000   |  10000    |  100000 |  1000000
> Head               |   10.25  |  15.85  |   65.53    |  569.15  |  9194.19
> Patch              |   10.25  |  15.84  |   65.91    |  571.75  |  9208.66
> % diff              |      0.00 |    0.06  |    -0.58    |     -0.46  |    -0.16
>
> There is a performance dip in the range of 0 to 0.58 percent.
> The attached patch has the changes for the same. The test script used
> is also attached.
>

The patch still needs more review but the change has negligible
performance impact. The next step is to get more opinions on whether
we should add a new subscription option (say
skip_not_existant_publication) for this work. See patch v1-0002-* in
email [1]. The problem summary is explained in email [2] and in the
commit message of the 0001 patch in this thread. But still, let me
write briefly for the ease of others.

The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will
lead to restarting of apply worker, and after the restart, the apply
worker will use the existing slot and replication origin corresponding
to the subscription. Now, it is possible that before the restart, the
origin has not been updated, and the WAL start location points to a
location before where PUBLICATION pointed to by SET PUBLICATION
exists. This leads to an error: "ERROR:  publication "pub1" does not
exist". Once this error occurs, apply worker will never be able to
proceed and will always return the same error. For users, this is a
problem because they would have created a publication before executing
ALTER SUBSCRIPTION ... SET PUBLICATION .. and now they have no way to
proceed.

The solution we came up with is to skip loading the publication if the
publication does not exist. We load the publication later and update
the relation entry when the publication gets created.

The two main concerns with this idea, as shared in email [3], are
performance implications of this change and the possibility of current
behaviour expectations from the users.

We came up with a solution where the performance impact is negligible,
as shown in the tests [4]. For that, we won't try to reload the
skipped/missing publication for each change but will attempt it only
when any new publication is created/dropped for a valid relation entry
in RelationSyncCache (maintained by pgoutput).

The new option skip_not_existant_publication is to address the second
concern "Imagine you have a subscriber using two publications p1 and
p2, and someone comes around and drops p1 by mistake. With the
proposed patch, the subscription will notice this, but it'll continue
sending data ignoring the missing publication. Yes, it will continue
working, but it's quite possible this breaks the subscriber and it's
be better to fail and stop replicating.".

I see the point of adding such an option to avoid breaking the current
applications (if there are any) that are relying on current behaviour.
But OTOH, I am not sure if users expect us to fail explicitly in such
scenarios.

This is a long-standing behaviour for which we get reports from time
to time, and once analyzing a failure, Tom also looked at it and
agreed that we don't have much choice to avoid skipping non-existent
publications [5]. But we never concluded as to whether skipping should
be a default behavior or an optional one. So, we need more opinions on
it.

Thoughts?

[1] -
https://www.postgresql.org/message-id/CALDaNm0-n8FGAorM%2BbTxkzn%2BAOUyx5%3DL_XmnvOP6T24%2B-NcBKg%40mail.gmail.com
[2] -
https://www.postgresql.org/message-id/CAA4eK1Lc%3DNDV1HrY2gNasFK90MtysnA575a%2Brd0p%2BPOjXN%2BSpw%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com
[4] - https://www.postgresql.org/message-id/CALDaNm2Xkm1M-ik2RLJZ9rMhW2zW2GRLL6ePyZJbXcAjOVwzXg%40mail.gmail.com
[5] - https://www.postgresql.org/message-id/631312.1707251789%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.



On Tue, Mar 4, 2025 at 9:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 4, 2025 at 12:23 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > There is almost negligible dip with the above suggested way, the test
> > results for the same is given below(execution time is in milli
> > seconds):
> > Brach/records  |  100     |  1000   |  10000    |  100000 |  1000000
> > Head               |   10.25  |  15.85  |   65.53    |  569.15  |  9194.19
> > Patch              |   10.25  |  15.84  |   65.91    |  571.75  |  9208.66
> > % diff              |      0.00 |    0.06  |    -0.58    |     -0.46  |    -0.16
> >
> > There is a performance dip in the range of 0 to 0.58 percent.
> > The attached patch has the changes for the same. The test script used
> > is also attached.
> >
>
> The patch still needs more review but the change has negligible
> performance impact. The next step is to get more opinions on whether
> we should add a new subscription option (say
> skip_not_existant_publication) for this work. See patch v1-0002-* in
> email [1]. The problem summary is explained in email [2] and in the
> commit message of the 0001 patch in this thread. But still, let me
> write briefly for the ease of others.
>
> The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will
> lead to restarting of apply worker, and after the restart, the apply
> worker will use the existing slot and replication origin corresponding
> to the subscription. Now, it is possible that before the restart, the
> origin has not been updated, and the WAL start location points to a
> location before where PUBLICATION pointed to by SET PUBLICATION
> exists. This leads to an error: "ERROR:  publication "pub1" does not
> exist". Once this error occurs, apply worker will never be able to
> proceed and will always return the same error. For users, this is a
> problem because they would have created a publication before executing
> ALTER SUBSCRIPTION ... SET PUBLICATION .. and now they have no way to
> proceed.
>
> The solution we came up with is to skip loading the publication if the
> publication does not exist. We load the publication later and update
> the relation entry when the publication gets created.
>
> The two main concerns with this idea, as shared in email [3], are
> performance implications of this change and the possibility of current
> behaviour expectations from the users.
>
> We came up with a solution where the performance impact is negligible,
> as shown in the tests [4]. For that, we won't try to reload the
> skipped/missing publication for each change but will attempt it only
> when any new publication is created/dropped for a valid relation entry
> in RelationSyncCache (maintained by pgoutput).

Thank you for summarizing the issue. That helps catch up a lot.

>
> The new option skip_not_existant_publication is to address the second
> concern "Imagine you have a subscriber using two publications p1 and
> p2, and someone comes around and drops p1 by mistake. With the
> proposed patch, the subscription will notice this, but it'll continue
> sending data ignoring the missing publication. Yes, it will continue
> working, but it's quite possible this breaks the subscriber and it's
> be better to fail and stop replicating.".

I think that in this particular situation the current behavior would
be likely to miss more changes than the patch'ed behavior case.

After the logical replication stops, the user would have to alter the
subscription to subscribe to only p1 by executing 'ALTER SUBSCRIPTION
... SET PUBLICATION p1' in order to resume the logical replication. In
any case, the publisher might be receiving further changes but the
subscriber would end up missing changes for tables associated with p2
generated while p2 doesn't exist. Even if the user re-creates the
publication p2 after that, it would be hard for users to re-alter the
subscription to get changes for tables associated with p1 and p2 from
the exact point of p2 being created. Therefore, the subscriber could
end up missing some changes that happened between 'CREATE PUBLICATION
p2' and 'ALTER SUBSCRIPTION ... SET PUBLICATION p1, p2'.

On the other hand, with the patch, the publication can send the
changes to tables associated with p1 and p2 as soon as it decodes the
WAL record of re-CREATE PUBLICATION p2.

>
> I see the point of adding such an option to avoid breaking the current
> applications (if there are any) that are relying on current behaviour.
> But OTOH, I am not sure if users expect us to fail explicitly in such
> scenarios.

On the side note, with the patch since we ignore missing publications
we will be able to create a subscription with whatever publications
regardless their existence like:

CREATE SUBSCRIPTION ... PUBLICATION pub1, pub2, pub3, pub4, pub5, ..., pub1000;

The walsender corresponding to such subscriber can stream changes as
soon as the listed publications are created on the publisher and
REFRESH PUBLICATION is executed.

>
> This is a long-standing behaviour for which we get reports from time
> to time, and once analyzing a failure, Tom also looked at it and
> agreed that we don't have much choice to avoid skipping non-existent
> publications [5]. But we never concluded as to whether skipping should
> be a default behavior or an optional one. So, we need more opinions on
> it.

I'm leaning toward making the skipping behavior a default as I could
not find a good benefit for the current behavior (i.e., stopping
logical replication due to missing publications).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Sun, Mar 9, 2025 at 9:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Mar 4, 2025 at 9:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > I see the point of adding such an option to avoid breaking the current
> > applications (if there are any) that are relying on current behaviour.
> > But OTOH, I am not sure if users expect us to fail explicitly in such
> > scenarios.
>
> On the side note, with the patch since we ignore missing publications
> we will be able to create a subscription with whatever publications
> regardless their existence like:
>
> CREATE SUBSCRIPTION ... PUBLICATION pub1, pub2, pub3, pub4, pub5, ..., pub1000;
>
> The walsender corresponding to such subscriber can stream changes as
> soon as the listed publications are created on the publisher and
> REFRESH PUBLICATION is executed.
>

Right, but OTOH, one can expect that the data should start replicating
as soon as one creates a publication on the publisher. However, the
data for tables that are part of the publication will start
replicating from the point when the decoding process will process the
WAL corresponding to Create Publication. I suggest to add something
for this in docs unless it is already explained.

> >
> > This is a long-standing behaviour for which we get reports from time
> > to time, and once analyzing a failure, Tom also looked at it and
> > agreed that we don't have much choice to avoid skipping non-existent
> > publications [5]. But we never concluded as to whether skipping should
> > be a default behavior or an optional one. So, we need more opinions on
> > it.
>
> I'm leaning toward making the skipping behavior a default as I could
> not find a good benefit for the current behavior (i.e., stopping
> logical replication due to missing publications).
>

Sounds reasonable. We can always add the option at a later point if
required. Thanks for your input. We can continue reviewing and
committing the current patch.

--
With Regards,
Amit Kapila.



On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On further thinking, I felt the use of publications_updated variable
> is not required we can use publications_valid itself which will be set
> if the publication system table is invalidated. Here is a patch for
> the same.
>

The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread  [1]. So, we should try to rebuild the
entry when we have skipped the required publication previously.

Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.

BTW, I am planning to commit this only on HEAD as this is a behavior
change. Please let me know if you guys think otherwise.

[1] -
https://www.postgresql.org/message-id/OSCPR01MB14966C09AA201EFFA706576A7F5C92%40OSCPR01MB14966.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.



On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On further thinking, I felt the use of publications_updated variable
> is not required we can use publications_valid itself which will be set
> if the publication system table is invalidated. Here is a patch for
> the same.
>

The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread  [1]. So, we should try to rebuild the
entry when we have skipped the required publication previously.

Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.

I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC, the issue occurs when a publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET PUBLICATION is executed, the subscriber workers restart and request the changes based on restart_lsn, which is at an earlier LSN in the WAL than the LSN at which the publication was created. This leads to an error, and we are addressing this behavior as part of the fix by skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication is created and this behavior looks fine. 

BTW, I am planning to commit this only on HEAD as this is a behavior
change. Please let me know if you guys think otherwise.

Somehow this looks like a bug fix which should be backported no?  Am I missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 10, 2025 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:
>> >
>> > On further thinking, I felt the use of publications_updated variable
>> > is not required we can use publications_valid itself which will be set
>> > if the publication system table is invalidated. Here is a patch for
>> > the same.
>> >
>>
>> The patch relies on the fact that whenever a publication's data is
>> invalidated, it will also invalidate all the RelSyncEntires as per
>> publication_invalidation_cb. But note that we are discussing removing
>> that inefficiency in the thread  [1]. So, we should try to rebuild the
>> entry when we have skipped the required publication previously.
>>
>> Apart from this, please consider updating the docs, as mentioned in my
>> response to Sawada-San's email.
>
>
> I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC, the
issueoccurs when a publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET PUBLICATION is executed,
thesubscriber workers restart and request the changes based on restart_lsn, which is at an earlier LSN in the WAL than
theLSN at which the publication was created. This leads to an error, and we are addressing this behavior as part of the
fixby skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication is created
andthis behavior looks fine. 
>

Yes, your understanding is correct, but note that as such, the patch
is simply skipping the missing publication. The skipped changes are
because those were on the table that is not part of any publication
w.r.t historic snapshot we have at the point of time.

>> BTW, I am planning to commit this only on HEAD as this is a behavior
>> change. Please let me know if you guys think otherwise.
>
>
> Somehow this looks like a bug fix which should be backported no?  Am I missing something?
>

We can consider this a bug-fix and backpatch it, but I am afraid that
because this is a behavior change, some users may not like it. Also, I
don't remember seeing public reports for this behavior; that is
probably because it is hard to hit. FYI, we found this via BF
failures. So, I thought it would be better to get this field tested
via HEAD only at this point in time.

--
With Regards,
Amit Kapila.



On Mon, Mar 10, 2025 at 10:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 10, 2025 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:
> >> >
> >> > On further thinking, I felt the use of publications_updated variable
> >> > is not required we can use publications_valid itself which will be set
> >> > if the publication system table is invalidated. Here is a patch for
> >> > the same.
> >> >
> >>
> >> The patch relies on the fact that whenever a publication's data is
> >> invalidated, it will also invalidate all the RelSyncEntires as per
> >> publication_invalidation_cb. But note that we are discussing removing
> >> that inefficiency in the thread  [1]. So, we should try to rebuild the
> >> entry when we have skipped the required publication previously.
> >>
> >> Apart from this, please consider updating the docs, as mentioned in my
> >> response to Sawada-San's email.
> >
> >
> > I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC,
theissue occurs when a publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET PUBLICATION is
executed,the subscriber workers restart and request the changes based on restart_lsn, which is at an earlier LSN in the
WALthan the LSN at which the publication was created. This leads to an error, and we are addressing this behavior as
partof the fix by skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication
iscreated and this behavior looks fine. 
> >
>
> Yes, your understanding is correct, but note that as such, the patch
> is simply skipping the missing publication. The skipped changes are
> because those were on the table that is not part of any publication
> w.r.t historic snapshot we have at the point of time.

So, it will skip loading the missing publication up to the LSN where
the publication is created and then load it from there, correct? Do we
have a test case for this? I couldn't find one in the latest patch or
in the email thread to demonstrate this behavior.


> >> BTW, I am planning to commit this only on HEAD as this is a behavior
> >> change. Please let me know if you guys think otherwise.
> >
> >
> > Somehow this looks like a bug fix which should be backported no?  Am I missing something?
> >
>
> We can consider this a bug-fix and backpatch it, but I am afraid that
> because this is a behavior change, some users may not like it. Also, I
> don't remember seeing public reports for this behavior; that is
> probably because it is hard to hit. FYI, we found this via BF
> failures. So, I thought it would be better to get this field tested
> via HEAD only at this point in time.

At the moment, I don't have a strong opinion on this. Since no one has
encountered or reported this issue, it might be the case that it's not
affecting anyone, and we could simply backpatch without causing any
dissatisfaction. However, I'm fine with whatever others decide.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Mon, 10 Mar 2025 at 09:33, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On further thinking, I felt the use of publications_updated variable
> > is not required we can use publications_valid itself which will be set
> > if the publication system table is invalidated. Here is a patch for
> > the same.
> >
>
> The patch relies on the fact that whenever a publication's data is
> invalidated, it will also invalidate all the RelSyncEntires as per
> publication_invalidation_cb. But note that we are discussing removing
> that inefficiency in the thread  [1]. So, we should try to rebuild the
> entry when we have skipped the required publication previously.
>
> Apart from this, please consider updating the docs, as mentioned in my
> response to Sawada-San's email.

The create subscription documentation already has "We allow
non-existent publications to be specified so that users can add those
later. This means pg_subscription can have non-existent publications."
and should be enough at [1]. Let me know if we need to add more
documentation.

Apart from this I have changed the log level that logs "skipped
loading publication" to WARNING as we log a warning "WARNING:
publications "pub2", "pub3" do not exist on the publisher" in case of
CREATE SUBSCRIPTION and looked similar to this. I can change it to a
different log level in case you feel this is not the right level.

Also I have added a test case for dilip's comment from [2].
The attached v7 version patch has the changes for the same.

[1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html
[2] - https://www.postgresql.org/message-id/CAFiTN-tgUR6QLSs3UHK7gx4VP7cURGNkufA_xkrQLw9eCnbGQw%40mail.gmail.com

Regards,
Vignesh

Attachment
On Tue, Mar 11, 2025 at 9:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Mar 10, 2025 at 10:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > >> BTW, I am planning to commit this only on HEAD as this is a behavior
> > >> change. Please let me know if you guys think otherwise.
> > >
> > >
> > > Somehow this looks like a bug fix which should be backported no?  Am I missing something?
> > >
> >
> > We can consider this a bug-fix and backpatch it, but I am afraid that
> > because this is a behavior change, some users may not like it. Also, I
> > don't remember seeing public reports for this behavior; that is
> > probably because it is hard to hit. FYI, we found this via BF
> > failures. So, I thought it would be better to get this field tested
> > via HEAD only at this point in time.
>
> At the moment, I don't have a strong opinion on this. Since no one has
> encountered or reported this issue, it might be the case that it's not
> affecting anyone, and we could simply backpatch without causing any
> dissatisfaction. However, I'm fine with whatever others decide.
>

Sawada-San, others, do you have an opinion on whether to backpatch this change?

--
With Regards,
Amit Kapila.



On Tue, Mar 11, 2025 at 4:01 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, 10 Mar 2025 at 09:33, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On further thinking, I felt the use of publications_updated variable
> > > is not required we can use publications_valid itself which will be set
> > > if the publication system table is invalidated. Here is a patch for
> > > the same.
> > >
> >
> > The patch relies on the fact that whenever a publication's data is
> > invalidated, it will also invalidate all the RelSyncEntires as per
> > publication_invalidation_cb. But note that we are discussing removing
> > that inefficiency in the thread  [1]. So, we should try to rebuild the
> > entry when we have skipped the required publication previously.
> >
> > Apart from this, please consider updating the docs, as mentioned in my
> > response to Sawada-San's email.
>
> The create subscription documentation already has "We allow
> non-existent publications to be specified so that users can add those
> later. This means pg_subscription can have non-existent publications."
> and should be enough at [1]. Let me know if we need to add more
> documentation.
>
> Apart from this I have changed the log level that logs "skipped
> loading publication" to WARNING as we log a warning "WARNING:
> publications "pub2", "pub3" do not exist on the publisher" in case of
> CREATE SUBSCRIPTION and looked similar to this. I can change it to a
> different log level in case you feel this is not the right level.
>
> Also I have added a test case for dilip's comment from [2].
> The attached v7 version patch has the changes for the same.
>

Thanks, Vignesh, for adding the test. I believe you've tested the
effect of DROP PUBLICATION. However, I think we should also test the
behavior of ALTER SUBSCRIPTION...SET PUBLICATION before creating the
PUBLICATION, and then create the PUBLICATION at a later stage.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Wed, Mar 12, 2025 at 3:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 11, 2025 at 9:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Mar 10, 2025 at 10:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > >> BTW, I am planning to commit this only on HEAD as this is a behavior
> > > >> change. Please let me know if you guys think otherwise.
> > > >
> > > >
> > > > Somehow this looks like a bug fix which should be backported no?  Am I missing something?
> > > >
> > >
> > > We can consider this a bug-fix and backpatch it, but I am afraid that
> > > because this is a behavior change, some users may not like it. Also, I
> > > don't remember seeing public reports for this behavior; that is
> > > probably because it is hard to hit. FYI, we found this via BF
> > > failures. So, I thought it would be better to get this field tested
> > > via HEAD only at this point in time.
> >
> > At the moment, I don't have a strong opinion on this. Since no one has
> > encountered or reported this issue, it might be the case that it's not
> > affecting anyone, and we could simply backpatch without causing any
> > dissatisfaction. However, I'm fine with whatever others decide.
> >
>
> Sawada-San, others, do you have an opinion on whether to backpatch this change?

I'm also afraid of backpatching it so I guess it would be better to
push it to only HEAD. I think if users have encountered and we see
reported the issue we can consider backpatching again. If regression
tests on backbranches continue to fail intermittently, probably we can
consider adding waits as the patch Osumi-san proposed[1]?

Regards,

[1]
https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Wed, 12 Mar 2025 at 16:15, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Thanks, Vignesh, for adding the test. I believe you've tested the
> effect of DROP PUBLICATION. However, I think we should also test the
> behavior of ALTER SUBSCRIPTION...SET PUBLICATION before creating the
> PUBLICATION, and then create the PUBLICATION at a later stage.

I felt having only one test case for this is enough, I have removed
the DROP PUBLICATION test and added the SET PUBLICATION test. The
attached v8 version patch has the changes for the same.

Regards,
Vignesh

Attachment
On Thu, Mar 13, 2025 at 7:38 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 12 Mar 2025 at 16:15, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Thanks, Vignesh, for adding the test. I believe you've tested the
> > effect of DROP PUBLICATION. However, I think we should also test the
> > behavior of ALTER SUBSCRIPTION...SET PUBLICATION before creating the
> > PUBLICATION, and then create the PUBLICATION at a later stage.
>
> I felt having only one test case for this is enough, I have removed
> the DROP PUBLICATION test and added the SET PUBLICATION test. The
> attached v8 version patch has the changes for the same.

Thanks looks good to me.

While looking at the patch, I have a few comments/questions

+ if (pub)
+ result = lappend(result, pub);
+ else
+ {
+ /*
+ * When executing 'ALTER SUBSCRIPTION ... SET PUBLICATION', the
+ * apply worker continues using the existing replication slot and
+ * origin after restarting. If the replication origin is not
+ * updated before the restart, the WAL start location may point to
+ * a position before the specified publication exists, causing
+ * persistent apply worker restarts and errors.
+ *
+ * This ensures that the publication is skipped if it does not
+ * exist and is loaded when the corresponding WAL record is
+ * encountered.
+ */
+ ereport(WARNING,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("skipped loading publication: %s", pubname),
+ errhint("If the publication already exists, ignore it as it will be
loaded upon reaching the corresponding WAL record; otherwise, create
it."));
+ }

This comment focuses on a specific use case regarding the problem with
'ALTER SUBSCRIPTION ... SET PUBLICATION,' but in reality, we are
addressing a more general case where the user is trying to SET
PUBLICATION or even CREATE SUBSCRIPTION, and some publications are
missing. Wouldn't it be better to rephrase the comment?

2. + errhint("If the publication already exists, ignore it as it will
be loaded upon reaching the corresponding WAL record; otherwise,
create it."));

Is this hint correct? This is a question rather than a comment: When
we reach a particular WAL where the publication was created, will the
publication automatically load, or does the user need to REFRESH the
publications?



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Thu, 13 Mar 2025 at 09:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Thanks looks good to me.
>
> While looking at the patch, I have a few comments/questions
>
> + if (pub)
> + result = lappend(result, pub);
> + else
> + {
> + /*
> + * When executing 'ALTER SUBSCRIPTION ... SET PUBLICATION', the
> + * apply worker continues using the existing replication slot and
> + * origin after restarting. If the replication origin is not
> + * updated before the restart, the WAL start location may point to
> + * a position before the specified publication exists, causing
> + * persistent apply worker restarts and errors.
> + *
> + * This ensures that the publication is skipped if it does not
> + * exist and is loaded when the corresponding WAL record is
> + * encountered.
> + */
> + ereport(WARNING,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("skipped loading publication: %s", pubname),
> + errhint("If the publication already exists, ignore it as it will be
> loaded upon reaching the corresponding WAL record; otherwise, create
> it."));
> + }
>
> This comment focuses on a specific use case regarding the problem with
> 'ALTER SUBSCRIPTION ... SET PUBLICATION,' but in reality, we are
> addressing a more general case where the user is trying to SET
> PUBLICATION or even CREATE SUBSCRIPTION, and some publications are
> missing. Wouldn't it be better to rephrase the comment?

How about a comment something like below:
/*
* In 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and
* 'CREATE SUBSCRIPTION', if the replication origin is not updated
* before the worker exits, the WAL start location might point to a
* position before the publication's WAL record. This can lead to
* persistent apply worker restarts and errors.
*
* Additionally, dropping a subscription's publication should not
* disrupt logical replication.
*
* This ensures that a missing publication is skipped and loaded
* when its corresponding WAL record is encountered.
*/
ereport(WARNING,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("skipped loading publication: %s", pubname),
errhint("If the publication is missing, create and refresh it.
Otherwise, wait for the slot to reach the WAL record, then refresh"));

> 2. + errhint("If the publication already exists, ignore it as it will
> be loaded upon reaching the corresponding WAL record; otherwise,
> create it."));
>
> Is this hint correct? This is a question rather than a comment: When
> we reach a particular WAL where the publication was created, will the
> publication automatically load, or does the user need to REFRESH the
> publications?

Users need to refresh the publication in case the relation is not
already added to pg_subscription_rel and apply incremental changes.
How about an error hint like:
"If the publication is missing, create and refresh it. Otherwise, wait
for the slot to reach the WAL record for the created publication, then
refresh"

Regards,
Vignesh



On Thu, Mar 13, 2025 at 10:49 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 13 Mar 2025 at 09:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Thanks looks good to me.
> >
> > While looking at the patch, I have a few comments/questions
> >
> > + if (pub)
> > + result = lappend(result, pub);
> > + else
> > + {
> > + /*
> > + * When executing 'ALTER SUBSCRIPTION ... SET PUBLICATION', the
> > + * apply worker continues using the existing replication slot and
> > + * origin after restarting. If the replication origin is not
> > + * updated before the restart, the WAL start location may point to
> > + * a position before the specified publication exists, causing
> > + * persistent apply worker restarts and errors.
> > + *
> > + * This ensures that the publication is skipped if it does not
> > + * exist and is loaded when the corresponding WAL record is
> > + * encountered.
> > + */
> > + ereport(WARNING,
> > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > + errmsg("skipped loading publication: %s", pubname),
> > + errhint("If the publication already exists, ignore it as it will be
> > loaded upon reaching the corresponding WAL record; otherwise, create
> > it."));
> > + }
> >
> > This comment focuses on a specific use case regarding the problem with
> > 'ALTER SUBSCRIPTION ... SET PUBLICATION,' but in reality, we are
> > addressing a more general case where the user is trying to SET
> > PUBLICATION or even CREATE SUBSCRIPTION, and some publications are
> > missing. Wouldn't it be better to rephrase the comment?
>
> How about a comment something like below:
> /*
> * In 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and
> * 'CREATE SUBSCRIPTION', if the replication origin is not updated
> * before the worker exits, the WAL start location might point to a
> * position before the publication's WAL record. This can lead to
> * persistent apply worker restarts and errors.
> *
> * Additionally, dropping a subscription's publication should not
> * disrupt logical replication.
> *
> * This ensures that a missing publication is skipped and loaded
> * when its corresponding WAL record is encountered.
> */

Looks fine, shall we add the missing publication point as well
something like below

/*
* In operations like 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and
* 'CREATE SUBSCRIPTION', if the specified publication does not exist or
* if the replication origin is not updated before the worker exits,
* the WAL start location may point to a position prior to the publication's
* WAL record. This can cause persistent restarts and errors
* in the apply worker.
*
> * Additionally, dropping a subscription's publication should not
> * disrupt logical replication.
*
> * This ensures that a missing publication is skipped and loaded
> * when its corresponding WAL record is encountered.
*/



> ereport(WARNING,
> errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> errmsg("skipped loading publication: %s", pubname),
> errhint("If the publication is missing, create and refresh it.
> Otherwise, wait for the slot to reach the WAL record, then refresh"));
>
> > 2. + errhint("If the publication already exists, ignore it as it will
> > be loaded upon reaching the corresponding WAL record; otherwise,
> > create it."));
> >
> > Is this hint correct? This is a question rather than a comment: When
> > we reach a particular WAL where the publication was created, will the
> > publication automatically load, or does the user need to REFRESH the
> > publications?
>
> Users need to refresh the publication in case the relation is not
> already added to pg_subscription_rel and apply incremental changes.
> How about an error hint like:
> "If the publication is missing, create and refresh it. Otherwise, wait
> for the slot to reach the WAL record for the created publication, then
> refresh"

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Thu, Mar 13, 2025 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Looks fine, shall we add the missing publication point as well
> something like below
>
> /*
> * In operations like 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and
> * 'CREATE SUBSCRIPTION', if the specified publication does not exist or
> * if the replication origin is not updated before the worker exits,
> * the WAL start location may point to a position prior to the publication's
> * WAL record. This can cause persistent restarts and errors
> * in the apply worker.
> *

I think that is too much related to pub-sub model, and ideally,
pgoutput should not care about it. I have written a comment
considering somebody using pgoutput decoding module via APIs.

> > * Additionally, dropping a subscription's publication should not
> > * disrupt logical replication.
> *
> > * This ensures that a missing publication is skipped and loaded
> > * when its corresponding WAL record is encountered.
> */
>
>
>
> > ereport(WARNING,
> > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > errmsg("skipped loading publication: %s", pubname),
> > errhint("If the publication is missing, create and refresh it.
> > Otherwise, wait for the slot to reach the WAL record, then refresh"));
> >
> > > 2. + errhint("If the publication already exists, ignore it as it will
> > > be loaded upon reaching the corresponding WAL record; otherwise,
> > > create it."));
> > >
> > > Is this hint correct? This is a question rather than a comment: When
> > > we reach a particular WAL where the publication was created, will the
> > > publication automatically load, or does the user need to REFRESH the
> > > publications?
> >
> > Users need to refresh the publication in case the relation is not
> > already added to pg_subscription_rel and apply incremental changes.
> > How about an error hint like:
> > "If the publication is missing, create and refresh it. Otherwise, wait
> > for the slot to reach the WAL record for the created publication, then
> > refresh"
>

I have tried to split this information into errdetail and errhint in
the attached. See and let me know what you think of the same.

--
With Regards,
Amit Kapila.

Attachment
On Thu, Mar 13, 2025 at 3:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 13, 2025 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Looks fine, shall we add the missing publication point as well
> > something like below
> >
> > /*
> > * In operations like 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and
> > * 'CREATE SUBSCRIPTION', if the specified publication does not exist or
> > * if the replication origin is not updated before the worker exits,
> > * the WAL start location may point to a position prior to the publication's
> > * WAL record. This can cause persistent restarts and errors
> > * in the apply worker.
> > *
>
> I think that is too much related to pub-sub model, and ideally,
> pgoutput should not care about it. I have written a comment
> considering somebody using pgoutput decoding module via APIs.

I agree, here we just need to talk about skipping the missing
publication, not different scenarios where that can happen. This
comments look much better.

> > > * Additionally, dropping a subscription's publication should not
> > > * disrupt logical replication.
> > *
> > > * This ensures that a missing publication is skipped and loaded
> > > * when its corresponding WAL record is encountered.
> > */
> >
> >
> >
> > > ereport(WARNING,
> > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > errmsg("skipped loading publication: %s", pubname),
> > > errhint("If the publication is missing, create and refresh it.
> > > Otherwise, wait for the slot to reach the WAL record, then refresh"));
> > >
> > > > 2. + errhint("If the publication already exists, ignore it as it will
> > > > be loaded upon reaching the corresponding WAL record; otherwise,
> > > > create it."));
> > > >
> > > > Is this hint correct? This is a question rather than a comment: When
> > > > we reach a particular WAL where the publication was created, will the
> > > > publication automatically load, or does the user need to REFRESH the
> > > > publications?
> > >
> > > Users need to refresh the publication in case the relation is not
> > > already added to pg_subscription_rel and apply incremental changes.
> > > How about an error hint like:
> > > "If the publication is missing, create and refresh it. Otherwise, wait
> > > for the slot to reach the WAL record for the created publication, then
> > > refresh"
> >
>
> I have tried to split this information into errdetail and errhint in
> the attached. See and let me know what you think of the same.

Yes, the errhint also makes sense to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Thu, Mar 13, 2025 at 7:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Mar 13, 2025 at 3:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I think that is too much related to pub-sub model, and ideally,
> > pgoutput should not care about it. I have written a comment
> > considering somebody using pgoutput decoding module via APIs.
>

Thanks, Dilip and Sawada-San, for the inputs, and Vignesh for the
patch. I have pushed the change.

--
With Regards,
Amit Kapila.