Thread: Add an option to skip loading missing publication to avoid logical replication failure
Add an option to skip loading missing publication to avoid logical replication failure
From
vignesh C
Date:
Hi, Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the logical replication in certain cases. This can happen as the apply worker will get restarted after SET PUBLICATION, the apply worker will use the existing slot and replication origin corresponding to the subscription. Now, it is possible that before restart the origin has not been updated and the WAL start location points to a location prior to where PUBLICATION pub exists which can lead to such an error. Once this error occurs, apply worker will never be able to proceed and will always return the same error. There was discussion on this and Amit had posted a patch to handle this at [2]. Amit's patch does continue using a historic snapshot but ignores publications that are not found for the purpose of computing RelSyncEntry attributes. We won't mark such an entry as valid till all the publications are loaded without anything missing. This means we won't publish operations on tables corresponding to that publication till we found such a publication and that seems okay. I have added an option skip_not_exist_publication to enable this operation only when skip_not_exist_publication is specified as true. There is no change in default behavior when skip_not_exist_publication is specified as false. But one thing to note with the patch (with skip_not_exist_publication option) is that replication of few WAL entries will be skipped till the publication is loaded like in the below example: -- Create table in publisher and subscriber create table t1(c1 int); create table t2(c1 int); -- Create publications create publication pub1 for table t1; create publication pub2 for table t2; -- Create subscription create subscription test1 connection 'dbname=postgres host=localhost port=5432' publication pub1, pub2; -- Drop one publication drop publication pub1; -- Insert in the publisher insert into t1 values(11); insert into t2 values(21); -- Select in subscriber postgres=# select * from t1; c1 ---- (0 rows) postgres=# select * from t2; c1 ---- 21 (1 row) -- Create the dropped publication in publisher create publication pub1 for table t1; -- Insert in the publisher insert into t1 values(12); postgres=# select * from t1; c1 ---- 11 12 (2 rows) -- Select data in subscriber postgres=# select * from t1; -- record with value 11 will be missing in subscriber c1 ---- 12 (1 row) Thoughts? [1] - https://www.postgresql.org/message-id/CAA4eK1%2BT-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q%40mail.gmail.com Regards, Vignesh
Attachment
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
vignesh C
Date:
On Mon, 19 Feb 2024 at 12:48, vignesh C <vignesh21@gmail.com> wrote: > > Hi, > > Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the > logical replication in certain cases. This can happen as the apply > worker will get restarted after SET PUBLICATION, the apply worker will > use the existing slot and replication origin corresponding to the > subscription. Now, it is possible that before restart the origin has > not been updated and the WAL start location points to a location prior > to where PUBLICATION pub exists which can lead to such an error. Once > this error occurs, apply worker will never be able to proceed and will > always return the same error. > > There was discussion on this and Amit had posted a patch to handle > this at [2]. Amit's patch does continue using a historic snapshot but > ignores publications that are not found for the purpose of computing > RelSyncEntry attributes. We won't mark such an entry as valid till all > the publications are loaded without anything missing. This means we > won't publish operations on tables corresponding to that publication > till we found such a publication and that seems okay. > I have added an option skip_not_exist_publication to enable this > operation only when skip_not_exist_publication is specified as true. > There is no change in default behavior when skip_not_exist_publication > is specified as false. I have updated the patch to now include changes for pg_dump, added few tests, describe changes and added documentation changes. The attached v2 version patch has the changes for the same. Regards, Vignesh
Attachment
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Mon, Feb 19, 2024 at 12:49 PM vignesh C <vignesh21@gmail.com> wrote: > > Currently ALTER SUBSCRIPTION ... SET PUBLICATION will break the > logical replication in certain cases. This can happen as the apply > worker will get restarted after SET PUBLICATION, the apply worker will > use the existing slot and replication origin corresponding to the > subscription. Now, it is possible that before restart the origin has > not been updated and the WAL start location points to a location prior > to where PUBLICATION pub exists which can lead to such an error. Once > this error occurs, apply worker will never be able to proceed and will > always return the same error. > > There was discussion on this and Amit had posted a patch to handle > this at [2]. Amit's patch does continue using a historic snapshot but > ignores publications that are not found for the purpose of computing > RelSyncEntry attributes. We won't mark such an entry as valid till all > the publications are loaded without anything missing. This means we > won't publish operations on tables corresponding to that publication > till we found such a publication and that seems okay. > I have added an option skip_not_exist_publication to enable this > operation only when skip_not_exist_publication is specified as true. > There is no change in default behavior when skip_not_exist_publication > is specified as false. > Did you try to measure the performance impact of this change? We can try a few cases where DDL and DMLs are involved, missing publication (drop publication and recreate after a varying number of records to check the impact). The other names for the option could be: skip_notexistant_publications, or ignore_nonexistant_publications. Can we think of any others? -- With Regards, Amit Kapila.
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Tue, Feb 18, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 14 Feb 2025 at 15:36, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Did you try to measure the performance impact of this change? We can > > try a few cases where DDL and DMLs are involved, missing publication > > (drop publication and recreate after a varying number of records to > > check the impact). > > Since we don't have an exact scenario to compare with the patch > (because, in the current HEAD, when the publication is missing, an > error is thrown and the walsender/worker restarts), I compared the > positive case, where records are successfully replicated to the > subscriber, as shown below. For the scenario with the patch, I ran the > same test, where the publication is dropped before the insert, > allowing the walsender to check whether the publication is present. > The test results, which represent the median of 7 runs and the > execution run is in milliseconds, are provided below: > > Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 > Head | 1.214 | 2.548 | 10.823 | 90.3 | 951.833 > Patch | 1.215 | 2.5485 | 10.8545 | 90.94 | 955.134 > % diff | 0.082 | 0.020 | 0.291 | 0.704 | 0.347 > > I noticed that the test run with patches is very negligible. The > scripts used for execution are attached. > You have used the synchronous_standby_name to evaluate the performance which covers other parts of replication than the logical decoding. It would be better to test using pg_recvlogical. -- With Regards, Amit Kapila.
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
vignesh C
Date:
On Tue, 18 Feb 2025 at 16:53, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 18, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Fri, 14 Feb 2025 at 15:36, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Did you try to measure the performance impact of this change? We can > > > try a few cases where DDL and DMLs are involved, missing publication > > > (drop publication and recreate after a varying number of records to > > > check the impact). > > > > Since we don't have an exact scenario to compare with the patch > > (because, in the current HEAD, when the publication is missing, an > > error is thrown and the walsender/worker restarts), I compared the > > positive case, where records are successfully replicated to the > > subscriber, as shown below. For the scenario with the patch, I ran the > > same test, where the publication is dropped before the insert, > > allowing the walsender to check whether the publication is present. > > The test results, which represent the median of 7 runs and the > > execution run is in milliseconds, are provided below: > > > > Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 > > Head | 1.214 | 2.548 | 10.823 | 90.3 | 951.833 > > Patch | 1.215 | 2.5485 | 10.8545 | 90.94 | 955.134 > > % diff | 0.082 | 0.020 | 0.291 | 0.704 | 0.347 > > > > I noticed that the test run with patches is very negligible. The > > scripts used for execution are attached. > > > > You have used the synchronous_standby_name to evaluate the performance > which covers other parts of replication than the logical decoding. It > would be better to test using pg_recvlogical. Here are the test runs with pg_recvlogical, the test results, which represent the median of 10 runs and the execution run is in milliseconds, are provided below: Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 Head | 9.95 | 15.26 | 62.62 | 536.57 | 8480.83 Patch | 9.218 | 10.32 | 23.05 | 143.83 | 4852.43 % diff | 7.356 | 32.38 | 63.19 | 73.193| 42.783 We observe that test execution with the patch performs better between 7.35 percent to 73.19 percent. This is because, in HEAD, after loading and verifying that the publication is valid, it must continue processing to output the change. In contrast, with the patch, outputting the change is skipped since the publication does not exist. The attached script has the script that was used for testing. Here the NUM_RECORDS count should be changed accordingly for each of the tests and while running the test with the patch change uncomment the drop publication command. Regards, Vignesh
Attachment
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
vignesh C
Date:
On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote: > > The attached script has the script that was used for testing. Here the > NUM_RECORDS count should be changed accordingly for each of the tests > and while running the test with the patch change uncomment the drop > publication command. I have done further analysis on the test and changed the test to compare it better with HEAD. The execution time is in milliseconds. Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 Head | 10.43 | 15.86 | 64.44 | 550.56 | 8991.04 Patch | 11.35 | 17.26 | 73.50 | 640.21 | 10104.72 % diff | -8.82 | -8.85 | -14.08 | -16.28 | -12.38 There is a performance degradation in the range of 8.8 to 16.2 percent. Test Details (With Head): a) Create two publications for the same table. b) Insert the records listed in the table below. c) Use pg_recvlogical to capture the changes. Test Details (With Patch): a) Create two publications for the same table.b) Drop one publication(to check the impact of skip missing publication), ensuring that changes from the remaining publication continue to be captured. c) Insert the records listed in the table below.d) Use pg_recvlogical to capture the changes. The performance degradation is in the range of 8.8 to 16.2 percentage.The script used for the testing is attached, while running with patch the drop publication command in script should be uncommented and the record count should be changed for each of the run. Also I changed the patch so that we need not execute the get_rel_sync_entry code flow for every record in case of missing publication case and to do so only in case when the publications have changed. The attached v4 version patch has the changes for the same. Regards, Vignesh
Attachment
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Mon, Mar 3, 2025 at 2:30 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote: > > > > The attached script has the script that was used for testing. Here the > > NUM_RECORDS count should be changed accordingly for each of the tests > > and while running the test with the patch change uncomment the drop > > publication command. > > I have done further analysis on the test and changed the test to > compare it better with HEAD. The execution time is in milliseconds. > Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 > Head | 10.43 | 15.86 | 64.44 | 550.56 | 8991.04 > Patch | 11.35 | 17.26 | 73.50 | 640.21 | 10104.72 > % diff | -8.82 | -8.85 | -14.08 | -16.28 | -12.38 > > There is a performance degradation in the range of 8.8 to 16.2 percent. > - /* Validate the entry */ - if (!entry->replicate_valid) + /* + * If the publication is invalid, check for updates. + * This optimization ensures that the next block, which queries the system + * tables and builds the relation entry, runs only if a new publication was + * created. + */ + if (!publications_valid && data->publications) + { + bool skipped_pub = false; + List *publications; + + publications = LoadPublications(data->publication_names, &skipped_pub); The publications_valid flag indicates whether the publications cache is valid or not; the flag is set to false for any invalidation in the pg_publication catalog. I wonder that instead of using the same flag what if we use a separate publications_skipped flag? If that works, you don't even need to change the current location where we LoadPublications. -- With Regards, Amit Kapila.
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
vignesh C
Date:
On Mon, 3 Mar 2025 at 16:41, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Mar 3, 2025 at 2:30 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote: > > > > > > The attached script has the script that was used for testing. Here the > > > NUM_RECORDS count should be changed accordingly for each of the tests > > > and while running the test with the patch change uncomment the drop > > > publication command. > > > > I have done further analysis on the test and changed the test to > > compare it better with HEAD. The execution time is in milliseconds. > > Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 > > Head | 10.43 | 15.86 | 64.44 | 550.56 | 8991.04 > > Patch | 11.35 | 17.26 | 73.50 | 640.21 | 10104.72 > > % diff | -8.82 | -8.85 | -14.08 | -16.28 | -12.38 > > > > There is a performance degradation in the range of 8.8 to 16.2 percent. > > > > - /* Validate the entry */ > - if (!entry->replicate_valid) > + /* > + * If the publication is invalid, check for updates. > + * This optimization ensures that the next block, which queries the system > + * tables and builds the relation entry, runs only if a new publication was > + * created. > + */ > + if (!publications_valid && data->publications) > + { > + bool skipped_pub = false; > + List *publications; > + > + publications = LoadPublications(data->publication_names, &skipped_pub); > > The publications_valid flag indicates whether the publications cache > is valid or not; the flag is set to false for any invalidation in the > pg_publication catalog. I wonder that instead of using the same flag > what if we use a separate publications_skipped flag? If that works, > you don't even need to change the current location where we > LoadPublications. There is almost negligible dip with the above suggested way, the test results for the same is given below(execution time is in milli seconds): Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19 Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66 % diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16 There is a performance dip in the range of 0 to 0.58 percent. The attached patch has the changes for the same. The test script used is also attached. Regards, Vignesh
Attachment
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
vignesh C
Date:
On Tue, 4 Mar 2025 at 12:22, vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 3 Mar 2025 at 16:41, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Mar 3, 2025 at 2:30 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Tue, 25 Feb 2025 at 15:32, vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > The attached script has the script that was used for testing. Here the > > > > NUM_RECORDS count should be changed accordingly for each of the tests > > > > and while running the test with the patch change uncomment the drop > > > > publication command. > > > > > > I have done further analysis on the test and changed the test to > > > compare it better with HEAD. The execution time is in milliseconds. > > > Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 > > > Head | 10.43 | 15.86 | 64.44 | 550.56 | 8991.04 > > > Patch | 11.35 | 17.26 | 73.50 | 640.21 | 10104.72 > > > % diff | -8.82 | -8.85 | -14.08 | -16.28 | -12.38 > > > > > > There is a performance degradation in the range of 8.8 to 16.2 percent. > > > > > > > - /* Validate the entry */ > > - if (!entry->replicate_valid) > > + /* > > + * If the publication is invalid, check for updates. > > + * This optimization ensures that the next block, which queries the system > > + * tables and builds the relation entry, runs only if a new publication was > > + * created. > > + */ > > + if (!publications_valid && data->publications) > > + { > > + bool skipped_pub = false; > > + List *publications; > > + > > + publications = LoadPublications(data->publication_names, &skipped_pub); > > > > The publications_valid flag indicates whether the publications cache > > is valid or not; the flag is set to false for any invalidation in the > > pg_publication catalog. I wonder that instead of using the same flag > > what if we use a separate publications_skipped flag? If that works, > > you don't even need to change the current location where we > > LoadPublications. > > There is almost negligible dip with the above suggested way, the test > results for the same is given below(execution time is in milli > seconds): > Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 > Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19 > Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66 > % diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16 > > There is a performance dip in the range of 0 to 0.58 percent. > The attached patch has the changes for the same. The test script used > is also attached. On further thinking, I felt the use of publications_updated variable is not required we can use publications_valid itself which will be set if the publication system table is invalidated. Here is a patch for the same. Regards, Vignesh
Attachment
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Tue, Mar 4, 2025 at 12:23 PM vignesh C <vignesh21@gmail.com> wrote: > > There is almost negligible dip with the above suggested way, the test > results for the same is given below(execution time is in milli > seconds): > Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 > Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19 > Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66 > % diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16 > > There is a performance dip in the range of 0 to 0.58 percent. > The attached patch has the changes for the same. The test script used > is also attached. > The patch still needs more review but the change has negligible performance impact. The next step is to get more opinions on whether we should add a new subscription option (say skip_not_existant_publication) for this work. See patch v1-0002-* in email [1]. The problem summary is explained in email [2] and in the commit message of the 0001 patch in this thread. But still, let me write briefly for the ease of others. The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will lead to restarting of apply worker, and after the restart, the apply worker will use the existing slot and replication origin corresponding to the subscription. Now, it is possible that before the restart, the origin has not been updated, and the WAL start location points to a location before where PUBLICATION pointed to by SET PUBLICATION exists. This leads to an error: "ERROR: publication "pub1" does not exist". Once this error occurs, apply worker will never be able to proceed and will always return the same error. For users, this is a problem because they would have created a publication before executing ALTER SUBSCRIPTION ... SET PUBLICATION .. and now they have no way to proceed. The solution we came up with is to skip loading the publication if the publication does not exist. We load the publication later and update the relation entry when the publication gets created. The two main concerns with this idea, as shared in email [3], are performance implications of this change and the possibility of current behaviour expectations from the users. We came up with a solution where the performance impact is negligible, as shown in the tests [4]. For that, we won't try to reload the skipped/missing publication for each change but will attempt it only when any new publication is created/dropped for a valid relation entry in RelationSyncCache (maintained by pgoutput). The new option skip_not_existant_publication is to address the second concern "Imagine you have a subscriber using two publications p1 and p2, and someone comes around and drops p1 by mistake. With the proposed patch, the subscription will notice this, but it'll continue sending data ignoring the missing publication. Yes, it will continue working, but it's quite possible this breaks the subscriber and it's be better to fail and stop replicating.". I see the point of adding such an option to avoid breaking the current applications (if there are any) that are relying on current behaviour. But OTOH, I am not sure if users expect us to fail explicitly in such scenarios. This is a long-standing behaviour for which we get reports from time to time, and once analyzing a failure, Tom also looked at it and agreed that we don't have much choice to avoid skipping non-existent publications [5]. But we never concluded as to whether skipping should be a default behavior or an optional one. So, we need more opinions on it. Thoughts? [1] - https://www.postgresql.org/message-id/CALDaNm0-n8FGAorM%2BbTxkzn%2BAOUyx5%3DL_XmnvOP6T24%2B-NcBKg%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1Lc%3DNDV1HrY2gNasFK90MtysnA575a%2Brd0p%2BPOjXN%2BSpw%40mail.gmail.com [3] - https://www.postgresql.org/message-id/dc08add3-10a8-738b-983a-191c7406707b%40enterprisedb.com [4] - https://www.postgresql.org/message-id/CALDaNm2Xkm1M-ik2RLJZ9rMhW2zW2GRLL6ePyZJbXcAjOVwzXg%40mail.gmail.com [5] - https://www.postgresql.org/message-id/631312.1707251789%40sss.pgh.pa.us -- With Regards, Amit Kapila.
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Masahiko Sawada
Date:
On Tue, Mar 4, 2025 at 9:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 4, 2025 at 12:23 PM vignesh C <vignesh21@gmail.com> wrote: > > > > There is almost negligible dip with the above suggested way, the test > > results for the same is given below(execution time is in milli > > seconds): > > Brach/records | 100 | 1000 | 10000 | 100000 | 1000000 > > Head | 10.25 | 15.85 | 65.53 | 569.15 | 9194.19 > > Patch | 10.25 | 15.84 | 65.91 | 571.75 | 9208.66 > > % diff | 0.00 | 0.06 | -0.58 | -0.46 | -0.16 > > > > There is a performance dip in the range of 0 to 0.58 percent. > > The attached patch has the changes for the same. The test script used > > is also attached. > > > > The patch still needs more review but the change has negligible > performance impact. The next step is to get more opinions on whether > we should add a new subscription option (say > skip_not_existant_publication) for this work. See patch v1-0002-* in > email [1]. The problem summary is explained in email [2] and in the > commit message of the 0001 patch in this thread. But still, let me > write briefly for the ease of others. > > The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will > lead to restarting of apply worker, and after the restart, the apply > worker will use the existing slot and replication origin corresponding > to the subscription. Now, it is possible that before the restart, the > origin has not been updated, and the WAL start location points to a > location before where PUBLICATION pointed to by SET PUBLICATION > exists. This leads to an error: "ERROR: publication "pub1" does not > exist". Once this error occurs, apply worker will never be able to > proceed and will always return the same error. For users, this is a > problem because they would have created a publication before executing > ALTER SUBSCRIPTION ... SET PUBLICATION .. and now they have no way to > proceed. > > The solution we came up with is to skip loading the publication if the > publication does not exist. We load the publication later and update > the relation entry when the publication gets created. > > The two main concerns with this idea, as shared in email [3], are > performance implications of this change and the possibility of current > behaviour expectations from the users. > > We came up with a solution where the performance impact is negligible, > as shown in the tests [4]. For that, we won't try to reload the > skipped/missing publication for each change but will attempt it only > when any new publication is created/dropped for a valid relation entry > in RelationSyncCache (maintained by pgoutput). Thank you for summarizing the issue. That helps catch up a lot. > > The new option skip_not_existant_publication is to address the second > concern "Imagine you have a subscriber using two publications p1 and > p2, and someone comes around and drops p1 by mistake. With the > proposed patch, the subscription will notice this, but it'll continue > sending data ignoring the missing publication. Yes, it will continue > working, but it's quite possible this breaks the subscriber and it's > be better to fail and stop replicating.". I think that in this particular situation the current behavior would be likely to miss more changes than the patch'ed behavior case. After the logical replication stops, the user would have to alter the subscription to subscribe to only p1 by executing 'ALTER SUBSCRIPTION ... SET PUBLICATION p1' in order to resume the logical replication. In any case, the publisher might be receiving further changes but the subscriber would end up missing changes for tables associated with p2 generated while p2 doesn't exist. Even if the user re-creates the publication p2 after that, it would be hard for users to re-alter the subscription to get changes for tables associated with p1 and p2 from the exact point of p2 being created. Therefore, the subscriber could end up missing some changes that happened between 'CREATE PUBLICATION p2' and 'ALTER SUBSCRIPTION ... SET PUBLICATION p1, p2'. On the other hand, with the patch, the publication can send the changes to tables associated with p1 and p2 as soon as it decodes the WAL record of re-CREATE PUBLICATION p2. > > I see the point of adding such an option to avoid breaking the current > applications (if there are any) that are relying on current behaviour. > But OTOH, I am not sure if users expect us to fail explicitly in such > scenarios. On the side note, with the patch since we ignore missing publications we will be able to create a subscription with whatever publications regardless their existence like: CREATE SUBSCRIPTION ... PUBLICATION pub1, pub2, pub3, pub4, pub5, ..., pub1000; The walsender corresponding to such subscriber can stream changes as soon as the listed publications are created on the publisher and REFRESH PUBLICATION is executed. > > This is a long-standing behaviour for which we get reports from time > to time, and once analyzing a failure, Tom also looked at it and > agreed that we don't have much choice to avoid skipping non-existent > publications [5]. But we never concluded as to whether skipping should > be a default behavior or an optional one. So, we need more opinions on > it. I'm leaning toward making the skipping behavior a default as I could not find a good benefit for the current behavior (i.e., stopping logical replication due to missing publications). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Sun, Mar 9, 2025 at 9:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 4, 2025 at 9:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > I see the point of adding such an option to avoid breaking the current > > applications (if there are any) that are relying on current behaviour. > > But OTOH, I am not sure if users expect us to fail explicitly in such > > scenarios. > > On the side note, with the patch since we ignore missing publications > we will be able to create a subscription with whatever publications > regardless their existence like: > > CREATE SUBSCRIPTION ... PUBLICATION pub1, pub2, pub3, pub4, pub5, ..., pub1000; > > The walsender corresponding to such subscriber can stream changes as > soon as the listed publications are created on the publisher and > REFRESH PUBLICATION is executed. > Right, but OTOH, one can expect that the data should start replicating as soon as one creates a publication on the publisher. However, the data for tables that are part of the publication will start replicating from the point when the decoding process will process the WAL corresponding to Create Publication. I suggest to add something for this in docs unless it is already explained. > > > > This is a long-standing behaviour for which we get reports from time > > to time, and once analyzing a failure, Tom also looked at it and > > agreed that we don't have much choice to avoid skipping non-existent > > publications [5]. But we never concluded as to whether skipping should > > be a default behavior or an optional one. So, we need more opinions on > > it. > > I'm leaning toward making the skipping behavior a default as I could > not find a good benefit for the current behavior (i.e., stopping > logical replication due to missing publications). > Sounds reasonable. We can always add the option at a later point if required. Thanks for your input. We can continue reviewing and committing the current patch. -- With Regards, Amit Kapila.
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > On further thinking, I felt the use of publications_updated variable > is not required we can use publications_valid itself which will be set > if the publication system table is invalidated. Here is a patch for > the same. > The patch relies on the fact that whenever a publication's data is invalidated, it will also invalidate all the RelSyncEntires as per publication_invalidation_cb. But note that we are discussing removing that inefficiency in the thread [1]. So, we should try to rebuild the entry when we have skipped the required publication previously. Apart from this, please consider updating the docs, as mentioned in my response to Sawada-San's email. BTW, I am planning to commit this only on HEAD as this is a behavior change. Please let me know if you guys think otherwise. [1] - https://www.postgresql.org/message-id/OSCPR01MB14966C09AA201EFFA706576A7F5C92%40OSCPR01MB14966.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Dilip Kumar
Date:
On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On further thinking, I felt the use of publications_updated variable
> is not required we can use publications_valid itself which will be set
> if the publication system table is invalidated. Here is a patch for
> the same.
>
The patch relies on the fact that whenever a publication's data is
invalidated, it will also invalidate all the RelSyncEntires as per
publication_invalidation_cb. But note that we are discussing removing
that inefficiency in the thread [1]. So, we should try to rebuild the
entry when we have skipped the required publication previously.
Apart from this, please consider updating the docs, as mentioned in my
response to Sawada-San's email.
I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC, the issue occurs when a publication is created after a certain LSN. When
ALTER SUBSCRIPTION ... SET PUBLICATION
is executed, the subscriber workers restart and request the changes based on restart_lsn
, which is at an earlier LSN in the WAL than the LSN at which the publication was created. This leads to an error, and we are addressing this behavior as part of the fix by skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication is created and this behavior looks fine. BTW, I am planning to commit this only on HEAD as this is a behavior
change. Please let me know if you guys think otherwise.
Somehow this looks like a bug fix which should be backported no? Am I missing something?
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Mon, Mar 10, 2025 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: >> > >> > On further thinking, I felt the use of publications_updated variable >> > is not required we can use publications_valid itself which will be set >> > if the publication system table is invalidated. Here is a patch for >> > the same. >> > >> >> The patch relies on the fact that whenever a publication's data is >> invalidated, it will also invalidate all the RelSyncEntires as per >> publication_invalidation_cb. But note that we are discussing removing >> that inefficiency in the thread [1]. So, we should try to rebuild the >> entry when we have skipped the required publication previously. >> >> Apart from this, please consider updating the docs, as mentioned in my >> response to Sawada-San's email. > > > I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC, the issueoccurs when a publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET PUBLICATION is executed, thesubscriber workers restart and request the changes based on restart_lsn, which is at an earlier LSN in the WAL than theLSN at which the publication was created. This leads to an error, and we are addressing this behavior as part of the fixby skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication is created andthis behavior looks fine. > Yes, your understanding is correct, but note that as such, the patch is simply skipping the missing publication. The skipped changes are because those were on the table that is not part of any publication w.r.t historic snapshot we have at the point of time. >> BTW, I am planning to commit this only on HEAD as this is a behavior >> change. Please let me know if you guys think otherwise. > > > Somehow this looks like a bug fix which should be backported no? Am I missing something? > We can consider this a bug-fix and backpatch it, but I am afraid that because this is a behavior change, some users may not like it. Also, I don't remember seeing public reports for this behavior; that is probably because it is hard to hit. FYI, we found this via BF failures. So, I thought it would be better to get this field tested via HEAD only at this point in time. -- With Regards, Amit Kapila.
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Dilip Kumar
Date:
On Mon, Mar 10, 2025 at 10:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Mar 10, 2025 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Mar 10, 2025 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > >> > > >> > On further thinking, I felt the use of publications_updated variable > >> > is not required we can use publications_valid itself which will be set > >> > if the publication system table is invalidated. Here is a patch for > >> > the same. > >> > > >> > >> The patch relies on the fact that whenever a publication's data is > >> invalidated, it will also invalidate all the RelSyncEntires as per > >> publication_invalidation_cb. But note that we are discussing removing > >> that inefficiency in the thread [1]. So, we should try to rebuild the > >> entry when we have skipped the required publication previously. > >> > >> Apart from this, please consider updating the docs, as mentioned in my > >> response to Sawada-San's email. > > > > > > I'm not sure I fully understand it, but based on your previous email and the initial email from Vignesh, if IIUC, theissue occurs when a publication is created after a certain LSN. When ALTER SUBSCRIPTION ... SET PUBLICATION is executed,the subscriber workers restart and request the changes based on restart_lsn, which is at an earlier LSN in the WALthan the LSN at which the publication was created. This leads to an error, and we are addressing this behavior as partof the fix by skipping the changes which are between the restart_lsn of subscriber and the lsn at which publication iscreated and this behavior looks fine. > > > > Yes, your understanding is correct, but note that as such, the patch > is simply skipping the missing publication. The skipped changes are > because those were on the table that is not part of any publication > w.r.t historic snapshot we have at the point of time. So, it will skip loading the missing publication up to the LSN where the publication is created and then load it from there, correct? Do we have a test case for this? I couldn't find one in the latest patch or in the email thread to demonstrate this behavior. > >> BTW, I am planning to commit this only on HEAD as this is a behavior > >> change. Please let me know if you guys think otherwise. > > > > > > Somehow this looks like a bug fix which should be backported no? Am I missing something? > > > > We can consider this a bug-fix and backpatch it, but I am afraid that > because this is a behavior change, some users may not like it. Also, I > don't remember seeing public reports for this behavior; that is > probably because it is hard to hit. FYI, we found this via BF > failures. So, I thought it would be better to get this field tested > via HEAD only at this point in time. At the moment, I don't have a strong opinion on this. Since no one has encountered or reported this issue, it might be the case that it's not affecting anyone, and we could simply backpatch without causing any dissatisfaction. However, I'm fine with whatever others decide. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
vignesh C
Date:
On Mon, 10 Mar 2025 at 09:33, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On further thinking, I felt the use of publications_updated variable > > is not required we can use publications_valid itself which will be set > > if the publication system table is invalidated. Here is a patch for > > the same. > > > > The patch relies on the fact that whenever a publication's data is > invalidated, it will also invalidate all the RelSyncEntires as per > publication_invalidation_cb. But note that we are discussing removing > that inefficiency in the thread [1]. So, we should try to rebuild the > entry when we have skipped the required publication previously. > > Apart from this, please consider updating the docs, as mentioned in my > response to Sawada-San's email. The create subscription documentation already has "We allow non-existent publications to be specified so that users can add those later. This means pg_subscription can have non-existent publications." and should be enough at [1]. Let me know if we need to add more documentation. Apart from this I have changed the log level that logs "skipped loading publication" to WARNING as we log a warning "WARNING: publications "pub2", "pub3" do not exist on the publisher" in case of CREATE SUBSCRIPTION and looked similar to this. I can change it to a different log level in case you feel this is not the right level. Also I have added a test case for dilip's comment from [2]. The attached v7 version patch has the changes for the same. [1] - https://www.postgresql.org/docs/devel/sql-createsubscription.html [2] - https://www.postgresql.org/message-id/CAFiTN-tgUR6QLSs3UHK7gx4VP7cURGNkufA_xkrQLw9eCnbGQw%40mail.gmail.com Regards, Vignesh
Attachment
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Tue, Mar 11, 2025 at 9:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Mar 10, 2025 at 10:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > >> BTW, I am planning to commit this only on HEAD as this is a behavior > > >> change. Please let me know if you guys think otherwise. > > > > > > > > > Somehow this looks like a bug fix which should be backported no? Am I missing something? > > > > > > > We can consider this a bug-fix and backpatch it, but I am afraid that > > because this is a behavior change, some users may not like it. Also, I > > don't remember seeing public reports for this behavior; that is > > probably because it is hard to hit. FYI, we found this via BF > > failures. So, I thought it would be better to get this field tested > > via HEAD only at this point in time. > > At the moment, I don't have a strong opinion on this. Since no one has > encountered or reported this issue, it might be the case that it's not > affecting anyone, and we could simply backpatch without causing any > dissatisfaction. However, I'm fine with whatever others decide. > Sawada-San, others, do you have an opinion on whether to backpatch this change? -- With Regards, Amit Kapila.
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Dilip Kumar
Date:
On Tue, Mar 11, 2025 at 4:01 PM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 10 Mar 2025 at 09:33, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Mar 4, 2025 at 6:54 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On further thinking, I felt the use of publications_updated variable > > > is not required we can use publications_valid itself which will be set > > > if the publication system table is invalidated. Here is a patch for > > > the same. > > > > > > > The patch relies on the fact that whenever a publication's data is > > invalidated, it will also invalidate all the RelSyncEntires as per > > publication_invalidation_cb. But note that we are discussing removing > > that inefficiency in the thread [1]. So, we should try to rebuild the > > entry when we have skipped the required publication previously. > > > > Apart from this, please consider updating the docs, as mentioned in my > > response to Sawada-San's email. > > The create subscription documentation already has "We allow > non-existent publications to be specified so that users can add those > later. This means pg_subscription can have non-existent publications." > and should be enough at [1]. Let me know if we need to add more > documentation. > > Apart from this I have changed the log level that logs "skipped > loading publication" to WARNING as we log a warning "WARNING: > publications "pub2", "pub3" do not exist on the publisher" in case of > CREATE SUBSCRIPTION and looked similar to this. I can change it to a > different log level in case you feel this is not the right level. > > Also I have added a test case for dilip's comment from [2]. > The attached v7 version patch has the changes for the same. > Thanks, Vignesh, for adding the test. I believe you've tested the effect of DROP PUBLICATION. However, I think we should also test the behavior of ALTER SUBSCRIPTION...SET PUBLICATION before creating the PUBLICATION, and then create the PUBLICATION at a later stage. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Masahiko Sawada
Date:
On Wed, Mar 12, 2025 at 3:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 11, 2025 at 9:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Mar 10, 2025 at 10:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > >> BTW, I am planning to commit this only on HEAD as this is a behavior > > > >> change. Please let me know if you guys think otherwise. > > > > > > > > > > > > Somehow this looks like a bug fix which should be backported no? Am I missing something? > > > > > > > > > > We can consider this a bug-fix and backpatch it, but I am afraid that > > > because this is a behavior change, some users may not like it. Also, I > > > don't remember seeing public reports for this behavior; that is > > > probably because it is hard to hit. FYI, we found this via BF > > > failures. So, I thought it would be better to get this field tested > > > via HEAD only at this point in time. > > > > At the moment, I don't have a strong opinion on this. Since no one has > > encountered or reported this issue, it might be the case that it's not > > affecting anyone, and we could simply backpatch without causing any > > dissatisfaction. However, I'm fine with whatever others decide. > > > > Sawada-San, others, do you have an opinion on whether to backpatch this change? I'm also afraid of backpatching it so I guess it would be better to push it to only HEAD. I think if users have encountered and we see reported the issue we can consider backpatching again. If regression tests on backbranches continue to fail intermittently, probably we can consider adding waits as the patch Osumi-san proposed[1]? Regards, [1] https://www.postgresql.org/message-id/TYCPR01MB83737A68CD5D554EA82BD7B9EDD39%40TYCPR01MB8373.jpnprd01.prod.outlook.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
vignesh C
Date:
On Wed, 12 Mar 2025 at 16:15, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Thanks, Vignesh, for adding the test. I believe you've tested the > effect of DROP PUBLICATION. However, I think we should also test the > behavior of ALTER SUBSCRIPTION...SET PUBLICATION before creating the > PUBLICATION, and then create the PUBLICATION at a later stage. I felt having only one test case for this is enough, I have removed the DROP PUBLICATION test and added the SET PUBLICATION test. The attached v8 version patch has the changes for the same. Regards, Vignesh
Attachment
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Dilip Kumar
Date:
On Thu, Mar 13, 2025 at 7:38 AM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 12 Mar 2025 at 16:15, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Thanks, Vignesh, for adding the test. I believe you've tested the > > effect of DROP PUBLICATION. However, I think we should also test the > > behavior of ALTER SUBSCRIPTION...SET PUBLICATION before creating the > > PUBLICATION, and then create the PUBLICATION at a later stage. > > I felt having only one test case for this is enough, I have removed > the DROP PUBLICATION test and added the SET PUBLICATION test. The > attached v8 version patch has the changes for the same. Thanks looks good to me. While looking at the patch, I have a few comments/questions + if (pub) + result = lappend(result, pub); + else + { + /* + * When executing 'ALTER SUBSCRIPTION ... SET PUBLICATION', the + * apply worker continues using the existing replication slot and + * origin after restarting. If the replication origin is not + * updated before the restart, the WAL start location may point to + * a position before the specified publication exists, causing + * persistent apply worker restarts and errors. + * + * This ensures that the publication is skipped if it does not + * exist and is loaded when the corresponding WAL record is + * encountered. + */ + ereport(WARNING, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("skipped loading publication: %s", pubname), + errhint("If the publication already exists, ignore it as it will be loaded upon reaching the corresponding WAL record; otherwise, create it.")); + } This comment focuses on a specific use case regarding the problem with 'ALTER SUBSCRIPTION ... SET PUBLICATION,' but in reality, we are addressing a more general case where the user is trying to SET PUBLICATION or even CREATE SUBSCRIPTION, and some publications are missing. Wouldn't it be better to rephrase the comment? 2. + errhint("If the publication already exists, ignore it as it will be loaded upon reaching the corresponding WAL record; otherwise, create it.")); Is this hint correct? This is a question rather than a comment: When we reach a particular WAL where the publication was created, will the publication automatically load, or does the user need to REFRESH the publications? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
vignesh C
Date:
On Thu, 13 Mar 2025 at 09:18, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Thanks looks good to me. > > While looking at the patch, I have a few comments/questions > > + if (pub) > + result = lappend(result, pub); > + else > + { > + /* > + * When executing 'ALTER SUBSCRIPTION ... SET PUBLICATION', the > + * apply worker continues using the existing replication slot and > + * origin after restarting. If the replication origin is not > + * updated before the restart, the WAL start location may point to > + * a position before the specified publication exists, causing > + * persistent apply worker restarts and errors. > + * > + * This ensures that the publication is skipped if it does not > + * exist and is loaded when the corresponding WAL record is > + * encountered. > + */ > + ereport(WARNING, > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("skipped loading publication: %s", pubname), > + errhint("If the publication already exists, ignore it as it will be > loaded upon reaching the corresponding WAL record; otherwise, create > it.")); > + } > > This comment focuses on a specific use case regarding the problem with > 'ALTER SUBSCRIPTION ... SET PUBLICATION,' but in reality, we are > addressing a more general case where the user is trying to SET > PUBLICATION or even CREATE SUBSCRIPTION, and some publications are > missing. Wouldn't it be better to rephrase the comment? How about a comment something like below: /* * In 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and * 'CREATE SUBSCRIPTION', if the replication origin is not updated * before the worker exits, the WAL start location might point to a * position before the publication's WAL record. This can lead to * persistent apply worker restarts and errors. * * Additionally, dropping a subscription's publication should not * disrupt logical replication. * * This ensures that a missing publication is skipped and loaded * when its corresponding WAL record is encountered. */ ereport(WARNING, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("skipped loading publication: %s", pubname), errhint("If the publication is missing, create and refresh it. Otherwise, wait for the slot to reach the WAL record, then refresh")); > 2. + errhint("If the publication already exists, ignore it as it will > be loaded upon reaching the corresponding WAL record; otherwise, > create it.")); > > Is this hint correct? This is a question rather than a comment: When > we reach a particular WAL where the publication was created, will the > publication automatically load, or does the user need to REFRESH the > publications? Users need to refresh the publication in case the relation is not already added to pg_subscription_rel and apply incremental changes. How about an error hint like: "If the publication is missing, create and refresh it. Otherwise, wait for the slot to reach the WAL record for the created publication, then refresh" Regards, Vignesh
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Dilip Kumar
Date:
On Thu, Mar 13, 2025 at 10:49 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 13 Mar 2025 at 09:18, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Thanks looks good to me. > > > > While looking at the patch, I have a few comments/questions > > > > + if (pub) > > + result = lappend(result, pub); > > + else > > + { > > + /* > > + * When executing 'ALTER SUBSCRIPTION ... SET PUBLICATION', the > > + * apply worker continues using the existing replication slot and > > + * origin after restarting. If the replication origin is not > > + * updated before the restart, the WAL start location may point to > > + * a position before the specified publication exists, causing > > + * persistent apply worker restarts and errors. > > + * > > + * This ensures that the publication is skipped if it does not > > + * exist and is loaded when the corresponding WAL record is > > + * encountered. > > + */ > > + ereport(WARNING, > > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("skipped loading publication: %s", pubname), > > + errhint("If the publication already exists, ignore it as it will be > > loaded upon reaching the corresponding WAL record; otherwise, create > > it.")); > > + } > > > > This comment focuses on a specific use case regarding the problem with > > 'ALTER SUBSCRIPTION ... SET PUBLICATION,' but in reality, we are > > addressing a more general case where the user is trying to SET > > PUBLICATION or even CREATE SUBSCRIPTION, and some publications are > > missing. Wouldn't it be better to rephrase the comment? > > How about a comment something like below: > /* > * In 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and > * 'CREATE SUBSCRIPTION', if the replication origin is not updated > * before the worker exits, the WAL start location might point to a > * position before the publication's WAL record. This can lead to > * persistent apply worker restarts and errors. > * > * Additionally, dropping a subscription's publication should not > * disrupt logical replication. > * > * This ensures that a missing publication is skipped and loaded > * when its corresponding WAL record is encountered. > */ Looks fine, shall we add the missing publication point as well something like below /* * In operations like 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and * 'CREATE SUBSCRIPTION', if the specified publication does not exist or * if the replication origin is not updated before the worker exits, * the WAL start location may point to a position prior to the publication's * WAL record. This can cause persistent restarts and errors * in the apply worker. * > * Additionally, dropping a subscription's publication should not > * disrupt logical replication. * > * This ensures that a missing publication is skipped and loaded > * when its corresponding WAL record is encountered. */ > ereport(WARNING, > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("skipped loading publication: %s", pubname), > errhint("If the publication is missing, create and refresh it. > Otherwise, wait for the slot to reach the WAL record, then refresh")); > > > 2. + errhint("If the publication already exists, ignore it as it will > > be loaded upon reaching the corresponding WAL record; otherwise, > > create it.")); > > > > Is this hint correct? This is a question rather than a comment: When > > we reach a particular WAL where the publication was created, will the > > publication automatically load, or does the user need to REFRESH the > > publications? > > Users need to refresh the publication in case the relation is not > already added to pg_subscription_rel and apply incremental changes. > How about an error hint like: > "If the publication is missing, create and refresh it. Otherwise, wait > for the slot to reach the WAL record for the created publication, then > refresh" +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Thu, Mar 13, 2025 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Looks fine, shall we add the missing publication point as well > something like below > > /* > * In operations like 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and > * 'CREATE SUBSCRIPTION', if the specified publication does not exist or > * if the replication origin is not updated before the worker exits, > * the WAL start location may point to a position prior to the publication's > * WAL record. This can cause persistent restarts and errors > * in the apply worker. > * I think that is too much related to pub-sub model, and ideally, pgoutput should not care about it. I have written a comment considering somebody using pgoutput decoding module via APIs. > > * Additionally, dropping a subscription's publication should not > > * disrupt logical replication. > * > > * This ensures that a missing publication is skipped and loaded > > * when its corresponding WAL record is encountered. > */ > > > > > ereport(WARNING, > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("skipped loading publication: %s", pubname), > > errhint("If the publication is missing, create and refresh it. > > Otherwise, wait for the slot to reach the WAL record, then refresh")); > > > > > 2. + errhint("If the publication already exists, ignore it as it will > > > be loaded upon reaching the corresponding WAL record; otherwise, > > > create it.")); > > > > > > Is this hint correct? This is a question rather than a comment: When > > > we reach a particular WAL where the publication was created, will the > > > publication automatically load, or does the user need to REFRESH the > > > publications? > > > > Users need to refresh the publication in case the relation is not > > already added to pg_subscription_rel and apply incremental changes. > > How about an error hint like: > > "If the publication is missing, create and refresh it. Otherwise, wait > > for the slot to reach the WAL record for the created publication, then > > refresh" > I have tried to split this information into errdetail and errhint in the attached. See and let me know what you think of the same. -- With Regards, Amit Kapila.
Attachment
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Dilip Kumar
Date:
On Thu, Mar 13, 2025 at 3:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Mar 13, 2025 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Looks fine, shall we add the missing publication point as well > > something like below > > > > /* > > * In operations like 'ALTER SUBSCRIPTION ... ADD/SET PUBLICATION' and > > * 'CREATE SUBSCRIPTION', if the specified publication does not exist or > > * if the replication origin is not updated before the worker exits, > > * the WAL start location may point to a position prior to the publication's > > * WAL record. This can cause persistent restarts and errors > > * in the apply worker. > > * > > I think that is too much related to pub-sub model, and ideally, > pgoutput should not care about it. I have written a comment > considering somebody using pgoutput decoding module via APIs. I agree, here we just need to talk about skipping the missing publication, not different scenarios where that can happen. This comments look much better. > > > * Additionally, dropping a subscription's publication should not > > > * disrupt logical replication. > > * > > > * This ensures that a missing publication is skipped and loaded > > > * when its corresponding WAL record is encountered. > > */ > > > > > > > > > ereport(WARNING, > > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > > errmsg("skipped loading publication: %s", pubname), > > > errhint("If the publication is missing, create and refresh it. > > > Otherwise, wait for the slot to reach the WAL record, then refresh")); > > > > > > > 2. + errhint("If the publication already exists, ignore it as it will > > > > be loaded upon reaching the corresponding WAL record; otherwise, > > > > create it.")); > > > > > > > > Is this hint correct? This is a question rather than a comment: When > > > > we reach a particular WAL where the publication was created, will the > > > > publication automatically load, or does the user need to REFRESH the > > > > publications? > > > > > > Users need to refresh the publication in case the relation is not > > > already added to pg_subscription_rel and apply incremental changes. > > > How about an error hint like: > > > "If the publication is missing, create and refresh it. Otherwise, wait > > > for the slot to reach the WAL record for the created publication, then > > > refresh" > > > > I have tried to split this information into errdetail and errhint in > the attached. See and let me know what you think of the same. Yes, the errhint also makes sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add an option to skip loading missing publication to avoid logical replication failure
From
Amit Kapila
Date:
On Thu, Mar 13, 2025 at 7:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Mar 13, 2025 at 3:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I think that is too much related to pub-sub model, and ideally, > > pgoutput should not care about it. I have written a comment > > considering somebody using pgoutput decoding module via APIs. > Thanks, Dilip and Sawada-San, for the inputs, and Vignesh for the patch. I have pushed the change. -- With Regards, Amit Kapila.