Re: Data is copied twice when specifying both child and parent table in publication - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: Data is copied twice when specifying both child and parent table in publication |
Date | |
Msg-id | CAHut+Pv=TROv8mhHtFmXfkZwF2Sc9x5wD-QXWhGjZmStWg+Zog@mail.gmail.com Whole thread Raw |
In response to | Re: Data is copied twice when specifying both child and parent table in publication (Jacob Champion <jchampion@timescale.com>) |
Responses |
Re: Data is copied twice when specifying both child and parent table in publication
|
List | pgsql-hackers |
On Fri, Mar 31, 2023 at 5:15 AM Jacob Champion <jchampion@timescale.com> wrote: > > On Wed, Mar 29, 2023 at 2:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Pushed. > > While rebasing my logical-roots patch over the top of this, I ran into > another situation where mixed viaroot settings can duplicate data. The > key idea is to subscribe to two publications with mixed settings, as > before, and add a partition root that's already been replicated with > viaroot=false to the other publication with viaroot=true. > > pub=# CREATE TABLE part (a int) PARTITION BY RANGE (a); > pub=# CREATE PUBLICATION pub_all FOR ALL TABLES; > pub=# CREATE PUBLICATION pub_other FOR TABLE other WITH > (publish_via_partition_root); > -- populate with data, then switch to subscription side > sub=# CREATE SUBSCRIPTION sub CONNECTION ... PUBLICATION pub_all, pub_other; > -- switch back to publication > pub=# ALTER PUBLICATION pub_other ADD TABLE part; > -- and back to subscription > sub=# ALTER SUBSCRIPTION sub REFRESH PUBLICATION; > -- data is now duplicated > > (Standalone reproduction attached.) > > This is similar to what happens if you alter the > publish_via_partition_root setting for an existing publication, but > I'd argue it's easier to hit by accident. Is this part of the same > class of bugs, or is it different (or even expected) behavior? > Hi Jacob. I tried your example. And I can see after the REFRESH the added table 'part' tablesync is launched and so does the copy causing duplicate data. sub=# ALTER SUBSCRIPTION sub REFRESH PUBLICATION; ALTER SUBSCRIPTION sub=# 2023-03-31 13:09:30.348 AEDT [334] LOG: logical replication table synchronization worker for subscription "sub", table "part" has started ... Duplicate data happens because REFRESH PUBLICATION has the default "refresh_option of copy_data=true. Although the result is at first a bit unexpected, I am not sure if anything can be done to make it do what you probably hoped it would do: For example, Just imagine if logic could be made smarter to recognize that since there was already the 'part_def' being subscribed so it should NOT use the default 'copy_data=true' when the REFRESH launches the ancestor table 'part'... Even if that logic was implemented, I have a feeling you could *still* run into problems if the 'part' table was made of multiple partitions. I think you might get to a situation where you DO want some partition data copied (because you did not have it yet but now you are subscribing to the root you want it) while at the same time, you DON'T want to get duplicated data from other partitions (because you already knew about those ones -- like your example does). So, I am not sure what the answer is, or maybe there isn't one. At least, we need to check there are sufficient "BE CAREFUL" warnings in the documentation for scenarios like this. ------ Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: