Re: Data is copied twice when specifying both child and parent table in publication - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Data is copied twice when specifying both child and parent table in publication
Date
Msg-id CAHut+Pv=TROv8mhHtFmXfkZwF2Sc9x5wD-QXWhGjZmStWg+Zog@mail.gmail.com
Whole thread Raw
In response to Re: Data is copied twice when specifying both child and parent table in publication  (Jacob Champion <jchampion@timescale.com>)
Responses Re: Data is copied twice when specifying both child and parent table in publication
List pgsql-hackers
On Fri, Mar 31, 2023 at 5:15 AM Jacob Champion <jchampion@timescale.com> wrote:
>
> On Wed, Mar 29, 2023 at 2:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Pushed.
>
> While rebasing my logical-roots patch over the top of this, I ran into
> another situation where mixed viaroot settings can duplicate data. The
> key idea is to subscribe to two publications with mixed settings, as
> before, and add a partition root that's already been replicated with
> viaroot=false to the other publication with viaroot=true.
>
>   pub=# CREATE TABLE part (a int) PARTITION BY RANGE (a);
>   pub=# CREATE PUBLICATION pub_all FOR ALL TABLES;
>   pub=# CREATE PUBLICATION pub_other FOR TABLE other WITH
> (publish_via_partition_root);
>   -- populate with data, then switch to subscription side
>   sub=# CREATE SUBSCRIPTION sub CONNECTION ... PUBLICATION pub_all, pub_other;
>   -- switch back to publication
>   pub=# ALTER PUBLICATION pub_other ADD TABLE part;
>   -- and back to subscription
>   sub=# ALTER SUBSCRIPTION sub REFRESH PUBLICATION;
>   -- data is now duplicated
>
> (Standalone reproduction attached.)
>
> This is similar to what happens if you alter the
> publish_via_partition_root setting for an existing publication, but
> I'd argue it's easier to hit by accident. Is this part of the same
> class of bugs, or is it different (or even expected) behavior?
>

Hi Jacob. I tried your example. And I can see after the REFRESH the
added table 'part' tablesync is launched and so does the copy causing
duplicate data.

sub=# ALTER SUBSCRIPTION sub REFRESH PUBLICATION;
ALTER SUBSCRIPTION
sub=# 2023-03-31 13:09:30.348 AEDT [334] LOG:  logical replication
table synchronization worker for subscription "sub", table "part" has
started
...

Duplicate data happens because REFRESH PUBLICATION has the default
"refresh_option of copy_data=true.

Although the result is at first a bit unexpected, I am not sure if
anything can be done to make it do what you probably hoped it would
do:

For example, Just imagine if logic could be made smarter to recognize
that since there was already the 'part_def' being subscribed so it
should NOT use the default 'copy_data=true' when the REFRESH launches
the ancestor table 'part'...

Even if that logic was implemented, I have a feeling you could *still*
run into problems if the 'part' table was made of multiple partitions.
I think you might get to a situation where you DO want some partition
data copied (because you did not have it yet but now you are
subscribing to the root you want it) while at the same time, you DON'T
want to get duplicated data from other partitions (because you already
knew about those ones  -- like your example does).

So, I am not sure what the answer is, or maybe there isn't one.

At least, we need to check there are sufficient "BE CAREFUL" warnings
in the documentation for scenarios like this.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Add shared buffer hits to pg_stat_io
Next
From: John Naylor
Date:
Subject: Re: Thoughts on using Text::Template for our autogenerated code?