Re: Data is copied twice when specifying both child and parent table in publication - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Data is copied twice when specifying both child and parent table in publication
Date
Msg-id CAHut+PuHr=nkBXva7Qi1XrkL96dFvQfkPeo7+GZjOFUH=j5Pjw@mail.gmail.com
Whole thread Raw
In response to RE: Data is copied twice when specifying both child and parent table in publication  ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>)
Responses Re: Data is copied twice when specifying both child and parent table in publication  (Amit Kapila <amit.kapila16@gmail.com>)
RE: Data is copied twice when specifying both child and parent table in publication  ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>)
List pgsql-hackers
Here are some review comments for patch v20-0001.

======
General.

1.
That function 'pg_get_publication_tables' does not seem to be
described in the PG documentation. Why isn't it in the "System Catalog
Information Functions" table [1] ?

I asked this same question a long time ago but then the reply [2] was
like "it doesn't seem to be a function provided to users".

Well, perhaps that just means that the documentation has been
accidentally missing for a long time. Does anybody know for sure if
the omission of this function from the documentation is deliberate? If
nobody here knows, then maybe this can be asked/addressed in a
separate thread.

======
src/backend/catalog/pg_publication.c

2. filter_partitions

(review comment from my v19 review)

> 2a.
> My previous review [1] (see #1) suggested putting most code within the
> condition. AFAICT my comment still is applicable but was not yet
> addressed.

22/3 Wang-san replied: "Personally, I prefer the current style because
the approach you mentioned adds some indentations."

Sure, but there is more than just indentation/style differences here.
Currently, there is some unnecessary code executed if the table is not
a partition. And the reader cannot tell at-a-glance if (skip) will be
true/false without looking more closely at the loop logic. So, I think
changing it would be better, but anyway I won’t debate about it any
more because it's not a functional problem.

======
src/backend/commands/subscriptioncmds.c

3. fetch_table_list

+ /* Get the list of tables from the publisher. */
+ if (server_version >= 160000)
+ {
+ StringInfoData pub_names;

- appendStringInfoString(&cmd, "FROM pg_catalog.pg_publication_tables t\n"
-    " WHERE t.pubname IN (");
- get_publications_str(publications, &cmd, true);
- appendStringInfoChar(&cmd, ')');
+ initStringInfo(&pub_names);
+ get_publications_str(publications, &pub_names, true);
+
+ /*
+ * From version 16, we allowed passing multiple publications to the
+ * function pg_get_publication_tables. This helped to filter out the
+ * partition table whose ancestor is also published in this
+ * publication array.
+ *
+ * Join pg_get_publication_tables with pg_publication to exclude
+ * non-existing publications.
+ */
+ appendStringInfo(&cmd, "SELECT DISTINCT N.nspname, C.relname,\n"
+ "              ( SELECT array_agg(a.attname ORDER BY a.attnum)\n"
+ "                FROM pg_attribute a\n"
+ "                WHERE a.attrelid = GPT.relid AND\n"
+ "                      a.attnum = ANY(GPT.attrs)\n"
+ "              ) AS attnames\n"
+ " FROM pg_class C\n"
+ "   JOIN pg_namespace N ON N.oid = C.relnamespace\n"
+ "   JOIN ( SELECT (pg_get_publication_tables(VARIADIC
array_agg(pubname::text))).*\n"
+ "          FROM pg_publication\n"
+ "          WHERE pubname IN ( %s )) as GPT\n"
+ "       ON GPT.relid = C.oid\n",
+ pub_names.data);
+
+ pfree(pub_names.data);
+ }
+ else
+ {
+ appendStringInfoString(&cmd, "SELECT DISTINCT t.schemaname, t.tablename \n");
+
+ /* Get column lists for each relation if the publisher supports it */
+ if (check_columnlist)
+ appendStringInfoString(&cmd, ", t.attnames\n");
+
+ appendStringInfoString(&cmd, "FROM pg_catalog.pg_publication_tables t\n"
+    " WHERE t.pubname IN (");
+ get_publications_str(publications, &cmd, true);
+ appendStringInfoChar(&cmd, ')');
+ }

I noticed the SQL "if" part is using uppercase aliases, but the SQL in
the "else" part is using lowercase aliases. I think it would be better
to be consistent (pick one).

======
src/test/subscription/t/013_partition.pl

4.
-# for tab4, we publish changes through the "middle" partitioned table
+# If we subscribe only to pub_lower_level, changes for tab4 will be published
+# through the "middle" partition table. However, since we will be subscribing
+# to both pub_lower_level and pub_all (see subscription sub2 below), we will
+# publish changes via the root table (tab4).
 $node_publisher->safe_psql('postgres',
  "CREATE PUBLICATION pub_lower_level FOR TABLE tab4_1 WITH
(publish_via_partition_root = true)"
 );

~

This comment seemed a bit overkill IMO. I don't think you need to say
much here except maybe:

# Note that subscription "sub2" will later subscribe simultaneously to
both pub_lower_level (i.e. just table tab4_1) and  pub_all.

~~~

5.
I think maybe you could have another test scenario where you INSERT
something into tab4_1_1 while only the publication for tab4_1 has
publish_via_partition_root=true

~~~

6.
AFAICT the tab4 tests are only testing the initial sync, but are not
testing normal replication. Maybe some more normal (post sync) INSERTS
are needed to tab4, tab4_1, tab4_1_1 ?


======
src/test/subscription/t/028_row_filter.pl

7.
+# insert data into partitioned table.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab_rowfilter_viaroot_part(a) VALUES(13), (17)");
+
 $node_subscriber->safe_psql('postgres',
  "CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub_1, tap_pub_2,
tap_pub_3, tap_pub_4a, tap_pub_4b, tap_pub_5a, tap_pub_5b,
tap_pub_toast, tap_pub_inherits, tap_pub_viaroot_2, tap_pub_viaroot_1"
 );
@@ -707,13 +711,17 @@ is($result, qq(t|1), 'check replicated rows to
tab_rowfilter_toast');
 # the row filter for the top-level ancestor:
 #
 # tab_rowfilter_viaroot_part filter is: (a > 15)
+# - INSERT (13)        NO, 13 < 15
 # - INSERT (14)        NO, 14 < 15
 # - INSERT (15)        NO, 15 = 15
 # - INSERT (16)        YES, 16 > 15
+# - INSERT (17)        YES, 17 > 15
 $result =
   $node_subscriber->safe_psql('postgres',
- "SELECT a FROM tab_rowfilter_viaroot_part");
-is($result, qq(16), 'check replicated rows to tab_rowfilter_viaroot_part');
+ "SELECT a FROM tab_rowfilter_viaroot_part ORDER BY 1");
+is( $result, qq(16
+17),
+ 'check replicated rows to tab_rowfilter_viaroot_part');

~

I'm not 100% sure this is testing quite what you want to test. AFAICT
the subscription is created like:

"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub_1, tap_pub_2,
tap_pub_3, tap_pub_4a, tap_pub_4b, tap_pub_5a, tap_pub_5b,
tap_pub_toast, tap_pub_inherits, tap_pub_viaroot_2, tap_pub_viaroot_1"

Notice in this case  BOTH the partitioned table and the partition had
been published using "WITH (publish_via_partition_root)". But, IIUC
won't it be better to test when only the partition's publication was
using that option?

For example, I think then it would be a better test of this "At least one" code:

/* At least one publication is using publish_via_partition_root. */
if (pub_elem->pubviaroot)
    viaroot = true;
======
src/test/subscription/t/031_column_list.pl

8.
- CREATE PUBLICATION pub_root_true FOR TABLE test_root (a) WITH
(publish_via_partition_root = true);
+ CREATE PUBLICATION pub_root_true_1 FOR TABLE test_root (a) WITH
(publish_via_partition_root = true);
+ CREATE PUBLICATION pub_root_true_2 FOR TABLE test_root_1 (a, b) WITH
(publish_via_partition_root = true);

  -- initial data
  INSERT INTO test_root VALUES (1, 2, 3);
  INSERT INTO test_root VALUES (10, 20, 30);
 ));

+# Subscribe to pub_root_true_1 and pub_root_true_2 at the same time, which
+# means that the initial data will be synced once, and only the column list of
+# the parent table (test_root) in the publication pub_root_true_1 will be used
+# for both table sync and data replication.
 $node_subscriber->safe_psql(
  'postgres', qq(
- CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION
pub_root_true;
+ CREATE

~

(This is simlar to the previous review comment #7 above)

Won't it be a better test of the "At least one" code when only the
publication of partition (test_root_1) is using "WITH
(publish_via_partition_root = true)".

e.g
CREATE PUBLICATION pub_root_true_1 FOR TABLE test_root (a);
CREATE PUBLICATION pub_root_true_2 FOR TABLE test_root_1 (a, b) WITH
(publish_via_partition_root = true);

------
[1] https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO-CATALOG-TABLE
[2]
https://www.postgresql.org/message-id/OS3PR01MB6275FB5397C6A647F262A3A69E009%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Ajin Cherian
Date:
Subject: Re: Support logical replication of DDLs
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Error "initial slot snapshot too large" in create replication slot