Re: [PATCH] Fix replica identity mismatch for partitioned tables with publish_via_partition_root - Mailing list pgsql-hackers
| From | Masahiko Sawada |
|---|---|
| Subject | Re: [PATCH] Fix replica identity mismatch for partitioned tables with publish_via_partition_root |
| Date | |
| Msg-id | CAD21AoD42cJ61nKjriDuX0K-iZWK6xXA2w35NBK_jNN9qKmW8w@mail.gmail.com Whole thread Raw |
| In response to | [PATCH] Fix replica identity mismatch for partitioned tables with publish_via_partition_root (Mikhail Kharitonov <mikhail.kharitonov.dev@gmail.com>) |
| List | pgsql-hackers |
On Mon, May 5, 2025 at 1:56 AM Mikhail Kharitonov <mikhail.kharitonov.dev@gmail.com> wrote: > > Hi hackers, > > An inconsistency was observed when using logical replication on partitioned > tables with the option `publish_via_partition_root = true`: if REPLICA IDENTITY > FULL is set only on the parent table, but not on all partitions, logical > decoding emits UPDATE and DELETE messages with tag 'O' (old tuple) even for > partitions that do not have full replica identity. In those cases, only the > primary key columns are included in the message, which contradicts the expected > meaning of 'O' and violates the logical replication message protocol: > > https://www.postgresql.org/docs/current/protocol-logicalrep-message-formats.html > > This can cause issues in downstream consumers, which interpret > the 'O' tag as implying that a full tuple is present. > > The attached patch resolves the inconsistency by selecting the correct tuple > type ('O' vs 'K') based on the replica identity of the actual leaf relation > being published, rather than using the setting of the root relation alone. > As a result, the format of logical replication messages aligns with > the semantics > defined by the protocol. > > Steps to reproduce: > > 1. Create a partitioned table with REPLICA IDENTITY FULL on the parent > and only one of the partitions. > > 2. Create a publication with `publish_via_partition_root = true`. > > 3. Perform INSERT, UPDATE, DELETE operations through the root table. > > 4. Observe via `pg_recvlogical` that for a partition without full replica > identity, the logical replication stream contains 'O' records with > only key fields. I tested this scenario but what I've seen in my env is somewhat different from the above analysis; pgoutput plugin writes 'O' records as you mentioned, but it doesn't omit non-key fields, but writes NULL as non-key fields. Here are my reproducible steps: create table p (a int not null, b int) partition by list (a); create table c1 partition of p for values in (1); create table c2 partition of p for values in (2); create unique index on c2 (a); alter table p replica identity full; alter table c1 replica identity full; alter table c2 replica identity using INDEX c2_a_idx ; insert into p values (1, 10), (2, 20); create publication pub for all tables with (publish_via_partition_root = 'true'); select pg_create_logical_replication_slot('sub', 'pgoutput'); delete from p where a = 1; delete from p where a = 2; select encode(data, 'escape') from pg_logical_slot_peek_binary_changes('sub', null, null, 'proto_version', '1', 'publication_names', 'pub'); The last pg_logical_slot_peek_binary_changes() writes the two 'D' (delete) messages: 1. D\000\000@\000O\000\x02t\000\000\000\x011t\000\000\000\x0210 2. D\000\000@\000O\000\x02t\000\000\000\x012n What we can know from these messages are: - Both messages have 'O'. - Both messages have two columns ('\000\x02'). - The first message has: the first column '1' (length is 1 ('\000\000\000\x01')), and the second column '10' (length is 2 ('\000\000\000\x02')). - The second message has: the first column '2', and the second column NULL ('n'). From these facts, I guess there could be problematic cases even in the native logical replication. Here are reproducible steps: -- Publisher create table p (a int not null, b int) partition by list (a); create table c1 partition of p for values in (1); create table c2 partition of p for values in (2); create unique index on c2 (a); alter table p replica identity full; alter table c1 replica identity full; alter table c2 replica identity using INDEX c2_a_idx ; insert into p values (1, 10), (2, 20); create publication pub for all tables with (publish_via_partition_root = 'true'); -- Subscriber create table p (a int, b int, c int); create subscription sub connection 'dbname=postgres port=5551' publication pub; -- Publisher delete from p where a = 1; -- generate a message 'DELETE (1, 10)' delete from p where a = 2; -- generate a message 'DELETE (2, NULL)' The second delete message cannot find the tuple on the subscriber, so the table contents are now inconsistent between the publisher and the subscriber. I need more investigation to verify that it's a problem, but this behavior doesn't change even with the proposed change. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: