Thread: RE: Subscription sometimes loses txns after initial table sync

RE: Subscription sometimes loses txns after initial table sync

From
"Zhijie Hou (Fujitsu)"
Date:
On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam@pritambaral.com> wrote:
> To: pgsql-hackers <pgsql-hackers@postgresql.org>
> Subject: Subscription sometimes loses txns after initial table sync
> 
> This was discovered when testing the plan for a major version upgrade via
> logical replication. Said plan requires that some tables be synced before
> others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
> followed
> by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
> revealed
> that sometimes, for some tables added this way, txns after the initial data copy
> are lost by the subscription.
> 
> A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
> even 12.22 (on either side of the replication setup). The script runs at a
> default scale of 100 tables with 10k inserts each. This scale is enough to
> demonstrate a failure rate of 1% to 9% of tables on my modest laptop.
> 
> In attempts to analyse why this happens, it has been observed that the sender
> sometimes does not pick up a published table, even when the receiver that
> started the sender process has seen the table as available (as returned by
> pg_get_publication_tables()) and has thus begun COPYing its data. When the
> COPY
> finishes (and the tablesync worker is finished), the apply loop on the receiver
> expects to receive (and apply) subsequent changes for such tables, but simply
> isn't sent any. This was observed by dumping every CopyData message sent
> over
> the wire.
> 
> The attached script (like the original migration plan) uses a single publication
> and adds tables to it successively. Curiously, when the script was changed to
> use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
> ADD
> PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
> the no. of
> tables with data loss jumped to 100%.

Thanks for reporting the issue.

The described behavior looks similar to another bug discussed in [1]. If
possible, could you please check if the latest patch in that thread can fix the
bug you reported ?

If it does, it would be helpful to share the feedback in that thread.

[1] https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com

Best Regards,
Hou zj

Re: Subscription sometimes loses txns after initial table sync

From
Shlok Kyal
Date:
On Tue, 10 Dec 2024 at 07:24, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam@pritambaral.com> wrote:
> > To: pgsql-hackers <pgsql-hackers@postgresql.org>
> > Subject: Subscription sometimes loses txns after initial table sync
> >
> > This was discovered when testing the plan for a major version upgrade via
> > logical replication. Said plan requires that some tables be synced before
> > others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
> > followed
> > by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
> > revealed
> > that sometimes, for some tables added this way, txns after the initial data copy
> > are lost by the subscription.
> >
> > A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
> > even 12.22 (on either side of the replication setup). The script runs at a
> > default scale of 100 tables with 10k inserts each. This scale is enough to
> > demonstrate a failure rate of 1% to 9% of tables on my modest laptop.
> >
> > In attempts to analyse why this happens, it has been observed that the sender
> > sometimes does not pick up a published table, even when the receiver that
> > started the sender process has seen the table as available (as returned by
> > pg_get_publication_tables()) and has thus begun COPYing its data. When the
> > COPY
> > finishes (and the tablesync worker is finished), the apply loop on the receiver
> > expects to receive (and apply) subsequent changes for such tables, but simply
> > isn't sent any. This was observed by dumping every CopyData message sent
> > over
> > the wire.
> >
> > The attached script (like the original migration plan) uses a single publication
> > and adds tables to it successively. Curiously, when the script was changed to
> > use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
> > ADD
> > PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
> > the no. of
> > tables with data loss jumped to 100%.
>
> Thanks for reporting the issue.
>
> The described behavior looks similar to another bug discussed in [1]. If
> possible, could you please check if the latest patch in that thread can fix the
> bug you reported ?
>
> If it does, it would be helpful to share the feedback in that thread.
>
> [1] https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
>

Hi,

I tried to reproduce the issue on HEAD and REL_17_STABLE branches. I
found that the issue is intermittent for me. I ran the script,
provided in [1], 50 times on both branches and I was able to reproduce
the issue 4 times and 5 times respectively.
Then I tested both the branches after applying patches in [2] and ran
the script 50 times. I was not able to reproduce the issue with patch.

I think as Hou-san suggested, the patches in [2] can fix this issue.

[1]: https://www.postgresql.org/message-id/8b595156-d8b6-4b53-a788-7d945726cd2f%40pritambaral.com
[2]: https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com

Thanks and Regards,
Shlok Kyal