On Tue, 10 Dec 2024 at 07:24, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam@pritambaral.com> wrote:
> > To: pgsql-hackers <pgsql-hackers@postgresql.org>
> > Subject: Subscription sometimes loses txns after initial table sync
> >
> > This was discovered when testing the plan for a major version upgrade via
> > logical replication. Said plan requires that some tables be synced before
> > others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ...
> > followed
> > by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness
> > revealed
> > that sometimes, for some tables added this way, txns after the initial data copy
> > are lost by the subscription.
> >
> > A reproducer script is attached. It has been tested with PG 17.2, 14.15, and
> > even 12.22 (on either side of the replication setup). The script runs at a
> > default scale of 100 tables with 10k inserts each. This scale is enough to
> > demonstrate a failure rate of 1% to 9% of tables on my modest laptop.
> >
> > In attempts to analyse why this happens, it has been observed that the sender
> > sometimes does not pick up a published table, even when the receiver that
> > started the sender process has seen the table as available (as returned by
> > pg_get_publication_tables()) and has thus begun COPYing its data. When the
> > COPY
> > finishes (and the tablesync worker is finished), the apply loop on the receiver
> > expects to receive (and apply) subsequent changes for such tables, but simply
> > isn't sent any. This was observed by dumping every CopyData message sent
> > over
> > the wire.
> >
> > The attached script (like the original migration plan) uses a single publication
> > and adds tables to it successively. Curiously, when the script was changed to
> > use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ...
> > ADD
> > PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION),
> > the no. of
> > tables with data loss jumped to 100%.
>
> Thanks for reporting the issue.
>
> The described behavior looks similar to another bug discussed in [1]. If
> possible, could you please check if the latest patch in that thread can fix the
> bug you reported ?
>
> If it does, it would be helpful to share the feedback in that thread.
>
> [1] https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
>
Hi,
I tried to reproduce the issue on HEAD and REL_17_STABLE branches. I
found that the issue is intermittent for me. I ran the script,
provided in [1], 50 times on both branches and I was able to reproduce
the issue 4 times and 5 times respectively.
Then I tested both the branches after applying patches in [2] and ran
the script 50 times. I was not able to reproduce the issue with patch.
I think as Hou-san suggested, the patches in [2] can fix this issue.
[1]: https://www.postgresql.org/message-id/8b595156-d8b6-4b53-a788-7d945726cd2f%40pritambaral.com
[2]: https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
Thanks and Regards,
Shlok Kyal