Re: Subscription sometimes loses txns after initial table sync - Mailing list pgsql-hackers
From | Shlok Kyal |
---|---|
Subject | Re: Subscription sometimes loses txns after initial table sync |
Date | |
Msg-id | CANhcyEUcY8YxMC0zBS3WQWxUkcTJQ_80rzV8Eu1y2e-sFVxLrg@mail.gmail.com Whole thread Raw |
In response to | RE: Subscription sometimes loses txns after initial table sync ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
List | pgsql-hackers |
On Tue, 10 Dec 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, December 9, 2024 9:21 PM Pritam Baral <pritam@pritambaral.com> wrote: > > To: pgsql-hackers <pgsql-hackers@postgresql.org> > > Subject: Subscription sometimes loses txns after initial table sync > > > > This was discovered when testing the plan for a major version upgrade via > > logical replication. Said plan requires that some tables be synced before > > others. So I implemented it using ALTER PUBLICATION ... ADD TABLE ... > > followed > > by ALTER SUBSCRIPTION ... REFRESH PUBLICATION. A test for correctness > > revealed > > that sometimes, for some tables added this way, txns after the initial data copy > > are lost by the subscription. > > > > A reproducer script is attached. It has been tested with PG 17.2, 14.15, and > > even 12.22 (on either side of the replication setup). The script runs at a > > default scale of 100 tables with 10k inserts each. This scale is enough to > > demonstrate a failure rate of 1% to 9% of tables on my modest laptop. > > > > In attempts to analyse why this happens, it has been observed that the sender > > sometimes does not pick up a published table, even when the receiver that > > started the sender process has seen the table as available (as returned by > > pg_get_publication_tables()) and has thus begun COPYing its data. When the > > COPY > > finishes (and the tablesync worker is finished), the apply loop on the receiver > > expects to receive (and apply) subsequent changes for such tables, but simply > > isn't sent any. This was observed by dumping every CopyData message sent > > over > > the wire. > > > > The attached script (like the original migration plan) uses a single publication > > and adds tables to it successively. Curiously, when the script was changed to > > use a dedicated publication per table (and thus, ALTER SUBSCRIPTION ... > > ADD > > PUBLICATION instead of ALTER SUBSCRIPTION ... REFRESH PUBLICATION), > > the no. of > > tables with data loss jumped to 100%. > > Thanks for reporting the issue. > > The described behavior looks similar to another bug discussed in [1]. If > possible, could you please check if the latest patch in that thread can fix the > bug you reported ? > > If it does, it would be helpful to share the feedback in that thread. > > [1] https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com > Hi, I tried to reproduce the issue on HEAD and REL_17_STABLE branches. I found that the issue is intermittent for me. I ran the script, provided in [1], 50 times on both branches and I was able to reproduce the issue 4 times and 5 times respectively. Then I tested both the branches after applying patches in [2] and ran the script 50 times. I was not able to reproduce the issue with patch. I think as Hou-san suggested, the patches in [2] can fix this issue. [1]: https://www.postgresql.org/message-id/8b595156-d8b6-4b53-a788-7d945726cd2f%40pritambaral.com [2]: https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com Thanks and Regards, Shlok Kyal
pgsql-hackers by date: