On Fri, Feb 18, 2022 at 5:23 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-02-11 15:28:19 +0100, Peter Eisentraut wrote:
> > On 05.02.22 20:59, Andres Freund wrote:
> > > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote:
> > > > From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001
> > > > From: Peter Eisentraut<peter@eisentraut.org>
> > > > Date: Mon, 3 Jan 2022 14:43:36 +0100
> > > > Subject: [PATCH v3] Synchronize logical replication slots from primary to
> > > > standby
> > > I've just skimmed the patch and the related threads. As far as I can tell this
> > > cannot be safely used without the conflict handling in [1], is that correct?
> >
> > This or similar questions have been asked a few times about this or similar
> > patches, but they always come with some doubt.
>
> I'm certain it's a problem - the only reason I couched it was that there could
> have been something clever in the patch preventing problems that I missed
> because I just skimmed it.
>
>
> > If we think so, it would be
> > useful perhaps if we could come up with test cases that would demonstrate
> > why that other patch/feature is necessary. (I'm not questioning it
> > personally, I'm just throwing out ideas here.)
>
> The patch as-is just breaks one of the fundamental guarantees necessary for
> logical decoding, that no rows versions can be removed that are still required
> for logical decoding (signalled via catalog_xmin). So there needs to be an
> explicit mechanism upholding that guarantee, but there is not right now from
> what I can see.
I've been working on adding test coverage to prove this out, but I've
encountered the problem reported in [1].
My assumption, but Andres please correct me if I'm wrong, that we
should see issues with the following steps (given the primary,
physical replica, and logical subscriber already created in the test):
1. Ensure both logical subscriber and physical replica are caught up
2. Disable logical subscription
3. Make a catalog change on the primary (currently renaming the
primary key column)
4. Vacuum pg_class
5. Ensure physical replication is caught up
6. Stop primary and promote the replica
7. Write to the changed table
8. Update subscription to point to promoted replica
9. Re-enable logical subscription
I'm attaching my test as an additional patch in the series for
reference. Currently I have steps 3 and 4 commented out to show that
the issues in [1] occur without any attempt to trigger the catalog
xmin problem.
Given this error seems pretty significant in terms of indicating
fundamental lack of test coverage (the primary stated benefit of the
patch is physical failover), and it currently is a blocker to testing
more deeply.
Thanks,
James Coleman
1:
https://www.postgresql.org/message-id/TYCPR01MB684949EA7AA904EE938548C79F3A9%40TYCPR01MB6849.jpnprd01.prod.outlook.com