Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From James Coleman
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAAaqYe9FdKODa1a9n=qj+w3NiB9gkwvhRHhcJNginuYYRCnLrg@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Andres Freund <andres@anarazel.de>)
Responses Re: Synchronizing slots from primary to standby  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Fri, Feb 18, 2022 at 5:23 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-02-11 15:28:19 +0100, Peter Eisentraut wrote:
> > On 05.02.22 20:59, Andres Freund wrote:
> > > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote:
> > > >  From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001
> > > > From: Peter Eisentraut<peter@eisentraut.org>
> > > > Date: Mon, 3 Jan 2022 14:43:36 +0100
> > > > Subject: [PATCH v3] Synchronize logical replication slots from primary to
> > > >   standby
> > > I've just skimmed the patch and the related threads. As far as I can tell this
> > > cannot be safely used without the conflict handling in [1], is that correct?
> >
> > This or similar questions have been asked a few times about this or similar
> > patches, but they always come with some doubt.
>
> I'm certain it's a problem - the only reason I couched it was that there could
> have been something clever in the patch preventing problems that I missed
> because I just skimmed it.
>
>
> > If we think so, it would be
> > useful perhaps if we could come up with test cases that would demonstrate
> > why that other patch/feature is necessary.  (I'm not questioning it
> > personally, I'm just throwing out ideas here.)
>
> The patch as-is just breaks one of the fundamental guarantees necessary for
> logical decoding, that no rows versions can be removed that are still required
> for logical decoding (signalled via catalog_xmin). So there needs to be an
> explicit mechanism upholding that guarantee, but there is not right now from
> what I can see.

I've been working on adding test coverage to prove this out, but I've
encountered the problem reported in [1].

My assumption, but Andres please correct me if I'm wrong, that we
should see issues with the following steps (given the primary,
physical replica, and logical subscriber already created in the test):

1. Ensure both logical subscriber and physical replica are caught up
2. Disable logical subscription
3. Make a catalog change on the primary (currently renaming the
primary key column)
4. Vacuum pg_class
5. Ensure physical replication is caught up
6. Stop primary and promote the replica
7. Write to the changed table
8. Update subscription to point to promoted replica
9. Re-enable logical subscription

I'm attaching my test as an additional patch in the series for
reference. Currently I have steps 3 and 4 commented out to show that
the issues in [1] occur without any attempt to trigger the catalog
xmin problem.

Given this error seems pretty significant in terms of indicating
fundamental lack of test coverage (the primary stated benefit of the
patch is physical failover), and it currently is a blocker to testing
more deeply.

Thanks,
James Coleman

1:
https://www.postgresql.org/message-id/TYCPR01MB684949EA7AA904EE938548C79F3A9%40TYCPR01MB6849.jpnprd01.prod.outlook.com

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [PATCH] Enable SSL library detection via PQsslAttribute
Next
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences