Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
Date
Msg-id CAA4eK1+uJ2u=MOQ=Fy4NqY8KEyxw41GXBCbifD-meVTC0p1mDw@mail.gmail.com
Whole thread Raw
In response to Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION  (Michail Nikolaev <michail.nikolaev@gmail.com>)
List pgsql-hackers
On Tue, Jan 3, 2023 at 8:50 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> > Does that by any chance mean you are using a non-community version of
> > Postgres which has some other changes?
>
> It is a managed Postgres service in the general cloud. Usually, such
> providers apply some custom minor patches.
> The only one I know about - about forbidding of canceling queries
> while waiting for synchronous replication acknowledgement.
>

Okay, but it would be better to know what all the other changes they have.

> > It is possible but ideally, in that case, the client should request
> > such a transaction again.
>
> I am not sure I get you here.
>
> I'll try to explain what I mean:
>
> The patch I'm referring to does not allow canceling a query while it
> waiting acknowledge for ACK for COMMIT message in case of synchronous
> replication.
> If synchronous standby is down - query and connection just stuck until
> server restart (or until standby become available to process ACK).
> Tuples changed by such a hanging transaction are not visible by other
> transactions. It is all done to prevent seeing spurious tuples in case
> of network split.
>
> So, it seems like we had such a situation during that story because of
> our synchronous standby downtime (before server restart).
> My thoughts just about the possibility of fact that such transactions
> (waiting for ACK for COMMIT) are handled somehow incorrectly by
> logical replication engine.
>

I understood this point yesterday but we do have handling for such
cases. Say, if the subscriber is down during the time of such
synchronous transactions, after the restart, it will request to
restart the replication from a point which is prior to such
transactions. We ensure this by replication origins. See docs [1] for
more information about the same. Now, it is possible that there is a
bug in that mechanism but it is difficult to find it without some
hints from LOGs or a reproducible test. It is also possible that there
is another area that has a bug in the Postgres code. But, OTOH, we
can't rule out the possibility that it is because of some features
added by managed service unless you can reproduce it on the Postgres
build.

[1] - https://www.postgresql.org/docs/devel/replication-origins.html

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply
Next
From: Alexander Korotkov
Date:
Subject: Re: POC: Lock updated tuples in tuple_update() and tuple_delete()