Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION
Date
Msg-id CAA4eK1LYq+gJO6V34dVnnYy2adBxZDarvhhxTMFkxDr3Vh5OZg@mail.gmail.com
Whole thread Raw
In response to Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION  (Michail Nikolaev <michail.nikolaev@gmail.com>)
Responses Re: Data loss on logical replication, 12.12 to 14.5, ALTER SUBSCRIPTION  (Michail Nikolaev <michail.nikolaev@gmail.com>)
List pgsql-hackers
On Wed, Dec 28, 2022 at 4:52 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> Hello.
>
> > None of these entries are from the point mentioned by you [1]
> > yesterday where you didn't find the corresponding data in the
> > subscriber. How did you identify that the entries corresponding to
> > that timing were missing?
>
> Some of the before the interval, some after... But the source database
> was generating a lot of WAL during logical replication
> - some of these log entries from time AFTER completion of initial sync
> of B but (probably) BEFORE finishing B table catch up (entering
> streaming mode).
>
...
...
>
> So, shortly the story looks like:
>
> * initial sync of A (and other tables) started and completed, they are
> in streaming mode
> * B and C initial sync started (by altering PUBLICATION and SUBSCRIPTION)
> * B sync completed, but new changes are still applying to the tables
> to catch up primary
>

The point which is not completely clear from your description is the
timing of missing records. In one of your previous emails, you seem to
have indicated that the data missed from Table B is from the time when
the initial sync for Table B was in-progress, right? Also, from your
description, it seems there is no error or restart that happened
during the time of initial sync for Table B. Is that understanding
correct?

> * logical replication apply worker is restarted because IO error on WAL receive
> * Postgres killed
> * Postgres restarted
> * C initial sync restarted
> * logical replication apply worker few times restarted because IO
> error on WAL receive
> * finally every table in streaming mode but with small gap in B
>

I am not able to see how these steps can lead to the problem. If the
problem is reproducible at your end, you might want to increase LOG
verbosity to DEBUG1 and see if there is additional information in the
LOGs that can help or it would be really good if there is a
self-sufficient test to reproduce it.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Bug in check for unreachable MERGE WHEN clauses
Next
From: David Geier
Date:
Subject: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?