On Tue, Jan 3, 2023 at 2:14 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> > The point which is not completely clear from your description is the
> > timing of missing records. In one of your previous emails, you seem to
> > have indicated that the data missed from Table B is from the time when
> > the initial sync for Table B was in-progress, right? Also, from your
> > description, it seems there is no error or restart that happened
> > during the time of initial sync for Table B. Is that understanding
> > correct?
>
> Yes and yes.
> * B sync started - 08:08:34
> * lost records are created - 09:49:xx
> * B initial sync finished - 10:19:08
> * I/O error with WAL - 10:19:22
> * SIGTERM - 10:35:20
>
> "Finished" here is `logical replication table synchronization worker
> for subscription "cloud_production_main_sub_v4", table "B" has
> finished`.
> As far as I know, it is about COPY command.
>
> > I am not able to see how these steps can lead to the problem.
>
> One idea I have here - it is something related to the patch about
> forbidding of canceling queries while waiting for synchronous
> replication acknowledgement [1].
> It is applied to Postgres in the cloud we were using [2]. We started
> to see such errors in 10:24:18:
>
> `The COMMIT record has already flushed to WAL locally and might
> not have been replicated to the standby. We must wait here.`
>
Does that by any chance mean you are using a non-community version of
Postgres which has some other changes?
> I wonder could it be some tricky race because of downtime of
> synchronous replica and queries stuck waiting for ACK forever?
>
It is possible but ideally, in that case, the client should request
such a transaction again.
--
With Regards,
Amit Kapila.