Hi,
On 2021-02-10 08:02:17 +0530, Amit Kapila wrote:
> On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I think similar happens without any of the work done in PG-14 as well
> > > if we restart the apply worker before the commit completes on the
> > > subscriber. After the restart, we will send the start_decoding_at
> > > point based on some previous commit which will make publisher send the
> > > entire transaction again. I don't think restart of WAL sender or WAL
> > > receiver is such a common thing. It can only happen due to some bug in
> > > code or user wishes to stop the nodes or some crash happened.
> >
> > Really? My impression is that the logical replication protocol is
> > supposed to be designed in such a way that once a transaction is
> > successfully confirmed, it won't be sent again. Now if something is
> > not confirmed then it has to be sent again. But if it is confirmed
> > then it shouldn't happen.
Correct.
> If by successfully confirmed, you mean that once the subscriber node
> has received, it won't be sent again then as far as I know that is not
> true. We rely on the flush location sent by the subscriber to advance
> the decoding locations. We update the flush locations after we apply
> the transaction's commit successfully. Also, after the restart, we use
> the replication origin's last flush location as a point from where we
> need the transactions and the origin's progress is updated at commit
> time.
That's not quite right. Yes, the flush location isn't guaranteed to be
updated at that point, but a replication client will send the last
location they've received and successfully processed, and that has to
*guarantee* that they won't receive anything twice, or miss
something. Otherwise you've broken the protocol.
Greetings,
Andres Freund