Thread: repeated decoding of prepared transactions

repeated decoding of prepared transactions

From
Markus Wanner
Date:
Amit, Ajin, hackers,

testing logical decoding for two-phase transactions, I stumbled over 
what I first thought is a bug.  But comments seems to indicate this is 
intended behavior.  Could you please clarify or elaborate on the design 
decision?  Or indicate this indeed is a bug?

What puzzled me is that if a decoder is restarted in between the PREPARE 
and the COMMIT PREPARED, it repeats the entire transaction, despite it 
being already sent and potentially prepared on the receiving side.

In terms of `pg_logical_slot_get_changes` (and roughly from the 
prepare.sql test), this looks as follows:

                         data
----------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:1
  PREPARE TRANSACTION 'test_prepared#1'
(3 rows)


This is the first delivery of the transaction.  After a restart, it will 
get all of the changes again, though:


                         data
----------------------------------------------------
  BEGIN
  table public.test_prepared1: INSERT: id[integer]:1
  PREPARE TRANSACTION 'test_prepared#1'
  COMMIT PREPARED 'test_prepared#1'
(4 rows)


I did not expect this, as any receiver that wants to have decoded 2PC is 
likely supporting some kind of two-phase commits itself.  And would 
therefore prepare the transaction upon its first reception.  Potentially 
receiving it a second time would require complicated filtering on every 
prepared transaction.

Furthermore, this clearly and unnecessarily holds back the restart LSN. 
Meaning even just a single prepared transaction can block advancing the 
restart LSN.  In most cases, these are short lived.  But on the other 
hand, there may be an arbitrary amount of other transactions in between 
a PREPARE and the corresponding COMMIT PREPARED in the WAL.  Not being 
able to advance over a prepared transaction seems like a bad thing in 
such a case.

I fail to see where this repetition would ever be useful.  Is there any 
reason for the current implementation that I'm missing or can this be 
corrected?  Thanks for elaborating.

Regards

Markus



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner
<markus.wanner@enterprisedb.com> wrote:
>
> Amit, Ajin, hackers,
>
> testing logical decoding for two-phase transactions, I stumbled over
> what I first thought is a bug.  But comments seems to indicate this is
> intended behavior.  Could you please clarify or elaborate on the design
> decision?  Or indicate this indeed is a bug?
>
> What puzzled me is that if a decoder is restarted in between the PREPARE
> and the COMMIT PREPARED, it repeats the entire transaction, despite it
> being already sent and potentially prepared on the receiving side.
>
> In terms of `pg_logical_slot_get_changes` (and roughly from the
> prepare.sql test), this looks as follows:
>
>                          data
> ----------------------------------------------------
>   BEGIN
>   table public.test_prepared1: INSERT: id[integer]:1
>   PREPARE TRANSACTION 'test_prepared#1'
> (3 rows)
>
>
> This is the first delivery of the transaction.  After a restart, it will
> get all of the changes again, though:
>
>
>                          data
> ----------------------------------------------------
>   BEGIN
>   table public.test_prepared1: INSERT: id[integer]:1
>   PREPARE TRANSACTION 'test_prepared#1'
>   COMMIT PREPARED 'test_prepared#1'
> (4 rows)
>
>
> I did not expect this, as any receiver that wants to have decoded 2PC is
> likely supporting some kind of two-phase commits itself.  And would
> therefore prepare the transaction upon its first reception.  Potentially
> receiving it a second time would require complicated filtering on every
> prepared transaction.
>

The reason was mentioned in ReorderBufferFinishPrepared(). See below
comments in code:
/*
 * It is possible that this transaction is not decoded at prepare time
 * either because by that time we didn't have a consistent snapshot or it
 * was decoded earlier but we have restarted. We can't distinguish between
 * those two cases so we send the prepare in both the cases and let
 * downstream decide whether to process or skip it. We don't need to
 * decode the xact for aborts if it is not done already.
 */
This won't happen when we replicate via pgoutput  (the patch for which
is still not committed) because it won't restart from a previous point
(unless the server needs to be restarted due to some reason) as you
are doing via logical decoding APIs. Now, we don't send again the
prepared xacts on repeated calls of pg_logical_slot_get_changes()
unless we encounter commit. This behavior is already explained in docs
[1] (See, Transaction Begin Prepare Callback in docs) and a way to
skip the prepare.

> Furthermore, this clearly and unnecessarily holds back the restart LSN.
> Meaning even just a single prepared transaction can block advancing the
> restart LSN.  In most cases, these are short lived.  But on the other
> hand, there may be an arbitrary amount of other transactions in between
> a PREPARE and the corresponding COMMIT PREPARED in the WAL.  Not being
> able to advance over a prepared transaction seems like a bad thing in
> such a case.
>

That anyway is true without this work as well where restart_lsn can be
advanced on commits. We haven't changed anything in that regard.

[1] - https://www.postgresql.org/docs/devel/logicaldecoding-output-plugin.html


-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Markus Wanner
Date:
Hello Amit,

thanks for your very quick response.

On 08.02.21 11:13, Amit Kapila wrote:
> /*
>   * It is possible that this transaction is not decoded at prepare time
>   * either because by that time we didn't have a consistent snapshot or it
>   * was decoded earlier but we have restarted. We can't distinguish between
>   * those two cases so we send the prepare in both the cases and let
>   * downstream decide whether to process or skip it. We don't need to
>   * decode the xact for aborts if it is not done already.
>   */

The way I read the surrounding code, the only case a 2PC transaction 
does not get decoded a prepare time is if the transaction is empty.  Or 
are you aware of any other situation that might currently happen?

> (unless the server needs to be restarted due to some reason)

Right, the repetition occurs only after a restart of the walsender in 
between a prepare and a commit prepared record.

> That anyway is true without this work as well where restart_lsn can be
> advanced on commits. We haven't changed anything in that regard.

I did not mean to blame the patch, but merely try to understand some of 
the design decisions behind it.

And as I just learned, even if we managed to avoid the repetition, a 
restarted walsender still needs to see prepared transactions as 
in-progress in its snapshots.  So we cannot move forward the restart_lsn 
to after a prepare record (until the final commit or rollback is consumed).

Best Regards

Markus



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner
<markus.wanner@enterprisedb.com> wrote:
>
> Hello Amit,
>
> thanks for your very quick response.
>
> On 08.02.21 11:13, Amit Kapila wrote:
> > /*
> >   * It is possible that this transaction is not decoded at prepare time
> >   * either because by that time we didn't have a consistent snapshot or it
> >   * was decoded earlier but we have restarted. We can't distinguish between
> >   * those two cases so we send the prepare in both the cases and let
> >   * downstream decide whether to process or skip it. We don't need to
> >   * decode the xact for aborts if it is not done already.
> >   */
>
> The way I read the surrounding code, the only case a 2PC transaction
> does not get decoded a prepare time is if the transaction is empty.  Or
> are you aware of any other situation that might currently happen?
>

We also skip decoding at prepare time if we haven't reached a
consistent snapshot by that time. See below code in DecodePrepare().
DecodePrepare()
{
..
/* We can't start streaming unless a consistent state is reached. */
if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
{
ReorderBufferSkipPrepare(ctx->reorder, xid);
return;
}
..
}

There are other reasons as well like the output plugin doesn't want to
allow decoding at prepare time but I don't think those are relevant to
the discussion here.

> > (unless the server needs to be restarted due to some reason)
>
> Right, the repetition occurs only after a restart of the walsender in
> between a prepare and a commit prepared record.
>
> > That anyway is true without this work as well where restart_lsn can be
> > advanced on commits. We haven't changed anything in that regard.
>
> I did not mean to blame the patch, but merely try to understand some of
> the design decisions behind it.
>
> And as I just learned, even if we managed to avoid the repetition, a
> restarted walsender still needs to see prepared transactions as
> in-progress in its snapshots.  So we cannot move forward the restart_lsn
> to after a prepare record (until the final commit or rollback is consumed).
>

Right and say if we forget the prepared transactions and move forward
with restart_lsn once we get the prepare for any transaction. Then we
will open up a window where we haven't actually sent the prepared xact
because of say "snapshot has not yet reached consistent state" and we
have moved the restart_lsn. Then later when we get the commit
corresponding to the prepared transaction by which time say the
"snapshot has reached consistent state" then we will miss sending the
transaction contents and prepare for it. I think for such reasons we
allow restart_lsn to moved only once the transaction is finished
(committed or rolled back).

-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Ashutosh Bapat
Date:
On Tue, Feb 9, 2021 at 8:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner
> <markus.wanner@enterprisedb.com> wrote:
> >
> > Hello Amit,
> >
> > thanks for your very quick response.
> >
> > On 08.02.21 11:13, Amit Kapila wrote:
> > > /*
> > >   * It is possible that this transaction is not decoded at prepare time
> > >   * either because by that time we didn't have a consistent snapshot or it
> > >   * was decoded earlier but we have restarted. We can't distinguish between
> > >   * those two cases so we send the prepare in both the cases and let
> > >   * downstream decide whether to process or skip it. We don't need to
> > >   * decode the xact for aborts if it is not done already.
> > >   */
> >
> > The way I read the surrounding code, the only case a 2PC transaction
> > does not get decoded a prepare time is if the transaction is empty.  Or
> > are you aware of any other situation that might currently happen?
> >
>
> We also skip decoding at prepare time if we haven't reached a
> consistent snapshot by that time. See below code in DecodePrepare().
> DecodePrepare()
> {
> ..
> /* We can't start streaming unless a consistent state is reached. */
> if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
> {
> ReorderBufferSkipPrepare(ctx->reorder, xid);
> return;
> }
> ..
> }

Can you please provide steps which can lead to this situation? If
there is an earlier discussion which has example scenarios, please
point us to the relevant thread.

If we are not sending PREPARED transactions that's fine, but sending
the same prepared transaction as many times as the WAL sender is
restarted between sending prepare and commit prepared is a waste of
network bandwidth. The wastage is proportional to the amount of
changes in the transaction and number of such transactions themselves.
Also this will cause performance degradation. So if we can avoid
resending prepared transactions twice that will help.

-- 
Best Wishes,
Ashutosh Bapat



Re: repeated decoding of prepared transactions

From
Ajin Cherian
Date:
On Tue, Feb 9, 2021 at 4:59 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

> Can you please provide steps which can lead to this situation? If
> there is an earlier discussion which has example scenarios, please
> point us to the relevant thread.
>
> If we are not sending PREPARED transactions that's fine, but sending
> the same prepared transaction as many times as the WAL sender is
> restarted between sending prepare and commit prepared is a waste of
> network bandwidth. The wastage is proportional to the amount of
> changes in the transaction and number of such transactions themselves.
> Also this will cause performance degradation. So if we can avoid
> resending prepared transactions twice that will help.

One of this scenario is explained in the test case in

postgres/contrib/test_decoding/specs/twophase_snapshot.spec

regards,
Ajin Cherian
Fujitsu Australia



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Tue, Feb 9, 2021 at 11:29 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Feb 9, 2021 at 8:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner
> > <markus.wanner@enterprisedb.com> wrote:
> > >
> > > Hello Amit,
> > >
> > > thanks for your very quick response.
> > >
> > > On 08.02.21 11:13, Amit Kapila wrote:
> > > > /*
> > > >   * It is possible that this transaction is not decoded at prepare time
> > > >   * either because by that time we didn't have a consistent snapshot or it
> > > >   * was decoded earlier but we have restarted. We can't distinguish between
> > > >   * those two cases so we send the prepare in both the cases and let
> > > >   * downstream decide whether to process or skip it. We don't need to
> > > >   * decode the xact for aborts if it is not done already.
> > > >   */
> > >
> > > The way I read the surrounding code, the only case a 2PC transaction
> > > does not get decoded a prepare time is if the transaction is empty.  Or
> > > are you aware of any other situation that might currently happen?
> > >
> >
> > We also skip decoding at prepare time if we haven't reached a
> > consistent snapshot by that time. See below code in DecodePrepare().
> > DecodePrepare()
> > {
> > ..
> > /* We can't start streaming unless a consistent state is reached. */
> > if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
> > {
> > ReorderBufferSkipPrepare(ctx->reorder, xid);
> > return;
> > }
> > ..
> > }
>
> Can you please provide steps which can lead to this situation?
>

Ajin has already shared the example with you.

> If
> there is an earlier discussion which has example scenarios, please
> point us to the relevant thread.
>

It started in the email [1] and from there you can read later emails
to know more about this.

> If we are not sending PREPARED transactions that's fine,
>

Hmm, I am not sure if that is fine because if the output plugin sets
the two-phase-commit option, it would expect all prepared xacts to
arrive not some only some of them.

> but sending
> the same prepared transaction as many times as the WAL sender is
> restarted between sending prepare and commit prepared is a waste of
> network bandwidth.
>

I think similar happens without any of the work done in PG-14 as well
if we restart the apply worker before the commit completes on the
subscriber. After the restart, we will send the start_decoding_at
point based on some previous commit which will make publisher send the
entire transaction again. I don't think restart of WAL sender or WAL
receiver is such a common thing. It can only happen due to some bug in
code or user wishes to stop the nodes or some crash happened.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2Bd3gzCyzsYjt1m6sfGf_C_uFmo9JK%3D3Wafp6yR8Mg8uQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Robert Haas
Date:
On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think similar happens without any of the work done in PG-14 as well
> if we restart the apply worker before the commit completes on the
> subscriber. After the restart, we will send the start_decoding_at
> point based on some previous commit which will make publisher send the
> entire transaction again. I don't think restart of WAL sender or WAL
> receiver is such a common thing. It can only happen due to some bug in
> code or user wishes to stop the nodes or some crash happened.

Really? My impression is that the logical replication protocol is
supposed to be designed in such a way that once a transaction is
successfully confirmed, it won't be sent again. Now if something is
not confirmed then it has to be sent again. But if it is confirmed
then it shouldn't happen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think similar happens without any of the work done in PG-14 as well
> > if we restart the apply worker before the commit completes on the
> > subscriber. After the restart, we will send the start_decoding_at
> > point based on some previous commit which will make publisher send the
> > entire transaction again. I don't think restart of WAL sender or WAL
> > receiver is such a common thing. It can only happen due to some bug in
> > code or user wishes to stop the nodes or some crash happened.
>
> Really? My impression is that the logical replication protocol is
> supposed to be designed in such a way that once a transaction is
> successfully confirmed, it won't be sent again. Now if something is
> not confirmed then it has to be sent again. But if it is confirmed
> then it shouldn't happen.
>

If by successfully confirmed, you mean that once the subscriber node
has received, it won't be sent again then as far as I know that is not
true. We rely on the flush location sent by the subscriber to advance
the decoding locations. We update the flush locations after we apply
the transaction's commit successfully. Also, after the restart, we use
the replication origin's last flush location as a point from where we
need the transactions and the origin's progress is updated at commit
time.

OTOH, If by successfully confirmed, you mean that once the subscriber
has applied the complete transaction (including commit), then you are
right that it won't be sent again.

-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Ashutosh Bapat
Date:


On Wed, Feb 10, 2021 at 8:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think similar happens without any of the work done in PG-14 as well
> > if we restart the apply worker before the commit completes on the
> > subscriber. After the restart, we will send the start_decoding_at
> > point based on some previous commit which will make publisher send the
> > entire transaction again. I don't think restart of WAL sender or WAL
> > receiver is such a common thing. It can only happen due to some bug in
> > code or user wishes to stop the nodes or some crash happened.
>
> Really? My impression is that the logical replication protocol is
> supposed to be designed in such a way that once a transaction is
> successfully confirmed, it won't be sent again. Now if something is
> not confirmed then it has to be sent again. But if it is confirmed
> then it shouldn't happen.
>

If by successfully confirmed, you mean that once the subscriber node
has received, it won't be sent again then as far as I know that is not
true. We rely on the flush location sent by the subscriber to advance
the decoding locations. We update the flush locations after we apply
the transaction's commit successfully. Also, after the restart, we use
the replication origin's last flush location as a point from where we
need the transactions and the origin's progress is updated at commit
time.

OTOH, If by successfully confirmed, you mean that once the subscriber
has applied the complete transaction (including commit), then you are
right that it won't be sent again.

I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream. We need to send a whole uncommitted transaction downstream again because previously applied changes must have been aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream prepares a transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstream confirms that it has "flushed" PREPARE, there is no need to send all the changes again.

--
Best Wishes,
Ashutosh

Re: repeated decoding of prepared transactions

From
Ajin Cherian
Date:
On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:


> I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending
downstream.We need to send a whole uncommitted transaction downstream again because previously applied changes must
havebeen aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream
preparesa transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the
downstreamconfirms that it has "flushed" PREPARE, there is no need to send all the changes again. 

But the other side of the problem is that ,without this, if the
prepared transaction is prior to a consistent snapshot when decoding
starts/restarts, then only the "commit prepared" is sent to downstream
(as seen in the test scenario I shared above), and downstream has to
error away the commit prepared because it does not have the
corresponding prepared transaction. We did not find an easy way to
distinguish between these two scenarios for prepared transactions.
a. A consistent snapshot being formed in between a prepare and a
commit prepared for the first time.
b. Decoder restarting between a prepare and a commit prepared.

For plugins to be able to handle this, we have added a special
callback "Begin Prepare" as explained in [1] section 49.6.4.10

"The required begin_prepare_cb callback is called whenever the start
of a prepared transaction has been decoded. The gid field, which is
part of the txn parameter can be used in this callback to check if the
plugin has already received this prepare in which case it can skip the
remaining changes of the transaction. This can only happen if the user
restarts the decoding after receiving the prepare for a transaction
but before receiving the commit prepared say because of some error."

The pgoutput plugin is also being updated to be able to handle this
situation of duplicate prepared transactions.

regards,
Ajin Cherian
Fujitsu Australia



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>
>
> > I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending
downstream.We need to send a whole uncommitted transaction downstream again because previously applied changes must
havebeen aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream
preparesa transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the
downstreamconfirms that it has "flushed" PREPARE, there is no need to send all the changes again. 
>
> But the other side of the problem is that ,without this, if the
> prepared transaction is prior to a consistent snapshot when decoding
> starts/restarts, then only the "commit prepared" is sent to downstream
> (as seen in the test scenario I shared above), and downstream has to
> error away the commit prepared because it does not have the
> corresponding prepared transaction.
>

I think it is not only simple error handling, it is required for
data-consistency. We need to send the transactions whose commits are
encountered after a consistent snapshot is reached.

--
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Markus Wanner
Date:
On 10.02.21 07:32, Amit Kapila wrote:
> On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
>> But the other side of the problem is that ,without this, if the
>> prepared transaction is prior to a consistent snapshot when decoding
>> starts/restarts, then only the "commit prepared" is sent to downstream
>> (as seen in the test scenario I shared above), and downstream has to
>> error away the commit prepared because it does not have the
>> corresponding prepared transaction.
> 
> I think it is not only simple error handling, it is required for
> data-consistency. We need to send the transactions whose commits are
> encountered after a consistent snapshot is reached.

I'm with Ashutosh here.  If a replica is properly in sync, it knows 
about prepared transactions and all the gids of those.  Sending the 
transactional changes and the prepare again is inconsistent.

The point of a two-phase transaction is to have two phases.  An output 
plugin must have the chance of treating them as independent events. 
Once a PREPARE is confirmed, it must not be sent again.  Even if the 
transaction is still in-progress and its changes are not yet visible on 
the origin node.

Regards

Markus



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Wed, Feb 10, 2021 at 1:40 PM Markus Wanner
<markus.wanner@enterprisedb.com> wrote:
>
> On 10.02.21 07:32, Amit Kapila wrote:
> > On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >> But the other side of the problem is that ,without this, if the
> >> prepared transaction is prior to a consistent snapshot when decoding
> >> starts/restarts, then only the "commit prepared" is sent to downstream
> >> (as seen in the test scenario I shared above), and downstream has to
> >> error away the commit prepared because it does not have the
> >> corresponding prepared transaction.
> >
> > I think it is not only simple error handling, it is required for
> > data-consistency. We need to send the transactions whose commits are
> > encountered after a consistent snapshot is reached.
>
> I'm with Ashutosh here.  If a replica is properly in sync, it knows
> about prepared transactions and all the gids of those.  Sending the
> transactional changes and the prepare again is inconsistent.
>
> The point of a two-phase transaction is to have two phases.  An output
> plugin must have the chance of treating them as independent events.
>

I am not sure I understand what problem you are facing to deal with
this in the output plugin, it is explained in docs and Ajin also
pointed out the same. Ajin and I have explained to you the design
constraints on the publisher-side due to which we have done this way.
Do you have any better ideas to deal with this?

-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Robert Haas
Date:
On Tue, Feb 9, 2021 at 9:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> If by successfully confirmed, you mean that once the subscriber node
> has received, it won't be sent again then as far as I know that is not
> true. We rely on the flush location sent by the subscriber to advance
> the decoding locations. We update the flush locations after we apply
> the transaction's commit successfully. Also, after the restart, we use
> the replication origin's last flush location as a point from where we
> need the transactions and the origin's progress is updated at commit
> time.
>
> OTOH, If by successfully confirmed, you mean that once the subscriber
> has applied the complete transaction (including commit), then you are
> right that it won't be sent again.

I meant - once the subscriber has advanced the flush location.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner
<markus.wanner@enterprisedb.com> wrote:
>
> I did not expect this, as any receiver that wants to have decoded 2PC is
> likely supporting some kind of two-phase commits itself.  And would
> therefore prepare the transaction upon its first reception.  Potentially
> receiving it a second time would require complicated filtering on every
> prepared transaction.
>

I would like to bring one other scenario to your notice where you
might want to handle things differently for prepared transactions on
the plugin side. Assume we have multiple publications (for simplicity
say 2) on publisher with corresponding subscriptions (say 2, each
corresponding to one publication on the publisher). When a user
performs a transaction on a publisher that involves the tables from
both publications, on the subscriber-side, we do that work via two
different transactions, corresponding to each subscription. But, we
need some way to deal with prepared xacts because they need GID and we
can't use the same GID for both subscriptions. Please see the detailed
example and one idea to deal with the same in the main thread[1]. It
would be really helpful if you or others working on the plugin side
can share your opinion on the same.

Now, coming back to the restart case where the prepared transaction
can be sent again by the publisher. I understand yours and others
point that we should not send prepared transaction if there is a
restart between prepare and commit but there are reasons why we have
done that way and I am open to your suggestions. I'll once again try
to explain the exact case to you which is not very apparent. The basic
idea is that we ship/replay all transactions where commit happens
after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
atop snapbuild.c for details. Now, for transactions where prepare is
before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
after SNAPBUILD_CONSISTENT, we need to send the entire transaction
including prepare at the commit time. One might think it is quite easy
to detect that, basically if we skip prepare when the snapshot state
was not SNAPBUILD_CONSISTENT, then mark a flag in ReorderBufferTxn and
use the same to detect during commit and accordingly take the decision
to send prepare but unfortunately it is not that easy. There is always
a chance that on restart we reuse the snapshot serialized by some
other Walsender at a location prior to Prepare and if that happens
then this time the prepare won't be skipped due to snapshot state
(SNAPBUILD_CONSISTENT) but due to start_decodint_at point (considering
we have already shipped some of the later commits but not prepare).
Now, this will actually become the same situation where the restart
has happened after we have sent the prepare but not commit. This is
the reason we have to resend the prepare when the subscriber restarts
between prepare and commit.

You can reproduce the case where we can't distinguish between two
situations by using the test case in twophase_snapshot.spec and
additionally starting a separate session via the debugger. So, the
steps in the test case are as below:

"s2b" "s2txid" "s1init" "s3b" "s3txid" "s2c" "s2b" "s2insert" "s2p"
"s3c" "s1insert" "s1start" "s2cp" "s1start"

Define new steps as

"s4init" {SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot_1',
'test_decoding');}
"s4start" {SELECT data  FROM
pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL,
'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit',
'1');}

The first thing we need to do is s4init and stop the debugger in
SnapBuildProcessRunningXacts. Now perform steps from 's2b' till first
's1start' in twophase_snapshot.spec. Then continue in the s4 session
and perform s4start. After this, if you debug (or add the logs) the
second s1start, you will notice that we are skipping prepare not
because of inconsistent snapshot but a forward location in
start_decoding_at. If you don't involve session-4, then it will always
skip prepare due to an inconsistent snapshot state. This involves a
debugger so not easy to write an automated test for it.

I have used a bit tricky scenario to explain this but not sure if
there was any other simpler way.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BLvkeX%3DB3xon7RcBwD4CVaFSryPj3pTBAALrDxQVPDwA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Markus Wanner
Date:
Hello Amit,

thanks a lot for your extensive explanation and examples, I appreciate 
this very much.  I'll need to think this through and see how we can make 
this work for us.

Best Regards

Markus



Re: repeated decoding of prepared transactions

From
Robert Haas
Date:
On Thu, Feb 11, 2021 at 5:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> to explain the exact case to you which is not very apparent. The basic
> idea is that we ship/replay all transactions where commit happens
> after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
> atop snapbuild.c for details. Now, for transactions where prepare is
> before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
> after SNAPBUILD_CONSISTENT, we need to send the entire transaction
> including prepare at the commit time.

This might be a dumb question, but: why?

Is this because the effects of the prepared transaction might
otherwise be included neither in the initial synchronization of the
data nor in any subsequently decoded transaction, thus leaving the
replica out of sync?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Fri, Feb 12, 2021 at 1:10 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Feb 11, 2021 at 5:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > to explain the exact case to you which is not very apparent. The basic
> > idea is that we ship/replay all transactions where commit happens
> > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
> > atop snapbuild.c for details. Now, for transactions where prepare is
> > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
> > after SNAPBUILD_CONSISTENT, we need to send the entire transaction
> > including prepare at the commit time.
>
> This might be a dumb question, but: why?
>
> Is this because the effects of the prepared transaction might
> otherwise be included neither in the initial synchronization of the
> data nor in any subsequently decoded transaction, thus leaving the
> replica out of sync?
>

Yes.

-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Andres Freund
Date:
Hi,

On 2021-02-10 08:02:17 +0530, Amit Kapila wrote:
> On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I think similar happens without any of the work done in PG-14 as well
> > > if we restart the apply worker before the commit completes on the
> > > subscriber. After the restart, we will send the start_decoding_at
> > > point based on some previous commit which will make publisher send the
> > > entire transaction again. I don't think restart of WAL sender or WAL
> > > receiver is such a common thing. It can only happen due to some bug in
> > > code or user wishes to stop the nodes or some crash happened.
> >
> > Really? My impression is that the logical replication protocol is
> > supposed to be designed in such a way that once a transaction is
> > successfully confirmed, it won't be sent again. Now if something is
> > not confirmed then it has to be sent again. But if it is confirmed
> > then it shouldn't happen.

Correct.


> If by successfully confirmed, you mean that once the subscriber node
> has received, it won't be sent again then as far as I know that is not
> true. We rely on the flush location sent by the subscriber to advance
> the decoding locations. We update the flush locations after we apply
> the transaction's commit successfully. Also, after the restart, we use
> the replication origin's last flush location as a point from where we
> need the transactions and the origin's progress is updated at commit
> time.

That's not quite right. Yes, the flush location isn't guaranteed to be
updated at that point, but a replication client will send the last
location they've received and successfully processed, and that has to
*guarantee* that they won't receive anything twice, or miss
something. Otherwise you've broken the protocol.

Greetings,

Andres Freund



Re: repeated decoding of prepared transactions

From
Petr Jelinek
Date:
>
> On 13 Feb 2021, at 17:32, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-02-10 08:02:17 +0530, Amit Kapila wrote:
>> On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>> If by successfully confirmed, you mean that once the subscriber node
>> has received, it won't be sent again then as far as I know that is not
>> true. We rely on the flush location sent by the subscriber to advance
>> the decoding locations. We update the flush locations after we apply
>> the transaction's commit successfully. Also, after the restart, we use
>> the replication origin's last flush location as a point from where we
>> need the transactions and the origin's progress is updated at commit
>> time.
>
> That's not quite right. Yes, the flush location isn't guaranteed to be
> updated at that point, but a replication client will send the last
> location they've received and successfully processed, and that has to
> *guarantee* that they won't receive anything twice, or miss
> something. Otherwise you've broken the protocol.
>

Agreed, if we relied purely on flush location of a slot, there would be no need for origins to track the lsn. AFAIK
thisis exactly why origins are Wal logged along with transaction, it allows us to guarantee never getting anything that
hasbeed durably written. 

—
Petr


Re: repeated decoding of prepared transactions

From
Andres Freund
Date:
Hi,

On 2021-02-13 17:37:29 +0100, Petr Jelinek wrote:
> Agreed, if we relied purely on flush location of a slot, there would
> be no need for origins to track the lsn.

And we would be latency bound replicating transactions, which'd not be
fun for single-insert ones for example...


> AFAIK this is exactly why origins are Wal logged along with
> transaction, it allows us to guarantee never getting anything that has
> beed durably written.

I think you'd need something like origins in that case, because
something could still go wrong before the other side has received the
flush (network disconnect, primary crash, ...).

Greetings,

Andres Freund



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Sat, Feb 13, 2021 at 10:23 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2021-02-13 17:37:29 +0100, Petr Jelinek wrote:
>
> > AFAIK this is exactly why origins are Wal logged along with
> > transaction, it allows us to guarantee never getting anything that has
> > beed durably written.
>
> I think you'd need something like origins in that case, because
> something could still go wrong before the other side has received the
> flush (network disconnect, primary crash, ...).
>

We are already using origins in apply-worker to guarantee that and
with each commit, the origin's lsn location is also WAL-logged. That
helps us to send the start location for a slot after the restart. As
far as I understand this is how it works from the apply-worker side. I
am not sure if I am missing something here?

-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Thu, Feb 11, 2021 at 4:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner
> <markus.wanner@enterprisedb.com> wrote:
> >
> Now, coming back to the restart case where the prepared transaction
> can be sent again by the publisher. I understand yours and others
> point that we should not send prepared transaction if there is a
> restart between prepare and commit but there are reasons why we have
> done that way and I am open to your suggestions. I'll once again try
> to explain the exact case to you which is not very apparent. The basic
> idea is that we ship/replay all transactions where commit happens
> after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
> atop snapbuild.c for details. Now, for transactions where prepare is
> before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
> after SNAPBUILD_CONSISTENT, we need to send the entire transaction
> including prepare at the commit time. One might think it is quite easy
> to detect that, basically if we skip prepare when the snapshot state
> was not SNAPBUILD_CONSISTENT, then mark a flag in ReorderBufferTxn and
> use the same to detect during commit and accordingly take the decision
> to send prepare but unfortunately it is not that easy. There is always
> a chance that on restart we reuse the snapshot serialized by some
> other Walsender at a location prior to Prepare and if that happens
> then this time the prepare won't be skipped due to snapshot state
> (SNAPBUILD_CONSISTENT) but due to start_decodint_at point (considering
> we have already shipped some of the later commits but not prepare).
> Now, this will actually become the same situation where the restart
> has happened after we have sent the prepare but not commit. This is
> the reason we have to resend the prepare when the subscriber restarts
> between prepare and commit.
>

After further thinking on this problem and some off-list discussions
with Ajin, there appears to be another way to solve the above problem
by which we can avoid resending the prepare after restart if it has
already been processed by the subscriber. The main reason why we were
not able to distinguish between the two cases ((a) prepare happened
before SNAPBUILD_CONSISTENT state but commit prepared happened after
we reach SNAPBUILD_CONSISTENT state and (b) prepare is already
decoded, successfully processed by the subscriber and we have
restarted the decoding) is that we can re-use the serialized snapshot
at LSN location prior to Prepare of some concurrent WALSender after
the restart. Now, if we ensure that we don't use serialized snapshots
for decoding via slots where two_phase decoding option is enabled then
we won't have that problem. The drawback is that in some cases it can
take a bit more time for initial snapshot building but maybe that is
better than the current solution.

Any suggestions?

-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Amit Kapila
Date:
On Tue, Feb 16, 2021 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 11, 2021 at 4:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner
> > <markus.wanner@enterprisedb.com> wrote:
> > >
> > Now, coming back to the restart case where the prepared transaction
> > can be sent again by the publisher. I understand yours and others
> > point that we should not send prepared transaction if there is a
> > restart between prepare and commit but there are reasons why we have
> > done that way and I am open to your suggestions. I'll once again try
> > to explain the exact case to you which is not very apparent. The basic
> > idea is that we ship/replay all transactions where commit happens
> > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
> > atop snapbuild.c for details. Now, for transactions where prepare is
> > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
> > after SNAPBUILD_CONSISTENT, we need to send the entire transaction
> > including prepare at the commit time. One might think it is quite easy
> > to detect that, basically if we skip prepare when the snapshot state
> > was not SNAPBUILD_CONSISTENT, then mark a flag in ReorderBufferTxn and
> > use the same to detect during commit and accordingly take the decision
> > to send prepare but unfortunately it is not that easy. There is always
> > a chance that on restart we reuse the snapshot serialized by some
> > other Walsender at a location prior to Prepare and if that happens
> > then this time the prepare won't be skipped due to snapshot state
> > (SNAPBUILD_CONSISTENT) but due to start_decodint_at point (considering
> > we have already shipped some of the later commits but not prepare).
> > Now, this will actually become the same situation where the restart
> > has happened after we have sent the prepare but not commit. This is
> > the reason we have to resend the prepare when the subscriber restarts
> > between prepare and commit.
> >
>
> After further thinking on this problem and some off-list discussions
> with Ajin, there appears to be another way to solve the above problem
> by which we can avoid resending the prepare after restart if it has
> already been processed by the subscriber. The main reason why we were
> not able to distinguish between the two cases ((a) prepare happened
> before SNAPBUILD_CONSISTENT state but commit prepared happened after
> we reach SNAPBUILD_CONSISTENT state and (b) prepare is already
> decoded, successfully processed by the subscriber and we have
> restarted the decoding) is that we can re-use the serialized snapshot
> at LSN location prior to Prepare of some concurrent WALSender after
> the restart. Now, if we ensure that we don't use serialized snapshots
> for decoding via slots where two_phase decoding option is enabled then
> we won't have that problem. The drawback is that in some cases it can
> take a bit more time for initial snapshot building but maybe that is
> better than the current solution.
>

I see another thing which we need to address if we have to use the
above solution. The issue is if initially the two-pc option for
subscription is off and we skipped prepare because of that and then
some unrelated commit happened which allowed start_decoding_at point
to move ahead. And then the user enabled the two-pc option for the
subscription, then we will again skip prepare because it is behind
start_decoding_at point which becomes the same case where prepare
seems to have already been sent. So, in such a situation with the
above solution, we will miss sending the prepared transaction and its
data and hence risk making replica out-of-sync. Now, this can be
avoided if we don't allow users to alter the two-pc option once the
subscription is created. I am not sure but maybe for the first version
of this feature that might be okay and we can improve it later if we
have better ideas. This will definitely allow us to avoid checks in
the plugins and or apply-worker which seems like a good trade-off and
it will address the concern most people have raised in this thread.
Any thoughts?

-- 
With Regards,
Amit Kapila.



Re: repeated decoding of prepared transactions

From
Ajin Cherian
Date:
On Tue, Feb 16, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

>
> After further thinking on this problem and some off-list discussions
> with Ajin, there appears to be another way to solve the above problem
> by which we can avoid resending the prepare after restart if it has
> already been processed by the subscriber. The main reason why we were
> not able to distinguish between the two cases ((a) prepare happened
> before SNAPBUILD_CONSISTENT state but commit prepared happened after
> we reach SNAPBUILD_CONSISTENT state and (b) prepare is already
> decoded, successfully processed by the subscriber and we have
> restarted the decoding) is that we can re-use the serialized snapshot
> at LSN location prior to Prepare of some concurrent WALSender after
> the restart. Now, if we ensure that we don't use serialized snapshots
> for decoding via slots where two_phase decoding option is enabled then
> we won't have that problem. The drawback is that in some cases it can
> take a bit more time for initial snapshot building but maybe that is
> better than the current solution.

Based on this suggestion, I have created a patch on HEAD which now
does not allow repeated decoding
of prepared transactions. For this, the code now enforces
full_snapshot if two-phase decoding is enabled.
Do have a look at the patch and see if you have any comments.

Currently  one problem with this, as you have also mentioned in your
last mail,  is that if initially two-phase is disabled in
test_decoding while
decoding prepare (causing the prepared transaction to not be decoded)
and later enabled after the commit prepared (where it assumes that the
transaction was decoded at prepare time), then the transaction is not
decoded at all. For eg:

postgres=# begin;
BEGIN
postgres=*# INSERT INTO do_write DEFAULT VALUES;
INSERT 0 1
postgres=*# PREPARE TRANSACTION 'test1';
PREPARE TRANSACTION
postgres=# SELECT data  FROM
pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit',
'0');
data
------
(0 rows)
postgres=# commit prepared 'test1';
COMMIT PREPARED
postgres=# SELECT data  FROM
pg_logical_slot_get_changes('isolation_slot', NULL, NULL,
'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit',
'1');
         data
-------------------------
COMMIT PREPARED 'test1' (1 row)

1st pg_logical_slot_get_changes is called with two-phase-commit off,
2nd is called with two-phase-commit on. You can see that the
transaction is not decoded at all.
For this, I am planning to change the semantics such that
two-phase-commit can only be specified while creating the slot using
pg_create_logical_replication_slot()
and not in pg_logical_slot_get_changes, thus preventing
two-phase-commit flag from being toggled between restarts of the
decoder. Let me know if anybody objects to this
change, else I will update that in the next patch.


regards,
Ajin Cherian
Fujitsu Australia

Attachment

Re: repeated decoding of prepared transactions

From
Markus Wanner
Date:
Ajin, Amit,

thank you both a lot for thinking this through and even providing a patch.

The changes in expectation for twophase.out matches exactly with what I 
prepared.  And the switch with pg_logical_slot_get_changes indeed is 
something I had not yet considered, either.

On 19.02.21 03:50, Ajin Cherian wrote:
> For this, I am planning to change the semantics such that
> two-phase-commit can only be specified while creating the slot using
> pg_create_logical_replication_slot()
> and not in pg_logical_slot_get_changes, thus preventing
> two-phase-commit flag from being toggled between restarts of the
> decoder. Let me know if anybody objects to this
> change, else I will update that in the next patch.

This sounds like a good plan to me, yes.


However, more generally speaking, I suspect you are overthinking this. 
All of the complexity arises because of the assumption that an output 
plugin receiving and confirming a PREPARE may not be able to persist 
that first phase of transaction application.  Instead, you are trying to 
somehow resurrect the transactional changes and the prepare at COMMIT 
PREPARED time and decode it in a deferred way.

Instead, I'm arguing that a PREPARE is an atomic operation just like a 
transaction's COMMIT.  The decoder should always feed these in the order 
of appearance in the WAL.  For example, if you have PREAPRE A, COMMIT B, 
COMMIT PREPARED A in the WAL, the decoder should always output these 
events in exactly that order.  And not ever COMMIT B, PREPARE A, COMMIT 
PREPARED A (which is currently violated in the expectation for 
twophase_snapshot, because the COMMIT for `s1insert` there appears after 
the PREPARE of `s2p` in the WAL, but gets decoded before it).

The patch I'm attaching corrects this expectation in twophase_snapshot, 
adds an explanatory diagram, and eliminates any danger of sending 
PREPAREs at COMMIT PREPARED time.  Thereby preserving the ordering of 
PREPAREs vs COMMITs.

Given the output plugin supports two-phase commit, I argue there must be 
a good reason for it setting the start_decoding_at LSN to a point in 
time after a PREPARE.  To me that means the output plugin (or its 
downstream replica) has processed the PREPARE (and the downstream 
replica did whatever it needed to do on its side in order to make the 
transaction ready to be committed in a second phase).

(In the weird case of an output plugin that wants to enable two-phase 
commit but does not really support it downstream, it's still possible 
for it to hold back LSN confirmations for prepared-but-still-in-flight 
transactions.  However, I'm having a hard time justifying this use case.)

With that line of thinking, the point in time (or in WAL) of the COMMIT 
PREPARED does not matter at all to reason about the decoding of the 
PREPARE operation.  Instead, there are only exactly two cases to consider:

a) the PREPARE happened before the start_decoding_at LSN and must not be 
decoded. (But the effects of the PREPARE must then be included in the 
initial synchronization. If that's not supported, the output plugin 
should not enable two-phase commit.)

b) the PREPARE happens after the start_decoding_at LSN and must be 
decoded.  (It obviously is not included in the initial synchronization 
or decoded by a previous instance of the decoder process.)

The case where the PREPARE lies before SNAPBUILD_CONSISTENT must always 
be case a) where we must not repeat the PREPARE, anyway.  And in case b) 
where we need a consistent snapshot to decode the PREPARE, existing 
provisions already guarantee that to be possible (or how would this be 
different from a regular single-phase commit?).

Please let me know what you think and whether this approach is feasible 
for you as well.

Regards

Markus

Attachment