Thread: repeated decoding of prepared transactions
Amit, Ajin, hackers, testing logical decoding for two-phase transactions, I stumbled over what I first thought is a bug. But comments seems to indicate this is intended behavior. Could you please clarify or elaborate on the design decision? Or indicate this indeed is a bug? What puzzled me is that if a decoder is restarted in between the PREPARE and the COMMIT PREPARED, it repeats the entire transaction, despite it being already sent and potentially prepared on the receiving side. In terms of `pg_logical_slot_get_changes` (and roughly from the prepare.sql test), this looks as follows: data ---------------------------------------------------- BEGIN table public.test_prepared1: INSERT: id[integer]:1 PREPARE TRANSACTION 'test_prepared#1' (3 rows) This is the first delivery of the transaction. After a restart, it will get all of the changes again, though: data ---------------------------------------------------- BEGIN table public.test_prepared1: INSERT: id[integer]:1 PREPARE TRANSACTION 'test_prepared#1' COMMIT PREPARED 'test_prepared#1' (4 rows) I did not expect this, as any receiver that wants to have decoded 2PC is likely supporting some kind of two-phase commits itself. And would therefore prepare the transaction upon its first reception. Potentially receiving it a second time would require complicated filtering on every prepared transaction. Furthermore, this clearly and unnecessarily holds back the restart LSN. Meaning even just a single prepared transaction can block advancing the restart LSN. In most cases, these are short lived. But on the other hand, there may be an arbitrary amount of other transactions in between a PREPARE and the corresponding COMMIT PREPARED in the WAL. Not being able to advance over a prepared transaction seems like a bad thing in such a case. I fail to see where this repetition would ever be useful. Is there any reason for the current implementation that I'm missing or can this be corrected? Thanks for elaborating. Regards Markus
On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner <markus.wanner@enterprisedb.com> wrote: > > Amit, Ajin, hackers, > > testing logical decoding for two-phase transactions, I stumbled over > what I first thought is a bug. But comments seems to indicate this is > intended behavior. Could you please clarify or elaborate on the design > decision? Or indicate this indeed is a bug? > > What puzzled me is that if a decoder is restarted in between the PREPARE > and the COMMIT PREPARED, it repeats the entire transaction, despite it > being already sent and potentially prepared on the receiving side. > > In terms of `pg_logical_slot_get_changes` (and roughly from the > prepare.sql test), this looks as follows: > > data > ---------------------------------------------------- > BEGIN > table public.test_prepared1: INSERT: id[integer]:1 > PREPARE TRANSACTION 'test_prepared#1' > (3 rows) > > > This is the first delivery of the transaction. After a restart, it will > get all of the changes again, though: > > > data > ---------------------------------------------------- > BEGIN > table public.test_prepared1: INSERT: id[integer]:1 > PREPARE TRANSACTION 'test_prepared#1' > COMMIT PREPARED 'test_prepared#1' > (4 rows) > > > I did not expect this, as any receiver that wants to have decoded 2PC is > likely supporting some kind of two-phase commits itself. And would > therefore prepare the transaction upon its first reception. Potentially > receiving it a second time would require complicated filtering on every > prepared transaction. > The reason was mentioned in ReorderBufferFinishPrepared(). See below comments in code: /* * It is possible that this transaction is not decoded at prepare time * either because by that time we didn't have a consistent snapshot or it * was decoded earlier but we have restarted. We can't distinguish between * those two cases so we send the prepare in both the cases and let * downstream decide whether to process or skip it. We don't need to * decode the xact for aborts if it is not done already. */ This won't happen when we replicate via pgoutput (the patch for which is still not committed) because it won't restart from a previous point (unless the server needs to be restarted due to some reason) as you are doing via logical decoding APIs. Now, we don't send again the prepared xacts on repeated calls of pg_logical_slot_get_changes() unless we encounter commit. This behavior is already explained in docs [1] (See, Transaction Begin Prepare Callback in docs) and a way to skip the prepare. > Furthermore, this clearly and unnecessarily holds back the restart LSN. > Meaning even just a single prepared transaction can block advancing the > restart LSN. In most cases, these are short lived. But on the other > hand, there may be an arbitrary amount of other transactions in between > a PREPARE and the corresponding COMMIT PREPARED in the WAL. Not being > able to advance over a prepared transaction seems like a bad thing in > such a case. > That anyway is true without this work as well where restart_lsn can be advanced on commits. We haven't changed anything in that regard. [1] - https://www.postgresql.org/docs/devel/logicaldecoding-output-plugin.html -- With Regards, Amit Kapila.
Hello Amit, thanks for your very quick response. On 08.02.21 11:13, Amit Kapila wrote: > /* > * It is possible that this transaction is not decoded at prepare time > * either because by that time we didn't have a consistent snapshot or it > * was decoded earlier but we have restarted. We can't distinguish between > * those two cases so we send the prepare in both the cases and let > * downstream decide whether to process or skip it. We don't need to > * decode the xact for aborts if it is not done already. > */ The way I read the surrounding code, the only case a 2PC transaction does not get decoded a prepare time is if the transaction is empty. Or are you aware of any other situation that might currently happen? > (unless the server needs to be restarted due to some reason) Right, the repetition occurs only after a restart of the walsender in between a prepare and a commit prepared record. > That anyway is true without this work as well where restart_lsn can be > advanced on commits. We haven't changed anything in that regard. I did not mean to blame the patch, but merely try to understand some of the design decisions behind it. And as I just learned, even if we managed to avoid the repetition, a restarted walsender still needs to see prepared transactions as in-progress in its snapshots. So we cannot move forward the restart_lsn to after a prepare record (until the final commit or rollback is consumed). Best Regards Markus
On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner <markus.wanner@enterprisedb.com> wrote: > > Hello Amit, > > thanks for your very quick response. > > On 08.02.21 11:13, Amit Kapila wrote: > > /* > > * It is possible that this transaction is not decoded at prepare time > > * either because by that time we didn't have a consistent snapshot or it > > * was decoded earlier but we have restarted. We can't distinguish between > > * those two cases so we send the prepare in both the cases and let > > * downstream decide whether to process or skip it. We don't need to > > * decode the xact for aborts if it is not done already. > > */ > > The way I read the surrounding code, the only case a 2PC transaction > does not get decoded a prepare time is if the transaction is empty. Or > are you aware of any other situation that might currently happen? > We also skip decoding at prepare time if we haven't reached a consistent snapshot by that time. See below code in DecodePrepare(). DecodePrepare() { .. /* We can't start streaming unless a consistent state is reached. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT) { ReorderBufferSkipPrepare(ctx->reorder, xid); return; } .. } There are other reasons as well like the output plugin doesn't want to allow decoding at prepare time but I don't think those are relevant to the discussion here. > > (unless the server needs to be restarted due to some reason) > > Right, the repetition occurs only after a restart of the walsender in > between a prepare and a commit prepared record. > > > That anyway is true without this work as well where restart_lsn can be > > advanced on commits. We haven't changed anything in that regard. > > I did not mean to blame the patch, but merely try to understand some of > the design decisions behind it. > > And as I just learned, even if we managed to avoid the repetition, a > restarted walsender still needs to see prepared transactions as > in-progress in its snapshots. So we cannot move forward the restart_lsn > to after a prepare record (until the final commit or rollback is consumed). > Right and say if we forget the prepared transactions and move forward with restart_lsn once we get the prepare for any transaction. Then we will open up a window where we haven't actually sent the prepared xact because of say "snapshot has not yet reached consistent state" and we have moved the restart_lsn. Then later when we get the commit corresponding to the prepared transaction by which time say the "snapshot has reached consistent state" then we will miss sending the transaction contents and prepare for it. I think for such reasons we allow restart_lsn to moved only once the transaction is finished (committed or rolled back). -- With Regards, Amit Kapila.
On Tue, Feb 9, 2021 at 8:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner > <markus.wanner@enterprisedb.com> wrote: > > > > Hello Amit, > > > > thanks for your very quick response. > > > > On 08.02.21 11:13, Amit Kapila wrote: > > > /* > > > * It is possible that this transaction is not decoded at prepare time > > > * either because by that time we didn't have a consistent snapshot or it > > > * was decoded earlier but we have restarted. We can't distinguish between > > > * those two cases so we send the prepare in both the cases and let > > > * downstream decide whether to process or skip it. We don't need to > > > * decode the xact for aborts if it is not done already. > > > */ > > > > The way I read the surrounding code, the only case a 2PC transaction > > does not get decoded a prepare time is if the transaction is empty. Or > > are you aware of any other situation that might currently happen? > > > > We also skip decoding at prepare time if we haven't reached a > consistent snapshot by that time. See below code in DecodePrepare(). > DecodePrepare() > { > .. > /* We can't start streaming unless a consistent state is reached. */ > if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT) > { > ReorderBufferSkipPrepare(ctx->reorder, xid); > return; > } > .. > } Can you please provide steps which can lead to this situation? If there is an earlier discussion which has example scenarios, please point us to the relevant thread. If we are not sending PREPARED transactions that's fine, but sending the same prepared transaction as many times as the WAL sender is restarted between sending prepare and commit prepared is a waste of network bandwidth. The wastage is proportional to the amount of changes in the transaction and number of such transactions themselves. Also this will cause performance degradation. So if we can avoid resending prepared transactions twice that will help. -- Best Wishes, Ashutosh Bapat
On Tue, Feb 9, 2021 at 4:59 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > Can you please provide steps which can lead to this situation? If > there is an earlier discussion which has example scenarios, please > point us to the relevant thread. > > If we are not sending PREPARED transactions that's fine, but sending > the same prepared transaction as many times as the WAL sender is > restarted between sending prepare and commit prepared is a waste of > network bandwidth. The wastage is proportional to the amount of > changes in the transaction and number of such transactions themselves. > Also this will cause performance degradation. So if we can avoid > resending prepared transactions twice that will help. One of this scenario is explained in the test case in postgres/contrib/test_decoding/specs/twophase_snapshot.spec regards, Ajin Cherian Fujitsu Australia
On Tue, Feb 9, 2021 at 11:29 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Tue, Feb 9, 2021 at 8:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner > > <markus.wanner@enterprisedb.com> wrote: > > > > > > Hello Amit, > > > > > > thanks for your very quick response. > > > > > > On 08.02.21 11:13, Amit Kapila wrote: > > > > /* > > > > * It is possible that this transaction is not decoded at prepare time > > > > * either because by that time we didn't have a consistent snapshot or it > > > > * was decoded earlier but we have restarted. We can't distinguish between > > > > * those two cases so we send the prepare in both the cases and let > > > > * downstream decide whether to process or skip it. We don't need to > > > > * decode the xact for aborts if it is not done already. > > > > */ > > > > > > The way I read the surrounding code, the only case a 2PC transaction > > > does not get decoded a prepare time is if the transaction is empty. Or > > > are you aware of any other situation that might currently happen? > > > > > > > We also skip decoding at prepare time if we haven't reached a > > consistent snapshot by that time. See below code in DecodePrepare(). > > DecodePrepare() > > { > > .. > > /* We can't start streaming unless a consistent state is reached. */ > > if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT) > > { > > ReorderBufferSkipPrepare(ctx->reorder, xid); > > return; > > } > > .. > > } > > Can you please provide steps which can lead to this situation? > Ajin has already shared the example with you. > If > there is an earlier discussion which has example scenarios, please > point us to the relevant thread. > It started in the email [1] and from there you can read later emails to know more about this. > If we are not sending PREPARED transactions that's fine, > Hmm, I am not sure if that is fine because if the output plugin sets the two-phase-commit option, it would expect all prepared xacts to arrive not some only some of them. > but sending > the same prepared transaction as many times as the WAL sender is > restarted between sending prepare and commit prepared is a waste of > network bandwidth. > I think similar happens without any of the work done in PG-14 as well if we restart the apply worker before the commit completes on the subscriber. After the restart, we will send the start_decoding_at point based on some previous commit which will make publisher send the entire transaction again. I don't think restart of WAL sender or WAL receiver is such a common thing. It can only happen due to some bug in code or user wishes to stop the nodes or some crash happened. [1] - https://www.postgresql.org/message-id/CAA4eK1%2Bd3gzCyzsYjt1m6sfGf_C_uFmo9JK%3D3Wafp6yR8Mg8uQ%40mail.gmail.com -- With Regards, Amit Kapila.
On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I think similar happens without any of the work done in PG-14 as well > if we restart the apply worker before the commit completes on the > subscriber. After the restart, we will send the start_decoding_at > point based on some previous commit which will make publisher send the > entire transaction again. I don't think restart of WAL sender or WAL > receiver is such a common thing. It can only happen due to some bug in > code or user wishes to stop the nodes or some crash happened. Really? My impression is that the logical replication protocol is supposed to be designed in such a way that once a transaction is successfully confirmed, it won't be sent again. Now if something is not confirmed then it has to be sent again. But if it is confirmed then it shouldn't happen. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think similar happens without any of the work done in PG-14 as well > > if we restart the apply worker before the commit completes on the > > subscriber. After the restart, we will send the start_decoding_at > > point based on some previous commit which will make publisher send the > > entire transaction again. I don't think restart of WAL sender or WAL > > receiver is such a common thing. It can only happen due to some bug in > > code or user wishes to stop the nodes or some crash happened. > > Really? My impression is that the logical replication protocol is > supposed to be designed in such a way that once a transaction is > successfully confirmed, it won't be sent again. Now if something is > not confirmed then it has to be sent again. But if it is confirmed > then it shouldn't happen. > If by successfully confirmed, you mean that once the subscriber node has received, it won't be sent again then as far as I know that is not true. We rely on the flush location sent by the subscriber to advance the decoding locations. We update the flush locations after we apply the transaction's commit successfully. Also, after the restart, we use the replication origin's last flush location as a point from where we need the transactions and the origin's progress is updated at commit time. OTOH, If by successfully confirmed, you mean that once the subscriber has applied the complete transaction (including commit), then you are right that it won't be sent again. -- With Regards, Amit Kapila.
On Wed, Feb 10, 2021 at 8:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think similar happens without any of the work done in PG-14 as well
> > if we restart the apply worker before the commit completes on the
> > subscriber. After the restart, we will send the start_decoding_at
> > point based on some previous commit which will make publisher send the
> > entire transaction again. I don't think restart of WAL sender or WAL
> > receiver is such a common thing. It can only happen due to some bug in
> > code or user wishes to stop the nodes or some crash happened.
>
> Really? My impression is that the logical replication protocol is
> supposed to be designed in such a way that once a transaction is
> successfully confirmed, it won't be sent again. Now if something is
> not confirmed then it has to be sent again. But if it is confirmed
> then it shouldn't happen.
>
If by successfully confirmed, you mean that once the subscriber node
has received, it won't be sent again then as far as I know that is not
true. We rely on the flush location sent by the subscriber to advance
the decoding locations. We update the flush locations after we apply
the transaction's commit successfully. Also, after the restart, we use
the replication origin's last flush location as a point from where we
need the transactions and the origin's progress is updated at commit
time.
OTOH, If by successfully confirmed, you mean that once the subscriber
has applied the complete transaction (including commit), then you are
right that it won't be sent again.
I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream. We need to send a whole uncommitted transaction downstream again because previously applied changes must have been aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream prepares a transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstream confirms that it has "flushed" PREPARE, there is no need to send all the changes again.
--
Best Wishes,
Ashutosh
On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream.We need to send a whole uncommitted transaction downstream again because previously applied changes must havebeen aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream preparesa transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstreamconfirms that it has "flushed" PREPARE, there is no need to send all the changes again. But the other side of the problem is that ,without this, if the prepared transaction is prior to a consistent snapshot when decoding starts/restarts, then only the "commit prepared" is sent to downstream (as seen in the test scenario I shared above), and downstream has to error away the commit prepared because it does not have the corresponding prepared transaction. We did not find an easy way to distinguish between these two scenarios for prepared transactions. a. A consistent snapshot being formed in between a prepare and a commit prepared for the first time. b. Decoder restarting between a prepare and a commit prepared. For plugins to be able to handle this, we have added a special callback "Begin Prepare" as explained in [1] section 49.6.4.10 "The required begin_prepare_cb callback is called whenever the start of a prepared transaction has been decoded. The gid field, which is part of the txn parameter can be used in this callback to check if the plugin has already received this prepare in which case it can skip the remaining changes of the transaction. This can only happen if the user restarts the decoding after receiving the prepare for a transaction but before receiving the commit prepared say because of some error." The pgoutput plugin is also being updated to be able to handle this situation of duplicate prepared transactions. regards, Ajin Cherian Fujitsu Australia
On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: > > > > I think we need to treat a prepared transaction slightly different from an uncommitted transaction when sending downstream.We need to send a whole uncommitted transaction downstream again because previously applied changes must havebeen aborted and hence lost by the downstream and thus it needs to get all of those again. But when a downstream preparesa transaction, even if it's not committed, those changes are not lost even after restart of a walsender. If the downstreamconfirms that it has "flushed" PREPARE, there is no need to send all the changes again. > > But the other side of the problem is that ,without this, if the > prepared transaction is prior to a consistent snapshot when decoding > starts/restarts, then only the "commit prepared" is sent to downstream > (as seen in the test scenario I shared above), and downstream has to > error away the commit prepared because it does not have the > corresponding prepared transaction. > I think it is not only simple error handling, it is required for data-consistency. We need to send the transactions whose commits are encountered after a consistent snapshot is reached. -- With Regards, Amit Kapila.
On 10.02.21 07:32, Amit Kapila wrote: > On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <itsajin@gmail.com> wrote: >> But the other side of the problem is that ,without this, if the >> prepared transaction is prior to a consistent snapshot when decoding >> starts/restarts, then only the "commit prepared" is sent to downstream >> (as seen in the test scenario I shared above), and downstream has to >> error away the commit prepared because it does not have the >> corresponding prepared transaction. > > I think it is not only simple error handling, it is required for > data-consistency. We need to send the transactions whose commits are > encountered after a consistent snapshot is reached. I'm with Ashutosh here. If a replica is properly in sync, it knows about prepared transactions and all the gids of those. Sending the transactional changes and the prepare again is inconsistent. The point of a two-phase transaction is to have two phases. An output plugin must have the chance of treating them as independent events. Once a PREPARE is confirmed, it must not be sent again. Even if the transaction is still in-progress and its changes are not yet visible on the origin node. Regards Markus
On Wed, Feb 10, 2021 at 1:40 PM Markus Wanner <markus.wanner@enterprisedb.com> wrote: > > On 10.02.21 07:32, Amit Kapila wrote: > > On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian <itsajin@gmail.com> wrote: > >> But the other side of the problem is that ,without this, if the > >> prepared transaction is prior to a consistent snapshot when decoding > >> starts/restarts, then only the "commit prepared" is sent to downstream > >> (as seen in the test scenario I shared above), and downstream has to > >> error away the commit prepared because it does not have the > >> corresponding prepared transaction. > > > > I think it is not only simple error handling, it is required for > > data-consistency. We need to send the transactions whose commits are > > encountered after a consistent snapshot is reached. > > I'm with Ashutosh here. If a replica is properly in sync, it knows > about prepared transactions and all the gids of those. Sending the > transactional changes and the prepare again is inconsistent. > > The point of a two-phase transaction is to have two phases. An output > plugin must have the chance of treating them as independent events. > I am not sure I understand what problem you are facing to deal with this in the output plugin, it is explained in docs and Ajin also pointed out the same. Ajin and I have explained to you the design constraints on the publisher-side due to which we have done this way. Do you have any better ideas to deal with this? -- With Regards, Amit Kapila.
On Tue, Feb 9, 2021 at 9:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > If by successfully confirmed, you mean that once the subscriber node > has received, it won't be sent again then as far as I know that is not > true. We rely on the flush location sent by the subscriber to advance > the decoding locations. We update the flush locations after we apply > the transaction's commit successfully. Also, after the restart, we use > the replication origin's last flush location as a point from where we > need the transactions and the origin's progress is updated at commit > time. > > OTOH, If by successfully confirmed, you mean that once the subscriber > has applied the complete transaction (including commit), then you are > right that it won't be sent again. I meant - once the subscriber has advanced the flush location. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner <markus.wanner@enterprisedb.com> wrote: > > I did not expect this, as any receiver that wants to have decoded 2PC is > likely supporting some kind of two-phase commits itself. And would > therefore prepare the transaction upon its first reception. Potentially > receiving it a second time would require complicated filtering on every > prepared transaction. > I would like to bring one other scenario to your notice where you might want to handle things differently for prepared transactions on the plugin side. Assume we have multiple publications (for simplicity say 2) on publisher with corresponding subscriptions (say 2, each corresponding to one publication on the publisher). When a user performs a transaction on a publisher that involves the tables from both publications, on the subscriber-side, we do that work via two different transactions, corresponding to each subscription. But, we need some way to deal with prepared xacts because they need GID and we can't use the same GID for both subscriptions. Please see the detailed example and one idea to deal with the same in the main thread[1]. It would be really helpful if you or others working on the plugin side can share your opinion on the same. Now, coming back to the restart case where the prepared transaction can be sent again by the publisher. I understand yours and others point that we should not send prepared transaction if there is a restart between prepare and commit but there are reasons why we have done that way and I am open to your suggestions. I'll once again try to explain the exact case to you which is not very apparent. The basic idea is that we ship/replay all transactions where commit happens after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see atop snapbuild.c for details. Now, for transactions where prepare is before snapshot state SNAPBUILD_CONSISTENT and commit prepared is after SNAPBUILD_CONSISTENT, we need to send the entire transaction including prepare at the commit time. One might think it is quite easy to detect that, basically if we skip prepare when the snapshot state was not SNAPBUILD_CONSISTENT, then mark a flag in ReorderBufferTxn and use the same to detect during commit and accordingly take the decision to send prepare but unfortunately it is not that easy. There is always a chance that on restart we reuse the snapshot serialized by some other Walsender at a location prior to Prepare and if that happens then this time the prepare won't be skipped due to snapshot state (SNAPBUILD_CONSISTENT) but due to start_decodint_at point (considering we have already shipped some of the later commits but not prepare). Now, this will actually become the same situation where the restart has happened after we have sent the prepare but not commit. This is the reason we have to resend the prepare when the subscriber restarts between prepare and commit. You can reproduce the case where we can't distinguish between two situations by using the test case in twophase_snapshot.spec and additionally starting a separate session via the debugger. So, the steps in the test case are as below: "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2c" "s2b" "s2insert" "s2p" "s3c" "s1insert" "s1start" "s2cp" "s1start" Define new steps as "s4init" {SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1', 'test_decoding');} "s4start" {SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL, 'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit', '1');} The first thing we need to do is s4init and stop the debugger in SnapBuildProcessRunningXacts. Now perform steps from 's2b' till first 's1start' in twophase_snapshot.spec. Then continue in the s4 session and perform s4start. After this, if you debug (or add the logs) the second s1start, you will notice that we are skipping prepare not because of inconsistent snapshot but a forward location in start_decoding_at. If you don't involve session-4, then it will always skip prepare due to an inconsistent snapshot state. This involves a debugger so not easy to write an automated test for it. I have used a bit tricky scenario to explain this but not sure if there was any other simpler way. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BLvkeX%3DB3xon7RcBwD4CVaFSryPj3pTBAALrDxQVPDwA%40mail.gmail.com -- With Regards, Amit Kapila.
Hello Amit, thanks a lot for your extensive explanation and examples, I appreciate this very much. I'll need to think this through and see how we can make this work for us. Best Regards Markus
On Thu, Feb 11, 2021 at 5:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > to explain the exact case to you which is not very apparent. The basic > idea is that we ship/replay all transactions where commit happens > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see > atop snapbuild.c for details. Now, for transactions where prepare is > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is > after SNAPBUILD_CONSISTENT, we need to send the entire transaction > including prepare at the commit time. This might be a dumb question, but: why? Is this because the effects of the prepared transaction might otherwise be included neither in the initial synchronization of the data nor in any subsequently decoded transaction, thus leaving the replica out of sync? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 12, 2021 at 1:10 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Feb 11, 2021 at 5:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > to explain the exact case to you which is not very apparent. The basic > > idea is that we ship/replay all transactions where commit happens > > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see > > atop snapbuild.c for details. Now, for transactions where prepare is > > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is > > after SNAPBUILD_CONSISTENT, we need to send the entire transaction > > including prepare at the commit time. > > This might be a dumb question, but: why? > > Is this because the effects of the prepared transaction might > otherwise be included neither in the initial synchronization of the > data nor in any subsequently decoded transaction, thus leaving the > replica out of sync? > Yes. -- With Regards, Amit Kapila.
Hi, On 2021-02-10 08:02:17 +0530, Amit Kapila wrote: > On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I think similar happens without any of the work done in PG-14 as well > > > if we restart the apply worker before the commit completes on the > > > subscriber. After the restart, we will send the start_decoding_at > > > point based on some previous commit which will make publisher send the > > > entire transaction again. I don't think restart of WAL sender or WAL > > > receiver is such a common thing. It can only happen due to some bug in > > > code or user wishes to stop the nodes or some crash happened. > > > > Really? My impression is that the logical replication protocol is > > supposed to be designed in such a way that once a transaction is > > successfully confirmed, it won't be sent again. Now if something is > > not confirmed then it has to be sent again. But if it is confirmed > > then it shouldn't happen. Correct. > If by successfully confirmed, you mean that once the subscriber node > has received, it won't be sent again then as far as I know that is not > true. We rely on the flush location sent by the subscriber to advance > the decoding locations. We update the flush locations after we apply > the transaction's commit successfully. Also, after the restart, we use > the replication origin's last flush location as a point from where we > need the transactions and the origin's progress is updated at commit > time. That's not quite right. Yes, the flush location isn't guaranteed to be updated at that point, but a replication client will send the last location they've received and successfully processed, and that has to *guarantee* that they won't receive anything twice, or miss something. Otherwise you've broken the protocol. Greetings, Andres Freund
> > On 13 Feb 2021, at 17:32, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-02-10 08:02:17 +0530, Amit Kapila wrote: >> On Wed, Feb 10, 2021 at 12:08 AM Robert Haas <robertmhaas@gmail.com> wrote: >>> >> If by successfully confirmed, you mean that once the subscriber node >> has received, it won't be sent again then as far as I know that is not >> true. We rely on the flush location sent by the subscriber to advance >> the decoding locations. We update the flush locations after we apply >> the transaction's commit successfully. Also, after the restart, we use >> the replication origin's last flush location as a point from where we >> need the transactions and the origin's progress is updated at commit >> time. > > That's not quite right. Yes, the flush location isn't guaranteed to be > updated at that point, but a replication client will send the last > location they've received and successfully processed, and that has to > *guarantee* that they won't receive anything twice, or miss > something. Otherwise you've broken the protocol. > Agreed, if we relied purely on flush location of a slot, there would be no need for origins to track the lsn. AFAIK thisis exactly why origins are Wal logged along with transaction, it allows us to guarantee never getting anything that hasbeed durably written. — Petr
Hi, On 2021-02-13 17:37:29 +0100, Petr Jelinek wrote: > Agreed, if we relied purely on flush location of a slot, there would > be no need for origins to track the lsn. And we would be latency bound replicating transactions, which'd not be fun for single-insert ones for example... > AFAIK this is exactly why origins are Wal logged along with > transaction, it allows us to guarantee never getting anything that has > beed durably written. I think you'd need something like origins in that case, because something could still go wrong before the other side has received the flush (network disconnect, primary crash, ...). Greetings, Andres Freund
On Sat, Feb 13, 2021 at 10:23 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-02-13 17:37:29 +0100, Petr Jelinek wrote: > > > AFAIK this is exactly why origins are Wal logged along with > > transaction, it allows us to guarantee never getting anything that has > > beed durably written. > > I think you'd need something like origins in that case, because > something could still go wrong before the other side has received the > flush (network disconnect, primary crash, ...). > We are already using origins in apply-worker to guarantee that and with each commit, the origin's lsn location is also WAL-logged. That helps us to send the start location for a slot after the restart. As far as I understand this is how it works from the apply-worker side. I am not sure if I am missing something here? -- With Regards, Amit Kapila.
On Thu, Feb 11, 2021 at 4:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner > <markus.wanner@enterprisedb.com> wrote: > > > Now, coming back to the restart case where the prepared transaction > can be sent again by the publisher. I understand yours and others > point that we should not send prepared transaction if there is a > restart between prepare and commit but there are reasons why we have > done that way and I am open to your suggestions. I'll once again try > to explain the exact case to you which is not very apparent. The basic > idea is that we ship/replay all transactions where commit happens > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see > atop snapbuild.c for details. Now, for transactions where prepare is > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is > after SNAPBUILD_CONSISTENT, we need to send the entire transaction > including prepare at the commit time. One might think it is quite easy > to detect that, basically if we skip prepare when the snapshot state > was not SNAPBUILD_CONSISTENT, then mark a flag in ReorderBufferTxn and > use the same to detect during commit and accordingly take the decision > to send prepare but unfortunately it is not that easy. There is always > a chance that on restart we reuse the snapshot serialized by some > other Walsender at a location prior to Prepare and if that happens > then this time the prepare won't be skipped due to snapshot state > (SNAPBUILD_CONSISTENT) but due to start_decodint_at point (considering > we have already shipped some of the later commits but not prepare). > Now, this will actually become the same situation where the restart > has happened after we have sent the prepare but not commit. This is > the reason we have to resend the prepare when the subscriber restarts > between prepare and commit. > After further thinking on this problem and some off-list discussions with Ajin, there appears to be another way to solve the above problem by which we can avoid resending the prepare after restart if it has already been processed by the subscriber. The main reason why we were not able to distinguish between the two cases ((a) prepare happened before SNAPBUILD_CONSISTENT state but commit prepared happened after we reach SNAPBUILD_CONSISTENT state and (b) prepare is already decoded, successfully processed by the subscriber and we have restarted the decoding) is that we can re-use the serialized snapshot at LSN location prior to Prepare of some concurrent WALSender after the restart. Now, if we ensure that we don't use serialized snapshots for decoding via slots where two_phase decoding option is enabled then we won't have that problem. The drawback is that in some cases it can take a bit more time for initial snapshot building but maybe that is better than the current solution. Any suggestions? -- With Regards, Amit Kapila.
On Tue, Feb 16, 2021 at 9:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 11, 2021 at 4:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner > > <markus.wanner@enterprisedb.com> wrote: > > > > > Now, coming back to the restart case where the prepared transaction > > can be sent again by the publisher. I understand yours and others > > point that we should not send prepared transaction if there is a > > restart between prepare and commit but there are reasons why we have > > done that way and I am open to your suggestions. I'll once again try > > to explain the exact case to you which is not very apparent. The basic > > idea is that we ship/replay all transactions where commit happens > > after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see > > atop snapbuild.c for details. Now, for transactions where prepare is > > before snapshot state SNAPBUILD_CONSISTENT and commit prepared is > > after SNAPBUILD_CONSISTENT, we need to send the entire transaction > > including prepare at the commit time. One might think it is quite easy > > to detect that, basically if we skip prepare when the snapshot state > > was not SNAPBUILD_CONSISTENT, then mark a flag in ReorderBufferTxn and > > use the same to detect during commit and accordingly take the decision > > to send prepare but unfortunately it is not that easy. There is always > > a chance that on restart we reuse the snapshot serialized by some > > other Walsender at a location prior to Prepare and if that happens > > then this time the prepare won't be skipped due to snapshot state > > (SNAPBUILD_CONSISTENT) but due to start_decodint_at point (considering > > we have already shipped some of the later commits but not prepare). > > Now, this will actually become the same situation where the restart > > has happened after we have sent the prepare but not commit. This is > > the reason we have to resend the prepare when the subscriber restarts > > between prepare and commit. > > > > After further thinking on this problem and some off-list discussions > with Ajin, there appears to be another way to solve the above problem > by which we can avoid resending the prepare after restart if it has > already been processed by the subscriber. The main reason why we were > not able to distinguish between the two cases ((a) prepare happened > before SNAPBUILD_CONSISTENT state but commit prepared happened after > we reach SNAPBUILD_CONSISTENT state and (b) prepare is already > decoded, successfully processed by the subscriber and we have > restarted the decoding) is that we can re-use the serialized snapshot > at LSN location prior to Prepare of some concurrent WALSender after > the restart. Now, if we ensure that we don't use serialized snapshots > for decoding via slots where two_phase decoding option is enabled then > we won't have that problem. The drawback is that in some cases it can > take a bit more time for initial snapshot building but maybe that is > better than the current solution. > I see another thing which we need to address if we have to use the above solution. The issue is if initially the two-pc option for subscription is off and we skipped prepare because of that and then some unrelated commit happened which allowed start_decoding_at point to move ahead. And then the user enabled the two-pc option for the subscription, then we will again skip prepare because it is behind start_decoding_at point which becomes the same case where prepare seems to have already been sent. So, in such a situation with the above solution, we will miss sending the prepared transaction and its data and hence risk making replica out-of-sync. Now, this can be avoided if we don't allow users to alter the two-pc option once the subscription is created. I am not sure but maybe for the first version of this feature that might be okay and we can improve it later if we have better ideas. This will definitely allow us to avoid checks in the plugins and or apply-worker which seems like a good trade-off and it will address the concern most people have raised in this thread. Any thoughts? -- With Regards, Amit Kapila.
On Tue, Feb 16, 2021 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > After further thinking on this problem and some off-list discussions > with Ajin, there appears to be another way to solve the above problem > by which we can avoid resending the prepare after restart if it has > already been processed by the subscriber. The main reason why we were > not able to distinguish between the two cases ((a) prepare happened > before SNAPBUILD_CONSISTENT state but commit prepared happened after > we reach SNAPBUILD_CONSISTENT state and (b) prepare is already > decoded, successfully processed by the subscriber and we have > restarted the decoding) is that we can re-use the serialized snapshot > at LSN location prior to Prepare of some concurrent WALSender after > the restart. Now, if we ensure that we don't use serialized snapshots > for decoding via slots where two_phase decoding option is enabled then > we won't have that problem. The drawback is that in some cases it can > take a bit more time for initial snapshot building but maybe that is > better than the current solution. Based on this suggestion, I have created a patch on HEAD which now does not allow repeated decoding of prepared transactions. For this, the code now enforces full_snapshot if two-phase decoding is enabled. Do have a look at the patch and see if you have any comments. Currently one problem with this, as you have also mentioned in your last mail, is that if initially two-phase is disabled in test_decoding while decoding prepare (causing the prepared transaction to not be decoded) and later enabled after the commit prepared (where it assumes that the transaction was decoded at prepare time), then the transaction is not decoded at all. For eg: postgres=# begin; BEGIN postgres=*# INSERT INTO do_write DEFAULT VALUES; INSERT 0 1 postgres=*# PREPARE TRANSACTION 'test1'; PREPARE TRANSACTION postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit', '0'); data ------ (0 rows) postgres=# commit prepared 'test1'; COMMIT PREPARED postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', 'false', 'skip-empty-xacts', '1', 'two-phase-commit', '1'); data ------------------------- COMMIT PREPARED 'test1' (1 row) 1st pg_logical_slot_get_changes is called with two-phase-commit off, 2nd is called with two-phase-commit on. You can see that the transaction is not decoded at all. For this, I am planning to change the semantics such that two-phase-commit can only be specified while creating the slot using pg_create_logical_replication_slot() and not in pg_logical_slot_get_changes, thus preventing two-phase-commit flag from being toggled between restarts of the decoder. Let me know if anybody objects to this change, else I will update that in the next patch. regards, Ajin Cherian Fujitsu Australia
Attachment
Ajin, Amit, thank you both a lot for thinking this through and even providing a patch. The changes in expectation for twophase.out matches exactly with what I prepared. And the switch with pg_logical_slot_get_changes indeed is something I had not yet considered, either. On 19.02.21 03:50, Ajin Cherian wrote: > For this, I am planning to change the semantics such that > two-phase-commit can only be specified while creating the slot using > pg_create_logical_replication_slot() > and not in pg_logical_slot_get_changes, thus preventing > two-phase-commit flag from being toggled between restarts of the > decoder. Let me know if anybody objects to this > change, else I will update that in the next patch. This sounds like a good plan to me, yes. However, more generally speaking, I suspect you are overthinking this. All of the complexity arises because of the assumption that an output plugin receiving and confirming a PREPARE may not be able to persist that first phase of transaction application. Instead, you are trying to somehow resurrect the transactional changes and the prepare at COMMIT PREPARED time and decode it in a deferred way. Instead, I'm arguing that a PREPARE is an atomic operation just like a transaction's COMMIT. The decoder should always feed these in the order of appearance in the WAL. For example, if you have PREAPRE A, COMMIT B, COMMIT PREPARED A in the WAL, the decoder should always output these events in exactly that order. And not ever COMMIT B, PREPARE A, COMMIT PREPARED A (which is currently violated in the expectation for twophase_snapshot, because the COMMIT for `s1insert` there appears after the PREPARE of `s2p` in the WAL, but gets decoded before it). The patch I'm attaching corrects this expectation in twophase_snapshot, adds an explanatory diagram, and eliminates any danger of sending PREPAREs at COMMIT PREPARED time. Thereby preserving the ordering of PREPAREs vs COMMITs. Given the output plugin supports two-phase commit, I argue there must be a good reason for it setting the start_decoding_at LSN to a point in time after a PREPARE. To me that means the output plugin (or its downstream replica) has processed the PREPARE (and the downstream replica did whatever it needed to do on its side in order to make the transaction ready to be committed in a second phase). (In the weird case of an output plugin that wants to enable two-phase commit but does not really support it downstream, it's still possible for it to hold back LSN confirmations for prepared-but-still-in-flight transactions. However, I'm having a hard time justifying this use case.) With that line of thinking, the point in time (or in WAL) of the COMMIT PREPARED does not matter at all to reason about the decoding of the PREPARE operation. Instead, there are only exactly two cases to consider: a) the PREPARE happened before the start_decoding_at LSN and must not be decoded. (But the effects of the PREPARE must then be included in the initial synchronization. If that's not supported, the output plugin should not enable two-phase commit.) b) the PREPARE happens after the start_decoding_at LSN and must be decoded. (It obviously is not included in the initial synchronization or decoded by a previous instance of the decoder process.) The case where the PREPARE lies before SNAPBUILD_CONSISTENT must always be case a) where we must not repeat the PREPARE, anyway. And in case b) where we need a consistent snapshot to decode the PREPARE, existing provisions already guarantee that to be possible (or how would this be different from a regular single-phase commit?). Please let me know what you think and whether this approach is feasible for you as well. Regards Markus