Thread: [BUG] "FailedAssertion" reported when streaming in logical replication
[BUG] "FailedAssertion" reported when streaming in logical replication
From
"tanghy.fnst@fujitsu.com"
Date:
Hi I think I may found a bug when using streaming in logical replication. Could anyone please take a look at this? Here's what I did to produce the problem. I set logical_decoding_work_mem and created multiple publications at publisher, created multiple subscriptions with "streaming= on" at subscriber. However, an assertion failed at publisher when I COMMIT and ROLLBACK multiple transactions at the same time. The log reported a FailedAssertion: TRAP: FailedAssertion("txn->size == 0", File: "reorderbuffer.c", Line: 3465, PID: 911730) The problem happens both in synchronous mode and asynchronous mode. When there are only one or two publications, It doesn'tseem to happen. (In my case, there are 8 publications and the failure always happened). The scripts and the log are attached. It took me about 4 minutes to run the script on my machine. Please contact me if you need more specific info for the problem. Regards Tang
Attachment
Hi
I think I may found a bug when using streaming in logical replication. Could anyone please take a look at this?
Here's what I did to produce the problem.
I set logical_decoding_work_mem and created multiple publications at publisher, created multiple subscriptions with "streaming = on" at subscriber.
However, an assertion failed at publisher when I COMMIT and ROLLBACK multiple transactions at the same time.
The log reported a FailedAssertion:
TRAP: FailedAssertion("txn->size == 0", File: "reorderbuffer.c", Line: 3465, PID: 911730)
The problem happens both in synchronous mode and asynchronous mode. When there are only one or two publications, It doesn't seem to happen. (In my case, there are 8 publications and the failure always happened).
The scripts and the log are attached. It took me about 4 minutes to run the script on my machine.
Please contact me if you need more specific info for the problem.
Thanks for reporting. I will look into it.
On Mon, Apr 26, 2021 at 1:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, 26 Apr 2021 at 12:45 PM, tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: >> >> Hi >> >> I think I may found a bug when using streaming in logical replication. Could anyone please take a look at this? >> >> Here's what I did to produce the problem. >> I set logical_decoding_work_mem and created multiple publications at publisher, created multiple subscriptions with "streaming= on" at subscriber. >> However, an assertion failed at publisher when I COMMIT and ROLLBACK multiple transactions at the same time. >> >> The log reported a FailedAssertion: >> TRAP: FailedAssertion("txn->size == 0", File: "reorderbuffer.c", Line: 3465, PID: 911730) >> >> The problem happens both in synchronous mode and asynchronous mode. When there are only one or two publications, It doesn'tseem to happen. (In my case, there are 8 publications and the failure always happened). >> >> The scripts and the log are attached. It took me about 4 minutes to run the script on my machine. >> Please contact me if you need more specific info for the problem. > > > > Thanks for reporting. I will look into it. I am able to reproduce this and I think I have done the initial investigation. The cause of the issue is that, this transaction has only one change and that change is XLOG_HEAP2_NEW_CID, which is added through SnapBuildProcessNewCid. Basically, when we add any changes through SnapBuildProcessChange we set the base snapshot but when we add SnapBuildProcessNewCid this we don't set the base snapshot, because there is nothing to be done for this change. Now, this transaction is identified as the biggest transaction with non -partial changes, and now in ReorderBufferStreamTXN, it will return immediately because the base_snapshot is NULL. I think the fix should be while selecting the largest transaction in ReorderBufferLargestTopTXN, we should check the base_snapshot should not be NULL. I will think more about this and post the patch. From the core dump, we can see that base_snapshot is 0x0 and ntuplecids = 1, and txn_flags = 1 also proves that it has a new command id change. And the size of the txn also shows that it has only one change and that is REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID because in that case, the change size will be just the sizeof(ReorderBufferChange) which is 80. (gdb) p *txn $4 = {txn_flags = 1, xid = 1115, toplevel_xid = 0, gid = 0x0, first_lsn = 1061159120, final_lsn = 0, end_lsn = 0, toptxn = 0x0, restart_decoding_lsn = 958642624, origin_id = 0, origin_lsn = 0, commit_time = 0, base_snapshot = 0x0, base_snapshot_lsn = 0, base_snapshot_node = {prev = 0x0, next = 0x0}, snapshot_now = 0x0, command_id = 4294967295, nentries = 1, nentries_mem = 1, changes = {head = {prev = 0x3907c18, next = 0x3907c18}}, tuplecids = {head = {prev = 0x39073d8, next = 0x39073d8}}, ntuplecids = 1, tuplecid_hash = 0x0, toast_hash = 0x0, subtxns = {head = {prev = 0x30f1cd8, next = 0x30f1cd8}}, nsubtxns = 0, ninvalidations = 0, invalidations = 0x0, node = {prev = 0x30f1a98, next = 0x30c64f8}, size = 80, total_size = 80, concurrent_abort = false, output_plugin_private = 0x0} -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Apr 26, 2021 at 5:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Apr 26, 2021 at 1:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, 26 Apr 2021 at 12:45 PM, tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > >> > >> Hi > >> > >> I think I may found a bug when using streaming in logical replication. Could anyone please take a look at this? > >> > >> Here's what I did to produce the problem. > >> I set logical_decoding_work_mem and created multiple publications at publisher, created multiple subscriptions with"streaming = on" at subscriber. > >> However, an assertion failed at publisher when I COMMIT and ROLLBACK multiple transactions at the same time. > >> > >> The log reported a FailedAssertion: > >> TRAP: FailedAssertion("txn->size == 0", File: "reorderbuffer.c", Line: 3465, PID: 911730) > >> > >> The problem happens both in synchronous mode and asynchronous mode. When there are only one or two publications, Itdoesn't seem to happen. (In my case, there are 8 publications and the failure always happened). > >> > >> The scripts and the log are attached. It took me about 4 minutes to run the script on my machine. > >> Please contact me if you need more specific info for the problem. > > > > > > > > Thanks for reporting. I will look into it. > > I am able to reproduce this and I think I have done the initial investigation. > > The cause of the issue is that, this transaction has only one change > and that change is XLOG_HEAP2_NEW_CID, which is added through > SnapBuildProcessNewCid. Basically, when we add any changes through > SnapBuildProcessChange we set the base snapshot but when we add > SnapBuildProcessNewCid this we don't set the base snapshot, because > there is nothing to be done for this change. Now, this transaction is > identified as the biggest transaction with non -partial changes, and > now in ReorderBufferStreamTXN, it will return immediately because the > base_snapshot is NULL. > Your analysis sounds correct to me. -- With Regards, Amit Kapila.
On Mon, Apr 26, 2021 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 26, 2021 at 5:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I am able to reproduce this and I think I have done the initial investigation. > > > > The cause of the issue is that, this transaction has only one change > > and that change is XLOG_HEAP2_NEW_CID, which is added through > > SnapBuildProcessNewCid. Basically, when we add any changes through > > SnapBuildProcessChange we set the base snapshot but when we add > > SnapBuildProcessNewCid this we don't set the base snapshot, because > > there is nothing to be done for this change. Now, this transaction is > > identified as the biggest transaction with non -partial changes, and > > now in ReorderBufferStreamTXN, it will return immediately because the > > base_snapshot is NULL. > > > > Your analysis sounds correct to me. > Thanks, I have attached a patch to fix this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Apr 26, 2021 at 7:52 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Apr 26, 2021 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Apr 26, 2021 at 5:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I am able to reproduce this and I think I have done the initial investigation. > > > > > > The cause of the issue is that, this transaction has only one change > > > and that change is XLOG_HEAP2_NEW_CID, which is added through > > > SnapBuildProcessNewCid. Basically, when we add any changes through > > > SnapBuildProcessChange we set the base snapshot but when we add > > > SnapBuildProcessNewCid this we don't set the base snapshot, because > > > there is nothing to be done for this change. Now, this transaction is > > > identified as the biggest transaction with non -partial changes, and > > > now in ReorderBufferStreamTXN, it will return immediately because the > > > base_snapshot is NULL. > > > > > > > Your analysis sounds correct to me. > > > > Thanks, I have attached a patch to fix this. There is also one very silly mistake in below condition, basically, once we got any transaction for next transaction it is unconditionally selecting without comparing the size because largest != NULL is wrong, ideally this should be largest == NULL, basically, if we haven't select any transaction then only we can approve next transaction without comparing the size. if ((largest != NULL || txn->total_size > largest_size) && (txn->base_snapshot != NULL) && (txn->total_size > 0) && !(rbtxn_has_incomplete_tuple(txn))) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Apr 26, 2021 at 7:52 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Apr 26, 2021 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Apr 26, 2021 at 5:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I am able to reproduce this and I think I have done the initial investigation. > > > > > > The cause of the issue is that, this transaction has only one change > > > and that change is XLOG_HEAP2_NEW_CID, which is added through > > > SnapBuildProcessNewCid. Basically, when we add any changes through > > > SnapBuildProcessChange we set the base snapshot but when we add > > > SnapBuildProcessNewCid this we don't set the base snapshot, because > > > there is nothing to be done for this change. Now, this transaction is > > > identified as the biggest transaction with non -partial changes, and > > > now in ReorderBufferStreamTXN, it will return immediately because the > > > base_snapshot is NULL. > > > > > > > Your analysis sounds correct to me. > > > > Thanks, I have attached a patch to fix this. > Can't we use 'txns_by_base_snapshot_lsn' list for this purpose? It is ensured in ReorderBufferSetBaseSnapshot that we always assign base_snapshot to a top-level transaction if the current is a known subxact. I think that will be true because we always form xid-subxid relation before processing each record in LogicalDecodingProcessRecord. Few other minor comments: 1. I think we can update the comments atop function ReorderBufferLargestTopTXN. 2. minor typo in comments atop ReorderBufferLargestTopTXN "...There is a scope of optimization here such that we can select the largest transaction which has complete changes...". In this 'complete' should be incomplete. This is not related to this patch but I think we can fix it along with this because anyway we are going to change surrounding comments. -- With Regards, Amit Kapila.
On Tue, Apr 27, 2021 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 26, 2021 at 7:52 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Apr 26, 2021 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Apr 26, 2021 at 5:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > I am able to reproduce this and I think I have done the initial investigation. > > > > > > > > The cause of the issue is that, this transaction has only one change > > > > and that change is XLOG_HEAP2_NEW_CID, which is added through > > > > SnapBuildProcessNewCid. Basically, when we add any changes through > > > > SnapBuildProcessChange we set the base snapshot but when we add > > > > SnapBuildProcessNewCid this we don't set the base snapshot, because > > > > there is nothing to be done for this change. Now, this transaction is > > > > identified as the biggest transaction with non -partial changes, and > > > > now in ReorderBufferStreamTXN, it will return immediately because the > > > > base_snapshot is NULL. > > > > > > > > > > Your analysis sounds correct to me. > > > > > > > Thanks, I have attached a patch to fix this. > > > > Can't we use 'txns_by_base_snapshot_lsn' list for this purpose? It is > ensured in ReorderBufferSetBaseSnapshot that we always assign > base_snapshot to a top-level transaction if the current is a known > subxact. I think that will be true because we always form xid-subxid > relation before processing each record in > LogicalDecodingProcessRecord. Yeah, we can do that, but here we are only interested in top transactions and this list will give us sub-transaction as well so we will have to skip it in the below if condition. So I think using toplevel_by_lsn and skipping the txn without base_snapshot in below if condition will be cheaper compared to process all the transactions with base snapshot i.e. txns_by_base_snapshot_lsn and skipping the sub-transactions in the below if conditions. Whats your thoughts on this? > Few other minor comments: > 1. I think we can update the comments atop function ReorderBufferLargestTopTXN. > 2. minor typo in comments atop ReorderBufferLargestTopTXN "...There is > a scope of optimization here such that we can select the largest > transaction which has complete changes...". In this 'complete' should > be incomplete. This is not related to this patch but I think we can > fix it along with this because anyway we are going to change > surrounding comments. I will work on these in the next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Apr 27, 2021 at 11:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Apr 27, 2021 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Apr 26, 2021 at 7:52 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Apr 26, 2021 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Apr 26, 2021 at 5:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > I am able to reproduce this and I think I have done the initial investigation. > > > > > > > > > > The cause of the issue is that, this transaction has only one change > > > > > and that change is XLOG_HEAP2_NEW_CID, which is added through > > > > > SnapBuildProcessNewCid. Basically, when we add any changes through > > > > > SnapBuildProcessChange we set the base snapshot but when we add > > > > > SnapBuildProcessNewCid this we don't set the base snapshot, because > > > > > there is nothing to be done for this change. Now, this transaction is > > > > > identified as the biggest transaction with non -partial changes, and > > > > > now in ReorderBufferStreamTXN, it will return immediately because the > > > > > base_snapshot is NULL. > > > > > > > > > > > > > Your analysis sounds correct to me. > > > > > > > > > > Thanks, I have attached a patch to fix this. > > > > > > > Can't we use 'txns_by_base_snapshot_lsn' list for this purpose? It is > > ensured in ReorderBufferSetBaseSnapshot that we always assign > > base_snapshot to a top-level transaction if the current is a known > > subxact. I think that will be true because we always form xid-subxid > > relation before processing each record in > > LogicalDecodingProcessRecord. > > Yeah, we can do that, but here we are only interested in top > transactions and this list will give us sub-transaction as well so we > will have to skip it in the below if condition. > I am not so sure about this point. I have explained above why I think there won't be any subtransactions in this. Can you please let me know what am I missing if anything? -- With Regards, Amit Kapila.
On Tue, Apr 27, 2021 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Can't we use 'txns_by_base_snapshot_lsn' list for this purpose? It is > > > ensured in ReorderBufferSetBaseSnapshot that we always assign > > > base_snapshot to a top-level transaction if the current is a known > > > subxact. I think that will be true because we always form xid-subxid > > > relation before processing each record in > > > LogicalDecodingProcessRecord. > > > > Yeah, we can do that, but here we are only interested in top > > transactions and this list will give us sub-transaction as well so we > > will have to skip it in the below if condition. > > > > I am not so sure about this point. I have explained above why I think > there won't be any subtransactions in this. Can you please let me know > what am I missing if anything? Got your point, yeah this will only have top transactions so we can use this. I will change this in the next patch. In fact we can put an assert that it should not be an sub transaction? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Apr 27, 2021 at 12:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Apr 27, 2021 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Can't we use 'txns_by_base_snapshot_lsn' list for this purpose? It is > > > > ensured in ReorderBufferSetBaseSnapshot that we always assign > > > > base_snapshot to a top-level transaction if the current is a known > > > > subxact. I think that will be true because we always form xid-subxid > > > > relation before processing each record in > > > > LogicalDecodingProcessRecord. > > > > > > Yeah, we can do that, but here we are only interested in top > > > transactions and this list will give us sub-transaction as well so we > > > will have to skip it in the below if condition. > > > > > > > I am not so sure about this point. I have explained above why I think > > there won't be any subtransactions in this. Can you please let me know > > what am I missing if anything? > > Got your point, yeah this will only have top transactions so we can > use this. I will change this in the next patch. In fact we can put > an assert that it should not be an sub transaction? > Right. It is good to have an assert. -- With Regards, Amit Kapila.
On Tue, Apr 27, 2021 at 12:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Apr 27, 2021 at 12:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Apr 27, 2021 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Can't we use 'txns_by_base_snapshot_lsn' list for this purpose? It is > > > > > ensured in ReorderBufferSetBaseSnapshot that we always assign > > > > > base_snapshot to a top-level transaction if the current is a known > > > > > subxact. I think that will be true because we always form xid-subxid > > > > > relation before processing each record in > > > > > LogicalDecodingProcessRecord. > > > > > > > > Yeah, we can do that, but here we are only interested in top > > > > transactions and this list will give us sub-transaction as well so we > > > > will have to skip it in the below if condition. > > > > > > > > > > I am not so sure about this point. I have explained above why I think > > > there won't be any subtransactions in this. Can you please let me know > > > what am I missing if anything? > > > > Got your point, yeah this will only have top transactions so we can > > use this. I will change this in the next patch. In fact we can put > > an assert that it should not be an sub transaction? > > > > Right. It is good to have an assert. I have modified the patch based on the above comments. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
RE: [BUG] "FailedAssertion" reported when streaming in logical replication
From
"tanghy.fnst@fujitsu.com"
Date:
> I have modified the patch based on the above comments. Thanks for your patch. I tested again after applying your patch and the problem is fixed. Regards Tang
On Wed, Apr 28, 2021 at 12:25 PM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > > > I have modified the patch based on the above comments. > > Thanks for your patch. > I tested again after applying your patch and the problem is fixed. Thanks for confirming. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Apr 28, 2021 at 1:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Apr 28, 2021 at 12:25 PM tanghy.fnst@fujitsu.com > <tanghy.fnst@fujitsu.com> wrote: > > > > > I have modified the patch based on the above comments. > > > > Thanks for your patch. > > I tested again after applying your patch and the problem is fixed. > > Thanks for confirming. I tried to think about how to write a test case for this scenario, but I think it will not be possible to generate an automated test case for this. Basically, we need 2 concurrent transactions and out of that, we need one transaction which just has processed only one change i.e XLOG_HEAP2_NEW_CID and another transaction should overflow the logical decoding work mem, so that we select the wrong transaction which doesn't have the base snapshot. But how to control that the transaction which is performing the DDL just write the XLOG_HEAP2_NEW_CID wal and before it writes any other WAL we should get the WAl from other transaction which overflows the buffer. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
RE: [BUG] "FailedAssertion" reported when streaming in logical replication
From
"tanghy.fnst@fujitsu.com"
Date:
On Thursday, April 29, 2021 3:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote >I tried to think about how to write a test case for this scenario, but >I think it will not be possible to generate an automated test case for this. >Basically, we need 2 concurrent transactions and out of that, >we need one transaction which just has processed only one change i.e >XLOG_HEAP2_NEW_CID and another transaction should overflow the logical >decoding work mem, so that we select the wrong transaction which >doesn't have the base snapshot. But how to control that the >transaction which is performing the DDL just write the >XLOG_HEAP2_NEW_CID wal and before it writes any other WAL we should >get the WAl from other transaction which overflows the buffer. Thanks for your updating. Actually, I tried to make the automated test for the problem, too. But made no process on this. Agreed on your opinion " not be possible to generate an automated test case for this ". If anyone figure out a good solution for the test automation of this case. Please be kind to share that with us. Thanks. Regards, Tang
On Thu, Apr 29, 2021 at 12:09 PM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > > > On Thursday, April 29, 2021 3:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote > > >I tried to think about how to write a test case for this scenario, but > >I think it will not be possible to generate an automated test case for this. > >Basically, we need 2 concurrent transactions and out of that, > >we need one transaction which just has processed only one change i.e > >XLOG_HEAP2_NEW_CID and another transaction should overflow the logical > >decoding work mem, so that we select the wrong transaction which > >doesn't have the base snapshot. But how to control that the > >transaction which is performing the DDL just write the > >XLOG_HEAP2_NEW_CID wal and before it writes any other WAL we should > >get the WAl from other transaction which overflows the buffer. > > Thanks for your updating. > Actually, I tried to make the automated test for the problem, too. But made no process on this. > Agreed on your opinion " not be possible to generate an automated test case for this ". Thanks for trying this out. > If anyone figure out a good solution for the test automation of this case. > Please be kind to share that with us. Thanks. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Apr 27, 2021 at 5:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have modified the patch based on the above comments. > The patch looks good to me. I have slightly modified the comments and commit message. See, what you think of the attached? I think we can leave the test for this as there doesn't seem to be an easy way to automate it. -- With Regards, Amit Kapila.
Attachment
On Thu, Apr 29, 2021 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Apr 27, 2021 at 5:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have modified the patch based on the above comments. > > > > The patch looks good to me. I have slightly modified the comments and > commit message. See, what you think of the attached? I think we can > leave the test for this as there doesn't seem to be an easy way to > automate it. Your changes look good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com