Thread: Reorderbuffer crash during recovery
Hi, I found couple of crashes in reorderbuffer while review/testing of logical_work_mem and logical streaming of large in-progress transactions. Stack trace of the same are given below: Issue 1: #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6 #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6 #2 0x0000000000ec514d in ExceptionalCondition (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 "FailedAssertion", fileName=0x10eab00 "../../../../src/include/lib/ilist.h", lineNumber=458) at assert.c:54 #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, off=48) at ../../../../src/include/lib/ilist.h:458 #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, oldestRunningXid=895) at reorderbuffer.c:1910 #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, buf=0x7fff7e7b1e40) at decode.c:332 #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0, record=0x3342770) at decode.c:121 #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845 #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b <XLogSendLogical>) at walsender.c:2199 #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at walsender.c:1128 #10 0x0000000000b6ce83 in exec_replication_command (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 (proto_version '1', publication_names '\"pub1\"')") at walsender.c:1545 #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0, dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at postgres.c:4256 #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498 #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189 #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727 #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at postmaster.c:1400 #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210 Issue 2: #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 #2 0x0000000000ec4e1d in ExceptionalCondition (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", errorType=0x10ea284 "FailedAssertion", fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, txn=0x2bafb08) at reorderbuffer.c:3052 #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, txn=0x2bafb08) at reorderbuffer.c:1318 #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, txn=0x2b9d778) at reorderbuffer.c:1257 #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, oldestRunningXid=3835) at reorderbuffer.c:1973 #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, buf=0x7ffcbc74cc00) at decode.c:332 #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, record=0x2b67990) at decode.c:121 #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 From initial analysis it looks like: Issue1 it seems like if all the reorderbuffer has been flushed and then the server restarts. This problem occurs. Issue 2 it seems like if there are many subtransactions present and then the server restarts. This problem occurs. The subtransaction's final_lsn is not being set and when ReorderBufferRestoreCleanup is called the assert fails. May be for this we might have to set the subtransaction's final_lsn before cleanup(not sure). I could not reproduce this issue consistently with a test case, But I felt this looks like a problem from review. For issue1, I could reproduce by the following steps: 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. 2) Have many open transactions with subtransactions open. 3) Attach one of the transaction from gdb and call abort(). I'm not sure of the fix for this. If I get time I will try to spend more time to find out the fix. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote: > > Hi, > > I found couple of crashes in reorderbuffer while review/testing of > logical_work_mem and logical streaming of large in-progress > transactions. Stack trace of the same are given below: > Issue 1: > #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6 > #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec514d in ExceptionalCondition > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > "FailedAssertion", > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > lineNumber=458) at assert.c:54 > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > off=48) at ../../../../src/include/lib/ilist.h:458 > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > oldestRunningXid=895) at reorderbuffer.c:1910 > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > buf=0x7fff7e7b1e40) at decode.c:332 > #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0, > record=0x3342770) at decode.c:121 > #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845 > #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b > <XLogSendLogical>) at walsender.c:2199 > #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at > walsender.c:1128 > #10 0x0000000000b6ce83 in exec_replication_command > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 > (proto_version '1', publication_names '\"pub1\"')") > at walsender.c:1545 > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0, > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at > postgres.c:4256 > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498 > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189 > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727 > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at > postmaster.c:1400 > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210 > > Issue 2: > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec4e1d in ExceptionalCondition > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > errorType=0x10ea284 "FailedAssertion", > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:3052 > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:1318 > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2b9d778) at reorderbuffer.c:1257 > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > oldestRunningXid=3835) at reorderbuffer.c:1973 > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > buf=0x7ffcbc74cc00) at decode.c:332 > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > record=0x2b67990) at decode.c:121 > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > From initial analysis it looks like: > Issue1 it seems like if all the reorderbuffer has been flushed and > then the server restarts. This problem occurs. > Issue 2 it seems like if there are many subtransactions present and > then the server restarts. This problem occurs. The subtransaction's > final_lsn is not being set and when ReorderBufferRestoreCleanup is > called the assert fails. May be for this we might have to set the > subtransaction's final_lsn before cleanup(not sure). > > I could not reproduce this issue consistently with a test case, But I > felt this looks like a problem from review. > > For issue1, I could reproduce by the following steps: > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. > 2) Have many open transactions with subtransactions open. > 3) Attach one of the transaction from gdb and call abort(). Do you need subtransactions for the issue1? It appears that after the restart if the changes list is empty it will hit the assert. Am I missing something? > > I'm not sure of the fix for this. If I get time I will try to spend > more time to find out the fix. > Thoughts? > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote: > > Hi, > > I found couple of crashes in reorderbuffer while review/testing of > logical_work_mem and logical streaming of large in-progress > transactions. Stack trace of the same are given below: > Issue 1: > #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6 > #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec514d in ExceptionalCondition > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > "FailedAssertion", > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > lineNumber=458) at assert.c:54 > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > off=48) at ../../../../src/include/lib/ilist.h:458 > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > oldestRunningXid=895) at reorderbuffer.c:1910 > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > buf=0x7fff7e7b1e40) at decode.c:332 > #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0, > record=0x3342770) at decode.c:121 > #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845 > #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b > <XLogSendLogical>) at walsender.c:2199 > #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at > walsender.c:1128 > #10 0x0000000000b6ce83 in exec_replication_command > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 > (proto_version '1', publication_names '\"pub1\"')") > at walsender.c:1545 > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0, > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at > postgres.c:4256 > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498 > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189 > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727 > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at > postmaster.c:1400 > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210 > > Issue 2: > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec4e1d in ExceptionalCondition > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > errorType=0x10ea284 "FailedAssertion", > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:3052 > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:1318 > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2b9d778) at reorderbuffer.c:1257 > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > oldestRunningXid=3835) at reorderbuffer.c:1973 > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > buf=0x7ffcbc74cc00) at decode.c:332 > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > record=0x2b67990) at decode.c:121 > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > From initial analysis it looks like: > Issue1 it seems like if all the reorderbuffer has been flushed and > then the server restarts. This problem occurs. > Issue 2 it seems like if there are many subtransactions present and > then the server restarts. This problem occurs. The subtransaction's > final_lsn is not being set and when ReorderBufferRestoreCleanup is > called the assert fails. May be for this we might have to set the > subtransaction's final_lsn before cleanup(not sure). > > I could not reproduce this issue consistently with a test case, But I > felt this looks like a problem from review. > > For issue1, I could reproduce by the following steps: > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. > 2) Have many open transactions with subtransactions open. > 3) Attach one of the transaction from gdb and call abort(). Do you need subtransactions for the issue1? It appears that after the restart if the changes list is empty it will hit the assert. Am I missing something? > > I'm not sure of the fix for this. If I get time I will try to spend > more time to find out the fix. > Thoughts? > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote: > > > > Hi, > > > > I found couple of crashes in reorderbuffer while review/testing of > > logical_work_mem and logical streaming of large in-progress > > transactions. Stack trace of the same are given below: > > Issue 1: > > #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6 > > #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6 > > #2 0x0000000000ec514d in ExceptionalCondition > > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > > "FailedAssertion", > > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > > lineNumber=458) at assert.c:54 > > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > > off=48) at ../../../../src/include/lib/ilist.h:458 > > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > > oldestRunningXid=895) at reorderbuffer.c:1910 > > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > > buf=0x7fff7e7b1e40) at decode.c:332 > > #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0, > > record=0x3342770) at decode.c:121 > > #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845 > > #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b > > <XLogSendLogical>) at walsender.c:2199 > > #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at > > walsender.c:1128 > > #10 0x0000000000b6ce83 in exec_replication_command > > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 > > (proto_version '1', publication_names '\"pub1\"')") > > at walsender.c:1545 > > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0, > > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at > > postgres.c:4256 > > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498 > > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189 > > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727 > > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at > > postmaster.c:1400 > > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210 > > > > Issue 2: > > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > > #2 0x0000000000ec4e1d in ExceptionalCondition > > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > > errorType=0x10ea284 "FailedAssertion", > > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > > txn=0x2bafb08) at reorderbuffer.c:3052 > > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > > txn=0x2bafb08) at reorderbuffer.c:1318 > > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > > txn=0x2b9d778) at reorderbuffer.c:1257 > > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > > oldestRunningXid=3835) at reorderbuffer.c:1973 > > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > > buf=0x7ffcbc74cc00) at decode.c:332 > > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > > record=0x2b67990) at decode.c:121 > > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > > > From initial analysis it looks like: > > Issue1 it seems like if all the reorderbuffer has been flushed and > > then the server restarts. This problem occurs. > > Issue 2 it seems like if there are many subtransactions present and > > then the server restarts. This problem occurs. The subtransaction's > > final_lsn is not being set and when ReorderBufferRestoreCleanup is > > called the assert fails. May be for this we might have to set the > > subtransaction's final_lsn before cleanup(not sure). > > > > I could not reproduce this issue consistently with a test case, But I > > felt this looks like a problem from review. > > > > For issue1, I could reproduce by the following steps: > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. > > 2) Have many open transactions with subtransactions open. > > 3) Attach one of the transaction from gdb and call abort(). > > Do you need subtransactions for the issue1? It appears that after the > restart if the changes list is empty it will hit the assert. Am I > missing something? > When I had reported this issue I could reproduce this issue with sub-transactions. Now I have tried without using sub-transactions and could still reproduce this issue. You are right Issue 1 will appear in both the cases with and without subtransactions. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote: > > > > Hi, > > > > I found couple of crashes in reorderbuffer while review/testing of > > logical_work_mem and logical streaming of large in-progress > > transactions. Stack trace of the same are given below: > > Issue 1: > > #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6 > > #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6 > > #2 0x0000000000ec514d in ExceptionalCondition > > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > > "FailedAssertion", > > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > > lineNumber=458) at assert.c:54 > > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > > off=48) at ../../../../src/include/lib/ilist.h:458 > > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > > oldestRunningXid=895) at reorderbuffer.c:1910 > > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > > buf=0x7fff7e7b1e40) at decode.c:332 > > #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0, > > record=0x3342770) at decode.c:121 > > #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845 > > #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b > > <XLogSendLogical>) at walsender.c:2199 > > #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at > > walsender.c:1128 > > #10 0x0000000000b6ce83 in exec_replication_command > > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 > > (proto_version '1', publication_names '\"pub1\"')") > > at walsender.c:1545 > > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0, > > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at > > postgres.c:4256 > > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498 > > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189 > > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727 > > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at > > postmaster.c:1400 > > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210 > > > > Issue 2: > > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > > #2 0x0000000000ec4e1d in ExceptionalCondition > > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > > errorType=0x10ea284 "FailedAssertion", > > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > > txn=0x2bafb08) at reorderbuffer.c:3052 > > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > > txn=0x2bafb08) at reorderbuffer.c:1318 > > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > > txn=0x2b9d778) at reorderbuffer.c:1257 > > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > > oldestRunningXid=3835) at reorderbuffer.c:1973 > > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > > buf=0x7ffcbc74cc00) at decode.c:332 > > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > > record=0x2b67990) at decode.c:121 > > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > > > From initial analysis it looks like: > > Issue1 it seems like if all the reorderbuffer has been flushed and > > then the server restarts. This problem occurs. > > Issue 2 it seems like if there are many subtransactions present and > > then the server restarts. This problem occurs. The subtransaction's > > final_lsn is not being set and when ReorderBufferRestoreCleanup is > > called the assert fails. May be for this we might have to set the > > subtransaction's final_lsn before cleanup(not sure). > > > > I could not reproduce this issue consistently with a test case, But I > > felt this looks like a problem from review. > > > > For issue1, I could reproduce by the following steps: > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. > > 2) Have many open transactions with subtransactions open. > > 3) Attach one of the transaction from gdb and call abort(). > > Do you need subtransactions for the issue1? It appears that after the > restart if the changes list is empty it will hit the assert. Am I > missing something? > When I had reported this issue I could reproduce this issue with sub-transactions. Now I have tried without using sub-transactions and could still reproduce this issue. You are right Issue 1 will appear in both the cases with and without subtransactions. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > Hi, > > > > > > I found couple of crashes in reorderbuffer while review/testing of > > > logical_work_mem and logical streaming of large in-progress > > > transactions. Stack trace of the same are given below: > > > Issue 1: > > > #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6 > > > #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6 > > > #2 0x0000000000ec514d in ExceptionalCondition > > > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > > > "FailedAssertion", > > > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > > > lineNumber=458) at assert.c:54 > > > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > > > off=48) at ../../../../src/include/lib/ilist.h:458 > > > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > > > oldestRunningXid=895) at reorderbuffer.c:1910 > > > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > > > buf=0x7fff7e7b1e40) at decode.c:332 > > > #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0, > > > record=0x3342770) at decode.c:121 > > > #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845 > > > #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b > > > <XLogSendLogical>) at walsender.c:2199 > > > #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at > > > walsender.c:1128 > > > #10 0x0000000000b6ce83 in exec_replication_command > > > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 > > > (proto_version '1', publication_names '\"pub1\"')") > > > at walsender.c:1545 > > > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0, > > > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at > > > postgres.c:4256 > > > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498 > > > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189 > > > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727 > > > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at > > > postmaster.c:1400 > > > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210 > > > > > > Issue 2: > > > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > > > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > > > #2 0x0000000000ec4e1d in ExceptionalCondition > > > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > > > errorType=0x10ea284 "FailedAssertion", > > > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > > > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > > > txn=0x2bafb08) at reorderbuffer.c:3052 > > > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > > > txn=0x2bafb08) at reorderbuffer.c:1318 > > > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > > > txn=0x2b9d778) at reorderbuffer.c:1257 > > > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > > > oldestRunningXid=3835) at reorderbuffer.c:1973 > > > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > > > buf=0x7ffcbc74cc00) at decode.c:332 > > > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > > > record=0x2b67990) at decode.c:121 > > > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > > > > > From initial analysis it looks like: > > > Issue1 it seems like if all the reorderbuffer has been flushed and > > > then the server restarts. This problem occurs. > > > Issue 2 it seems like if there are many subtransactions present and > > > then the server restarts. This problem occurs. The subtransaction's > > > final_lsn is not being set and when ReorderBufferRestoreCleanup is > > > called the assert fails. May be for this we might have to set the > > > subtransaction's final_lsn before cleanup(not sure). > > > > > > I could not reproduce this issue consistently with a test case, But I > > > felt this looks like a problem from review. > > > > > > For issue1, I could reproduce by the following steps: > > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. > > > 2) Have many open transactions with subtransactions open. > > > 3) Attach one of the transaction from gdb and call abort(). > > > > Do you need subtransactions for the issue1? It appears that after the > > restart if the changes list is empty it will hit the assert. Am I > > missing something? > > > > When I had reported this issue I could reproduce this issue with > sub-transactions. Now I have tried without using sub-transactions and > could still reproduce this issue. You are right Issue 1 will appear in > both the cases with and without subtransactions. Okay, thanks for the confirmation. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > Hi, > > > > > > I found couple of crashes in reorderbuffer while review/testing of > > > logical_work_mem and logical streaming of large in-progress > > > transactions. Stack trace of the same are given below: > > > Issue 1: > > > #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6 > > > #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6 > > > #2 0x0000000000ec514d in ExceptionalCondition > > > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > > > "FailedAssertion", > > > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > > > lineNumber=458) at assert.c:54 > > > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > > > off=48) at ../../../../src/include/lib/ilist.h:458 > > > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > > > oldestRunningXid=895) at reorderbuffer.c:1910 > > > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > > > buf=0x7fff7e7b1e40) at decode.c:332 > > > #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0, > > > record=0x3342770) at decode.c:121 > > > #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845 > > > #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b > > > <XLogSendLogical>) at walsender.c:2199 > > > #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at > > > walsender.c:1128 > > > #10 0x0000000000b6ce83 in exec_replication_command > > > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 > > > (proto_version '1', publication_names '\"pub1\"')") > > > at walsender.c:1545 > > > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0, > > > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at > > > postgres.c:4256 > > > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498 > > > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189 > > > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727 > > > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at > > > postmaster.c:1400 > > > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210 > > > > > > Issue 2: > > > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > > > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > > > #2 0x0000000000ec4e1d in ExceptionalCondition > > > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > > > errorType=0x10ea284 "FailedAssertion", > > > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > > > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > > > txn=0x2bafb08) at reorderbuffer.c:3052 > > > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > > > txn=0x2bafb08) at reorderbuffer.c:1318 > > > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > > > txn=0x2b9d778) at reorderbuffer.c:1257 > > > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > > > oldestRunningXid=3835) at reorderbuffer.c:1973 > > > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > > > buf=0x7ffcbc74cc00) at decode.c:332 > > > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > > > record=0x2b67990) at decode.c:121 > > > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > > > > > From initial analysis it looks like: > > > Issue1 it seems like if all the reorderbuffer has been flushed and > > > then the server restarts. This problem occurs. > > > Issue 2 it seems like if there are many subtransactions present and > > > then the server restarts. This problem occurs. The subtransaction's > > > final_lsn is not being set and when ReorderBufferRestoreCleanup is > > > called the assert fails. May be for this we might have to set the > > > subtransaction's final_lsn before cleanup(not sure). > > > > > > I could not reproduce this issue consistently with a test case, But I > > > felt this looks like a problem from review. > > > > > > For issue1, I could reproduce by the following steps: > > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. > > > 2) Have many open transactions with subtransactions open. > > > 3) Attach one of the transaction from gdb and call abort(). > > > > Do you need subtransactions for the issue1? It appears that after the > > restart if the changes list is empty it will hit the assert. Am I > > missing something? > > > > When I had reported this issue I could reproduce this issue with > sub-transactions. Now I have tried without using sub-transactions and > could still reproduce this issue. You are right Issue 1 will appear in > both the cases with and without subtransactions. Okay, thanks for the confirmation. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 07, 2019 at 11:01:17AM +0530, Dilip Kumar wrote: >On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote: >> >> On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > >> > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote: >> > > >> > > Hi, >> > > >> > > ... >> > > >> > > Issue1 it seems like if all the reorderbuffer has been flushed and >> > > then the server restarts. This problem occurs. >> > > Issue 2 it seems like if there are many subtransactions present and >> > > then the server restarts. This problem occurs. The subtransaction's >> > > final_lsn is not being set and when ReorderBufferRestoreCleanup is >> > > called the assert fails. May be for this we might have to set the >> > > subtransaction's final_lsn before cleanup(not sure). >> > > >> > > I could not reproduce this issue consistently with a test case, But I >> > > felt this looks like a problem from review. >> > > >> > > For issue1, I could reproduce by the following steps: >> > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. >> > > 2) Have many open transactions with subtransactions open. >> > > 3) Attach one of the transaction from gdb and call abort(). >> > >> > Do you need subtransactions for the issue1? It appears that after the >> > restart if the changes list is empty it will hit the assert. Am I >> > missing something? >> > >> >> When I had reported this issue I could reproduce this issue with >> sub-transactions. Now I have tried without using sub-transactions and >> could still reproduce this issue. You are right Issue 1 will appear in >> both the cases with and without subtransactions. > >Okay, thanks for the confirmation. > I'm a bit confused - does this happen only with the logical_work_mem patches, or with clean master too? If only with the patches, which version exactly? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 07, 2019 at 11:01:17AM +0530, Dilip Kumar wrote: >On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote: >> >> On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > >> > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote: >> > > >> > > Hi, >> > > >> > > ... >> > > >> > > Issue1 it seems like if all the reorderbuffer has been flushed and >> > > then the server restarts. This problem occurs. >> > > Issue 2 it seems like if there are many subtransactions present and >> > > then the server restarts. This problem occurs. The subtransaction's >> > > final_lsn is not being set and when ReorderBufferRestoreCleanup is >> > > called the assert fails. May be for this we might have to set the >> > > subtransaction's final_lsn before cleanup(not sure). >> > > >> > > I could not reproduce this issue consistently with a test case, But I >> > > felt this looks like a problem from review. >> > > >> > > For issue1, I could reproduce by the following steps: >> > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. >> > > 2) Have many open transactions with subtransactions open. >> > > 3) Attach one of the transaction from gdb and call abort(). >> > >> > Do you need subtransactions for the issue1? It appears that after the >> > restart if the changes list is empty it will hit the assert. Am I >> > missing something? >> > >> >> When I had reported this issue I could reproduce this issue with >> sub-transactions. Now I have tried without using sub-transactions and >> could still reproduce this issue. You are right Issue 1 will appear in >> both the cases with and without subtransactions. > >Okay, thanks for the confirmation. > I'm a bit confused - does this happen only with the logical_work_mem patches, or with clean master too? If only with the patches, which version exactly? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > I'm a bit confused - does this happen only with the logical_work_mem > patches, or with clean master too? > This occurs with the clean master. This is a base code problem revealed while doing stress testing of logical_work_mem patches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > I'm a bit confused - does this happen only with the logical_work_mem > patches, or with clean master too? > This occurs with the clean master. This is a base code problem revealed while doing stress testing of logical_work_mem patches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, On 2019-11-07 17:03:44 +0530, Amit Kapila wrote: > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > I'm a bit confused - does this happen only with the logical_work_mem > > patches, or with clean master too? > > > > This occurs with the clean master. This is a base code problem > revealed while doing stress testing of logical_work_mem patches. As far as I can tell there are no repro steps included? Any chance to get those? Greetings, Andres Freund
Hi, On 2019-11-07 17:03:44 +0530, Amit Kapila wrote: > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > > > I'm a bit confused - does this happen only with the logical_work_mem > > patches, or with clean master too? > > > > This occurs with the clean master. This is a base code problem > revealed while doing stress testing of logical_work_mem patches. As far as I can tell there are no repro steps included? Any chance to get those? Greetings, Andres Freund
On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I'm a bit confused - does this happen only with the logical_work_mem
> > > patches, or with clean master too?
> > >
> >
> > This occurs with the clean master. This is a base code problem
> > revealed while doing stress testing of logical_work_mem patches.
>
> As far as I can tell there are no repro steps included? Any chance to
> get those?
>
This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.
This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.
To simplify the reorderbuffer crash, I used the following steps:
1) Make replication setup with publisher/subscriber for some table
2) Prepare a sql file with the below:
>
> Hi,
>
> On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I'm a bit confused - does this happen only with the logical_work_mem
> > > patches, or with clean master too?
> > >
> >
> > This occurs with the clean master. This is a base code problem
> > revealed while doing stress testing of logical_work_mem patches.
>
> As far as I can tell there are no repro steps included? Any chance to
> get those?
>
This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.
This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.
To simplify the reorderbuffer crash, I used the following steps:
1) Make replication setup with publisher/subscriber for some table
2) Prepare a sql file with the below:
begin;
4096 insert statements;
select pg_sleep(120)
3) Execute the above script.
4) Attach the postgres process when pg_sleep is in progress.
5) call abort() from attached gdb.
6) After sometime there will be many core files in publisher installation data directory.
[1] https://commitfest.postgresql.org/25/1927/
[1] https://commitfest.postgresql.org/25/1927/
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > I'm a bit confused - does this happen only with the logical_work_mem
> > patches, or with clean master too?
> >
>
> This occurs with the clean master. This is a base code problem
> revealed while doing stress testing of logical_work_mem patches.
As far as I can tell there are no repro steps included? Any chance to
get those?
Greetings,
Andres Freund
On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I'm a bit confused - does this happen only with the logical_work_mem
> > > patches, or with clean master too?
> > >
> >
> > This occurs with the clean master. This is a base code problem
> > revealed while doing stress testing of logical_work_mem patches.
>
> As far as I can tell there are no repro steps included? Any chance to
> get those?
>
This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.
This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.
To simplify the reorderbuffer crash, I used the following steps:
1) Make replication setup with publisher/subscriber for some table
2) Prepare a sql file with the below:
>
> Hi,
>
> On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I'm a bit confused - does this happen only with the logical_work_mem
> > > patches, or with clean master too?
> > >
> >
> > This occurs with the clean master. This is a base code problem
> > revealed while doing stress testing of logical_work_mem patches.
>
> As far as I can tell there are no repro steps included? Any chance to
> get those?
>
This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.
This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.
To simplify the reorderbuffer crash, I used the following steps:
1) Make replication setup with publisher/subscriber for some table
2) Prepare a sql file with the below:
begin;
4096 insert statements;
select pg_sleep(120)
3) Execute the above script.
4) Attach the postgres process when pg_sleep is in progress.
5) call abort() from attached gdb.
6) After sometime there will be many core files in publisher installation data directory.
[1] https://commitfest.postgresql.org/25/1927/
[1] https://commitfest.postgresql.org/25/1927/
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > I'm a bit confused - does this happen only with the logical_work_mem
> > patches, or with clean master too?
> >
>
> This occurs with the clean master. This is a base code problem
> revealed while doing stress testing of logical_work_mem patches.
As far as I can tell there are no repro steps included? Any chance to
get those?
Greetings,
Andres Freund
On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote: > > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra > > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > > > I'm a bit confused - does this happen only with the logical_work_mem > > > > patches, or with clean master too? > > > > > > > > > > This occurs with the clean master. This is a base code problem > > > revealed while doing stress testing of logical_work_mem patches. > > > > As far as I can tell there are no repro steps included? Any chance to > > get those? > > I think it will be bit tricky to write a test case, but the callstack indicates the problem. For ex. for issue-1, see the below part of callstack, #2 0x0000000000ec514d in ExceptionalCondition (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 "FailedAssertion", fileName=0x10eab00 "../../../../src/include/lib/ilist.h", lineNumber=458) at assert.c:54 #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, off=48) at ../../../../src/include/lib/ilist.h:458 #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, oldestRunningXid=895) at reorderbuffer.c:1910 #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, buf=0x7fff7e7b1e40) at decode.c:332 The Assertion failure indicates that the changes list is empty when we are trying to get 'lsn' of the last change. I think this is possible when immediately after serializing the transaction changes the server got restarted (say it crashed or somebody switched off and restarted it). The reason is that after serializing the changes, the changes list will be empty and serialized flag for txn will be set. I am not if there is any reason to set the final_lsn if the changes list is empty. Similarly, if we see the call stack of issue-2, the problem is clear. #2 0x0000000000ec4e1d in ExceptionalCondition (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", errorType=0x10ea284 "FailedAssertion", fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, txn=0x2bafb08) at reorderbuffer.c:3052 #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, txn=0x2bafb08) at reorderbuffer.c:1318 #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, txn=0x2b9d778) at reorderbuffer.c:1257 #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, oldestRunningXid=3835) at reorderbuffer.c:1973 I think this also has a symptom similar to the prior issue but for sub-transactions. The ReorderBufferAbortOld() tries to set the final_lsn of toplevel transaction, but not for subtransactions, later in that path ReorderBufferRestoreCleanup expects it to be set even for subtransaction. Is that Assert in ReorderBufferRestoreCleanup() required, because immediately after assert, we are anyway setting the value of final_lsn. > > This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions"link [1] I found the crashes. > > This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not fromreorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears. > > To simplify the reorderbuffer crash, I used the following steps: > 1) Make replication setup with publisher/subscriber for some table > 2) Prepare a sql file with the below: > begin; > 4096 insert statements; > select pg_sleep(120) > 3) Execute the above script. > 4) Attach the postgres process when pg_sleep is in progress. > 5) call abort() from attached gdb. > Isn't it important to call this abort immediately after the changes are serialized as explained above? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote: > > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra > > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > > > I'm a bit confused - does this happen only with the logical_work_mem > > > > patches, or with clean master too? > > > > > > > > > > This occurs with the clean master. This is a base code problem > > > revealed while doing stress testing of logical_work_mem patches. > > > > As far as I can tell there are no repro steps included? Any chance to > > get those? > > I think it will be bit tricky to write a test case, but the callstack indicates the problem. For ex. for issue-1, see the below part of callstack, #2 0x0000000000ec514d in ExceptionalCondition (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 "FailedAssertion", fileName=0x10eab00 "../../../../src/include/lib/ilist.h", lineNumber=458) at assert.c:54 #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, off=48) at ../../../../src/include/lib/ilist.h:458 #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, oldestRunningXid=895) at reorderbuffer.c:1910 #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, buf=0x7fff7e7b1e40) at decode.c:332 The Assertion failure indicates that the changes list is empty when we are trying to get 'lsn' of the last change. I think this is possible when immediately after serializing the transaction changes the server got restarted (say it crashed or somebody switched off and restarted it). The reason is that after serializing the changes, the changes list will be empty and serialized flag for txn will be set. I am not if there is any reason to set the final_lsn if the changes list is empty. Similarly, if we see the call stack of issue-2, the problem is clear. #2 0x0000000000ec4e1d in ExceptionalCondition (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", errorType=0x10ea284 "FailedAssertion", fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, txn=0x2bafb08) at reorderbuffer.c:3052 #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, txn=0x2bafb08) at reorderbuffer.c:1318 #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, txn=0x2b9d778) at reorderbuffer.c:1257 #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, oldestRunningXid=3835) at reorderbuffer.c:1973 I think this also has a symptom similar to the prior issue but for sub-transactions. The ReorderBufferAbortOld() tries to set the final_lsn of toplevel transaction, but not for subtransactions, later in that path ReorderBufferRestoreCleanup expects it to be set even for subtransaction. Is that Assert in ReorderBufferRestoreCleanup() required, because immediately after assert, we are anyway setting the value of final_lsn. > > This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions"link [1] I found the crashes. > > This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not fromreorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears. > > To simplify the reorderbuffer crash, I used the following steps: > 1) Make replication setup with publisher/subscriber for some table > 2) Prepare a sql file with the below: > begin; > 4096 insert statements; > select pg_sleep(120) > 3) Execute the above script. > 4) Attach the postgres process when pg_sleep is in progress. > 5) call abort() from attached gdb. > Isn't it important to call this abort immediately after the changes are serialized as explained above? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Nov 9, 2019 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote: > > > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra > > > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > > > > > I'm a bit confused - does this happen only with the logical_work_mem > > > > > patches, or with clean master too? > > > > > > > > > > > > > This occurs with the clean master. This is a base code problem > > > > revealed while doing stress testing of logical_work_mem patches. > > > > > > As far as I can tell there are no repro steps included? Any chance to > > > get those? > > > > > I think it will be bit tricky to write a test case, but the callstack > indicates the problem. For ex. for issue-1, see the below part of > callstack, > > #2 0x0000000000ec514d in ExceptionalCondition > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > "FailedAssertion", > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > lineNumber=458) at assert.c:54 > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > off=48) at ../../../../src/include/lib/ilist.h:458 > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > oldestRunningXid=895) at reorderbuffer.c:1910 > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > buf=0x7fff7e7b1e40) at decode.c:332 > > The Assertion failure indicates that the changes list is empty when we > are trying to get 'lsn' of the last change. I think this is possible > when immediately after serializing the transaction changes the server > got restarted (say it crashed or somebody switched off and restarted > it). The reason is that after serializing the changes, the changes > list will be empty and serialized flag for txn will be set. I am not > if there is any reason to set the final_lsn if the changes list is > empty. > > Similarly, if we see the call stack of issue-2, the problem is clear. > > #2 0x0000000000ec4e1d in ExceptionalCondition > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > errorType=0x10ea284 "FailedAssertion", > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:3052 > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:1318 > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2b9d778) at reorderbuffer.c:1257 > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > oldestRunningXid=3835) at reorderbuffer.c:1973 > > I think this also has a symptom similar to the prior issue but for > sub-transactions. The ReorderBufferAbortOld() tries to set the > final_lsn of toplevel transaction, but not for subtransactions, later > in that path ReorderBufferRestoreCleanup expects it to be set even for > subtransaction. Is that Assert in ReorderBufferRestoreCleanup() > required, because immediately after assert, we are anyway setting the > value of final_lsn. > Thanks Amit for more analysis of the issues. > > > > This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions"link [1] I found the crashes. > > > > This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(notfrom reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffercrash appears. > > > > To simplify the reorderbuffer crash, I used the following steps: > > 1) Make replication setup with publisher/subscriber for some table > > 2) Prepare a sql file with the below: > > begin; > > 4096 insert statements; > > select pg_sleep(120) > > 3) Execute the above script. > > 4) Attach the postgres process when pg_sleep is in progress. > > 5) call abort() from attached gdb. > > > > Isn't it important to call this abort immediately after the changes > are serialized as explained above? > As we are performing exactly 4096 insert statements on a fresh replication setup it will get serialized immediately after 4096 inserts. The variable max_changes_in_memory is initialized with 4096 (reorderbuffer.c file), txn->nentries_mem gets incremented for every insert statement in ReorderBufferQueueChange function. As we have executed 4096 insert statements, txn->nentries_mem will become 4096 which results in the serializing from ReorderBufferCheckSerializeTXN function. Then attaching gdb and call abort() helped in reproducing the issue consistently. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Sat, Nov 9, 2019 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > Hi, > > > > > > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote: > > > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra > > > > <tomas.vondra@2ndquadrant.com> wrote: > > > > > > > > > > I'm a bit confused - does this happen only with the logical_work_mem > > > > > patches, or with clean master too? > > > > > > > > > > > > > This occurs with the clean master. This is a base code problem > > > > revealed while doing stress testing of logical_work_mem patches. > > > > > > As far as I can tell there are no repro steps included? Any chance to > > > get those? > > > > > I think it will be bit tricky to write a test case, but the callstack > indicates the problem. For ex. for issue-1, see the below part of > callstack, > > #2 0x0000000000ec514d in ExceptionalCondition > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > "FailedAssertion", > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > lineNumber=458) at assert.c:54 > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > off=48) at ../../../../src/include/lib/ilist.h:458 > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > oldestRunningXid=895) at reorderbuffer.c:1910 > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > buf=0x7fff7e7b1e40) at decode.c:332 > > The Assertion failure indicates that the changes list is empty when we > are trying to get 'lsn' of the last change. I think this is possible > when immediately after serializing the transaction changes the server > got restarted (say it crashed or somebody switched off and restarted > it). The reason is that after serializing the changes, the changes > list will be empty and serialized flag for txn will be set. I am not > if there is any reason to set the final_lsn if the changes list is > empty. > > Similarly, if we see the call stack of issue-2, the problem is clear. > > #2 0x0000000000ec4e1d in ExceptionalCondition > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > errorType=0x10ea284 "FailedAssertion", > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:3052 > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:1318 > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2b9d778) at reorderbuffer.c:1257 > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > oldestRunningXid=3835) at reorderbuffer.c:1973 > > I think this also has a symptom similar to the prior issue but for > sub-transactions. The ReorderBufferAbortOld() tries to set the > final_lsn of toplevel transaction, but not for subtransactions, later > in that path ReorderBufferRestoreCleanup expects it to be set even for > subtransaction. Is that Assert in ReorderBufferRestoreCleanup() > required, because immediately after assert, we are anyway setting the > value of final_lsn. > Thanks Amit for more analysis of the issues. > > > > This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions"link [1] I found the crashes. > > > > This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(notfrom reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffercrash appears. > > > > To simplify the reorderbuffer crash, I used the following steps: > > 1) Make replication setup with publisher/subscriber for some table > > 2) Prepare a sql file with the below: > > begin; > > 4096 insert statements; > > select pg_sleep(120) > > 3) Execute the above script. > > 4) Attach the postgres process when pg_sleep is in progress. > > 5) call abort() from attached gdb. > > > > Isn't it important to call this abort immediately after the changes > are serialized as explained above? > As we are performing exactly 4096 insert statements on a fresh replication setup it will get serialized immediately after 4096 inserts. The variable max_changes_in_memory is initialized with 4096 (reorderbuffer.c file), txn->nentries_mem gets incremented for every insert statement in ReorderBufferQueueChange function. As we have executed 4096 insert statements, txn->nentries_mem will become 4096 which results in the serializing from ReorderBufferCheckSerializeTXN function. Then attaching gdb and call abort() helped in reproducing the issue consistently. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
> I found couple of crashes in reorderbuffer while review/testing of > logical_work_mem and logical streaming of large in-progress > transactions. Stack trace of the same are given below: > Issue 1: > #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6 > #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec514d in ExceptionalCondition > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > "FailedAssertion", > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > lineNumber=458) at assert.c:54 > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > off=48) at ../../../../src/include/lib/ilist.h:458 > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > oldestRunningXid=895) at reorderbuffer.c:1910 > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > buf=0x7fff7e7b1e40) at decode.c:332 > #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0, > record=0x3342770) at decode.c:121 > #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845 > #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b > <XLogSendLogical>) at walsender.c:2199 > #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at > walsender.c:1128 > #10 0x0000000000b6ce83 in exec_replication_command > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 > (proto_version '1', publication_names '\"pub1\"')") > at walsender.c:1545 > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0, > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at > postgres.c:4256 > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498 > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189 > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727 > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at > postmaster.c:1400 > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210 > > Issue 2: > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec4e1d in ExceptionalCondition > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > errorType=0x10ea284 "FailedAssertion", > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:3052 > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:1318 > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2b9d778) at reorderbuffer.c:1257 > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > oldestRunningXid=3835) at reorderbuffer.c:1973 > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > buf=0x7ffcbc74cc00) at decode.c:332 > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > record=0x2b67990) at decode.c:121 > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > From initial analysis it looks like: > Issue1 it seems like if all the reorderbuffer has been flushed and > then the server restarts. This problem occurs. > Issue 2 it seems like if there are many subtransactions present and > then the server restarts. This problem occurs. The subtransaction's > final_lsn is not being set and when ReorderBufferRestoreCleanup is > called the assert fails. May be for this we might have to set the > subtransaction's final_lsn before cleanup(not sure). > > I could not reproduce this issue consistently with a test case, But I > felt this looks like a problem from review. > > For issue1, I could reproduce by the following steps: > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. > 2) Have many open transactions with subtransactions open. > 3) Attach one of the transaction from gdb and call abort(). > > I'm not sure of the fix for this. If I get time I will try to spend > more time to find out the fix. I have further analyzed the issue and found that: After abort is called, when the process restarts, it will clean the reorder information for the aborted transactions in ReorderBufferAbortOld function. It crashes in the below code as there are no changes present currently and all the changes are serialized: ....... if (txn->serialized && txn->final_lsn == 0)) { ReorderBufferChange *last = dlist_tail_element(ReorderBufferChange, node, &txn->changes); txn->final_lsn = last->lsn; } ....... It sets the final_lsn here so that it can iterate from the start_lsn to final_lsn and cleanup the serialized files in ReorderBufferRestoreCleanup function. One solution We were thinking was to store the lsn of the last serialized change while serialiizing and set the final_lsn in the above case where it crashes like the below code: ...... if (txn->serialized && txn->final_lsn == 0 && !dlist_is_empty(&txn->changes)) { ReorderBufferChange *last = dlist_tail_element(ReorderBufferChange, node, &txn->changes); txn->final_lsn = last->lsn; } else { /* * If there are no changes present as all of the changes were * serialized, use the last lsn that was serialized. */ txn->final_lsn = txn->current_serialized_lsn; } ...... I have tested the same scenario and verified it to be working. The patch for the same is attached. I could not add a test case for this as it involves attaching to gdb and calling abort. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
> I found couple of crashes in reorderbuffer while review/testing of > logical_work_mem and logical streaming of large in-progress > transactions. Stack trace of the same are given below: > Issue 1: > #0 0x00007f985c7d8337 in raise () from /lib64/libc.so.6 > #1 0x00007f985c7d9a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec514d in ExceptionalCondition > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24 > "FailedAssertion", > fileName=0x10eab00 "../../../../src/include/lib/ilist.h", > lineNumber=458) at assert.c:54 > #3 0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60, > off=48) at ../../../../src/include/lib/ilist.h:458 > #4 0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0, > oldestRunningXid=895) at reorderbuffer.c:1910 > #5 0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0, > buf=0x7fff7e7b1e40) at decode.c:332 > #6 0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0, > record=0x3342770) at decode.c:121 > #7 0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845 > #8 0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b > <XLogSendLogical>) at walsender.c:2199 > #9 0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at > walsender.c:1128 > #10 0x0000000000b6ce83 in exec_replication_command > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0 > (proto_version '1', publication_names '\"pub1\"')") > at walsender.c:1545 > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0, > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at > postgres.c:4256 > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498 > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189 > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727 > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at > postmaster.c:1400 > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210 > > Issue 2: > #0 0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6 > #1 0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6 > #2 0x0000000000ec4e1d in ExceptionalCondition > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr", > errorType=0x10ea284 "FailedAssertion", > fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54 > #3 0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:3052 > #4 0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2bafb08) at reorderbuffer.c:1318 > #5 0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0, > txn=0x2b9d778) at reorderbuffer.c:1257 > #6 0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0, > oldestRunningXid=3835) at reorderbuffer.c:1973 > #7 0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0, > buf=0x7ffcbc74cc00) at decode.c:332 > #8 0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0, > record=0x2b67990) at decode.c:121 > #9 0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845 > > From initial analysis it looks like: > Issue1 it seems like if all the reorderbuffer has been flushed and > then the server restarts. This problem occurs. > Issue 2 it seems like if there are many subtransactions present and > then the server restarts. This problem occurs. The subtransaction's > final_lsn is not being set and when ReorderBufferRestoreCleanup is > called the assert fails. May be for this we might have to set the > subtransaction's final_lsn before cleanup(not sure). > > I could not reproduce this issue consistently with a test case, But I > felt this looks like a problem from review. > > For issue1, I could reproduce by the following steps: > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always. > 2) Have many open transactions with subtransactions open. > 3) Attach one of the transaction from gdb and call abort(). > > I'm not sure of the fix for this. If I get time I will try to spend > more time to find out the fix. I have further analyzed the issue and found that: After abort is called, when the process restarts, it will clean the reorder information for the aborted transactions in ReorderBufferAbortOld function. It crashes in the below code as there are no changes present currently and all the changes are serialized: ....... if (txn->serialized && txn->final_lsn == 0)) { ReorderBufferChange *last = dlist_tail_element(ReorderBufferChange, node, &txn->changes); txn->final_lsn = last->lsn; } ....... It sets the final_lsn here so that it can iterate from the start_lsn to final_lsn and cleanup the serialized files in ReorderBufferRestoreCleanup function. One solution We were thinking was to store the lsn of the last serialized change while serialiizing and set the final_lsn in the above case where it crashes like the below code: ...... if (txn->serialized && txn->final_lsn == 0 && !dlist_is_empty(&txn->changes)) { ReorderBufferChange *last = dlist_tail_element(ReorderBufferChange, node, &txn->changes); txn->final_lsn = last->lsn; } else { /* * If there are no changes present as all of the changes were * serialized, use the last lsn that was serialized. */ txn->final_lsn = txn->current_serialized_lsn; } ...... I have tested the same scenario and verified it to be working. The patch for the same is attached. I could not add a test case for this as it involves attaching to gdb and calling abort. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote: > > > It sets the final_lsn here so that it can iterate from the start_lsn > to final_lsn and cleanup the serialized files in > ReorderBufferRestoreCleanup function. One solution We were thinking > was to store the lsn of the last serialized change while serialiizing > and set the final_lsn in the above case where it crashes like the > below code: Sure, we can do something on the lines what you are suggesting, but why can't we update final_lsn at the time of serializing the changes? If we do that then we don't even need to compute it separately during ReorderBufferAbortOld. Let me try to explain the problem and proposed solutions for the same. Currently, after serializing the changes we remove the 'changes' from ReorderBufferTXN. Now, if the system crashes due to any reason after that, we won't be able to compute final_lsn after the restart. And that leads to access violation in ReorderBufferAbortOld which is trying to access changes list from ReorderBufferTXN to compute final_lsn. We could fix it by either tracking 'last_serialized_change' as proposed by Vignesh or we could update the final_lsn while we serialize the changes. IIUC, commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e [1] tried to fix some related issue which leads to this another problem. Alvaro, Andres, do you have any suggestions? [1] - commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: Fri Jan 5 12:17:10 2018 -0300 Fix failure to delete spill files of aborted transactions -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote: > > > It sets the final_lsn here so that it can iterate from the start_lsn > to final_lsn and cleanup the serialized files in > ReorderBufferRestoreCleanup function. One solution We were thinking > was to store the lsn of the last serialized change while serialiizing > and set the final_lsn in the above case where it crashes like the > below code: Sure, we can do something on the lines what you are suggesting, but why can't we update final_lsn at the time of serializing the changes? If we do that then we don't even need to compute it separately during ReorderBufferAbortOld. Let me try to explain the problem and proposed solutions for the same. Currently, after serializing the changes we remove the 'changes' from ReorderBufferTXN. Now, if the system crashes due to any reason after that, we won't be able to compute final_lsn after the restart. And that leads to access violation in ReorderBufferAbortOld which is trying to access changes list from ReorderBufferTXN to compute final_lsn. We could fix it by either tracking 'last_serialized_change' as proposed by Vignesh or we could update the final_lsn while we serialize the changes. IIUC, commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e [1] tried to fix some related issue which leads to this another problem. Alvaro, Andres, do you have any suggestions? [1] - commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: Fri Jan 5 12:17:10 2018 -0300 Fix failure to delete spill files of aborted transactions -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 17, 2019 at 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > It sets the final_lsn here so that it can iterate from the start_lsn > > to final_lsn and cleanup the serialized files in > > ReorderBufferRestoreCleanup function. One solution We were thinking > > was to store the lsn of the last serialized change while serialiizing > > and set the final_lsn in the above case where it crashes like the > > below code: > > Sure, we can do something on the lines what you are suggesting, but > why can't we update final_lsn at the time of serializing the changes? > If we do that then we don't even need to compute it separately during > ReorderBufferAbortOld. > > Let me try to explain the problem and proposed solutions for the same. > Currently, after serializing the changes we remove the 'changes' from > ReorderBufferTXN. Now, if the system crashes due to any reason after > that, we won't be able to compute final_lsn after the restart. And > that leads to access violation in ReorderBufferAbortOld which is > trying to access changes list from ReorderBufferTXN to compute > final_lsn. > > We could fix it by either tracking 'last_serialized_change' as > proposed by Vignesh or we could update the final_lsn while we > serialize the changes. > I felt amit solution also solves the problem. Attached patch has the fix based on the solution proposed. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Dec 17, 2019 at 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > It sets the final_lsn here so that it can iterate from the start_lsn > > to final_lsn and cleanup the serialized files in > > ReorderBufferRestoreCleanup function. One solution We were thinking > > was to store the lsn of the last serialized change while serialiizing > > and set the final_lsn in the above case where it crashes like the > > below code: > > Sure, we can do something on the lines what you are suggesting, but > why can't we update final_lsn at the time of serializing the changes? > If we do that then we don't even need to compute it separately during > ReorderBufferAbortOld. > > Let me try to explain the problem and proposed solutions for the same. > Currently, after serializing the changes we remove the 'changes' from > ReorderBufferTXN. Now, if the system crashes due to any reason after > that, we won't be able to compute final_lsn after the restart. And > that leads to access violation in ReorderBufferAbortOld which is > trying to access changes list from ReorderBufferTXN to compute > final_lsn. > > We could fix it by either tracking 'last_serialized_change' as > proposed by Vignesh or we could update the final_lsn while we > serialize the changes. > I felt amit solution also solves the problem. Attached patch has the fix based on the solution proposed. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On 2019-Dec-27, vignesh C wrote: > I felt amit solution also solves the problem. Attached patch has the > fix based on the solution proposed. > Thoughts? This seems a sensible fix to me, though I didn't try to reproduce the failure. > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > } > > ReorderBufferSerializeChange(rb, txn, fd, change); > + txn->final_lsn = change->lsn; > dlist_delete(&change->node); > ReorderBufferReturnChange(rb, change); Should this be done insider ReorderBufferSerializeChange itself, instead of in its caller? Also, would it be sane to verify that the TXN doesn't already have a newer final_lsn? Maybe as an Assert. > @@ -188,8 +188,7 @@ typedef struct ReorderBufferTXN > * * plain abort record > * * prepared transaction abort > * * error during decoding > - * * for a crashed transaction, the LSN of the last change, regardless of > - * what it was. > + * * last serialized lsn > * ---- I propose "for a transaction with serialized changes, the LSN of the latest serialized one, unless one of the above cases." -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Dec-27, vignesh C wrote: > I felt amit solution also solves the problem. Attached patch has the > fix based on the solution proposed. > Thoughts? This seems a sensible fix to me, though I didn't try to reproduce the failure. > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > } > > ReorderBufferSerializeChange(rb, txn, fd, change); > + txn->final_lsn = change->lsn; > dlist_delete(&change->node); > ReorderBufferReturnChange(rb, change); Should this be done insider ReorderBufferSerializeChange itself, instead of in its caller? Also, would it be sane to verify that the TXN doesn't already have a newer final_lsn? Maybe as an Assert. > @@ -188,8 +188,7 @@ typedef struct ReorderBufferTXN > * * plain abort record > * * prepared transaction abort > * * error during decoding > - * * for a crashed transaction, the LSN of the last change, regardless of > - * what it was. > + * * last serialized lsn > * ---- I propose "for a transaction with serialized changes, the LSN of the latest serialized one, unless one of the above cases." -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2019-Dec-27, vignesh C wrote: > > > I felt amit solution also solves the problem. Attached patch has the > > fix based on the solution proposed. > > Thoughts? > > This seems a sensible fix to me, though I didn't try to reproduce the > failure. > > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > } > > > > ReorderBufferSerializeChange(rb, txn, fd, change); > > + txn->final_lsn = change->lsn; > > dlist_delete(&change->node); > > ReorderBufferReturnChange(rb, change); > > Should this be done insider ReorderBufferSerializeChange itself, instead > of in its caller? > makes sense. But, I think we should add a comment specifying the reason why it is important to set final_lsn while serializing the change. > Also, would it be sane to verify that the TXN > doesn't already have a newer final_lsn? Maybe as an Assert. > I don't think this is a good idea because we update the final_lsn with commit_lsn in ReorderBufferCommit after which we can try to serialize the remaining changes. Instead, we should update it only if the change_lsn value is greater than final_lsn. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2019-Dec-27, vignesh C wrote: > > > I felt amit solution also solves the problem. Attached patch has the > > fix based on the solution proposed. > > Thoughts? > > This seems a sensible fix to me, though I didn't try to reproduce the > failure. > > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > } > > > > ReorderBufferSerializeChange(rb, txn, fd, change); > > + txn->final_lsn = change->lsn; > > dlist_delete(&change->node); > > ReorderBufferReturnChange(rb, change); > > Should this be done insider ReorderBufferSerializeChange itself, instead > of in its caller? > makes sense. But, I think we should add a comment specifying the reason why it is important to set final_lsn while serializing the change. > Also, would it be sane to verify that the TXN > doesn't already have a newer final_lsn? Maybe as an Assert. > I don't think this is a good idea because we update the final_lsn with commit_lsn in ReorderBufferCommit after which we can try to serialize the remaining changes. Instead, we should update it only if the change_lsn value is greater than final_lsn. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > On 2019-Dec-27, vignesh C wrote: > > > > > I felt amit solution also solves the problem. Attached patch has the > > > fix based on the solution proposed. > > > Thoughts? > > > > This seems a sensible fix to me, though I didn't try to reproduce the > > failure. > > > > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > } > > > > > > ReorderBufferSerializeChange(rb, txn, fd, change); > > > + txn->final_lsn = change->lsn; > > > dlist_delete(&change->node); > > > ReorderBufferReturnChange(rb, change); > > > > Should this be done insider ReorderBufferSerializeChange itself, instead > > of in its caller? > > > > makes sense. But, I think we should add a comment specifying the > reason why it is important to set final_lsn while serializing the > change. Fixed > > Also, would it be sane to verify that the TXN > > doesn't already have a newer final_lsn? Maybe as an Assert. > > > > I don't think this is a good idea because we update the final_lsn with > commit_lsn in ReorderBufferCommit after which we can try to serialize > the remaining changes. Instead, we should update it only if the > change_lsn value is greater than final_lsn. > Fixed. Thanks Alvaro & Amit for your suggestions. I have made the changes based on your suggestions. Please find the updated patch for the same. I have also verified the patch in back branches. Separate patch was required for Release-10 branch, patch for the same is attached as 0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > On 2019-Dec-27, vignesh C wrote: > > > > > I felt amit solution also solves the problem. Attached patch has the > > > fix based on the solution proposed. > > > Thoughts? > > > > This seems a sensible fix to me, though I didn't try to reproduce the > > failure. > > > > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > } > > > > > > ReorderBufferSerializeChange(rb, txn, fd, change); > > > + txn->final_lsn = change->lsn; > > > dlist_delete(&change->node); > > > ReorderBufferReturnChange(rb, change); > > > > Should this be done insider ReorderBufferSerializeChange itself, instead > > of in its caller? > > > > makes sense. But, I think we should add a comment specifying the > reason why it is important to set final_lsn while serializing the > change. Fixed > > Also, would it be sane to verify that the TXN > > doesn't already have a newer final_lsn? Maybe as an Assert. > > > > I don't think this is a good idea because we update the final_lsn with > commit_lsn in ReorderBufferCommit after which we can try to serialize > the remaining changes. Instead, we should update it only if the > change_lsn value is greater than final_lsn. > Fixed. Thanks Alvaro & Amit for your suggestions. I have made the changes based on your suggestions. Please find the updated patch for the same. I have also verified the patch in back branches. Separate patch was required for Release-10 branch, patch for the same is attached as 0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 31, 2019 at 11:35 AM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > > > On 2019-Dec-27, vignesh C wrote: > > > > > > > I felt amit solution also solves the problem. Attached patch has the > > > > fix based on the solution proposed. > > > > Thoughts? > > > > > > This seems a sensible fix to me, though I didn't try to reproduce the > > > failure. > > > > > > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > > } > > > > > > > > ReorderBufferSerializeChange(rb, txn, fd, change); > > > > + txn->final_lsn = change->lsn; > > > > dlist_delete(&change->node); > > > > ReorderBufferReturnChange(rb, change); > > > > > > Should this be done insider ReorderBufferSerializeChange itself, instead > > > of in its caller? > > > > > > > makes sense. But, I think we should add a comment specifying the > > reason why it is important to set final_lsn while serializing the > > change. > > Fixed > > > > Also, would it be sane to verify that the TXN > > > doesn't already have a newer final_lsn? Maybe as an Assert. > > > > > > > I don't think this is a good idea because we update the final_lsn with > > commit_lsn in ReorderBufferCommit after which we can try to serialize > > the remaining changes. Instead, we should update it only if the > > change_lsn value is greater than final_lsn. > > > > Fixed. > Thanks Alvaro & Amit for your suggestions. I have made the changes > based on your suggestions. Please find the updated patch for the same. > I have also verified the patch in back branches. Separate patch was > required for Release-10 branch, patch for the same is attached as > 0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch. > Thoughts? One minor comment. Otherwise, the patch looks fine to me. + /* + * We set final_lsn on a transaction when we decode its commit or abort + * record, but we never see those records for crashed transactions. To + * ensure cleanup of these transactions, set final_lsn to that of their + * last change; this causes ReorderBufferRestoreCleanup to do the right + * thing. Final_lsn would have been set with commit_lsn earlier when we + * decode it commit, no need to update in that case + */ + if (txn->final_lsn < change->lsn) + txn->final_lsn = change->lsn; /decode it commit,/decode its commit, -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 31, 2019 at 11:35 AM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > > > On 2019-Dec-27, vignesh C wrote: > > > > > > > I felt amit solution also solves the problem. Attached patch has the > > > > fix based on the solution proposed. > > > > Thoughts? > > > > > > This seems a sensible fix to me, though I didn't try to reproduce the > > > failure. > > > > > > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) > > > > } > > > > > > > > ReorderBufferSerializeChange(rb, txn, fd, change); > > > > + txn->final_lsn = change->lsn; > > > > dlist_delete(&change->node); > > > > ReorderBufferReturnChange(rb, change); > > > > > > Should this be done insider ReorderBufferSerializeChange itself, instead > > > of in its caller? > > > > > > > makes sense. But, I think we should add a comment specifying the > > reason why it is important to set final_lsn while serializing the > > change. > > Fixed > > > > Also, would it be sane to verify that the TXN > > > doesn't already have a newer final_lsn? Maybe as an Assert. > > > > > > > I don't think this is a good idea because we update the final_lsn with > > commit_lsn in ReorderBufferCommit after which we can try to serialize > > the remaining changes. Instead, we should update it only if the > > change_lsn value is greater than final_lsn. > > > > Fixed. > Thanks Alvaro & Amit for your suggestions. I have made the changes > based on your suggestions. Please find the updated patch for the same. > I have also verified the patch in back branches. Separate patch was > required for Release-10 branch, patch for the same is attached as > 0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch. > Thoughts? One minor comment. Otherwise, the patch looks fine to me. + /* + * We set final_lsn on a transaction when we decode its commit or abort + * record, but we never see those records for crashed transactions. To + * ensure cleanup of these transactions, set final_lsn to that of their + * last change; this causes ReorderBufferRestoreCleanup to do the right + * thing. Final_lsn would have been set with commit_lsn earlier when we + * decode it commit, no need to update in that case + */ + if (txn->final_lsn < change->lsn) + txn->final_lsn = change->lsn; /decode it commit,/decode its commit, -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > One minor comment. Otherwise, the patch looks fine to me. > + /* > + * We set final_lsn on a transaction when we decode its commit or abort > + * record, but we never see those records for crashed transactions. To > + * ensure cleanup of these transactions, set final_lsn to that of their > + * last change; this causes ReorderBufferRestoreCleanup to do the right > + * thing. Final_lsn would have been set with commit_lsn earlier when we > + * decode it commit, no need to update in that case > + */ > + if (txn->final_lsn < change->lsn) > + txn->final_lsn = change->lsn; > > /decode it commit,/decode its commit, > Thanks Dilip for reviewing. I have fixed the comments you have suggested. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > One minor comment. Otherwise, the patch looks fine to me. > + /* > + * We set final_lsn on a transaction when we decode its commit or abort > + * record, but we never see those records for crashed transactions. To > + * ensure cleanup of these transactions, set final_lsn to that of their > + * last change; this causes ReorderBufferRestoreCleanup to do the right > + * thing. Final_lsn would have been set with commit_lsn earlier when we > + * decode it commit, no need to update in that case > + */ > + if (txn->final_lsn < change->lsn) > + txn->final_lsn = change->lsn; > > /decode it commit,/decode its commit, > Thanks Dilip for reviewing. I have fixed the comments you have suggested. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Jan 17, 2020 at 7:42 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > One minor comment. Otherwise, the patch looks fine to me. > > + /* > > + * We set final_lsn on a transaction when we decode its commit or abort > > + * record, but we never see those records for crashed transactions. To > > + * ensure cleanup of these transactions, set final_lsn to that of their > > + * last change; this causes ReorderBufferRestoreCleanup to do the right > > + * thing. Final_lsn would have been set with commit_lsn earlier when we > > + * decode it commit, no need to update in that case > > + */ > > + if (txn->final_lsn < change->lsn) > > + txn->final_lsn = change->lsn; > > > > /decode it commit,/decode its commit, > > > > Thanks Dilip for reviewing. > I have fixed the comments you have suggested. > Thanks for the updated patch. It looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 17, 2020 at 7:42 AM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > One minor comment. Otherwise, the patch looks fine to me. > > + /* > > + * We set final_lsn on a transaction when we decode its commit or abort > > + * record, but we never see those records for crashed transactions. To > > + * ensure cleanup of these transactions, set final_lsn to that of their > > + * last change; this causes ReorderBufferRestoreCleanup to do the right > > + * thing. Final_lsn would have been set with commit_lsn earlier when we > > + * decode it commit, no need to update in that case > > + */ > > + if (txn->final_lsn < change->lsn) > > + txn->final_lsn = change->lsn; > > > > /decode it commit,/decode its commit, > > > > Thanks Dilip for reviewing. > I have fixed the comments you have suggested. > Thanks for the updated patch. It looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 2020-Jan-17, vignesh C wrote: > Thanks Dilip for reviewing. > I have fixed the comments you have suggested. I ended up rewording that comment completely; I thought the original was not explaining things well enough. I also changed the comment for final_lsn in reorderbuffer.h: not only I remove the line that I added in df9f682c7bf8 (the previous bugfix), but I also remove the line that says "error during decoding", which came in with the very first logical decoding commit (b89e151054a); I couldn't find any evidence that final_lsn is being set on errors of any kind (other than transaction abort, which doesn't seem an "error" in that sense.) Please give these comments a read; maybe I have misunderstood something and my comment is wrong. Pushed now to all branches. Thanks, Vignesh, Amit and Dilip. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-17, vignesh C wrote: > Thanks Dilip for reviewing. > I have fixed the comments you have suggested. I ended up rewording that comment completely; I thought the original was not explaining things well enough. I also changed the comment for final_lsn in reorderbuffer.h: not only I remove the line that I added in df9f682c7bf8 (the previous bugfix), but I also remove the line that says "error during decoding", which came in with the very first logical decoding commit (b89e151054a); I couldn't find any evidence that final_lsn is being set on errors of any kind (other than transaction abort, which doesn't seem an "error" in that sense.) Please give these comments a read; maybe I have misunderstood something and my comment is wrong. Pushed now to all branches. Thanks, Vignesh, Amit and Dilip. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2020-Jan-17, vignesh C wrote: > > > Thanks Dilip for reviewing. > > I have fixed the comments you have suggested. > > I ended up rewording that comment completely; I thought the original was > not explaining things well enough. > > I also changed the comment for final_lsn in reorderbuffer.h: not only I > remove the line that I added in df9f682c7bf8 (the previous bugfix), but > I also remove the line that says "error during decoding", which came in > with the very first logical decoding commit (b89e151054a); I couldn't > find any evidence that final_lsn is being set on errors of any kind > (other than transaction abort, which doesn't seem an "error" in that > sense.) > > Please give these comments a read; maybe I have misunderstood something > and my comment is wrong. > The comments added by you look correct and good to me. Thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2020-Jan-17, vignesh C wrote: > > > Thanks Dilip for reviewing. > > I have fixed the comments you have suggested. > > I ended up rewording that comment completely; I thought the original was > not explaining things well enough. > > I also changed the comment for final_lsn in reorderbuffer.h: not only I > remove the line that I added in df9f682c7bf8 (the previous bugfix), but > I also remove the line that says "error during decoding", which came in > with the very first logical decoding commit (b89e151054a); I couldn't > find any evidence that final_lsn is being set on errors of any kind > (other than transaction abort, which doesn't seem an "error" in that > sense.) > > Please give these comments a read; maybe I have misunderstood something > and my comment is wrong. > The comments added by you look correct and good to me. Thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2020-Jan-17, vignesh C wrote: > > > Thanks Dilip for reviewing. > > I have fixed the comments you have suggested. > > I ended up rewording that comment completely; I thought the original was > not explaining things well enough. > > I also changed the comment for final_lsn in reorderbuffer.h: not only I > remove the line that I added in df9f682c7bf8 (the previous bugfix), but > I also remove the line that says "error during decoding", which came in > with the very first logical decoding commit (b89e151054a); I couldn't > find any evidence that final_lsn is being set on errors of any kind > (other than transaction abort, which doesn't seem an "error" in that > sense.) > > Please give these comments a read; maybe I have misunderstood something > and my comment is wrong. > > Pushed now to all branches. Thanks, Vignesh, Amit and Dilip. > Thanks Alvaro for pushing this patch. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2020-Jan-17, vignesh C wrote: > > > Thanks Dilip for reviewing. > > I have fixed the comments you have suggested. > > I ended up rewording that comment completely; I thought the original was > not explaining things well enough. > > I also changed the comment for final_lsn in reorderbuffer.h: not only I > remove the line that I added in df9f682c7bf8 (the previous bugfix), but > I also remove the line that says "error during decoding", which came in > with the very first logical decoding commit (b89e151054a); I couldn't > find any evidence that final_lsn is being set on errors of any kind > (other than transaction abort, which doesn't seem an "error" in that > sense.) > > Please give these comments a read; maybe I have misunderstood something > and my comment is wrong. > > Pushed now to all branches. Thanks, Vignesh, Amit and Dilip. > Thanks Alvaro for pushing this patch. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com