Thread: Reorderbuffer crash during recovery

Reorderbuffer crash during recovery

From
vignesh C
Date:
Hi,

I found couple of crashes in reorderbuffer while review/testing of
logical_work_mem and logical streaming of large in-progress
transactions. Stack trace of the same are given below:
Issue 1:
#0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
#1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
#2  0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
    fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332
#6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
record=0x3342770) at decode.c:121
#7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
#8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
<XLogSendLogical>) at walsender.c:2199
#9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
walsender.c:1128
#10 0x0000000000b6ce83 in exec_replication_command
(cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
    at walsender.c:1545
#11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
postgres.c:4256
#12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
#13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
#14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
#15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
postmaster.c:1400
#16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210

Issue 2:
#0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2  0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
    fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

From initial analysis it looks like:
Issue1 it seems like if all the reorderbuffer has been flushed and
then the server restarts. This problem occurs.
Issue 2 it seems like if there are many subtransactions present and
then the server restarts. This problem occurs. The subtransaction's
final_lsn is not being set and when ReorderBufferRestoreCleanup is
called the assert fails. May be for this we might have to set the
subtransaction's final_lsn before cleanup(not sure).

I could not reproduce this issue consistently with a test case, But I
felt this looks like a problem from review.

For issue1, I could reproduce by the following steps:
1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
2) Have many open transactions with subtransactions open.
3) Attach one of the transaction from gdb and call abort().

I'm not sure of the fix for this. If I get time I will try to spend
more time to find out the fix.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Dilip Kumar
Date:
On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Hi,
>
> I found couple of crashes in reorderbuffer while review/testing of
> logical_work_mem and logical streaming of large in-progress
> transactions. Stack trace of the same are given below:
> Issue 1:
> #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
>     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
> #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> record=0x3342770) at decode.c:121
> #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> <XLogSendLogical>) at walsender.c:2199
> #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> walsender.c:1128
> #10 0x0000000000b6ce83 in exec_replication_command
> (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> (proto_version '1', publication_names '\"pub1\"')")
>     at walsender.c:1545
> #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> postgres.c:4256
> #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> postmaster.c:1400
> #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
>
> Issue 2:
> #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
> #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> From initial analysis it looks like:
> Issue1 it seems like if all the reorderbuffer has been flushed and
> then the server restarts. This problem occurs.
> Issue 2 it seems like if there are many subtransactions present and
> then the server restarts. This problem occurs. The subtransaction's
> final_lsn is not being set and when ReorderBufferRestoreCleanup is
> called the assert fails. May be for this we might have to set the
> subtransaction's final_lsn before cleanup(not sure).
>
> I could not reproduce this issue consistently with a test case, But I
> felt this looks like a problem from review.
>
> For issue1, I could reproduce by the following steps:
> 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> 2) Have many open transactions with subtransactions open.
> 3) Attach one of the transaction from gdb and call abort().

Do you need subtransactions for the issue1? It appears that after the
restart if the changes list is empty it will hit the assert.  Am I
missing something?

>
> I'm not sure of the fix for this. If I get time I will try to spend
> more time to find out the fix.
> Thoughts?
>


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Dilip Kumar
Date:
On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Hi,
>
> I found couple of crashes in reorderbuffer while review/testing of
> logical_work_mem and logical streaming of large in-progress
> transactions. Stack trace of the same are given below:
> Issue 1:
> #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
>     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
> #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> record=0x3342770) at decode.c:121
> #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> <XLogSendLogical>) at walsender.c:2199
> #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> walsender.c:1128
> #10 0x0000000000b6ce83 in exec_replication_command
> (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> (proto_version '1', publication_names '\"pub1\"')")
>     at walsender.c:1545
> #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> postgres.c:4256
> #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> postmaster.c:1400
> #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
>
> Issue 2:
> #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
> #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> From initial analysis it looks like:
> Issue1 it seems like if all the reorderbuffer has been flushed and
> then the server restarts. This problem occurs.
> Issue 2 it seems like if there are many subtransactions present and
> then the server restarts. This problem occurs. The subtransaction's
> final_lsn is not being set and when ReorderBufferRestoreCleanup is
> called the assert fails. May be for this we might have to set the
> subtransaction's final_lsn before cleanup(not sure).
>
> I could not reproduce this issue consistently with a test case, But I
> felt this looks like a problem from review.
>
> For issue1, I could reproduce by the following steps:
> 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> 2) Have many open transactions with subtransactions open.
> 3) Attach one of the transaction from gdb and call abort().

Do you need subtransactions for the issue1? It appears that after the
restart if the changes list is empty it will hit the assert.  Am I
missing something?

>
> I'm not sure of the fix for this. If I get time I will try to spend
> more time to find out the fix.
> Thoughts?
>


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > Hi,
> >
> > I found couple of crashes in reorderbuffer while review/testing of
> > logical_work_mem and logical streaming of large in-progress
> > transactions. Stack trace of the same are given below:
> > Issue 1:
> > #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> > #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec514d in ExceptionalCondition
> > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> > "FailedAssertion",
> >     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> > lineNumber=458) at assert.c:54
> > #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> > off=48) at ../../../../src/include/lib/ilist.h:458
> > #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> > oldestRunningXid=895) at reorderbuffer.c:1910
> > #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> > buf=0x7fff7e7b1e40) at decode.c:332
> > #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> > record=0x3342770) at decode.c:121
> > #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> > #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> > <XLogSendLogical>) at walsender.c:2199
> > #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> > walsender.c:1128
> > #10 0x0000000000b6ce83 in exec_replication_command
> > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> > (proto_version '1', publication_names '\"pub1\"')")
> >     at walsender.c:1545
> > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> > postgres.c:4256
> > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> > postmaster.c:1400
> > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
> >
> > Issue 2:
> > #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> > #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec4e1d in ExceptionalCondition
> > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> > errorType=0x10ea284 "FailedAssertion",
> >     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> > #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:3052
> > #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:1318
> > #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > txn=0x2b9d778) at reorderbuffer.c:1257
> > #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> > oldestRunningXid=3835) at reorderbuffer.c:1973
> > #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> > buf=0x7ffcbc74cc00) at decode.c:332
> > #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> > record=0x2b67990) at decode.c:121
> > #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
> >
> > From initial analysis it looks like:
> > Issue1 it seems like if all the reorderbuffer has been flushed and
> > then the server restarts. This problem occurs.
> > Issue 2 it seems like if there are many subtransactions present and
> > then the server restarts. This problem occurs. The subtransaction's
> > final_lsn is not being set and when ReorderBufferRestoreCleanup is
> > called the assert fails. May be for this we might have to set the
> > subtransaction's final_lsn before cleanup(not sure).
> >
> > I could not reproduce this issue consistently with a test case, But I
> > felt this looks like a problem from review.
> >
> > For issue1, I could reproduce by the following steps:
> > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> > 2) Have many open transactions with subtransactions open.
> > 3) Attach one of the transaction from gdb and call abort().
>
> Do you need subtransactions for the issue1? It appears that after the
> restart if the changes list is empty it will hit the assert.  Am I
> missing something?
>

When I had reported this issue I could reproduce this issue with
sub-transactions. Now I have tried without using sub-transactions and
could still reproduce this issue. You are right Issue 1 will appear in
both the cases with and without subtransactions.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > Hi,
> >
> > I found couple of crashes in reorderbuffer while review/testing of
> > logical_work_mem and logical streaming of large in-progress
> > transactions. Stack trace of the same are given below:
> > Issue 1:
> > #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> > #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec514d in ExceptionalCondition
> > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> > "FailedAssertion",
> >     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> > lineNumber=458) at assert.c:54
> > #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> > off=48) at ../../../../src/include/lib/ilist.h:458
> > #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> > oldestRunningXid=895) at reorderbuffer.c:1910
> > #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> > buf=0x7fff7e7b1e40) at decode.c:332
> > #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> > record=0x3342770) at decode.c:121
> > #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> > #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> > <XLogSendLogical>) at walsender.c:2199
> > #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> > walsender.c:1128
> > #10 0x0000000000b6ce83 in exec_replication_command
> > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> > (proto_version '1', publication_names '\"pub1\"')")
> >     at walsender.c:1545
> > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> > postgres.c:4256
> > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> > postmaster.c:1400
> > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
> >
> > Issue 2:
> > #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> > #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec4e1d in ExceptionalCondition
> > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> > errorType=0x10ea284 "FailedAssertion",
> >     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> > #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:3052
> > #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:1318
> > #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > txn=0x2b9d778) at reorderbuffer.c:1257
> > #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> > oldestRunningXid=3835) at reorderbuffer.c:1973
> > #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> > buf=0x7ffcbc74cc00) at decode.c:332
> > #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> > record=0x2b67990) at decode.c:121
> > #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
> >
> > From initial analysis it looks like:
> > Issue1 it seems like if all the reorderbuffer has been flushed and
> > then the server restarts. This problem occurs.
> > Issue 2 it seems like if there are many subtransactions present and
> > then the server restarts. This problem occurs. The subtransaction's
> > final_lsn is not being set and when ReorderBufferRestoreCleanup is
> > called the assert fails. May be for this we might have to set the
> > subtransaction's final_lsn before cleanup(not sure).
> >
> > I could not reproduce this issue consistently with a test case, But I
> > felt this looks like a problem from review.
> >
> > For issue1, I could reproduce by the following steps:
> > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> > 2) Have many open transactions with subtransactions open.
> > 3) Attach one of the transaction from gdb and call abort().
>
> Do you need subtransactions for the issue1? It appears that after the
> restart if the changes list is empty it will hit the assert.  Am I
> missing something?
>

When I had reported this issue I could reproduce this issue with
sub-transactions. Now I have tried without using sub-transactions and
could still reproduce this issue. You are right Issue 1 will appear in
both the cases with and without subtransactions.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Dilip Kumar
Date:
On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I found couple of crashes in reorderbuffer while review/testing of
> > > logical_work_mem and logical streaming of large in-progress
> > > transactions. Stack trace of the same are given below:
> > > Issue 1:
> > > #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> > > #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> > > #2  0x0000000000ec514d in ExceptionalCondition
> > > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> > > "FailedAssertion",
> > >     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> > > lineNumber=458) at assert.c:54
> > > #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> > > off=48) at ../../../../src/include/lib/ilist.h:458
> > > #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> > > oldestRunningXid=895) at reorderbuffer.c:1910
> > > #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> > > buf=0x7fff7e7b1e40) at decode.c:332
> > > #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> > > record=0x3342770) at decode.c:121
> > > #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> > > #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> > > <XLogSendLogical>) at walsender.c:2199
> > > #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> > > walsender.c:1128
> > > #10 0x0000000000b6ce83 in exec_replication_command
> > > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> > > (proto_version '1', publication_names '\"pub1\"')")
> > >     at walsender.c:1545
> > > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> > > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> > > postgres.c:4256
> > > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> > > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> > > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> > > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> > > postmaster.c:1400
> > > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
> > >
> > > Issue 2:
> > > #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> > > #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> > > #2  0x0000000000ec4e1d in ExceptionalCondition
> > > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> > > errorType=0x10ea284 "FailedAssertion",
> > >     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> > > #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> > > txn=0x2bafb08) at reorderbuffer.c:3052
> > > #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > > txn=0x2bafb08) at reorderbuffer.c:1318
> > > #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > > txn=0x2b9d778) at reorderbuffer.c:1257
> > > #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> > > oldestRunningXid=3835) at reorderbuffer.c:1973
> > > #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> > > buf=0x7ffcbc74cc00) at decode.c:332
> > > #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> > > record=0x2b67990) at decode.c:121
> > > #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
> > >
> > > From initial analysis it looks like:
> > > Issue1 it seems like if all the reorderbuffer has been flushed and
> > > then the server restarts. This problem occurs.
> > > Issue 2 it seems like if there are many subtransactions present and
> > > then the server restarts. This problem occurs. The subtransaction's
> > > final_lsn is not being set and when ReorderBufferRestoreCleanup is
> > > called the assert fails. May be for this we might have to set the
> > > subtransaction's final_lsn before cleanup(not sure).
> > >
> > > I could not reproduce this issue consistently with a test case, But I
> > > felt this looks like a problem from review.
> > >
> > > For issue1, I could reproduce by the following steps:
> > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> > > 2) Have many open transactions with subtransactions open.
> > > 3) Attach one of the transaction from gdb and call abort().
> >
> > Do you need subtransactions for the issue1? It appears that after the
> > restart if the changes list is empty it will hit the assert.  Am I
> > missing something?
> >
>
> When I had reported this issue I could reproduce this issue with
> sub-transactions. Now I have tried without using sub-transactions and
> could still reproduce this issue. You are right Issue 1 will appear in
> both the cases with and without subtransactions.

Okay, thanks for the confirmation.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Dilip Kumar
Date:
On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I found couple of crashes in reorderbuffer while review/testing of
> > > logical_work_mem and logical streaming of large in-progress
> > > transactions. Stack trace of the same are given below:
> > > Issue 1:
> > > #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> > > #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> > > #2  0x0000000000ec514d in ExceptionalCondition
> > > (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> > > "FailedAssertion",
> > >     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> > > lineNumber=458) at assert.c:54
> > > #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> > > off=48) at ../../../../src/include/lib/ilist.h:458
> > > #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> > > oldestRunningXid=895) at reorderbuffer.c:1910
> > > #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> > > buf=0x7fff7e7b1e40) at decode.c:332
> > > #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> > > record=0x3342770) at decode.c:121
> > > #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> > > #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> > > <XLogSendLogical>) at walsender.c:2199
> > > #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> > > walsender.c:1128
> > > #10 0x0000000000b6ce83 in exec_replication_command
> > > (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> > > (proto_version '1', publication_names '\"pub1\"')")
> > >     at walsender.c:1545
> > > #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> > > dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> > > postgres.c:4256
> > > #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> > > #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> > > #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> > > #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> > > postmaster.c:1400
> > > #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
> > >
> > > Issue 2:
> > > #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> > > #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> > > #2  0x0000000000ec4e1d in ExceptionalCondition
> > > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> > > errorType=0x10ea284 "FailedAssertion",
> > >     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> > > #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> > > txn=0x2bafb08) at reorderbuffer.c:3052
> > > #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > > txn=0x2bafb08) at reorderbuffer.c:1318
> > > #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > > txn=0x2b9d778) at reorderbuffer.c:1257
> > > #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> > > oldestRunningXid=3835) at reorderbuffer.c:1973
> > > #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> > > buf=0x7ffcbc74cc00) at decode.c:332
> > > #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> > > record=0x2b67990) at decode.c:121
> > > #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
> > >
> > > From initial analysis it looks like:
> > > Issue1 it seems like if all the reorderbuffer has been flushed and
> > > then the server restarts. This problem occurs.
> > > Issue 2 it seems like if there are many subtransactions present and
> > > then the server restarts. This problem occurs. The subtransaction's
> > > final_lsn is not being set and when ReorderBufferRestoreCleanup is
> > > called the assert fails. May be for this we might have to set the
> > > subtransaction's final_lsn before cleanup(not sure).
> > >
> > > I could not reproduce this issue consistently with a test case, But I
> > > felt this looks like a problem from review.
> > >
> > > For issue1, I could reproduce by the following steps:
> > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> > > 2) Have many open transactions with subtransactions open.
> > > 3) Attach one of the transaction from gdb and call abort().
> >
> > Do you need subtransactions for the issue1? It appears that after the
> > restart if the changes list is empty it will hit the assert.  Am I
> > missing something?
> >
>
> When I had reported this issue I could reproduce this issue with
> sub-transactions. Now I have tried without using sub-transactions and
> could still reproduce this issue. You are right Issue 1 will appear in
> both the cases with and without subtransactions.

Okay, thanks for the confirmation.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Tomas Vondra
Date:
On Thu, Nov 07, 2019 at 11:01:17AM +0530, Dilip Kumar wrote:
>On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote:
>>
>> On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:
>> > >
>> > > Hi,
>> > >
>> > > ...
>> > >
>> > > Issue1 it seems like if all the reorderbuffer has been flushed and
>> > > then the server restarts. This problem occurs.
>> > > Issue 2 it seems like if there are many subtransactions present and
>> > > then the server restarts. This problem occurs. The subtransaction's
>> > > final_lsn is not being set and when ReorderBufferRestoreCleanup is
>> > > called the assert fails. May be for this we might have to set the
>> > > subtransaction's final_lsn before cleanup(not sure).
>> > >
>> > > I could not reproduce this issue consistently with a test case, But I
>> > > felt this looks like a problem from review.
>> > >
>> > > For issue1, I could reproduce by the following steps:
>> > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
>> > > 2) Have many open transactions with subtransactions open.
>> > > 3) Attach one of the transaction from gdb and call abort().
>> >
>> > Do you need subtransactions for the issue1? It appears that after the
>> > restart if the changes list is empty it will hit the assert.  Am I
>> > missing something?
>> >
>>
>> When I had reported this issue I could reproduce this issue with
>> sub-transactions. Now I have tried without using sub-transactions and
>> could still reproduce this issue. You are right Issue 1 will appear in
>> both the cases with and without subtransactions.
>
>Okay, thanks for the confirmation.
>

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too? If only with the patches, which
version exactly?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Reorderbuffer crash during recovery

From
Tomas Vondra
Date:
On Thu, Nov 07, 2019 at 11:01:17AM +0530, Dilip Kumar wrote:
>On Thu, Nov 7, 2019 at 9:55 AM vignesh C <vignesh21@gmail.com> wrote:
>>
>> On Wed, Nov 6, 2019 at 5:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > On Wed, Nov 6, 2019 at 5:20 PM vignesh C <vignesh21@gmail.com> wrote:
>> > >
>> > > Hi,
>> > >
>> > > ...
>> > >
>> > > Issue1 it seems like if all the reorderbuffer has been flushed and
>> > > then the server restarts. This problem occurs.
>> > > Issue 2 it seems like if there are many subtransactions present and
>> > > then the server restarts. This problem occurs. The subtransaction's
>> > > final_lsn is not being set and when ReorderBufferRestoreCleanup is
>> > > called the assert fails. May be for this we might have to set the
>> > > subtransaction's final_lsn before cleanup(not sure).
>> > >
>> > > I could not reproduce this issue consistently with a test case, But I
>> > > felt this looks like a problem from review.
>> > >
>> > > For issue1, I could reproduce by the following steps:
>> > > 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
>> > > 2) Have many open transactions with subtransactions open.
>> > > 3) Attach one of the transaction from gdb and call abort().
>> >
>> > Do you need subtransactions for the issue1? It appears that after the
>> > restart if the changes list is empty it will hit the assert.  Am I
>> > missing something?
>> >
>>
>> When I had reported this issue I could reproduce this issue with
>> sub-transactions. Now I have tried without using sub-transactions and
>> could still reproduce this issue. You are right Issue 1 will appear in
>> both the cases with and without subtransactions.
>
>Okay, thanks for the confirmation.
>

I'm a bit confused - does this happen only with the logical_work_mem
patches, or with clean master too? If only with the patches, which
version exactly?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> I'm a bit confused - does this happen only with the logical_work_mem
> patches, or with clean master too?
>

This occurs with the clean master.  This is a base code problem
revealed while doing stress testing of logical_work_mem patches.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> I'm a bit confused - does this happen only with the logical_work_mem
> patches, or with clean master too?
>

This occurs with the clean master.  This is a base code problem
revealed while doing stress testing of logical_work_mem patches.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Andres Freund
Date:
Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > I'm a bit confused - does this happen only with the logical_work_mem
> > patches, or with clean master too?
> >
> 
> This occurs with the clean master.  This is a base code problem
> revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

Greetings,

Andres Freund



Re: Reorderbuffer crash during recovery

From
Andres Freund
Date:
Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > I'm a bit confused - does this happen only with the logical_work_mem
> > patches, or with clean master too?
> >
> 
> This occurs with the clean master.  This is a base code problem
> revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

Greetings,

Andres Freund



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I'm a bit confused - does this happen only with the logical_work_mem
> > > patches, or with clean master too?
> > >
> >
> > This occurs with the clean master.  This is a base code problem
> > revealed while doing stress testing of logical_work_mem patches.
>
> As far as I can tell there are no repro steps included? Any chance to
> get those?
>

This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.

This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.

To simplify the reorderbuffer crash, I used the following steps:
1) Make replication setup with publisher/subscriber for some table
2) Prepare a sql file with the below:
begin;
4096 insert statements; 
select pg_sleep(120)
3) Execute the above script.
4) Attach the postgres process when pg_sleep is in progress. 
5) call abort() from attached gdb.
6) After sometime there will be many core files in publisher installation data directory.

[1] https://commitfest.postgresql.org/25/1927/

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > I'm a bit confused - does this happen only with the logical_work_mem
> > patches, or with clean master too?
> >
>
> This occurs with the clean master.  This is a base code problem
> revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

Greetings,

Andres Freund

Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I'm a bit confused - does this happen only with the logical_work_mem
> > > patches, or with clean master too?
> > >
> >
> > This occurs with the clean master.  This is a base code problem
> > revealed while doing stress testing of logical_work_mem patches.
>
> As far as I can tell there are no repro steps included? Any chance to
> get those?
>

This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress transactions" link [1] I found the crashes.

This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not from reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash appears.

To simplify the reorderbuffer crash, I used the following steps:
1) Make replication setup with publisher/subscriber for some table
2) Prepare a sql file with the below:
begin;
4096 insert statements; 
select pg_sleep(120)
3) Execute the above script.
4) Attach the postgres process when pg_sleep is in progress. 
5) call abort() from attached gdb.
6) After sometime there will be many core files in publisher installation data directory.

[1] https://commitfest.postgresql.org/25/1927/

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > I'm a bit confused - does this happen only with the logical_work_mem
> > patches, or with clean master too?
> >
>
> This occurs with the clean master.  This is a base code problem
> revealed while doing stress testing of logical_work_mem patches.

As far as I can tell there are no repro steps included? Any chance to
get those?

Greetings,

Andres Freund

Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > > <tomas.vondra@2ndquadrant.com> wrote:
> > > >
> > > > I'm a bit confused - does this happen only with the logical_work_mem
> > > > patches, or with clean master too?
> > > >
> > >
> > > This occurs with the clean master.  This is a base code problem
> > > revealed while doing stress testing of logical_work_mem patches.
> >
> > As far as I can tell there are no repro steps included? Any chance to
> > get those?
> >

I think it will be bit tricky to write a test case, but the callstack
indicates the problem.  For ex. for issue-1, see the below part of
callstack,

#2  0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
    fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332

The Assertion failure indicates that the changes list is empty when we
are trying to get 'lsn' of the last change.  I think this is possible
when immediately after serializing the transaction changes the server
got restarted (say it crashed or somebody switched off and restarted
it).  The reason is that after serializing the changes, the changes
list will be empty and serialized flag for txn will be set.  I am not
if there is any reason to set the final_lsn if the changes list is
empty.

Similarly, if we see the call stack of issue-2, the problem is clear.

#2  0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
    fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973

I think this also has a symptom similar to the prior issue but for
sub-transactions.  The ReorderBufferAbortOld() tries to set the
final_lsn of toplevel transaction, but not for subtransactions, later
in that path ReorderBufferRestoreCleanup expects it to be set even for
subtransaction.  Is that Assert in  ReorderBufferRestoreCleanup()
required, because immediately after assert, we are anyway setting the
value of final_lsn.

>
> This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress
transactions"link [1] I found the crashes.
 
>
> This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not
fromreorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash
appears.
>
> To simplify the reorderbuffer crash, I used the following steps:
> 1) Make replication setup with publisher/subscriber for some table
> 2) Prepare a sql file with the below:
> begin;
> 4096 insert statements;
> select pg_sleep(120)
> 3) Execute the above script.
> 4) Attach the postgres process when pg_sleep is in progress.
> 5) call abort() from attached gdb.
>

Isn't it important to call this abort immediately after the changes
are serialized as explained above?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > > <tomas.vondra@2ndquadrant.com> wrote:
> > > >
> > > > I'm a bit confused - does this happen only with the logical_work_mem
> > > > patches, or with clean master too?
> > > >
> > >
> > > This occurs with the clean master.  This is a base code problem
> > > revealed while doing stress testing of logical_work_mem patches.
> >
> > As far as I can tell there are no repro steps included? Any chance to
> > get those?
> >

I think it will be bit tricky to write a test case, but the callstack
indicates the problem.  For ex. for issue-1, see the below part of
callstack,

#2  0x0000000000ec514d in ExceptionalCondition
(conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
"FailedAssertion",
    fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
off=48) at ../../../../src/include/lib/ilist.h:458
#4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
oldestRunningXid=895) at reorderbuffer.c:1910
#5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
buf=0x7fff7e7b1e40) at decode.c:332

The Assertion failure indicates that the changes list is empty when we
are trying to get 'lsn' of the last change.  I think this is possible
when immediately after serializing the transaction changes the server
got restarted (say it crashed or somebody switched off and restarted
it).  The reason is that after serializing the changes, the changes
list will be empty and serialized flag for txn will be set.  I am not
if there is any reason to set the final_lsn if the changes list is
empty.

Similarly, if we see the call stack of issue-2, the problem is clear.

#2  0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
    fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973

I think this also has a symptom similar to the prior issue but for
sub-transactions.  The ReorderBufferAbortOld() tries to set the
final_lsn of toplevel transaction, but not for subtransactions, later
in that path ReorderBufferRestoreCleanup expects it to be set even for
subtransaction.  Is that Assert in  ReorderBufferRestoreCleanup()
required, because immediately after assert, we are anyway setting the
value of final_lsn.

>
> This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress
transactions"link [1] I found the crashes.
 
>
> This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres process(not
fromreorderbuffer but due to some other issue), after the original non-reorderbuffer crash this reorderbuffer crash
appears.
>
> To simplify the reorderbuffer crash, I used the following steps:
> 1) Make replication setup with publisher/subscriber for some table
> 2) Prepare a sql file with the below:
> begin;
> 4096 insert statements;
> select pg_sleep(120)
> 3) Execute the above script.
> 4) Attach the postgres process when pg_sleep is in progress.
> 5) call abort() from attached gdb.
>

Isn't it important to call this abort immediately after the changes
are serialized as explained above?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Sat, Nov 9, 2019 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > > > <tomas.vondra@2ndquadrant.com> wrote:
> > > > >
> > > > > I'm a bit confused - does this happen only with the logical_work_mem
> > > > > patches, or with clean master too?
> > > > >
> > > >
> > > > This occurs with the clean master.  This is a base code problem
> > > > revealed while doing stress testing of logical_work_mem patches.
> > >
> > > As far as I can tell there are no repro steps included? Any chance to
> > > get those?
> > >
>
> I think it will be bit tricky to write a test case, but the callstack
> indicates the problem.  For ex. for issue-1, see the below part of
> callstack,
>
> #2  0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
>     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
>
> The Assertion failure indicates that the changes list is empty when we
> are trying to get 'lsn' of the last change.  I think this is possible
> when immediately after serializing the transaction changes the server
> got restarted (say it crashed or somebody switched off and restarted
> it).  The reason is that after serializing the changes, the changes
> list will be empty and serialized flag for txn will be set.  I am not
> if there is any reason to set the final_lsn if the changes list is
> empty.
>
> Similarly, if we see the call stack of issue-2, the problem is clear.
>
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
>
> I think this also has a symptom similar to the prior issue but for
> sub-transactions.  The ReorderBufferAbortOld() tries to set the
> final_lsn of toplevel transaction, but not for subtransactions, later
> in that path ReorderBufferRestoreCleanup expects it to be set even for
> subtransaction.  Is that Assert in  ReorderBufferRestoreCleanup()
> required, because immediately after assert, we are anyway setting the
> value of final_lsn.
>

Thanks Amit for more analysis of the issues.

> >
> > This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress
transactions"link [1] I found the crashes.
 
> >
> > This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres
process(notfrom reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this
reorderbuffercrash appears.
 
> >
> > To simplify the reorderbuffer crash, I used the following steps:
> > 1) Make replication setup with publisher/subscriber for some table
> > 2) Prepare a sql file with the below:
> > begin;
> > 4096 insert statements;
> > select pg_sleep(120)
> > 3) Execute the above script.
> > 4) Attach the postgres process when pg_sleep is in progress.
> > 5) call abort() from attached gdb.
> >
>
> Isn't it important to call this abort immediately after the changes
> are serialized as explained above?
>

As we are performing exactly 4096 insert statements on a fresh
replication setup it will get serialized immediately after 4096
inserts. The variable max_changes_in_memory is initialized with 4096
(reorderbuffer.c file), txn->nentries_mem gets incremented for every
insert statement in ReorderBufferQueueChange function. As we have
executed 4096 insert statements, txn->nentries_mem will become 4096
which results in the serializing from ReorderBufferCheckSerializeTXN
function. Then attaching gdb and call abort() helped in reproducing
the issue consistently.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Sat, Nov 9, 2019 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 8, 2019 at 10:05 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, Nov 7, 2019 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2019-11-07 17:03:44 +0530, Amit Kapila wrote:
> > > > On Thu, Nov 7, 2019 at 4:48 PM Tomas Vondra
> > > > <tomas.vondra@2ndquadrant.com> wrote:
> > > > >
> > > > > I'm a bit confused - does this happen only with the logical_work_mem
> > > > > patches, or with clean master too?
> > > > >
> > > >
> > > > This occurs with the clean master.  This is a base code problem
> > > > revealed while doing stress testing of logical_work_mem patches.
> > >
> > > As far as I can tell there are no repro steps included? Any chance to
> > > get those?
> > >
>
> I think it will be bit tricky to write a test case, but the callstack
> indicates the problem.  For ex. for issue-1, see the below part of
> callstack,
>
> #2  0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
>     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
>
> The Assertion failure indicates that the changes list is empty when we
> are trying to get 'lsn' of the last change.  I think this is possible
> when immediately after serializing the transaction changes the server
> got restarted (say it crashed or somebody switched off and restarted
> it).  The reason is that after serializing the changes, the changes
> list will be empty and serialized flag for txn will be set.  I am not
> if there is any reason to set the final_lsn if the changes list is
> empty.
>
> Similarly, if we see the call stack of issue-2, the problem is clear.
>
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
>
> I think this also has a symptom similar to the prior issue but for
> sub-transactions.  The ReorderBufferAbortOld() tries to set the
> final_lsn of toplevel transaction, but not for subtransactions, later
> in that path ReorderBufferRestoreCleanup expects it to be set even for
> subtransaction.  Is that Assert in  ReorderBufferRestoreCleanup()
> required, because immediately after assert, we are anyway setting the
> value of final_lsn.
>

Thanks Amit for more analysis of the issues.

> >
> > This problem does not occur consistently. When I was reviewing and testing "logical streaming for large in-progress
transactions"link [1] I found the crashes.
 
> >
> > This issue does not occur directly, meaning this issue will occur only when some crash occurs in postgres
process(notfrom reorderbuffer but due to some other issue), after the original non-reorderbuffer crash this
reorderbuffercrash appears.
 
> >
> > To simplify the reorderbuffer crash, I used the following steps:
> > 1) Make replication setup with publisher/subscriber for some table
> > 2) Prepare a sql file with the below:
> > begin;
> > 4096 insert statements;
> > select pg_sleep(120)
> > 3) Execute the above script.
> > 4) Attach the postgres process when pg_sleep is in progress.
> > 5) call abort() from attached gdb.
> >
>
> Isn't it important to call this abort immediately after the changes
> are serialized as explained above?
>

As we are performing exactly 4096 insert statements on a fresh
replication setup it will get serialized immediately after 4096
inserts. The variable max_changes_in_memory is initialized with 4096
(reorderbuffer.c file), txn->nentries_mem gets incremented for every
insert statement in ReorderBufferQueueChange function. As we have
executed 4096 insert statements, txn->nentries_mem will become 4096
which results in the serializing from ReorderBufferCheckSerializeTXN
function. Then attaching gdb and call abort() helped in reproducing
the issue consistently.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
> I found couple of crashes in reorderbuffer while review/testing of
> logical_work_mem and logical streaming of large in-progress
> transactions. Stack trace of the same are given below:
> Issue 1:
> #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
>     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
> #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> record=0x3342770) at decode.c:121
> #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> <XLogSendLogical>) at walsender.c:2199
> #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> walsender.c:1128
> #10 0x0000000000b6ce83 in exec_replication_command
> (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> (proto_version '1', publication_names '\"pub1\"')")
>     at walsender.c:1545
> #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> postgres.c:4256
> #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> postmaster.c:1400
> #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
>
> Issue 2:
> #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
> #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> From initial analysis it looks like:
> Issue1 it seems like if all the reorderbuffer has been flushed and
> then the server restarts. This problem occurs.
> Issue 2 it seems like if there are many subtransactions present and
> then the server restarts. This problem occurs. The subtransaction's
> final_lsn is not being set and when ReorderBufferRestoreCleanup is
> called the assert fails. May be for this we might have to set the
> subtransaction's final_lsn before cleanup(not sure).
>
> I could not reproduce this issue consistently with a test case, But I
> felt this looks like a problem from review.
>
> For issue1, I could reproduce by the following steps:
> 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> 2) Have many open transactions with subtransactions open.
> 3) Attach one of the transaction from gdb and call abort().
>
> I'm not sure of the fix for this. If I get time I will try to spend
> more time to find out the fix.

I have further analyzed the issue and found that:
After abort is called, when the process restarts, it will clean the
reorder information for the aborted transactions in
ReorderBufferAbortOld function. It crashes in the below code as there
are no changes present currently and all the changes are serialized:
.......
if (txn->serialized && txn->final_lsn == 0))
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);

txn->final_lsn = last->lsn;
}
.......

It sets the final_lsn here so that it can iterate from the start_lsn
to final_lsn and cleanup the serialized files in
ReorderBufferRestoreCleanup function. One solution We were thinking
was to store the lsn of the last serialized change while serialiizing
and set the final_lsn in the above case where it crashes like the
below code:
......
if (txn->serialized && txn->final_lsn == 0 &&
!dlist_is_empty(&txn->changes))
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);

txn->final_lsn = last->lsn;
}
else
{
/*
* If there are no changes present as all of the changes were
* serialized, use the last lsn that was serialized.
*/
txn->final_lsn = txn->current_serialized_lsn;
}
......

I have tested the same scenario and verified it to be working. The
patch for the same is attached. I could not add a test case for this
as it involves attaching to gdb and calling abort.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
> I found couple of crashes in reorderbuffer while review/testing of
> logical_work_mem and logical streaming of large in-progress
> transactions. Stack trace of the same are given below:
> Issue 1:
> #0  0x00007f985c7d8337 in raise () from /lib64/libc.so.6
> #1  0x00007f985c7d9a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec514d in ExceptionalCondition
> (conditionName=0x10eab34 "!dlist_is_empty(head)", errorType=0x10eab24
> "FailedAssertion",
>     fileName=0x10eab00 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fd13 in dlist_tail_element_off (head=0x338fe60,
> off=48) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b547b7 in ReorderBufferAbortOld (rb=0x32ae7a0,
> oldestRunningXid=895) at reorderbuffer.c:1910
> #5  0x0000000000b3cb5e in DecodeStandbyOp (ctx=0x33424b0,
> buf=0x7fff7e7b1e40) at decode.c:332
> #6  0x0000000000b3c363 in LogicalDecodingProcessRecord (ctx=0x33424b0,
> record=0x3342770) at decode.c:121
> #7  0x0000000000b704b2 in XLogSendLogical () at walsender.c:2845
> #8  0x0000000000b6e9f8 in WalSndLoop (send_data=0xb7038b
> <XLogSendLogical>) at walsender.c:2199
> #9  0x0000000000b6bbf5 in StartLogicalReplication (cmd=0x33167a8) at
> walsender.c:1128
> #10 0x0000000000b6ce83 in exec_replication_command
> (cmd_string=0x328a0a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
> (proto_version '1', publication_names '\"pub1\"')")
>     at walsender.c:1545
> #11 0x0000000000c39f85 in PostgresMain (argc=1, argv=0x32b51c0,
> dbname=0x32b50e0 "testdb", username=0x32b50c0 "user1") at
> postgres.c:4256
> #12 0x0000000000b10dc7 in BackendRun (port=0x32ad890) at postmaster.c:4498
> #13 0x0000000000b0ff3e in BackendStartup (port=0x32ad890) at postmaster.c:4189
> #14 0x0000000000b08505 in ServerLoop () at postmaster.c:1727
> #15 0x0000000000b0781a in PostmasterMain (argc=3, argv=0x3284cb0) at
> postmaster.c:1400
> #16 0x000000000097492d in main (argc=3, argv=0x3284cb0) at main.c:210
>
> Issue 2:
> #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
> #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> From initial analysis it looks like:
> Issue1 it seems like if all the reorderbuffer has been flushed and
> then the server restarts. This problem occurs.
> Issue 2 it seems like if there are many subtransactions present and
> then the server restarts. This problem occurs. The subtransaction's
> final_lsn is not being set and when ReorderBufferRestoreCleanup is
> called the assert fails. May be for this we might have to set the
> subtransaction's final_lsn before cleanup(not sure).
>
> I could not reproduce this issue consistently with a test case, But I
> felt this looks like a problem from review.
>
> For issue1, I could reproduce by the following steps:
> 1) Change ReorderBufferCheckSerializeTXN so that it gets flushed always.
> 2) Have many open transactions with subtransactions open.
> 3) Attach one of the transaction from gdb and call abort().
>
> I'm not sure of the fix for this. If I get time I will try to spend
> more time to find out the fix.

I have further analyzed the issue and found that:
After abort is called, when the process restarts, it will clean the
reorder information for the aborted transactions in
ReorderBufferAbortOld function. It crashes in the below code as there
are no changes present currently and all the changes are serialized:
.......
if (txn->serialized && txn->final_lsn == 0))
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);

txn->final_lsn = last->lsn;
}
.......

It sets the final_lsn here so that it can iterate from the start_lsn
to final_lsn and cleanup the serialized files in
ReorderBufferRestoreCleanup function. One solution We were thinking
was to store the lsn of the last serialized change while serialiizing
and set the final_lsn in the above case where it crashes like the
below code:
......
if (txn->serialized && txn->final_lsn == 0 &&
!dlist_is_empty(&txn->changes))
{
ReorderBufferChange *last =
dlist_tail_element(ReorderBufferChange, node, &txn->changes);

txn->final_lsn = last->lsn;
}
else
{
/*
* If there are no changes present as all of the changes were
* serialized, use the last lsn that was serialized.
*/
txn->final_lsn = txn->current_serialized_lsn;
}
......

I have tested the same scenario and verified it to be working. The
patch for the same is attached. I could not add a test case for this
as it involves attaching to gdb and calling abort.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote:
>
>
> It sets the final_lsn here so that it can iterate from the start_lsn
> to final_lsn and cleanup the serialized files in
> ReorderBufferRestoreCleanup function. One solution We were thinking
> was to store the lsn of the last serialized change while serialiizing
> and set the final_lsn in the above case where it crashes like the
> below code:

Sure, we can do something on the lines what you are suggesting, but
why can't we update final_lsn at the time of serializing the changes?
If we do that then we don't even need to compute it separately during
ReorderBufferAbortOld.

Let me try to explain the problem and proposed solutions for the same.
Currently, after serializing the changes we remove the 'changes' from
ReorderBufferTXN.  Now, if the system crashes due to any reason after
that, we won't be able to compute final_lsn after the restart.  And
that leads to access violation in ReorderBufferAbortOld which is
trying to access changes list from ReorderBufferTXN to compute
final_lsn.

We could fix it by either tracking 'last_serialized_change' as
proposed by Vignesh or we could update the final_lsn while we
serialize the changes.

IIUC, commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e [1] tried to fix
some related issue which leads to this another problem.  Alvaro,
Andres, do you have any suggestions?


[1] -
commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Fri Jan 5 12:17:10 2018 -0300

    Fix failure to delete spill files of aborted transactions

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote:
>
>
> It sets the final_lsn here so that it can iterate from the start_lsn
> to final_lsn and cleanup the serialized files in
> ReorderBufferRestoreCleanup function. One solution We were thinking
> was to store the lsn of the last serialized change while serialiizing
> and set the final_lsn in the above case where it crashes like the
> below code:

Sure, we can do something on the lines what you are suggesting, but
why can't we update final_lsn at the time of serializing the changes?
If we do that then we don't even need to compute it separately during
ReorderBufferAbortOld.

Let me try to explain the problem and proposed solutions for the same.
Currently, after serializing the changes we remove the 'changes' from
ReorderBufferTXN.  Now, if the system crashes due to any reason after
that, we won't be able to compute final_lsn after the restart.  And
that leads to access violation in ReorderBufferAbortOld which is
trying to access changes list from ReorderBufferTXN to compute
final_lsn.

We could fix it by either tracking 'last_serialized_change' as
proposed by Vignesh or we could update the final_lsn while we
serialize the changes.

IIUC, commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e [1] tried to fix
some related issue which leads to this another problem.  Alvaro,
Andres, do you have any suggestions?


[1] -
commit df9f682c7bf81674b6ae3900fd0146f35df0ae2e
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Fri Jan 5 12:17:10 2018 -0300

    Fix failure to delete spill files of aborted transactions

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Tue, Dec 17, 2019 at 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> >
> > It sets the final_lsn here so that it can iterate from the start_lsn
> > to final_lsn and cleanup the serialized files in
> > ReorderBufferRestoreCleanup function. One solution We were thinking
> > was to store the lsn of the last serialized change while serialiizing
> > and set the final_lsn in the above case where it crashes like the
> > below code:
>
> Sure, we can do something on the lines what you are suggesting, but
> why can't we update final_lsn at the time of serializing the changes?
> If we do that then we don't even need to compute it separately during
> ReorderBufferAbortOld.
>
> Let me try to explain the problem and proposed solutions for the same.
> Currently, after serializing the changes we remove the 'changes' from
> ReorderBufferTXN.  Now, if the system crashes due to any reason after
> that, we won't be able to compute final_lsn after the restart.  And
> that leads to access violation in ReorderBufferAbortOld which is
> trying to access changes list from ReorderBufferTXN to compute
> final_lsn.
>
> We could fix it by either tracking 'last_serialized_change' as
> proposed by Vignesh or we could update the final_lsn while we
> serialize the changes.
>

I felt amit solution also solves the problem. Attached patch has the
fix based on the solution proposed.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Tue, Dec 17, 2019 at 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Dec 11, 2019 at 11:13 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> >
> > It sets the final_lsn here so that it can iterate from the start_lsn
> > to final_lsn and cleanup the serialized files in
> > ReorderBufferRestoreCleanup function. One solution We were thinking
> > was to store the lsn of the last serialized change while serialiizing
> > and set the final_lsn in the above case where it crashes like the
> > below code:
>
> Sure, we can do something on the lines what you are suggesting, but
> why can't we update final_lsn at the time of serializing the changes?
> If we do that then we don't even need to compute it separately during
> ReorderBufferAbortOld.
>
> Let me try to explain the problem and proposed solutions for the same.
> Currently, after serializing the changes we remove the 'changes' from
> ReorderBufferTXN.  Now, if the system crashes due to any reason after
> that, we won't be able to compute final_lsn after the restart.  And
> that leads to access violation in ReorderBufferAbortOld which is
> trying to access changes list from ReorderBufferTXN to compute
> final_lsn.
>
> We could fix it by either tracking 'last_serialized_change' as
> proposed by Vignesh or we could update the final_lsn while we
> serialize the changes.
>

I felt amit solution also solves the problem. Attached patch has the
fix based on the solution proposed.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Reorderbuffer crash during recovery

From
Alvaro Herrera
Date:
On 2019-Dec-27, vignesh C wrote:

> I felt amit solution also solves the problem. Attached patch has the
> fix based on the solution proposed.
> Thoughts?

This seems a sensible fix to me, though I didn't try to reproduce the
failure.

> @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
>          }
>  
>          ReorderBufferSerializeChange(rb, txn, fd, change);
> +        txn->final_lsn = change->lsn;
>          dlist_delete(&change->node);
>          ReorderBufferReturnChange(rb, change);

Should this be done insider ReorderBufferSerializeChange itself, instead
of in its caller?  Also, would it be sane to verify that the TXN
doesn't already have a newer final_lsn?  Maybe as an Assert.

> @@ -188,8 +188,7 @@ typedef struct ReorderBufferTXN
>       * * plain abort record
>       * * prepared transaction abort
>       * * error during decoding
> -     * * for a crashed transaction, the LSN of the last change, regardless of
> -     *   what it was.
> +     * * last serialized lsn
>       * ----

I propose "for a transaction with serialized changes, the LSN of the
latest serialized one, unless one of the above cases."

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Reorderbuffer crash during recovery

From
Alvaro Herrera
Date:
On 2019-Dec-27, vignesh C wrote:

> I felt amit solution also solves the problem. Attached patch has the
> fix based on the solution proposed.
> Thoughts?

This seems a sensible fix to me, though I didn't try to reproduce the
failure.

> @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
>          }
>  
>          ReorderBufferSerializeChange(rb, txn, fd, change);
> +        txn->final_lsn = change->lsn;
>          dlist_delete(&change->node);
>          ReorderBufferReturnChange(rb, change);

Should this be done insider ReorderBufferSerializeChange itself, instead
of in its caller?  Also, would it be sane to verify that the TXN
doesn't already have a newer final_lsn?  Maybe as an Assert.

> @@ -188,8 +188,7 @@ typedef struct ReorderBufferTXN
>       * * plain abort record
>       * * prepared transaction abort
>       * * error during decoding
> -     * * for a crashed transaction, the LSN of the last change, regardless of
> -     *   what it was.
> +     * * last serialized lsn
>       * ----

I propose "for a transaction with serialized changes, the LSN of the
latest serialized one, unless one of the above cases."

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Dec-27, vignesh C wrote:
>
> > I felt amit solution also solves the problem. Attached patch has the
> > fix based on the solution proposed.
> > Thoughts?
>
> This seems a sensible fix to me, though I didn't try to reproduce the
> failure.
>
> > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> >               }
> >
> >               ReorderBufferSerializeChange(rb, txn, fd, change);
> > +             txn->final_lsn = change->lsn;
> >               dlist_delete(&change->node);
> >               ReorderBufferReturnChange(rb, change);
>
> Should this be done insider ReorderBufferSerializeChange itself, instead
> of in its caller?
>

makes sense.  But, I think we should add a comment specifying the
reason why it is important to set final_lsn while serializing the
change.

>  Also, would it be sane to verify that the TXN
> doesn't already have a newer final_lsn?  Maybe as an Assert.
>

I don't think this is a good idea because we update the final_lsn with
commit_lsn in ReorderBufferCommit after which we can try to serialize
the remaining changes.  Instead, we should update it only if the
change_lsn value is greater than final_lsn.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Dec-27, vignesh C wrote:
>
> > I felt amit solution also solves the problem. Attached patch has the
> > fix based on the solution proposed.
> > Thoughts?
>
> This seems a sensible fix to me, though I didn't try to reproduce the
> failure.
>
> > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> >               }
> >
> >               ReorderBufferSerializeChange(rb, txn, fd, change);
> > +             txn->final_lsn = change->lsn;
> >               dlist_delete(&change->node);
> >               ReorderBufferReturnChange(rb, change);
>
> Should this be done insider ReorderBufferSerializeChange itself, instead
> of in its caller?
>

makes sense.  But, I think we should add a comment specifying the
reason why it is important to set final_lsn while serializing the
change.

>  Also, would it be sane to verify that the TXN
> doesn't already have a newer final_lsn?  Maybe as an Assert.
>

I don't think this is a good idea because we update the final_lsn with
commit_lsn in ReorderBufferCommit after which we can try to serialize
the remaining changes.  Instead, we should update it only if the
change_lsn value is greater than final_lsn.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > On 2019-Dec-27, vignesh C wrote:
> >
> > > I felt amit solution also solves the problem. Attached patch has the
> > > fix based on the solution proposed.
> > > Thoughts?
> >
> > This seems a sensible fix to me, though I didn't try to reproduce the
> > failure.
> >
> > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > >               }
> > >
> > >               ReorderBufferSerializeChange(rb, txn, fd, change);
> > > +             txn->final_lsn = change->lsn;
> > >               dlist_delete(&change->node);
> > >               ReorderBufferReturnChange(rb, change);
> >
> > Should this be done insider ReorderBufferSerializeChange itself, instead
> > of in its caller?
> >
>
> makes sense.  But, I think we should add a comment specifying the
> reason why it is important to set final_lsn while serializing the
> change.

Fixed

> >  Also, would it be sane to verify that the TXN
> > doesn't already have a newer final_lsn?  Maybe as an Assert.
> >
>
> I don't think this is a good idea because we update the final_lsn with
> commit_lsn in ReorderBufferCommit after which we can try to serialize
> the remaining changes.  Instead, we should update it only if the
> change_lsn value is greater than final_lsn.
>

Fixed.
Thanks Alvaro & Amit for your suggestions. I have made the changes
based on your suggestions. Please find the updated patch for the same.
I have also verified the patch in back branches. Separate patch was
required for Release-10 branch, patch for the same is attached as
0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > On 2019-Dec-27, vignesh C wrote:
> >
> > > I felt amit solution also solves the problem. Attached patch has the
> > > fix based on the solution proposed.
> > > Thoughts?
> >
> > This seems a sensible fix to me, though I didn't try to reproduce the
> > failure.
> >
> > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > >               }
> > >
> > >               ReorderBufferSerializeChange(rb, txn, fd, change);
> > > +             txn->final_lsn = change->lsn;
> > >               dlist_delete(&change->node);
> > >               ReorderBufferReturnChange(rb, change);
> >
> > Should this be done insider ReorderBufferSerializeChange itself, instead
> > of in its caller?
> >
>
> makes sense.  But, I think we should add a comment specifying the
> reason why it is important to set final_lsn while serializing the
> change.

Fixed

> >  Also, would it be sane to verify that the TXN
> > doesn't already have a newer final_lsn?  Maybe as an Assert.
> >
>
> I don't think this is a good idea because we update the final_lsn with
> commit_lsn in ReorderBufferCommit after which we can try to serialize
> the remaining changes.  Instead, we should update it only if the
> change_lsn value is greater than final_lsn.
>

Fixed.
Thanks Alvaro & Amit for your suggestions. I have made the changes
based on your suggestions. Please find the updated patch for the same.
I have also verified the patch in back branches. Separate patch was
required for Release-10 branch, patch for the same is attached as
0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Reorderbuffer crash during recovery

From
Dilip Kumar
Date:
On Tue, Dec 31, 2019 at 11:35 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >
> > > On 2019-Dec-27, vignesh C wrote:
> > >
> > > > I felt amit solution also solves the problem. Attached patch has the
> > > > fix based on the solution proposed.
> > > > Thoughts?
> > >
> > > This seems a sensible fix to me, though I didn't try to reproduce the
> > > failure.
> > >
> > > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > >               }
> > > >
> > > >               ReorderBufferSerializeChange(rb, txn, fd, change);
> > > > +             txn->final_lsn = change->lsn;
> > > >               dlist_delete(&change->node);
> > > >               ReorderBufferReturnChange(rb, change);
> > >
> > > Should this be done insider ReorderBufferSerializeChange itself, instead
> > > of in its caller?
> > >
> >
> > makes sense.  But, I think we should add a comment specifying the
> > reason why it is important to set final_lsn while serializing the
> > change.
>
> Fixed
>
> > >  Also, would it be sane to verify that the TXN
> > > doesn't already have a newer final_lsn?  Maybe as an Assert.
> > >
> >
> > I don't think this is a good idea because we update the final_lsn with
> > commit_lsn in ReorderBufferCommit after which we can try to serialize
> > the remaining changes.  Instead, we should update it only if the
> > change_lsn value is greater than final_lsn.
> >
>
> Fixed.
> Thanks Alvaro & Amit for your suggestions. I have made the changes
> based on your suggestions. Please find the updated patch for the same.
> I have also verified the patch in back branches. Separate patch was
> required for Release-10 branch, patch for the same is attached as
> 0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch.
> Thoughts?

One minor comment.  Otherwise, the patch looks fine to me.
+ /*
+ * We set final_lsn on a transaction when we decode its commit or abort
+ * record, but we never see those records for crashed transactions.  To
+ * ensure cleanup of these transactions, set final_lsn to that of their
+ * last change; this causes ReorderBufferRestoreCleanup to do the right
+ * thing. Final_lsn would have been set with commit_lsn earlier when we
+ * decode it commit, no need to update in that case
+ */
+ if (txn->final_lsn < change->lsn)
+ txn->final_lsn = change->lsn;

/decode it commit,/decode its commit,

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Dilip Kumar
Date:
On Tue, Dec 31, 2019 at 11:35 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Mon, Dec 30, 2019 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Dec 27, 2019 at 8:37 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >
> > > On 2019-Dec-27, vignesh C wrote:
> > >
> > > > I felt amit solution also solves the problem. Attached patch has the
> > > > fix based on the solution proposed.
> > > > Thoughts?
> > >
> > > This seems a sensible fix to me, though I didn't try to reproduce the
> > > failure.
> > >
> > > > @@ -2472,6 +2457,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > >               }
> > > >
> > > >               ReorderBufferSerializeChange(rb, txn, fd, change);
> > > > +             txn->final_lsn = change->lsn;
> > > >               dlist_delete(&change->node);
> > > >               ReorderBufferReturnChange(rb, change);
> > >
> > > Should this be done insider ReorderBufferSerializeChange itself, instead
> > > of in its caller?
> > >
> >
> > makes sense.  But, I think we should add a comment specifying the
> > reason why it is important to set final_lsn while serializing the
> > change.
>
> Fixed
>
> > >  Also, would it be sane to verify that the TXN
> > > doesn't already have a newer final_lsn?  Maybe as an Assert.
> > >
> >
> > I don't think this is a good idea because we update the final_lsn with
> > commit_lsn in ReorderBufferCommit after which we can try to serialize
> > the remaining changes.  Instead, we should update it only if the
> > change_lsn value is greater than final_lsn.
> >
>
> Fixed.
> Thanks Alvaro & Amit for your suggestions. I have made the changes
> based on your suggestions. Please find the updated patch for the same.
> I have also verified the patch in back branches. Separate patch was
> required for Release-10 branch, patch for the same is attached as
> 0001-Reorder-buffer-crash-while-aborting-old-transactions-REL_10.patch.
> Thoughts?

One minor comment.  Otherwise, the patch looks fine to me.
+ /*
+ * We set final_lsn on a transaction when we decode its commit or abort
+ * record, but we never see those records for crashed transactions.  To
+ * ensure cleanup of these transactions, set final_lsn to that of their
+ * last change; this causes ReorderBufferRestoreCleanup to do the right
+ * thing. Final_lsn would have been set with commit_lsn earlier when we
+ * decode it commit, no need to update in that case
+ */
+ if (txn->final_lsn < change->lsn)
+ txn->final_lsn = change->lsn;

/decode it commit,/decode its commit,

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> One minor comment.  Otherwise, the patch looks fine to me.
> + /*
> + * We set final_lsn on a transaction when we decode its commit or abort
> + * record, but we never see those records for crashed transactions.  To
> + * ensure cleanup of these transactions, set final_lsn to that of their
> + * last change; this causes ReorderBufferRestoreCleanup to do the right
> + * thing. Final_lsn would have been set with commit_lsn earlier when we
> + * decode it commit, no need to update in that case
> + */
> + if (txn->final_lsn < change->lsn)
> + txn->final_lsn = change->lsn;
>
> /decode it commit,/decode its commit,
>

Thanks Dilip for reviewing.
I have fixed the comments you have suggested.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> One minor comment.  Otherwise, the patch looks fine to me.
> + /*
> + * We set final_lsn on a transaction when we decode its commit or abort
> + * record, but we never see those records for crashed transactions.  To
> + * ensure cleanup of these transactions, set final_lsn to that of their
> + * last change; this causes ReorderBufferRestoreCleanup to do the right
> + * thing. Final_lsn would have been set with commit_lsn earlier when we
> + * decode it commit, no need to update in that case
> + */
> + if (txn->final_lsn < change->lsn)
> + txn->final_lsn = change->lsn;
>
> /decode it commit,/decode its commit,
>

Thanks Dilip for reviewing.
I have fixed the comments you have suggested.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Reorderbuffer crash during recovery

From
Dilip Kumar
Date:
On Fri, Jan 17, 2020 at 7:42 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > One minor comment.  Otherwise, the patch looks fine to me.
> > + /*
> > + * We set final_lsn on a transaction when we decode its commit or abort
> > + * record, but we never see those records for crashed transactions.  To
> > + * ensure cleanup of these transactions, set final_lsn to that of their
> > + * last change; this causes ReorderBufferRestoreCleanup to do the right
> > + * thing. Final_lsn would have been set with commit_lsn earlier when we
> > + * decode it commit, no need to update in that case
> > + */
> > + if (txn->final_lsn < change->lsn)
> > + txn->final_lsn = change->lsn;
> >
> > /decode it commit,/decode its commit,
> >
>
> Thanks Dilip for reviewing.
> I have fixed the comments you have suggested.
>
Thanks for the updated patch.  It looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Dilip Kumar
Date:
On Fri, Jan 17, 2020 at 7:42 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Jan 16, 2020 at 9:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > One minor comment.  Otherwise, the patch looks fine to me.
> > + /*
> > + * We set final_lsn on a transaction when we decode its commit or abort
> > + * record, but we never see those records for crashed transactions.  To
> > + * ensure cleanup of these transactions, set final_lsn to that of their
> > + * last change; this causes ReorderBufferRestoreCleanup to do the right
> > + * thing. Final_lsn would have been set with commit_lsn earlier when we
> > + * decode it commit, no need to update in that case
> > + */
> > + if (txn->final_lsn < change->lsn)
> > + txn->final_lsn = change->lsn;
> >
> > /decode it commit,/decode its commit,
> >
>
> Thanks Dilip for reviewing.
> I have fixed the comments you have suggested.
>
Thanks for the updated patch.  It looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Alvaro Herrera
Date:
On 2020-Jan-17, vignesh C wrote:

> Thanks Dilip for reviewing.
> I have fixed the comments you have suggested.

I ended up rewording that comment completely; I thought the original was
not explaining things well enough.

I also changed the comment for final_lsn in reorderbuffer.h: not only I
remove the line that I added in df9f682c7bf8 (the previous bugfix), but
I also remove the line that says "error during decoding", which came in
with the very first logical decoding commit (b89e151054a); I couldn't
find any evidence that final_lsn is being set on errors of any kind
(other than transaction abort, which doesn't seem an "error" in that
sense.)

Please give these comments a read; maybe I have misunderstood something
and my comment is wrong.

Pushed now to all branches.  Thanks, Vignesh, Amit and Dilip.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Reorderbuffer crash during recovery

From
Alvaro Herrera
Date:
On 2020-Jan-17, vignesh C wrote:

> Thanks Dilip for reviewing.
> I have fixed the comments you have suggested.

I ended up rewording that comment completely; I thought the original was
not explaining things well enough.

I also changed the comment for final_lsn in reorderbuffer.h: not only I
remove the line that I added in df9f682c7bf8 (the previous bugfix), but
I also remove the line that says "error during decoding", which came in
with the very first logical decoding commit (b89e151054a); I couldn't
find any evidence that final_lsn is being set on errors of any kind
(other than transaction abort, which doesn't seem an "error" in that
sense.)

Please give these comments a read; maybe I have misunderstood something
and my comment is wrong.

Pushed now to all branches.  Thanks, Vignesh, Amit and Dilip.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jan-17, vignesh C wrote:
>
> > Thanks Dilip for reviewing.
> > I have fixed the comments you have suggested.
>
> I ended up rewording that comment completely; I thought the original was
> not explaining things well enough.
>
> I also changed the comment for final_lsn in reorderbuffer.h: not only I
> remove the line that I added in df9f682c7bf8 (the previous bugfix), but
> I also remove the line that says "error during decoding", which came in
> with the very first logical decoding commit (b89e151054a); I couldn't
> find any evidence that final_lsn is being set on errors of any kind
> (other than transaction abort, which doesn't seem an "error" in that
> sense.)
>
> Please give these comments a read; maybe I have misunderstood something
> and my comment is wrong.
>

The comments added by you look correct and good to me.  Thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
Amit Kapila
Date:
On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jan-17, vignesh C wrote:
>
> > Thanks Dilip for reviewing.
> > I have fixed the comments you have suggested.
>
> I ended up rewording that comment completely; I thought the original was
> not explaining things well enough.
>
> I also changed the comment for final_lsn in reorderbuffer.h: not only I
> remove the line that I added in df9f682c7bf8 (the previous bugfix), but
> I also remove the line that says "error during decoding", which came in
> with the very first logical decoding commit (b89e151054a); I couldn't
> find any evidence that final_lsn is being set on errors of any kind
> (other than transaction abort, which doesn't seem an "error" in that
> sense.)
>
> Please give these comments a read; maybe I have misunderstood something
> and my comment is wrong.
>

The comments added by you look correct and good to me.  Thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jan-17, vignesh C wrote:
>
> > Thanks Dilip for reviewing.
> > I have fixed the comments you have suggested.
>
> I ended up rewording that comment completely; I thought the original was
> not explaining things well enough.
>
> I also changed the comment for final_lsn in reorderbuffer.h: not only I
> remove the line that I added in df9f682c7bf8 (the previous bugfix), but
> I also remove the line that says "error during decoding", which came in
> with the very first logical decoding commit (b89e151054a); I couldn't
> find any evidence that final_lsn is being set on errors of any kind
> (other than transaction abort, which doesn't seem an "error" in that
> sense.)
>
> Please give these comments a read; maybe I have misunderstood something
> and my comment is wrong.
>
> Pushed now to all branches.  Thanks, Vignesh, Amit and Dilip.
>

Thanks Alvaro for pushing this patch.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: Reorderbuffer crash during recovery

From
vignesh C
Date:
On Sat, Jan 18, 2020 at 2:42 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jan-17, vignesh C wrote:
>
> > Thanks Dilip for reviewing.
> > I have fixed the comments you have suggested.
>
> I ended up rewording that comment completely; I thought the original was
> not explaining things well enough.
>
> I also changed the comment for final_lsn in reorderbuffer.h: not only I
> remove the line that I added in df9f682c7bf8 (the previous bugfix), but
> I also remove the line that says "error during decoding", which came in
> with the very first logical decoding commit (b89e151054a); I couldn't
> find any evidence that final_lsn is being set on errors of any kind
> (other than transaction abort, which doesn't seem an "error" in that
> sense.)
>
> Please give these comments a read; maybe I have misunderstood something
> and my comment is wrong.
>
> Pushed now to all branches.  Thanks, Vignesh, Amit and Dilip.
>

Thanks Alvaro for pushing this patch.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com