Thread: Parallel scan with SubTransGetTopmostTransaction assert coredump
Hi Hackers,
Last email, format error, missing some information, so I resend this email.
With PG 13.2(3fb4c75e857adee3da4386e947ba58a75f3e74b7), I tested subtransaction with parallel scan, I got a subtransaction coredump as below:
```
(gdb) bt
#0 0x00001517ce61f7ff in raise () from /lib64/libc.so.6
#1 0x00001517ce609c35 in abort () from /lib64/libc.so.6
#2 0x0000000000aaf93f in ExceptionalCondition (conditionName=0xb4c920 "TransactionIdFollowsOrEquals(xid, TransactionXmin)", errorType=0xb4c796 "FailedAssertion",
fileName=0xb4c738 "/home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/subtrans.c", lineNumber=156)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/utils/error/assert.c:67
#3 0x0000000000563111 in SubTransGetTopmostTransaction (xid=196963)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/subtrans.c:156
#4 0x0000000000b05206 in XidInMVCCSnapshot (xid=196963, snapshot=0x2f8ec58)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/utils/time/snapmgr.c:2293
#5 0x00000000004ff2bc in HeapTupleSatisfiesMVCC (htup=0x7ffc21807120, snapshot=0x2f8ec58, buffer=1946)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam_visibility.c:1073
#6 0x0000000000500363 in HeapTupleSatisfiesVisibility (tup=0x7ffc21807120, snapshot=0x2f8ec58, buffer=1946)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam_visibility.c:1695
#7 0x00000000004e423b in heapgetpage (sscan=0x2f8e840, page=1685) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam.c:447
#8 0x00000000004e68bf in heapgettup_pagemode (scan=0x2f8e840, dir=ForwardScanDirection, nkeys=0, key=0x0)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam.c:1077
#9 0x00000000004e6eb6 in heap_getnextslot (sscan=0x2f8e840, direction=ForwardScanDirection, slot=0x2fd8a38)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam.c:1333
#10 0x000000000075350e in table_scan_getnextslot (sscan=0x2f8e840, direction=ForwardScanDirection, slot=0x2fd8a38)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/include/access/tableam.h:906
#11 0x00000000007535d6 in SeqNext (node=0x2fd86a8) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/nodeSeqscan.c:80
#12 0x000000000071af3c in ExecScanFetch (node=0x2fd86a8, accessMtd=0x753542 <SeqNext>, recheckMtd=0x7535e7 <SeqRecheck>)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execScan.c:133
#13 0x000000000071afdd in ExecScan (node=0x2fd86a8, accessMtd=0x753542 <SeqNext>, recheckMtd=0x7535e7 <SeqRecheck>)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execScan.c:199
#14 0x0000000000753631 in ExecSeqScan (pstate=0x2fd86a8) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/nodeSeqscan.c:112
#15 0x00000000007173af in ExecProcNodeFirst (node=0x2fd86a8) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execProcnode.c:450
#16 0x000000000070b7f7 in ExecProcNode (node=0x2fd86a8) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/include/executor/executor.h:248
#17 0x000000000070e302 in ExecutePlan (estate=0x2fd7ca0, planstate=0x2fd86a8, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x2f8ddd8, execute_once=true) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execMain.c:1632
#18 0x000000000070be22 in standard_ExecutorRun (queryDesc=0x2f8e7a8, direction=ForwardScanDirection, count=0, execute_once=true)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execMain.c:350
#19 0x000000000070bc50 in ExecutorRun (queryDesc=0x2f8e7a8, direction=ForwardScanDirection, count=0, execute_once=true)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execMain.c:294
#20 0x00000000007131d5 in ParallelQueryMain (seg=0x2ef30b8, toc=0x1517cf85c000)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execParallel.c:1448
#21 0x000000000055f70c in ParallelWorkerMain (main_arg=897455922) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c:1470
#22 0x000000000086e255 in StartBackgroundWorker () at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/bgworker.c:879
#23 0x000000000088192c in do_start_bgworker (rw=0x2f221c0) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:5870
#24 0x0000000000881cd8 in maybe_start_bgworkers () at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:6095
#25 0x0000000000880d14 in sigusr1_handler (postgres_signal_arg=10) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:5255
#26 <signal handler called>
#27 0x00001517ce6dc4bb in select () from /lib64/libc.so.6
#28 0x000000000087c867 in ServerLoop () at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:1703
#29 0x000000000087c232 in PostmasterMain (argc=3, argv=0x2ef1070) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:1412
#30 0x0000000000783418 in main (argc=3, argv=0x2ef1070) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/main/main.c:210
(gdb) f 3
#3 0x0000000000563111 in SubTransGetTopmostTransaction (xid=196963)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/subtrans.c:325
325 Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
(gdb) p xid
$1 = 196963
(gdb) f 4
#4 0x0000000000b05206 in XidInMVCCSnapshot (xid=196963, snapshot=0x2f8ec58)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/utils/time/snapmgr.c:2293
2293 xid = SubTransGetTopmostTransaction(xid);
(gdb) p *snapshot
$2 = {snapshot_type = SNAPSHOT_MVCC, xmin = 196962, xmax = 210314, xip = 0x2f8ecc0, xcnt = 82, subxip = 0x0, subxcnt = 0, suboverflowed = true,
takenDuringRecovery = false, copied = true, curcid = 14, speculativeToken = 2139062143, active_count = 0, regd_count = 1, ph_node = {first_child = 0x0,
next_sibling = 0xf65ca0 <CatalogSnapshotData+64>, prev_or_parent = 0x2f8dbc8}, whenTaken = 0, lsn = 0}
(gdb) p TransactionXmin
$3 = 196992
(gdb) f 3
#3 0x0000000000563111 in SubTransGetTopmostTransaction (xid=196963)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/subtrans.c:325
325 Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
(gdb) p xid
$4 = 196963
(gdb) p TransactionXmin
$5 = 196992
```
After simple analyze, I think it should be a bug.
1, Main process first get active snapshot(xmin 196962, xmax 210314), and push active snapshot.
2, Main process call InitializeParallelDSM, get transaction snapshot(xmin 196992, xmax 210320), and send this snapshot with parameters PARALLEL_KEY_TRANSACTION_SNAPSHOT.
3, Main process call ExecParallelInitializeDSM->ExecSeqScanInitializeDSM->table_parallelscan_initialize SerializeSnapshot active snapshot(xmin 196962, xmax 210314)
4, Main process Lanch Parallel Workers process.
Parallel Workers process work flow:
1, ParallelWorkerMain->BackgroundWorkerInitializeConnectionByOid->GetTransactionSnapshot(xmin 196992, xmax 210320) and set TransactionXmin to 196992.
2, ParallelWorkerMain->RestoreTransactionSnapshot->SetTransactionSnapshot->ProcArrayInstallRestoredXmin and set TransactionXmin to 196992.
3, ParallelWorkerMain->ExecParallelInitializeWorker->ExecSeqScanInitializeWorker->table_beginscan_parallel restore active snapshot(xmin 196962, xmax 210314) from main process.
4, Got coredump ParallelWorkerMain->ParallelQueryMain->ExecutorRun...ExecScan->heapgetpage->HeapTupleSatisfiesMVCC->XidInMVCCSnapshot->SubTransGetTopmostTransaction
So the root cause is the Parallel Workers process set the TransactionXmin with later transcation snapshot. When parallel scan, Parallel Workers process use the older active snapshot.
It leads to subtrans assert coredump. I don't know how to fix it. Is there any ideas?
Thanks
Pengcheng
Hi, On 2021-05-07 11:32:57 +0800, Pengchengliu wrote: > Hi Hackers, > > Last email, format error, missing some information, so I resend this email. > > With PG 13.2(3fb4c75e857adee3da4386e947ba58a75f3e74b7), I tested subtransaction with parallel scan, I got a subtransactioncoredump as below: > So the root cause is the Parallel Workers process set the TransactionXmin with later transcation snapshot. When parallelscan, Parallel Workers process use the older active snapshot. > > It leads to subtrans assert coredump. I don't know how to fix it. Is there any ideas? Do you have steps to reliably reproduce this? Greetings, Andres Freund
Hi Andres, Reproduce steps. 1, Modify and adjust NUM_SUBTRANS_BUFFERS to 128 from 32 in the file "src/include/access/subtrans.h" line number 15. 2, configure with enable assert and build it. 3, init a new database cluster. 4, modify postgres.conf and add some parameters as below. As the coredump from parallel scan, so we adjust parallel setting,make it easy to reproduce. max_connections = 2000 parallel_setup_cost=0 parallel_tuple_cost=0 min_parallel_table_scan_size=0 max_parallel_workers_per_gather=8 max_parallel_workers = 32 5, start the database cluster. 6, use the script init_test.sql in attachment to create tables. 7, use pgbench with script sub_120.sql in attachment to test it. Try it sometimes, you should get the coredump file. pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 200 -j 200 -T 120 Thanks Pengcheng -----Original Message----- From: Andres Freund <andres@anarazel.de> Sent: 2021年5月7日 11:55 To: Pengchengliu <pengchengliu@tju.edu.cn> Cc: pgsql-hackers@postgresql.org Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump Hi, On 2021-05-07 11:32:57 +0800, Pengchengliu wrote: > Hi Hackers, > > Last email, format error, missing some information, so I resend this email. > > With PG 13.2(3fb4c75e857adee3da4386e947ba58a75f3e74b7), I tested subtransaction with parallel scan, I got a subtransactioncoredump as below: > So the root cause is the Parallel Workers process set the TransactionXmin with later transcation snapshot. When parallelscan, Parallel Workers process use the older active snapshot. > > It leads to subtrans assert coredump. I don't know how to fix it. Is there any ideas? Do you have steps to reliably reproduce this? Greetings, Andres Freund
Hi Andres, Reproduce steps. 1, Modify and adjust NUM_SUBTRANS_BUFFERS to 128 from 32 in the file "src/include/access/subtrans.h" line number 15. 2, configure with enable assert and build it. 3, init a new database cluster. 4, modify postgres.conf and add some parameters as below. As the coredump from parallel scan, so we adjust parallel setting,make it easy to reproduce. max_connections = 2000 parallel_setup_cost=0 parallel_tuple_cost=0 min_parallel_table_scan_size=0 max_parallel_workers_per_gather=8 max_parallel_workers = 32 5, start the database cluster. 6, use the script init_test.sql in attachment to create tables. 7, use pgbench with script sub_120.sql in attachment to test it. Try it sometimes, you should get the coredump file. pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 200 -j 200 -T 120 Thanks Pengcheng -----Original Message----- From: Andres Freund <andres@anarazel.de> Sent: 2021年5月7日 11:55 To: Pengchengliu <pengchengliu@tju.edu.cn> Cc: pgsql-hackers@postgresql.org Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump Hi, On 2021-05-07 11:32:57 +0800, Pengchengliu wrote: > Hi Hackers, > > Last email, format error, missing some information, so I resend this email. > > With PG 13.2(3fb4c75e857adee3da4386e947ba58a75f3e74b7), I tested subtransaction with parallel scan, I got a subtransactioncoredump as below: > So the root cause is the Parallel Workers process set the TransactionXmin with later transcation snapshot. When parallelscan, Parallel Workers process use the older active snapshot. > > It leads to subtrans assert coredump. I don't know how to fix it. Is there any ideas? Do you have steps to reliably reproduce this? Greetings, Andres Freund
Attachment
On Tue, May 11, 2021 at 11:28 AM Pengchengliu <pengchengliu@tju.edu.cn> wrote: > > Hi Andres, > Reproduce steps. > > 1, Modify and adjust NUM_SUBTRANS_BUFFERS to 128 from 32 in the file "src/include/access/subtrans.h" line number 15. > 2, configure with enable assert and build it. > 3, init a new database cluster. > 4, modify postgres.conf and add some parameters as below. As the coredump from parallel scan, so we adjust parallel setting,make it easy to reproduce. > > max_connections = 2000 > > parallel_setup_cost=0 > parallel_tuple_cost=0 > min_parallel_table_scan_size=0 > max_parallel_workers_per_gather=8 > max_parallel_workers = 32 > > 5, start the database cluster. > 6, use the script init_test.sql in attachment to create tables. > 7, use pgbench with script sub_120.sql in attachment to test it. Try it sometimes, you should get the coredump file. > pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 200 -j 200 -T 120 > Hi, I had a go at reproducing your reported issue, making sure to follow all your steps. Unfortunately, your script seemed to run OK with pgbench and no crash/coredump occurred for me (and yes, I definitely had asserts enabled). I tried with both the 13.2 source code (3fb4c75e857adee3da4386e947ba58a75f3e74b7), running through the script with pgbench twice to completion, and also did the same using the latest Postgres source code. Will be interesting to see if anyone is able to reproduce your issue. Regards, Greg Nancarrow Fujitsu Australia
Hi Andres,
Thanks for you replay.
And If you still cannot reproduce it in 2 minitus. Could you run pgbench longer time, such as 30 or 60 minutes.
This coredump, It should be from parallel scan only.
For normal scan(without parallel), SubTransGetTopmostTransaction assert(HeapTupleSatisfiesMVCC->XidInMVCCSnapshot->SubTransGetTopmostTransaction->Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin))),
I think this Assert is correct. As single scan get transaction snapshot while set the TransactionXmin and Snapshot->xmin.
In XidInMVCCSnapshot, it will check whether xid precedes snapshot->xmin first. If it is, XidInMVCCSnapshot will return false directly.
So in XidInMVCCSnapshot->SubTransGetTopmostTransaction, xid cannot precede snapshot->xmin.
But for parallel scan, it is different. I modify the code and use the sleep to replace the SubTransGetTopmostTransaction Assert.
Then we can check TransactionXmin and the snapshot from DSA.
The stack is as below, when got the Assert error.
(gdb) bt
#0 0x0000149fb3d254bb in select () from /lib64/libc.so.6
#1 0x0000000000b1d3b3 in pg_usleep (microsec=1000000) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/port/pgsleep.c:56
#2 0x0000000000562a3b in SubTransGetTopmostTransaction (xid=799225) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/subtrans.c:164
#3 0x0000000000b04acb in XidInMVCCSnapshot (xid=799225, snapshot=0x2af2d00) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/utils/time/snapmgr.c:2293
#4 0x00000000004ff24c in HeapTupleSatisfiesMVCC (htup=0x7fffc1465f60, snapshot=0x2af2d00, buffer=109832)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam_visibility.c:1073
#5 0x00000000005002f3 in HeapTupleSatisfiesVisibility (tup=0x7fffc1465f60, snapshot=0x2af2d00, buffer=109832)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam_visibility.c:1695
#6 0x00000000004e41cb in heapgetpage (sscan=0x2af3118, page=10846) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam.c:447
#7 0x00000000004e684f in heapgettup_pagemode (scan=0x2af3118, dir=ForwardScanDirection, nkeys=0, key=0x0)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam.c:1077
#8 0x00000000004e6e46 in heap_getnextslot (sscan=0x2af3118, direction=ForwardScanDirection, slot=0x2affab0)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam.c:1333
#9 0x0000000000752e1a in table_scan_getnextslot (sscan=0x2af3118, direction=ForwardScanDirection, slot=0x2affab0)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/include/access/tableam.h:906
#10 0x0000000000752ee2 in SeqNext (node=0x2aff538) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/nodeSeqscan.c:80
#11 0x000000000071a848 in ExecScanFetch (node=0x2aff538, accessMtd=0x752e4e <SeqNext>, recheckMtd=0x752ef3 <SeqRecheck>)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execScan.c:133
#12 0x000000000071a8e9 in ExecScan (node=0x2aff538, accessMtd=0x752e4e <SeqNext>, recheckMtd=0x752ef3 <SeqRecheck>)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execScan.c:199
#13 0x0000000000752f3d in ExecSeqScan (pstate=0x2aff538) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/nodeSeqscan.c:112
#14 0x0000000000725794 in ExecProcNode (node=0x2aff538) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/include/executor/executor.h:248
#15 0x0000000000725c7f in fetch_input_tuple (aggstate=0x2afeff0) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/nodeAgg.c:589
#16 0x0000000000728f98 in agg_retrieve_direct (aggstate=0x2afeff0) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/nodeAgg.c:2463
#17 0x00000000007289f2 in ExecAgg (pstate=0x2afeff0) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/nodeAgg.c:2183
#18 0x0000000000716cbb in ExecProcNodeFirst (node=0x2afeff0) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execProcnode.c:450
#19 0x000000000070b103 in ExecProcNode (node=0x2afeff0) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/include/executor/executor.h:248
#20 0x000000000070dc0e in ExecutePlan (estate=0x2afeb30, planstate=0x2afeff0, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x2ab0578, execute_once=true) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execMain.c:1632
#21 0x000000000070b72e in standard_ExecutorRun (queryDesc=0x2af2c68, direction=ForwardScanDirection, count=0, execute_once=true)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execMain.c:350
#22 0x000000000070b55c in ExecutorRun (queryDesc=0x2af2c68, direction=ForwardScanDirection, count=0, execute_once=true)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execMain.c:294
#23 0x0000000000712ae1 in ParallelQueryMain (seg=0x2a0a0c8, toc=0x149fb4dab000) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/executor/execParallel.c:1448
#24 0x000000000055f69c in ParallelWorkerMain (main_arg=1403863538) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c:1470
#25 0x000000000086db61 in StartBackgroundWorker () at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/bgworker.c:879
#26 0x0000000000881238 in do_start_bgworker (rw=0x2a351b0) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:5870
#27 0x00000000008815e4 in maybe_start_bgworkers () at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:6095
#28 0x0000000000880620 in sigusr1_handler (postgres_signal_arg=10) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:5255
#29 <signal handler called>
#30 0x0000149fb3d254bb in select () from /lib64/libc.so.6
#31 0x000000000087c173 in ServerLoop () at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:1703
#32 0x000000000087bb3e in PostmasterMain (argc=3, argv=0x2a08080) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/postmaster/postmaster.c:1412
#33 0x0000000000782d24 in main (argc=3, argv=0x2a08080) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/main/main.c:210
(gdb) f 24
#24 0x000000000055f69c in ParallelWorkerMain (main_arg=1403863538) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c:1470
1470 entrypt(seg, toc);
(gdb) p *ActiveSnapshot->as_snap //active snapshot from main process
$18 = {snapshot_type = SNAPSHOT_MVCC, xmin = 799162, xmax = 822061, xip = 0x2ab0190, xcnt = 169, subxip = 0x0, subxcnt = 0, suboverflowed = true, takenDuringRecovery = false,
copied = true, curcid = 119, speculativeToken = 2139062143, active_count = 1, regd_count = 2, ph_node = {first_child = 0x2af2d40, next_sibling = 0x0, prev_or_parent = 0x0},
whenTaken = 0, lsn = 0}
(gdb) p *CurrentSnapshot //transaction snapshot from main process
$19 = {snapshot_type = SNAPSHOT_MVCC, xmin = 799425, xmax = 822293, xip = 0x2ab1c00, xcnt = 172, subxip = 0x149f29302010, subxcnt = 0, suboverflowed = true,
takenDuringRecovery = false, copied = false, curcid = 119, speculativeToken = 0, active_count = 0, regd_count = 0, ph_node = {first_child = 0x0, next_sibling = 0x0,
prev_or_parent = 0x0}, whenTaken = 0, lsn = 0}
(gdb) f 4
#4 0x00000000004ff24c in HeapTupleSatisfiesMVCC (htup=0x7fffc1465f60, snapshot=0x2af2d00, buffer=109832)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/heap/heapam_visibility.c:1073
1073 XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
(gdb) p *snapshot //ative snap shot from main process for scan
$20 = {snapshot_type = SNAPSHOT_MVCC, xmin = 799162, xmax = 822061, xip = 0x2af2d68, xcnt = 169, subxip = 0x0, subxcnt = 0, suboverflowed = true, takenDuringRecovery = false,
copied = true, curcid = 119, speculativeToken = 2139062143, active_count = 0, regd_count = 1, ph_node = {first_child = 0x0, next_sibling = 0xf65ca0 <CatalogSnapshotData+64>,
prev_or_parent = 0x2ab0168}, whenTaken = 0, lsn = 0}
(gdb) p TransactionXmin
$21 = 799425
(gdb) f 3
#3 0x0000000000b04acb in XidInMVCCSnapshot (xid=799225, snapshot=0x2af2d00) at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/utils/time/snapmgr.c:2293
2293 xid = SubTransGetTopmostTransaction(xid);
(gdb) p xid
$22 = 799225
The main process:
1, Main process get the transaction snapshot(xmin 799162, xmax 822061) and push the active snapshot first. And set this active snapshot to QueryDesc in CreateQueryDesc.
2, Main process collect active snapshot(xmin 799162, xmax 82206) and get the newer transaction snapshot(xmin 799425, xmax 822293). And use the parameter PARALLEL_KEY_TRANSACTION_SNAPSHOT, PARALLEL_KEY_ACTIVE_SNAPSHOT
store transaction snapshot and active snapshot.
3, Main process ExecGather->ExecInitParallelPlan->ExecParallelInitializeDSM->ExecSeqScanInitializeDSM->table_parallelscan_initialize, send active snapshot(xmin 799162, xmax 82206) with plan id to parallel work process.
4, lauch parallel work process.
The parallel work process:
1, Get Snapshot and set TransactionXmin itself, in ParallelWorkerMain->BackgroundWorkerInitializeConnectionByOid->GetTransactionSnapshot->GetSnapshotData.
2, Acooding PARALLEL_KEY_TRANSACTION_SNAPSHOT(xmin 799425, xmax 82229) from main process, and set TransactionXmin 799425 in ParallelWorkerMain->RestoreTransactionSnapshot->SetTransactionSnapshot->ProcArrayInstallRestoredXmin.
3, ExecParallelInitializeWorker->ExecSeqScanInitializeWorker->table_beginscan_parallel get the active snapshot(xmin 799162, xmax 82206) from main process, and set this snapshot to scan->rs_base.rs_snapshot.
4, parallel scan begin, with active snapshot(xmin 799162, xmax 82206) and TransactionXmin(799425),when scan tuple(xmin 799225) SubTransGetTopmostTransaction assert got.
In HeapTupleSatisfiesMVCC->XidInMVCCSnapshot->SubTransGetTopmostTransaction.
As main process gets the active snapshot (xmin 799162, xmax 822061) which is earlier than transaction snapshot(xmin 799425, xmax 822293). Parallel work process set TransactionXmin with transaction snapshot(xmin 799425, xmax 822293).
But scan tuple with active snapshot (xmin 799162, xmax 822061).
Thanks
Pengcheng
-----Original Message-----
From: Greg Nancarrow <gregn4422@gmail.com>
Sent: 2021年5月11日 19:08
To: Pengchengliu <pengchengliu@tju.edu.cn>
Cc: Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
On Tue, May 11, 2021 at 11:28 AM Pengchengliu <pengchengliu@tju.edu.cn> wrote:
>
> Hi Andres,
> Reproduce steps.
>
> 1, Modify and adjust NUM_SUBTRANS_BUFFERS to 128 from 32 in the file "src/include/access/subtrans.h" line number 15.
> 2, configure with enable assert and build it.
> 3, init a new database cluster.
> 4, modify postgres.conf and add some parameters as below. As the coredump from parallel scan, so we adjust parallel setting, make it easy to reproduce.
>
> max_connections = 2000
>
> parallel_setup_cost=0
> parallel_tuple_cost=0
> min_parallel_table_scan_size=0
> max_parallel_workers_per_gather=8
> max_parallel_workers = 32
>
> 5, start the database cluster.
> 6, use the script init_test.sql in attachment to create tables.
> 7, use pgbench with script sub_120.sql in attachment to test it. Try it sometimes, you should get the coredump file.
> pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 200 -j 200 -T 120
>
Hi,
I had a go at reproducing your reported issue, making sure to follow all your steps.
Unfortunately, your script seemed to run OK with pgbench and no crash/coredump occurred for me (and yes, I definitely had asserts enabled).
I tried with both the 13.2 source code
(3fb4c75e857adee3da4386e947ba58a75f3e74b7), running through the script with pgbench twice to completion, and also did the same using the latest Postgres source code.
Will be interesting to see if anyone is able to reproduce your issue.
Regards,
Greg Nancarrow
Fujitsu Australia
On Thu, May 13, 2021 at 11:25 AM Pengchengliu <pengchengliu@tju.edu.cn> wrote: > > Hi Andres, > Thanks for you replay. Er, it's Greg who has replied so far (not Andres). > > And If you still cannot reproduce it in 2 minitus. Could you run pgbench longer time, such as 30 or 60 minutes. > Actually, I did run it, multiple times, for more than 60 minutes, but no assert/crash/coredump occurred in my environment. > > The parallel work process: > > 1, Get Snapshot and set TransactionXmin itself, in ParallelWorkerMain->BackgroundWorkerInitializeConnectionByOid->GetTransactionSnapshot->GetSnapshotData. > > 2, Acooding PARALLEL_KEY_TRANSACTION_SNAPSHOT(xmin 799425, xmax 82229) from main process, and set TransactionXmin 799425in ParallelWorkerMain->RestoreTransactionSnapshot->SetTransactionSnapshot->ProcArrayInstallRestoredXmin. > > 3, ExecParallelInitializeWorker->ExecSeqScanInitializeWorker->table_beginscan_parallel get the active snapshot(xmin 799162,xmax 82206) from main process, and set this snapshot to scan->rs_base.rs_snapshot. > > 4, parallel scan begin, with active snapshot(xmin 799162, xmax 82206) and TransactionXmin(799425),when scan tuple(xmin799225) SubTransGetTopmostTransaction assert got. > > In HeapTupleSatisfiesMVCC->XidInMVCCSnapshot->SubTransGetTopmostTransaction. > I added some logging at a couple of points in the code: 1) In the Worker process code - ParallelWorkerMain() - where it restores the serialized transaction and active snapshots (i.e. passed to the Worker from the main process). 2) In the HeapTupleSatisfiesMVCC() function, immediately before it calls XidInMVCCSnapshot() After running it for an hour, examination of the log showed that in ALL cases, the transaction snapshot xmin,xmax was always THE SAME as the active snapshot xmin,xmax. (Can you verify that this occurs on your system when things are working, prior to the coredump?) This is different to what you are getting in your environment (at least, different to what you described when the problem occurs). In your case, you say that the main process gets "the newer transaction snapshot" - where exactly is this happening in your case? (or this is what you don't yet know?) Perhaps very occasionally this somehow happens on your system and triggers the Assert (and coredump)? I have not been able to reproduce that on my system. Have you reproduced this issue on any other system, using the same steps as you provided? I'm wondering if there might be something else in your environment that may be influencing this problem. Regards, Greg Nancarrow Fujitsu Australia
Hi Greg,
Thanks for you replay and test.
When main process gets the transaction snapshot in InitializeParallelDSM->GetTransactionSnapshot, the transaction snapshot xmin is very likely follows active snapshot xmin.
Use gdb it is easy to verify it.
Create env as blow:
1, Use PG13.2(3fb4c75e857adee3da4386e947ba58a75f3e74b7), init a cluster database.
2, Append the postgres.conf as below:
max_connections = 2000
parallel_setup_cost=0
parallel_tuple_cost=0
min_parallel_table_scan_size=0
max_parallel_workers_per_gather=8
max_parallel_workers = 128
3, Start the cluster database. Use the init_test.sql script in attachment to create some test tables.
4, Use the sub_120.sql script in attachment with pgbench to test it.
pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 200 -j 200 -T 1800
Then you can login the database, and use GDB to verify it.
1, First use explain, make sure force Parallel is OK.
postgres=# explain (verbose,analyze) select count(*) from contend1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
----------------
Finalize Aggregate (cost=12006.11..12006.12 rows=1 width=8) (actual time=1075.214..1075.449 rows=1 loops=1)
Output: count(*)
-> Gather (cost=12006.08..12006.09 rows=8 width=8) (actual time=1075.198..1075.433 rows=1 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 8
Workers Launched: 0
-> Partial Aggregate (cost=12006.08..12006.09 rows=1 width=8) (actual time=1074.674..1074.676 rows=1 loops=1)
Output: PARTIAL count(*)
-> Parallel Seq Scan on public.contend1 (cost=0.00..11690.06 rows=126406 width=0) (actual time=0.008..587.454 rows=1
010200 loops=1)
Output: id, val, c2, c3, c4, c5, c6, c7, c8, c9, c10, crt_time
Planning Time: 0.123 ms
Execution Time: 1075.588 ms
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
2182678
2, use gdb to debug our backend process. Add the breakpoint in parallel.c:219 and continue.
gdb -q -p 2182678
...
(gdb) b parallel.c:219
Breakpoint 1 at 0x55d085: file /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c, line 219.
(gdb) c
Continuing.
3, In the psql clinet, we can execute the explain command in step1 again.
After we get the breakpoint in gdb, we wait a moment. Then we execute next.
Use gdb check active_snapshot and transaction_snapshot, active_snapshot->xmin is 158987 and transaction_snapshot->xmin is 162160.
When I use gdb test it, sometimes active_snapshot is the same as transaction_snapshot. Then you can try it multiple times, and before execute next, try wait longer time.
Breakpoint 1, InitializeParallelDSM (pcxt=0x2d53670)
at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c:219
219 Snapshot transaction_snapshot = GetTransactionSnapshot();
(gdb) n
220 Snapshot active_snapshot = GetActiveSnapshot();
(gdb)
223 oldcontext = MemoryContextSwitchTo(TopTransactionContext);
(gdb) p *transaction_snapshot
$1 = {snapshot_type = SNAPSHOT_MVCC, xmin = 162160, xmax = 183011, xip = 0x2d50d10, xcnt = 179, subxip = 0x148a9cddf010,
subxcnt = 0, suboverflowed = true, takenDuringRecovery = false, copied = false, curcid = 0, speculativeToken = 0,
active_count = 0, regd_count = 0, ph_node = {first_child = 0x0, next_sibling = 0x0, prev_or_parent = 0x0}, whenTaken = 0, lsn = 0}
(gdb) p *active_snapshot
$2 = {snapshot_type = SNAPSHOT_MVCC, xmin = 158987, xmax = 173138, xip = 0x2d53288, xcnt = 178, subxip = 0x0, subxcnt = 0,
suboverflowed = true, takenDuringRecovery = false, copied = true, curcid = 0, speculativeToken = 0, active_count = 1,
regd_count = 2, ph_node = {first_child = 0x0, next_sibling = 0x0, prev_or_parent = 0x2d52e48}, whenTaken = 0, lsn = 0}
(gdb)
Thanks
Pengcheng
-----Original Message-----
From: Greg Nancarrow <gregn4422@gmail.com>
Sent: 2021年5月13日 22:15
To: Pengchengliu <pengchengliu@tju.edu.cn>
Cc: Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
On Thu, May 13, 2021 at 11:25 AM Pengchengliu <pengchengliu@tju.edu.cn> wrote:
>
> Hi Andres,
> Thanks for you replay.
Er, it's Greg who has replied so far (not Andres).
>
> And If you still cannot reproduce it in 2 minitus. Could you run pgbench longer time, such as 30 or 60 minutes.
>
Actually, I did run it, multiple times, for more than 60 minutes, but no assert/crash/coredump occurred in my environment.
>
> The parallel work process:
>
> 1, Get Snapshot and set TransactionXmin itself, in ParallelWorkerMain->BackgroundWorkerInitializeConnectionByOid->GetTransactionSnapshot->GetSnapshotData.
>
> 2, Acooding PARALLEL_KEY_TRANSACTION_SNAPSHOT(xmin 799425, xmax 82229) from main process, and set TransactionXmin 799425 in ParallelWorkerMain->RestoreTransactionSnapshot->SetTransactionSnapshot->ProcArrayInstallRestoredXmin.
>
> 3, ExecParallelInitializeWorker->ExecSeqScanInitializeWorker->table_beginscan_parallel get the active snapshot(xmin 799162, xmax 82206) from main process, and set this snapshot to scan->rs_base.rs_snapshot.
>
> 4, parallel scan begin, with active snapshot(xmin 799162, xmax 82206) and TransactionXmin(799425),when scan tuple(xmin 799225) SubTransGetTopmostTransaction assert got.
>
> In HeapTupleSatisfiesMVCC->XidInMVCCSnapshot->SubTransGetTopmostTransaction.
>
I added some logging at a couple of points in the code:
1) In the Worker process code - ParallelWorkerMain() - where it restores the serialized transaction and active snapshots (i.e. passed to the Worker from the main process).
2) In the HeapTupleSatisfiesMVCC() function, immediately before it calls XidInMVCCSnapshot()
After running it for an hour, examination of the log showed that in ALL cases, the transaction snapshot xmin,xmax was always THE SAME as the active snapshot xmin,xmax.
(Can you verify that this occurs on your system when things are working, prior to the coredump?)
This is different to what you are getting in your environment (at least, different to what you described when the problem occurs).
In your case, you say that the main process gets "the newer transaction snapshot" - where exactly is this happening in your case?
(or this is what you don't yet know?)
Perhaps very occasionally this somehow happens on your system and triggers the Assert (and coredump)? I have not been able to reproduce that on my system.
Have you reproduced this issue on any other system, using the same steps as you provided?
I'm wondering if there might be something else in your environment that may be influencing this problem.
Regards,
Greg Nancarrow
Fujitsu Australia
Attachment
On Fri, May 14, 2021 at 12:25 PM Pengchengliu <pengchengliu@tju.edu.cn> wrote: > > Hi Greg, > > Thanks for you replay and test. > > > > When main process gets the transaction snapshot in InitializeParallelDSM->GetTransactionSnapshot, the transaction snapshotxmin is very likely follows active snapshot xmin. > > > > Use gdb it is easy to verify it. > > Create env as blow: > > > > 1, Use PG13.2(3fb4c75e857adee3da4386e947ba58a75f3e74b7), init a cluster database. > > 2, Append the postgres.conf as below: > > > > max_connections = 2000 > > parallel_setup_cost=0 > > parallel_tuple_cost=0 > > min_parallel_table_scan_size=0 > > max_parallel_workers_per_gather=8 > > max_parallel_workers = 128 > > > > 3, Start the cluster database. Use the init_test.sql script in attachment to create some test tables. > > 4, Use the sub_120.sql script in attachment with pgbench to test it. > > > > pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 200 -j 200 -T 1800 > > > > > > > > Then you can login the database, and use GDB to verify it. > > 1, First use explain, make sure force Parallel is OK. > > > > postgres=# explain (verbose,analyze) select count(*) from contend1; > > QUERY PLAN > > > > ------------------------------------------------------------------------------------------------------------------------------------- > > ---------------- > > Finalize Aggregate (cost=12006.11..12006.12 rows=1 width=8) (actual time=1075.214..1075.449 rows=1 loops=1) > > Output: count(*) > > -> Gather (cost=12006.08..12006.09 rows=8 width=8) (actual time=1075.198..1075.433 rows=1 loops=1) > > Output: (PARTIAL count(*)) > > Workers Planned: 8 > > Workers Launched: 0 > > -> Partial Aggregate (cost=12006.08..12006.09 rows=1 width=8) (actual time=1074.674..1074.676 rows=1 loops=1) > > Output: PARTIAL count(*) > > -> Parallel Seq Scan on public.contend1 (cost=0.00..11690.06 rows=126406 width=0) (actual time=0.008..587.454rows=1 > > 010200 loops=1) > > Output: id, val, c2, c3, c4, c5, c6, c7, c8, c9, c10, crt_time > > Planning Time: 0.123 ms > > Execution Time: 1075.588 ms > > postgres=# select pg_backend_pid(); > > pg_backend_pid > > ---------------- > > 2182678 > > > > > > 2, use gdb to debug our backend process. Add the breakpoint in parallel.c:219 and continue. > > > > gdb -q -p 2182678 > > ... > > (gdb) b parallel.c:219 > > Breakpoint 1 at 0x55d085: file /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c,line 219. > > (gdb) c > > Continuing. > > > > 3, In the psql clinet, we can execute the explain command in step1 again. > > After we get the breakpoint in gdb, we wait a moment. Then we execute next. > > Use gdb check active_snapshot and transaction_snapshot, active_snapshot->xmin is 158987 and transaction_snapshot->xminis 162160. > > When I use gdb test it, sometimes active_snapshot is the same as transaction_snapshot. Then you can try it multiple times,and before execute next, try wait longer time. > > > > Breakpoint 1, InitializeParallelDSM (pcxt=0x2d53670) > > at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c:219 > > 219 Snapshot transaction_snapshot = GetTransactionSnapshot(); > > (gdb) n > > 220 Snapshot active_snapshot = GetActiveSnapshot(); > > (gdb) > > 223 oldcontext = MemoryContextSwitchTo(TopTransactionContext); > > (gdb) p *transaction_snapshot > > $1 = {snapshot_type = SNAPSHOT_MVCC, xmin = 162160, xmax = 183011, xip = 0x2d50d10, xcnt = 179, subxip = 0x148a9cddf010, > > subxcnt = 0, suboverflowed = true, takenDuringRecovery = false, copied = false, curcid = 0, speculativeToken = 0, > > active_count = 0, regd_count = 0, ph_node = {first_child = 0x0, next_sibling = 0x0, prev_or_parent = 0x0}, whenTaken= 0, lsn = 0} > > (gdb) p *active_snapshot > > $2 = {snapshot_type = SNAPSHOT_MVCC, xmin = 158987, xmax = 173138, xip = 0x2d53288, xcnt = 178, subxip = 0x0, subxcnt =0, > > suboverflowed = true, takenDuringRecovery = false, copied = true, curcid = 0, speculativeToken = 0, active_count = 1, > > regd_count = 2, ph_node = {first_child = 0x0, next_sibling = 0x0, prev_or_parent = 0x2d52e48}, whenTaken = 0, lsn = 0} > > (gdb) > > Hi Pengcheng, I followed all your steps. However, I perhaps get different behavior in my environment. 99% of the time, the xmin and xmax of the active_snapshot and transaction_snapshot are the same (regardless of how long I wait at different points after the breakpoint is hit). I've had one or two instances where the xmax values differ. I managed to catch just one case where there were different xmin and xmax values in the snapshots, but this occurred just prior to the pgbench client completing and terminating, and when I continued in the debugger, there was no crash/coredump. However, I think I've spotted something potentially important to this issue: For me, almost always "suboverflowed = false" in the snapshots (except in that one case, right at the end of the pgbench run), yet in your gdb example "suboverflowed = true" in both of the snapshots (i.e. the snapshot subxip array has overflowed). I'm guessing that this may be related to the coredump issue, but I'm not exactly sure how it has happened, and why it seemingly isn't being handled correctly and causes that Assert to fire in your case. Can you try and find out how the snapshot suboverflow is being set in your case? (since you are getting this readily in your examined snapshots???) I think there's only several places where it can be set to "true" (e.g. procarray.c:1641). Also, does increasing PGPROC_MAX_CACHED_SUBXIDS avoid, or delay, the problem for you? It's currently defined as 64. I notice that there's been some changes related to snapshot data handling and subxid overflow since 13.2, so I'm wondering whether your coredump issue can be reproduced with the latest code? Regards, Greg Nancarrow Fujitsu Australia
Hi Greg,
When you get the different xmin between active snapshot and transaction snapshot, maybe there is no coredump.
As maybe there is not tupe(xmin) between ActiveSnapshot->xmin and TransactionSnapshot->xmin which needs to be scaned in parallel process.
There is no doubt, it is very likely that ActiveSnapshot->xmin precedes TransactionSnapshot->xmin.
For this coredump, we must make sure parallel and snapshot overflow. If snapshot is not overflow, you cannot get the coredump.
As coredump is from parallel scan in MVCC when snapshot is overflow.
Did you use pgbench with the script sub_120.sql which I provide in attachment?
As the default PGPROC_MAX_CACHED_SUBXIDS is 64. In script sub_120.sql, for one transaction, it will use 120 subtransactions which is much larger than 64.
While getting the snapshot, it must be overflow. I really don't know why your snapshot is not overflow.
Did you increase the number PGPROC_MAX_CACHED_SUBXIDS? Please don't change any codes, now we just use the origin codes in PG13.2.
I have checked the codes in master branch, there is no change about this mechanism. This issue should still exist.
Thanks
Pengcheng
-----Original Message-----
From: Greg Nancarrow <gregn4422@gmail.com>
Sent: 2021年5月14日 16:47
To: Pengchengliu <pengchengliu@tju.edu.cn>
Cc: Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
On Fri, May 14, 2021 at 12:25 PM Pengchengliu <pengchengliu@tju.edu.cn> wrote:
>
> Hi Greg,
>
> Thanks for you replay and test.
>
>
>
> When main process gets the transaction snapshot in InitializeParallelDSM->GetTransactionSnapshot, the transaction snapshot xmin is very likely follows active snapshot xmin.
>
>
>
> Use gdb it is easy to verify it.
>
> Create env as blow:
>
>
>
> 1, Use PG13.2(3fb4c75e857adee3da4386e947ba58a75f3e74b7), init a cluster database.
>
> 2, Append the postgres.conf as below:
>
>
>
> max_connections = 2000
>
> parallel_setup_cost=0
>
> parallel_tuple_cost=0
>
> min_parallel_table_scan_size=0
>
> max_parallel_workers_per_gather=8
>
> max_parallel_workers = 128
>
>
>
> 3, Start the cluster database. Use the init_test.sql script in attachment to create some test tables.
>
> 4, Use the sub_120.sql script in attachment with pgbench to test it.
>
>
>
> pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 200 -j 200 -T 1800
>
>
>
>
>
>
>
> Then you can login the database, and use GDB to verify it.
>
> 1, First use explain, make sure force Parallel is OK.
>
>
>
> postgres=# explain (verbose,analyze) select count(*) from contend1;
>
>
> QUERY PLAN
>
>
>
> ----------------------------------------------------------------------
> ---------------------------------------------------------------
>
> ----------------
>
> Finalize Aggregate (cost=12006.11..12006.12 rows=1 width=8) (actual
> time=1075.214..1075.449 rows=1 loops=1)
>
> Output: count(*)
>
> -> Gather (cost=12006.08..12006.09 rows=8 width=8) (actual
> time=1075.198..1075.433 rows=1 loops=1)
>
> Output: (PARTIAL count(*))
>
> Workers Planned: 8
>
> Workers Launched: 0
>
> -> Partial Aggregate (cost=12006.08..12006.09 rows=1
> width=8) (actual time=1074.674..1074.676 rows=1 loops=1)
>
> Output: PARTIAL count(*)
>
> -> Parallel Seq Scan on public.contend1
> (cost=0.00..11690.06 rows=126406 width=0) (actual time=0.008..587.454
> rows=1
>
> 010200 loops=1)
>
> Output: id, val, c2, c3, c4, c5, c6, c7, c8, c9,
> c10, crt_time
>
> Planning Time: 0.123 ms
>
> Execution Time: 1075.588 ms
>
> postgres=# select pg_backend_pid();
>
> pg_backend_pid
>
> ----------------
>
> 2182678
>
>
>
>
>
> 2, use gdb to debug our backend process. Add the breakpoint in parallel.c:219 and continue.
>
>
>
> gdb -q -p 2182678
>
> ...
>
> (gdb) b parallel.c:219
>
> Breakpoint 1 at 0x55d085: file /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c, line 219.
>
> (gdb) c
>
> Continuing.
>
>
>
> 3, In the psql clinet, we can execute the explain command in step1 again.
>
> After we get the breakpoint in gdb, we wait a moment. Then we execute next.
>
> Use gdb check active_snapshot and transaction_snapshot, active_snapshot->xmin is 158987 and transaction_snapshot->xmin is 162160.
>
> When I use gdb test it, sometimes active_snapshot is the same as transaction_snapshot. Then you can try it multiple times, and before execute next, try wait longer time.
>
>
>
> Breakpoint 1, InitializeParallelDSM (pcxt=0x2d53670)
>
> at
> /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/ac
> cess/transam/parallel.c:219
>
> 219 Snapshot transaction_snapshot = GetTransactionSnapshot();
>
> (gdb) n
>
> 220 Snapshot active_snapshot = GetActiveSnapshot();
>
> (gdb)
>
> 223 oldcontext = MemoryContextSwitchTo(TopTransactionContext);
>
> (gdb) p *transaction_snapshot
>
> $1 = {snapshot_type = SNAPSHOT_MVCC, xmin = 162160, xmax = 183011, xip
> = 0x2d50d10, xcnt = 179, subxip = 0x148a9cddf010,
>
> subxcnt = 0, suboverflowed = true, takenDuringRecovery = false,
> copied = false, curcid = 0, speculativeToken = 0,
>
> active_count = 0, regd_count = 0, ph_node = {first_child = 0x0,
> next_sibling = 0x0, prev_or_parent = 0x0}, whenTaken = 0, lsn = 0}
>
> (gdb) p *active_snapshot
>
> $2 = {snapshot_type = SNAPSHOT_MVCC, xmin = 158987, xmax = 173138, xip
> = 0x2d53288, xcnt = 178, subxip = 0x0, subxcnt = 0,
>
> suboverflowed = true, takenDuringRecovery = false, copied = true,
> curcid = 0, speculativeToken = 0, active_count = 1,
>
> regd_count = 2, ph_node = {first_child = 0x0, next_sibling = 0x0,
> prev_or_parent = 0x2d52e48}, whenTaken = 0, lsn = 0}
>
> (gdb)
>
>
Hi Pengcheng,
I followed all your steps.
However, I perhaps get different behavior in my environment.
99% of the time, the xmin and xmax of the active_snapshot and transaction_snapshot are the same (regardless of how long I wait at different points after the breakpoint is hit). I've had one or two instances where the xmax values differ. I managed to catch just one case where there were different xmin and xmax values in the snapshots, but this occurred just prior to the pgbench client completing and terminating, and when I continued in the debugger, there was no crash/coredump.
However, I think I've spotted something potentially important to this issue:
For me, almost always "suboverflowed = false" in the snapshots (except in that one case, right at the end of the pgbench run), yet in your gdb example "suboverflowed = true" in both of the snapshots (i.e. the snapshot subxip array has overflowed). I'm guessing that this may be related to the coredump issue, but I'm not exactly sure how it has happened, and why it seemingly isn't being handled correctly and causes that Assert to fire in your case.
Can you try and find out how the snapshot suboverflow is being set in your case? (since you are getting this readily in your examined
snapshots???) I think there's only several places where it can be set to "true" (e.g. procarray.c:1641).
Also, does increasing PGPROC_MAX_CACHED_SUBXIDS avoid, or delay, the problem for you? It's currently defined as 64.
I notice that there's been some changes related to snapshot data handling and subxid overflow since 13.2, so I'm wondering whether your coredump issue can be reproduced with the latest code?
Regards,
Greg Nancarrow
Fujitsu Australia
Attachment
On Fri, May 14, 2021 at 8:36 PM Pengchengliu <pengchengliu@tju.edu.cn> wrote: > Did you use pgbench with the script sub_120.sql which I provide in attachment? yes > > Did you increase the number PGPROC_MAX_CACHED_SUBXIDS? Please don't change any codes, now we just use the origin codesin PG13.2. > No, I have made no source code changes at all. That was my suggestion, for you to try - because if the problem is avoided by increasing PGPROC_MAX_CACHED_SUBXIDS (to say 128) then it probably indicates the overflow condition is affecting the xmin.xmax of the two snapshots such that it invalidates the condition that is asserted. I think one problem is that in your settings, you haven't set "max_worker_processes", yet have set "max_parallel_workers = 128". I'm finding no more than 8 parallel workers are actually active at any one time. On top of this, you've got pgbench running with 200 concurrent clients. So many queries are actually executing parallel plans without using parallel workers, as the workers can't actually be launched (and this is probably why I'm finding it hard to reproduce the issue, if the problem involves snapshot suboverflow and parallel workers). I find that the following settings improve the parallelism per query and the whole test runs very much faster: max_connections = 2000 parallel_setup_cost=0 parallel_tuple_cost=0 min_parallel_table_scan_size=0 max_parallel_workers_per_gather=4 max_parallel_workers = 100 max_worker_processes = 128 and adjust the pgbench command-line: pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 25 -j 25 -T 1800 Problem is, I still get no coredump when using this. Can you try these settings and let me know if the crash still happens if you use these settings? I also tried: max_connections = 2000 parallel_setup_cost=0 parallel_tuple_cost=0 min_parallel_table_scan_size=0 max_parallel_workers_per_gather=2 max_parallel_workers = 280 max_worker_processes = 300 and the pgbench command-line: pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 140 -j 140 -T 1800 - but I still get no coredump. Regards, Greg Nancarrow Fujitsu Australia
Hi Tom & Robert, Could you review this Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin)) in SubTransGetTopmostTransaction. I think this assert is unsuitable for parallel work process. Before we discuss it in https://www.postgresql-archive.org/Parallel-scan-with-SubTransGetTopmostTransaction-assert-coredump-td6197408.html Thanks Pengcheng -----Original Message----- From: Greg Nancarrow <gregn4422@gmail.com> Sent: 2021年5月15日 0:44 To: Pengchengliu <pengchengliu@tju.edu.cn> Cc: Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org> Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump On Fri, May 14, 2021 at 8:36 PM Pengchengliu <pengchengliu@tju.edu.cn> wrote: > Did you use pgbench with the script sub_120.sql which I provide in attachment? yes > > Did you increase the number PGPROC_MAX_CACHED_SUBXIDS? Please don't change any codes, now we just use the origin codesin PG13.2. > No, I have made no source code changes at all. That was my suggestion, for you to try - because if the problem is avoided by increasing PGPROC_MAX_CACHED_SUBXIDS (to say128) then it probably indicates the overflow condition is affecting the xmin.xmax of the two snapshots such that it invalidatesthe condition that is asserted. I think one problem is that in your settings, you haven't set "max_worker_processes", yet have set "max_parallel_workers= 128". I'm finding no more than 8 parallel workers are actually active at any one time. On top of this, you've got pgbench running with 200 concurrent clients. So many queries are actually executing parallel plans without using parallel workers, as the workers can't actually be launched(and this is probably why I'm finding it hard to reproduce the issue, if the problem involves snapshot suboverflowand parallel workers). I find that the following settings improve the parallelism per query and the whole test runs very much faster: max_connections = 2000 parallel_setup_cost=0 parallel_tuple_cost=0 min_parallel_table_scan_size=0 max_parallel_workers_per_gather=4 max_parallel_workers = 100 max_worker_processes = 128 and adjust the pgbench command-line: pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 25 -j 25 -T 1800 Problem is, I still get no coredump when using this. Can you try these settings and let me know if the crash still happens if you use these settings? I also tried: max_connections = 2000 parallel_setup_cost=0 parallel_tuple_cost=0 min_parallel_table_scan_size=0 max_parallel_workers_per_gather=2 max_parallel_workers = 280 max_worker_processes = 300 and the pgbench command-line: pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 140 -j 140 -T 1800 - but I still get no coredump. Regards, Greg Nancarrow Fujitsu Australia
On Sat, May 15, 2021 at 12:37 PM 刘鹏程 <pengchengliu@tju.edu.cn> wrote: > > > BTW, I test it in a high performance server. It is verly easily be reproduced. My colleague and me use different environmentboth can reproduce it. > Hi Pengcheng, Although the issue won't reproduce easily in my system, I can certainly see how, for the snapshots used in the parallel worker case, the Active snapshot used is potentially an earlier snapshot that the Transaction snapshot. I don't know why it is getting a newer Transaction snapshot in InitializeParallelDSM(), when it has previously pushed the return value of GetTransactionSnapshot() as the Active snapshot. So I too hope Tom or Robert can explain what is going on here and how to resolve it (as you requested them to, in your other post). I actually think that the Assert in SubTransGetTopmostTransaction() is correct, but in the parallel-worker case, the snapshots are not being setup correctly. Can you try the trivial change below and see if it prevents the coredump? Regards, Greg Nancarrow Fujitsu Australia diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index 14a8690019..870889053f 100644 --- a/src/backend/access/transam/parallel.c +++ b/src/backend/access/transam/parallel.c @@ -216,7 +216,7 @@ InitializeParallelDSM(ParallelContext *pcxt) int i; FixedParallelState *fps; dsm_handle session_dsm_handle = DSM_HANDLE_INVALID; - Snapshot transaction_snapshot = GetTransactionSnapshot(); + Snapshot transaction_snapshot = GetActiveSnapshot(); Snapshot active_snapshot = GetActiveSnapshot(); /* We might be running in a very short-lived memory context. */
Hi Greg, > I actually think that the Assert in SubTransGetTopmostTransaction() is correct, but in the parallel-worker case, thesnapshots are not being setup correctly. I agree with you that Assert in SubTransGetTopmostTransaction() is correct. The root cause is Transaction Xmin are notbeing setup correctly in the parallel-worker. Actually I am very confused about ActiveSnapshot and TransactionSnapshot. I don't know why main process send ActiveSnapshotand TransactionSnapshot separately. And what is exact difference between them? If you know that, could you explain that for me? It will be very appreciated. Before we know them exactly, I think we should not modify the TransactionSnapshot to ActiveSnapshot in main process. Ifit is, the main process should send ActiveSnapshot only. Thanks Pengcheng -----Original Message----- From: Greg Nancarrow <gregn4422@gmail.com> Sent: 2021年5月17日 20:59 To: 刘鹏程 <pengchengliu@tju.edu.cn> Cc: Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org> Subject: Re: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump On Sat, May 15, 2021 at 12:37 PM 刘鹏程 <pengchengliu@tju.edu.cn> wrote: > > > BTW, I test it in a high performance server. It is verly easily be reproduced. My colleague and me use different environmentboth can reproduce it. > Hi Pengcheng, Although the issue won't reproduce easily in my system, I can certainly see how, for the snapshots used in the parallel workercase, the Active snapshot used is potentially an earlier snapshot that the Transaction snapshot. I don't know why itis getting a newer Transaction snapshot in InitializeParallelDSM(), when it has previously pushed the return value of GetTransactionSnapshot()as the Active snapshot. So I too hope Tom or Robert can explain what is going on here and how to resolve it (as you requested them to, in your otherpost). I actually think that the Assert in SubTransGetTopmostTransaction() is correct, but in the parallel-worker case, the snapshotsare not being setup correctly. Can you try the trivial change below and see if it prevents the coredump? Regards, Greg Nancarrow Fujitsu Australia diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index 14a8690019..870889053f 100644 --- a/src/backend/access/transam/parallel.c +++ b/src/backend/access/transam/parallel.c @@ -216,7 +216,7 @@ InitializeParallelDSM(ParallelContext *pcxt) int i; FixedParallelState *fps; dsm_handle session_dsm_handle = DSM_HANDLE_INVALID; - Snapshot transaction_snapshot = GetTransactionSnapshot(); + Snapshot transaction_snapshot = GetActiveSnapshot(); Snapshot active_snapshot = GetActiveSnapshot(); /* We might be running in a very short-lived memory context. */
On Tue, May 18, 2021 at 11:27 AM Pengchengliu <pengchengliu@tju.edu.cn> wrote: > > Hi Greg, > > Actually I am very confused about ActiveSnapshot and TransactionSnapshot. I don't know why main process send ActiveSnapshotand TransactionSnapshot separately. And what is exact difference between them? > If you know that, could you explain that for me? It will be very appreciated. In the context of a parallel-worker, I am a little confused too, so I can't explain it either. It is not really explained in the file "src\backend\access\transam\README.parallel", it only mentions the following as part of the state that needs to be copied to each worker: - The transaction snapshot. - The active snapshot, which might be different from the transaction snapshot. So they might be different, but exactly when and why? When I debugged a typical parallel-SELECT case, I found that prior to plan execution, GetTransactionSnapshot() was called and its return value was stored in both the QueryDesc and the estate (es_snapshot), which was then pushed on the ActiveSnapshot stack. So by the time InitializeParallelDSM() was called, the (top) ActiveSnapshot was the last snapshot returned from GetTransactionSnapshot(). So why InitializeParallelDSM() calls GetTransactionSnapshot() again is not clear to me (because isn't then the ActiveSnapshot a potentially earlier snapshot? - which it shouldn't be, AFAIK. And also, it's then different to the non-parallel case). > Before we know them exactly, I think we should not modify the TransactionSnapshot to ActiveSnapshot in main process.If it is, the main process should send ActiveSnapshot only. I think it would be worth you trying my suggested change (if you have a development environment, which I assume you have). Sure, IF it was deemed a proper solution, you'd only send the one snapshot, and adjust accordingly in ParallelWorkerMain(), but we need not worry about that in order to test it. Regards, Greg Nancarrow Fujitsu Australia
Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
> To: Pengchengliu <pengchengliu@tju.edu.cn> > Cc: Greg Nancarrow <gregn4422@gmail.com>; Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org> > Subject: Re: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump > I've also seen the reports of the same Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin)) with a subsequent crashin a parallel worker in PostgreSQL v11-based > build, Though I was unable to investigate deeper and reproduce the issue. The details above in the thread make me thinkit is a real and long-time-persistent error that is > surely worth to be fixed. I followed Liu's reproduce steps and successfully reproduce it in about half an hour running. My compile option is : " ./configure --enable-cassert --prefix=/home/pgsql". After applying greg-san's change, the coredump did not happened in two hour(it is still running). Note, I have not taken a deep look into the change, just provide some test information in advance. Best regards, houzj
On Tue, May 18, 2021 at 9:41 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > > To: Pengchengliu <pengchengliu@tju.edu.cn> > > Cc: Greg Nancarrow <gregn4422@gmail.com>; Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org> > > Subject: Re: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump > > > I've also seen the reports of the same Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin)) with a subsequent crashin a parallel worker in PostgreSQL v11-based > > build, Though I was unable to investigate deeper and reproduce the issue. The details above in the thread make me thinkit is a real and long-time-persistent error that is > > surely worth to be fixed. > > I followed Liu's reproduce steps and successfully reproduce it in about half an hour running. > My compile option is : " ./configure --enable-cassert --prefix=/home/pgsql". > > After applying greg-san's change, the coredump did not happened in two hour(it is still running). > Note, I have not taken a deep look into the change, just provide some test information in advance. > +1 Thanks for doing that. I'm unsure if that "fix" is the right approach, so please investigate it too. Regards, Greg Nancarrow Fujitsu Australia
Hi Greg, Thanks a lot for you explanation and your fix. I think your fix can resolve the core dump issue. As with your fix, parallel process reset Transaction Xmin from ActiveSnapshot. But it will change Transaction snapshot for all parallel scenarios. I don't know whether it bring in other issue. For test only, I think it is enough. So is there anybody can explain what's exactly difference between ActiveSnapshot and TransactionSnapshot in parallel workprocess. Then maybe we can find a better solution and try to fix it really. Thanks Pengcheng -----Original Message----- From: Greg Nancarrow <gregn4422@gmail.com> Sent: 2021年5月18日 17:15 To: Pengchengliu <pengchengliu@tju.edu.cn> Cc: Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org> Subject: Re: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump On Tue, May 18, 2021 at 11:27 AM Pengchengliu <pengchengliu@tju.edu.cn> wrote: > > Hi Greg, > > Actually I am very confused about ActiveSnapshot and TransactionSnapshot. I don't know why main process send ActiveSnapshotand TransactionSnapshot separately. And what is exact difference between them? > If you know that, could you explain that for me? It will be very appreciated. In the context of a parallel-worker, I am a little confused too, so I can't explain it either. It is not really explained in the file "src\backend\access\transam\README.parallel", it only mentions the following as part of the state that needs to be copiedto each worker: - The transaction snapshot. - The active snapshot, which might be different from the transaction snapshot. So they might be different, but exactly when and why? When I debugged a typical parallel-SELECT case, I found that prior to plan execution, GetTransactionSnapshot() was calledand its return value was stored in both the QueryDesc and the estate (es_snapshot), which was then pushed on the ActiveSnapshotstack. So by the time InitializeParallelDSM() was called, the (top) ActiveSnapshot was the last snapshot returned from GetTransactionSnapshot(). So why InitializeParallelDSM() calls GetTransactionSnapshot() again is not clear to me (because isn't then the ActiveSnapshota potentially earlier snapshot? - which it shouldn't be, AFAIK. And also, it's then different to the non-parallelcase). > Before we know them exactly, I think we should not modify the TransactionSnapshot to ActiveSnapshot in main process.If it is, the main process should send ActiveSnapshot only. I think it would be worth you trying my suggested change (if you have a development environment, which I assume you have).Sure, IF it was deemed a proper solution, you'd only send the one snapshot, and adjust accordingly in ParallelWorkerMain(),but we need not worry about that in order to test it. Regards, Greg Nancarrow Fujitsu Australia
On Thu, May 20, 2021 at 11:18 AM Pengchengliu <pengchengliu@tju.edu.cn> wrote: > > Hi Greg, > Thanks a lot for you explanation and your fix. > > I think your fix can resolve the core dump issue. As with your fix, parallel process reset Transaction Xmin from ActiveSnapshot. > But it will change Transaction snapshot for all parallel scenarios. I don't know whether it bring in other issue. > For test only, I think it is enough. > > So is there anybody can explain what's exactly difference between ActiveSnapshot and TransactionSnapshot in parallelwork process. > Then maybe we can find a better solution and try to fix it really. > I am proposing the attached patch to fix this issue (patches for both PG13.2 and latest PG14 source are provided). Perhaps this will trigger others who better know the intricacies of snapshot handling, and know the reasons and history behind why the transaction_snapshot and active_snapshot are currently passed separately to parallel workers, to speak up here and discuss the issue further, and check my fix (and note, I haven't attempted to modify README.parallel in the patch until I get further clarification on this issue). Perhaps someone can explain the purpose of calling GetTransactionSnapshot() AGAIN in InitializeParallelDSM() and how is this consistent with the current ActiveSnapshot? AFAICS, that doesn't seem correct, and that's why the patch removes it. I've rebuilt Postgres with the patch applied and run the regression tests, with and without "force_parallel_mode=regress", and all tests pass. So if the fix is wrong, no tests currently exist that detect issues with it. Regards, Greg Nancarrow Fujitsu Australia
Attachment
On Thu, May 20, 2021 at 4:08 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > Keep cfbot happy, use the PG14 patch as latest. Regards, Greg Nancarrow Fujitsu Australia
Attachment
On Mon, May 24, 2021 at 12:04:37PM +1000, Greg Nancarrow wrote: > Keep cfbot happy, use the PG14 patch as latest. This stuff is usually very tricky. Do we have a way to reliably reproduce the report discussed here? -- Michael
Attachment
On Mon, May 24, 2021 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, May 24, 2021 at 12:04:37PM +1000, Greg Nancarrow wrote: > > Keep cfbot happy, use the PG14 patch as latest. > > This stuff is usually very tricky. Agreed. That's why I was looking for experts in this snapshot-handling code, to look closer at this issue, check my proposed fix, come up with a better solution etc. >Do we have a way to reliably > reproduce the report discussed here? I couldn't reproduce it in my environment (though I could understand what was going wrong, based on the description provided). houzj (houzj.fnst@fujitsu.com) was able to reproduce it in his environment and kindly provided to me the following information: (He said that he followed most of the steps described by the original problem reporter, Pengcheng, but perhaps steps 2 and 7 are a little different from his steps. See the emails higher in the thread for the two scripts "init_test.sql" and "sub_120.sql") === 1, Modify and adjust NUM_SUBTRANS_BUFFERS to 128 from 32 in the file "src/include/access/subtrans.h" line number 15. 2, configure with enable assert and build it.( ./configure --enable-cassert --prefix=/home/pgsql) 3, init a new database cluster. 4, modify postgres.conf and add some parameters as below. As the coredump from parallel scan, so we adjust parallel setting, make it easy to reproduce. max_connections = 2000 parallel_setup_cost=0 parallel_tuple_cost=0 min_parallel_table_scan_size=0 max_parallel_workers_per_gather=8 max_parallel_workers = 32 5, start the database cluster. 6, use the script init_test.sql in attachment to create tables. 7, use pgbench with script sub_120.sql in attachment to test it. Try it sometimes, you should get the coredump file. pgbench -d postgres -p 33550 -n -r -f sub_120.sql -c 200 -j 200 -T 12000 (If cannot reproduce it, maybe you can try run two parallel pgbench xx at the same time) In my environment(CentOS 8.2, 128G RAM, 40 processors, disk SAS Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz), sometimes I can reproduce in about 5 minutes , but sometimes it needs about half an hour. Best regards, houzj === Regards, Greg Nancarrow Fujitsu Australia
On Mon, May 24, 2021 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, May 24, 2021 at 12:04:37PM +1000, Greg Nancarrow wrote:
> > Keep cfbot happy, use the PG14 patch as latest.
>
> This stuff is usually very tricky.
Agreed. That's why I was looking for experts in this snapshot-handling
code, to look closer at this issue, check my proposed fix, come up
with a better solution etc.
>Do we have a way to reliably
> reproduce the report discussed here?
Attachment
On Mon, May 24, 2021 at 11:56 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > > Using a recipe similar to what has been described above in the thread, I reliably reproduced the bug in many Postgres versions.(v.11, v.13 etc.). > 1. Make & make install > 2. Make check > 3. run SubTransGetTopmostTransaction-rep.sh in the Postgres source code dir. > Thanks, using the latest PG14 code (713a431c781fbfe1a22fae4991836077f0f4c513), with no modifications, and building with "--enable-cassert", your script reproduced the coredump on my system in about 10 minutes, on the 1st run. #0 0x00007f1542d1137f in raise () from /lib64/libc.so.6 #1 0x00007f1542cfbdb5 in abort () from /lib64/libc.so.6 #2 0x00000000009380cb in ExceptionalCondition () #3 0x00000000005aadbc in SubTransGetTopmostTransaction () #4 0x00000000009741ef in XidInMVCCSnapshot () #5 0x0000000000563326 in HeapTupleSatisfiesVisibility () #6 0x0000000000559d50 in heapgetpage () #7 0x000000000055a0e9 in heapgettup_pagemode () #8 0x000000000055b98e in heap_getnextslot () #9 0x00000000006df713 in SeqNext () #10 0x00000000006b725f in ExecScan () #11 0x00000000006bef41 in fetch_input_tuple () #12 0x00000000006c1f6f in ExecAgg () #13 0x00000000006ade83 in standard_ExecutorRun () #14 0x00000000006b21ed in ParallelQueryMain () #15 0x00000000005a8c11 in ParallelWorkerMain () #16 0x0000000000795f3c in StartBackgroundWorker () #17 0x00000000007a1bc1 in maybe_start_bgworkers () #18 0x00000000007a25c5 in sigusr1_handler () #19 <signal handler called> #20 0x00007f1542dce25b in select () from /lib64/libc.so.6 #21 0x00000000007a2cf6 in ServerLoop () #22 0x00000000007a4659 in PostmasterMain () #23 0x00000000005042ed in main () (Using a debug-build, it seems much harder to reproduce on my system) With my patch applied, I've had no crashes, from many runs. Regards, Greg Nancarrow Fujitsu Australia
Just a note here. After examining the core dump I did notice something. While in XidInMVCCSnapshot call the snapshot->suboverflowed is set true although subxip == NULL and subxcnt == 0. As far as I understand, snapshot->suboverflowed is set true in the GetRunningTransactionData call. And then I decided to put elog around CurrentRunningXacts->subxcnt's assigment. diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 42a89fc5dc9..3d2db02f580 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -2781,6 +2781,9 @@ GetRunningTransactionData(void) * increases if slots do. */ + if (suboverflowed) + elog(WARNING, " >>> CurrentRunningXacts->subxid_overflow is true"); + CurrentRunningXacts->xcnt = count - subcount; CurrentRunningXacts->subxcnt = subcount; CurrentRunningXacts->subxid_overflow = suboverflowed; ... and did get a bunch of messages. I.e. subxid_overflow is set true very often. I've increased the value of PGPROC_MAX_CACHED_SUBXIDS. Once it becomes more than 120 there are no messages and no failed assertions are provided any more. --- Best regards, Maxim Orlov.
Attachment
>Just a note here. After examining the core dump I did notice something.
>While in XidInMVCCSnapshot call the snapshot->suboverflowed is set true
>although subxip == NULL and subxcnt == 0. As far as I understand,
>snapshot->suboverflowed is set true in the GetRunningTransactionData
>call.
>And then I decided to put elog around CurrentRunningXacts->subxcnt's
>assigment.
>diff --git a/src/backend/storage/ipc/procarray.c
>b/src/backend/storage/ipc/procarray.c
>index 42a89fc5dc9..3d2db02f580 100644
>--- a/src/backend/storage/ipc/procarray.c
>+++ b/src/backend/storage/ipc/procarray.c
>@@ -2781,6 +2781,9 @@ GetRunningTransactionData(void)
> * increases if slots do.
> */
>+ if (suboverflowed)
>+ elog(WARNING, " >>> CurrentRunningXacts->subxid_overflow
>is true");
>+
> CurrentRunningXacts->xcnt = count - subcount;
> CurrentRunningXacts->subxcnt = subcount;
> CurrentRunningXacts->subxid_overflow = suboverflowed;
>... and did get a bunch of messages. I.e. subxid_overflow is set true
>very often.
>I've increased the value of PGPROC_MAX_CACHED_SUBXIDS. Once it becomes
>more than 120 there are no messages and no failed assertions are
>provided any more.
Please, avoid using decimal based values.
128 is multiple of 64.
See :
15.6.1 The Power of Powers of 2
regards,
Ranier Vilela
Please, avoid using decimal based values.
128 is multiple of 64.
Please, avoid using decimal based values.
128 is multiple of 64.
It's true that 128 is better to use than 120 but the main problem is not in the value but in the fact we never getCurrentRunningXacts->subxid_overflow = suboverflowed; with value more than 120. This solves the problem but it doesn't seem the right way to fix the issue.
Instead it's better to process suboverflowed state which is legit itself not resulting getting the crash on the Assert.
So the discussion of "better" value doesn't seem related to the problem. It is for demonstration only.
I've been studying commits, and on several occasions problems have been fixed like this.
But important is the documentation of the problem.
The analysis in the beginning of the discussion seems to be right, but the fix v2 looks too invasive for me. Personally, I'd like not to remove snapshot even if transaction is read-only. I propose to consider "xid < TransactionXmin" as a legit case and just promote xid to TransactionXmin. It's annoying this old bug still not fixed. What do you think? --- Best regards, Maxim Orlov.
Attachment
On Wed, Jun 23, 2021 at 1:06 AM Maxim Orlov <m.orlov@postgrespro.ru> wrote: > > The analysis in the beginning of the discussion seems to be right, but > the fix v2 looks too invasive for me. > > Personally, I'd like not to remove snapshot even if transaction is > read-only. I propose to consider "xid < TransactionXmin" as a legit case > and just promote xid to TransactionXmin. > > It's annoying this old bug still not fixed. What do you think? This v3 patch doesn't look right to me at all. It's not addressing the fundamental problem, it just seems to be working around it, by fiddling a xid value to avoid an assertion failure. I really can't see how the v2 patch "removes a snapshot" in the read-only transaction case, and is "invasive". Have you found a case that the v2 patch breaks? The v2 patch does follow the analysis in the beginning of the discussion, which identified that in setting up parallel workers, a "later transaction snapshot" was taken than the one actually used in the statement execution, and that's what eventually leads to the observed Assertion failure. The original problem reporter stated: "So the root cause is the Parallel Workers process set the TransactionXmin with later transcation snapshot". Also, as far as I can see, it makes no sense to pass parallel workers both an active snapshot and a (later) transaction snapshot. In the leader, prior to execution and running parallel workers, a transaction snapshot is obtained and pushed as the active snapshot (note: only ONE snapshot at this point). It makes little sense to me to then obtain ANOTHER transaction snapshot when setting up parallel workers, and pass that (only) to the workers along with the original (earlier) active snapshot. (If there is a reason for passing the TWO snapshots to parallel workers, original code authors and/or snapshot experts should speak up now, and code comments should be added accordingly to explain the purpose and how it is MEANT to work) This is why the v2 patch updates the code to just pass one snapshot (the active snapshot) to the parallel workers, which gets restored in each worker as the transaction snapshot and set as the active snapshot (so is then consistent with the snapshot setup in the parallel leader). The v3 patch doesn't address any of this at all. It is just checking if the xid is < the TransactionXMin in the snapshot, and if so, setting the xid to be TransactionXMin, and thus avoiding the Assertion failure. But that TransactionXMin was from the "later transaction snapshot", which was only obtained for the workers and doesn't necessarily match the earlier active snapshot taken, that is associated with the actual statement execution. As far as I am concerned, that "later transaction snapshot", taken for use by the workers, doesn't seem right, so that's why the v2 patch removes the use of it and only uses the active snapshot, for consistency with the rest of the code. I think that if there was something fundamentally wrong with the v2 patch's removal of that "later transaction snapshot" used by the parallel workers, then surely there would be a test failure somewhere - but that is not the case. I think the v2 patch should be restored as the proposed solution here. Regards, Greg Nancarrow Fujitsu Australia
Have you found a case that the v2 patch breaks?
The v2 patch does follow the analysis in the beginning of the
discussion, which identified that in setting up parallel workers, a
"later transaction snapshot" was taken than the one actually used in
the statement execution, and that's what eventually leads to the
observed Assertion failure.
The original problem reporter stated: "So the root cause is the
Parallel Workers process set the TransactionXmin with later
transcation snapshot".
Also, as far as I can see, it makes no sense to pass parallel workers
both an active snapshot and a (later) transaction snapshot. In the
leader, prior to execution and running parallel workers, a transaction
snapshot is obtained and pushed as the active snapshot (note: only ONE
snapshot at this point). It makes little sense to me to then obtain
ANOTHER transaction snapshot when setting up parallel workers, and
pass that (only) to the workers along with the original (earlier)
active snapshot. (If there is a reason for passing the TWO snapshots
to parallel workers, original code authors and/or snapshot experts
should speak up now, and code comments should be added accordingly to
explain the purpose and how it is MEANT to work)
This is why the v2 patch updates the code to just pass one snapshot
(the active snapshot) to the parallel workers, which gets restored in
each worker as the transaction snapshot and set as the active snapshot
(so is then consistent with the snapshot setup in the parallel
leader).
to be the minimum of the top transaction id and the snapshots.
Hi, I took a quick look on this - I'm no expert in the details of snapshots, so take my comments with a grain of salt. AFAICS both Greg Nancarrow and Pavel Borisov are kinda right. I think Greg is right the v3 patch does not seem like the right (or correct) solution, for a couple reasons: 1) It fixes the symptoms, not the cause. If we set TransactionXmin to a bogus value, this only fixes it locally in SubTransGetTopmostTransaction but I'd be surprised if other places did not have the same issue. 2) The xid/TransactionXmin comparison in the v2 fix: xid = xid > TransactionXmin ? xid : TransactionXmin; seems broken, because it compares the XIDs directly, but that's not going to work after generating enough transactions. 3) But even if this uses TransactionIdFollowsOrEquals, it seems very strange because the "xid" comes from XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) i.e. it's the raw xmin from the tuple, so why should we be setting it to TransactionXmin? That seems pretty strange, TBH. So yeah, I think this is due to confusion with two snapshots and failing to consider both of them when calculating TransactionXmin. But I think removing one of the snapshots (as the v2 does it) is rather strange too. I very much doubt having both the transaction and active snapshots in the parallel worker is not intentional, and Pavel may very well be right this breaks isolation levels that use the xact snapshot (i.e. REPEATABLE READ and SERIALIZABLE). I haven't checked, though. So I think we need to keep both snapshots, but fix TransactionXmin to consider both of them - I suppose ParallelWorkerMain could simply look at the two snapshots, and use the minimum. Which is what [1] (already linked by Pavel) talks about, although that's related to concerns about one of the processes dying, which is not what's happening here. I'm wondering what consequences this may have on production systems, though. We've only seen this failing because of the assert, so what happens when the build has asserts disabled? Looking at SubTransGetTopmostTransaction, it seems the while loop ends immediately (because it's pretty much the opposite of the assert), so we just return the "xid" as topmost XID. But will that produce incorrect query results, or what? regards [1] https://www.postgresql.org/message-id/20150208002027.GH9201%40alap3.anarazel.de -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2021-07-09 20:36, Tomas Vondra wrote: > Hi, > > I took a quick look on this - I'm no expert in the details of > snapshots, > so take my comments with a grain of salt. > > AFAICS both Greg Nancarrow and Pavel Borisov are kinda right. I think > Greg is right the v3 patch does not seem like the right (or correct) > solution, for a couple reasons: > > > 1) It fixes the symptoms, not the cause. If we set TransactionXmin to a > bogus value, this only fixes it locally in > SubTransGetTopmostTransaction > but I'd be surprised if other places did not have the same issue. > > > 2) The xid/TransactionXmin comparison in the v2 fix: > > xid = xid > TransactionXmin ? xid : TransactionXmin; > > seems broken, because it compares the XIDs directly, but that's not > going to work after generating enough transactions. > > > 3) But even if this uses TransactionIdFollowsOrEquals, it seems very > strange because the "xid" comes from > > XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) > > i.e. it's the raw xmin from the tuple, so why should we be setting it > to > TransactionXmin? That seems pretty strange, TBH. > > > So yeah, I think this is due to confusion with two snapshots and > failing > to consider both of them when calculating TransactionXmin. > > But I think removing one of the snapshots (as the v2 does it) is rather > strange too. I very much doubt having both the transaction and active > snapshots in the parallel worker is not intentional, and Pavel may very > well be right this breaks isolation levels that use the xact snapshot > (i.e. REPEATABLE READ and SERIALIZABLE). I haven't checked, though. > > So I think we need to keep both snapshots, but fix TransactionXmin to > consider both of them - I suppose ParallelWorkerMain could simply look > at the two snapshots, and use the minimum. Which is what [1] (already > linked by Pavel) talks about, although that's related to concerns about > one of the processes dying, which is not what's happening here. > > > I'm wondering what consequences this may have on production systems, > though. We've only seen this failing because of the assert, so what > happens when the build has asserts disabled? > > Looking at SubTransGetTopmostTransaction, it seems the while loop ends > immediately (because it's pretty much the opposite of the assert), so > we > just return the "xid" as topmost XID. But will that produce incorrect > query results, or what? > > > > regards > > > [1] > https://www.postgresql.org/message-id/20150208002027.GH9201%40alap3.anarazel.de PFA v4 patch based on the ideas above. In principle I see little difference with v3. But I agree it is more correct. I did test this patch on Linux and MacOS using testing algo above and got no error. On master branch before the patch I still see this error. --- Best regards, Maxim Orlov.
Attachment
> So yeah, I think this is due to confusion with two snapshots and
> failing
> to consider both of them when calculating TransactionXmin.
>
> But I think removing one of the snapshots (as the v2 does it) is rather
> strange too. I very much doubt having both the transaction and active
> snapshots in the parallel worker is not intentional, and Pavel may very
> well be right this breaks isolation levels that use the xact snapshot
> (i.e. REPEATABLE READ and SERIALIZABLE). I haven't checked, though.
>
> So I think we need to keep both snapshots, but fix TransactionXmin to
> consider both of them - I suppose ParallelWorkerMain could simply look
> at the two snapshots, and use the minimum. Which is what [1] (already
> linked by Pavel) talks about, although that's related to concerns about
> one of the processes dying, which is not what's happening here.
>
>
> I'm wondering what consequences this may have on production systems,
> though. We've only seen this failing because of the assert, so what
> happens when the build has asserts disabled?
> Looking at SubTransGetTopmostTransaction, it seems the while loop ends
> immediately (because it's pretty much the opposite of the assert), so
> we
> just return the "xid" as topmost XID. But will that produce incorrect
> query results, or what?
/www.postgresql.org/message-id/20150208002027.GH9201%40alap3.anarazel.de
PFA v4 patch based on the ideas above.
In principle I see little difference with v3. But I agree it is more
correct.
I did test this patch on Linux and MacOS using testing algo above and
got no error. On master branch before the patch I still see this error.
On Sat, Jul 10, 2021 at 3:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > But I think removing one of the snapshots (as the v2 does it) is rather > strange too. I very much doubt having both the transaction and active > snapshots in the parallel worker is not intentional, and Pavel may very > well be right this breaks isolation levels that use the xact snapshot > (i.e. REPEATABLE READ and SERIALIZABLE). I haven't checked, though. > Unfortunately there is currently no test, code-comment, README or developer-discussion that definitively determines which approach (v2 vs v3/v4) is a valid fix for this issue. We don't know if having both the transaction and active snapshots in a parallel worker is intentional or not, and if so, why so? (certainly in the non-parallel case of the same statement execution, there is only one snapshot in question here - the obtained transaction snapshot is pushed as the active snapshot, as it is done in 95% of cases in the code) It seems that only the original code authors know how the snapshot handling in parallel-workers is MEANT to work, and they have yet to speak up about it here. At this point, we can only all agree that there is a problem to be fixed here. My concern with the v3/v4 patch approach is that because the parallel-workers use a later snapshot to what is actually used in the execution context for the statement in the parallel leader, then it is possible for the parallel leader and parallel workers to have different transaction visibility, and surely this cannot be correct. For example, suppose a transaction that deletes a row, completes in the window between these two snapshots. Couldn't the row be visible to the parallel workers but not to the parallel leader? My guess is that currently there are not enough concurrent-transactions tests to expose such a problem, and the window here is fairly small. So we can fiddle xmin values to avoid the immediate Assert issue here, but it's not addressing potential xmax-related issues. Regards, Greg Nancarrow Fujitsu Australia
On Wed, Jul 14, 2021 at 10:34 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > Unfortunately there is currently no test, code-comment, README or > developer-discussion that definitively determines which approach (v2 > vs v3/v4) is a valid fix for this issue. > We don't know if having both the transaction and active snapshots in a > parallel worker is intentional or not, and if so, why so? > (certainly in the non-parallel case of the same statement execution, > there is only one snapshot in question here - the obtained transaction > snapshot is pushed as the active snapshot, as it is done in 95% of > cases in the code) > It seems that only the original code authors know how the snapshot > handling in parallel-workers is MEANT to work, and they have yet to > speak up about it here. > At this point, we can only all agree that there is a problem to be fixed here. Hi. Thanks to Thomas Munro for drawing my attention to this thread. I wasn't intentionally ignoring it, but there's a lot of email in the world and only so much time. When I wrote this code originally, the idea that I had in mind was simply this: whatever state we have in the leader ought to be reproduced in each worker. So if there's an active snapshot in the leader, we ought to make that active in all the workers, and if there's a transaction snapshot in the leader, we ought to make that the transaction snapshot in all of the workers. But I see now that my thinking was fuzzy, and I'm going to blame that on the name GetTransactionSnapshot() being slightly misleading. If IsolationUsesXactSnapshot() is true, then there's really such a thing as a transaction snapshot and reproducing that in the worker is a sensible thing to do. But when !IsolationUsesXactSnapshot(), GetTransactionSnapshot() doesn't just "get the transaction snapshot", because there isn't any such thing. It takes a whole new snapshot, on the theory that you wouldn't be calling this function unless you had finished up with the snapshot you got the last time you called this function. And in the case of initiating parallel query, that is the wrong thing. I think that, at least in the case where IsolationUsesXactSnapshot() is true, we need to make sure that calling GetTransactionSnapshot() in a worker produces the same result that it would have produced in the leader. Say one of the workers calls an sql or plpgsql function and that function runs a bunch of SQL statements. It seems to me that there's probably a way for this to result in calls inside the worker to GetTransactionSnapshot(), and if that doesn't return the same snapshot as in the leader, then we've broken MVCC. What about when IsolationUsesXactSnapshot() is false? Perhaps it's OK to just skip this altogether in that case. Certainly what we're doing can't be right, because copying a snapshot that wouldn't have been taken without parallel query can't ever be the right thing to do. Perhaps we need to copy something else instead. I'm not really sure. So I think v2 is probably on the right track, but wrong when the transaction isolation level is REPEATABLE READ or SERIALIZABLE, and v3 and v4 just seem like unprincipled hacks that try to avoid the assertion failure by lying about whether there's a problem. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jul 22, 2021 at 2:15 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Thanks to Thomas Munro for drawing my attention to this thread. I > wasn't intentionally ignoring it, but there's a lot of email in the > world and only so much time. > > When I wrote this code originally, the idea that I had in mind was > simply this: whatever state we have in the leader ought to be > reproduced in each worker. So if there's an active snapshot in the > leader, we ought to make that active in all the workers, and if > there's a transaction snapshot in the leader, we ought to make that > the transaction snapshot in all of the workers. > > But I see now that my thinking was fuzzy, and I'm going to blame that > on the name GetTransactionSnapshot() being slightly misleading. If > IsolationUsesXactSnapshot() is true, then there's really such a thing > as a transaction snapshot and reproducing that in the worker is a > sensible thing to do. But when !IsolationUsesXactSnapshot(), > GetTransactionSnapshot() doesn't just "get the transaction snapshot", > because there isn't any such thing. It takes a whole new snapshot, on > the theory that you wouldn't be calling this function unless you had > finished up with the snapshot you got the last time you called this > function. And in the case of initiating parallel query, that is the > wrong thing. > > I think that, at least in the case where IsolationUsesXactSnapshot() > is true, we need to make sure that calling GetTransactionSnapshot() in > a worker produces the same result that it would have produced in the > leader. Say one of the workers calls an sql or plpgsql function and > that function runs a bunch of SQL statements. It seems to me that > there's probably a way for this to result in calls inside the worker > to GetTransactionSnapshot(), and if that doesn't return the same > snapshot as in the leader, then we've broken MVCC. > > What about when IsolationUsesXactSnapshot() is false? Perhaps it's OK > to just skip this altogether in that case. Certainly what we're doing > can't be right, because copying a snapshot that wouldn't have been > taken without parallel query can't ever be the right thing to do. > Perhaps we need to copy something else instead. I'm not really sure. > > So I think v2 is probably on the right track, but wrong when the > transaction isolation level is REPEATABLE READ or SERIALIZABLE, and v3 > and v4 just seem like unprincipled hacks that try to avoid the > assertion failure by lying about whether there's a problem. > Many thanks for taking time to respond to this (and thanks to Thomas Munro too). It's much appreciated, as this is a complex area. For the time being, I'll reinstate the v2 patch (say as v5) as a partial solution, and then work on addressing the REPEATABLE READ and SERIALIZABLE transaction isolation levels, which you point out are not handled correctly by the patch. Regards, Greg Nancarrow Fujitsu Australia
Attachment
On Thu, Jul 22, 2021 at 2:15 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Thanks to Thomas Munro for drawing my attention to this thread. I
> wasn't intentionally ignoring it, but there's a lot of email in the
> world and only so much time.
>
> When I wrote this code originally, the idea that I had in mind was
> simply this: whatever state we have in the leader ought to be
> reproduced in each worker. So if there's an active snapshot in the
> leader, we ought to make that active in all the workers, and if
> there's a transaction snapshot in the leader, we ought to make that
> the transaction snapshot in all of the workers.
>
> But I see now that my thinking was fuzzy, and I'm going to blame that
> on the name GetTransactionSnapshot() being slightly misleading. If
> IsolationUsesXactSnapshot() is true, then there's really such a thing
> as a transaction snapshot and reproducing that in the worker is a
> sensible thing to do. But when !IsolationUsesXactSnapshot(),
> GetTransactionSnapshot() doesn't just "get the transaction snapshot",
> because there isn't any such thing. It takes a whole new snapshot, on
> the theory that you wouldn't be calling this function unless you had
> finished up with the snapshot you got the last time you called this
> function. And in the case of initiating parallel query, that is the
> wrong thing.
>
> I think that, at least in the case where IsolationUsesXactSnapshot()
> is true, we need to make sure that calling GetTransactionSnapshot() in
> a worker produces the same result that it would have produced in the
> leader. Say one of the workers calls an sql or plpgsql function and
> that function runs a bunch of SQL statements. It seems to me that
> there's probably a way for this to result in calls inside the worker
> to GetTransactionSnapshot(), and if that doesn't return the same
> snapshot as in the leader, then we've broken MVCC.
>
> What about when IsolationUsesXactSnapshot() is false? Perhaps it's OK
> to just skip this altogether in that case. Certainly what we're doing
> can't be right, because copying a snapshot that wouldn't have been
> taken without parallel query can't ever be the right thing to do.
> Perhaps we need to copy something else instead. I'm not really sure.
>
> So I think v2 is probably on the right track, but wrong when the
> transaction isolation level is REPEATABLE READ or SERIALIZABLE, and v3
> and v4 just seem like unprincipled hacks that try to avoid the
> assertion failure by lying about whether there's a problem.
>
Many thanks for taking time to respond to this (and thanks to Thomas Munro too).
It's much appreciated, as this is a complex area.
For the time being, I'll reinstate the v2 patch (say as v5) as a
partial solution, and then work on addressing the REPEATABLE READ and
SERIALIZABLE transaction isolation levels, which you point out are not
handled correctly by the patch.
On Tue, Aug 3, 2021 at 9:59 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > I've looked at v5 patch. It is completely similar to v2 patch, which I've already tested using the workflow, I've describedin the comments above. Before the patch I get the errors quite soon, the patch corrects them. Furthermore I'vetested the same patch under REPEATABLE READ and SERIALIZABLE and detected no flaws. So, now, when we've got Robert'sexplanation, it seems to me that v2 (aka v5) patch can be committed (leaving possible REPEATABLE READ and SERIALIZABLEimprovements for the future). I don't really sure it is still possible on 07/21 СF, but I'd change status ofthe patch to the ready-for-committer. Also I'd like the bugfix to be backported to the previous PG versions. I agree that the fix should be back-ported, but I'm not keen to commit anything unless it works for all isolation levels. The idea I sort of had floating around in my mind is a little different than what Greg has implemented. I was thinking that we could just skip SerializeSnapshot and the corresponding shm_toc_allocate() if !IsolationUsesXactSnapshot(). Then on the restore side we could just call shm_toc_lookup() with noError = true and skip RestoreTransactionSnapshot/RestoreSnapshot if it returns NULL. I don't know why Greg's patch is changing anything related to the active snapshot (as opposed to the transaction snapshot). Maybe there's a reason why we need that change, but I don't know what it is. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 4, 2021 at 3:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > >The idea I sort of had floating around in my mind is a little >different than what Greg has implemented. I was thinking that we could >just skip SerializeSnapshot and the corresponding shm_toc_allocate() >if !IsolationUsesXactSnapshot(). Then on the restore side we could >just call shm_toc_lookup() with noError = true and skip >RestoreTransactionSnapshot/RestoreSnapshot if it returns NULL. I've tried to follow your description and have attached a patch to hopefully match it, but it doesn't pass "make check-world". Perhaps I messed something up (apologies if so), or additional changes are needed to match what you had in mind or correct additional issues you didn't foresee? t/001_pgbench_with_server.pl .. 10/? # Failed test 'pgbench scale 1 initialization status (got 1 vs expected 0)' # at t/001_pgbench_with_server.pl line 108. ... # creating primary keys... # pgbench: fatal: query failed: ERROR: cannot take query snapshot during a parallel operation # CONTEXT: parallel worker # pgbench: query was: alter table pgbench_accounts add primary key (aid) >I don't know why Greg's patch is changing anything related to the >active snapshot (as opposed to the transaction snapshot). Maybe >there's a reason why we need that change, but I don't know what it is. I don't think my v2/v5 patch is changing anything related to the active snapshot (is it?). It's restoring the serialized active snapshot, installing it as the transaction snapshot and setting it as the active snapshot. The patch removes the additionally-acquired transaction snapshot in InitializeParallelDSM (which seems like a later snapshot to that which is actually being used in the execution state for the statement, back in the leader, if I recall correctly). Basically, in the parallel workers, I've tried to match the snapshot setup to that used in the leader. If the v2/v5 patch doesn't work correctly for some isolation level, I've yet to find it (but can't absolutely rule out there's some case not accounted for). Regards, Greg Nancarrow Fujitsu Australia
Attachment
On Wed, Aug 4, 2021 at 3:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
>The idea I sort of had floating around in my mind is a little
>different than what Greg has implemented. I was thinking that we could
>just skip SerializeSnapshot and the corresponding shm_toc_allocate()
>if !IsolationUsesXactSnapshot(). Then on the restore side we could
>just call shm_toc_lookup() with noError = true and skip
>RestoreTransactionSnapshot/RestoreSnapshot if it returns NULL.
I've tried to follow your description and have attached a patch to
hopefully match it, but it doesn't pass "make check-world".
Perhaps I messed something up (apologies if so), or additional changes
are needed to match what you had in mind or correct additional issues
you didn't foresee?
t/001_pgbench_with_server.pl .. 10/?
# Failed test 'pgbench scale 1 initialization status (got 1 vs expected 0)'
# at t/001_pgbench_with_server.pl line 108.
...
# creating primary keys...
# pgbench: fatal: query failed: ERROR: cannot take query snapshot
during a parallel operation
# CONTEXT: parallel worker
# pgbench: query was: alter table pgbench_accounts add primary key (aid)
Attachment
On Wed, Aug 4, 2021 at 7:55 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > >> > Greg, thanks for the fast response! I suppose that a check for IsolationUsesXactSnapshot() is also useful in a GetTransactionSnapshotfor the correct processing of a case with NULL transaction snapshot. > This corrects mentioned check-world test. > PFA v7 patch. Ah, thanks for that (I didn't debug that failure). But is the coredump issue reproducible now? (using v7 and your test script) Regards, Greg Nancarrow Fujitsu Australia
On Wed, Aug 4, 2021 at 8:17 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > Ah, thanks for that (I didn't debug that failure). > But is the coredump issue reproducible now? (using v7 and your test script) > Er, with the v7 patch, the problem still occurs (that Assert still fires during a run of the SubTransGetTopmostTransaction-rep.sh script). Stracktrace from the coredump is below: #0 0x00007f06e5e7d37f in raise () from /lib64/libc.so.6 #1 0x00007f06e5e67db5 in abort () from /lib64/libc.so.6 #2 0x0000000000b06547 in ExceptionalCondition ( conditionName=0xba8d78 "TransactionIdFollowsOrEquals(xid, TransactionXmin)", errorType=0xba8d0b "FailedAssertion", fileName=0xba8d00 "subtrans.c", lineNumber=156) at assert.c:69 #3 0x0000000000576f0f in SubTransGetTopmostTransaction (xid=3676) at subtrans.c:156 #4 0x0000000000b5f55a in XidInMVCCSnapshot (xid=3676, snapshot=0x2e44560) at snapmgr.c:2293 #5 0x000000000050e014 in HeapTupleSatisfiesMVCC (htup=0x7ffdc6dee6f0, snapshot=0x2e44560, buffer=15559) at heapam_visibility.c:1070 #6 0x000000000050f148 in HeapTupleSatisfiesVisibility (tup=0x7ffdc6dee6f0, snapshot=0x2e44560, buffer=15559) at heapam_visibility.c:1771 #7 0x00000000004f1d26 in heapgetpage (sscan=0x2e9a9a0, page=10310) at heapam.c:466 #8 0x00000000004f45cf in heapgettup_pagemode (scan=0x2e9a9a0, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:1118 #9 0x00000000004f4c91 in heap_getnextslot (sscan=0x2e9a9a0, direction=ForwardScanDirection, slot=0x2e8f9a0) at heapam.c:1396 #10 0x000000000077abb1 in table_scan_getnextslot (sscan=0x2e9a9a0, direction=ForwardScanDirection, slot=0x2e8f9a0) at ../../../src/include/access/tableam.h:1044 #11 0x000000000077ac79 in SeqNext (node=0x2e8f428) at nodeSeqscan.c:80 #12 0x000000000073d997 in ExecScanFetch (node=0x2e8f428, accessMtd=0x77abe5 <SeqNext>, recheckMtd=0x77ac8a <SeqRecheck>) at execScan.c:133 Regards, Greg Nancarrow Fujitsu Australia
On Wed, Aug 4, 2021 at 7:55 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>
>>
> Greg, thanks for the fast response! I suppose that a check for IsolationUsesXactSnapshot() is also useful in a GetTransactionSnapshot for the correct processing of a case with NULL transaction snapshot.
> This corrects mentioned check-world test.
> PFA v7 patch.
Ah, thanks for that (I didn't debug that failure).
But is the coredump issue reproducible now? (using v7 and your test script)
At SERIALIZABLE level with v2/v5 I get an error which I don't have before the patch (but no crash):pgbench: error: client 6 script 0 aborted in command 594 query 0: ERROR: could not serialize access due to read/write dependencies among transactionsDETAIL: Reason code: Canceled on identification as a pivot, during conflict out checking.
On Tue, Aug 3, 2021 at 11:41 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > I've tried to follow your description and have attached a patch to > hopefully match it, but it doesn't pass "make check-world". > Perhaps I messed something up (apologies if so), or additional changes > are needed to match what you had in mind or correct additional issues > you didn't foresee? This is the sort of thing I was thinking about but I don't understand why it doesn't fix the reported problem. Apparently I haven't correctly understood what the issue is. > >I don't know why Greg's patch is changing anything related to the > >active snapshot (as opposed to the transaction snapshot). Maybe > >there's a reason why we need that change, but I don't know what it is. > > I don't think my v2/v5 patch is changing anything related to the > active snapshot (is it?). > It's restoring the serialized active snapshot, installing it as the > transaction snapshot and setting it as the active snapshot. Why do you think it's right to install the serialized *active* snapshot as the *transaction* snapshot? I've been operating on the presumption that we wanted the worker to install the leader's transaction snapshot as its transaction snapshot as its transaction snapshot and the leader's active snapshot as its active snapshot, because in my mind the active snapshot and the transaction snapshot are two separate things. However, there's clearly something wrong with that line of reasoning, because apparently your proposal fixes the problem and mine doesn't. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 4, 2021 at 11:43 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Why do you think it's right to install the serialized *active* > snapshot as the *transaction* snapshot? I've been operating on the > presumption that we wanted the worker to install the leader's > transaction snapshot as its transaction snapshot as its transaction > snapshot and the leader's active snapshot as its active snapshot, > because in my mind the active snapshot and the transaction snapshot > are two separate things. However, there's clearly something wrong with > that line of reasoning, because apparently your proposal fixes the > problem and mine doesn't. > In setting up the snapshot for the execution state used in command execution, GetTransactionSnapshot() is called and (possibly a copy of) the returned snapshot is pushed as the ActiveSnapshot. The way I see it is that there is essentially only one snapshot here, the last-acquired TransactionSnapshot, which the ActiveSnapshot points to (or a copy of it). Rightly or wrongly, my v2/v5 patch is basically changing the code to do the same snapshot setup in each of the parallel workers. So why (current Postgres code, no patches applied) in setting up for parallel-worker execution (in InitializeParallelDSM) does the Postgres code then acquire ANOTHER TransactionSnapshot (by calling GetTransactionSnashot(), which could return CurrentSnapshot or a new snapshot) and serialize that, as well as serializing what the ActiveSnapshot points to, and then restore those in the workers as two separate snapshots? Is it a mistake? Or if intentional and correct, how so? Regards, Greg Nancarrow Fujitsu Australia
On Thu, Aug 5, 2021 at 12:03 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > As the current v7 patch doesn't fix the coredump issue and also the cfbot is now failing (as one of the regression tests fails) I'm reinstating my v2/v5 patch (as v8) as the current best solution to this issue. So far I haven't found a test case (e.g. isolation level) that breaks the patch. Pavel did encounter an error (but no crash) when testing the patch with SERIALIZABLE, but he found that this error occurred anyway without the patch. Regards, Greg Nancarrow Fujitsu Australia
Attachment
On Wed, Aug 4, 2021 at 10:03 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > In setting up the snapshot for the execution state used in command > execution, GetTransactionSnapshot() is called and (possibly a copy of) > the returned snapshot is pushed as the ActiveSnapshot. I mean, there are LOTS of PushActiveSnapshot() calls in the code. A lot of those specifically say PushActiveSnapshot(GetTransactionSnapshot()) but others are pushing snapshots obtained from various other places. I really don't think it can possibly be correct in general to assume that the snapshot on top of the active snapshot stack is the same as the transaction snapshot. > So why (current Postgres code, no patches applied) in setting up for > parallel-worker execution (in InitializeParallelDSM) does the Postgres > code then acquire ANOTHER TransactionSnapshot (by calling > GetTransactionSnashot(), which could return CurrentSnapshot or a new > snapshot) and serialize that, as well as serializing what the > ActiveSnapshot points to, and then restore those in the workers as two > separate snapshots? Is it a mistake? Or if intentional and correct, > how so? Well, I already agreed that in cases where GetTransactionSnapshot() will acquire an altogether new snapshot, we shouldn't call it, but beyond that I don't see why you think this is wrong. I mean, suppose we only call GetTransactionSnapshot() at parallel worker when IsolationUsesXactSnapshot(). In that case, CurrentSnapshot is a durable, transaction-lifetime piece of backend-local state that can affect the results of future calls to GetTransactionSnapshot(), and therefore seems to need to be replicated to workers. Separately, regardless of IsolationUsesXactSnapshot(), the active snapshot is accessible via calls to GetActiveSnapshot() and therefore should also be replicated to workers. Now, I don't know of any necessary reason why those two things need to be the same, because again, there are PushActiveSnapshot() calls all over the place, and they're not all pushing the transaction snapshot. So therefore it makes sense to me that we need to replicate those two things separately - the active snapshot in the leader becomes the active snapshot in the workers and the transaction snapshot in the leader becomes the transaction snapshot in the worker. Now there is evidently something wrong with this line of reasoning because the code is buggy and my proposed fix doesn't work, but I don't know what is wrong with it. You seem to think that it's crazy that we try to replicate the active snapshot to the active snapshot and the transaction snapshot to the transaction snapshot, but that did, and still does, seem really sane to me. The only part that now seems clearly wrong to me is that !IsolationUsesXactSnapshot() case takes an *extra* snapshot, but since fixing that didn't fix the bug, there's evidently more to the problem than that. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 10, 2021 at 12:35 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Now there is evidently something wrong with this line of reasoning > because the code is buggy and my proposed fix doesn't work, but I > don't know what is wrong with it. You seem to think that it's crazy > that we try to replicate the active snapshot to the active snapshot > and the transaction snapshot to the transaction snapshot, but that > did, and still does, seem really sane to me. The only part that now > seems clearly wrong to me is that !IsolationUsesXactSnapshot() case > takes an *extra* snapshot, but since fixing that didn't fix the bug, > there's evidently more to the problem than that. > I traced through snapshot processing during a parallel SELECT, up to the point of the existing GetTransactionSnapshot() and GetCurrentSnapshot() calls in InitializeParallelDSM(). I'm seeing the following sequence of calls, numbered below: PortalStart(): case PORTAL_ONE_SELECT: (1) PushActiveSnapshot(GetTransactionSnapshot()); ... queryDesc = CreateQueryDesc(linitial_node(PlannedStmt, portal->stmts), portal->sourceText, (2) GetActiveSnapshot(), InvalidSnapshot, None_Receiver, params, portal->queryEnv, 0); ... (3) PopActiveSnapshot(); PortalRunSelect(): (4) PushActiveSnapshot(queryDesc->snapshot); ExecutorRun(queryDesc, direction, (uint64) count, portal->run_once); InitializeParallelDSM(): (5) Snapshot transaction_snapshot = GetTransactionSnapshot(); (6) Snapshot active_snapshot = GetActiveSnapshot(); nprocessed = queryDesc->estate->es_processed; (7) PopActiveSnapshot(); The snapshot used in execution of the query is clearly the ActiveSnapshot at the time of creating the QueryDesc [at (2)] which is a copy of the TransactionSnapshot originally acquired [at (1)]. In InitializeParallelDSM() it acquires both the TransactionSnapshot [at (5)] and the ActiveSnapshot [at (6)], to be serialized in the DSM for the workers (each of which will deserialize and restore these). But the problem I see is that the GetTransactionSnapshot() call [at (5)] may acquire a new snapshot (i.e. a later snapshot than the ActiveSnapshot used in the execution of the query), for example, if a concurrent transaction has completed since GetTransactionSnapshot() was last called [in (1)]. In this case, GetTransactionSnapshot() calls GetSnapshotDataReuse() and it returns false, causing a new snapshot to be built by GetTransactionSnapshot(). curXactCompletionCount = ShmemVariableCache->xactCompletionCount; if (curXactCompletionCount != snapshot->snapXactCompletionCount) return false; When this TransactionSnapshot is restored in a worker process, it accordingly sets TransactionXmin, and if we look back at the coredump stacktrace and the Assert condition that failed in the worker, we see that the xid was expected to be >= TransactionXmin, but the Assert fired because the xid was < TransactionXmin. Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin)); This is explained by the TransactionSnapshot being a later snapshot in this case. So this is why it seems to be wrong to call GetTransactionSnapshot() in InitializeParallelDSM() and use a separate, potentially later, snapshot than that used in the execution state for the query. Regards, Greg Nancarrow Fujitsu Australia
On Wed, Aug 11, 2021 at 8:32 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > This is explained by the TransactionSnapshot being a later snapshot in > this case. > So this is why it seems to be wrong to call GetTransactionSnapshot() > in InitializeParallelDSM() and use a separate, potentially later, > snapshot than that used in the execution state for the query. Thanks for the research. I agree with your logic here, but: 1. Then why doesn't the approach I proposed fix it? 2. Consider the case where the toplevel query is something like SELECT complexfunc() FROM generate_series(1,10) g -- in a case like this, I think complexfunc() can cause snapshots to be taken internally. For example suppose we end up inside exec_eval_simple_expr, or SPI_cursor_open_internal, in either case with read_only = false. Here we're going to again call GetTransactionSnapshot() and then execute a query which may use parallelism. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 12, 2021 at 5:37 AM Robert Haas <robertmhaas@gmail.com> wrote: > > 1. Then why doesn't the approach I proposed fix it? > I think that with your approach, it is not doing the expected initialization done by SetTransactionSnapshot() (which is called by RestoreTransactionSnapshot(), which your approach skips in the case of the SQL script that reproduces the problem, because IsolationUsesXactSnapshot() returns false for XACT_READ_COMMITTED). There's some comments in SetTransactionSnapshot() explaining the tricky parts of this initialization, testing that the source transaction is still running, dealing with a race condition, and setting up TransactionXmin. Also, there's an "if (IsolationUsesXactSnapshot()) ..." block within that function, doing some required setup for transaction-snapshot mode, so it doesn't seem like a good idea to not call RestoreTransactionSnapshot() if !IsolationUsesXactSnapshot(), as the function is obviously catering for both cases, when the isolation level does and doesn't use a transaction snapshot. So I think SetTransactionSnapshot() always needs to be called. With your proposed approach, what I'm seeing is that the worker calls GetTransactionSnapshot() at some point, which then builds a new snapshot, and results in increasing TransactionXmin (probably because another concurrent transaction has since completed). This snapshot is thus later than the snapshot used in the execution state of the query being executed. This causes the Assert in SubTransGetTopmostTransaction() to fire because the xid doesn't follow or equal the TransactionXmin value. > 2. Consider the case where the toplevel query is something like SELECT > complexfunc() FROM generate_series(1,10) g -- in a case like this, I > think complexfunc() can cause snapshots to be taken internally. For > example suppose we end up inside exec_eval_simple_expr, or > SPI_cursor_open_internal, in either case with read_only = false. Here > we're going to again call GetTransactionSnapshot() and then execute a > query which may use parallelism. > > A query always uses the ActiveSnapshot at the time the QueryDesc is created - so as long as you don't (as the current code does) obtain a potentially later snapshot and try to restore that in the worker as the TransactionSnapshot (or let the worker create a new snapshot, because no TransactionSnapshot was restored, which may have a greater xmin than the ActiveSnapshot) then I think it should be OK. Regards, Greg Nancarrow Fujitsu Australia
On Fri, Aug 13, 2021 at 2:52 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > With your proposed approach, what I'm seeing is that the worker calls > GetTransactionSnapshot() at some point, which then builds a new > snapshot, and results in increasing TransactionXmin (probably because > another concurrent transaction has since completed). This snapshot is > thus later than the snapshot used in the execution state of the query > being executed. This causes the Assert in > SubTransGetTopmostTransaction() to fire because the xid doesn't follow > or equal the TransactionXmin value. Ah ha! Thank you. So I think what I was missing here is that even though the transaction snapshot is not a well-defined concept when !IsolationUsesXactSnapshot(), we still need TransactionXmin to be set to a value that's earlier than any XID we might inquire about. So the proposal to install the leader's active snapshot as the worker's transaction snapshot is really just a way of making that happen. Now that I understand better, that seems OK to me when !IsolationUsesXactSnapshot(), but otherwise I think we need to serialize and restore the actual transaction snapshot. Do you agree? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 18, 2021 at 5:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Ah ha! Thank you. So I think what I was missing here is that even > though the transaction snapshot is not a well-defined concept when > !IsolationUsesXactSnapshot(), we still need TransactionXmin to be set > to a value that's earlier than any XID we might inquire about. So the > proposal to install the leader's active snapshot as the worker's > transaction snapshot is really just a way of making that happen. Now > that I understand better, that seems OK to me when > !IsolationUsesXactSnapshot(), but otherwise I think we need to > serialize and restore the actual transaction snapshot. Do you agree? > Yes, I think I agree on that. I've updated the patch to restore the actual transaction snapshot in the IsolationUsesXactSnapshot() case, otherwise the active snapshot is installed as the transaction snapshot. I've tested the patch for the different transaction isolation levels, and the reported coredump (from assertion failure) is not occurring. (In the "serializable" case there are "could not serialize access due to read/write dependencies among transactions" errors, as Pavel has previously reported, but these occur without the patch and it appears to be an unrelated issue) Regards, Greg Nancarrow Fujitsu Australia
Attachment
On Wed, Aug 18, 2021 at 9:28 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > Yes, I think I agree on that. > I've updated the patch to restore the actual transaction snapshot in > the IsolationUsesXactSnapshot() case, otherwise the active snapshot is > installed as the transaction snapshot. > I've tested the patch for the different transaction isolation levels, > and the reported coredump (from assertion failure) is not occurring. > (In the "serializable" case there are "could not serialize access due > to read/write dependencies among transactions" errors, as Pavel has > previously reported, but these occur without the patch and it appears > to be an unrelated issue) I think this looks pretty good. I am not sure I see any reason to introduce a new function RestoreTxnSnapshotAndSetAsActive. Couldn't we just use RestoreTransactionSnapshot() and then call PushActiveSnapshot() from parallel.c? That seems preferable to me from the standpoint of avoiding multiplication of APIs. I also think that the comments should explain why we are doing this, rather than just that we are doing this. So instead of this: + /* + * If the transaction snapshot was serialized, restore it and restore the + * active snapshot too. Otherwise, the active snapshot is also installed as + * the transaction snapshot. + */ ...perhaps something like: If the transaction isolation level is READ COMMITTED or SERIALIZABLE, the leader has serialized the transaction snapshot and we must restore it. At lower isolation levels, there is no transaction-lifetime snapshot, but we need TransactionXmin to get set to a value which is less than or equal to the xmin of every snapshot that will be used by this worker. The easiest way to accomplish that is to install the active snapshot as the transaction snapshot. Code running in this parallel worker might take new snapshots via GetTransactionSnapshot() or GetLatestSnapshot(), but it shouldn't have any way of acquiring a snapshot older than the active snapshot. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 24, 2021 at 5:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > I think this looks pretty good. I am not sure I see any reason to > introduce a new function RestoreTxnSnapshotAndSetAsActive. Couldn't we > just use RestoreTransactionSnapshot() and then call > PushActiveSnapshot() from parallel.c? That seems preferable to me from > the standpoint of avoiding multiplication of APIs. > I initially thought this too, but RestoreTransactionSnapshot() sets up the resultant transaction snapshot in "CurrentSnapshot", which is static to snapmgr.c (like the other pointers to valid snapshots) and I didn't really want to mess with the visibility of that, to allow a call to PushActiveSnapshot(CurrentSnapshot) in parallel.c. Also, I wasn't sure if it was safe to call GetTransactionSnapshot() here without the risk of unwanted side-effects - but, looking at it again, I think it is probably OK, so I did use it in my revised patch (attached) and removed that new function RestoreTxnSnapshotAndSetAsActive(). Do you agree that it is OK to call GetTransactionSnapshot() here? > I also think that the comments should explain why we are doing this, > rather than just that we are doing this. So instead of this: > > + /* > + * If the transaction snapshot was serialized, restore it and restore the > + * active snapshot too. Otherwise, the active snapshot is also installed as > + * the transaction snapshot. > + */ > > ...perhaps something like: > > If the transaction isolation level is READ COMMITTED or SERIALIZABLE, > the leader has serialized the transaction snapshot and we must restore > it. At lower isolation levels, there is no transaction-lifetime > snapshot, but we need TransactionXmin to get set to a value which is > less than or equal to the xmin of every snapshot that will be used by > this worker. The easiest way to accomplish that is to install the > active snapshot as the transaction snapshot. Code running in this > parallel worker might take new snapshots via GetTransactionSnapshot() > or GetLatestSnapshot(), but it shouldn't have any way of acquiring a > snapshot older than the active snapshot. > I agree, that is a better comment and detailed description, but didn't you mean "If the transaction isolation level is REPEATABLE READ or SERIALIZABLE ..."? since we have: #define XACT_READ_UNCOMMITTED 0 #define XACT_READ_COMMITTED 1 #define XACT_REPEATABLE_READ 2 #define XACT_SERIALIZABLE 3 #define IsolationUsesXactSnapshot() (XactIsoLevel >= XACT_REPEATABLE_READ) Regards, Greg Nancarrow Fujitsu Australia
Attachment
On Tue, Aug 24, 2021 at 12:20 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > I initially thought this too, but RestoreTransactionSnapshot() sets up > the resultant transaction snapshot in "CurrentSnapshot", which is > static to snapmgr.c (like the other pointers to valid snapshots) and I > didn't really want to mess with the visibility of that, to allow a > call to PushActiveSnapshot(CurrentSnapshot) in parallel.c. Also, I > wasn't sure if it was safe to call GetTransactionSnapshot() here > without the risk of unwanted side-effects - but, looking at it again, > I think it is probably OK, so I did use it in my revised patch > (attached) and removed > that new function RestoreTxnSnapshotAndSetAsActive(). Do you agree > that it is OK to call GetTransactionSnapshot() here? I guess I was thinking more of rejiggering things so that we save the results of each RestoreSnapshot() call in a local variable, e.g. asnapshot and tsnapshot. And then I think we could just RestoreTransactionSnapshot() on whichever one we want, and then PushActiveSnapshot(asnapshot) either way. I think it would be worth trying to move the PushActiveSnapshot() call out of the if statement instead it in two places, written differently but doing the same thing. > I agree, that is a better comment and detailed description, but didn't > you mean "If the transaction isolation level is REPEATABLE READ or > SERIALIZABLE ..."? I sure did! -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 25, 2021 at 1:37 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I guess I was thinking more of rejiggering things so that we save the > results of each RestoreSnapshot() call in a local variable, e.g. > asnapshot and tsnapshot. And then I think we could just > RestoreTransactionSnapshot() on whichever one we want, and then > PushActiveSnapshot(asnapshot) either way. I think it would be worth > trying to move the PushActiveSnapshot() call out of the if statement > instead it in two places, written differently but doing the same > thing. > I've attached an updated patch, hopefully more along the lines that you were thinking of. Regards, Greg Nancarrow Fujitsu Australia
Attachment
On Wed, Aug 25, 2021 at 5:36 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > I've attached an updated patch, hopefully more along the lines that > you were thinking of. LGTM. Committed and back-patched to v10 and up. In theory the same bug exists in 9.6, but you'd have to have third-party code using the parallel context infrastructure in order to hit it. If the patch back-patched cleanly I would have done so just in case, but shm_toc_lookup lacks a bool noError option in that version. -- Robert Haas EDB: http://www.enterprisedb.com