RE: [HACKERS] logical decoding of two-phase transactions - Mailing list pgsql-hackers

From tanghy.fnst@fujitsu.com
Subject RE: [HACKERS] logical decoding of two-phase transactions
Date
Msg-id OS0PR01MB61138651E2130F3E9152C45BFB2C9@OS0PR01MB6113.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [HACKERS] logical decoding of two-phase transactions  (Peter Smith <smithpb2250@gmail.com>)
List pgsql-hackers
Hi Ajin

>The above patch had some changes missing which resulted in some tap
>tests failing. Sending an updated patchset. Keeping the patchset
>version the same.

Thanks for your patch.  I see a problem about Segmentation fault when using it. Please take a look at this.
The steps to reproduce the problem are as follows.

------publisher------
create table test (a int primary key, b varchar);
create publication pub for table test;

------subscriber------
create table test (a int primary key, b varchar);
create subscription sub connection 'dbname=postgres' publication pub with(two_phase=on);

Then, I prepare, commit, rollback transactions and TRUNCATE table in a sql as follows:
-------------
BEGIN;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1, 10000) s(i);
PREPARE TRANSACTION 't1';
COMMIT PREPARED 't1';

BEGIN;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(10001, 20000) s(i);
PREPARE TRANSACTION 't2';
ROLLBACK PREPARED 't2';

TRUNCATE test;
-------------

To make sure the problem produce easily, I looped above operations in my sql file about 10 times, then I can 100%
reproduceit and got segmentation fault in publisher log as follows:
 
-------------
2021-05-18 16:30:56.952 CST [548189] postmaster LOG:  server process (PID 548222) was terminated by signal 11:
Segmentationfault
 
2021-05-18 16:30:56.952 CST [548189] postmaster DETAIL:  Failed process was running: START_REPLICATION SLOT "sub"
LOGICAL0/0 (proto_version '3', two_phase 'on', publication_names '"pub"')
 
-------------

Here is the core dump information :
-------------
#0  0x000000000090afe4 in pq_sendstring (buf=buf@entry=0x251ca80, str=0x0) at pqformat.c:199
#1  0x0000000000ab0a2b in logicalrep_write_begin_prepare (out=0x251ca80, txn=txn@entry=0x25346e8) at proto.c:124
#2  0x00007f9528842dd6 in pgoutput_begin_prepare (ctx=ctx@entry=0x2514700, txn=txn@entry=0x25346e8) at pgoutput.c:495
#3  0x00007f9528843f70 in pgoutput_truncate (ctx=0x2514700, txn=0x25346e8, nrelations=1, relations=0x262f678,
change=0x25370b8)at pgoutput.c:905
 
#4  0x0000000000aa57cb in truncate_cb_wrapper (cache=<optimized out>, txn=<optimized out>, nrelations=<optimized out>,
relations=<optimizedout>, change=<optimized out>)
 
    at logical.c:1103
#5  0x0000000000abf333 in ReorderBufferApplyTruncate (streaming=false, change=0x25370b8, relations=0x262f678,
nrelations=1,txn=0x25346e8, rb=0x2516710)
 
    at reorderbuffer.c:1918
#6  ReorderBufferProcessTXN (rb=rb@entry=0x2516710, txn=0x25346e8, commit_lsn=commit_lsn@entry=27517176,
snapshot_now=<optimizedout>, command_id=command_id@entry=0,
 
    streaming=streaming@entry=false) at reorderbuffer.c:2278
#7  0x0000000000ac0b14 in ReorderBufferReplay (txn=<optimized out>, rb=rb@entry=0x2516710, xid=xid@entry=738,
commit_lsn=commit_lsn@entry=27517176,
    end_lsn=end_lsn@entry=27517544, commit_time=commit_time@entry=674644388404356, origin_id=0, origin_lsn=0) at
reorderbuffer.c:2591
#8  0x0000000000ac1713 in ReorderBufferCommit (rb=0x2516710, xid=xid@entry=738, commit_lsn=27517176, end_lsn=27517544,
commit_time=commit_time@entry=674644388404356,
    origin_id=origin_id@entry=0, origin_lsn=0) at reorderbuffer.c:2615
#9  0x0000000000a9f702 in DecodeCommit (ctx=ctx@entry=0x2514700, buf=buf@entry=0x7ffdd027c2b0,
parsed=parsed@entry=0x7ffdd027c140,xid=xid@entry=738,
 
    two_phase=<optimized out>) at decode.c:742
#10 0x0000000000a9fc6c in DecodeXactOp (ctx=ctx@entry=0x2514700, buf=buf@entry=0x7ffdd027c2b0) at decode.c:278
#11 0x0000000000aa1b75 in LogicalDecodingProcessRecord (ctx=0x2514700, record=0x2514ac0) at decode.c:142
#12 0x0000000000af6db1 in XLogSendLogical () at walsender.c:2876
#13 0x0000000000afb6aa in WalSndLoop (send_data=send_data@entry=0xaf6d49 <XLogSendLogical>) at walsender.c:2306
#14 0x0000000000afbdac in StartLogicalReplication (cmd=cmd@entry=0x24da288) at walsender.c:1206
#15 0x0000000000afd646 in exec_replication_command (
    cmd_string=cmd_string@entry=0x2452570 "START_REPLICATION SLOT \"sub\" LOGICAL 0/0 (proto_version '3', two_phase
'on',publication_names '\"pub\"')") at walsender.c:1646
 
#16 0x0000000000ba3514 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffdd027c560, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4482
 
#17 0x0000000000a7284a in BackendRun (port=port@entry=0x2477b60) at postmaster.c:4491
#18 0x0000000000a78bba in BackendStartup (port=port@entry=0x2477b60) at postmaster.c:4213
#19 0x0000000000a78ff9 in ServerLoop () at postmaster.c:1745
#20 0x0000000000a7bbdf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x244bae0) at postmaster.c:1417
#21 0x000000000090dc80 in main (argc=3, argv=0x244bae0) at main.c:209
-------------

I noticed that it called pgoutput_truncate function and pgoutput_begin_prepare function. It seems odd because TRUNCATE
isnot in a prepared transaction in my case.
 

I tried to debug this to learn more and found that in pgoutput_truncate function, the value of in_prepared_txn was
true.Later, it got a segmentation fault when it tried to get gid in logicalrep_write_begin_prepare function - it has no
gidso we got the segmentation fault.
 

FYI:
I also tested the case in synchronous mode, and it can execute successfully. So, I think the value of in_prepared_txn
issometimes incorrect in asynchronous mode. Maybe there's a better way to get this.
 

Regards
Tang

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: PG 14 release notes, first draft
Next
From: "Drouvot, Bertrand"
Date:
Subject: Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619