Thread: [Logical Replication] TRAP: FailedAssertion("rel->rd_rel->relreplident== REPLICA_IDENTITY_DEFAULT || rel->rd_rel->relreplident ==REPLICA_IDENTITY_FULL || rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX"

Hi,
I am getting a server crash on publication server on HEAD for the below test case.

Commit: b9c130a1fdf16cd99afb390c186d19acaea7d132

Data setup:
Publication server:
wal_level = logical
max_wal_senders = 10
max_replication_slots = 15
wal_log_hints = on
hot_standby_feedback = on
wal_receiver_status_interval = 1
listen_addresses='*'
log_min_messages=debug1
wal_sender_timeout = 0
logical_decoding_work_mem=64kB

Subscription server:
wal_level = logical
wal_log_hints = on
hot_standby_feedback = on
wal_receiver_status_interval = 1
log_min_messages=debug1
port=5433
logical_decoding_work_mem=64kB

Test case:
Publication server:
create table test(a int);
create publication test_pub for all tables;
alter table test replica identity NOTHING ;

Subscription server:
create table test(a int);
create subscription test_sub CONNECTION 'host=172.16.208.32 port=5432 dbname=postgres user=centos' PUBLICATION test_pub WITH ( slot_name = test_slot_sub);

Publication server:
insert into test values(generate_series(1,5),'aa');

After executing the DML in publication server ,it crashed with the mentioned assert.

Publication Server log File snippet:
2020-01-07 11:54:00.476 UTC [17417] DETAIL:  Streaming transactions committing after 0/163CC30, reading WAL from 0/163CC30.
2020-01-07 11:54:00.476 UTC [17417] LOG:  logical decoding found consistent point at 0/163CC30
2020-01-07 11:54:00.476 UTC [17417] DETAIL:  There are no running transactions.
TRAP: FailedAssertion("rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT || rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL || rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX", File: "proto.c", Line: 148)
postgres: walsender centos 172.16.208.32(40292) idle(ExceptionalCondition+0x53)[0x8ca453]
postgres: walsender centos 172.16.208.32(40292) idle[0x74c515]
/home/centos/PG_master/postgresql/inst/lib/pgoutput.so(+0x2114)[0x7f3bb463d114]
postgres: walsender centos 172.16.208.32(40292) idle[0x747fa8]
postgres: walsender centos 172.16.208.32(40292) idle(ReorderBufferCommit+0x12ee)[0x75187e]
postgres: walsender centos 172.16.208.32(40292) idle[0x7455a8]
postgres: walsender centos 172.16.208.32(40292) idle(LogicalDecodingProcessRecord+0x2ea)[0x74593a]
postgres: walsender centos 172.16.208.32(40292) idle[0x766c24]
postgres: walsender centos 172.16.208.32(40292) idle[0x7693a2]
postgres: walsender centos 172.16.208.32(40292) idle(exec_replication_command+0xbb1)[0x76a091]
postgres: walsender centos 172.16.208.32(40292) idle(PostgresMain+0x4b9)[0x7b1099]
postgres: walsender centos 172.16.208.32(40292) idle[0x482bc7]
postgres: walsender centos 172.16.208.32(40292) idle(PostmasterMain+0xdbf)[0x73339f]
postgres: walsender centos 172.16.208.32(40292) idle(main+0x44f)[0x48403f]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f3bc53f23d5]
postgres: walsender centos 172.16.208.32(40292) idle[0x4840a6]
2020-01-07 11:54:00.802 UTC [17359] LOG:  server process (PID 17417) was terminated by signal 6: Aborted
2020-01-07 11:54:00.802 UTC [17359] LOG:  terminating any other active server processes
2020-01-07 11:54:00.802 UTC [17413] WARNING:  terminating connection because of crash of another server process
2020-01-07 11:54:00.802 UTC [17413] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.


Stack Trace:
Core was generated by `postgres: walsender centos 172.16.208.32(40286) idle      '.
Program terminated with signal 6, Aborted.
#0  0x00007f3bc5406207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-34.el7.x86_64 libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-16.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007f3bc5406207 in raise () from /lib64/libc.so.6
#1  0x00007f3bc54078f8 in abort () from /lib64/libc.so.6
#2  0x00000000008ca472 in ExceptionalCondition (
    conditionName=conditionName@entry=0xa5b9c8 "rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT || rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL || rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX", errorType=errorType@entry=0x918fe9 "FailedAssertion", fileName=fileName@entry=0xa5b903 "proto.c", lineNumber=lineNumber@entry=148) at assert.c:67
#3  0x000000000074c515 in logicalrep_write_insert (out=0xe8c768, rel=rel@entry=0x7f3bc69548d8, newtuple=0x7f3bb3e3a068) at proto.c:146
#4  0x00007f3bb463d114 in pgoutput_change (ctx=0xe8a8b0, txn=<optimized out>, relation=0x7f3bc69548d8, change=0xf3e018) at pgoutput.c:345
#5  0x0000000000747fa8 in change_cb_wrapper (cache=<optimized out>, txn=<optimized out>, relation=<optimized out>, change=<optimized out>) at logical.c:754
#6  0x000000000075187e in ReorderBufferCommit (rb=0xf2e090, xid=xid@entry=489, commit_lsn=23318344, end_lsn=<optimized out>, commit_time=commit_time@entry=631713235036663, origin_id=origin_id@entry=0,
    origin_lsn=origin_lsn@entry=0) at reorderbuffer.c:1661
#7  0x00000000007455a8 in DecodeCommit (xid=489, parsed=0x7ffec53910a0, buf=0x7ffec5391260, ctx=0xe8a8b0) at decode.c:637
#8  DecodeXactOp (ctx=0xe8a8b0, buf=buf@entry=0x7ffec5391260) at decode.c:245
#9  0x000000000074593a in LogicalDecodingProcessRecord (ctx=0xe8a8b0, record=0xe8ab70) at decode.c:114
#10 0x0000000000766c24 in XLogSendLogical () at walsender.c:2806
#11 0x00000000007693a2 in WalSndLoop (send_data=send_data@entry=0x766bc0 <XLogSendLogical>) at walsender.c:2230
#12 0x000000000076a091 in StartLogicalReplication (cmd=0xee9d68) at walsender.c:1153
#13 exec_replication_command (cmd_string=cmd_string@entry=0xe652d0 "START_REPLICATION SLOT \"test_slot_sub\" LOGICAL 0/0 (proto_version '1', publication_names '\"test_pub\"')") at walsender.c:1576
#14 0x00000000007b1099 in PostgresMain (argc=<optimized out>, argv=argv@entry=0xe90a58, dbname=0xe90960 "postgres", username=<optimized out>) at postgres.c:4287
#15 0x0000000000482bc7 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4498
#16 BackendStartup (port=0xe88920) at postmaster.c:4189
#17 ServerLoop () at postmaster.c:1727
#18 0x000000000073339f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xe5fe20) at postmaster.c:1400
#19 0x000000000048403f in main (argc=3, argv=0xe5fe20) at main.c:210

Thanks.
--
Regards,
Neha Sharma
On Tue, Jan 07, 2020 at 05:38:49PM +0530, Neha Sharma wrote:
> I am getting a server crash on publication server on HEAD for the below
> test case.
>
> Test case:
> Publication server:
> create table test(a int);
> create publication test_pub for all tables;
> alter table test replica identity NOTHING ;
>
> Subscription server:
> create table test(a int);
> create subscription test_sub CONNECTION 'host=172.16.208.32 port=5432
> dbname=postgres user=centos' PUBLICATION test_pub WITH ( slot_name =
> test_slot_sub);
>
> Publication server:
> insert into test values(generate_series(1,5),'aa');

This would not work as your relation has only one column.  There are
some TAP tests for logical replication (none actually stressing
NOTHING as replica identity), which do not fail, and I cannot
reproduce the failure myself.

> After executing the DML in publication server ,it crashed with the
> mentioned assert.

Do you have other objects defined on your schema on the publication or
the subscription side?  Like, er, triggers?
--
Michael

Attachment
Hi Michael,
Thanks for looking into the issue. Sorry by mistake I had mentioned the incorrect DML query,please use the query as mentioned below.

On Thu, Jan 9, 2020 at 11:38 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jan 07, 2020 at 05:38:49PM +0530, Neha Sharma wrote:
> I am getting a server crash on publication server on HEAD for the below
> test case.
>
> Test case:
> Publication server:
> create table test(a int);
> create publication test_pub for all tables;
> alter table test replica identity NOTHING ;
>
> Subscription server:
> create table test(a int);
> create subscription test_sub CONNECTION 'host=172.16.208.32 port=5432
> dbname=postgres user=centos' PUBLICATION test_pub WITH ( slot_name =
> test_slot_sub);
>
> Publication server:
> insert into test values(generate_series(1,5),'aa');
 insert into test values(generate_series(1,5));

This would not work as your relation has only one column.  There are
some TAP tests for logical replication (none actually stressing
NOTHING as replica identity), which do not fail, and I cannot
reproduce the failure myself.

> After executing the DML in publication server ,it crashed with the
> mentioned assert.

Do you have other objects defined on your schema on the publication or
the subscription side?  Like, er, triggers?
I had only one table in the publication server.

I am able to reproduce the issue consistently.

2020-01-09 07:14:31.727 UTC [20436] LOG:  logical decoding found consistent point at 0/1632FC0
2020-01-09 07:14:31.727 UTC [20436] DETAIL:  There are no running transactions.
TRAP: FailedAssertion("rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT || rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL || rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX", File: "proto.c", Line: 148)
postgres: walsender centos 172.16.208.32(40324) idle(ExceptionalCondition+0x53)[0x8ca453]
postgres: walsender centos 172.16.208.32(40324) idle[0x74c515]
/home/centos/PG_master/postgresql/inst/lib/pgoutput.so(+0x2114)[0x7fb105038114]
postgres: walsender centos 172.16.208.32(40324) idle[0x747fa8]
postgres: walsender centos 172.16.208.32(40324) idle(ReorderBufferCommit+0x12ee)[0x75187e]
postgres: walsender centos 172.16.208.32(40324) idle[0x7455a8]
postgres: walsender centos 172.16.208.32(40324) idle(LogicalDecodingProcessRecord+0x2ea)[0x74593a]
postgres: walsender centos 172.16.208.32(40324) idle[0x766c24]
postgres: walsender centos 172.16.208.32(40324) idle[0x7693a2]
postgres: walsender centos 172.16.208.32(40324) idle(exec_replication_command+0xbb1)[0x76a091]
postgres: walsender centos 172.16.208.32(40324) idle(PostgresMain+0x4b9)[0x7b1099]
postgres: walsender centos 172.16.208.32(40324) idle[0x482bc7]
postgres: walsender centos 172.16.208.32(40324) idle(PostmasterMain+0xdbf)[0x73339f]
postgres: walsender centos 172.16.208.32(40324) idle(main+0x44f)[0x48403f]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb115ded3d5]
postgres: walsender centos 172.16.208.32(40324) idle[0x4840a6]
2020-01-09 07:14:32.055 UTC [20357] LOG:  server process (PID 20436) was terminated by signal 6: Aborted
--
Michael
On Thu, Jan 9, 2020 at 12:50 PM Neha Sharma
<neha.sharma@enterprisedb.com> wrote:
>
> Hi Michael,
> Thanks for looking into the issue. Sorry by mistake I had mentioned the incorrect DML query,please use the query as
mentionedbelow.
 
>
> On Thu, Jan 9, 2020 at 11:38 AM Michael Paquier <michael@paquier.xyz> wrote:
>>
>> On Tue, Jan 07, 2020 at 05:38:49PM +0530, Neha Sharma wrote:
>> > I am getting a server crash on publication server on HEAD for the below
>> > test case.
>> >
>> > Test case:
>> > Publication server:
>> > create table test(a int);
>> > create publication test_pub for all tables;
>> > alter table test replica identity NOTHING ;
>> >
>> > Subscription server:
>> > create table test(a int);
>> > create subscription test_sub CONNECTION 'host=172.16.208.32 port=5432
>> > dbname=postgres user=centos' PUBLICATION test_pub WITH ( slot_name =
>> > test_slot_sub);
>> >
>> > Publication server:
>> > insert into test values(generate_series(1,5),'aa');
>
>  insert into test values(generate_series(1,5));
>>
>>
>> This would not work as your relation has only one column.  There are
>> some TAP tests for logical replication (none actually stressing
>> NOTHING as replica identity), which do not fail, and I cannot
>> reproduce the failure myself.

I am able to reproduce the failure,  I think the assert in the
'logicalrep_write_insert' is not correct.  IMHO even if the replica
identity is set to NOTHING we should be able to replicate INSERT?

This will fix the issue.

diff --git a/src/backend/replication/logical/proto.c
b/src/backend/replication/logical/proto.c
index dcf7c08..471461c 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -145,7 +145,8 @@ logicalrep_write_insert(StringInfo out, Relation
rel, HeapTuple newtuple)

        Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
                   rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
-                  rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
+                  rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX ||
+                  rel->rd_rel->relreplident == REPLICA_IDENTITY_NOTHING);

        /* use Oid as relation identifier */
        pq_sendint32(out, RelationGetRelid(rel));

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Thu, 9 Jan 2020 at 08:48, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jan 9, 2020 at 12:50 PM Neha Sharma
> <neha.sharma@enterprisedb.com> wrote:
> >
> > Hi Michael,
> > Thanks for looking into the issue. Sorry by mistake I had mentioned the incorrect DML query,please use the query as
mentionedbelow.
 
> >
> > On Thu, Jan 9, 2020 at 11:38 AM Michael Paquier <michael@paquier.xyz> wrote:
> >>
> >> On Tue, Jan 07, 2020 at 05:38:49PM +0530, Neha Sharma wrote:
> >> > I am getting a server crash on publication server on HEAD for the below
> >> > test case.
> >> >
> >> > Test case:
> >> > Publication server:
> >> > create table test(a int);
> >> > create publication test_pub for all tables;
> >> > alter table test replica identity NOTHING ;
> >> >
> >> > Subscription server:
> >> > create table test(a int);
> >> > create subscription test_sub CONNECTION 'host=172.16.208.32 port=5432
> >> > dbname=postgres user=centos' PUBLICATION test_pub WITH ( slot_name =
> >> > test_slot_sub);
> >> >
> >> > Publication server:
> >> > insert into test values(generate_series(1,5),'aa');
> >
> >  insert into test values(generate_series(1,5));
> >>
> >>
> >> This would not work as your relation has only one column.  There are
> >> some TAP tests for logical replication (none actually stressing
> >> NOTHING as replica identity), which do not fail, and I cannot
> >> reproduce the failure myself.
>
> I am able to reproduce the failure,  I think the assert in the
> 'logicalrep_write_insert' is not correct.  IMHO even if the replica
> identity is set to NOTHING we should be able to replicate INSERT?
>
True that.
> This will fix the issue.
>
> diff --git a/src/backend/replication/logical/proto.c
> b/src/backend/replication/logical/proto.c
> index dcf7c08..471461c 100644
> --- a/src/backend/replication/logical/proto.c
> +++ b/src/backend/replication/logical/proto.c
> @@ -145,7 +145,8 @@ logicalrep_write_insert(StringInfo out, Relation
> rel, HeapTuple newtuple)
>
>         Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
>                    rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
> -                  rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
> +                  rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX ||
> +                  rel->rd_rel->relreplident == REPLICA_IDENTITY_NOTHING);
>
>         /* use Oid as relation identifier */
>         pq_sendint32(out, RelationGetRelid(rel));
>
+1



-- 
Regards,
Rafia Sabih



Hi,

On 2020-01-09 13:17:59 +0530, Dilip Kumar wrote:
> I am able to reproduce the failure,  I think the assert in the
> 'logicalrep_write_insert' is not correct.  IMHO even if the replica
> identity is set to NOTHING we should be able to replicate INSERT?
> 
> This will fix the issue.
> 
> diff --git a/src/backend/replication/logical/proto.c
> b/src/backend/replication/logical/proto.c
> index dcf7c08..471461c 100644
> --- a/src/backend/replication/logical/proto.c
> +++ b/src/backend/replication/logical/proto.c
> @@ -145,7 +145,8 @@ logicalrep_write_insert(StringInfo out, Relation
> rel, HeapTuple newtuple)
> 
>         Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
>                    rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
> -                  rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
> +                  rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX ||
> +                  rel->rd_rel->relreplident == REPLICA_IDENTITY_NOTHING);
> 
>         /* use Oid as relation identifier */
>         pq_sendint32(out, RelationGetRelid(rel));

There's not much point in having this assert, right? Given that it
covers all choices? Seems better to just drop it.

Greetings,

Andres Freund



On Thu, 9 Jan 2020 at 10:43 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2020-01-09 13:17:59 +0530, Dilip Kumar wrote:
> I am able to reproduce the failure,  I think the assert in the
> 'logicalrep_write_insert' is not correct.  IMHO even if the replica
> identity is set to NOTHING we should be able to replicate INSERT?
>
> This will fix the issue.
>
> diff --git a/src/backend/replication/logical/proto.c
> b/src/backend/replication/logical/proto.c
> index dcf7c08..471461c 100644
> --- a/src/backend/replication/logical/proto.c
> +++ b/src/backend/replication/logical/proto.c
> @@ -145,7 +145,8 @@ logicalrep_write_insert(StringInfo out, Relation
> rel, HeapTuple newtuple)
>
>         Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
>                    rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
> -                  rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
> +                  rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX ||
> +                  rel->rd_rel->relreplident == REPLICA_IDENTITY_NOTHING);
>
>         /* use Oid as relation identifier */
>         pq_sendint32(out, RelationGetRelid(rel));

There's not much point in having this assert, right? Given that it
covers all choices? Seems better to just drop
it.

Yeah right!
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 10, 2020 at 07:30:34AM +0530, Dilip Kumar wrote:
> On Thu, 9 Jan 2020 at 10:43 PM, Andres Freund <andres@anarazel.de> wrote:
>> There's not much point in having this assert, right? Given that it
>> covers all choices? Seems better to just drop it.
>
> Yeah right!

Refreshing my mind on that...  The two remaining assertions still make
sense for update and delete changes per the restrictions in place in
CheckCmdReplicaIdentity(), and there is a gap with the regression
tests.  So combining all that I get the attached patch (origin point
is 665d1fa).  Thoughts?
--
Michael

Attachment
On Fri, Jan 10, 2020 at 10:31 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Jan 10, 2020 at 07:30:34AM +0530, Dilip Kumar wrote:
> > On Thu, 9 Jan 2020 at 10:43 PM, Andres Freund <andres@anarazel.de> wrote:
> >> There's not much point in having this assert, right? Given that it
> >> covers all choices? Seems better to just drop it.
> >
> > Yeah right!
>
> Refreshing my mind on that...  The two remaining assertions still make
> sense for update and delete changes per the restrictions in place in
> CheckCmdReplicaIdentity(),

Right

 and there is a gap with the regression
> tests.  So combining all that I get the attached patch (origin point
> is 665d1fa).  Thoughts?

LGTM

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, Jan 10, 2020 at 11:01:13AM +0530, Dilip Kumar wrote:
> On Fri, Jan 10, 2020 at 10:31 AM Michael Paquier <michael@paquier.xyz> wrote:
>> and there is a gap with the regression
>> tests.  So combining all that I get the attached patch (origin point
>> is 665d1fa).  Thoughts?
>
> LGTM

Thanks for the lookup.  I'll look at that again in a couple of days
and hopefully wrap it by then.
--
Michael

Attachment
On Sat, Jan 11, 2020 at 10:34:20AM +0900, Michael Paquier wrote:
> Thanks for the lookup.  I'll look at that again in a couple of days
> and hopefully wrap it by then.

And done.
--
Michael

Attachment