Thread: Problem with logical replication

Problem with logical replication

From
Dilip Kumar
Date:
While try to setup a cascading replication, I have observed that if we
set the REPLICA IDENTITY to FULL on the subscriber side then there is
an Assert hit.

After analysis I have found that, when we set the REPLICA IDENTITY to
FULL on subscriber side (because I wanted to make this a publisher for
another subscriber).
then it will set relation->rd_replidindex to InvalidOid refer below code snippet
RelationGetIndexList()
{
....
if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex))
relation->rd_replidindex = pkeyIndex;
else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
relation->rd_replidindex = candidateIndex;
else
relation->rd_replidindex = InvalidOid;
}

But, while appying the update and if the table have an index we have
this assert in build_replindex_scan_key

static bool
build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
TupleTableSlot *searchslot)
{
...
Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
}

To me it appears like this assert is not correct.  Attached patch has
removed this assert and things works fine.


#0  0x00007ff2a0c8d5d7 in raise () from /lib64/libc.so.6
#1  0x00007ff2a0c8ecc8 in abort () from /lib64/libc.so.6
#2  0x0000000000aa7c7d in ExceptionalCondition (conditionName=0xc1bb30
"RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel)",
errorType=0xc1bad9 "FailedAssertion",
    fileName=0xc1bb1c "execReplication.c", lineNumber=60) at assert.c:67
#3  0x00000000007153c3 in build_replindex_scan_key
(skey=0x7fff25711560, rel=0x7ff2a1b0b800, idxrel=0x7ff2a1b0bd98,
searchslot=0x21328c8) at execReplication.c:60
#4  0x00000000007156ac in RelationFindReplTupleByIndex
(rel=0x7ff2a1b0b800, idxoid=16387, lockmode=LockTupleExclusive,
searchslot=0x21328c8, outslot=0x2132bb0) at execReplication.c:141
#5  0x00000000008aeba5 in FindReplTupleInLocalRel (estate=0x2150170,
localrel=0x7ff2a1b0b800, remoterel=0x214a7c8, remoteslot=0x21328c8,
localslot=0x7fff25711f28) at worker.c:989
#6  0x00000000008ae6f2 in apply_handle_update_internal
(relinfo=0x21327b0, estate=0x2150170, remoteslot=0x21328c8,
newtup=0x7fff25711fd0, relmapentry=0x214a7c8) at worker.c:820
#7  0x00000000008ae609 in apply_handle_update (s=0x7fff25719560) at worker.c:788
#8  0x00000000008af8b1 in apply_dispatch (s=0x7fff25719560) at worker.c:1362
#9  0x00000000008afd52 in LogicalRepApplyLoop (last_received=22926832)
at worker.c:1570
#10 0x00000000008b0c3a in ApplyWorkerMain (main_arg=0) at worker.c:2114
#11 0x0000000000869c15 in StartBackgroundWorker () at bgworker.c:813
#12 0x000000000087d28f in do_start_bgworker (rw=0x20a07a0) at postmaster.c:5852
#13 0x000000000087d63d in maybe_start_bgworkers () at postmaster.c:6078
#14 0x000000000087c685 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5247
#15 <signal handler called>
#16 0x00007ff2a0d458d3 in __select_nocancel () from /lib64/libc.so.6
#17 0x0000000000878153 in ServerLoop () at postmaster.c:1691
#18 0x0000000000877b42 in PostmasterMain (argc=3, argv=0x2079120) at
postmaster.c:1400
#19 0x000000000077f256 in main (argc=3, argv=0x2079120) at main.c:210

To reproduce this issue
run start1.sh
then execute below commands on publishers.
insert into pgbench_accounts values(1,2);
update pgbench_accounts set b=30 where a=1;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
Confirmed and added to opened items - this is a live bug back to v10.

for a in data data1; do ./tmp_install/usr/local/pgsql/bin/initdb -D $a --no-sync& done; wait
echo "wal_level = logical">> data/postgresql.conf; echo "port=5433" >> data1/postgresql.conf
for a in data data1; do ./tmp_install/usr/local/pgsql/bin/postmaster -D $a& done
for a in "CREATE TABLE pgbench_accounts(a int primary key, b int)" "ALTER TABLE pgbench_accounts REPLICA IDENTITY FULL"
"CREATEPUBLICATION mypub FOR TABLE pgbench_accounts"; do \
 
    for p in 5432 5433; do psql -d postgres -h /tmp -p "$p" -c "$a"; done; done

psql -d postgres -h /tmp -p 5433 -c "CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432 dbname=postgres'
PUBLICATIONmypub"
 
psql -d postgres -h /tmp -p 5432 -c "insert into pgbench_accounts values(1,2); update pgbench_accounts set b=30 where
a=1;"

-- 
Justin



Re: Problem with logical replication

From
Masahiko Sawada
Date:
On Thu, 16 Apr 2020 at 17:48, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> While try to setup a cascading replication, I have observed that if we
> set the REPLICA IDENTITY to FULL on the subscriber side then there is
> an Assert hit.
>
> After analysis I have found that, when we set the REPLICA IDENTITY to
> FULL on subscriber side (because I wanted to make this a publisher for
> another subscriber).
> then it will set relation->rd_replidindex to InvalidOid refer below code snippet
> RelationGetIndexList()
> {
> ....
> if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex))
> relation->rd_replidindex = pkeyIndex;
> else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
> relation->rd_replidindex = candidateIndex;
> else
> relation->rd_replidindex = InvalidOid;
> }
>
> But, while appying the update and if the table have an index we have
> this assert in build_replindex_scan_key
>
> static bool
> build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
> TupleTableSlot *searchslot)
> {
> ...
> Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
> }
>
> To me it appears like this assert is not correct.  Attached patch has
> removed this assert and things works fine.
>
>
> #0  0x00007ff2a0c8d5d7 in raise () from /lib64/libc.so.6
> #1  0x00007ff2a0c8ecc8 in abort () from /lib64/libc.so.6
> #2  0x0000000000aa7c7d in ExceptionalCondition (conditionName=0xc1bb30
> "RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel)",
> errorType=0xc1bad9 "FailedAssertion",
>     fileName=0xc1bb1c "execReplication.c", lineNumber=60) at assert.c:67
> #3  0x00000000007153c3 in build_replindex_scan_key
> (skey=0x7fff25711560, rel=0x7ff2a1b0b800, idxrel=0x7ff2a1b0bd98,
> searchslot=0x21328c8) at execReplication.c:60
> #4  0x00000000007156ac in RelationFindReplTupleByIndex
> (rel=0x7ff2a1b0b800, idxoid=16387, lockmode=LockTupleExclusive,
> searchslot=0x21328c8, outslot=0x2132bb0) at execReplication.c:141
> #5  0x00000000008aeba5 in FindReplTupleInLocalRel (estate=0x2150170,
> localrel=0x7ff2a1b0b800, remoterel=0x214a7c8, remoteslot=0x21328c8,
> localslot=0x7fff25711f28) at worker.c:989
> #6  0x00000000008ae6f2 in apply_handle_update_internal
> (relinfo=0x21327b0, estate=0x2150170, remoteslot=0x21328c8,
> newtup=0x7fff25711fd0, relmapentry=0x214a7c8) at worker.c:820
> #7  0x00000000008ae609 in apply_handle_update (s=0x7fff25719560) at worker.c:788
> #8  0x00000000008af8b1 in apply_dispatch (s=0x7fff25719560) at worker.c:1362
> #9  0x00000000008afd52 in LogicalRepApplyLoop (last_received=22926832)
> at worker.c:1570
> #10 0x00000000008b0c3a in ApplyWorkerMain (main_arg=0) at worker.c:2114
> #11 0x0000000000869c15 in StartBackgroundWorker () at bgworker.c:813
> #12 0x000000000087d28f in do_start_bgworker (rw=0x20a07a0) at postmaster.c:5852
> #13 0x000000000087d63d in maybe_start_bgworkers () at postmaster.c:6078
> #14 0x000000000087c685 in sigusr1_handler (postgres_signal_arg=10) at
> postmaster.c:5247
> #15 <signal handler called>
> #16 0x00007ff2a0d458d3 in __select_nocancel () from /lib64/libc.so.6
> #17 0x0000000000878153 in ServerLoop () at postmaster.c:1691
> #18 0x0000000000877b42 in PostmasterMain (argc=3, argv=0x2079120) at
> postmaster.c:1400
> #19 0x000000000077f256 in main (argc=3, argv=0x2079120) at main.c:210
>
> To reproduce this issue
> run start1.sh
> then execute below commands on publishers.
> insert into pgbench_accounts values(1,2);
> update pgbench_accounts set b=30 where a=1;
>

I could reproduce this issue by the steps you shared. For the bug fix
patch, I basically agree to remove that assertion from
build_replindex_scan_key() but I think it's better to update the
assertion instead of removal and update the following comment:

* This is not generic routine, it expects the idxrel to be replication
* identity of a rel and meet all limitations associated with that.
*/
static bool
build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
                         TupleTableSlot *searchslot)
{

An alternative solution would be that logical replication worker
determines the access path based on its replica identity instead of
seeking the chance to use the primary key as follows:

@@ -981,7 +981,7 @@ FindReplTupleInLocalRel(EState *estate, Relation localrel,

    *localslot = table_slot_create(localrel, &estate->es_tupleTable);

-   idxoid = GetRelationIdentityOrPK(localrel);
+   idxoid = RelationGetReplicaIndex(localrel);
    Assert(OidIsValid(idxoid) ||
           (remoterel->replident == REPLICA_IDENTITY_FULL));

That way, we can avoid such mismatch between replica identity and an
index for index scans. But a downside is that it will end up with a
sequential scan even if the local table has the primary key. IIUC if
the table has the primary key, a logical replication worker can use
the primary key for the update and delete even if its replica identity
is FULL, because the columns of the primary key are always a subset of
all columns. So I'll look at this closely but I agree with your idea.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Problem with logical replication

From
Euler Taveira
Date:
On Mon, 20 Apr 2020 at 10:25, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Thu, 16 Apr 2020 at 17:48, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I could reproduce this issue by the steps you shared. For the bug fix
patch, I basically agree to remove that assertion from
build_replindex_scan_key() but I think it's better to update the
assertion instead of removal and update the following comment:

IMO the assertion is using the wrong function because it should test a replica
identity or primary key (GetRelationIdentityOrPK). RelationGetReplicaIndex
returns InvalidOid even though the table has a primary key.
GetRelationIdentityOrPK tries to obtain a replica identity and if it fails, it
tries a primary key. That's exact what this assertion should use. We should
also notice that FindReplTupleInLocalRel uses GetRelationIdentityOrPK and after
a code path like RelationFindReplTupleByIndex -> build_replindex_scan_key it
should also use the same function.

Since GetRelationIdentityOrPK is a fallback function that
uses RelationGetReplicaIndex and RelationGetPrimaryKeyIndex, I propose that we
move this static function to execReplication.c.

I attached a patch with the described solution. I also included a test that
covers this scenario.


--
Euler Taveira                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment

Re: Problem with logical replication

From
Michael Paquier
Date:
On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:
> I attached a patch with the described solution. I also included a test that
> covers this scenario.

-   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
+   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));

Not much a fan of adding a routine to relcache.c to do the work of two
routines already present, so I think that we had better add an extra
condition based on RelationGetPrimaryKeyIndex, and give up on
GetRelationIdentityOrPK() in execReplication.c.  Wouldn't it also be
better to cross-check the replica identity here depending on if
RelationGetReplicaIndex() returns an invalid OID or not?
--
Michael

Attachment

Re: Problem with logical replication

From
Masahiko Sawada
Date:
On Mon, 11 May 2020 at 16:28, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:
> > I attached a patch with the described solution. I also included a test that
> > covers this scenario.
>
> -   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
> +   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));
>
> Not much a fan of adding a routine to relcache.c to do the work of two
> routines already present, so I think that we had better add an extra
> condition based on RelationGetPrimaryKeyIndex, and give up on
> GetRelationIdentityOrPK() in execReplication.c.

+1

In any case, it seems to me that the comment of
build_replindex_scan_key needs to be updated.

 * This is not generic routine, it expects the idxrel to be replication
 * identity of a rel and meet all limitations associated with that.

For example, we can update the above:

 * This is not generic routine, it expects the idxrel to be replication
 * identity or primary key of a rel and meet all limitations associated
* with that.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Problem with logical replication

From
Euler Taveira
Date:
On Tue, 12 May 2020 at 06:36, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Mon, 11 May 2020 at 16:28, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:
> > I attached a patch with the described solution. I also included a test that
> > covers this scenario.
>
> -   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
> +   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));
>
> Not much a fan of adding a routine to relcache.c to do the work of two
> routines already present, so I think that we had better add an extra
> condition based on RelationGetPrimaryKeyIndex, and give up on
> GetRelationIdentityOrPK() in execReplication.c.

Although, I think this solution is fragile, I updated the patch accordingly.
(When/If someone changed GetRelationIdentityOrPK() it will break this assert)


In any case, it seems to me that the comment of
build_replindex_scan_key needs to be updated.

 * This is not generic routine, it expects the idxrel to be replication
 * identity of a rel and meet all limitations associated with that.

It is implicit that a primary key can be a replica identity so I think this
comment is fine.


--
Euler Taveira                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment

Re: Problem with logical replication

From
Dilip Kumar
Date:
On Wed, May 13, 2020 at 6:15 AM Euler Taveira
<euler.taveira@2ndquadrant.com> wrote:
>
> On Tue, 12 May 2020 at 06:36, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Mon, 11 May 2020 at 16:28, Michael Paquier <michael@paquier.xyz> wrote:
>> >
>> > On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:
>> > > I attached a patch with the described solution. I also included a test that
>> > > covers this scenario.
>> >
>> > -   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
>> > +   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));
>> >
>> > Not much a fan of adding a routine to relcache.c to do the work of two
>> > routines already present, so I think that we had better add an extra
>> > condition based on RelationGetPrimaryKeyIndex, and give up on
>> > GetRelationIdentityOrPK() in execReplication.c.
>>
> Although, I think this solution is fragile, I updated the patch accordingly.
> (When/If someone changed GetRelationIdentityOrPK() it will break this assert)
>
>>
>> In any case, it seems to me that the comment of
>> build_replindex_scan_key needs to be updated.
>>
>>  * This is not generic routine, it expects the idxrel to be replication
>>  * identity of a rel and meet all limitations associated with that.
>>
> It is implicit that a primary key can be a replica identity so I think this
> comment is fine.

I like your idea of modifying the assert instead of completely removing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Problem with logical replication

From
Michael Paquier
Date:
On Tue, May 12, 2020 at 09:45:45PM -0300, Euler Taveira wrote:
>> In any case, it seems to me that the comment of
>> build_replindex_scan_key needs to be updated.
>>
>>  * This is not generic routine, it expects the idxrel to be replication
>>  * identity of a rel and meet all limitations associated with that.
>
> It is implicit that a primary key can be a replica identity so I think this
> comment is fine.

Agreed.  I don't think either that we need to update this comment.  I
was playing with this patch and what you have here looks fine by me.
Two nits: the extra parenthesis in the assert are not necessary, and
the indentation had some diffs.  Tom has just reindented the whole
tree, so let's keep things clean.
--
Michael

Attachment

Re: Problem with logical replication

From
Euler Taveira
Date:
On Fri, 15 May 2020 at 02:47, Michael Paquier <michael@paquier.xyz> wrote:

Agreed.  I don't think either that we need to update this comment.  I
was playing with this patch and what you have here looks fine by me.
Two nits: the extra parenthesis in the assert are not necessary, and
the indentation had some diffs.  Tom has just reindented the whole
tree, so let's keep things clean.


LGTM.


--
Euler Taveira                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Problem with logical replication

From
Michael Paquier
Date:
On Fri, May 15, 2020 at 08:48:53AM -0300, Euler Taveira wrote:
> On Fri, 15 May 2020 at 02:47, Michael Paquier <michael@paquier.xyz> wrote:
>> Agreed.  I don't think either that we need to update this comment.  I
>> was playing with this patch and what you have here looks fine by me.
>> Two nits: the extra parenthesis in the assert are not necessary, and
>> the indentation had some diffs.  Tom has just reindented the whole
>> tree, so let's keep things clean.
>
> LGTM.

Thanks for double-checking.  Applied and back-patched down to 10
then.
--
Michael

Attachment