RE: Replica Identity check of partition table on subscriber - Mailing list pgsql-hackers

From houzj.fnst@fujitsu.com
Subject RE: Replica Identity check of partition table on subscriber
Date
Msg-id OS0PR01MB57161D8BBF9FA24658C1F95394B39@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Replica Identity check of partition table on subscriber  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Replica Identity check of partition table on subscriber
List pgsql-hackers
On Tuesday, June 21, 2022 1:29 PM Amit Kapila <amit.kapila16@gmail.com>:
> 
> On Tue, Jun 21, 2022 at 8:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 21, 2022 at 7:49 AM Amit Langote <amitlangote09@gmail.com>
> wrote:
> > >
> > >
> > > I think it should spell out REPLICA IDENTITY explicitly to avoid the
> > > commit being confused to have to do with "Referential Integrity
> > > checking".
> > >
> >
> > This makes sense. I'll take care of this.
> >
> 
> After pushing this patch, buildfarm member prion has failed.
> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prion&br=HE
> AD
> 
> It seems to me that the problem could be due to the reason that the
> entry returned by logicalrep_partition_open() may not have the correct
> value for localrel when we found the entry and localrelvalid is also
> true. The point is that before this commit we never use localrel value
> from the rel entry returned by logicalrep_partition_open. I think we
> need to always update the localrel value in
> logicalrep_partition_open().

Agreed.

And I have confirmed that the failure is due to the segmentation violation when
access the cached relation. I reproduced this by using -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE option which was hinted by Tom.

Stack:
#0 check_relation_updatable (rel=0x1cf4548) at worker.c:1745
#1 0x0000000000909cbb in apply_handle_tuple_routing (edata=0x1cbf4e8, remoteslot=0x1cbf908, newtup=0x0,
operation=CMD_DELETE)at worker.c:2181
 
#2 0x00000000009097a5 in apply_handle_delete (s=0x7ffcef7fd730) at worker.c:2005
#3 0x000000000090a794 in apply_dispatch (s=0x7ffcef7fd730) at worker.c:2503
#4 0x000000000090ad43 in LogicalRepApplyLoop (last_received=22299920) at worker.c:2775
#5 0x000000000090c2ab in start_apply (origin_startpos=0) at worker.c:3549
#6 0x000000000090ca8d in ApplyWorkerMain (main_arg=0) at worker.c:3805
#7 0x00000000008c4c64 in StartBackgroundWorker () at bgworker.c:858
#8 0x00000000008ceaeb in do_start_bgworker (rw=0x1c3c6b0) at postmaster.c:5815
#9 0x00000000008cee97 in maybe_start_bgworkers () at postmaster.c:6039
#10 0x00000000008cdf4e in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5204
#11 <signal handler called>
#12 0x00007fd8fbe0d4ab in select () from /lib64/libc.so.6
#13 0x00000000008c9cfb in ServerLoop () at postmaster.c:1770
#14 0x00000000008c96e4 in PostmasterMain (argc=4, argv=0x1c110a0) at postmaster.c:1478
#15 0x00000000007c665b in main (argc=4, argv=0x1c110a0) at main.c:202
(gdb) p rel->localrel->rd_rel
$5 = (Form_pg_class) 0x7f7f7f7f7f7f7f7f

We didn't hit this problem because we only access that relation when we plan to
report an error[1] and then the worker will restart and cache will be built, so
everything seems OK.

The problem seems already existed and we hit this because we started to access
the cached relation in more places.

I think we should try to update the relation every time as the relation is
opened and closed by caller and here is the patch to do that.

[1]
    /*
     * We are in error mode so it's fine this is somewhat slow. It's better to
     * give user correct error.
     */
    if (OidIsValid(GetRelationIdentityOrPK(rel->localrel)))

Best regards,
Hou zj

Attachment

pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Using PQexecQuery in pipeline mode produces unexpected Close messages
Next
From: Amit Langote
Date:
Subject: Re: Replica Identity check of partition table on subscriber