RE: Replica Identity check of partition table on subscriber - Mailing list pgsql-hackers
From | houzj.fnst@fujitsu.com |
---|---|
Subject | RE: Replica Identity check of partition table on subscriber |
Date | |
Msg-id | OS0PR01MB57165891A6EA9021E8A1896494B39@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Replica Identity check of partition table on subscriber (Amit Langote <amitlangote09@gmail.com>) |
Responses |
Re: Replica Identity check of partition table on subscriber
|
List | pgsql-hackers |
On Tuesday, June 21, 2022 3:21 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Tue, Jun 21, 2022 at 3:35 PM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, June 21, 2022 1:29 PM Amit Kapila <amit.kapila16@gmail.com>: > > > After pushing this patch, buildfarm member prion has failed. > > > > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=prion&br=HE > > > AD > > > > > > It seems to me that the problem could be due to the reason that the > > > entry returned by logicalrep_partition_open() may not have the correct > > > value for localrel when we found the entry and localrelvalid is also > > > true. The point is that before this commit we never use localrel value > > > from the rel entry returned by logicalrep_partition_open. I think we > > > need to always update the localrel value in > > > logicalrep_partition_open(). > > > > Agreed. > > > > And I have confirmed that the failure is due to the segmentation violation > when > > access the cached relation. I reproduced this by using > -DRELCACHE_FORCE_RELEASE > > -DCATCACHE_FORCE_RELEASE option which was hinted by Tom. > > > > Stack: > > #0 check_relation_updatable (rel=0x1cf4548) at worker.c:1745 > > #1 0x0000000000909cbb in apply_handle_tuple_routing (edata=0x1cbf4e8, > remoteslot=0x1cbf908, newtup=0x0, operation=CMD_DELETE) at > worker.c:2181 > > #2 0x00000000009097a5 in apply_handle_delete (s=0x7ffcef7fd730) at > worker.c:2005 > > #3 0x000000000090a794 in apply_dispatch (s=0x7ffcef7fd730) at > worker.c:2503 > > #4 0x000000000090ad43 in LogicalRepApplyLoop > (last_received=22299920) at worker.c:2775 > > #5 0x000000000090c2ab in start_apply (origin_startpos=0) at worker.c:3549 > > #6 0x000000000090ca8d in ApplyWorkerMain (main_arg=0) at > worker.c:3805 > > #7 0x00000000008c4c64 in StartBackgroundWorker () at bgworker.c:858 > > #8 0x00000000008ceaeb in do_start_bgworker (rw=0x1c3c6b0) at > postmaster.c:5815 > > #9 0x00000000008cee97 in maybe_start_bgworkers () at postmaster.c:6039 > > #10 0x00000000008cdf4e in sigusr1_handler (postgres_signal_arg=10) at > postmaster.c:5204 > > #11 <signal handler called> > > #12 0x00007fd8fbe0d4ab in select () from /lib64/libc.so.6 > > #13 0x00000000008c9cfb in ServerLoop () at postmaster.c:1770 > > #14 0x00000000008c96e4 in PostmasterMain (argc=4, argv=0x1c110a0) at > postmaster.c:1478 > > #15 0x00000000007c665b in main (argc=4, argv=0x1c110a0) at main.c:202 > > (gdb) p rel->localrel->rd_rel > > $5 = (Form_pg_class) 0x7f7f7f7f7f7f7f7f > > > > We didn't hit this problem because we only access that relation when we plan > to > > report an error[1] and then the worker will restart and cache will be built, so > > everything seems OK. > > > > The problem seems already existed and we hit this because we started to > access > > the cached relation in more places. > > > > I think we should try to update the relation every time as the relation is > > opened and closed by caller and here is the patch to do that. > Thanks for the patch. > > I agree it's an old bug. A partition map entry's localrel may point > to a stale Relation pointer, because once the caller had closed the > relation, the relcache subsystem is free to "clear" it, like in the > case of a RELCACHE_FORCE_RELEASE build. Hi, Thanks for replying. > Fixing it the way patch does seems fine, though it feels like > localrelvalid will lose some of its meaning for the partition map > entries -- we will now overwrite localrel even if localrelvalid is > true. To me, it seems localrelvalid doesn't have the meaning that the cached relation pointer is valid. In logicalrep_rel_open(), we also reopen and update the relation even if the localrelvalid is true. > + /* > + * Relation is opened and closed by caller, so we need to always update the > + * partrel in case the cached relation was closed. > + */ > + entry->localrel = partrel; > + > + if (entry->localrelvalid) > return entry; > > Maybe we should add a comment here about why it's okay to overwrite > localrel even if localrelvalid is true. How about the following hunk: > > @@ -596,8 +596,20 @@ logicalrep_partition_open(LogicalRepRelMapEntry > *root, > > entry = &part_entry->relmapentry; > > + /* > + * We must always overwrite entry->localrel with the latest partition > + * Relation pointer, because the Relation pointed to by the old value may > + * have been cleared after the caller would have closed the partition > + * relation after the last use of this entry. Note that localrelvalid is > + * only updated by the relcache invalidation callback, so it may still be > + * true irrespective of whether the Relation pointed to by localrel has > + * been cleared or not. > + */ > if (found && entry->localrelvalid) > + { > + entry->localrel = partrel; > return entry; > + } > > Attached a patch containing the above to consider as an alternative. This looks fine to me as well. Best regards, Hou zj
pgsql-hackers by date: