RE: Truncate in synchronous logical replication failed - Mailing list pgsql-hackers

From osumi.takamichi@fujitsu.com
Subject RE: Truncate in synchronous logical replication failed
Date
Msg-id OSBPR01MB4888C84371882A4E001E3A44ED459@OSBPR01MB4888.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Truncate in synchronous logical replication failed  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Truncate in synchronous logical replication failed
List pgsql-hackers
On Thursday, April 22, 2021 6:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Apr 20, 2021 at 9:00 AM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > On  Tuesday, April 20, 2021 10:53 AM Ajin Cherian <itsajin@gmail.com>
> wrote:
> > >
> > > I reviewed the patch, ran make check, no issues. One minor comment:
> > >
> > > Could you add the comment similar to RelationGetIndexAttrBitmap() on
> > > why the redo, it's not very obvious to someone reading the code, why
> > > we are refetching the index list here.
> > >
> > > + /* Check if we need to redo */
> > >
> > > + newindexoidlist = RelationGetIndexList(relation);
> > Yeah, makes sense. Fixed.
> 
> I don't think here we need to restart to get a stable list of indexes as we do in
> RelationGetIndexAttrBitmap. The reason is here we build the cache entry
> using a historic snapshot and all the later changes are absorbed while
> decoding WAL.
I rechecked this and I agree with this.
I don't see any problem to remove the redo check.
Based on this direction, I'll share my several minor comments.

[1] a typo of RelationGetIdentityKeyBitmap's comment

+ * This is a special purpose function used during logical replication. Here,
+ * unlike RelationGetIndexAttrBitmap(), we don't a acquire lock on the required

We have "a" in an unnatural place. It should be "we don't acquire...".

[2] suggestion to fix RelationGetIdentityKeyBitmap's comment

+ * later changes are absorbed while decoding WAL. Due to this reason, we don't
+ * need to retry here in case of a change in the set of indexes.

I think it's better to use "Because of" instead of "Due to".
This patch works to solve the deadlock.

[3] wrong comment in RelationGetIdentityKeyBitmap

+       /* Save some values to compare after building attributes */

I wrote this comment for the redo check
that has been removed already. We can delete it.

[4] suggestion to remove local variable relreplindex in RelationGetIdentityKeyBitmap

I think we don't need relreplindex.
We can just pass relaton->rd_replidindex to RelationIdGetRelation().
There is no more reference of the variable.

[5] suggestion to fix the place to release indexoidlist

I thought we can do its list_free() ealier,
after checking if there is no indexes.

[6] suggestion to remove period in one comment.

+       /* Quick exit if we already computed the result. */

This comes from RelationGetIndexAttrBitmap (and my posted versions)
but I think we can remove the period to give better alignment shared with other comments in the function.

> I have updated that and modified few comments in the
> attached patch. Can you please test this in clobber_cache_always mode? I
> think just testing subscription/t/010_truncate.pl would be sufficient.
I did it. It didn't fail. No problem.

> Now, this bug exists in prior versions (>= v11) where we have introduced
> decoding of Truncate. Do we want to backpatch this? As no user has reported
> this till now apart from Tang who I think got it while doing some other patch
> testing, we might want to just push it in HEAD. I am fine either way. Petr,
> others, do you have any opinion on this matter?
I think we are fine with applying this patch to the HEAD only,
since nobody has complained about the hang issue.


Best Regards,
    Takamichi Osumi


pgsql-hackers by date:

Previous
From: "tsunakawa.takay@fujitsu.com"
Date:
Subject: RE: A test for replay of regression tests
Next
From: "wangyukun@fujitsu.com"
Date:
Subject: fix a comment