Re: Potential data loss due to race condition during logical replication slot creation - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: Potential data loss due to race condition during logical replication slot creation
Date
Msg-id CAD21AoBV7odVHcTsXS2VNUKqw3J0AB_jaRNhL0UuXJ7=kOkLFw@mail.gmail.com
Whole thread Raw
In response to Re: Potential data loss due to race condition during logical replication slot creation  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Potential data loss due to race condition during logical replication slot creation
List pgsql-bugs
On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Sawada-san,
> >
> > Thanks for creating the patch!
> >
> > > I've attached updated patches for HEAD and pg17 for now (I will create
> > > the patch for other backbranches).
> > >
> > > In the patches, I used a different approach in between HEAD and
> > > backbranches. On HEAD, we store a flag indicating whether or not we
> > > should skip snapshot restores into the SnapBuild struct and set it
> > > only while finding the start point. Therefore we have to bump
> > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed
> > > above; store the flag in LogicalDecodingContext and set it when
> > > creating the LogicalDecodingContext for a new logical slot. A possible
> > > downside of the approach taken for backbranches is that we implicitly
> > > require for users to use the same LogicalDecodingContext for  both
> > > initializing the context for a new slot and finding its start point.
> > > IIUC it was not strictly required. This restriction would not be a
> > > problem at least in the core, but I'm not sure if there are no
> > > external extensions that create a logical slot in that way. This is
> > > the main reason why I used a different approach on HEAD and
> > > backbranches. Therefore, if it does not matter, I think we can use the
> > > same approach on all branches, which is better in terms of
> > > maintainability.
> >
> > I want to confirm your point. You meant that you wanted to unify appraoches,
> > especially you wanted to store the flag in LogicalDecodingContext, rigth?
>
> Yes. Ideally I'd like to use the same approach in all branches
> regardless of how to fix it for better maintainability.
>
> > I
> > briefly grepped github repos with "DecodingContextFindStartpoint", and cannot
> > find such pattern. E.g., [1]. But this point must be pointed by extension developers.
>
> Thank you for searching on github. It will also affect the future
> extension developments so I'm slightly hesitant to do that.
>
> > >
> > > Also, I excluded the test case for the problem that Kuroda-san
> > > reported[1] since the reported problem occurred due to the same cause
> > > of the problem originally reported on this thread. The bug that
> > > decodes only partial transactions could lead to various symptoms. IIUC
> > > these test cases test the same behavior.
> >
> > Yes, I also think they are caused by the same root cause, so we can skip.
> >
> > Comments for HEAD patch:
> > 1.
> > You must modify test_decoding/meson.build. It was also missing in patches for
> > backbranches.
> >
> > 2.
> > The test code needs cleanup. E.g.,
> > - Steps "s0_insert_cat" and "s0_savepoint" is not used
> > - Table user_cat is not used
> > It was also missing in patches for backbranches.
>
> Will fix the above points in the next version patch.
>
> >
> > 3.
> > Just to confirm - In SnapshotRestore(), the added attribute is not restored
> > from the disk. This is intentional because 1) it has been set to false when it
> > was serilizing to disk and 2) the destination (SnapBuild *builder) is initialized
> > by palloc0() so it was set to false. Is it right?
>
> Right.
>
> > Comments for backblanches patch:
> >
> > 1.
> > According to wikipage [2], new attribute must be at the end of struct.
>
> Good point. Will fix.
>

I've attached the new version patches for all branches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-bugs by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Next
From: Amit Kapila
Date:
Subject: Re: Potential data loss due to race condition during logical replication slot creation